solving go (alphago + alphazero) [3x]

Feb 08, 2018 in ADVANCED • PIPELINE.AI • REINFORCEMENT LEARNING
27 min read

A talk on how MCTS, Alpha Go, Alpha Go Zero and Alpha Zero can be used to solve the game of go.

I have given it a few times now:

Note: for both presentations the audio gets a lot better about five minutes in!

First version @ Thumbtack to the Advanced Spark/Tensorflow meetup (february 8, 2018):

Video • Slides

Second @ Advanced Spark/Tensorflow meetup at Strata Conf in San Jose (march 6, 2018):

Video • Slides

Third @ hackillinois 2019 (feburary 24th, 2019):

Slides

Transcript

Thank you all for showing up. I put the slides online if you’d like to follow along. I’d like to thank Chris for loaning me his soapbox. Quick show of hands, who here works with TensorFlow? Who here works with neural networks? And who here has played Go?

My background – I used to play a lot of chess. Two years ago back in grad school, I took some AI courses, and the professor let me build a bot to play Go. So I’ve just been watching it from afar and just studying it.

I will do a brief summary of Go with a discussion of the rules of the game. We’ll look at how to combine UCT with random rollouts to produce MCTS, which is the basis of all of the modern approaches to this game. Then we’ll look at how the combination of MCTS with policy and value networks produces the Alpha Go series of engines. Then last year, they combined the policy and value portions to produce Alpha Go Zero which came out last May or so, and then in December of last year, they released Alpha Zero which is like Alpha Go Zero without domain specific knowledge, so it’s like a general purpose algorithm that you can apply to other domains. Then we’ll do a quick demo of that at the end.

If you’ve never seen a Go board, here’s what it looks like. Our basic objective is to fill the board with stones and capture stones while your opponent is trying to do the same thing. The Go game comes to us from Asia, it started about 2500 years ago. The goal is basically to fill the 19x19 board, which is 361 squares, keeping score as you go. We’ll focus on the Chinese scoring system, which can be summed up as nothing more than the number of squares that you have stones on in the game and the number of stones that you’ve captured. The first player to move in Go has a little advantage. This is black as well – not white, which is different from chess. So black has 7.5 points subtracted from the score in order to balance that out, which is called komi. The game will last about 300 moves, which gives us a state-space of about 10^170 power. Chess is about 10^50, and the number of atoms in the observable universe is about 10^80. So if you can imagine a universe for every atom in our universe, that would be still be slightly shy of the state space in go. Because of this, Go has historically been the holy grail of AI research.

The first part of the algorithm is called UCT which is a stochastic algorithm. It’s designed around the multi armed bandit problem, which is a fancy way of saying, how do you win the most money from a room of slot machines, given a certain number of quarters?

The basic idea of the UCT algorithm is to explore new machines, you don’t want to put all your quarters into one machine. Then when you find a winning machine, you then put the max amount of your money. UCT doesn’t actually give you the best algorithm for making money from those machines, it’s actually designed to give you the least worst solution. But when we convert the least worst solution from that domain, to game theory, the least worst move is the best move, basically.

Next in line is UCT which is just combining it with trees. So at the bottom here, we have basically what the UCB algorithm would look like whenever applied to a Go board. We start off on the game tree and we decided that we want to evaluate a new move. We evaluate our new move, and then for example in this picture, we can take the area circled ‘r’ and we go back to the tree. So black would want to go towards this line so we increase their score and their probability of winning. But then we go up a move, and white does not want to go down this line, so for them we decrease the probability of this branch. We go back and forth until the climb up to the top, and you repeat the cycle over and over again.

The UCT algorithm itself is a pretty valid way to approach the game. But the c part of this picture, the evaluation is where the real magic happens. Unlike the slot machine where you try to put a quarter in, you instantly find out whether or not you win – with Go, it might be 200 moves until you finish the game. So historically, this has been out of reach.

About a decade ago they came out with random rollouts. The basic idea is to take the current board state and simply add pieces of the board, adding stones randomly until both sides cannot play – a pass. Then you score the board using a Chinese rule set which is easy to implement. Then basically this random score, it gives you a win or loss prediction, then you can go back and update the tree. So this combination of using UCT and random rollouts, this is called MCTS. This can be proven to approach perfect play on the Go board. The only minor bug is that this may take longer than the universe will ever exist, but that’s what math people call an implementation problem.

A couple of years ago as you’re probably aware, Alpha Go beat the world champion of this game. This is the former world champion of Go, a human player – Lee Sedol, realizing he’s probably one of the last people to ever beat a computer at this game.

There are a few different Alpha Go engines. The original one is Alpha Go Fan. This is from 2015 when Fan Hui was one of the strongest players in Europe and he did a demonstration match against alpha go. Then there was the Alpha Go Lee engine from two years ago, and finally, in January of last year, Google released the paper Alpha Go Master. These three engines are basically roughly having the same basic algorithm, so we’re going to go through that right now. A basic idea of how these engines approach the game is to take the Monte Carlo tree search approach but to improve their search algorithm. So first with the policy network, we typically take moves. Ideally moves that are good, rather than random ones. Secondly, we want to use a value network to predict the odds of a particular move.

If we can just get a good guess what we now think a particular move is worth, where we can explore more areas of the tree. Then finally, we don’t get rid of the Monte Carlo tree search at all, altogether, rather – we are still using it to perform deeper evaluations whenever we need. So we’re still building upon solid bits.

So how do we build this? Well, the policy network that’s probably the closest to traditional machine learning, if you will – they took a collection of human games, about 150,000 of them, and applied supervised learning which was a CNN to create a policy network. So given a particular position, could it predict the next move? Then, they took this policy network and recombined it with the Monte Carlo tree search, played a lot more games, so now they have a larger collection of both human and computer games, and then they can rebuild the policy network again to produce an even larger, stronger network, which is the reinforcement learning part of this process. The end goal of all of this is to use the policy network to make moves rapidly.

So the first engine they constructed was able to make a human-looking move in a particular position about 55% of the time, which was significantly better than the other attempts at that point in time. But moreover, you can make that prediction in about 3 milliseconds. They then constructed a second version of this policy network that wasn’t even really a neural network anymore. It was based on a SoftMax function combined with small 3x3 patterns. So this one was only half as accurate, we’ll say 24% as accurate. But this smaller engine could run even faster, and so it could make its predictions in about two microseconds. An important note that we’ll come back to here in a second is that this policy network alone, which is just sort of copying what it things a human would do in a particular position. It doesn’t actually add any true knowledge, but it can actually beat a number of engines.

The second part of the equation is to build a value network. The basic question we’re trying to ask is from a given input stage, we predict who is going to win in a particular situation without actually having to perform a full simulation. So for this, they constructed a second CNN, similar to the policy network one, to simply predict the winning probability of getting into a position. They trained is on the MSE between the prediction and the outcome, and then they added a second heuristic of how many losses there were in the game.

The problem with this approach is basically that the network just copies its input, it overtrains to its input games, so they had to relax the network. If you look at a Go board, you can imagine that you can turn it 90 degrees. The stones would all have different numbers for where they are, but the game itself would remain the same. In data boards, you can rotate four times. By the same idea, you can also take the board and flip it over itself. So each game could have 8 degrees of freedom. So by taking the original gains, add in these 8 degrees of freedom, it allowed the network to properly train. Then ultimately, the policy network, we can use the policy network to make quick predictions. So it can run about 15,000x faster than the actual Monte Carlo tree search simulation.

So on this slide, we see the power combining these different approaches. Starting on this side, we have the two versions of Alpha Go – this is Alpha Go Fan. And then the third one is actually Fan Hui, who is a human player, and he’s one of the strongest human players in Europe at that point in time. These red bars are like five contemporary Monte Carlo tree search-based engines of that day. Then on the far side, we have the Alpha Go components. We’ll start on the very far right.

The very farthest right is just the pure Monte Carlo tree search part of the Alpha Go engine. This is interesting to me because it’s actually weaker than the other Monte Carlo tree search engines of its day. The next one over we have is the policy network. This is simply a neural network that’s emulating human games. It’s actually even slightly stronger. Next, we have the value network, which is just the network that can very quickly evaluate moves. I would argue it’s weaker at playing Go, but it can do a lot more simulations and it can explore more of its state space, so it’s stronger by that approach here. As we keep on coming over though, you can see the power of combining these different approaches. By combining the Monte Carlo tree search with the value network, you can get about a 500 point improvement performance by combining the value of policy networks – we can get a little bit more. And then finally by combining the Monte Carlo tree search with a policy network, we can get almost up to 2,500 ELO rating.

Then finally on this far graph here, we have all three approaches combined together. So we have the Monte Carlo tree search value and the policy networks, and that was produced with a full-blown engine of Alpha Go. That was able to beat Fan Hui in their demonstration match. The interesting one to me from this slide is the third one going that way for the Alpha Go engine. This is just a pure combination of value and policy networks with no Monte Carlo tree search. The reason this is interesting to me is that it’s the basis of the Alpha Go Zero engine which came out last year.

Alpha Go Zero built upon this foundation, but it had a couple of important concepts. But the largest one I would say is basically that we input a position, we use a single network, it combines both the policy and the value networks together to predict simultaneously both the best move and the winning odds of that move which we can then use to build our game tree. Because we’re using one network, we can play games against itself. We can train a new network to categorize wins and losses and reduce the prediction error, and then finally we can evaluate the new network against the old one and pick a winner, and from there, we repeat this process for about 700k generations, and then we have a master level play – starting from a completely randomly initialized state.

So now we’ll go into how the actual neural network in Alpha Go Zero works. This is a cheat sheet you can get off the internet, which is a nice PDF where you can zoom way in and see everything in great detail. We’ll go through the first few parts here a little bit more deeply in a second, but the basic concept is that we play games against ourselves and we train the neural network against that, and we play the new neural network against the old one, we pick a winner, and we repeat the process.

The far slide we have over here is how the game state is actually encoded in the Alpha Go Zero engine – it’s 19x19, a full board. So we have 8 stacks for black, the last 8 are for black, and then we have 8 stacks for white, the last 8 is for white, and then finally the last layer is simply black or white depending on who is going to move. This is how we store a single state for the Alpha Go Engine.

Then over here we take our input state and we come over to the bottom left corner – we apply a 3x3 convolution, a batch normalization, and a regular step. Then from there, we go over here to the far bottom right side. We do another 3x3 convolution, batch norm, a 3x3 convolution batch, then we add a skip connection. This network uses a resnet style of approach. From there, we simply have 40 of these blocks going all the way up. At the top on our left, we have ahead to output whether or not we think a move has a winning probability. So we have a value head, and we have another note on the other side that outputs what we think is the most likely move in this position. So, the policy. We take our input state, we run it through 40 residual blocks, and then at the end, we have a prediction of what we think the best next move is, and a score for how likely we think that move is.

From there, basically, we can take our whole network and flip it all the way upside back down, and basically, we can reconstruct our Monte Carlo tree search style tree, but now we can just do our single neural network to do all the steps. Then finally we apply a global simulated annealing temperature schedule just to make sure we don’t get stuck in a local minimum or maximum while we’re beginning our state space search.

Here we can see how this process plays out, it’s where we’re actually training things. The network starts out extremely weak, it basically makes completely random moves. It takes it literally a day or two to reach even just beginner level play. But then it slowly improves until it passes the dotted green line here, which is the Alpha Go Lee that played against Lee Sedol. The top one is the Alpha Go Master engine, which we’ll talk a little bit more about here. But finally, it just surpasses all of these other engines and it just ends up in its own world. It’s beyond anything that’s ever been done before and has nobody to compete with anymore.

This other graph here is basically the 2017 version of our ELO ratings from before. So we have our same Monte Carlo tree search style engines over here, the red blocks. Then coming this way, the far dark blue one is the Alpha Go Fan engine from before. The next one is the Alpha Go Lee engine, which is a little bit stronger still. Then you have the Alpha Go Master, and then a little bit higher up there is Alpha Go Zero. A portion of the reason that we’re able to make these improvements between these engines has to do with simply increasements in computational power. The Alpha Go Fan was trained on about 172 GPUs, similar to what probably many of you have. Alpha Go Lee was trained on a cluster of 48 TPUv1 devices like Google had for internal hardware. Then the Alpha Go Master and the Alpha Go Zero were trained on a cluster of TPU-v3 pods.

So we’ll go a little bit into the differences between the Alpha Go Zero and Alpha Go Master, they’re both trained on the same hardware, but the second engine is able to achieve a little bit higher performance. But the one thing I would like you to remember is this raw policy network from the Alpha Go Zero engine, which is the gray bar here, which is able to play at about 3,000 level rating. We’ll come back to that one here in a second.

The difference between the Alpha Go Master and the Alpha Go Zero approaches comes down to a couple of different things. This bottom slide is illustrating the differences between the two networks. One is that the Alpha Go Zero uses Resnet as opposed to CNN’s, so it can be trained to a slightly lower amount of error. The second thing is the combination of both the value and policy networks. The Google team actually built a version of the engine that trained a policy network on its own and the value network on its own and combined the results at the end. But as you can see in this bottom left graph, combining the two things together to have a shared goal state I think gives it a solid 500 to 1,000 point increase in performance just because the two things don’t have separate goals.

Then the second part is that the Resnet allows you to train to a higher degree of error. So the top left right here is basically showing you the difference between the supervised learning and the reinforcement learning approach. So the supervised learning actually does pretty well. It starts out very quickly, it learns from the games it was shown, but ultimately it cannot improve beyond its input dataset. It becomes limited by its data that it is trained on. The reinforcement learning starts out much slower, takes a lot longer to improve, but because it doesn’t have any limitations upon its growth, it’s able to ultimately achieve that higher level of performance.

So this slide is showing you the Alpha Go Zero training as it’s part of the process. So we start off over here from zero. As we go across, we’re getting deeper and deeper into the training. So from zero to 72 hours is about the amount of time it goes from being a raw beginner to playing at a high expert level. So at the very beginning, if we go down here to this bottom-left slide, you can see that the engine is just trying to gobble as many stones as they can. It’s playing like a pure beginner, it’s just playing purely for stones. If we go a little bit further into the game training state though, it’s starting to branch out and play on the edges of the board. Then finally over here, it finally has a full board vision and a global understanding of the game.

If you’ve played Go, the engine doesn’t actually learn ladders until very late in its training, which is something that I find interesting. Only after it’s learned all the little parts of the game can it actually have the global vision to see that it can’t move its stones across the board like that. Ladders are like a play that’s often taught to beginners. It’s an easy pattern for them to learn, but this engine doesn’t learn that until after it’s basically mastered everything else.

If you go across the top, we have various Joseki. These are basically opening patterns in Go. The engine is starting off from a completely blank randomly initialized state. As it improves, it basically rediscovers 2,000 years worth of opening theory in this game and it learns how to apply it and it learns how to refute it as well.

So if we take these top 5 Joseki, we can see right here a graph with the same axis of time played, and then the second axis is how often it’s actually playing it. So we’ve discovered some of these openings, it plays them a whole bunch, it discovers how to refute the opening, and then it abandons it altogether. Like this one right here. Other ones, it discovers and decides that it loves them and keeps on playing them a whole bunch.

This combination of these approaches is the Alpha Go Zero engine and it more or less has solved the game of Go from a practical standpoint. No human player will ever be able to beat a computer again, I’m sorry. That’s the current state of affairs.

From here, we’ll go into Alpha Zero. This is a paper that Google released in December of last year. It’s a generalized version of the Alpha Go approach, but it doesn’t actually have any GO specific knowledge. All the Alpha Zero engine takes is an input, it takes a board state, a list of possible moves, an evaluation function in some form, and then from there, it can generate value and policy networks by playing against itself. So you can give it these three input conditions, it can teach itself how to play and it will quickly improve to master level, given enough TPU time.

So they gave it a chessboard, they trained it up, it plays chess at like a 3,000 level, which is a grandmaster level. They gave it GO again, and it taught itself to play GO. It should be noted that the Alpha Go Zero, the green line right here at the top is like a version of the Alpha Go Zero engine that is only trained for three days. The full-blown one that was trained for 40 days is significantly stronger. This one can’t actually beat Alpha Go Zero.

The middle one is the one that’s cool to me. Shogi is a game from Japan. It’s similar to chess, it’s played on a 10x10 board. Historically, it’s not been well studied, but they’ve basically been able to give it the input board’s state listed moves, it was able to produce a master level player just by adding TPU time.

Here we can see Alpha Zero, similar to the Joseki of GO. Here’s Alpha Zero teaching itself to play various chess openings, and then either embracing the chess opening and using it a whole bunch or learning it and then deciding it wasn’t really that great of an opening and abandoning it. What’s interesting to me about this is that I play a lot of chess – I like the English opening. A lot of players for black play Sicilian defense. It’s kind of popular because it’s an aggressive dynamic way to play. But the computer does not like that opening at all. One of the arguments against it is that it’s the English opening, but with one less move on the board. Alpha Go likes a number of openings that were popular in the 1800s but fell out of favor because they were felt to be not aggressive enough.

So now we’ll apply the Alpha Zero engine to a slightly smaller problem of tic tac toe. If you’re not familiar with tic tac toe, you have a board with three rows. You fill out X’s and O’s, and if you play perfectly, you should always be able to draw. But it’s a board game just like these other ones, and we can model it the same way. So we have a 3x3 board for tic tac toe. Any empty square is a legal move. And in our evaluation functions, if there are 3 in a row, that’s a win.

So we’ll set this up, we’ll play a basic loop. We’ll play games against itself. We’ll train a network using Keras and TensorFlow to recognize the winners and minimize the losses. We’ll evaluate the neural network and then we’ll repeat the process until we have a tic tac toe bot.

This is some code I got off of the internet. A student in Stanford names Surag Nair wrote an Alpha Zero implementation in Python to play Othello, and then a Russian guy by the name of Evgeny Tyurin modified it to run tic tac toe, so I’m just running his code. We’ll run the script here.

So this is going to play about 50 games with itself. Then we’re going to train the network using Keras and TensorFlow. I modified his code slightly. He runs it for 10 epochs, mine only runs for 3, which is a little bit dangerous because it doesn’t always play perfectly. But hopefully, this will work.

Audience: While this is running, you had shown the spaces to be about 8 white and 8 black, and current probability – but on your earlier slide, it said it had almost 10^170th?

Sorry, that would be the state space for a single move. That’s how the state space is encoded for the network itself. I’m talking about how… There are 361 squares, there are 3 possible states for each square, then 300 possible moves, that’s how you get to that other number.

Okay, so we’ve trained the network, now we played the new network against the old one. Our random model actually beat our alpha zero engine’s training first model. If we let this run deep enough, it will eventually converge to a proper engine. But let’s stop it right here because this is going to start taking a while. Here’s the state state of our neural network. We’ll play it against a properly trained network that had about an hour’s worth of CPU time.

We’ll see how well our initial bot that we just trained does. So now they’re just blindly playing against each other as fast as possible. We’ll have them do a hundred games. So our new engine… I’m playing against a perfect engine. It won zero games, lost 46, but managed to draw 64. But if we sit here and let it run for an hour, it will play perfectly and there will be 100 draws if we let these two bots play against each other.

We can open up our board state-space right here, and this implementation of tic tac toe is 3 squares – if we take it and change 3 to 4, now we can have 4 x 4 squares. Now we can go back here, and we can rerun our training script. Now each game will take longer to simulate. We’ll have to play more games and we’ll have to train for a much deeper time, but if we let this one run, it will take about 20 hours to 24 hours, almost a day. We’ll have an engine that can play 4x4 tic tac toe perfectly, so on and so forth.

Audience: Can we explore new transfer learning where maybe learning one thing could help you get a running start with another thing?

For the Alpha Zero engines, we start from random states, so we don’t actually have to do transfer learning. So here is the 4x4 version of tic tac toe. You guys want to try to play against the bot? [laughs]

Okay, a recap of what we’ve seen so far. We’ve seen the Monte Carlo tree search in the form of UCT plus rollouts, it solves GO, but that approach doesn’t really scale. We’ve seen how you can combine expert knowledge in the form of prior games with value and policy networks which can be thought of being an optimization on top in order to surpass human players. From there, we’ve seen how a single network randomly initialized can reach even greater performance via self-play. Then finally I would argue that this approach generalizes to other domains and is very human-like. What do I mean by that?

I will ask you the question – what is AI? We’ll start off with what would be like the Chinese Room example, which would be like the negative case. You might say to me, “I can play tic tac toe perfectly all day long.” I’ll say, “Great, show me. Let’s get a piece of paper.” Then we just start writing down all the possible games until we’ve enumerated every single possible tic tac toe game on a 3x3 board that’s possible. Then I can take this stack of papers and give it to somebody else who doesn’t even know how to play tic tac toe at all, and I can say, “Just find the position that’s in the book, and make the next move.” We can do this for 3x3 tic tac toe, and we can do it again for 4x4 tic tac toe, it will just take a bigger book.

If we were to take the Alpha Go Zero policy network and we can somehow print it out, it would be incomprehensibly large, but if we can somehow print it out, then we can make a gigantic book that we can hand to any person in this room and you could play Go simply by taking the book, and whenever your opponent makes a move, you open it up to the right page and you make the next move. Because the Alpha Go policy engine plays at about a 3,000 level, you would be close to world master status without even having to understand what you were doing. Then based on this, all we’ve really done is invented a fancy form of compression, we’ll say. A way to compress the state space down. This would be like what I would say is the negative example.

So let’s go the other direction. If we were to teach somebody these board games, what would we do? We would give them a board, we would teach them the rules, and we would have them practice. Really, that’s all we do with this engine. We give them a board and the rules, and we have them do lots and lots of CPU time, but it doesn’t really have any advantage that we don’t have. Moreover, the question then becomes, “What is the difference between a master level player and a beginner?” I would argue that it’s not that the master necessarily can say checkmate in 36 or whatever and knows everything to do in any situation.

It’s more that the master level players know what situations to seek out, what situations to avoid, and the reason they can do that is just because they have lots and lots of experience. About a hundred years ago there was a world grandmaster of chess named Jose Casablanca. They asked him, “How many moves ahead do you think?” He said, “I only think ahead one move, but it’s the right one.” At the risk of anthropomorphizing our Go bot – if you could ask Alpha Zero that same question, it would tell you that answer is absolutely correct.

Video • Slides

Video • Slides

Slides

Transcript

Papers: