Deepmind's AlphaGO Max Welling University of Amsterdam Man vs. Machine: 1-4 Machine Learning Ingredients • Supervised deep learning • Reinforcement learning (self-play) • Monte Carlo Tree Search Deep Convolutional Networks Forward: Filter, nonlinearity, subsample, filter, nonlinearity, subsample, …., predict . . . Backward: backpropagation (determines how you need to turn the parameter knobs) 4 Reinforcement Learning "Learning from rewards received by interacting with the world" "board configuration" "I won/lost" "move: where to drop my stone" The Four Deep Networks of AlphaGO The Policy Networks (supervised learning) Deep CNN trained from recorded games to predict the probability that an expert would move a stone into that position. fast "shallow" version of a = expert move s = board state The Policy Network (Reinforcement Learning) Deep CNN trained to predict the optimal move to win by playing against older versions of itself using reinforcement learning. s = board configuration a~ z = 1 if game was won z=-1 if game was lost The Value Network Network to predict the value of a state, trained on dataset generated from self-play using policy Note difference with policy networks: regression onto a single state, not all possible (legal) moves. s = board configuration z = 1 if game was won z=-1 if game was lost Monte Carlo Tree Search • Every move has a value: Q(s,a) • Every path gets an "exploration bonus" U(s,a) • Pick next node according to: • U(s,a) decreases with more visits: AlphaGO's Monte Carlo Tree Search FOR EVERY MOVE DO: REPEAT: 1. Choose a path in the existing tree by maximizing (over a) Q(s,a) + U(s,a) 2. Expand leaf node: evaluate for all children 3. Compute value at node using: 4. Perform Monte Carlo run with cheap policy: 5. Record reward R (=±1). 6. Value of new leaf node is: 7. Update all Q(s,a) as the average of all values V of nodes below it END REPEAT. Pick the most visited move from the root node: MOVE Computation Conclusions • AlphaGO is mix of supervised deep learning, reinforcement learning and Monte Carlo Tree Search. • AlphaGO uses 1000x less board evaluations than DeepBlue. Instead it relied more on machine learning.
© Copyright 2026 Paperzz