Deepmind`s AlphaGO

Deepmind's AlphaGO
Max Welling
University of Amsterdam
Man vs. Machine: 1-4
Machine Learning Ingredients
• Supervised deep learning
• Reinforcement learning (self-play)
• Monte Carlo Tree Search
Deep Convolutional Networks
Forward: Filter, nonlinearity, subsample, filter, nonlinearity, subsample, …., predict
.
.
.
Backward: backpropagation (determines how you need to turn the parameter knobs)
4
Reinforcement Learning
"Learning from rewards received by interacting with the world"
"board configuration"
"I won/lost"
"move: where to drop my stone"
The Four Deep Networks
of AlphaGO
The Policy Networks
(supervised learning)
Deep CNN trained from recorded games to predict the
probability that an expert would move a stone
into that position.
fast "shallow" version of
a = expert move
s = board state
The Policy Network
(Reinforcement Learning)
Deep CNN trained to predict the optimal move
to win by playing against older versions of itself
using reinforcement learning.
s = board configuration
a~
z = 1 if game was won
z=-1 if game was lost
The Value Network
Network to predict the value of a state,
trained on dataset generated from self-play
using policy
Note difference with policy networks: regression
onto a single state, not all possible (legal) moves.
s = board configuration
z = 1 if game was won
z=-1 if game was lost
Monte Carlo Tree Search
• Every move has a value: Q(s,a)
• Every path gets an "exploration bonus" U(s,a)
• Pick next node according to:
• U(s,a) decreases with more visits:
AlphaGO's Monte Carlo Tree Search
FOR EVERY MOVE DO: REPEAT:
1. Choose a path in the existing tree by maximizing (over a) Q(s,a) + U(s,a)
2. Expand leaf node: evaluate
for all children
3. Compute value at node using:
4. Perform Monte Carlo run with cheap policy:
5. Record reward R (=±1).
6. Value of new leaf node is:
7. Update all Q(s,a) as the average of all values V of nodes below it
END REPEAT.
Pick the most visited move from the root node:
MOVE
Computation
Conclusions
• AlphaGO is mix of supervised deep learning, reinforcement learning
and Monte Carlo Tree Search.
• AlphaGO uses 1000x less board evaluations than DeepBlue.
Instead it relied more on machine learning.