Deep Learning and Reinforcement Learning

Deep Learning and Reinforcement Learning
Razvan Pascanu (Google DeepMind)
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
1/ 40
Disclaimers:
Slides based on David Silver’s Lecture Notes
I
I
I
From a DL perspective
Not complete, but rather biased and focused
It is meant to make you want to learn this
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
2/ 40
What is Reinforcement Learning ?
Supervised
Unsupervised
RL
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
3/ 40
What is Reinforcement Learning ?
Supervised Learning
Unsupervised Learning
Razvan Pascanu (Google DeepMind)
Reinforcement Learning
Deep Learning and Reinforcement Learning
17 August 2015
4/ 40
Laundry list of differences for RL
at
?
Actions:
xt+1
x∗
Decision:
?
x
Moving target
Active learning
Weak error signal
Long Term Dependencies
?
a2
a1
a3
Start
Exploration/Explotation
Razvan Pascanu (Google DeepMind)
x0
x1
x2
x3
Deep Learning and Reinforcement Learning
17 August 2015
5/ 40
RL problem
I
I
Reward – scalar feedback signal
Goal – pick the sequence of actions that maximizes
the cumulative reward
Reward Hypothesis.
All goals can be described by the maximization of expected
cumulative reward
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
6/ 40
RL problem
I
Agent and Environment
Source: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/intro_RL.pdf
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
7/ 40
RL problem
I
History – is sequence of observations and actions
I
State – information used to decide what happens next (MDP/POMDP)
P(St |St−1 ) = P(St |S1 , ..St−1 )
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
8/ 40
Inside the RL agents
An RL agent has one or more of these components:
I
I
I
Policy – given a state provide a distribution over the
actions
Value function – given a state (state/action pair)
estimate expected future reward
Model – agent’s representation of the world (planning)
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
9/ 40
RL agents taxonomy
VALUE
Function
Actor
Critic
POLICY
MODEL
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
10/ 40
Policy based methods (digression)
I
I
I
I
Effective in high-dimensional / continous action spaces
Can learn stochastic policies
Better convergence properties
Noisy gradients !
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
11/ 40
Reinforce
Directly maximize the cumulative reward !
J(θ) = E πθ [r ] =
X
d(s)
X
πθ (s, a)rs,a
Maximize J. Using the log trick we have:
X
∂J X
∂ log π
∂ log π
=
d(s)
πθ (s, a)
rs,a = E π [
r]
∂θ
∂θ
∂θ
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
12/ 40
Primer Dynamic Programming
9
5
10
3
START
END
25
5
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
13/ 40
Primer Dynamic Programming
93
10
3
START
25
112
END
31
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
14/ 40
Primer Dynamic Programming
Bellman’s Principle of Optimality
An optimal policy has the property that whatever initial state and
initial decision are, the remaining decisions must consitute an
optimal policy (Bellman, 1957)
V (x) = max {F (x, a) + βV (T (x, a))}
a∈Γ(x)
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
15/ 40
Q-Values
The Q value Q(x, a) is the expected cumulative reward for picking
action a in state x.
We can act greedily or epsilon greedy.
π(ai |x) =
1 − , ifQ(ai , x) > Q(aj , x)∀j
,
otherwise
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
16/ 40
Q-learning
Think of Q-values as the length of the path in the graph. Use
dynamic programing (Bellman equation):
Q̂t (xt , at ) = rxt ,at + β max Qt (xt+1 , a) ⇒
a
Qt+1 (xt , at ) = Qt (xt , at ) +
γ
|{z}
(Q̂(xt , at ) − Qt (xt , at ))
|
{z
}
learning rate derivative the of square error(Q̂−Q)2
|
{z
Regress Q to Q̂ using SGD
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
}
17 August 2015
17/ 40
Finally using deep learning for RL
What role does Deep Learning play in RL ?
I
provides a compact form for Q (function approximator)
∂Qθ
θt+1 = θt + γ (Q̂(xt , at ) − Qθt (xt , at ))
∂θ}
|
{z
2
derivative of the square error ∂(Q̂−Q)
∂θ
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
18/ 40
Q-learning in Theano? (theano pseudocode)
x = TT. vector ("x")
q = TT.dot(Wout , TT.nnet.relu(TT.dot(W, x)+b)+ bout)
forward = theano . function ([x], q)
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
19/ 40
Q-learning in Theano? (theano pseudocode)
x = TT. vector ("x")
q = TT.dot(Wout , TT.nnet.relu(TT.dot(W, x)+b)+ bout)
params = [W,b,Wout , bout]
target q = TT. scalar (" target q ")
action = TT. iscalar (’action ’)
lr = TT. scalar (’lr’)
gp = TT.grad ((q[ action ] − target q ) ∗ ∗ 2 , params )
learn = theano . function ([x,action , target q ,lr], [],
updates =[(p,p−lr ∗ gp) for p,gp in zip(params ,gp )])
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
20/ 40
Q-learning in Theano? (theano pseudocode)
for ,(x, x tp1 , act , reward ) in enumerate ( memory ):
target q = reward + forward ( x tp1 ). max ()
learn (x, act , target q , 1e −3)
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
21/ 40
But Learning can be tricky
xt+1 = 2
xt+2 = 2
xt = 2
Correlated samples break learning
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
22/ 40
Solution: replay buffer
at
xt, rt
at
q
xt
Q-learning
θt+1
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
23/ 40
High variance and minibatches
I
I
I
Reinforcement Learning is inherintly sequential
Replay Buffer gives elegant solution to employ
minibatches
Minibatches means reduced variance in the gradients
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
24/ 40
Target network
I
I
Q̂ changes as fast as Q
fix Q̂ (target network) and update it
periodically
Razvan Pascanu (Google DeepMind)
x∗
?
x
Moving target
Deep Learning and Reinforcement Learning
17 August 2015
25/ 40
Other details
I
I
I
I
I
I
SGD can be slow .. rely on RMSprop (or any new optimizer)
Convolutional models are more efficient then MLPs
DQN uses action repeat set to 4
DQN receives 4 frames of the game at a time (grayscale)
is anealled from 1 to .1
Training takes time (roughly 12-14 days)
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
26/ 40
Average Reward per Episode
Results
Average Reward on Breakout
250
200
150
100
50
0
10 20 30 40 50 60 70 80 90 100
Training Epochs
Average Reward on Seaquest
Average Reward per Episode
1800
1600
1400
1200
1000
800
600
400
200
0
0
0
10 20 30 40 50 60 70 80 90 100
Training Epochs
Source: Mnih et al., Human-level control through deep reinforcement learning, Nature 2015
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
27/ 40
Results
Source: Mnih et al., Human-level control through deep reinforcement learning, Nature 2015
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
28/ 40
Results
Nature paper videos
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
29/ 40
Parallelization – Gorila
Source: Nair et al., Massively Parallel Methods for Deep Reinforcement Learning, ICML DL
workshop
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
30/ 40
Results
50
40
GAMES
30
20
BEATING
10
HIGHEST
0
0
1
2
3
TIME (Days)
4
5
6
Source: Nair et al., Massively Parallel Methods for Deep Reinforcement Learning, ICML DL
workshop
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
31/ 40
Results
Atlantis
Breakout
Robotank
Boxing
Road_Runner
Krull
Demon_Attack
Double_Dunk*
Wizard_of_Wor*
Crazy_Climber
Assault
Time_Pilot
Gopher
Star_Gunner
Name_This_Game
Tennis
JamesBond
Pong
Fishing_Derby
Kung_Fu_Master
Up_n_Down
Tutankham
Ice_Hockey
Space_Invaders
at human-level or above
Zaxxon
below human-level
Bank_Heist
QBert
Battle_Zone
Centipede
Kangaroo
Venture
Asterix*
Freeway
RiverRaid
Chopper_Command
Hero
Seaquest
Beam_Rider
Enduro
Bowling
Amidar
Alien
Gravitar*
Frostbite
GORILA
Ms_Pacman
Private_Eye*
Montezuma_Revenge
DQN
Asteroids*
0%
200%
400%
600%
800%
1,000%
5,000%
Human Score
Source: Nair et al., Massively Parallel Methods for Deep Reinforcement Learning, ICML DL
workshop
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
32/ 40
Where this doesn’t work (straightforwardly)
I
Continous control
I
Robotics (experience is very explensive)
I
Sparse rewards (Montezuma !?)
I
Long term correlations (Montezuma !?)
But this does not mean that RL+DL can
not be the solution !
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
33/ 40
DeepMind Research
Source: http://deepmind.com/publications.html
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
34/ 40
And we are hiring
[email protected]
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
35/ 40
Thank you
Questions ?
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
36/ 40
Possible exercise for the afternoon sessions I
Pick one or several tasks from the Deep Learning Tutorials:
I
Logistic Regression
I
MLP
I
AutoEncoders / Denoising AutoEncoders
I
Stacked Denoising AutoEncoders
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
37/ 40
Possible exercise for the afternoon sessions II
Compare different initialization for neural networks (MLPs and ConvNets) with
rectifieres or tanh. In particular compare:
I
The initialization proposed by Glorot et al.
I
Sampling uniformally from [− fan1 in , fan1 in ]
I
Setting all singular values to 1 (and biases to 0)
How do different optimization algorithms help with this initializations? Extra kudos for
interesting plots or analysis. Please make use of the Deep Learning Tutorials.
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
38/ 40
Possible exercise for the afternoon sessions III
Requires convolutions
Re-implement the AutoEncoder tutorial using convolutions both in the encoder and
the decoder. Extra kudos for allowing pooling (or strides) in the encoder.
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
39/ 40
Possible exercise for the afternoon sessions IV
Requires Reinforcement Learning
Attempt to solve the Catch game.
Not actual screenshots of the game
Razvan Pascanu (Google DeepMind)
Deep Learning and Reinforcement Learning
17 August 2015
40/ 40