A Crash Course in Reinforcement Learning

A Crash Course in Reinforcement
Learning
Oliver Schulte
Simon Fraser University
Outline
•
•
•
•
•
What is Reinforcement Learning?
Key Definitions
Key Learning Tasks
Reinforcement Learning Techniques
Reinforcement Learning with Neural Nets
Learning To Act
• So far: learning to predict
• Now: learn to act
– In engineering: control theory
– Economics, operations research: decision and game
theory
• Examples:
–
–
–
–
fly helicopter
drive car
play Go
play soccer
RL at a glance
Agent
State
Reward
Action
Environment
s0
a0
r0
s1
a1
r1
s2
a2
r2
...
Goal: Learn to choose actions that maximize
r + g r + g 2 r + ... , where 0 < g <1
0
1
2
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch13.pdf
Acting in Action
• Autonomous Helicopter
– An example of imitation learning: start by
observing human actions
• Learning to play video games
– “Deep Q works best when it lives in the moment”
• Learn to flip pancakes
Markov Decision Processes
• Recall Markov process (MP)
– state = vector x ≅ s of input variable values
– can contain hidden variables = partially observable
(POMDP)
– transition probability P(s’|s)
• Markov reward process (MRP) = MP + rewards r
• Markov decision process (MDP) = MRP + actions a
• Markov game = MDP with actions, rewards for > 1
agent
Model Parameters: transition
probabilities
• Markov process:
P(s(t+1)|s(t))
• MDP:
P(s(t+1)|s(t),a(t))
E(r(t+1)|s(t),a(t)) expected reward
• recall basketball example
• also hockey example
• grid example David Poole’s demo
Returns and discounting
• A trajectory is a (possibly infinite) sequence
s(0),a(0),r(0),s(1),a(1),r(1),...,s(n),a(n),r(n),...
• The return is the total sum of rewards.
• But: if the trajectory is infinite, we have an
infinite sum!
• Solution: Weight by discount factor γ between
0 and 1.
• Return = r(0)+γr(1)+γ2r(n)+...
RL Concepts
These 3
functions can be
computed by
neural networks
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch13.pdf
Policies and Values
• A deterministic policy π is a function that
maps states to actions.
– i.e. tells us how to act.
• Can also be probabilistic.
• Can be implemented using neural nets.
• Given a policy and an MDP, we have the
expected return from using the policy at a
state.
• Notation: Vπ(s)
Optimal Policies
• A policy π* is optimal if for any other policy
and for all states s
Vπ*(s) ≥ Vπ(s)
• The value of the optimal policy is written as
V*(s).
The action value function
• Given a policy, the expected reward at a state
given an action is denoted as
Qπ(s,a).
• Similarly Q*(s,a) for the value of an action
given the optimal policy.
• grid example
Two Learning Problems
• Prediction: For a fixed policy, learn Vπ(s).
• Control: For a given MDP, learn V*(s) (optimal
policy).
• Variants for Q-function.
Model-Based Learning
Data
Transition
Probabilities
dynamic
programming
Value
Function
Bellmann equation:
Vπ(s) = Ps’,a π(a) x ( E(r)|s,a + P(s’|s,a) x Vπ (s’) )
• Developed for transition probabilities that are “nice”
discrete, Gaussian, Poisson,...
• grid example
Model-free Learning
• By-pass estimating transition probabilities
• Why? Continuous state variables, no “nice”
functional form.
• (How about using LSTM/RNN dynamic model? deep dynamic
programming?)
Model-free Learning
• Directly learn optimal policy π* (policy
iteration)
• Directly learn optimal value function V*.
• Directly learn optimal action-value function
Q*.
• All of these functions can be implemented in a
neural network.
NN learning = reinforcement learning
Model-free Learning: What are the
data?
• Data is simply a sequence of events
s(0),a(0),r(0),s(1),a(1),r(1),...
doesn’t tell us expected values or optimal
actions.
• Monte Carlo learning: to learn V, observe
return at end of episode.
• e.g. chessbase gives percentage of wins by
white for any position
Temporal Difference Learning
• Consistency idea: using current model, and
given data, s(0),a(0),r(0),s(1),a(1),r(1),
estimate
1. the value V(s(t)) at current state
2. the next-step value V1(s(t)) = r(t)+γV(s(t+1))
3. Minimize the “error” [V1(s(t))-V(s(t))]2
Model-Free Learning Example
http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html