Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation
Lecture 7: Value Function Approximation
Joseph Modayil
Lecture 7: Value Function Approximation
Outline
1 Introduction
2 Incremental Methods
3 Batch Methods
Lecture 7: Value Function Approximation
Introduction
Large-Scale Reinforcement Learning
Reinforcement learning can be used to solve large problems, e.g.
Backgammon: 1020 states
Computer Go: 10170 states
Helicopter/Mountain Car: continuous state space
Robots: informal state space (physical universe)
How can we scale up the model-free methods for prediction and
control from the last two lectures?
Lecture 7: Value Function Approximation
Introduction
Value Function Approximation
So far we have represented value function by a lookup table
Every state s has an entry V (s)
Or every state-action pair s, a has an entry Q(s, a)
Problem with large MDPs:
There are too many states and/or actions to store in memory
It is too slow to learn the value of each state individually
Solution for large MDPs:
Estimate value function with function approximation
Vθ (s) ≈ v π (s)
or Qθ (s, a) ≈ q π (s, a)
Generalise from seen states to unseen states
Update parameter θ using MC or TD learning
Lecture 7: Value Function Approximation
Introduction
Which Function Approximator?
There are many function approximators, e.g.
Artificial neural network
Decision tree
Nearest neighbour
Fourier / wavelet bases
Coarse coding
In principle, any function approximator can be used. However, the
choice may be affected by some properties of RL:
Experience is not i.i.d. - successive time-steps are correlated
During control, value function v π (s) is non-stationary
Agent’s actions affect the subsequent data it receives
Feedback is delayed, not instantaneous
Lecture 7: Value Function Approximation
Introduction
Classes of Function Approximation
Tabular (No FA): a table with an entry for each MDP state
State Aggregation: Partition environment states
Linear function approximation: fixed features (or fixed kernel)
Differentiable (nonlinear) function approximation: neural nets
So what should you choose? Depends on your goals.
Top: good theory but weak performance
..
.
Bottom: excellent performance but weak theory
Linear function approximation is a useful middle ground
Neural nets now commonly give the highest performance
Lecture 7: Value Function Approximation
Incremental Methods
Gradient Descent
Gradient Descent
Let J(θ) be a differentiable function of parameter vector θ
Define the gradient of J(θ) to be

∂J(θ)
∂θ1

 . 
. 
∇θ J(θ) = 
 . 
∂J(θ)
∂θn
To find a local minimum of J(θ)
Adjust the parameter θ in the direction of -ve gradient
1
∆θ = − α∇θ J(θ)
2
where α is a step-size parameter
Lecture 7: Value Function Approximation
Incremental Methods
Gradient Descent
Value Function Approx. By Stochastic Gradient Descent
Goal: find parameter vector θ minimising mean-squared error
between approximate value fn Vθ (s) and true value fn v π (s)
J(θ) = Eπ (v π (S) − Vθ (S))2
Note: The notation Eπ [] means that the random variable S is
P
drawn from a distribution induced by π. Eπ [f (S)] = s f (s)dπ (s)
Gradient descent finds a local minimum
1
∆θ = − α∇θ J(θ) = αEπ [(v π (S) − Vθ (S))∇θ Vθ (S)]
2
Stochastic gradient descent samples the gradient
∆θ = α(v π (s) − Vθ (s))∇θ Vθ (s)
Expected update is equal to full gradient update
Lecture 7: Value Function Approximation
Incremental Methods
Linear Function Approximation
Feature Vectors
Represent state by a feature vector


φ1 (s)


φ(s) =  ... 
φn (s)
For example:
Distance of robot from landmarks
Trends in the stock market
Piece and pawn configurations in chess
Lecture 7: Value Function Approximation
Incremental Methods
Linear Function Approximation
Linear Value Function Approximation
Approximate value function by a linear combination of features
Vθ (s) = φ(s)> θ =
n
X
φj (s)θj
j=1
Objective function is quadratic in parameters θ
h
i
J(θ) = Eπ (v π (S) − φ(S)> θ)2
Stochastic gradient descent converges on global optimum
Update rule is particularly simple
∇θ Vθ (s) = φ(s)
∆θ = α(v π (s) − Vθ (s))φ(s)
Update = step-size × prediction error × feature vector
Lecture 7: Value Function Approximation
Incremental Methods
Linear Function Approximation
Table Lookup Features
Table lookup can be implemented as a special case of linear
value function approximation
Let the n states be given by S = {s (1) , . . . , s (n) }.
Using table lookup features


1(s = s (1) )


..
φtable (s) = 

.
1(s = s (n) )
Parameter vector θ gives value of each individual state

  
1(s = s (1) )
θ1



.
.
..
V (s) = 
 ·  .. 
1(s = s (n) )
θn
Lecture 7: Value Function Approximation
Incremental Methods
Coarse Coding Example
Coarse Coding
Example of linear value function approximation:
Coarse coding provides large feature vector φ(s)
Parameter vector θ gives a value to each feature
original
representation
expanded
representation,
many features
et
Y
approximation
Lecture 7: Value Function Approximation
Incremental Methods
Coarse Coding Example
Generalization in Coarse Coding
Lecture 7: Value Function Approximation
Incremental Methods
Coarse Coding Example
Stochastic Gradient Descent with Coarse Coding
Lecture 7: Value Function Approximation
Incremental Methods
Incremental Prediction Algorithms
Incremental Prediction Algorithms
Have assumed true value function v π (s) given by supervisor
But in RL there is no supervisor, only rewards
In practice, we substitute a target for v π (s)
For MC, the target is the return Gt
∆θ = α(Gt − Vθ (s))∇θ Vθ (s)
For TD(0), the target is the TD target r + γVθ (s 0 )
∆θ = α(r + γVθ (s 0 ) − Vθ (s))∇θ Vθ (s)
For TD(λ), the target is the λ-return Gtλ
∆θ = α(Gtλ − Vθ (s))∇θ Vθ (s)
Lecture 7: Value Function Approximation
Incremental Methods
Incremental Prediction Algorithms
Monte-Carlo with Value Function Approximation
The return Gt is an unbiased, noisy sample of true value v π (s)
Can therefore apply supervised learning to “training data”:
hS1 , G1 i, hS2 , G2 i, ..., hST , GT i
For example, using linear Monte-Carlo policy evaluation
∆θ = α(Gt − Vθ (s))∇θ Vθ (s)
= α(Gt − Vθ (s))φ(s)
Monte-Carlo evaluation converges to a local optimum
Even when using non-linear value function approximation
Lecture 7: Value Function Approximation
Incremental Methods
Incremental Prediction Algorithms
TD Learning with Value Function Approximation
The TD-target Rt+1 + γVθ (St+1 ) is a biased sample of true
value v π (St )
Can still apply supervised learning to “training data”:
hS1 , R2 + γVθ (S2 )i, hS2 , R3 + γVθ (S3 )i, ..., hST −1 , RT i
For example, using linear TD(0)
∆θ = α(r + γVθ (s 0 ) − Vθ (s))∇θ Vθ (s)
= αδφ(s)
Linear TD(0) converges (close) to global optimum
Lecture 7: Value Function Approximation
Incremental Methods
Incremental Prediction Algorithms
TD(λ) with Value Function Approximation
The λ-return Gtλ is also a biased sample of true value v π (s)
Can again apply supervised learning to “training data”:
D
E D
E
D
E
S1 , G1λ , S2 , G2λ , ..., ST −1 , GTλ −1
Forward view linear TD(λ)
∆θ = α(Gtλ − Vθ (St ))∇θ Vθ (St )
= α(Gtλ − Vθ (St ))φ(St )
Backward view linear TD(λ)
δt = Rt+1 + γVθ (St+1 ) − Vθ (St )
et = γλet−1 + φ(St )
∆θ = αδt et
Lecture 7: Value Function Approximation
Incremental Methods
Incremental Control Algorithms
Control with Value Function Approximation
Q ≈
θ Qπ
Starting θ
Qθ ≈ Q*
π
)
dy(Q θ
gree
ε
=
Policy evaluation Approximate policy evaluation, Qθ ≈ q π
Policy improvement -greedy policy improvement
Lecture 7: Value Function Approximation
Incremental Methods
Incremental Control Algorithms
Action-Value Function Approximation
Approximate the action-value function
Qθ (s, a) ≈ q π (s, a)
Minimise mean-squared error between approximate
action-value fn Qθ (s, a) and true action-value fn q π (s, a)
J(θ) = Eπ (q π (S, A) − Qθ (S, A))2
Here, Eπ [] means both S and A are drawn from a distribution
induced by π.
Use stochastic gradient descent to find a local minimum
1
− ∇θ J(θ) = (q π (s, a) − Qθ (s, a))∇θ Qθ (s, a)
2
∆θ = α(q π (s, a) − Qθ (s, a))∇θ Qθ (s, a)
Lecture 7: Value Function Approximation
Incremental Methods
Incremental Control Algorithms
Linear Action-Value Function Approximation
Represent state and action by a feature vector


φ1 (s, a)


φ(s, a) =  ... 
φn (s, a)
Represent action-value fn by linear combination of features
Qθ (s, a) = φ(s, a)> θ =
n
X
φj (s, a)θj
j=1
Stochastic gradient descent update
∇θ Qθ (s, a) = φ(s, a)
∆θ = α(q π (s, a) − Qθ (s, a))φ(s)
Lecture 7: Value Function Approximation
Incremental Methods
Incremental Control Algorithms
Incremental Linear Control Algorithms
Like prediction, we must substitute a target for q π (s, a)
For MC, the target is the return Gt
∆θ = α(Gt − Qθ (St , at ))φ(St , At )
For SARSA(0), the target is the TD target
Rt+1 + γQ(St+1 , At+1 )
∆θ = α(Rt+1 + γQθ (St+1 , At+1 ) − Qθ (St , At ))φ(St , At )
For forward-view Sarsa(λ), target is the λ-return with
action-values
∆θ = α(Gtλ − Qθ (St , At ))φ(St , At )
For backward-view Sarsa(λ), equivalent update is
δt = Rt+1 + γQθ (St+1 , At+1 ) − Qθ (St , At )
et = γλet−1 + φ(St , At )
∆θ = αδt et
Lecture 7: Value Function Approximation
Incremental Methods
Mountain Car
Linear Sarsa with Coarse Coding in Mountain Car
Lecture 7: Value Function Approximation
Incremental Methods
Mountain Car
Linear Sarsa with Radial Basis Functions in Mountain Car
Lecture 7: Value Function Approximation
Incremental Methods
Mountain Car
Study of λ: Should We Bootstrap?
Lecture 7: Value Function Approximation
Incremental Methods
Convergence
Convergence Questions
The previous results show it is desirable to bootstrap
But now we consider convergence issues
When do incremental prediction algorithms converge?
When using bootstrapping (i.e. TD with λ < 1)?
When using linear value function approximation?
When using off-policy learning?
Ideally, we would like algorithms that converge in all cases
Lecture 7: Value Function Approximation
Incremental Methods
Convergence
Baird’s Counterexample
Lecture 7: Value Function Approximation
Incremental Methods
Convergence
Parameter Divergence in Baird’s Counterexample
Lecture 7: Value Function Approximation
Incremental Methods
Convergence
Convergence of Prediction Algorithms
On/Off-Policy
On-Policy
Off-Policy
Algorithm
MC
TD(0)
TD(λ)
MC
TD(0)
TD(λ)
Table Lookup
3
3
3
3
3
3
Linear
3
3
3
3
7
7
Non-Linear
3
7
7
3
7
7
Lecture 7: Value Function Approximation
Incremental Methods
Convergence
Gruesome Threesome
We have not quite achieved our ideal goal for prediction algorithms.
Lecture 7: Value Function Approximation
Incremental Methods
Convergence
Gradient Temporal-Difference Learning
TD does not follow the gradient of any objective function
This is why TD can diverge when off-policy or using
non-linear function approximation
Gradient TD follows true gradient of projected Bellman error
On/Off-Policy
On-Policy
Off-Policy
Algorithm
MC
TD
Gradient TD
MC
TD
Gradient TD
Table Lookup
3
3
3
3
3
3
Linear
3
3
3
3
7
3
Non-Linear
3
7
3
3
7
3
Lecture 7: Value Function Approximation
Incremental Methods
Convergence
Convergence of Control Algorithms
In practice, the tabular control learning algorithms are
extended to find a control policy (with linear FA or with
neural nets).
In theory, many aspects of control are not as simple to specify
under function approximation.
e.g. The starting state distribution is required before
specifying an optimal policy, unlike in the tabular setting. The
optimal policy can differ starting from state s (1) and from
state s (2) , but the state aggregation may not be able to
distinguish between them.
Such situations commonly arise in large environments (e.g.
robotics), and tracking is often preferred to convergence.
(continually adapting the policy instead of converging to a
fixed policy).
Lecture 7: Value Function Approximation
Batch Methods
Batch Reinforcement Learning
Gradient descent is simple and appealing
But it is not sample efficient
Batch methods seek to find the best fitting value function for
a given a set of past experience (“training data”)
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Prediction
Least Squares Prediction
Given value function approximation Vθ (s) ≈ v π (s)
And experience D consisting of hstate, estimated valuei pairs
D
E D
E
D
E
D = { S1 , V̂1π , S2 , V̂2π , ..., ST , V̂Tπ }
Which parameters θ give the best fitting value fn Vθ (s)?
Least squares algorithms find parameter vector θ minimising
sum-squared error between Vθ (St ) and target values V̂tπ ,
LS(θ) =
T
X
(V̂tπ − Vθ (St ))2
t=1
h
i
= ED (V̂ π − Vθ (s))2
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Prediction
Stochastic Gradient Descent with Experience Replay
Given experience consisting of hstate, valuei pairs
D
E D
E
D
E
D = { S1 , V̂1π , S2 , V̂2π , ..., ST , V̂Tπ }
Repeat:
1 Sample state, value from experience
D
E
s, V̂ π ∼ D
2
Apply stochastic gradient descent update
∆θ = α(V̂ π − Vθ (s))∇θ Vθ (s)
Converges to least squares solution
θπ = argmin LS(θ)
θ
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Prediction
Linear Least Squares Prediction
Experience replay finds least squares solution
But it may take many iterations
Using linear value function approximation Vθ (s) = φ(s)> θ
We can solve the least squares solution directly
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Prediction
Linear Least Squares Prediction (2)
At minimum of LS(θ), the expected update must be zero
ED [∆θ] = 0
α
T
X
φ(St )(V̂tπ − φ(St )> θ) = 0
t=1
T
X
φ(St )V̂tπ =
t=1
T
X
φ(St )φ(St )> θ
t=1
θ=
T
X
!−1
φ(St )φ(St )
>
t=1
T
X
φ(St )V̂tπ
t=1
For N features, direct solution time is O(N 3 )
Incremental solution time is O(N 2 ) using Shermann-Morrison
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Prediction
Linear Least Squares Prediction Algorithms
We do not know true values vtπ (have estimates V̂tπ )
In practice, our “training data” must use noisy or biased
samples of vtπ
LSMC Least Squares Monte-Carlo uses return
vtπ ≈ Gt
LSTD Least Squares Temporal-Difference uses TD target
vtπ ≈ Rt+1 + γVθ (St+1 )
LSTD(λ) Least Squares TD(λ) uses λ-return
vtπ ≈ Vtλ
In each case solve directly for fixed point of MC / TD / TD(λ)
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Prediction
Linear Least Squares Prediction Algorithms (2)
LSMC
0=
T
X
α(Gt − Vθ (St ))φ(St )
t=1
θ=
T
X
!−1
φ(St )φ(St )>
t=1
LSTD
0=
T
X
T
X
φ(St )Gt
t=1
α(Rt+1 + γVθ (St+1 ) − Vθ (St ))φ(St )
t=1
θ=
T
X
!−1
φ(St )(φ(St ) − γφ(St+1 ))>
t=1
LSTD(λ)
0=
T
X
T
X
φ(St )Rt+1
t=1
αδt et
t=1
θ=
T
X
t=1
!−1
et (φ(St ) − γφ(St+1 ))>
T
X
t=1
et Rt+1
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Prediction
Convergence of Linear Least Squares Prediction Algorithms
On/Off-Policy
On-Policy
Off-Policy
Algorithm
MC
LSMC
TD
LSTD
MC
LSMC
TD
LSTD
Table Lookup
3
3
3
3
3
3
3
3
Linear
3
3
3
3
3
3
7
3
Non-Linear
3
7
3
7
-
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Control
Least Squares Policy Iteration
Q ≈
θ Qπ
Starting θ
Qθ ≈ Q*
π
)
(Q θ
eedy
r
g
=
Policy evaluation Policy evaluation by least squares Q-learning
Policy improvement Greedy policy improvement
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Control
Least Squares Action-Value Function Approximation
Approximate action-value function q π (s, a)
using linear combination of features φ(s, a)
Qθ (s, a) = φ(s, a)> θ ≈ q π (s, a)
Minimise least squares error between Qθ (s, a) and q π (s, a)
from experience generated using policy π
consisting of h(state, action), valuei pairs
D
E D
E
D
E
D = { (S1 , A1 ), V̂1π , (S2 , A2 ), V̂2π , ..., (ST , AT ), V̂Tπ }
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Control
Least Squares Control
For policy evaluation, we want to efficiently use all experience
For control, we also want to improve the policy
This experience is generated from many policies
So to evaluate q π (s, a) we must learn off-policy
We use the same idea as Q-learning:
Use experience generated by old policy
St , At , Rt+1 , St+1 ∼ πold
Consider alternative successor action a0 = πnew (St+1 )
Update Qθ (St , At ) towards value of alternative action
Rt+1 + γQθ (St+1 , a0 )
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Control
Least Squares Q-Learning
Consider the following linear Q-learning update
δ = Rt+1 + γQθ (St+1 , π(St+1 )) − Qθ (St , At )
∆θ = αδφ(St , At )
LSTDQ algorithm: solve for total update = zero
0=
T
X
α(Rt+1 + γQθ (St+1 , π(St+1 )) − Qθ (St , At ))φ(St , At )
t=1
θ=
T
X
t=1
!−1
>
φ(St , At )(φ(St , At ) − γφ(St+1 , π(St+1 )))
T
X
t=1
φ(St , At )Rt+1
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Control
Least Squares Policy Iteration Algorithm
The following pseudocode uses LSTDQ for policy evaluation
It repeatedly re-evaluates experience D with different policies
function LSPI-TD(D, π0 )
π 0 ← π0
repeat
π ← π0
Q ← LSTDQ(π, D)
for all s ∈ S do
π 0 (s) ← argmax Q(s, a)
a∈A
end for
until (π ≈ π 0 )
return π
end function
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Control
Least-Squares Policy Iteration
Chain Walk Example
0.1
0.1
0.1
0.9
R
0.9
1
L
2
0.9
R
L
L
3
4
0.9
0.1
r=1
R
R
0.9
L
0.9
0.9
0.1
r=0
0.1
0.9
0.1
0.1
r=1
r=0
Figure 9: The problematic MDP.
Consider the 50 state version of this problem
where s is the state number. LSPI was applied on the same problem using the same basis
Reward
+1forineachstates
10actions
and 41,
0 each
elsewhere
functions
repeated
of the two
so that
action gets its own parameters:10

Optimal policy: R (1-9),L I(a
(10-25),
R
(26-41), L (42, 50)
= L) × 1
 I(a = L) × s 
 Gaussians

Features: 10 evenly spaced
 I(a = L) × s2 (σ = 4) for each action

φ(s, a) = 
 I(a = R) × 1  .
 walk policy
Experience: 10,000 steps
random
from
I(a = R) × s 
I(a = R) × s2
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Control
Least-Squares Policy Iteration
LSPI in Chain Walk: Action-Value Function
1.5
1
0.5
0
1.5
1
0.5
−0.5
0
5
10
15
20
25
30
Iteration1
35
40
45
50
5
10
15
20
25
30
Iteration2
35
40
45
50
5
10
15
20
25
30
Iteration4
35
40
45
50
5
10
15
20
25
30
Iteration6
35
40
45
50
4
4
2
2
0
0
5
10
15
20
25
30
Iteration3
35
40
45
50
4
4
2
2
0
5
10
15
20
25
30
Iteration5
35
40
45
50
0
4
2
0
5
10
15
20
25
30
Iteration7
35
40
45
50
Exact (solid blue)
Function Approx. (red dashes)
Lecture 7: Value Function Approximation
Batch Methods
Least Squares Control
Questions?