Approximate value function

Approximate value function
Vladimir Li
Jiexiong Tang
RPL KTH
March 21
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
1 / 44
Overview
1
Previously in this course
2
Why?
3
How?
MSVE
Linear models
Gradient descent
Feature Construction
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
2 / 44
Previously in this course
Dynamic programming
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
3 / 44
Previously in this course
Monte Carlo
Dynamic programming
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
3 / 44
Previously in this course
Monte Carlo
Dynamic programming
Vladimir Li Jiexiong Tang (RPL KTH)
TD(0)
March 21
3 / 44
Previosly in this course
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
4 / 44
Problems with huge state space
Backgammon: 1020 states
Game of Go: 10170 states
Robot: continuous state space, multi-dimensional
Why a large state space is a issue for RL?
How to scale up a model-free methods from previous lectures?
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
5 / 44
Approximation of Value function
Until now: Represent value function by lookup table
For every state s, an entry V (s)
For each state-action pair, an entry Q(s, a)
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
6 / 44
Approximation of Value function
Until now: Represent value function by lookup table
For every state s, an entry V (s)
For each state-action pair, an entry Q(s, a)
Models with large state space
Too many states to store
Too slow to learn value for each state
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
6 / 44
Approximation of Value function
Until now: Represent value function by lookup table
For every state s, an entry V (s)
For each state-action pair, an entry Q(s, a)
Models with large state space
Too many states to store
Too slow to learn value for each state
Solution:
Use value function approximation
v̂θ : s → vπ (s)
Generalize from seen states
Learn parameter θ
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
6 / 44
Find approximation for ”True” value function
If vπ (s) is true value function and v̂ (s, θ) is its approximation
Then we can write objective of the prediction problem as:
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
7 / 44
Find approximation for ”True” value function
If vπ (s) is true value function and v̂ (s, θ) is its approximation
Then we can write objective of the prediction problem as:
. X
MSVE (θ) =
d(s)[vπ (s) − v̂ (s, θ)]2
s∈S
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
7 / 44
On-policy distribution for episodic tasks
η(s) = h(s) +
X
η(s̄)
X
π(a|s̄)p(s|s̄, a)
a
s̄
η(s)
0
s 0 η(s )
d(s) = P
where:
η(s): number of time steps spend, on average, in states s in a single
episode
h(s): probability that an episode starts in s
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
8 / 44
MSVE
. X
d(s)[vπ (s) − v̂ (s, θ)]2
MSVE (θ) =
s∈S
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
9 / 44
Linear models
A function approximation using linear combinations of features:
Approximation function is linear in the weights θ
For each state s, there is a real-valued vector of features φ(s):
e.g., Polynomial, Fourier, Coarse Coding, Tile Coding and RBF, etc.
T
v̂ (s, θ) = θ φ(s) =
n
X
θi φi (s)
i=1
Gradient with respect to θ is:
∇v̂ (s, θ) = φ(s)
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
10 / 44
Linear models
Simplicity
Efficient in computation
Work well in practice
Guarantees convergence
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
11 / 44
Table Lookup (Special case)
Table lookup is special case of linear value function approximation
v̂ (s, θ) = θ T φ(s)
If we have n states create a table lookup feature as follows:


1(s = s1 )


..
φ(s) = 

.
1(s = sn )
Then the value function approximation is:
T
v̂ (s, θ) = θ φ(s) =
n
X
θi 1(s = si )
i=1
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
12 / 44
1000-state random walk example
Reward:
1 if ST = 1000
-1 if ST = 0
0 otherwise
Actions:
Go to left k steps
Go to right k steps
where k is uniformly sampled from [1, 100]
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
13 / 44
1000-state random walk example
State aggregation:


θ1
 
θ =  ... 
θ10
Vladimir Li Jiexiong Tang (RPL KTH)

1(1 ≤ s < 100)


..
φ(s) = 

.
1(901 ≤ s < 1000)

March 21
14 / 44
Goal
Looking for optimal parameters, θ ∗
MSVE (θ ∗ ) ≤ MSVE (θ)
for all θ or at least its neighborhood
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
15 / 44
Gradient descent
J(θ) is a differentiable objective function
Gradient of J(θ) is defined as:

∂J(θ)
 ∂θ1 
 . 

∇θ J(θ) = 
 .. 
 ∂J(θ) 

∂θn
The gradient decay is:
θ = θ + ∆θ
1
⇒ ∆θ = − α∇θ J(θ)
2
where α is a positive step-size parameter.
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
16 / 44
Stochastic gradient descent
Goal: find parameter vector θ that minimizing MSVE (θ), i.e.
difference between approximated value v̂ (s, θ) and the ”true” value
vπ (s). If we assume that states appears with same distribution d, we
can write gradient update as:
1
∆θ = − α∇θ J(θ)
2
1
= − α∇θ [vπ (S) − v̂ (S, θ)]2
2
=α[vπ (S) − v̂ (S, θ)]∇θ v̂ (S, θ)
vπ is substituted with target output Ut
∆θ = α(Ut − v̂ (S, θ))∇θ v̂ (S, θ)
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
17 / 44
.
Ut = Gt (MC)
Monte Carlo:
.
Ut = Gt
where Gt is return; the total discounted reward.
Update rule:
θ ← θ + α[Gt − v̂ (St , θ)]∇θ v̂ (St , θ)
Gt is unbiased estimate of vπ (St ) → θ is guaranteed to converge.
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
18 / 44
Gradient Monte Carlo Algorithm
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
19 / 44
1000-state random walk example
State aggregation:


θ1
 
θ =  ... 
θ10
We start in state 631 and terminate in state 1000.
What is MC update at t=0?
d(s) is uniform
initial θ is zero
α = 0.1
γ=1
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
20 / 44
1000-state random walk example
State aggregation:


θ1
 
θ =  ... 
θ10
In the same episode at t=1 we are in the state 650.
What is MC update at t=1?
d(s) is uniform
initial θ is zero
α = 0.1
γ=1
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
21 / 44
1000-state random walk
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
22 / 44
.
Ut = Rt+1 + γ v̂ (St+1 , θ) (TD(0))
TD(0):
.
Ut = Rt+1 + γ v̂ (St+1 , θ)
Update rule:
θ ← θ + α[Rt+1 + γ v̂ (St+1 , θ) − v̂ (St , θ)]∇θ v̂ (St , θ)
Ut is a bootstrapped biased estimate of vπ (S).
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
23 / 44
Semi-gradient TD(0)
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
24 / 44
Semi-gradient TD(0) convergence proof
Replace gradient in previous semi-gradient TD(0)
θt+1 = θt + α(Rt+1 + γθt T φt+1 − θt T φt )φt
(1)
= θt + α(Rt+1 φt − φt (φt − γφt+1 )T θt )
(2)
The expected next weight vector can be written as:
E[θt+1 |θt ] = θt + α(b − Aθt )
(3)
where b = E[Rt+1 φt ] and A = E[φt (φt − γφt+1 )T ]
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
25 / 44
Semi-gradient TD(0) convergence proof
Find TD fixpoint when gradient is 0:
θ TD = A−1 b
(4)
Linear semi-gradient TD(0) converges to this point.
MSVE is within a bounded expansion:
MSVE (θTD ) ≤
Vladimir Li Jiexiong Tang (RPL KTH)
1
minθ MSVE (θ)
1−γ
(5)
March 21
26 / 44
n-step semi-gradient TD
Semi-gradient n-step TD algorithm
(n)
θ ← θ + α[Gt
− v̂ (St , θ t+n−1 )]∇θ v̂ (St , θ t+n−1 ), 0 ≤ t ≤ T
where
(n)
Gt
= Rt+1 +γRt+2 +...+γ n−1 Rt+n +γ n v̂ (St+n , θ t+n−1 ), 0 ≤ t ≤ T −n
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
27 / 44
n-step semi-gradient TD
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
28 / 44
MC vs TD(0)
θ ← θ + α[Gt
−v̂ (St , θ)]∇θ v̂ (St , θ)
θ ← θ + α[Rt+1 + γ v̂ (St+1 , θ)
−v̂ (St , θ)]∇θ v̂ (St , θ)
TD(0) target is bootstrapped → biased estimate
Target depends on current θ and we only include a part of the
gradient (Semi-gradient)
Converge reliably for linear functions (proof)
Faster to learn
Enable continual and online learning
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
29 / 44
MC vs TD(0)
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
30 / 44
Feature Construction
Value estimates are sums of features times corresponding weights
Choosing feature is one way to apply domain knowledge
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
31 / 44
Feature Construction - Polynomial
Each d-dimensional polynomial basis function φi can be written as:
φi (s) =
d
Y
c
sj i,j
(6)
j=1
where ci,j ∈ [0, N] and sj is normalized to [0, 1]
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
32 / 44
Feature Construction - Polynomial
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
33 / 44
Feature Construction - Fourier
Fourier cosine basis functions
φi (s) = cos(πc i s)
(7)
where ci,j ∈ [0, N] and sj is normalized to [0, 1]
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
34 / 44
Feature Construction - Fourier
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
35 / 44
Feature Construction - Fourier
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
36 / 44
Feature Construction - Fourier
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
37 / 44
Feature Construction - Coarse Coding
Coarse Coding : the number of their features whose receptive fields
overlap.
Corresponding to each circle is a single weight is learned as a
component of θ
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
38 / 44
Feature Construction - Coarse Coding
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
39 / 44
Feature Construction - Tile Coding
Receptive fields of the features are grouped into partitions of the
input space.
Each such partition is called a tiling, and each element of the
partition is called a tile.
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
40 / 44
Feature Construction - Tile Coding
Tile coding with multiple offsets
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
41 / 44
Feature Construction - Tile Coding
Tile coding compared with direct state aggregation
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
42 / 44
Feature Construction - RBF
Radial basis functions (RBFs):
||s − ci ||2
)
2σ 2
∈ [0, N] and s is normalized to [0, 1]
φi (s) = exp(−
where ci,j
Vladimir Li Jiexiong Tang (RPL KTH)
(8)
March 21
43 / 44
Nonlinear approximation
Artificial neural networks (ANNs) are widely used nonlinear model:
Universal approximation;
Deep ANN is capable of learning feature from data directly
Vladimir Li Jiexiong Tang (RPL KTH)
March 21
44 / 44