Value Function Approximation

Introduction to Reinforcement Learning
Part 4: Value Function Approximation
Rowan McAllister
Reinforcement Learning Reading Group
06 May 2015
Note
I’ve created these slides whilst following, and using figures from
“Algorithms for Reinforcement Learning ” lectures by Csaba Szepesvári,
specifically section 3.2.
The lectures themselves are available on Professor Szepesvári’s homepage:
http://www.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf
Any errors please email me: [email protected]
Previous talks: Block worlds: X-space is
small, discrete
value stored in a table, indexed by state
exact values
Today’s talk: X-space is:
large or infinite, perhaps continuous
Cannot possibly store / tabulate separate values V (x) per state x
approximated values
Value Function Approximation
Common form: linear combination of basis functions:
Vθ (x)
=
θ> φ(x),
x ∈X
where
θ ∈ Rd
φ(x) : X →
(weights)
Rd
(basis fns)
Basis Function φ Examples
Vθ (x)
θ> φ(x)
=
1. Radial Basis Functions (RBF)
f (||x − c1 ||) , ... , f (||x − cd ||)
φ(x) =
2. State Aggregation
φ(x)
e.g. Vθ (x) =
P
i:φi (x)=1 θi
∈
{0, 1}d
Curse of Dimensionality
What happens when dim(X) large? e.g. hundreds?
Usually some intrinsic lower dim sub-manifold, than chosen state space.
E.g. multiple video cameras recording a 12 DOF robot. Observational
dimensionality ∼ millions, but only 12 intrinsic dimensions exist.
Basis functions that help:
Hash functions
Random projections
Neural nets
RBFs
TD(λ) with Function Approximation
(with accumulating eligibility traces)
Previously in block world:
δt+1
=
Rt+1 + γVt (Xt+1 ) − Vt (Xt )
zt+1 (x)
=
I{x=Xt } + γλzt (x)
Vt+1 (x)
=
Vt (x) + αt δt+1 zt+1 (x)
z0 (x)
=
0
TD(λ) with Function Approximation
(with accumulating eligibility traces)
Previously in block world:
δt+1
=
Rt+1 + γVt (Xt+1 ) − Vt (Xt )
zt+1 (x)
=
I{x=Xt } + γλzt (x)
Vt+1 (x)
=
Vt (x) + αt δt+1 zt+1 (x)
z0 (x)
=
0
δt+1
=
Rt+1 + γVθt (Xt+1 ) − Vθt (Xt )
zt+1
=
∇θ Vθt (Xt ) + γλzt
θt+1
=
θt + αt δt+1 zt+1
z0
=
0
TD(λ) with Function Approximation
Have data D = {(Xt , Rt ); t > 0}
Now estimate Vθ (x)
gradient must exist ∇θ Vθ
Algorithm 4: TD(λ) with linear function approximation
(called after each transition)
1: Input: X (last state), Y (next state), R (reward), θ (current value fn
parameters), z (eligibility traces)
>
>
2: δ ← R + γ · θ φ[Y ] − θ φ[X ]
| {z } | {z }
V [Y ]
3:
z[x] ← φ[X ] +γ · λ · z
| {z }
∇θ Vθ (X )
θ←θ+α·δ·z
5: return (θ, z)
4:
V [X ]
Least Squares Temporal Difference (LSTD) learning (1)
θt+1
=
θt + αt δt+1 zt+1
δt+1
=
Rt+1 + γ · Vθ (Xt+1 ) − Vθ (Xt )
=
Rt+1 + γ · θ> φt+1 − θ> φt
=
∇θ Vθt (Xt ) + γλzt
t
X
(γλ)t−s φs
zt+1
=
s=0
Least Squares Temporal Difference (LSTD) learning (1)
θt+1
=
θt + αt δt+1 zt+1
δt+1
=
Rt+1 + γ · Vθ (Xt+1 ) − Vθ (Xt )
=
Rt+1 + γ · θ> φt+1 − θ> φt
=
∇θ Vθt (Xt ) + γλzt
t
X
(γλ)t−s φs
zt+1
=
s=0
So try to satisfy:
E[δt+1 zt+1 ]
=
0
≈
1
T
X
t
δt+1 zt+1
Least Squares Temporal Difference (LSTD) learning (2)
0
=
1
T
X
δt+1 zt+1
t
=
1
T
X
Rt+1 + γθ> φt+1 − θ> φt zt+1
t
=
1
T
|
X
t
Rt+1 zt+1 −
1
T
{z
|
b
}
X
t
zt+1 (φt − γφt+1 )> θ
{z
A
}
Least Squares Temporal Difference (LSTD) learning (2)
0
=
1
T
X
δt+1 zt+1
t
=
1
T
X
Rt+1 + γθ> φt+1 − θ> φt zt+1
t
=
1
T
|
X
t
Rt+1 zt+1 −
1
T
{z
|
}
X
zt+1 (φt − γφt+1 )> θ
t
b
{z
A
So... (if A is invertible)
θ
=
A−1 b
}
Least Squares Temporal Difference (LSTD) learning (2)
0
=
1
T
X
δt+1 zt+1
t
=
1
T
X
Rt+1 + γθ> φt+1 − θ> φt zt+1
t
=
1
T
|
X
t
Rt+1 zt+1 −
1
T
{z
|
}
X
zt+1 (φt − γφt+1 )> θ
t
b
A
So... (if A is invertible)
θ
Special λ = 0 case:
{z
zt+1 = φt .
=
A−1 b
}
Recursive-LSTD
What if inverting A is expensive, impossible?
C0
=
Ct+1
=
θt+1
=
Id
Ct φt (φt − γφt+1 )> Ct
1 + (φt − γφt+1 )> Ct φt
Ct
δt+1 φt
θt +
1 + (φt − γφt+1 )> Ct φt
Ct −
LSTD complexity: O(nd 2 + d 3 )
RLSTD complexity: O(nd 2 )
(R)LSTD Summary
Name: because LSTD minimises projected squared Bellman error:
||ΠF ,µ (TV − V )||2µ
(R)LSTD Summary
Name: because LSTD minimises projected squared Bellman error:
||ΠF ,µ (TV − V )||2µ
Pros:
1. No αt to tune
2. Converges faster than TD(λ)