Introduction to Reinforcement Learning Part 4: Value Function Approximation Rowan McAllister Reinforcement Learning Reading Group 06 May 2015 Note I’ve created these slides whilst following, and using figures from “Algorithms for Reinforcement Learning ” lectures by Csaba Szepesvári, specifically section 3.2. The lectures themselves are available on Professor Szepesvári’s homepage: http://www.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf Any errors please email me: [email protected] Previous talks: Block worlds: X-space is small, discrete value stored in a table, indexed by state exact values Today’s talk: X-space is: large or infinite, perhaps continuous Cannot possibly store / tabulate separate values V (x) per state x approximated values Value Function Approximation Common form: linear combination of basis functions: Vθ (x) = θ> φ(x), x ∈X where θ ∈ Rd φ(x) : X → (weights) Rd (basis fns) Basis Function φ Examples Vθ (x) θ> φ(x) = 1. Radial Basis Functions (RBF) f (||x − c1 ||) , ... , f (||x − cd ||) φ(x) = 2. State Aggregation φ(x) e.g. Vθ (x) = P i:φi (x)=1 θi ∈ {0, 1}d Curse of Dimensionality What happens when dim(X) large? e.g. hundreds? Usually some intrinsic lower dim sub-manifold, than chosen state space. E.g. multiple video cameras recording a 12 DOF robot. Observational dimensionality ∼ millions, but only 12 intrinsic dimensions exist. Basis functions that help: Hash functions Random projections Neural nets RBFs TD(λ) with Function Approximation (with accumulating eligibility traces) Previously in block world: δt+1 = Rt+1 + γVt (Xt+1 ) − Vt (Xt ) zt+1 (x) = I{x=Xt } + γλzt (x) Vt+1 (x) = Vt (x) + αt δt+1 zt+1 (x) z0 (x) = 0 TD(λ) with Function Approximation (with accumulating eligibility traces) Previously in block world: δt+1 = Rt+1 + γVt (Xt+1 ) − Vt (Xt ) zt+1 (x) = I{x=Xt } + γλzt (x) Vt+1 (x) = Vt (x) + αt δt+1 zt+1 (x) z0 (x) = 0 δt+1 = Rt+1 + γVθt (Xt+1 ) − Vθt (Xt ) zt+1 = ∇θ Vθt (Xt ) + γλzt θt+1 = θt + αt δt+1 zt+1 z0 = 0 TD(λ) with Function Approximation Have data D = {(Xt , Rt ); t > 0} Now estimate Vθ (x) gradient must exist ∇θ Vθ Algorithm 4: TD(λ) with linear function approximation (called after each transition) 1: Input: X (last state), Y (next state), R (reward), θ (current value fn parameters), z (eligibility traces) > > 2: δ ← R + γ · θ φ[Y ] − θ φ[X ] | {z } | {z } V [Y ] 3: z[x] ← φ[X ] +γ · λ · z | {z } ∇θ Vθ (X ) θ←θ+α·δ·z 5: return (θ, z) 4: V [X ] Least Squares Temporal Difference (LSTD) learning (1) θt+1 = θt + αt δt+1 zt+1 δt+1 = Rt+1 + γ · Vθ (Xt+1 ) − Vθ (Xt ) = Rt+1 + γ · θ> φt+1 − θ> φt = ∇θ Vθt (Xt ) + γλzt t X (γλ)t−s φs zt+1 = s=0 Least Squares Temporal Difference (LSTD) learning (1) θt+1 = θt + αt δt+1 zt+1 δt+1 = Rt+1 + γ · Vθ (Xt+1 ) − Vθ (Xt ) = Rt+1 + γ · θ> φt+1 − θ> φt = ∇θ Vθt (Xt ) + γλzt t X (γλ)t−s φs zt+1 = s=0 So try to satisfy: E[δt+1 zt+1 ] = 0 ≈ 1 T X t δt+1 zt+1 Least Squares Temporal Difference (LSTD) learning (2) 0 = 1 T X δt+1 zt+1 t = 1 T X Rt+1 + γθ> φt+1 − θ> φt zt+1 t = 1 T | X t Rt+1 zt+1 − 1 T {z | b } X t zt+1 (φt − γφt+1 )> θ {z A } Least Squares Temporal Difference (LSTD) learning (2) 0 = 1 T X δt+1 zt+1 t = 1 T X Rt+1 + γθ> φt+1 − θ> φt zt+1 t = 1 T | X t Rt+1 zt+1 − 1 T {z | } X zt+1 (φt − γφt+1 )> θ t b {z A So... (if A is invertible) θ = A−1 b } Least Squares Temporal Difference (LSTD) learning (2) 0 = 1 T X δt+1 zt+1 t = 1 T X Rt+1 + γθ> φt+1 − θ> φt zt+1 t = 1 T | X t Rt+1 zt+1 − 1 T {z | } X zt+1 (φt − γφt+1 )> θ t b A So... (if A is invertible) θ Special λ = 0 case: {z zt+1 = φt . = A−1 b } Recursive-LSTD What if inverting A is expensive, impossible? C0 = Ct+1 = θt+1 = Id Ct φt (φt − γφt+1 )> Ct 1 + (φt − γφt+1 )> Ct φt Ct δt+1 φt θt + 1 + (φt − γφt+1 )> Ct φt Ct − LSTD complexity: O(nd 2 + d 3 ) RLSTD complexity: O(nd 2 ) (R)LSTD Summary Name: because LSTD minimises projected squared Bellman error: ||ΠF ,µ (TV − V )||2µ (R)LSTD Summary Name: because LSTD minimises projected squared Bellman error: ||ΠF ,µ (TV − V )||2µ Pros: 1. No αt to tune 2. Converges faster than TD(λ)
© Copyright 2026 Paperzz