Approximate value function Vladimir Li Jiexiong Tang RPL KTH March 21 Vladimir Li Jiexiong Tang (RPL KTH) March 21 1 / 44 Overview 1 Previously in this course 2 Why? 3 How? MSVE Linear models Gradient descent Feature Construction Vladimir Li Jiexiong Tang (RPL KTH) March 21 2 / 44 Previously in this course Dynamic programming Vladimir Li Jiexiong Tang (RPL KTH) March 21 3 / 44 Previously in this course Monte Carlo Dynamic programming Vladimir Li Jiexiong Tang (RPL KTH) March 21 3 / 44 Previously in this course Monte Carlo Dynamic programming Vladimir Li Jiexiong Tang (RPL KTH) TD(0) March 21 3 / 44 Previosly in this course Vladimir Li Jiexiong Tang (RPL KTH) March 21 4 / 44 Problems with huge state space Backgammon: 1020 states Game of Go: 10170 states Robot: continuous state space, multi-dimensional Why a large state space is a issue for RL? How to scale up a model-free methods from previous lectures? Vladimir Li Jiexiong Tang (RPL KTH) March 21 5 / 44 Approximation of Value function Until now: Represent value function by lookup table For every state s, an entry V (s) For each state-action pair, an entry Q(s, a) Vladimir Li Jiexiong Tang (RPL KTH) March 21 6 / 44 Approximation of Value function Until now: Represent value function by lookup table For every state s, an entry V (s) For each state-action pair, an entry Q(s, a) Models with large state space Too many states to store Too slow to learn value for each state Vladimir Li Jiexiong Tang (RPL KTH) March 21 6 / 44 Approximation of Value function Until now: Represent value function by lookup table For every state s, an entry V (s) For each state-action pair, an entry Q(s, a) Models with large state space Too many states to store Too slow to learn value for each state Solution: Use value function approximation v̂θ : s → vπ (s) Generalize from seen states Learn parameter θ Vladimir Li Jiexiong Tang (RPL KTH) March 21 6 / 44 Find approximation for ”True” value function If vπ (s) is true value function and v̂ (s, θ) is its approximation Then we can write objective of the prediction problem as: Vladimir Li Jiexiong Tang (RPL KTH) March 21 7 / 44 Find approximation for ”True” value function If vπ (s) is true value function and v̂ (s, θ) is its approximation Then we can write objective of the prediction problem as: . X MSVE (θ) = d(s)[vπ (s) − v̂ (s, θ)]2 s∈S Vladimir Li Jiexiong Tang (RPL KTH) March 21 7 / 44 On-policy distribution for episodic tasks η(s) = h(s) + X η(s̄) X π(a|s̄)p(s|s̄, a) a s̄ η(s) 0 s 0 η(s ) d(s) = P where: η(s): number of time steps spend, on average, in states s in a single episode h(s): probability that an episode starts in s Vladimir Li Jiexiong Tang (RPL KTH) March 21 8 / 44 MSVE . X d(s)[vπ (s) − v̂ (s, θ)]2 MSVE (θ) = s∈S Vladimir Li Jiexiong Tang (RPL KTH) March 21 9 / 44 Linear models A function approximation using linear combinations of features: Approximation function is linear in the weights θ For each state s, there is a real-valued vector of features φ(s): e.g., Polynomial, Fourier, Coarse Coding, Tile Coding and RBF, etc. T v̂ (s, θ) = θ φ(s) = n X θi φi (s) i=1 Gradient with respect to θ is: ∇v̂ (s, θ) = φ(s) Vladimir Li Jiexiong Tang (RPL KTH) March 21 10 / 44 Linear models Simplicity Efficient in computation Work well in practice Guarantees convergence Vladimir Li Jiexiong Tang (RPL KTH) March 21 11 / 44 Table Lookup (Special case) Table lookup is special case of linear value function approximation v̂ (s, θ) = θ T φ(s) If we have n states create a table lookup feature as follows: 1(s = s1 ) .. φ(s) = . 1(s = sn ) Then the value function approximation is: T v̂ (s, θ) = θ φ(s) = n X θi 1(s = si ) i=1 Vladimir Li Jiexiong Tang (RPL KTH) March 21 12 / 44 1000-state random walk example Reward: 1 if ST = 1000 -1 if ST = 0 0 otherwise Actions: Go to left k steps Go to right k steps where k is uniformly sampled from [1, 100] Vladimir Li Jiexiong Tang (RPL KTH) March 21 13 / 44 1000-state random walk example State aggregation: θ1 θ = ... θ10 Vladimir Li Jiexiong Tang (RPL KTH) 1(1 ≤ s < 100) .. φ(s) = . 1(901 ≤ s < 1000) March 21 14 / 44 Goal Looking for optimal parameters, θ ∗ MSVE (θ ∗ ) ≤ MSVE (θ) for all θ or at least its neighborhood Vladimir Li Jiexiong Tang (RPL KTH) March 21 15 / 44 Gradient descent J(θ) is a differentiable objective function Gradient of J(θ) is defined as: ∂J(θ) ∂θ1 . ∇θ J(θ) = .. ∂J(θ) ∂θn The gradient decay is: θ = θ + ∆θ 1 ⇒ ∆θ = − α∇θ J(θ) 2 where α is a positive step-size parameter. Vladimir Li Jiexiong Tang (RPL KTH) March 21 16 / 44 Stochastic gradient descent Goal: find parameter vector θ that minimizing MSVE (θ), i.e. difference between approximated value v̂ (s, θ) and the ”true” value vπ (s). If we assume that states appears with same distribution d, we can write gradient update as: 1 ∆θ = − α∇θ J(θ) 2 1 = − α∇θ [vπ (S) − v̂ (S, θ)]2 2 =α[vπ (S) − v̂ (S, θ)]∇θ v̂ (S, θ) vπ is substituted with target output Ut ∆θ = α(Ut − v̂ (S, θ))∇θ v̂ (S, θ) Vladimir Li Jiexiong Tang (RPL KTH) March 21 17 / 44 . Ut = Gt (MC) Monte Carlo: . Ut = Gt where Gt is return; the total discounted reward. Update rule: θ ← θ + α[Gt − v̂ (St , θ)]∇θ v̂ (St , θ) Gt is unbiased estimate of vπ (St ) → θ is guaranteed to converge. Vladimir Li Jiexiong Tang (RPL KTH) March 21 18 / 44 Gradient Monte Carlo Algorithm Vladimir Li Jiexiong Tang (RPL KTH) March 21 19 / 44 1000-state random walk example State aggregation: θ1 θ = ... θ10 We start in state 631 and terminate in state 1000. What is MC update at t=0? d(s) is uniform initial θ is zero α = 0.1 γ=1 Vladimir Li Jiexiong Tang (RPL KTH) March 21 20 / 44 1000-state random walk example State aggregation: θ1 θ = ... θ10 In the same episode at t=1 we are in the state 650. What is MC update at t=1? d(s) is uniform initial θ is zero α = 0.1 γ=1 Vladimir Li Jiexiong Tang (RPL KTH) March 21 21 / 44 1000-state random walk Vladimir Li Jiexiong Tang (RPL KTH) March 21 22 / 44 . Ut = Rt+1 + γ v̂ (St+1 , θ) (TD(0)) TD(0): . Ut = Rt+1 + γ v̂ (St+1 , θ) Update rule: θ ← θ + α[Rt+1 + γ v̂ (St+1 , θ) − v̂ (St , θ)]∇θ v̂ (St , θ) Ut is a bootstrapped biased estimate of vπ (S). Vladimir Li Jiexiong Tang (RPL KTH) March 21 23 / 44 Semi-gradient TD(0) Vladimir Li Jiexiong Tang (RPL KTH) March 21 24 / 44 Semi-gradient TD(0) convergence proof Replace gradient in previous semi-gradient TD(0) θt+1 = θt + α(Rt+1 + γθt T φt+1 − θt T φt )φt (1) = θt + α(Rt+1 φt − φt (φt − γφt+1 )T θt ) (2) The expected next weight vector can be written as: E[θt+1 |θt ] = θt + α(b − Aθt ) (3) where b = E[Rt+1 φt ] and A = E[φt (φt − γφt+1 )T ] Vladimir Li Jiexiong Tang (RPL KTH) March 21 25 / 44 Semi-gradient TD(0) convergence proof Find TD fixpoint when gradient is 0: θ TD = A−1 b (4) Linear semi-gradient TD(0) converges to this point. MSVE is within a bounded expansion: MSVE (θTD ) ≤ Vladimir Li Jiexiong Tang (RPL KTH) 1 minθ MSVE (θ) 1−γ (5) March 21 26 / 44 n-step semi-gradient TD Semi-gradient n-step TD algorithm (n) θ ← θ + α[Gt − v̂ (St , θ t+n−1 )]∇θ v̂ (St , θ t+n−1 ), 0 ≤ t ≤ T where (n) Gt = Rt+1 +γRt+2 +...+γ n−1 Rt+n +γ n v̂ (St+n , θ t+n−1 ), 0 ≤ t ≤ T −n Vladimir Li Jiexiong Tang (RPL KTH) March 21 27 / 44 n-step semi-gradient TD Vladimir Li Jiexiong Tang (RPL KTH) March 21 28 / 44 MC vs TD(0) θ ← θ + α[Gt −v̂ (St , θ)]∇θ v̂ (St , θ) θ ← θ + α[Rt+1 + γ v̂ (St+1 , θ) −v̂ (St , θ)]∇θ v̂ (St , θ) TD(0) target is bootstrapped → biased estimate Target depends on current θ and we only include a part of the gradient (Semi-gradient) Converge reliably for linear functions (proof) Faster to learn Enable continual and online learning Vladimir Li Jiexiong Tang (RPL KTH) March 21 29 / 44 MC vs TD(0) Vladimir Li Jiexiong Tang (RPL KTH) March 21 30 / 44 Feature Construction Value estimates are sums of features times corresponding weights Choosing feature is one way to apply domain knowledge Vladimir Li Jiexiong Tang (RPL KTH) March 21 31 / 44 Feature Construction - Polynomial Each d-dimensional polynomial basis function φi can be written as: φi (s) = d Y c sj i,j (6) j=1 where ci,j ∈ [0, N] and sj is normalized to [0, 1] Vladimir Li Jiexiong Tang (RPL KTH) March 21 32 / 44 Feature Construction - Polynomial Vladimir Li Jiexiong Tang (RPL KTH) March 21 33 / 44 Feature Construction - Fourier Fourier cosine basis functions φi (s) = cos(πc i s) (7) where ci,j ∈ [0, N] and sj is normalized to [0, 1] Vladimir Li Jiexiong Tang (RPL KTH) March 21 34 / 44 Feature Construction - Fourier Vladimir Li Jiexiong Tang (RPL KTH) March 21 35 / 44 Feature Construction - Fourier Vladimir Li Jiexiong Tang (RPL KTH) March 21 36 / 44 Feature Construction - Fourier Vladimir Li Jiexiong Tang (RPL KTH) March 21 37 / 44 Feature Construction - Coarse Coding Coarse Coding : the number of their features whose receptive fields overlap. Corresponding to each circle is a single weight is learned as a component of θ Vladimir Li Jiexiong Tang (RPL KTH) March 21 38 / 44 Feature Construction - Coarse Coding Vladimir Li Jiexiong Tang (RPL KTH) March 21 39 / 44 Feature Construction - Tile Coding Receptive fields of the features are grouped into partitions of the input space. Each such partition is called a tiling, and each element of the partition is called a tile. Vladimir Li Jiexiong Tang (RPL KTH) March 21 40 / 44 Feature Construction - Tile Coding Tile coding with multiple offsets Vladimir Li Jiexiong Tang (RPL KTH) March 21 41 / 44 Feature Construction - Tile Coding Tile coding compared with direct state aggregation Vladimir Li Jiexiong Tang (RPL KTH) March 21 42 / 44 Feature Construction - RBF Radial basis functions (RBFs): ||s − ci ||2 ) 2σ 2 ∈ [0, N] and s is normalized to [0, 1] φi (s) = exp(− where ci,j Vladimir Li Jiexiong Tang (RPL KTH) (8) March 21 43 / 44 Nonlinear approximation Artificial neural networks (ANNs) are widely used nonlinear model: Universal approximation; Deep ANN is capable of learning feature from data directly Vladimir Li Jiexiong Tang (RPL KTH) March 21 44 / 44
© Copyright 2026 Paperzz