Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science Department Stanford University July 2008, ICML Outline • Reinforcement Learning and Following Trajectories • Space-indexed Dynamical Systems and Space-indexed Dynamic Programming • Experimental Results Reinforcement Learning and Following Trajectories Trajectory Following • Consider task of following trajectory in a vehicle such as a car or helicopter • State space too large to discretize, can’t apply tabular RL/dynamic programming Trajectory Following • Dynamic programming algorithms w/ nonstationary policies seem well-suited to task – Policy Search by Dynamic Programming (Bagnell, et. al), Differential Dynamic Programming (Jacobson and Mayne) Dynamic Programming t=1 Divide control task into discrete time steps Dynamic Programming t=1 t=2 Divide control task into discrete time steps Dynamic Programming t=4 t=1 t=2 t=3 Divide control task into discrete time steps t=5 ::: Dynamic Programming t=4 t=1 t=2 t=5 t=3 Proceeding backwards in time, learn policies for t = T, T-1, …, 2, 1 ::: Dynamic Programming t=4 t=1 t=2 t=5 t=3 ¼5 Proceeding backwards in time, learn policies for t = T, T-1, …, 2, 1 ::: Dynamic Programming t=4 t=1 t=2 t=5 t=3 ¼4 ¼ 5 Proceeding backwards in time, learn policies for t = T, T-1, …, 2, 1 ::: Dynamic Programming t=4 t=1 t=2 ¼1 ¼2 t=5 t=3 ¼3 ¼4 ¼ 5 Proceeding backwards in time, learn policies for t = T, T-1, …, 2, 1 ::: Dynamic Programming t=4 t=1 t=2 ¼1 ¼2 t=5 t=3 ¼3 ¼4 ¼ 5 Key Advantage: Policies are local (only need to perform well over small portion of state space) ::: Problems with Dynamic Programming Problem #1: Policies from traditional dynamic programming algorithms are time-indexed Problems with Dynamic Programming ¼5 Supposed we learned policy ¼5 assuming this distribution over states Problems with Dynamic Programming ¼5 But, due to natural stochasticity of environment, car is actually here at t = 5 Problems with Dynamic Programming ¼5 Resulting policy will perform very poorly Problems with Dynamic Programming ¼4 ¼1 ¼ 2 ¼3 ¼5 Partial Solution: Re-indexing Execute policy closest to current location, regardless of time Problems with Dynamic Programming Problem #2: Uncertainty over future states makes it hard to learn any good policy Problems with Dynamic Programming Dist. over states at time t = 5 Due to stochasticity, large uncertainty over states in distant future Problems with Dynamic Programming Dist. over states at time t = 5 DP algorithms require learning policy that performs well over entire distribution Space-Indexed Dynamic Programming • Basic idea of Space-Indexed Dynamic Programming (SIDP): Perform DP with respect to space indices (planes tangent to trajectory) Space-Indexed Dynamical Systems and Dynamic Programming Difficulty with SIDP • No guarantee that taking single action will move to next plane along trajectory • Introduce notion of space-indexed dynamical system Time-Indexed Dynamical System • Creating time-indexed dynamical systems: s_ = f(s; u) Time-Indexed Dynamical System • Creating time-indexed dynamical systems: s_ = f(s; u) current state Time-Indexed Dynamical System • Creating time-indexed dynamical systems: s_ = f(s; u) current state control action Time-Indexed Dynamical System • Creating time-indexed dynamical systems: s_ = f(s; u) time derivative of state current state control action Time-Indexed Dynamical System • Creating time-indexed dynamical systems: s_ = f(s; u) Euler integration st + ¢ t = st + f(st ; ut )¢t Space-Indexed Dynamical Systems • Creating space-indexed dynamical systems: space index d+1 space index d • Simulate forward until whenever vehicle hits next tangent plane Space-Indexed Dynamical Systems • Creating space-indexed dynamical systems: s_ = f(s; u) space index d+1 space index d sd+ 1 = sd + f(sd ; ud )¢t(sd ; ud ) Space-Indexed Dynamical Systems • Creating space-indexed dynamical systems: s_ = f(s; u) space index d+1 space index d sd+ 1 = sd + f(sd ; ud )¢t(sd ; ud ) (Positive solution ¢t(s; u) exists as long as controller makes some forward progress) ¢t(s; u) = ( s_?d + 1 ) T ( s¡ s ?d + 1 ) ( s_?d + 1 ) T s_ Space-Indexed Dynamical Systems • Result is a dynamical system indexed by spatial-index variable d rather than time sd+ 1 = sd + f(sd ; ud )¢t(sd ; ud ) • Space-indexed dynamic programming runs DP directly on this system Space-Indexed Dynamic Programming d=1 Divide trajectory into discrete space planes Space-Indexed Dynamic Programming d=1 d=2 Divide trajectory into discrete space planes Space-Indexed Dynamic Programming d=4 d=1 d=2 d=3 Divide trajectory into discrete space planes d=5 Space-Indexed Dynamic Programming d=4 d=5 d=1 d=2 d=3 Proceeding backwards, learn policies for d = D, D-1, …, 2, 1 Space-Indexed Dynamic Programming d=4 d=5 d=1 d=2 d=3 ¼5 Proceeding backwards, learn policies for d = D, D-1, …, 2, 1 Space-Indexed Dynamic Programming d=4 d=5 d=1 d=2 d=3 ¼4 ¼ 5 Proceeding backwards, learn policies for d = D, D-1, …, 2, 1 Space-Indexed Dynamic Programming d=4 d=5 d=1 d=2 d=3 ¼1 ¼2 ¼3 ¼4 ¼ 5 Proceeding backwards, learn policies for d = D, D-1, …, 2, 1 Problems with Dynamic Programming Problem #1: Policies from traditional dynamic programming algorithms are time-indexed Space-Indexed Dynamic Programming ¼5 ¼4 Time indexed DP: can execute policy learned for different location Space indexed DP: always executes policy based on current spatial index Problems with Dynamic Programming Problem #2: Uncertainty over future states makes it hard to learn any good policy Space-Indexed Dynamic Programming Dist. over states at time t = 5 Time indexed DP: wide distribution over future states Dist. over states at index d = 5 Space indexed DP: much tighter distribution over future states Space-Indexed Dynamic Programming Dist. over states at time t = 5 Dist. over states at index d = 5 t(5): Time indexed DP: wide distribution over future states Space indexed DP: much tighter distribution over future states Experiments Experimental Domain • Task: following race track trajectory in RC car with randomly placed obstacles Experimental Setup • Implemented space-indexed version of PSDP algorithm – Policy chooses steering angle using SVM classifier (constant velocity) – Used simple textbook model simulator of car dynamics to learn policy • Evaluated PSDP time-indexed, time-indexed with re-indexing and space-indexed Time-Indexed PSDP Time-Indexed PSDP w/ Re-indexing Space-Indexed PSDP Empirical Evaluation Time-indexed PSDP Cost: Infinite (no trajectory succeeds) Time-indexed PSDP with Re-indexing Space-indexed PSDP Cost: 49.32 Cost: 59.74 Additional Experiments • In the paper: additional experiments on the Stanford Grand Challenge Car using space-indexed DDP, and on a simulated helicopter domain using space-indexed PSDP Related Work • Reinforcement learning / dynamic programming: Bagnell et al., 2004; Jacobson and Mayne, 1970; Lagoudakis and Parr, 2003; Langford and Zadrozny, 2005 • Differential Dynamic Programming: Atkeson, 1994; Tassa et al., 2008 • Gain Scheduling, Model Predictive Control: Leith and Leithead, 2000; Garica et al., 1989 Summary • Trajectory following uses non-stationary policies, but traditional DP / RL algorithms suffer because they are time-indexed • In this paper, we introduce the notions of a space-indexed dynamical system, and space-indexed dynamic programming • Demonstrated usefulness of these methods on real-world control tasks. Thank you! Videos available online at http://cs.stanford.edu/~kolter/icml08videos
© Copyright 2026 Paperzz