Space-Indexed Dynamic Programming

Space-Indexed Dynamic
Programming: Learning to
Follow Trajectories
J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi
Gu, Charles DuHadway
Computer Science Department
Stanford University
July 2008, ICML
Outline
• Reinforcement Learning and Following
Trajectories
• Space-indexed Dynamical Systems and
Space-indexed Dynamic Programming
• Experimental Results
Reinforcement Learning
and Following Trajectories
Trajectory Following
• Consider task of following trajectory in a
vehicle such as a car or helicopter
• State space too large to discretize, can’t
apply tabular RL/dynamic programming
Trajectory Following
• Dynamic programming algorithms w/ nonstationary policies seem well-suited to task
– Policy Search by Dynamic Programming
(Bagnell, et. al), Differential Dynamic
Programming (Jacobson and Mayne)
Dynamic Programming
t=1
Divide control task into
discrete time steps
Dynamic Programming
t=1 t=2
Divide control task into
discrete time steps
Dynamic Programming
t=4
t=1 t=2
t=3
Divide control task into
discrete time steps
t=5
:::
Dynamic Programming
t=4
t=1 t=2
t=5
t=3
Proceeding backwards in
time, learn policies for
t = T, T-1, …, 2, 1
:::
Dynamic Programming
t=4
t=1 t=2
t=5
t=3
¼5
Proceeding backwards in
time, learn policies for
t = T, T-1, …, 2, 1
:::
Dynamic Programming
t=4
t=1 t=2
t=5
t=3
¼4 ¼
5
Proceeding backwards in
time, learn policies for
t = T, T-1, …, 2, 1
:::
Dynamic Programming
t=4
t=1 t=2
¼1
¼2
t=5
t=3
¼3
¼4 ¼
5
Proceeding backwards in
time, learn policies for
t = T, T-1, …, 2, 1
:::
Dynamic Programming
t=4
t=1 t=2
¼1
¼2
t=5
t=3
¼3
¼4 ¼
5
Key Advantage:
Policies are local (only need
to perform well over small
portion of state space)
:::
Problems with Dynamic Programming
Problem #1: Policies from
traditional dynamic
programming algorithms
are time-indexed
Problems with Dynamic Programming
¼5
Supposed we learned
policy ¼5 assuming this
distribution over states
Problems with Dynamic Programming
¼5
But, due to natural
stochasticity of environment,
car is actually here at t = 5
Problems with Dynamic Programming
¼5
Resulting policy will
perform very poorly
Problems with Dynamic Programming
¼4
¼1 ¼
2
¼3
¼5
Partial Solution: Re-indexing
Execute policy closest to current
location, regardless of time
Problems with Dynamic Programming
Problem #2: Uncertainty over
future states makes it hard to
learn any good policy
Problems with Dynamic Programming
Dist. over states at time t = 5
Due to stochasticity, large
uncertainty over states in
distant future
Problems with Dynamic Programming
Dist. over states at time t = 5
DP algorithms require
learning policy that performs
well over entire distribution
Space-Indexed Dynamic Programming
• Basic idea of Space-Indexed Dynamic
Programming (SIDP):
Perform DP with respect to
space indices (planes
tangent to trajectory)
Space-Indexed Dynamical
Systems and Dynamic
Programming
Difficulty with SIDP
• No guarantee that taking single action will
move to next plane along trajectory
• Introduce notion of space-indexed
dynamical system
Time-Indexed Dynamical System
• Creating time-indexed dynamical systems:
s_ = f(s; u)
Time-Indexed Dynamical System
• Creating time-indexed dynamical systems:
s_ = f(s; u)
current state
Time-Indexed Dynamical System
• Creating time-indexed dynamical systems:
s_ = f(s; u)
current state
control action
Time-Indexed Dynamical System
• Creating time-indexed dynamical systems:
s_ = f(s; u)
time derivative of state
current state
control action
Time-Indexed Dynamical System
• Creating time-indexed dynamical systems:
s_ = f(s; u)
Euler integration
st + ¢ t = st + f(st ; ut )¢t
Space-Indexed Dynamical Systems
• Creating space-indexed dynamical systems:
space index d+1
space index d
• Simulate forward until whenever vehicle
hits next tangent plane
Space-Indexed Dynamical Systems
• Creating space-indexed dynamical systems:
s_ = f(s; u)
space index d+1
space index d
sd+ 1 = sd + f(sd ; ud )¢t(sd ; ud )
Space-Indexed Dynamical Systems
• Creating space-indexed dynamical systems:
s_ = f(s; u)
space index d+1
space index d
sd+ 1 = sd + f(sd ; ud )¢t(sd ; ud )
(Positive solution ¢t(s; u) exists
as long as controller makes
some forward progress)
¢t(s; u) =
( s_?d + 1 ) T ( s¡ s ?d + 1 )
( s_?d + 1 ) T s_
Space-Indexed Dynamical Systems
• Result is a dynamical system indexed by
spatial-index variable d rather than time
sd+ 1 = sd + f(sd ; ud )¢t(sd ; ud )
• Space-indexed dynamic programming
runs DP directly on this system
Space-Indexed Dynamic Programming
d=1
Divide trajectory into
discrete space planes
Space-Indexed Dynamic Programming
d=1 d=2
Divide trajectory into
discrete space planes
Space-Indexed Dynamic Programming
d=4
d=1 d=2 d=3
Divide trajectory into
discrete space planes
d=5
Space-Indexed Dynamic Programming
d=4
d=5
d=1 d=2 d=3
Proceeding backwards,
learn policies for
d = D, D-1, …, 2, 1
Space-Indexed Dynamic Programming
d=4
d=5
d=1 d=2 d=3
¼5
Proceeding backwards,
learn policies for
d = D, D-1, …, 2, 1
Space-Indexed Dynamic Programming
d=4
d=5
d=1 d=2 d=3
¼4 ¼
5
Proceeding backwards,
learn policies for
d = D, D-1, …, 2, 1
Space-Indexed Dynamic Programming
d=4
d=5
d=1 d=2 d=3
¼1
¼2
¼3
¼4 ¼
5
Proceeding backwards,
learn policies for
d = D, D-1, …, 2, 1
Problems with Dynamic Programming
Problem #1: Policies from
traditional dynamic
programming algorithms
are time-indexed
Space-Indexed Dynamic Programming
¼5
¼4
Time indexed DP:
can execute
policy learned for
different location
Space indexed DP:
always executes
policy based on
current spatial
index
Problems with Dynamic Programming
Problem #2: Uncertainty over
future states makes it hard to
learn any good policy
Space-Indexed Dynamic Programming
Dist. over states at time t = 5
Time indexed DP:
wide distribution
over future states
Dist. over states at index d = 5
Space indexed DP:
much tighter
distribution over
future states
Space-Indexed Dynamic Programming
Dist. over states at time t = 5
Dist. over states at index d = 5
t(5):
Time indexed DP:
wide distribution
over future states
Space indexed DP:
much tighter
distribution over
future states
Experiments
Experimental Domain
• Task: following race
track trajectory in RC
car with randomly
placed obstacles
Experimental Setup
• Implemented space-indexed version of
PSDP algorithm
– Policy chooses steering angle using SVM
classifier (constant velocity)
– Used simple textbook model simulator of car
dynamics to learn policy
• Evaluated PSDP time-indexed, time-indexed
with re-indexing and space-indexed
Time-Indexed PSDP
Time-Indexed PSDP w/ Re-indexing
Space-Indexed PSDP
Empirical Evaluation
Time-indexed PSDP
Cost: Infinite (no
trajectory succeeds)
Time-indexed PSDP
with Re-indexing
Space-indexed PSDP
Cost: 49.32
Cost: 59.74
Additional Experiments
• In the paper: additional
experiments on the
Stanford Grand
Challenge Car using
space-indexed DDP,
and on a simulated
helicopter domain
using space-indexed
PSDP
Related Work
• Reinforcement learning / dynamic programming:
Bagnell et al., 2004; Jacobson and Mayne,
1970; Lagoudakis and Parr, 2003; Langford and
Zadrozny, 2005
• Differential Dynamic Programming: Atkeson,
1994; Tassa et al., 2008
• Gain Scheduling, Model Predictive Control: Leith
and Leithead, 2000; Garica et al., 1989
Summary
• Trajectory following uses non-stationary
policies, but traditional DP / RL algorithms
suffer because they are time-indexed
• In this paper, we introduce the notions of a
space-indexed dynamical system, and
space-indexed dynamic programming
• Demonstrated usefulness of these
methods on real-world control tasks.
Thank you!
Videos available online at
http://cs.stanford.edu/~kolter/icml08videos