mlws - Carnegie Mellon School of Computer Science

Applying Online Search
Techniques to
Reinforcement Learning
Scott Davies, Andrew Ng, and
Andrew Moore
Carnegie Mellon University
The Agony of Continuous State Spaces
• Learning useful value functions for
continuous-state optimal control problems
can be difficult
– Small inaccuracies/inconsistencies in approximated
value functions can cause simple controllers to fail
miserably
– Accurate value functions can be very expensive to
compute even in relatively low-dimensional spaces
with perfectly accurate state transition models
Combining Value Functions With Online Search
• Instead of modeling the value function accurately
everywhere, we can perform online searches for
good trajectories from the agent’s current position
to compensate for value function inaccuracies
• We examine two different types of search:
– “Local” searches in which the agent performs a finitedepth look-ahead search
– “Global” searches in which the agent searches for
trajectories all the way to goal states
Typical One-Step “Search”
Given a value function V(x) over the state space, an agent
typically uses a model to predict where each possible one-step
trajectory T takes it, then chooses the trajectory that
maximizes
RT +  V(xT)
where RT is the reward accumulated along T
 is the discount factor
xT is the state at the end of T
This takes O(|A|) time, where A is the set of possible actions.
Given a perfect V(x), this would lead to optimal behavior.
Local Search
• An obvious possible extension: consider all possible d-step
trajectories T, selecting the one that maximizes RT +
dV(xT).
– Computational expense: O(|A|d).
• To make deeper searches more computationally tractable,
we can limit agent to considering only trajectories in which
the action is switched at most s times.
– Computational expense: O (d
 A
d
s
s 1
)
(considerably cheaper than full d-step search if s << d)
Local Search: Example
Title:
hillc ar.fig
Creator:
fig2dev Version 3.1 Patchlevel 2
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
Title:
Creator:
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
Post Script print er, but not to
ot her ty pes of printers .
Velocity
• Two-dimensional state space
(position + velocity)
• Car must back up to take
“running start” to make it
Position
Search over 20-step trajectories
with at most one switch in actions
Using Local Search Online
Repeat:
• From current state, consider all possible d-step trajectories T in which
the action is changed at most s times
• Perform the first action in the trajectory that maximizes RT + dV(xT).
Let B denote the “parallel backup operator” such that
BV ( x )  max R( x, a )    V ( ( x, a )) 
aA
If s = (d-1), Local Search is formally equivalent to behaving greedily
with respect to the new value function Bd-1V.
Since V is typically arrived at through iterations of a much cruder backup
operator, this value function is often much more accurate than V.
Uninformed Global Search
• Suppose we have a minimum-cost-to-goal problem in a
continuous state space with nonnegative costs. Why not
forget about explicitly calculating V and just extend the
search from the current position all the way to the goal?
• Problem: combinatorial explosion.
• Possible solution:
– Break state space into partitions, e.g. a uniform grid. (Can be
represented sparsely.)
– Use previously discussed local search procedure to find
trajectories between partitions
– Prune all but least-cost trajectory entering any given partition
Uninformed Global Search
Title:
Creator:
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
• Problems:
– Still computationally
expensive
– Even with fine
partitioning of state
space, pruning the
wrong trajectories can
cause search to fail
Informed Global Search
• Use approximate value function V to guide the
selection of which points to search from next
• Reasonably accurate V will cause search to stay
along optimal path to goal: dramatic reduction in
search time
• V can help choose effective points within each
partition from which to search, thereby improving
solution quality
• Uniformed Global Search same as “Informed”
Global Search with V(x) = 0
Informed Global Search Algorithm
• Let x0 be current state, and g(x0) be the grid
element containing x0
• Set g(x0)’s “representative state” to x0, and add
g(x0) to priority queue P with priority V(x0)
• Until goal state found or P empty:
– Remove grid element g from top of P. Let x denote g’s
“representative state.”
– SEARCH-FROM(g, x)
• If goal found, execute trajectory; otherwise signal
failure
Informed Global Search Algorithm, cont’d
SEARCH-FROM(g, x):
• Starting from x, perform “local search” as described
earlier, but prune the search wherever it reaches a different
grid element g  g.
• Each time another grid element g reached at state x:
– If g previously SEARCHED-FROM, do nothing.
– If g never previously reached, add g to P with priority RT(x0…x)
+  |T|V(x), where T is trajectory from x0 to x. Set g’s
“representative state” to x. Record trajectory from x to x.
– If g previously reached but previous priority is lower than
RT(x0…x) +  |T|V(x), update g s priority to RT(x0…x) +  |T|V(x)
and set “representative state” to x. Record trajectory from x to x.
Informed Global Search Examples
Title:
Title:
Creator:
Creator:
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
7*7 simplex-interpolated V
13*13 simplex-interpolated V
Hill-car Search Trees
Informed Global Search as A*
• Informed Global Search is essentially an A* search using
the value function V as a search heuristic
• Using A* with an optimistic heuristic function normally
guarantees optimal path to the goal.
• Uninformed global search effectively uses trivially
optimistic heuristic V(s) = 0. Might we expect better
solution quality with uninformed search than with nonoptimistic crude approximate value function V?
• Not necessarily! A crude approximate non-optimistic
value function can improve solution quality by helping the
algorithm avoid pruning wrong parts of search tree
Hill-car
Title:
hillc ar.fig
Creator:
fig2dev Version 3.1 Patchlevel 2
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
• Car on steep hill
• State variables:
position and velocity
(2-d)
• Actions: accelerate
forward or backward
• Goal: park near top
• Random start states
• Cost: total time to goal
Acrobot
Goal
2
1
• Two-link planar robot acting in vertical plane
under gravity
• Underactuated joint at elbow; unactuated
shoulder
• Two angular positions & their velocities (4-d)
• Goal: raise tip at least one link’s height above
shoulder
• Two actions: full torque clockwise /
counterclockwise
• Random starting positions
• Cost: total time to goal
Move-Cart-Pole

x
Goal configuration
• Upright pole attached to cart by
unactuated joint
• State: horizontal position of cart, angle
of pole, and associated velocities (4-d)
• Actions: accelerate left or right
• Goal configuration: cart moved, pole
balanced
• Start with random x;  = 0
• Per-step cost quadratic in distance from
goal configuration
• Big penalty if pole falls over
Planar Slider
Title:
Creator:
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
• Puck sliding on bumpy 2-d
surface
• Two spatial variables &
their velocities (4-d)
• Actions: accelerate NW,
NE, SW, or SE
• Goal in NW corner
• Random start states
• Cost: total time to goal
Local Search Experiments
Move-Cart-Pole
CPU Time/Trial (sec.)
60000
Solution cost
50000
40000
30000
20000
10000
0
1
2
3
4
5
6
7
Search depth
8
9
10
18
16
14
12
10
8
6
4
2
0
1
2
3
4
5
6
7
8
9
Search depth
• CPU Time and Solution cost vs. search depth d
• No limits imposed on number of action switches (s=d)
• Value function: 134 simplex-interpolation grid
10
Local Search Experiments
Hill-car
14
CPU time/trial (sec.)
Solution cost
200
150
100
50
12
10
8
6
4
2
0
0
0
10
20
Search depth
30
0
10
20
Search depth
• CPU Time and Solution cost vs. search depth d
• Max. number of action switches fixed at 2 (s = 2)
• Value function: 72 simplex-interpolated value function
30
Comparative experiments: Hill-Car
Search Method
None Local
Uninf. Glob.
Solution Cost
CPU Time/Trial
187
0.02
140
0.36
FAIL
N/A
• Local search: d=6, s=2
• Global searches:
– Local search between grid elements: d=20, s=1
– 502 search grid resolution
• 72 simplex-interpolated value function
Inf. Glob.
151
0.14
Hill-Car results cont’d
• Uninformed Global Search
prunes wrong trajectories
• Increase search grid to 1002
so this doesn’t happen:
Failed search trajectory picture goes
here
– Uninformed does near-optimal
– Informed doesn’t: crude value
function not optimistic
Search Method
Uninf. Glob. 1 Inf. Glob. 1 Uninf. Glob. 2 Inf. Glob. 2
Solution Cost
CPU Time/Trial
FAIL
N/A
151
0.14
109
0.82
138
0.29
Comparative Results: Four-d domains
All value functions: 134 simplex interpolations
All local searches between global search elements:
depth 20, with at max. 1 action switch (d=20, s=1)
• Acrobot:
– Local Search: depth 4; no action switch restriction (d=4,s=4)
– Global: 504 search grid
• Move-Cart-Pole: same as Acrobot
• Slider:
– Local Search: depth 10; max. 1 action switch (d=10,s=1)
– Global: 204 search grid
Acrobot
No search Local search Uninf. Global Search
cost
time cost time
cost time
#LS
Acrobot
454
Move-Cart-Pole 49993
Planar Slider
212
0.1 305 1.2
0.66 10339 1.13
1.9 197
52
407 5.8 14250
3164 3.45
7605
104
94 23690
Inf. Global Search
cost time
#LS
198 0.47
5073 0.64
54
2
#LS: number of local searches performed to find paths
between elements of global search grid
• Local search significantly improves solution quality, but
increases CPU time by order of magnitude
• Uninformed global search takes even more time; poor
solution quality indicates suboptimal trajectory pruning
• Informed global search finds much better solutions in
relatively little time. Value function drastically reduces
search, and better pruning leads to better solutions
914
1072
533
Move-Cart-Pole
No search Local search Uninf. Global Search
cost
time cost time
cost time
#LS
Acrobot
454
Move-Cart-Pole 49993
Planar Slider
212
0.1 305 1.2
0.66 10339 1.13
1.9 197
52
407 5.8 14250
3164 3.45
7605
104
94 23690
Inf. Global Search
cost time
#LS
198 0.47
5073 0.64
54
2
914
1072
533
• No search: pole often falls, incurring large penalties; overall
poor solution quality
• Local search improves things a bit
• Uninformed search finds better solutions than informed
– Few grid cells in which pruning is required
– Value function not optimistic, so informed search solutions suboptimal
• Informed search reduces costs by order of magnitude with no
increase in required CPU time
Planar Slider
No search Local search Uninf. Global Search
cost
time cost time
cost time
#LS
Acrobot
454
Move-Cart-Pole 49993
Planar Slider
212
0.1 305 1.2
0.66 10339 1.13
1.9 197
52
407 5.8 14250
3164 3.45
7605
104
94 23690
Inf. Global Search
cost time
#LS
198 0.47
5073 0.64
54
2
914
1072
533
• Local search almost useless, and incurs massive CPU expense
• Uninformed search decreases solution cost by 50%, but at even
greater CPU expense
• Informed search decreases solution cost by factor of 4, at no
increase in CPU time
Using Search with Learned Models
• Toy Example: Hill-Car
– 72 simplex-interpolated value function
– One nearest-neighbor function approximator per possible action
used to learn dx/dt
– States sufficiently far away from nearest neighbor optimistically
assumed to be absorbing to encourage exploration
• Average costs over first few hundred trials:
– No search: 212
– Local search: 127
– Informed global search: 155
Using Search with Learned Models
• Problems do arise when using learned models:
– Inaccuracies in models may cause global searches to
fail. Not clear then if failure should be blamed on
model inaccuracies or on insufficiently fine state space
partitioning
– Trajectories found will be inaccurate
• Need adaptive closed-loop controller
• Fortunately, we will get new data with which to increase the
accuracy of our model
– Model approximators must be fast and accurate
Avenues for Future Research
• Extensions to nondeterministic systems?
• Higher-dimensional problems
• Better function approximators for model
learning
• Variable-resolution search grids
• Optimistic value function generation?