Applying Online Search Techniques to Reinforcement Learning Scott Davies, Andrew Ng, and Andrew Moore Carnegie Mellon University The Agony of Continuous State Spaces • Learning useful value functions for continuous-state optimal control problems can be difficult – Small inaccuracies/inconsistencies in approximated value functions can cause simple controllers to fail miserably – Accurate value functions can be very expensive to compute even in relatively low-dimensional spaces with perfectly accurate state transition models Combining Value Functions With Online Search • Instead of modeling the value function accurately everywhere, we can perform online searches for good trajectories from the agent’s current position to compensate for value function inaccuracies • We examine two different types of search: – “Local” searches in which the agent performs a finitedepth look-ahead search – “Global” searches in which the agent searches for trajectories all the way to goal states Typical One-Step “Search” Given a value function V(x) over the state space, an agent typically uses a model to predict where each possible one-step trajectory T takes it, then chooses the trajectory that maximizes RT + V(xT) where RT is the reward accumulated along T is the discount factor xT is the state at the end of T This takes O(|A|) time, where A is the set of possible actions. Given a perfect V(x), this would lead to optimal behavior. Local Search • An obvious possible extension: consider all possible d-step trajectories T, selecting the one that maximizes RT + dV(xT). – Computational expense: O(|A|d). • To make deeper searches more computationally tractable, we can limit agent to considering only trajectories in which the action is switched at most s times. – Computational expense: O (d A d s s 1 ) (considerably cheaper than full d-step search if s << d) Local Search: Example Title: hillc ar.fig Creator: fig2dev Version 3.1 Patchlevel 2 Prev iew : This EPS picture w as not s av ed w ith a preview inc luded in it. Comment: This EPS picture w ill print to a Pos tSc ript printer, but not to other ty pes of printers. Title: Creator: Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a Post Script print er, but not to ot her ty pes of printers . Velocity • Two-dimensional state space (position + velocity) • Car must back up to take “running start” to make it Position Search over 20-step trajectories with at most one switch in actions Using Local Search Online Repeat: • From current state, consider all possible d-step trajectories T in which the action is changed at most s times • Perform the first action in the trajectory that maximizes RT + dV(xT). Let B denote the “parallel backup operator” such that BV ( x ) max R( x, a ) V ( ( x, a )) aA If s = (d-1), Local Search is formally equivalent to behaving greedily with respect to the new value function Bd-1V. Since V is typically arrived at through iterations of a much cruder backup operator, this value function is often much more accurate than V. Uninformed Global Search • Suppose we have a minimum-cost-to-goal problem in a continuous state space with nonnegative costs. Why not forget about explicitly calculating V and just extend the search from the current position all the way to the goal? • Problem: combinatorial explosion. • Possible solution: – Break state space into partitions, e.g. a uniform grid. (Can be represented sparsely.) – Use previously discussed local search procedure to find trajectories between partitions – Prune all but least-cost trajectory entering any given partition Uninformed Global Search Title: Creator: Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a PostScript printer, but not to other ty pes of printers . • Problems: – Still computationally expensive – Even with fine partitioning of state space, pruning the wrong trajectories can cause search to fail Informed Global Search • Use approximate value function V to guide the selection of which points to search from next • Reasonably accurate V will cause search to stay along optimal path to goal: dramatic reduction in search time • V can help choose effective points within each partition from which to search, thereby improving solution quality • Uniformed Global Search same as “Informed” Global Search with V(x) = 0 Informed Global Search Algorithm • Let x0 be current state, and g(x0) be the grid element containing x0 • Set g(x0)’s “representative state” to x0, and add g(x0) to priority queue P with priority V(x0) • Until goal state found or P empty: – Remove grid element g from top of P. Let x denote g’s “representative state.” – SEARCH-FROM(g, x) • If goal found, execute trajectory; otherwise signal failure Informed Global Search Algorithm, cont’d SEARCH-FROM(g, x): • Starting from x, perform “local search” as described earlier, but prune the search wherever it reaches a different grid element g g. • Each time another grid element g reached at state x: – If g previously SEARCHED-FROM, do nothing. – If g never previously reached, add g to P with priority RT(x0…x) + |T|V(x), where T is trajectory from x0 to x. Set g’s “representative state” to x. Record trajectory from x to x. – If g previously reached but previous priority is lower than RT(x0…x) + |T|V(x), update g s priority to RT(x0…x) + |T|V(x) and set “representative state” to x. Record trajectory from x to x. Informed Global Search Examples Title: Title: Creator: Creator: Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a PostScript printer, but not to other ty pes of printers . Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a PostScript printer, but not to other ty pes of printers . 7*7 simplex-interpolated V 13*13 simplex-interpolated V Hill-car Search Trees Informed Global Search as A* • Informed Global Search is essentially an A* search using the value function V as a search heuristic • Using A* with an optimistic heuristic function normally guarantees optimal path to the goal. • Uninformed global search effectively uses trivially optimistic heuristic V(s) = 0. Might we expect better solution quality with uninformed search than with nonoptimistic crude approximate value function V? • Not necessarily! A crude approximate non-optimistic value function can improve solution quality by helping the algorithm avoid pruning wrong parts of search tree Hill-car Title: hillc ar.fig Creator: fig2dev Version 3.1 Patchlevel 2 Prev iew : This EPS picture w as not s av ed w ith a preview inc luded in it. Comment: This EPS picture w ill print to a Pos tSc ript printer, but not to other ty pes of printers. • Car on steep hill • State variables: position and velocity (2-d) • Actions: accelerate forward or backward • Goal: park near top • Random start states • Cost: total time to goal Acrobot Goal 2 1 • Two-link planar robot acting in vertical plane under gravity • Underactuated joint at elbow; unactuated shoulder • Two angular positions & their velocities (4-d) • Goal: raise tip at least one link’s height above shoulder • Two actions: full torque clockwise / counterclockwise • Random starting positions • Cost: total time to goal Move-Cart-Pole x Goal configuration • Upright pole attached to cart by unactuated joint • State: horizontal position of cart, angle of pole, and associated velocities (4-d) • Actions: accelerate left or right • Goal configuration: cart moved, pole balanced • Start with random x; = 0 • Per-step cost quadratic in distance from goal configuration • Big penalty if pole falls over Planar Slider Title: Creator: Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a PostScript printer, but not to other ty pes of printers . • Puck sliding on bumpy 2-d surface • Two spatial variables & their velocities (4-d) • Actions: accelerate NW, NE, SW, or SE • Goal in NW corner • Random start states • Cost: total time to goal Local Search Experiments Move-Cart-Pole CPU Time/Trial (sec.) 60000 Solution cost 50000 40000 30000 20000 10000 0 1 2 3 4 5 6 7 Search depth 8 9 10 18 16 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 Search depth • CPU Time and Solution cost vs. search depth d • No limits imposed on number of action switches (s=d) • Value function: 134 simplex-interpolation grid 10 Local Search Experiments Hill-car 14 CPU time/trial (sec.) Solution cost 200 150 100 50 12 10 8 6 4 2 0 0 0 10 20 Search depth 30 0 10 20 Search depth • CPU Time and Solution cost vs. search depth d • Max. number of action switches fixed at 2 (s = 2) • Value function: 72 simplex-interpolated value function 30 Comparative experiments: Hill-Car Search Method None Local Uninf. Glob. Solution Cost CPU Time/Trial 187 0.02 140 0.36 FAIL N/A • Local search: d=6, s=2 • Global searches: – Local search between grid elements: d=20, s=1 – 502 search grid resolution • 72 simplex-interpolated value function Inf. Glob. 151 0.14 Hill-Car results cont’d • Uninformed Global Search prunes wrong trajectories • Increase search grid to 1002 so this doesn’t happen: Failed search trajectory picture goes here – Uninformed does near-optimal – Informed doesn’t: crude value function not optimistic Search Method Uninf. Glob. 1 Inf. Glob. 1 Uninf. Glob. 2 Inf. Glob. 2 Solution Cost CPU Time/Trial FAIL N/A 151 0.14 109 0.82 138 0.29 Comparative Results: Four-d domains All value functions: 134 simplex interpolations All local searches between global search elements: depth 20, with at max. 1 action switch (d=20, s=1) • Acrobot: – Local Search: depth 4; no action switch restriction (d=4,s=4) – Global: 504 search grid • Move-Cart-Pole: same as Acrobot • Slider: – Local Search: depth 10; max. 1 action switch (d=10,s=1) – Global: 204 search grid Acrobot No search Local search Uninf. Global Search cost time cost time cost time #LS Acrobot 454 Move-Cart-Pole 49993 Planar Slider 212 0.1 305 1.2 0.66 10339 1.13 1.9 197 52 407 5.8 14250 3164 3.45 7605 104 94 23690 Inf. Global Search cost time #LS 198 0.47 5073 0.64 54 2 #LS: number of local searches performed to find paths between elements of global search grid • Local search significantly improves solution quality, but increases CPU time by order of magnitude • Uninformed global search takes even more time; poor solution quality indicates suboptimal trajectory pruning • Informed global search finds much better solutions in relatively little time. Value function drastically reduces search, and better pruning leads to better solutions 914 1072 533 Move-Cart-Pole No search Local search Uninf. Global Search cost time cost time cost time #LS Acrobot 454 Move-Cart-Pole 49993 Planar Slider 212 0.1 305 1.2 0.66 10339 1.13 1.9 197 52 407 5.8 14250 3164 3.45 7605 104 94 23690 Inf. Global Search cost time #LS 198 0.47 5073 0.64 54 2 914 1072 533 • No search: pole often falls, incurring large penalties; overall poor solution quality • Local search improves things a bit • Uninformed search finds better solutions than informed – Few grid cells in which pruning is required – Value function not optimistic, so informed search solutions suboptimal • Informed search reduces costs by order of magnitude with no increase in required CPU time Planar Slider No search Local search Uninf. Global Search cost time cost time cost time #LS Acrobot 454 Move-Cart-Pole 49993 Planar Slider 212 0.1 305 1.2 0.66 10339 1.13 1.9 197 52 407 5.8 14250 3164 3.45 7605 104 94 23690 Inf. Global Search cost time #LS 198 0.47 5073 0.64 54 2 914 1072 533 • Local search almost useless, and incurs massive CPU expense • Uninformed search decreases solution cost by 50%, but at even greater CPU expense • Informed search decreases solution cost by factor of 4, at no increase in CPU time Using Search with Learned Models • Toy Example: Hill-Car – 72 simplex-interpolated value function – One nearest-neighbor function approximator per possible action used to learn dx/dt – States sufficiently far away from nearest neighbor optimistically assumed to be absorbing to encourage exploration • Average costs over first few hundred trials: – No search: 212 – Local search: 127 – Informed global search: 155 Using Search with Learned Models • Problems do arise when using learned models: – Inaccuracies in models may cause global searches to fail. Not clear then if failure should be blamed on model inaccuracies or on insufficiently fine state space partitioning – Trajectories found will be inaccurate • Need adaptive closed-loop controller • Fortunately, we will get new data with which to increase the accuracy of our model – Model approximators must be fast and accurate Avenues for Future Research • Extensions to nondeterministic systems? • Higher-dimensional problems • Better function approximators for model learning • Variable-resolution search grids • Optimistic value function generation?
© Copyright 2026 Paperzz