Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance Nicholas Lawrance Persistent Autonomous Flight Reinforcement Learning for Soaring • What I want to do • Have a good understanding of the dynamics involved in aerodynamic soaring in known conditions but: 1. Dynamic soaring requires energy loss actions for net energy gain cycles which can be difficult using traditional control or path generation methods 2. Wind is difficult to predict; guidance and nav must be done on-line whilst simultaneously maintaining reasonable energy levels and safety requirements 3. Classic exploration-exploitation problem with the added catch that exploration requires energy gained through exploitation Nicholas Lawrance Persistent Autonomous Flight Nicholas Lawrance Persistent Autonomous Flight Reinforcement Learning for Soaring • Why reinforcement learning • Previous work focused on understanding soaring and examining alternatives for generating energy-gain paths. • Always have the issue of balancing exploration and exploitation, my code ended up being long sequences of heuristic rules • Reinforcement learning could provide the link from known good paths towards optimal paths Nicholas Lawrance Persistent Autonomous Flight Monte Carlo, TD, Sarsa & Q-learning • Monte Carlo – Learn an average reward for actions taken during series of episodes • Temporal Difference – Simultaneously estimate expected reward and value function • Sarsa – using TD for on-policy control • Q-learning – off-policy TD control • Nicholas Lawrance Persistent Autonomous Flight Figure 6.13: The cliff-walking task. Off-policy Q-learning learns the optimal policy, along the edge of the cliff, but then keeps falling off because of the -greedy action selection. On-policy Sarsa learns a safer policy taking into account the action selection method. These data are from a single run, but smoothed. Nicholas Lawrance Persistent Autonomous Flight Eligibility Traces • TD(0) is effectively one-step backup of Vπ (reward only counts to previous action) • Eligibility traces extend this to reward the sequence of actions that lead to the current reward. Nicholas Lawrance Persistent Autonomous Flight Sarsa(λ) • Initialize Q(s,a) arbitrarily and e(s,a) = 0, for all s, a • Repeat (for each episode): • Initialize s, a • Repeat (for each step of episode): • Take action a, observe r, s’ • Choose a’ from s’ using policy derived from Q (ε-greedy) • For all s,a: • until s is terminal Nicholas Lawrance Persistent Autonomous Flight Sarsa(λ) Nicholas Lawrance Persistent Autonomous Flight Simplest soaring attempt • Square grid, simple motion, energy sinks and sources • Movement cost, turn cost, edge cost Nicholas Lawrance Persistent Autonomous Flight Simulation - Static 30 1 25 20000 0.5 20 15000 0 10000 15 5000 -0.5 10 0 -1 5 -5000 30 -1.5 5 10 15 20 25 30 20 30 20 10 2 10 0 0 1.5 1 0.5 0 -0.5 1000 Nicholas Lawrance 2000 3000 4000 5000 6000 7000 8000 9000 10000 Persistent Autonomous Flight 3 2.5 2 1.5 -greedy; = 0.1 -greedy; = 0.01 Softmax; = 5 Softmax; = 0.5 Softmax; = 1 1 0.5 0 -0.5 -1 -1.5 1000 2000 Nicholas Lawrance 3000 4000 5000 6000 7000 8000 9000 10000 Persistent Autonomous Flight Hex grid, dynamic soaring • Energy based simulation • Drag movement cost, turn cost • Constant speed • No wind motion (due to limited states) Nicholas Lawrance Persistent Autonomous Flight Hex grid, dynamic soaring 30 1 400 25 0.9 200 0.8 Softmax; = 50 Softmax; = 500 -greedy; = 0.01 0.70 Average reward (W) 20 15 10 0.6 -200 0.5 -400 0.4 0.3 -600 0.2 5 -800 0.1 0 5 10 15 20 Nicholas Lawrance 25 0 -1000 1 2 3 4 5 6 Time step ( t = 1s) 7 8 9 Persistent Autonomous Flight 10 4 x 10 100 30 25 50 20 0 15 -50 10 -100 5 -150 5 Nicholas Lawrance 10 15 20 25 Persistent Autonomous Flight 200 100 0 Average reward (W) -100 -200 -300 -400 Softmax; = 500 Softmax; = 100 -500 -600 -greedy; = 0.01 -greedy; = 0.1 1 Nicholas Lawrance 2 3 4 5 Time step 6 7 8 9 10 4 x 10 Persistent Autonomous Flight 100 30 25 50 20 0 15 -50 10 -100 5 -150 5 Nicholas Lawrance 10 15 20 25 Persistent Autonomous Flight 60 40 20 Average reward (W) 0 Softmax; = 100 Softmax; = 500 -20 -greedy; = 0.01 -greedy; = 0.1 -40 -60 -80 -100 -120 -140 1 Nicholas Lawrance 2 3 4 5 6 Time step ( t = 1s) 7 8 9 10 4 x 10 Persistent Autonomous Flight Next • Reinforcement learning has advantages to offer our group, but our contribution should probably be focused in well defined areas • For most of our problems, the state spaces are very large and usually continuous; we need estimation methods • We usually have a good understanding of at least some aspects of the problem; how can/should we use this information to give better solutions? Nicholas Lawrance Persistent Autonomous Flight
© Copyright 2026 Paperzz