Reinforcement Learning for Soaring CDMRG – 24 May 2010

Reinforcement Learning for Soaring
CDMRG – 24 May 2010
Nick Lawrance
Nicholas Lawrance
Persistent Autonomous Flight
Reinforcement Learning for Soaring
• What I want to do
•
Have a good understanding of the dynamics
involved in aerodynamic soaring in known
conditions but:
1. Dynamic soaring requires energy loss actions
for net energy gain cycles which can be difficult
using traditional control or path generation
methods
2. Wind is difficult to predict; guidance and nav
must be done on-line whilst simultaneously
maintaining reasonable energy levels and safety
requirements
3. Classic exploration-exploitation problem with the
added catch that exploration requires energy
gained through exploitation
Nicholas Lawrance
Persistent Autonomous Flight
Nicholas Lawrance
Persistent Autonomous Flight
Reinforcement Learning for Soaring
• Why reinforcement learning
• Previous work focused on understanding soaring
and examining alternatives for generating
energy-gain paths.
• Always have the issue of balancing exploration
and exploitation, my code ended up being long
sequences of heuristic rules
• Reinforcement learning could provide the link
from known good paths towards optimal paths
Nicholas Lawrance
Persistent Autonomous Flight
Monte Carlo, TD, Sarsa & Q-learning
• Monte Carlo – Learn an average reward for actions
taken during series of episodes
• Temporal Difference – Simultaneously estimate
expected reward and value function
• Sarsa – using TD for on-policy control
• Q-learning – off-policy TD control
•
Nicholas Lawrance
Persistent Autonomous Flight
Figure 6.13: The cliff-walking task. Off-policy Q-learning learns the optimal policy, along the edge of the
cliff, but then keeps falling off because of the -greedy action selection. On-policy Sarsa learns a safer
policy taking into account the action selection method. These data are from a single run, but smoothed.
Nicholas Lawrance
Persistent Autonomous Flight
Eligibility Traces
• TD(0) is effectively one-step backup of Vπ
(reward only counts to previous action)
• Eligibility traces extend this to reward the
sequence of actions that lead to the current
reward.
Nicholas Lawrance
Persistent Autonomous Flight
Sarsa(λ)
• Initialize Q(s,a) arbitrarily and e(s,a) = 0, for all s, a
• Repeat (for each episode):
• Initialize s, a
• Repeat (for each step of episode):
• Take action a, observe r, s’
• Choose a’ from s’ using policy derived from Q (ε-greedy)
• For all s,a:
• until s is terminal
Nicholas Lawrance
Persistent Autonomous Flight
Sarsa(λ)
Nicholas Lawrance
Persistent Autonomous Flight
Simplest soaring attempt
• Square grid, simple motion, energy sinks and
sources
• Movement cost, turn cost, edge cost
Nicholas Lawrance
Persistent Autonomous Flight
Simulation - Static
30
1
25
20000
0.5
20
15000
0
10000
15
5000
-0.5
10
0
-1
5
-5000
30
-1.5
5
10
15
20
25
30
20
30
20
10
2
10
0
0
1.5
1
0.5
0
-0.5
1000
Nicholas Lawrance
2000
3000
4000
5000
6000
7000
8000
9000 10000
Persistent Autonomous Flight
3
2.5
2
1.5
-greedy;  = 0.1
-greedy;  = 0.01
Softmax;  = 5
Softmax;  = 0.5
Softmax;  = 1
1
0.5
0
-0.5
-1
-1.5
1000
2000
Nicholas Lawrance
3000
4000
5000
6000
7000
8000
9000 10000
Persistent Autonomous Flight
Hex grid, dynamic soaring
• Energy based simulation
• Drag movement cost, turn cost
• Constant speed
• No wind motion (due to limited states)
Nicholas Lawrance
Persistent Autonomous Flight
Hex grid, dynamic soaring
30
1
400
25
0.9
200
0.8
Softmax;  = 50
Softmax;  = 500
-greedy;  = 0.01
0.70
Average reward (W)
20
15
10
0.6
-200
0.5
-400
0.4
0.3
-600
0.2
5
-800
0.1
0
5
10
15
20
Nicholas Lawrance
25
0
-1000
1
2
3
4
5
6
Time step ( t = 1s)
7
8
9
Persistent Autonomous Flight
10
4
x 10
100
30
25
50
20
0
15
-50
10
-100
5
-150
5
Nicholas Lawrance
10
15
20
25
Persistent Autonomous Flight
200
100
0
Average reward (W)
-100
-200
-300
-400
Softmax;  = 500
Softmax;  = 100
-500
-600
-greedy;  = 0.01
-greedy;  = 0.1
1
Nicholas Lawrance
2
3
4
5
Time step
6
7
8
9
10
4
x 10
Persistent Autonomous Flight
100
30
25
50
20
0
15
-50
10
-100
5
-150
5
Nicholas Lawrance
10
15
20
25
Persistent Autonomous Flight
60
40
20
Average reward (W)
0
Softmax;  = 100
Softmax;  = 500
-20
-greedy;  = 0.01
-greedy;  = 0.1
-40
-60
-80
-100
-120
-140
1
Nicholas Lawrance
2
3
4
5
6
Time step ( t = 1s)
7
8
9
10
4
x 10
Persistent Autonomous Flight
Next
• Reinforcement learning has advantages to offer
our group, but our contribution should probably
be focused in well defined areas
• For most of our problems, the state spaces are
very large and usually continuous; we need
estimation methods
• We usually have a good understanding of at
least some aspects of the problem; how
can/should we use this information to give
better solutions?
Nicholas Lawrance
Persistent Autonomous Flight