RL 12: Hierarchical RL - School of Informatics

RL 12: Hierarchical RL
Generalisation, Abstraction, Options
Michael Herrmann
University of Edinburgh, School of Informatics
01/03/2013
Overview
Options [last time]
SMDP [last time]
Feudal RL (Dayan and Hinton)
Hierarchical Abstract Machines (Parr)
MAXQ (Dietterich)
Cooperation
HPOMDP
01/03/2013 Michael Herrmann RL 12
Hierarchical RL
Learning Problems (T. G. Dietterich, 2006)
Given a set of options, learn a policy over those options
[Precup, Sutton, and Singh; Kalmar, Szepesvari, and Lörincz]
Given a hierarchy of partial policies, learn policy for the entire
problem [Parr and Russell]
Given a set of subtasks, learn policies for each subtask
[Mahadevan and Connell; Sutton, Precup and Singh; Ryan and
Pendrith]
Given a set of subtasks, learn policies for the entire problem
[Kaelbling (HDG), Singh (Compositional Tasks), Dayan and
Hinton (Feudal Q), Dietterich (MAXQ); Dean and Lin]
01/03/2013 Michael Herrmann RL 12
Feudal RL (Dayan and Hinton, 1995)
Quotes
Reward Hiding:
If a sub-manager fails to achieve the sub-goal set by its
manager it is not rewarded
Conversely, if a sub-manager achieves the sub-goal it is given it
is rewarded, even if this does not lead to satisfaction of the
manager’s own goal.
In the early stages of learning, low-level managers can become
quite competent at achieving low-level goals even if the
highest level goal has never been satisfied.
Information Hiding:
Managers only need to know the state of the system at the
granularity of their own choices of tasks.
A super-manager does not know what choices its manager has
made to satisfy its command.
01/03/2013 Michael Herrmann RL 12
Feudal RL (Dayan and Hinton, 1995)
Lords dictate subgoals to
serfs
Subgoals = reward
functions?
Demonstrated on a
navigation task
Markov property problem
Stability?
Optimality?
P. Dayan and G. E. Hinton. "Feudal reinforcement learning. NIPS (1993): 271-271.
01/03/2013 Michael Herrmann RL 12
Hierarchical Abstract Machines (HAM), (R. Parr, 1998)
Policies of a core MDP (M) are defined as programs which
execute based on own state and current state of core MDP
HAM policy ≈ collection of FSMs: {Hi }
Four types of states of Hi
Action: Generate action (for M) based on state of M and
currently executing Hi
at = π mti , st ∈ Ast
where mti is the current state of Hi and st is the current state
of M; resulting in a state change in M
Call: Suspend Hi and start Hj setting its state based on st
Choose: (Non-deterministically) pick the next state of Hi
Stop: return control to calling machine (and continue there)
01/03/2013 Michael Herrmann RL 12
State Transition Structure in HAM
Upon hitting a obstacle the choose action starts (recursively) either
follow-wall or back-off
HAM H: Initial machine + closure of all machine states in all
machines reachable from possible initial states of the initial machine
01/03/2013 Michael Herrmann RL 12
HAM ◦ MDP = SMDP
Composition of HAM and core MDP defines an SMDP
Only actions of the composite machine = choice points
These actions only affect HAM part
As an SMDP, things run autonomously between choice points
What are the rewards?
Moreover, optimal policy of the composite machine only
depend on the choice points!
There is a reduce (H ◦ M) whose states are just choice points
of the original H ◦ M (i.e. Action, Call, Stop are contracted
into neighbouring choices).
We can apply SMDP Q-learning to keep updates manageable
01/03/2013 Michael Herrmann RL 12
Partial Policies
Example: Parr’s Maze Problem
TraverseHallway(d ) calls
ToWallBouncing and
BackOut.
ToWallBouncing(d1 , d2 )
calls ToWall, FollowWall
FollowWall(d )
ToWall(d )
BackOut(d1 , d2 ) calls
BackOne, PerpFive
BackOne(d )
PerpFive(d1 , d2 )
01/03/2013 Michael Herrmann RL 12
Partial Policies
Results on Parr’s Maze Problem
Shown is value of starting state. “Flat” Q is ultimately better.
01/03/2013 Michael Herrmann RL 12
MAXQ
Included temporal abstraction
Introduced “safe” abstraction
Handled subgoals/tasks elegantly
Subtasks w/repeated structure can appear in multiple copies
throughout state space
Subtasks can be isolated w/o violating Markov
Separated subtask reward from completion reward
Example taxi/logistics domain
Subtasks move between locations
High level tasks pick up/drop off assets
01/03/2013 Michael Herrmann RL 12
MAXQ: Maze task
Application: A taxi driver should know the best route from
anywhere to anywhere
100 states × 100 tasks:
10,000 combinations of
start and goal
Actions: North, South,
East, West. Succeeds
with prob. 0.8 and
moving perpendicularly
with prob. 0.2.
Requires 10000 values (or
40000 values in
Q-learning)
01/03/2013 Michael Herrmann RL 12
MAXQ: Decomposition Strategy
Impose a set of landmark
states. Partitions the state
space into Voronoi cells.
Constrain the policy. The
policy will go from the
starting cell via a series of
landmarks until it reaches
the landmark in the goal
cell. From there, it will go
to the goal.
Decompose the value
function.
01/03/2013 Michael Herrmann RL 12
MAXQ: Decomposition Strategy
Subtasks:
From current location x to landmark `: V1 (x, `)
From landmark `1 to landmark `2 : V2 (`1 , `2 )
From current location x to the goal location g : V3 (`2 , g )
V (x, g ) =
min
{V1 (x, `1 )+V2 (`1 , `2 )+V3 (`2 , g )} ,
`1 ∈NL(x),`2 ∈NL(g )
L (x) is the landmark nearest to x, NL (x) is the set of
neighbouring landmarks including L (x); no discount γ = 1
Requires only 6,070 values to store as a set of Q functions.
Each of these subtasks can be shared by many combinations of
initial states and goal states.
01/03/2013 Michael Herrmann RL 12
Variant
Non-Hierarchical
Execution (Kaelbling)
“Early Termination” of
options/subtasks
01/03/2013 Michael Herrmann RL 12
MAXQ Value Function Decomposition
Does not rely on SMDP
Instead: hierarchy of SMDPs
Starts with a decomposition of a core MDP M into a set of
subtasks {M0 , M1 , . . . , Mn }
M0 is the root subtask, i.e. solving M0 solves M
Actions in M0 are either elementary actions or subtasks Mk
for k > 0
Actions in Mk are either elementary actions or subtasks Mm
for m > k
results in a task graph
01/03/2013 Michael Herrmann RL 12
MAXQ Value Function Decomposition (ctd.)
Subtask Mi is a tripel: hπi , Si , Ti i
πi can select any of the child nodes of Mi (or primitive
actions)
Si active states where πi can be applied
Ti set of termination states
pseudo reward function (for states in Ti )
K a pushdown stack
contains names and parameter values of the hierarchy of
calling subtasks,
top of the stack contains the name of the subtask currently
being executed.
01/03/2013 Michael Herrmann RL 12
MAXQ Value Function Decomposition (ctd.)
π = (π0 , π1 , . . . , πn ) where πi is the policy of Mi
V π ([s, K ]) Hierarchical value function for π: expected return
given the π is followed after starting in s with stack K
top level value: V π ([s, nil ])
projected value function V π (i, s): expected return when
following πi from s until termination
Each subtask i defines a discrete-time SMDP with state set Si
.
transition probabilities: Pi (s 0 , τ |s, a) (involve probabilities at
lower levels)
Immediate reward: For all s ∈ Si and all child subtasks Ma :
Ri (s, a) ≡ V π (s, a)
Bellman equation
X
V π (i, s) = V π (πi (s) , s) +
Piπ s 0 , τ |s, πi (s) γ τ V π i, s 0
s 0 ,τ
Vπ
(i, s 0 )
expected return on completing subtask Mi starting at s 0
01/03/2013 Michael Herrmann RL 12
MAXQ Value Function Decomposition (ctd.)
Qπ (i, s, a) = V π (a, s) +
X
Piπ s 0 , τ |s, a γ τ Qπ i, s 0 , π s 0
s 0 ,τ
Completion function: Expected return for completing subtask Mi
after subtask Ma terminates.
X
C π (i, s, a) =
Piπ s 0 , τ |s, a γ τ Qπ i, s 0 , π s 0
s 0 ,τ
MAXQ hierarchical value function decomposition:
V π (0, s) = V π (an , s)+C π (an−1 , s, an )+· · ·+C π (a1 , s, a2 )+C π (0, s, a1 )
P
where V π = s 0 P (s 0 |s, an ) R (s 0 |s, an )
01/03/2013 Michael Herrmann RL 12
Consequences
Non-Markovian w.r.t. state set of the core MDP
Markovian w.r.t. state set augmented by the stack K
As a consequence, subtask policies have to assign actions to
every combination, (s, K ) of core states s and stacks K
Two kinds of optimality
A hierarchically optimal policy for MDP M is a policy that
achieves the highest cumulative reward among all policies
consistent with the given hierarchy.
A recursively optimal policy for Markov decision process M is
a hierarchical policy such that for each subtask, the
corresponding policy is optimal for the implied SMDP
A policy may be recursively optimal but not hierarchically
optimal because hierarchical optimality of a subtask generally
depends not only on that subtask’s children, but also on how
the subtask participates in higher-level subtasks.
01/03/2013 Michael Herrmann RL 12
Algorithm
Algorithm is able to learn hierarchical policies from sample
trajectories
Algorithm is recursively applied form of SMDP Q-learning
Takes advantage of the MAXQ value function decomposition
to update estimates of subtask completion functions.
If the pseudo-reward functions of all the subtasks are
identically zero:
Function MAXQ-0 calls itself recursively to descend through
the hierarchy to finally execute primitive actions
When it returns from each call, it updates the completion
function corresponding to the appropriate subtask, with
discounting determined by the returned count of the number
of primitive actions executed.
01/03/2013 Michael Herrmann RL 12
Performance
MAXQ-Q learning converges to a recursively optimal policy
assuming that state abstractions are applied under the
conditions (see next slide)
State Abstraction is very important for achieving reasonable
performance with MAXQ-Q learning
01/03/2013 Michael Herrmann RL 12
Conditions: Independent State Variables
Independence of transition probability
Partition the state variables into two sets X and Y such that
P x 0 , y 0 |x, y , a = P x 0 |x, a P (y |y , a)
Independence of reward
R x 0 , y 0 |x, y , a = R x 0 |x, a
Result Distribution Irrelevance: For all s1 and s2 such that
s1 = (x, y1 ) and s2 = (x, y2 ) and all s,
P(s|s1 , a) = P(s|s2 , a)
If these properties hold, then Y is irrelevant for option/subtask a
and the completion function C (i, s, a) can be represented using
C (i, x, a).
01/03/2013 Michael Herrmann RL 12
Tradeoff Between Abstraction and Hierarchical Optimality
Claim: To obtain state abstraction, there must be a barrier between
the subtask and the context. Information must not cross that
barrier.
Suppose that the optimal policy within a subtask depends on
the relative value of two states s1 and s2 outside of that
subtask.
Then information about the difference between s1 and s2 must
flow into the subtask, and the representation of the value
function within the subtask must depend on the differences
between s1 and s2 .
This could defeat any state abstraction that satisfies the two
conditions given above.
Therefore, this dependency must be excluded from the subtask
(e.g., by a pseudo-reward function)
But to achieve hierarchical optimality, we must have this kind
of information flow. Therefore, state abstraction and
hierarchical optimality are incompatible.
01/03/2013 Michael Herrmann RL 12
Cooperation between agents based on MAXQ
01/03/2013 Michael Herrmann RL 12
Hierarchical State Estimator
Robot Navigation Task:
Navigation is mainly decided a high-level states
Locomotion at low-level states
Reaching a dead end is bad, but walking straight is still good.
01/03/2013 Michael Herrmann RL 12
Hierarchical POMDP
Two adjacent
corridors in a
robot navigation
task.
Two primitive actions, “go-left” (dotted), “go-right” (dashed)
Unobservable abstract states s1 and s2 , and each with two entry
and two exit states.
Hidden states s4 , s5 , s6 , s9 , s10 with associated observation
models.
01/03/2013 Michael Herrmann RL 12
HPOMDP: State uncertainty (belief state entropy)
Spatiotemporal abstraction reduces the uncertainty and requires
less frequent decision-making.
01/03/2013 Michael Herrmann RL 12
Frontier Problems
Automatic hierarchy discovery: Topology, Connectivity
Bottlenecks, “normal forms”, repeated components
“Safe” state abstraction as a kind of factorisation
Dynamic abstraction
Relationship between hierarchy and model
Computational complexity, scaling
01/03/2013 Michael Herrmann RL 12
The SMPD connection
Probabilistic policies over options µ : S × O → [0, 1]
Value function over options V µ (s), Qµ (s, o), VO∗ (s),
Q∗O (s, o)
Learning methods: Q, TD, DynaQ (using reward model)
Models of options
Planning methods: value iteration, policy iteration,
A coherent theory of learning and planning with courses of
action at variable times scales, yet at the same level.
01/03/2013 Michael Herrmann RL 12
Summary
Termination improvement: Improving the value function by
changing the termination conditions of options
Combine option and intra-option learning (with out necessarily
terminating the options)
Define subgoals: Refine and simplify policy learning within
options
01/03/2013 Michael Herrmann RL 12
Acknowledgements
Some material was adapted from web resources associated with
Sutton and Barto’s Reinforcement Learning book
. . . before being used by Dr. Subramanian Ramamoorthy in this
course in the last three years.
T. G. Dietterich (2000) Hierarchical reinforcement learning with the
MAXQ value function decomposition" 13, 227-303, also using a
talk from 2008.
A. G. Barto and S. Mahadevan (2003) Recent advances in
hierarchical reinforcement learning. Discrete Event Dynamic
Systems 13:4, 341-379.
Many slides are adapted from: S. Singh, Reinforcement Learning: A
Tutorial. Computer Science & Engineering, U. Michigan, Ann
Arbor. www.eecs.umich.edu/~baveja/ICML06Tutorial/
S. J. Duff, M. O. Bradtke (1995) Reinforcement learning methods
for continuous-time Markov decision problems. NIPS 7: MIT Press.
01/03/2013 Michael Herrmann RL 12

Download Report

RL 12: Hierarchical RL - School of Informatics

Paperzz.com

Your Paperzz