Hierarchical Reinforcement Learning in Continuous State Spaces

Hierarchical Reinforcement Learning in
Continuous State Spaces
Dissertation
zur Erlangung des Doktorgrades Dr.rer.nat.
der Fakultät für Informatik
der Universität Ulm
Hans Vollbrecht
Abteilung Neuroinformatik
Universität Ulm, 89069 Ulm
1
Ich möchte an dieser Stelle Günther Palm für seine ausdauernde Bereitschaft zu den vielen Fachgesprächen mit mir danken, die mich durch die
lange Entstehungszeit dieser Arbeit begleitet haben, und auch wesentlich
zu meiner Motivation beigetragen haben. Auch Laura sei Dank für die
Geduld mit der Fixierung, die so eine Arbeit mit sich bringt.
Tag der mündlichen Prüfung: 30.10.2003
Erster Gutachter: Prof. Dr. Günther Palm
Zweiter Gutachter: Prof. Dr. Uwe Schöning
Dritter Gutachter: Prof. Dr. Martin Riedmiller
2
Contents
1 Introduction
1.1 Why to use RL for control problems? . . . . . . . . . . . . .
1.2 Hierarchical approaches to generalization, abstraction and
modularity in RL . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Hierarchical discretization . . . . . . . . . . . . . . .
1.2.2 Hierarchical modularization . . . . . . . . . . . . . .
1.2.3 Benefits for RL by hierarchical approaches: shorter
learning times . . . . . . . . . . . . . . . . . . . . . .
1.3 Organization of the thesis . . . . . . . . . . . . . . . . . . .
13
14
2 RL
2.1
2.2
2.3
25
25
28
32
in continuous state spaces
The optimization task . . . . . . . . . . . . . . . . . . . . .
Dynamic Programming (DP) . . . . . . . . . . . . . . . . .
The learning rule . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Time discretization, action discretization, and the resulting DP-equation . . . . . . . . . . . . . . . . . .
2.3.2 State space discretization . . . . . . . . . . . . . . .
2.3.3 The Q-learning rule on finite state partitionings . . .
2.3.4 Q-learning for deterministic problems? . . . . . . . .
2.3.5 The action model . . . . . . . . . . . . . . . . . . . .
Semi Markov Decision Processes . . . . . . . . . . .
Options . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Generalization, Aliasing and Non-Markovian models . . . .
2.4.1 Definition of Aliasing and Generalization . . . . . .
2.4.2 Do we need a Markovian transition model? . . . . .
17
18
19
22
22
32
34
37
40
41
42
43
45
45
48
3 The example task domains: the TBU problem and the
mountain car problem
49
3.1 The TBU problem . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 the mountain car problem . . . . . . . . . . . . . . . . . . . 52
3
4
CONTENTS
4 Hierarchic Task Composition
55
4.1 Basic Motivation . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.1 Hierarchical Task Architectures . . . . . . . . . . . . 56
4.1.2 Optimality of Policies in Hierarchical Task Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.3 Three Hierarchical Composition Principles . . . . . . 61
4.2 The Veto Principle . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.2 Realization of the veto principle . . . . . . . . . . . 67
a) Learning the Q-functions of an avoidance task . . 68
b) Defining the veto-function . . . . . . . . . . . . . 68
c) Learning a task under the veto function . . . . . . 69
4.2.3 Discussion of the veto principle . . . . . . . . . . . . 70
1. Benefits from the veto principle . . . . . . . . . . 70
2. Vetoed tasks are hierarchically optimal . . . . . . 71
3. State generalization in avoidance tasks . . . . . . 71
4. State generalization in tasks under the veto principle 72
4.3 The Subtask Principle . . . . . . . . . . . . . . . . . . . . . 74
4.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.2 Q-learning under the Subtask Principle . . . . . . . 79
4.3.3 Action durations and partitionings . . . . . . . . . . 81
a) action duration . . . . . . . . . . . . . . . . . . . 81
b) partitionings in the subtask principle . . . . . . . 83
4.3.4 the algorithm of the subtask principle . . . . . . . . 85
4.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 86
1. Reduction of complexity by the subtask principle
87
2. optimality and applicability of the subtask principle 91
4.4 The Perturbation Principle . . . . . . . . . . . . . . . . . . 98
4.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4.2 Q-learning under the Perturbation Principle . . . . . 104
4.4.3 Action durations and partitionings . . . . . . . . . . 105
Action durations . . . . . . . . . . . . . . . . . . . . 105
Partitioning . . . . . . . . . . . . . . . . . . . . . . . 106
4.4.4 A Multilevel Perturbation Architecture:
Definition and Algorithms . . . . . . . . . . . . . . . 107
4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 114
1. Complexity Reduction . . . . . . . . . . . . . . . 114
2. What is the meaning of a level’s equilibrium? . . 115
3. Optimality . . . . . . . . . . . . . . . . . . . . . . 116
4. Instability . . . . . . . . . . . . . . . . . . . . . . 118
4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5
4.6.1
4.6.2
4.6.3
4.6.4
4.6.5
The veto principle . . . . .
The perturbation principle
The subtask principle . . .
All three principles together
Complex TBU tasks . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
122
125
126
128
130
5 kd-Q-Learning
5.0.6 Aliasing problems . . . . . . . . . . . . . . . . . . .
5.0.7 Dynamic state space discretization is required on all
semantic action levels . . . . . . . . . . . . . . . . .
5.0.8 Acceleration of learning . . . . . . . . . . . . . . . .
5.0.9 Approaches to adaptive state space discretization . .
5.1 State Splitting . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Some simple state splitting rules . . . . . . . . . . .
5.1.2 State splitting and confidence in the value function .
5.2 kd-Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Basic idea . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 The learning algorithm . . . . . . . . . . . . . . . .
The Level Descent Process . . . . . . . . . . . . . .
Definition of the value of the (hierarchic) successor
state in a state transition . . . . . . . . . .
Confidence in the value of a state . . . . . . . . . . .
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Mountain Car . . . . . . . . . . . . . . . . . . . . . .
5.3.2 The TBU problem . . . . . . . . . . . . . . . . . . .
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Problems with the high variance of the performance
5.4.2 Special features and possible extensions . . . . . . .
133
133
134
134
135
137
138
140
141
143
145
148
150
152
153
153
162
164
164
164
6 Conclusion
167
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6
Zusammenfassung
Verstärkungslernen (Reinforcement Learning, RL) hat nunmehr zwei Jahrzehnte intensiver Forschung erlebt. Es ist attraktiv sowohl für theoretische
Arbeiten, wegen seiner soliden mathematischen Fundierung, als auch für
praktische Anwendungen, wegen seines rein empirischen Ansatzes. In der
zweiten Hälfte der neunziger Jahre bekam dieses Lernverfahren eine reife
theoretische Formulierung durch eine vereinheitlichte Sicht auf eine ganze
Reihe scheinbar verschiedener RL Verfahren durch die Arbeiten von R.
Sutton, A. Barto, D. Bertsekas und J. Tsitsiklis [Sutton & Barto, 1998,
Bertsekas & Tsitsiklis, 1996]. Auf der Anwendungsseite findet RL hingegen sein grösstes Hindernis für eine weite Verbreitung im Problem langer
Lernzeiten, und viel Forschungsaufwand wurde dem Ziel gewidmet, dieses
Problem zu überwinden.
Andererseits, und ganz unabhängig von der RL Forschung, sind seit der
zweiten Hälfte der Achtziger Jahre eine Reihe sogenannter Verhaltensarchitekturen entwickelt und auf realweltliche Probleme (meist aus der Robotik)
angewendet worden, in einer Aufbruchsstimmung und mit grosser Kreativität
ausgelöst durch die Auflockerung einiger Dogmen der klassischen KI Ansätze.
Diese Verhaltensarchitekturen hatten in praktischen Anwendungen einen
bemerkenswerten Erfolg, nur fehlte ihnen meist die theoretische Fundierung:
um eine erfolgreiche Subsumption Architektur [Brooks, 1986] zu entwerfen,
bedarf es oft einer Art Kunst.
Diese Dissertation geht von der Überzeugung aus, dass in Verhaltensarchitekturen einer der Schlüssel zur Effizienzsteigerung von RL zu suchen ist.
Die Arbeit hat mit der Idee begonnen, eine Verhaltensarchitektur könne mit
wohldefinierten Kompositionsprinzipien von Einzelverhalten beschrieben
werden, wenn man eine solche Komposition in erster Linie als Lernproblem sieht, und nicht lediglich als ein Ausführungsproblem, wie es in der
Vergangenheit meist gesehen wurde. Da RL einen hohen Abstraktionsgrad für elementare Konzepte wie Zustand, Aktion und Zielorientierung
besitzt, mit einem einheitlichen und einfachen theoretischen Gerüst, schien
7
8
es vielversprechend zu sein, RL als einzige Lernmethode für Verhaltensarchitekturen zu wählen. Diese Wahl erlaubt, die Bewertung von Architekturen auf eine einfache und transparente Basis zu stellen: Optimalität im
Sinne von Optimalität einer durch Belohnung/Bestrafung gelernten Lösung
eines Markovschen Entscheidungsproblems (Maximierung der erwarteten
zukünftigen Belohnung). Diese Definition von Optimalität erlaubt eine
kohärente Erweiterung für hierarchische Verhaltensarchitekturen.
Diese Dissertation präsentiert drei elementare Kompositionsprinzipien gelernter Verhalten in einer modularen und hierarchischen Verhaltensarchitektur, in welcher eine Komposition wieder mit RL gelernt wird. Hierarchie
spielt bei diesen Kompositionsprinzipien eine wichtige Rolle. Ganz ähnlich
wie es in Software Architekturen der Fall ist, versucht man mit Hierarchien in Verhaltensarchitekturen Modularisierung zu fördern, und gründet
sie auf sogenannter Top-Down Abstraktion - Aufgaben auf höheren Ebenen
abstrahieren von gewissen operativen Details mit welchen sie tiefergelegene
Aufgaben beauftragen (Delegation) -, und auf Bottom-Up Abstraktion Aufgaben auf tieferen Ebenen werden definiert und gelernt unabhängig vom
speziellen Kontext in dem sie von höhergelegenen Aufgaben gebraucht werden. Das Konzept eines semi-markovschen Entscheidungsprozesses bietet
die Möglichkeit, eine einheitliche theoretische Formulierung der Komposition von gelernten Verhalten zu komplexeren Verhalten zu geben. Der letztendliche Vorteil dieses Ansatzes liegt in der Verkürzung von Lernzeiten.
Der Anwendungsbereich, für den diese Dissertation einen Beitrag liefert,
ist der von Kontrollproblemen in kontinuierlichen Zustandsräumen welche
nicht auf den physikalischen Raum beschränkt sind (nicht nur ”maze tasks”!).
In solchen Kontrollproblemen können die Interaktionen zwischen mehreren
Teilverhalten sehr viel komplexer werden als im physikalischen Raum.
Darüberhinaus müssen geeignete Diskretisierungstechniken definiert werden, was einen weitereren Schwerpunkt dieser Arbeit darstellt. In diesem
Bereich wird ein Beitrag zu einem wichtigen Forschungsschwerpunkt von
RL gegeben: dem von Generalisierung im Zustands- und im Aktionenraum
eines RL Problems. Hier wird ein neuer RL Algorithmus (kd-Q-Lernen)
vorgestellt, in dem eine adaptive Diskretisierung des Zustandsraumes erreicht wird. Diese wird durch einen sogenannten kd-Trie repräsentiert,
und der Agent lernt gleichzeitig auf mehreren zeitlichen und räumlichen
Auflösungsskalen. Die Adaptierung geschieht dann durch die Wahl der
gröbsten Auflösung, bei der die optimale Aktion auf allen tieferliegenden
Auflösungsskalen eindeutig ist.
Abstract
Reinforcement learning (RL) has been studied intensively for almost two
decades. It has been attractive both for theoretical investigation because
of its sound mathematical foundation, and for practical applications because of the possibility of employing a purely empirical learning. In the
second half of the nineties, the theory has been given a mature formulation in a unifying view by the work of R. Sutton, A. Barto, D. Bertsekas
and J. Tsitsiklis [Sutton & Barto, 1998, Bertsekas & Tsitsiklis, 1996]. On
the application side, however, RL has found its major obstacle for a broad
acceptance in the problem of long learning times, and much research has
been done in the attempt to overcome this problem.
On the other side, independently of RL research, behavioral architectures
have been developped and applied to real world applications (mostly in
robotics) since the second half of the eighties in an atmosphere of great
departure and creativeness, overcoming some dogmas of the classical AI
approaches. These approaches had a notable success in practical applications, but lacked a theoretical foundation: building for example a successful
subsumption architecture [Brooks 1986] for a given control problem is kind
of an art.
The work of this thesis started from the idea that a behavioral architecture can be defined with sound principles of behavior composition when
it is viewed in first a learng problem, and not exclusively as an execution problem as it has mostly been done in the past. Since RL uses a
high degree of abstraction for basic concepts such as state, action, and
goal-directedness, within a homogeneous, simple theoretical framework, it
seemed to be promising to take RL as the unique framework for learning
in a behavioral architecture. This choice allows for putting the evaluation
of architectures on a solid and transparent basis: optimality in the sense of
optimality of an RL-learned solution to a control problem (i.e. maximization of the expected future reward). This definition of optimality finds a
coherent extension applicable to hierarchical behavior architectures.
The thesis presents basic composition principles of learned behaviors in a
9
10
modular and hierarchical behavior architecture in which a composition has
to be learned again by RL. Hierarchy plays an important role for these
composition principles. Very much the same as in software architectures,
hierarchy in behavioral architectures aims at modularization and is based
on top-down abstraction - high-level tasks abstract from operational details
which are delegated to low-level tasks -, and on bottom-up abstraction low-level tasks are defined and learned independently of the context they
will be used in by high-level tasks. The concept of a Semi Markov Decision
Process allows for a unifying theoretical formulation of the composition of
learned behavior to a complex behavior. The ultimate benefit of this approach is the shortening of learning time.
The application domain which the contributions of this thesis aim at, is
that of control problems in continuous state space that is not limited to
be physical space (not only maze tasks!). In this case, interactions between behaviors in a behavior architecture become much more complex
than in physical space. Furthermore, convenient discretization techniques
have to be developped, contributing to an important research topic in RL:
that of generalization in state and action space. In this concern, the thesis
presents a new RL algorithm (kd-Q-learning) with an adaptive state space
discretization technique based on a state representation with kd-tries. The
agent learns simultaneously at different scales of temporal and spatial resolution. During this learning process, adaptation is accomplished by the
choice of the coarsest resolution such that the optimal action is unique for
all scales of resolution at lower levels of the kd-trie.
11
12
Glossary
Symbol
MDP
A
A(s)
S, s
si
π(s)
V π (s)
∗
V π (s), V ∗ (s)
π
Q (s, a)
∗
Qπ (s), Q∗ (s)
Q, V
r, r(s), r(s, a)
R,R(s)
Jt (s, a(.))
J∞ (s, a(.))
h
Qh (s, a), Vh (s), rh
dur(a), dur(a, s)
a
s −→ s
α, αi , αt
γ
Ŝ
S(s)
Ŝa , Ŝa (s)
V̂ , Q̂
SMDP
option
T , T (p), T (p | S)
T ◦ , T ◦ (p), T ◦ (p | S)
v(s)
Ā
S (j)
S eq
Seq
Sterm
CLF
ŝ = (ŝk , .., ŝk−d )
ŝi
d
β
Short Description
Markov Decision Problem
action set of an MDP
action permission function of state s
(continuous) state space of an MDP, single state in S
state after i-th transition in a (learning) episode
a policy selecting an action a in state s
exploration probability in an -greedy policy
value function evaluating state s for policy π
value function evaluating state s for optimal policy π ∗
quality (Q-)function evaluating action a in state s for policy π
quality (Q-)function for optimal policy π ∗
whenever it is clear from the context which policy is meant
running reward
boundary reward in terminal state s
reward functional under control a(t)
limt→∞ Jt (s, a(.))
the basic time step in time discretization
quality, value, reward with explicit reference to step size
duration of action a in state s: number of steps h
transition from state s to s with action a
of length dur(s, a) · h
learning factor after i-th update for stochastic approximation
a reward discount factor in [0, 1]
a (finite) partitioning of a S
a partition of S from Ŝ with s ∈ S(s)
a seperate partitioning for each action a
value and quality function approximation on partitioning Ŝ
Semi Markov Decision Problem
generalized action, given a fixed policy and a termination probability
a task in state space S, with predicate p defining goal states
an avoidance task, with predicate p defining collision states
a veto function on state s
flat action set (only elementary actions, no options)
state space of j-th subtask in subtask principle
state space of equilibration task in perturbation principle
set of equilibrium states
set of terminal states
constant for scaling the decay of learn factor α
a hierarchic state in a kd-trie
a node in the kd-trie at level i
hierarchic depth in kd-Q-learning
confidence propagation factor in kd-Q-learning
page
25
26
26
26
26
38
26
26
28
28
24,26
24
24
24
30
31
31
29
24
33
35
33
34
40
41
47
47
62
75
75
98
98
98
118
143
143
143
150
Chapter 1
Introduction
This thesis examines how a certain class of machine learning techniques
known as reinforcement learning (RL) can be applied, by means of hierarchical representations, to deterministic control problems. Target application domains can be pH-value regulation in chemical processes, mechanical
control such as damping down a swinging load of a crane, or problems such
as navigating a trailer truck backwards. Problems of this type are defined
in real-valued state-spaces with system dynamics describable by differential equations, and the learning controller must find a feedback control law
mapping states to control actions in discrete time steps. A controller of
such a system applies an action at in state st at time t and transits to a
rt+1
π
st
st+1
at
controlling agent at time t
- perceives with his sensors the system’s state st
- chooses an action at = π(st) with his policy π
- executes action at
and at time t +1
- receives a reward rt+1∈{+1,-1,0}
- perceives with his sensors the system’s state st+1
- chooses an action at+1 = π(st+1) with his policy π
- etc.
goal: learn a policy π: S→A ( S, A set of states and actions, resp.) that maximizes
rt+1 + γ⋅ rt+2 + γ 2⋅ rt+3 + . . . (γ ≤ 1)
with rt+i ∈{+1,-1,0} a reward, a punishment, or no reinforcement. The expected value
of this accumulated future reward following policy π is called the value V(st) of state st .
Figure 1.1: a controller learning by reinforcement
13
14
Introduction
new state st+1 receiving eventually a reward rt+1 (figure 1.1). The states
are the controller’s representation of the controlled system, based mainly
on sensor signals. Reward as a reinforcement is a qualitative signal sparsely
given to the agent. It does not measure an error, but just signals to the
agent the success or insuccess of its actual or past actions. In its form used
in this thesis, an agent gets rewarded only when it reaches a desired region
of state-space, or gets punished when it violates some constraint, as for
example in a collision of a robot.
Learning control by reinforcement is not new. The problem of balancing
an inverted pendulum, for example, is a classical control problem that has
been solved by reinforcement learners already in the early days of RL [Barto
& Sutton & Anderson, 1983]. The new ideas presented and developped in
this thesis consist in applying hierarchical structures to RL in the following
two areas:
• hierarchical task composition in a modular controller architecture,
• hierarchical abstraction in state and action-space of the learning system.
Before introducing and motivating these two points, a more general question
shall be considered.
1.1
Why to use RL for control problems?
Traditional approaches of the engineering sciences to control theory are
mainly concerned with the definition of the system dynamics by a set of
differential equations with boundary conditions and constraints such as
minimizing a cost function, numerical solution schemes for such equations,
stability analysis, and system identification techniques. Learning is considered at most a means for identifying systems (system parameter estimation), and for finding robust controller designs for some kind of adaptive
control adapting to drifting system parameters.
The motivation for applying RL to control problems is to
1. define the problem as simple as possible: by some qualitative signal
provided in critical or desired state regions - punishment or reward.
Going even further in simplification, no a priori knowledge of the
system dynamics might be required.
2. understand how state (input) representation, action (output) representation, and a behavioral architecture may help reducing the complexity of searching a solution to a control problem.
Introduction
15
3. use a closed, elegant learning theory that helps understanding the
impact of predefined structure and of the environment on a completely
empirical learning process.
4. learn a control policy for stochastic systems.
This thesis is motivated by the first three points. The first point is attractive mainly for practical reasons. Some systems are hard to be modelled
in terms of differential equations, either because we lack a theory of their
dynamics, or because the system is not accessible (a black box), or because
the person interested in the control agent knows what the system shall do,
but she may not have the skill to define a model of the system dynamics.
In all these cases, significant support is offered by a method for learning
a controller from a simple problem formulation in terms of a qualitative
reward/punishment signal (three-valued: > 0, = 0, < 0), of a set of sensor
signals and of action commands. Let us call a learning approach model-free
whenever learning occurs exclusively by experience without a predefined
model of system dynamics. All learning approaches presented in this thesis
are model-free.
The second point of motivation is the central point of interest for this
work since input/output representation, and modularity in complex behavior are issues not only important for accelerating learning times of basic
RL-techniques which are notoriously slow [Kaelbling & Littman & Moore,
1996]. Classical approaches to control theory in numerical mathematics
established among others by Kushner [Kushner & Dupuis, 1992] and Hackbusch [Hackbusch & Trottenberg, 1982] are based on discretization techniques of the state space which may be improved as well by the insights
gained with RL (chapter 5). In this concern, the advantage of RL is that it
offers a basis for defining what a controlling agent has to remember from its
experience during learning in order to learn control on some process. This
may be much less than a predictive knowledge of system dynamics in form
of predicted state transitions under any action (a state transition model).
The point is that there is a fundamental difference between prediction and
evaluation, the latter meaning the evaluation of a certain action in a given
state with respect to expected future reward (as indicated in figure 1.1) .
When reducing the requirements from prediction to evaluation (when we
can predict, we can also evaluate, but when we can evaluate, we might not
be able to predict), new insights to discretization of state- and action-space
will come up, as will be discussed in chapter 2 and 5.
As far as modularity in behavioral architectures of control systems is concerned, RL with its closed and elegant theory offers a sound basis for developping and evaluating such architectures in a homogenous way. The importance of this issue is given by the long-lasting discussion in the AI com-
16
Introduction
munity on planning and behavioral architectures since fundamental critics
on traditional AI approaches have been expressed for example by Brooks
with his subsumption architecture [Brooks, 1986], and by Chapman and
Agre with their ”deictic” approach to planning [Agre & Chapman, 1990].
The third point of motivation for employing RL in control is that RL as
a purely empirical learning method allows for studying the effect of predefined structure and of the environment on learning and behavior of the
controller. In contrast to classical control theory which is based on assumptions regarding the system type (linearity, or type of non-linearity for example), RL starts with no assumption of this type, learning in its pure form
only by trial and error. This is an important feature for understanding the
impact of predefined structure such as the choice of the sensor signals, the
discretization schemes for state and action space, composition principles
for several controllers pursuing different, possibly contrasting objectives,
constraints on action choice, and the like. Moreover, classical control theory requires a system model that often is idealized with respect to the real
system usually being characterized by many constraints. RL represents in
this concern more of an ecological approach in which we can observe the interaction between the predefined structure and the specific environment in
their influence on the learned behavior. Whoever tried to define a ”good”
reward function for a specific control problem knows how important this
environmental influence is to a successful learning of control. See figure 1.2
for an illustrating example.
right
flow
left
flow
robot
a)
b)
c)
d)
e)
RL with optical flow (from [Leb00])
a) - c): reward for zero optical flow difference between left and right flow (balancing)
d) - e): punishment for wall collision ( d) learned with start positions only in the shaded area )
Figure 1.2: environmental influence on the learned behavior in RL
Introduction
1.2
17
Hierarchical approaches to generalization,
abstraction and modularity in RL
This thesis starts from the following basic design problems for a controller
that learns a policy (a mapping π from states to actions, see figure 1.3) by
reinforcement :
state
internal
state
policy π
sensor
values
action
generalization?
architecture?
representation?
spatial abstraction?
temporal
abstraction?
Figure 1.3: the thesis’s basic questions on RL
1. which representation of the controller’s current situation is best for
acting optimally, and for learning such an optimal behavior?
A basic structure of a controller’s state is given by the distinction between the sensory state and the internal state of the controller. While
the former relates only to sensor values or to features extracted from
them, the latter represents behavioral states of the controlling agent
such as subgoals generated by the agent, or temporary relationships
between several simultaneously active controllers in a modular controller architecture. This distinction leads to the following basic representation problems:
(a) Which sensors are relevant to the behavior to be learned? Is a
feature-based representation available? Shall the sensory state-space
be discretized? Which form of generalization1 for the state-based policy can be aimed at?
(b) What should the internal architecture of the controller be? Should
it be different for learning and exploiting learned behavior? In case
of a modular architecture, what should the interactions be between
1 generalization:
the act of extending a statement or a behavior for a specific situation
to a more general situation. It is based on abstraction (see next footnote).
18
Introduction
different controllers? How should they be composed? Does the architecture define internal states which support the basic model requirements for RL (for example Markovian state transitions)?
2. what are the range and the contents of possible temporal abstractions2
for actions?
Temporal abstraction is strongly related to the semantic level of actions in an agent. More abstract actions usually are more extended in
time. Reinforcement learning supports a very broad meaning of actions. It can be anything the agent has complete control about (not
on its effects, but on its identification) and which may modify the
system’s situation. It only has to satisfy the basic model requirement
of causing state transitions that are markovian. Thus, the following
two questions are fundamental:
(c) What does an action represent? For example, a motor signal kept
constant for some time, or a temporally extended behavior already
learned by some controller, or a parameter setting for another action?
(d) What is the relationship between temporal abstraction in actionspace and spatial abstraction in state-space?
All these questions are not completely new, and current and past research
on them will be discussed later on.
The techniques presented here and which will answer some of the above
questions, are all based on a common concept: that of hierarchy. Hierarchy
is commonly understood as a static organization principle. It is based
on abstraction: higher-level units abstract from the details of lower-level
ones, representing a common characteristic or responsibility. A detailed
discussion of the meaning of ”hierarchy” in behavioral architectures will be
given at the beginning of chapter 4.
1.2.1
Hierarchical discretization
Applied to control and RL, hierarchy plays an important role in organizing
the generalization, based on spatial and temporal abstraction, of a policy
during learning. I start with a hierarchical representation of a cellular
discretization of the sensory state-space in form of a so-called kd-trie, with
lower-level nodes representing smaller cells that refine the discretization
of larger cells represented by higher-level nodes (see figure 2.1, page 34).
Based on this hierarchical abstraction of sensory state-space, a hierarchical
2 abstraction: (in this thesis) the act of substituting the multiplicity or the details of
a situation (set of states) or of a behavior (temporal course of acting) by a higher-level
unit. In the first case, I speak about spatial abstraction, in the second about temporal
abstraction.
Introduction
19
generalization of a policy will be defined (i.e. one best action for all states
within a discretization cell) that relates spatial abstraction to the most
simple form of temporal abstraction for actions: that of repetition. This
leads in a natural way to the distinction between the intensity (the value
of a motor signal, for example) and the duration of an action. A policy will
select only the intensity from a finite value set, while the duration will be
adapted to the cell width the current state falls into. A new RL-algorithm
will be presented in chapter 5, the so-called kd-Q-learning3 , which at the
beginning learns simultaneously on several hierarchical levels representing
different spatial abstractions, using the intrinsic hierarchical structure of
kd-tries. As learning proceeds, state transitions get increasingly refined
by a descent in the kd-trie scaling down both the spatial and temporal
abstraction.
This technique improves state-splitting techniques proposed for a long time
as a means for adaptive resolution of discretization in RL. It will be shown
how state-splitting can be interpreted as a level descent in a kd-trie. kd-Qlearning improves on state-splitting in that it avoids the loss of information
that occurs in plain state-splitting with respect to the experience of the
learning agent that led to the decision to split a state.
1.2.2
Hierarchical modularization
A second, and in this thesis central area in which hierarchy plays an important role, is the architecture of a modular control system. A well-known
technique for solving a complex problem is to break it down into simpler
ones: the top-down problem solving approach using the principle of divideand-conquer. Applied to complex control problems, the idea is to break
the control problem down into simpler ones, let seperate controllers then
learn each single control problem, and finally combine them to perform the
complex task. In this concern, the dual approach has to be considered too:
build up from already learned simpler controllers, in a bottom-up manner,
a more complex controller. This idea is a common one, for example, in
software engineering, where reusable components are composed to build
up more complex software. This requires a kind of bottom-up abstraction, in which simple components have to be defined in such a way that
they abstract from their specific use at a higher level for being reusable in
different (application) contexts. In this thesis, I will present results on a
problem to be solved for a modular controller architecture: what are the
principles of composition of different controllers learned by RL. Continuing
3a
hierarchical extension of Watkins plain Q-learning which is based on simple look-up
tables. See chapter 2.
20
Introduction
the analogy with software engineering, object-oriented software development, for example, uses messages, events, delegation and inheritance for
composing and extending functionality of a software system. They are the
basic composition principles for predefined or newly developped, simpler
components which guide the software designer in decomposing a complex
problem into simpler ones solved by seperate pieces of software. The design
of a modular controller architecture needs such basic composition principles
for controllers as well. These principles, however, must solve problems different from those of software engineering. In control, a support is required
for
• the stability of solutions
• the handling of dependencies between solution paths to the goals of
different controllers
• the reduction of state space complexity
• the reduction of the temporal depth for the outcome of decisions
• the coherent temporal resolutions of different controllers
to mention just the most important ones.
Again, this thesis follows the concept of hierarchy for the definition of
composition principles of controllers. Chapter 4 presents three new basic principles of hierarchical composition in which complex control tasks
are described by conjunctions of goal predicates each learned by a separate
controller. Since RL in this thesis is required to be the only learning method
employed, these principles are defined such that whenever composition has
to be learned, it will be learned by RL. The motivation for this requirement
has been given in section 1.1.
In the first composition principle called the veto-principle, the evaluation
function4 of a controller of an avoidance task (for example the avoidance of
obstacles for a vehicle) will be combined with the evaluation function of a
controller of a goal seeking task (for example the navigation of a vehicle to
a specific place) in such a way that the former may put a veto on the action
selection of the policy of the latter in collision-critical parts of state-space,
but leaving the policy of the latter in noncritical parts completely uninfluenced. This is a non-trivial interaction since optimal paths to desired
regions of state-space might pass as close as possible to obstacles.
The second principle, the subtask principle, uses explicit task activation
as a hierarchical relationship between two tasks, where the execution of
4 evaluating
the expected accumulated future reward as indicated in figure 1.1
Introduction
21
one task becomes a single action of the second task. Unlike the common
approach in which the activated controller of a (sub)task is active until
achievement of its subgoal (useful for independent subgoals [Singh, 1993]
or for disjunction of subgoals), the subtask principle lets it perform just
one step the length of which is adapted to the particular discretization of
the subtask’s state-space.
In the third principle, the perturbation principle, two controllers are hierarchically related to each other such that the higher one perturbs the
goal state of the lower one in the direction of its own goal. After each
such perturbation action, controllers of the lower levels reestablish a system state in which their goal predicates hold. These principles define a
multi-layer architecture, with sequential task composition on each layer,
and each maintaining the system in an equilibrium condition which can
be perturbed by the next higher one. The main advantage of this architecture consists in a reduction of state space complexity at higher levels.
The perturbation principle is related to a similar principle in biomechanics,
the so-called Equilibrium Point Hypothesis [Feldman, 1974], which postulates a principle of motor control based on equilibrium states of multiple,
antagonistic muscles which get perturbed by motor neurons during motor
activities. The same benefit of reduction of the degrees of freedom during
control underlies this hypothesis.
As already mentioned, the task composition principles of this thesis are all
defined within the RL framework. This common theoretical basis makes
it possible to define different types of optimality and suboptimality which
can be achieved by these composition principles. In contrast, early steps
to task composition such as those presented by the work of Brooks or Mahadevan [Brooks, 1985, Mahadevan & Connell, 1992], lacked a framework
for optimality, being rather intuitive in their benefits.
This thesis treats only task composition, not the dual question of how to
decompose a complex problem. Now, task (controller) composition is the
ultimate objective of decomposing a complex task. Thus, once powerful
composition principles have been identified, and their potential benefits
have been formulated quantitively and understood formally, they can be
exploited by crafting a task decomposition by hand, or by trying to find a
way in which the agent itself can learn the decomposition. This latter issue
points into the direction of future research.
22
Introduction
1.2.3
Benefits for RL by hierarchical approaches: shorter
learning times
Reinforcement learning is attractive for control problems in which we lack
a model of system dynamics at the outset of learning. Meanwhile this
appeal has lasted for more than a decade, nonetheless a major obstacle for
becoming attractive also in practice, namely its long learning time, hasn’t
been overcome so far. Maybe that this problem makes basically part of its
appeal: it is a kind of learning ”ex nihil”. However, in most cases in which
we prefer or even have to learn without specifying the system dynamics, the
system will be real, and long learning times become prohibitively expensive.
The hierarchical approaches presented in this thesis have been developped
mainly with the objective to improve learning times considerably. This
objective has been pursued at two different levels:
• at the level of basic reinforcement learning theory, a hierarchical approach to state space discretization in control has been developped in
the framework of Q-learning.
• at the architectural level, hierarchical composition principles for controllers in a plain RL framework have been developped that foster
modular reuse and complexity reduction of the learning process.
Both developments improve learning times. Moreover, they are tightly
coupled since
• the composition of controllers learned with RL requires a careful integration of their spatial and temporal abstractions, and
• both require a more general concept of action than that offered by
the predefined set of elementary actions: constant actions of variable
duration, or actions which are themselves complex behaviors, as in
the subtask and the perturbation principle.
This common ground of two apparently different developments has come
up during the several-years work of this thesis. It explains, together with
the basic idea of hierarchy, why the two issues are presented here in one
work on RL.
1.3
Organization of the thesis
In the following chapter, the basic system model, the learning theory, the
discretization structure of state and action space, and some problems that
arise with discretization, are presented.
Introduction
23
Chapter 3 introduces two example control problems used for illustration
and quantitative evaluation of the concepts of the following chapters.
Chapter 4 treats modular, hierarchic controller architectures. After an
overview of the different types of modularity in existing research, three
hierarchical composition principles developped in this thesis will be presented, together with their learning rules. The concrete architecture for
the example problem of a trailer truck navigating backward will be presented, together with performance results. Special topics regarding stability and convergence will be discussed at the end of the chapter. In chapter
5, the kd-Q-learning will be presented. As a starting point, so-called statesplitting techniques in RL will be discussed, and some state-splitting rules
developped for this thesis will be presented, together with the major drawbacks of this approach in general. The algorithm of kd-Q-learning will then
be presented, and it will be shown how state-splitting can be integrated in
a natural way into this new learning method. Performance evaluations will
demonstrate the improvement in the two example problems.
24
Introduction
Chapter 2
RL in continuous state
spaces
This chapter defines the basic system model for continuous control and the
optimization problem. It will then mainly present the basic learning theory.
2.1
The optimization task
The optimization tasks considered in this thesis are defined for deterministic, controlled systems with real-valued system states. Let x(t) ∈ S be the
state of a system at time t, where S ⊂ Rn the set of system states. The
system can be controlled by a control a(t) ∈ A, where A ⊂ Rm . S will be
called the state-space, and A the action-space. In this thesis, A will always
be a finite set1 . System dynamics are given by a controlled differential
equation
ẋ(t) = f (x(t), a(t)),
x(0) = s0 ,
x(t) ∈ S,
a(t) ∈ A
(2.1)
with f a measurable function. For a given starting state s0 and a given
control a(.), the (unique) solution
xs0 ,a(.) (t) : solution of equation 2.1
(2.2)
is the trajectory of the system in state space, when starting from state s0
and controlled by a(t) at time t.
In reinforcement learning (RL), the system control gets evaluated qualitatively by a reward r : S ×A → Rew with Rew ⊂ R. Since a reward does not
1 for
a work on continuous action signals, see [Gullapalli, 1992]
25
26
RL in continuous state spaces
indicate a precise error measure, but simply a qualitative reinforcement or
punishment given sparsely, Rew is usually defined by a finite set of values,
with positive values indicating a reward, negative values a punishment, and
zero for no reinforcement (no hint). Thus, r(s, a) indicates either a reward
or a punishment or no reinforcement for taking action a in state s. If the
reward does not depend on the state, the form r(a) will be used meaning
for example a cost (r < 0) of the action. If the reward does not depend
on the action but only on the state of the system, the form r(s) will be
used, usually with Rew = {+1, −1, 0} and s being either a target state, or
a ”collision” state, or a state without reinforcement.
Optimal controls are those controls a∗ (.) which maximize the total future
accumulated reward. This can be formalized by defining the
infinite horizon reward functional J∞ := limt→∞ Jt with
te
γ τ r(xs0 ,a(.) (τ ), a(τ )) dτ + γ te R(xs0 ,a(.) (te ))
Jt (s0 , a(.)) :=
(2.3)
0
Here, r specifies the running reward, R is a boundary reward for terminal
states (with the convention that the agent stays in such a state forever, receiving reward R > 0 only once at time te , and that R is 0 in non-terminal
states). te with te ≤ t is the time when the agent enters such a terminal
state, otherwise (i.e. when at time t the system is in a non-terminal state)
te = t. Finally, γ ∈ (0, 1] is a discount factor for future reward. When
γ < 1, the reward decays in the future. The boundary reward and the discount factor allow for defining within the same framework both controllers
with stopping states (finite time horizon, usually when reaching a target
state, or a ”collision” state) and without stopping (infinite time horizon,
usually keeping the system within some target state region).
The optimization problem consists now in finding an optimal control that
maximizes all accumulated future reward:
RL in continuous state spaces
27
Optimization Problem 1
For any state s0 ∈ S, find an optimal control a∗s0 (.) that maximizes
J∞ , i.e. the accumulated future reward:
J∞ (s0 , a∗s0 (.)) = sup J∞ (s0 , a(.))
a(.)
Within this formulation of optimization, a discount factor γ < 1 can
be used in problems with terminal target states (positively rewarded) to
find shortest paths to a goal region, since maximization of J∞ (s0 , a(.)) =
γ te R(xs0 ,a(.) (te ) means minimization of te . For infinite horizon problems
with a (positively rewarded) target state region, a discount factor γ < 1
assures the boundness of J∞ . In this case, however, an optimal control
doesn’t find necessarily a shortest path to the rewarded state region, but
looks for both a short path to the goal region and how long it will be able to
keep the system therein. Obviously, the value of γ is crucial for the integration of these two optimizations (see figure 3.2 for an illustrating example).
In RL, a controller has to learn an optimal feedback control law2 usually
called the optimal policy π ∗ : S → A that maximizes J∞ . For this purpose,
(2.1), (2.2) are reformulated into (2.1’), (2.2’) by replacing a(t) with π(x(t))
in (2.1), xs0 ,a(.) (t) with xs0 ,π (t) in (2.2) as the solution of (2.1’), and the
optimization problem 1 with:
Optimization Problem 1’
Find an optimal policy π ∗ : S → A such that for any state s0 ∈ S,
the reward functional J∞ is maximized at the control a∗s0 (.) defined by
a∗s0 (t) = π ∗ (xs0 ,π∗ (t)):
J∞ (s0 , a∗s0 (.)) = sup J∞ (s0 , a(.))
a(.)
For notational convenience (referred to in the sequel as Definition 2.3’), let
Jt (s0 , π) for a given policy be defined as Jt (s0 , aπ (.)) where aπ (τ ) is defined
as π(xs0 ,π (τ )).
The next section will report the basic results of the theory of dynamic
programming [Bertsekas, 1987], stating the existence of a solution to the
optimization problems 1 and 1’, and the equivalence of both problems.
2 i.e.
a mapping from states to actions, instead of a time-dependent control signal as
used in (2.1) and (2.3)
28
RL in continuous state spaces
2.2
Dynamic Programming (DP)
The theory of dynamic programming [Bellman, 1957, Bertsekas95] formulates statements about the existence and about methods for finding a solution to Markov decision problems, starting from a central equation called
the Bellman equation.
A Markov decision process is defined by an optimization problem for a sequential markovian decision process given by a 6-tuple Ω(S, A, A, p, r, π).
Sequential Markov Decision Process and its Value Function:
Given Ω(S, A, A, p, r, π) with:
S, a finite set of states.
A, a finite set of actions.
A : S → ℘(A) \ ∅, the action permission function indicating the set of actions A(s) allowed in state s.
pas,s for all s, s ∈ S and a ∈ A(s), the probabilities of transition from state
s to successor state s when applying action a .
r : SA × S → [minReward, maxReward], a (bounded) reward function
where r(s, a, s ) is the reward for having executed action a in state s with
resulting successor state s , with SA := {(s, a) ∈ S × A | a ∈ A(s)}.
π : S s → a ∈ A(s), a decision policy.
Then (st , π(st ), rt ) is called a sequential Markov Decision Process (MDP) if
π(s)
it is a stochastic process with p(st+1 = s | st = s, at = π(s)) = ps,s and
rt = r(st , π(st ), st+1 ).
A Value Function V π : S → R for a given MDP (st , π(st ), rt ) on
Ω(S, A, A, p, r, π) is defined as
n
V (s) := lim E[
γ t · rt | s0 = s], γ ∈ [0, 1]
π
n→∞
(2.4)
t=0
for any state s0 ∈ S.
A Markov decision problem is defined as the following optimization prob-
RL in continuous state spaces
29
lem:
Markov Decision Problem
Find an optimal (stationary) policy π ∗ : S → A which maximizes the
value function V π : S → R for each s0 ∈ S on all MDPs (st , π(st ), rt )
in Ω(S, A, A, p, r, .):
∗
V π (s) = max V π (s) ∀s ∈ S
π
(2.5)
From the theory of dynamic programming, the following well-known results
on MDPs (st , π(st ), rt ) on Ω(S, A, A, p, r, π) are fundamental for RL:
1. for any policy π, the value function V π is the solution of the
Bellman Equation for policy evaluation:
V π (s) = E[r(s, π(s), s ) + γ · V π (s ) | s0 = s] , s = s1
(2.6)
∗
2. the optimal value function V ∗ := V π is the solution of the
Bellman Equation for the optimal policy evaluation:
V ∗ (s) = max E[r(s, a, s ) + γ · V ∗ (s ) | s0 = s, a0 = a] , s = s1 (2.7)
a∈A(s)
Interpreting the right-hand side of (2.6) and (2.7) as an operator on realvalued functions on the state space, the so-called DP-operators Tπ and
T : F(S) → F(S) can be defined:
Tπ (V )(s)
T (V )(s)
:= E[r(s, π(s), s ) + γ · V (s ) | s0 = s]
(2.8)
∗ := max E[r(s, a, s ) + γ · V (s ) | s0 = s, a0 = a] (2.9)
a∈A(s)
If the discount factor γ < 1, then the DP-operators are contraction operators (with respect to the maximum norm on F(S)), and the following two
results follow from the fix point theorem of functional analysis:
3. for any policy π, the value function V π is given by
V π = lim Tπn (V )
n→∞
(2.10)
with V ∈ F(S) any starting function, and Tπn (V ) := Tπ (Tπn−1 (V )).
4. the optimal value function V ∗ is given by
V ∗ = lim T n (V )
n→∞
(2.11)
30
RL in continuous state spaces
with V ∈ F(S) any starting function.
The iteration defined in 4. is called Value Iteration.
Knowing the optimal value function V ∗ for a given MDP, the optimal policy
is given by
(2.12)
π ∗ (s) := arg max E[r(s, a, s ) + γ · V ∗ (s )]
a∈A(s)
Thus, by value iteration it is possible to compute the optimal policy. In
pratice, however, the problem is that we usually do not know in advance the
transition probabilities pas,s , and thus cannot calculate the expectation required for the DP operator in (2.7). In order to calculate the optimal policy
without the knowledge of the expectation, a stochastic approximation technique based only on the transition experiences of a learning agent has been
defined by [Watkins, 1989] which thereafter has become well-known under
the name Q-learning. It is an empirical learning technique which learns
the so-called optimal Q-values. A Q-function Q∗ : SA → R is defined by
isolating part of the right-hand side of the Bellman Equation (2.7);
Q∗ (s, a) := E[r(s, a, s ) + γ · V ∗ (s )]
for any state s ∈ S and a ∈ A(s). Reminding the definition of the optimal
value function V ∗ as given above, the Q-function Q(s, a) is the total future
decayed reward given that in state s action a is taken, and thereafter the
optimal policy is being followed.
The Bellman Equation (2.7) then can be rewritten as
V ∗ (s) = max Q∗ (s, a)
a∈A(s)
The last two equations can be combined to the Bellman Equation for the
optimal Q-function:
Q∗ (s, a) = E[r(s, a, s ) + γ · max Q∗ (s , a )]
a ∈A(s )
(2.13)
This means, that the optimal Q-function can be calculated as the solution
for the above equation. Or, in other terms, the optimal Q-function is the
fixpoint of the DP-operator Tq : F(SA) → F(SA) with
Tq (Q)(s, a) := E[r(s, a, s ) + γ · max Q(s , a )]
a ∈A(s )
This again is a contraction operator, and again
Q∗ = lim Tqn (Q)
n→∞
(2.14)
RL in continuous state spaces
31
The equation (2.13) is similar to equation (2.7), but the important difference is that maximum and expectation have been switched (which do not
commute), and equation (2.13) can now be solved by applying an approximation to the DP operator Tq for each experienced state transition of the
learning agent, approximating the expectation by the single experience:
r(st , at , st+1 ) + γ ·
max
a∈A(st+1 )
Q(st+1 , a)
A full replacement, however, of the old value of the Q-function at (st , at )
by this single experience would not work, due to the variance of the experienced transitions. Watkin’s Q-learning rule uses the Robbins-Monro
stochastic approximation algorithm [Bertsekas & Tsitsiklis, 1996] to approximate the unknown expectation of the DP operator.
Watkin’s Q-learning:
Given a Markov Decision Problem,
• let Q0 : SA → R be any function
• let (αi )i∈N be a sequence of learning factors (0 ≤ αi ≤ 1) for which
holds
αi = ∞ and
αi2 < ∞
i∈N
i∈N
a
i
• let the state transition at the i-th learning step be si −→
si
Then, the sequence of the functions Qi : SA → R defined by
Qi+1 (s, a) := (1 − αi ) · Qi (si , ai ) + αi · (r(si , ai , si ) +
γ · max Qi (si , a )) if s = si , a = ai
a ∈A(si )
Qi+1 (s, a) := Qi (s, a) otherwise
converges to the optimal q-function Q∗ whenever the sequence (si , ai )i∈N
is such that in any state s ∈ S, all actions will be applied infinitely often.
A proof can be found in [Bertsekas & Tsitsiklis, 1996], section 5.6.
32
2.3
RL in continuous state spaces
The learning rule
The basic problem studied in this thesis is to learn optimal control for continuous, deterministic control problems just by trial and error, without any
prior knowledge of the system dynamics. The basic learning method is chosen to be Q-learning as presented in the previous section. The motivation is
twofold. First, Q-learning offers a sound theoretical basis, thus supporting
the understanding of learning rules derived from the basic Q-learning. Second, Q-learning copes with stochastic processes. Although I will consider
here only deterministic control problems, stochasticity appears necessarily
whenever state space aggregation / discretization is introduced: the agent
has no notion of where it is in state space within an aggregated state, resulting in apparently nondeterministic state transitions. This requires some
kind of averaging learning rule such as Q-learning.
In this section, I develop and present the basic learning rule used in
this thesis. Similar approaches (for model-based learning, however) have
been presented by [Munos, 1998] and [Pareigis, 1998]. First, I’ll develop
the discretization model for time and actions in order to reformulate the
optimization problem of section 2.1 as a sequential MDP. Then, the function approximation for the value functions will be defined as a step function
on a finite state partitioning, and thereafter, a particular state space partitioning based on kd-tries will be defined. Finally, a Q-learning rule will be
defined in this context, and particular topics concerning convergence and
nonmarkovian models will be discussed.
2.3.1
Time discretization, action discretization, and
the resulting DP-equation
In order to define optimization problem 1’ of section 2.1 as a sequential
MDP, the control agent has to discretize time, defining a sequence of decision points.
Time Discretization
In this thesis, a basic, fixed time step h > 0 is used for discretization.
As in numerical analysis, the choice of the stepsize might be important.
I will assume a sufficiently small stepsize choosen by the designer of the
learning controller: any stepsize adaptation (which is a main issue of this
thesis) will be given by multiple basic stepsizes, not by fractions of it.
Now, what is an action like during such a time step? The action set could
RL in continuous state spaces
33
be any set of fixed local controls. The simplest such local control in state
s is given by a fixed control value constant for some time. This discretized
action model will be used in this thesis.
Action Discretization
Actions a are from finite sets A(s), and they are controls constant during
some fixed time.
The set of all actions s∈S A(s) is finite, too.
The constant control value is called action intensity, denominated by the
action a itself.
The time the action is kept constant, is given by the action duration,
denominated dur(a):
with the fixed step size h, the effective action duration equals dur(a) · h.
Thus, an action a is a control which is constant for a duration dur(a) · h
with value a. The duration may depend on the state. In that case, I’ll
write dur(a, s). Different actions, or an action in different states with the
same control value, may have different durations.
It is straightforward to see that the results of dynamic programming (i.e.
the Bellman equations) as presented in the last section, can be extended to
a continuous state space. Therefore, the
Bellman equation
Qh (xs,a (dur(a) · h), a )
Qh (s, a) = rh (s, a) + γ dur(a)·h · max
a ∈A
(2.15)
with rh (s, a) := Jdur(a)·h (s, a) (see (2.3)),
has a unique solution (if γ < 1 and r bounded3 )4 :
3 we
always assume that Qh (s, a) ≡ 0 for terminal states
that the only difference between (2.15) and (2.13) is the exponent of γ. Since
the contraction property remains valid, all results on the corresponding DP operator as
presented in the last section, are still valid
4 Note
34
RL in continuous state spaces
The optimal Q-function and the optimal policy
Q∗h : SA → R
the solution of the Bellman equation (2.15), is the optimal
Q-function which defines an optimal policy through
π ∗ (s) := arg max Q∗h (s, a)
a∈A(s)
(2.16)
Thus, the optimal value function defined by
Vh∗ (s) := max Q∗h (s, a)
a∈A
(2.17)
equals supπ J∞ .
Note that the action set A(s) contains controls a(.) with values constrained
to be constant for their duration dur(a). Thus, the optimality as defined
above is defined for a constrained policy. It is easy to prove the following
Proposition 2.1
Given the action sets A(s), s ∈ S, with action durations dur(a, s) and
optimal Q-function Q∗h , the same control problem, but with minimal action durations dur(a, s) = 1 ∀s∈S,a∈A(s) , has an optimal Q-function Q̃∗h for
which holds
Q∗h (s, a) ≤ Q̃∗h (s, a)
∀s∈S,a∈A(s)
If the action duration of 1 is guaranteed to be the one with the best optimal
policy, so why introduce the possibility for multiple stepsizes as action durations? The answer is that generalization over time in control may reduce
considerably the learning effort (see chapter 5). The simplest form of such
generalization is given by action repetition.
This indicates a type of compromise typical for reinforcement learning: accelerate the learning phase accepting eventually a suboptimal solution.
2.3.2
State space discretization
In order to define a Q-learning rule similar to that given in section 2.2 for
finite state spaces, the Q-function must be represented as a function with
a finite set of parameters to be learned: a function approximation. This
vector could represent the weights of a neural network, the vector for a
RL in continuous state spaces
35
linear function, for the parameters of a radial basis function network, to
mention some of the function approximation methods employed in RL [Sutton & Barto, 1998]. In this thesis, a state space discretization is employed,
resulting in a tabular form of the Q-function. This choice is motivated by
the requirement of controlling directly the resolution of discretization when
composing tasks as described later in this thesis.
A discretization of state space is given by a partitioning Ŝ = {S1 , .., SN }
N
of state space, with S = i=1 Si , with Si ∩ Sj = ∅.
All terminal states are assumed to be in a particular partition not explicitely listed in Ŝ, with the convention that V ≡ 0 in this partition.
In pratice, the partitioning can be any kind of tiling structure5 6 .
The basic partitioning structure applied in this work is a kd-trie [Friedman et al, 1977] in which n-dimensional cuboid-like cells (the whole state
space S is a single cell at the beginning) get successively split along (n-1)dimensional hyperplanes which cut a cell into two halves along a selected
dimension, defining thereby two son nodes of a father node in a binary
tree (figure 2.1). There are several advantages offered by kd-tries: simple
and fast access to tiles (by multidimensional binary search), ease of refinement allowing an adaptive discretization of state space, and a hierarchical
representation allowing for different levels of generalization as described in
chapter 5.
Basically, there are two types of approximation with step functions on
state space partitionings:
1. define a single state space partitioning:
both the value function and the Q-function are step functions (constant in each partition) on one state partitioning. This is the most
common approach, in which the main focus is on approximating the
value function V (s) [Sutton & Barto, 1998]. However, in Q-learning
the main focus is on approximating the Q-function. Thus,
2. define different state space partitionings for Q(s, a) for each a ∈ A:
this approach allows for different partitionings for the Q-function depending on the particular action:
Ŝa = {S1,a , .., SNa ,a }
5 examples
∀a ∈ A
of different tiling types can be found in [Sutton & Barto, 1998]
some theoretical parts of this thesis, more abstract partitionings will be used, as
for example in the ”question” of section 2.4
6 in
36
RL in continuous state spaces
state space
S(s)
.s
..
..
.
.....
kd-trie
.s'
S(s')
transition with action a and
duration dur(a)
s  S(s)
a
S(s') s'
Figure 2.1: state space partitioning with a kd-trie
with S =
Na
Si,a
and Si,a ∩ Sj,a = ∅
i=1
Then, for each a ∈ A, the function Q(., a) : S → R is approximated
by a step function on partitioning Ŝa :
Q̂(., a) : Ŝa → R
This approach allows for different local discretization resolution for
each action. Although this approach seems to be equivalent to the first
one, when we consider the result of learning (the final Q-function),
since at that time we are interested just in the Q-value for the best
action of that state, it might be quite different when learning the
Q-function.
The first approach is a specialization of the second one. All experimental
results presented in this work are based on the first approach (one partitioning), but the definition of the composition principles are valid also
for the second approach, and at least Proposition 4.1 for the first principle (veto principle, chapter 4.2.1) requires this approximation type. In
practice, however, a discretization with multiple kd-tries can become quite
resource-intensive, which is the reason why in this work, simulations have
been performed just for the first type of discretization.
RL in continuous state spaces
37
As a notational convenience, the expression
S(s, a)
will denote that partition Si,a ∈ Ŝa for which s ∈ Si,a holds, and in case
of a single partitioning,
S(s)
denominates the partition that contains s.
2.3.3
The Q-learning rule on finite state partitionings
Given
• a finite action set A
• state space partitionings Ŝa = {S1,a , .., SNa ,a }
∀a ∈ A
• A : S → ℘(A) \ ∅, the action permission function indicating the set
of actions A(s) allowed in state s
• Â : Ŝa → ℘(A) \ ∅, the action permission function indicating the set
of actions allowed
in a partition. This function is defined by:
Â(Si,a ) := s∈Si,a A(s)
requiring from the partitioning Ŝ that ∀s∈S ∃a∈A : a ∈ Â(S(s, a))
the agent has to learn an approximation of Q∗h by a step function Q̂h using
the following Q-learning rule7 for the n-th step in a learning episode:
7 Just for the sake of simplicity, we use a plain Q-learning rule here. More efficient
learning rules, such as Q(λ), can be used as well in what follows. In effect, some results
reported in chap. 5, are based on Q(λ).
38
RL in continuous state spaces
Q-learning rule for the n-th step in a learning episode:
(n+1)
Q̂h
(n)
(S(sn , an ), an ) := (1 − αn ) · Q̂h (S(sn , an ), an ) + αn · [rh (sn , an )
+γ dur(an ) ·
max
a ∈Â(S(sn+1 ,a ))
(n)
Q̂h (S(sn+1 , a ), a )]
(2.18)
with
dur(an )−1
rh (sn , an ) =
γ i · r(s(n,i) , an , s(n,i+1) )
i=0
where
an
an
s(n,1) −→
. . . s(n,dur(an )) = sn+1 ∈ S(sn+1 , .),
S(sn , an ) sn = s(n,0) −→
(n)
and an ∈ Â(S(sn , an )), and Q̂h (S(sn+1 , .), .) = 0 if sn+1 a terminal state.
Note that the definition of an action permission function for a whole
partition is required, since an MDP as defined in section 2.2 requires a
fixed action set for each state, and states are partitions in the discretized
model. The meaning of the above definition of the action permission function is, that an action is permitted in a partition when it is permitted in
all of its states.
Note furthermore, that the basic time step h has been dropped in the
decay factor,replacing γ by γ h , without loss of generality.
The Q-learning rule will be applied mostly with discretizations defined by a
single partitioning, independent from actions, as has been discussed in the
previous section. In that case, the notation becomes considerably simpler.
In order to achieve a canonical form that fits for both cases, (2.18 ) can be
written in a more readable form. For this purpose, define a partitioning ŜV
definied by the partitions
SV (s) :=
S(s, a)
(2.19)
a∈Â(S(s,a))
for all s ∈ S, and define the value function approximation
(n)
V̂h (SV (s)) :=
max
a ∈Â(S(s,a ))
(n)
Q̂h (S(s, a ), a )
(2.20)
RL in continuous state spaces
39
Then, (2.18) can be written as
(n+1)
Q̂h
(n)
(S(sn , an ), an ) := (1 − αn ) · Q̂h (S(sn , an ), an )
+αn · [rh (sn , an ) + γ
dur(an )
·
(n)
V̂h (SV
(2.21)
(sn+1 ))]
or, when the partitioning does not depend on actions
(n+1)
Q̂h
(n)
(S(sn ), an ) := (1 − αn ) · Q̂h (S(sn ), an )
+αn · [rh (sn , an ) + γ dur(an ) ·
(n)
V̂h (S(sn+1 ))]
The basic Q-learning algorithm works as follows:
basic Q-learning algorithm on state space partitionings:
initialize:
partition S
initialize Q on partitioning
initialize α , the initial learning rate
repeat (for each learning episode):
initialize:
s ∈ S (randomly)
repeat (for each action selection):
select action a ∈ A(s) (for example -greedy, see below)
t ← 0, r ← 0, c ← 1
repeat (for each time step of size h):
execute a
observe next state s and reward rstep
r ← r + c · rstep
t←t+1
c←c·γ
/ S(s)
until s ∈
Q(S(s), a) ← (1 − α) · Q(S(s), a)+
α · r + γ t · maxa ∈A(s ) Q(S(s ), a )
s ← s , update α
until learning episode finished
end repeat
(2.22)
40
RL in continuous state spaces
An -greedy policy is one that takes with probability 1 − the currently
best action, and with probability a random action (uniform distribution).
2.3.4
Q-learning for deterministic problems?
Why should we employ Q-learning based on stochastic approximation for
a deterministic problem? Due to state space discretization by partitioning,
the state transition model becomes nondeterministic because of the random
position of the continuous state within a partition. Assuming a distribution
of st ∈ Si independent of S(st−1 ) (which, obviously, is not the case during
sequential learning, but would be true only for episodic learning with each
episode of length one and uniformly distributed initial states8 ), the transition
at
S(st+1 ) = Sj
Si = S(st ) −→
defines an MDP for the learning agent that perceives S(st ), not st itself,
with transition probabilities paSi ,Sj , Si , Sj ∈ Ŝ. See figure 2.2.
Thus, under the ideal assumption of Markovian state transitions, we can
apply Q-learning as defined in the last subsection. The only difference to
discrete Q-learning (2.20) is that of a discount factor defining a decay that
depends on the action duration. Now, looking at the process in elementary
time steps h, we get
J(s0 , a(.))
= r(s0 , a0 , s1 ) + γ · r(s1 , a0 , s2 ) + ...
+γ dur(a) r(sdur(a) , a1 , sdur(a)+1 )
= rh (s0 , a0 ) + γ dur(a) · J(sdur(a) , a(.))
giving rise to (2.18).
The problem of non-Markovian state transitions will be discussed in detail
in section 2.4.
8 Q-learning
converges, by the way, even under these conditions
RL in continuous state spaces
state space
. st
.s
S(st)
41
transition with action a and duration dur(a)
of elementary time steps h
stochastic process
a
st  S(st )
S(st+1) st+1
t+1
defined by:
transition probabilities
p( S(st+1) | S(st) ,a)
duration probabilities
p(dur(a) | S(st) o S(st+1)
S(st+1)
Figure 2.2: stochastic process due to state space discretization
2.3.5
The action model
The spatial abstraction introduced by the partitioning, has its counterpart
in a corresponding temporal abstraction in action space introduced by the
action duration as defined in 2.3.1. Both abstractions depend on each other:
the learning agent doesn’t distinguish between different states within a partition, and it cannot change action for the time an action lasts (dur(a)),
since such a decision could not be evaluated. This motivates the following
:
Definition of static action durations in state partitionings:
Given the state space partitionings Ŝa = {S1,a , .., SNa ,a }
the duration of an action a in a partition Si,a as
∀a ∈ A, define
dur(Si,a , a) := max min{n > 0 | xs,a (n · h) ∈
/ Si,a }
s∈Si,a
(2.23)
Thus, the duration of an action is defined as the maximum number of steps
that are required to get just out of that cell. Note that this number may be
infinite, in which case the action will simply not be permitted in this partition,
except for terminal states (see section 2.1).
This definition defines static action durations in the sense that the duration
does not depend on the specific state s ∈ Si,a , which may be different for
different visits of the partition. An example for an infinite action duration
appears in the mountain car, see chapter 3.2.
An alternative approach is given by
42
RL in continuous state spaces
Definition of dynamic action durations in state partitionings:
Given the state space partitionings Ŝa = {S1,a , .., SNa ,a } ∀a ∈ A, and a
visit of partition Si,a at state s ∈ Si,a , define the duration of an action a as
dur(s, S(s, a), a) := min{n > 0 | xs,a (n · h) ∈
/ S(s, a)}
(2.24)
Thus, dur(s, S(s, a), a) is the minimum number of steps with action a to get
out of the cell starting from s.
This definition defines dynamic action durations in the sense that the duration depends on the cell S(s, a) and the state s ∈ S(s, a) in which the
action is applied9 . The basic Q-learning algorithm with dynamic action
duration is given at the end of section 2.3.3, while the Q-learning algorithm
with static action duration can easily be derived from that algorithm.
The main difference between the two action models is that model (2.21)
defines a deterministic action duration depending only on the partition,
while model (2.22) also depends on the random position of the state within
the partition st ∈ S(st , a). Note that this action model causes the learning
system to take decisions near the borders of partitions.
Semi Markov Decision Processes
A learning theory for a generalization of the Bellman equation that treats
probabilistic action durations has been developed among others by [Bradtke
& Duff, 1995], based on the concept of Semi Markov Decision Processes.
Since a probabilistic action duration appears in two of the three composition principles presented in chapter 4, SMDPs shall be briefly introduced
now. A Semi Markov Decision Process (SMDP) is defined as a Markov
Decision Process (section 2.2) with the extension that actions may have a
probabilistic duration dur(a). The probability of a transition to state s
with action a starting from state s is still given by
pas,s = p(s | s, a)
∀s,s ∈S,a∈A(s)
An SMDP has a second probability
pT (dur(a) = t | s −→ s )
a
9 as
∀s,s ∈S,a∈A(s),t>0
will be explained in chapter 4 (section 4.4.3), this definition of dynamic action
duration does not support the third composition principle (perturbation principle)
RL in continuous state spaces
43
for the action duration being t, under the condition that the transition is
a
s −→ s . The Bellman equation (2.7) from section 2.2 now rewrites as
V ∗ (s) = max E r̄(s, a, s , dur(a)) + γ dur(a) · V ∗ (s )
(2.25)
a∈A(s)
= max
p(s | s, a)·
a∈A(s)


s
∞

p(dur(a)|s −→ s ) · (r̄(s, a, s , dur(a)) + γ dur(a) · V ∗ (s ))
a
dur(a)=1
where the reward r̄(s, a, s , dur(a)) is an expectation value defined by
n
i−1
γ
· r(si−1 , a, si ) | s0 = s, sn = s
r̄(s, a, s , n) := E
i=1
(the expectation is with respect to the intermediate states s2 , ..., sn−1 ). The
right-hand side of (2.25) can be interpreted as a DP operator in the same
way as in section 2.2. It is easy to see that it is again a contraction operator,
with a (random) variable contraction factor γ dur(a) which γ is an upper
boundary of. The reward value r̄(s, a, s , dur(a)) is bounded whenever the
one-step reward r is bounded. Therefore, the theory of DP as presented
in section 2.2 applies here again, and value iteration would converge to the
optimal value function V ∗ .
Bradtke and Duff [Bradtke & Duff, 1995] have proven that under conditions
similar to those of Q-learning for MDPs (see section 2.2), the following
SMDP Q-learning rule converges to the optimal policy (i.e. that policy
that maximizes the right-hand side of (2.25)):
Q(n+1) (st , at )
:=
(1 − αn ) · Q(n) (st , at ) + αn · (r̄t +
γ k · max Q(n) (st+k , a))
(2.26)
a∈A(st+k )
k
with observed values k := (at )dur and r̄t := i=1 γ i−1 · r(st+i−1 , at , st+i ).
This learning rule is the discrete form of the learning rule (2.18) in case of
the dynamic action model (2.24).
Options
SMDPs and the corresponding Bellman equation (2.25) can be generalized
even further to an action model called options which was studied theoretically in [Precup & Sutton & Singh, 1998]. This model is relevant to this
44
RL in continuous state spaces
thesis since the action model underlying the second and third composition
principle can be mapped to a special case of options, as will be discussed in
chapter 4. An option o is a generalization of an action with duration. It is
defined by a triple o =< π, β, I > with π a policy on I ⊂ S, β : I → [0, 1] a
variable indicating the probability of termination of the option in a state,
and I ⊂ S the states in which the option is applicable. If an option is taken
in state s ∈ I, an action is chosen by π and then executed, transiting to
state s in which the option will terminate with probability β(s ), otherwise
the agent follows again π until at some state ŝ the option terminates. In
that state, a new option is selected and executed. Thus, the agent is given
a fixed set of options O, and it must learn a policy that selects options
instead of elementary actions10 .
Formally, an SMDP modelling an option is defined by a joint probability
of a transition to state s after k (random variable) number of time steps
starting from state s and executing option o :
p(s , k|s, o)
The Bellman equation (2.7) from section 2.2 again rewrites as
V ∗ (s)
=
=
max E r̄(s, o, s , k) + γ k · V ∗ (s )
(2.27)


p(s , k|s, o) · r̄(s, o, s , k) + γ k · V ∗ (s ) 
max 
o∈O(s)
o∈O(s)
s ,k>0
with O(s) the set of options applicable in s, and where the reward r̄(s, o, s , k)
is an expectation value defined by
r̄(s, o, s , k) := E
k
γ
i−1
· r(si−1 , π(si−1 ), si ) | s0 = s, sk = s
i=1
with o = (β, π, I). The distribution of the random variable k depends on s
and can be calculated from β as part of the option.
The following modified SMDP-Q-learning rule has been defined and declared to converge for options by [Sutton & Precup & Singh, 1998]:
10 within
this framework, options can be mixed with elementary actions a ∈ A, since
the latter ones can easily be modelled as options
RL in continuous state spaces
45
Q-learning for Options:
Q(n+1) (st , ot )
:=
(1 − αn ) · Q(n) (st , ot ) + αn · (r̄t +
γk ·
max
o∈O(st+k )
(2.28)
Q(n) (st+k , o))
with observed values:
k the duration of the option ot
ot
st −→
s
the observed state transition after termination of ot
t+k
k
r̄t := i=1 γ i−1 · r(st+i−1 , πt (st+i−1 ), st+i )
the observed reward during execution of ot = (βt , πt , It )
and where
O the set of options
O(s) = {(β, π, I) ∈ O | s ∈ I}
Q : SO →R the Q-function on options
with SO:= {(s, o) | o applicable in s}.
Note that the main difference between options and SMDPs is in the points
of decision: in SMDPs, a decision (policy application) is taken at the beginning of an action, and for a random number of steps, the action gets
repeated without any feedback from the states (no decision). In options,
instead, a first decision (main policy) is taken when choosing an option,
then decisions (option policy) are taken within a fixed feadback control for
a random number of steps, until the option ends. Both approaches, however, have in common that any decision d is Markovian in that it results in
a Markov Process: pds,s describes the transition probability of transition to
state s after taking a decision d in state s.
2.4
2.4.1
Generalization, Aliasing and Non-Markovian
models
Definition of Aliasing and Generalization
A discretized state space model as defined in the previous sections may lead
to non-markovian processes. Two different transitions
a
Si,a sn −→ sn+1 ∈ Sj,a
a
Si,a sm −→ sm+1 ∈ Sk,a
46
RL in continuous state spaces
with j = k may be observed, and the different outcome doesn’t depend only
on state Si,a , but also on previous states. See figure 2.3 for an example.
In the decision problem to be learned, what does this potential violation of
sn-1
sn
sn+1
sm-1
sm
sm+1
Figure 2.3: non-markovian state transitions due to discretization
the Markov property mean? The following two definitions are important in
this concern.
Definition of State Aliasing
The learning agent has state aliasing whenever there are states s1 , s2 which it
cannot distinguish between (they are in the same partition), but that must be
distinguished in order to
1. represent the optimal policy π ∗ with π ∗ (s1 ) = π ∗ (s2 ), or to
2. learn the optimal policy
Definition of State Generalization
The learning agent generalizes states s1 and s2 whenever s1 and s2 belong to
the same partitioning without having state aliasing (neither case 1 nor 2).
In order to understand the difference between the first and the second
case of state aliasing, the following partitioning which I will call ”optimal
policy partitioning”, shall be examined. This partitioning is the coarsest
partitioning that can represent the optimal policy of a given MDP:
Definition of the optimal policy partitioning
Let π ∗ be the optimal policy of a given control problem as defined in 2.3.1, and
define a partitioning
Ŝ := (Si )1≤i≤|A| with Si := π ∗−1 (ai ), ai ∈ A. Refine Ŝ
such that Si = j=1..ni Si,j with Si,j the connectivity components of Si .
RL in continuous state spaces
47
vehicle
actions:
- north ↑
- south ↓
- east →
- west ←
l1 Å
1
↓
G: goal region
← Å2 Å3
↓
G ← Å4
G
a)
b)
Figure 2.4: example for a partitioning that can represent, but not learn,
the optimal policy
Within this partitioning, and with the dynamic action model (2.24), the
agent following π ∗ acts optimally.
The following (purely theoretical11 ) question is of particular interest in
order to understand the two aspects of state aliasing.
Question:
Does Q-learning as defined in (2.21) learn the optimal policy on the optimal
policy partitioning?
The general answer is negative, as the following counterexample shows.
Example 2.1
Figure 2.4a shows a layout of driveways for a vehicle that can move δs to
south, east, north or west as a basic action during time h. Actions driving
the vehicle against a wall are without effect. In the goal region at the left
bottom (terminal states), the vehicle receives a reward R > 0, and r = 0
anywhere else. The discount factor γ < 1 lets the vehicle learn to go for the
11 since
the optimal policy partitioning requires knowledge of the optimal policy which
is the result of any learning
48
RL in continuous state spaces
shortest path to the goal region. Figure 2.4b shows the partitioning defined
by the optimal policy as defined in the previous question. Now, assuming
a uniform distribution of st in Ŝ1 , the value of that partition as learned by
(2.21) is:
1 l1 x
R
· (γ l1 − 1)
γ · R dx =
V (Ŝ1 ) =
l1 0
l1 · ln(γ)
For l1 → ∞, we get V (Ŝ1 ) → 0. Thus, we can select some l1 large enough
to let V (Ŝ3 ) > V (Ŝ1 ). But then, Q-learning will converge to π(Ŝ2 ) = east,
while the optimal π ∗ (Ŝ2 ) = west. The problem is, that Ŝ1 generalizes too
much, although the optimal policy is constant on Ŝ1 .
2.4.2
Do we need a Markovian transition model?
State generalization may reduce extremely the learning effort of the agent.
It affects directly the propagation depth of the goal reward back to remote
regions of state space, since plain Q-learning is a 1-step learning method
updating evaluations of just one-step transitions. Therefore, state generalization is a primary objective of any continous RL approach. Adaptive
discretization techniques have been developped within this thesis and will
be a recurrent theme in this work. The main challenge in this concern is
the problem of aliasing as presented in the previous section.
All known proofs of convergence of Q-learning require the Markov property of the decision process. Yet, it is not always necessary in practice. In
fact, it is easy to give examples for models with generalization but without
aliasing that, however, are not Markovian. Just start from any model with
state space discretization without aliasing, and refine a partition (cell) in
the way shown in figure 2.3.
The point about the Markov property is that it characterizes the state dynamics independently from the task to be learned. This explains why many
methods developped as some variant of plain Q-learning are successful in
pratice, though lacking any proof of convergence.
Coming back to Q-learning on state space discretizations, a proof of convergence in the limit case maxŜi [max{s − s | s, s ∈ Ŝi }] → 0 with
triangulation or grid partitionings has been published by [Munos, 1998].
Chapter 3
The example task
domains: the TBU
problem and the
mountain car problem
3.1
The TBU problem
In the Truck-Backer-Upper (TBU) example [Nguyen & Widrow, 1991], the
control agent has to navigate a trailer truck backwards to a docking point.
State space and actions are given in figure 3.1a. The truck can move only
backwards, and it has to avoid an inner blocking between cab and trailer
(Φint > Φmaxintern ) and hitting the wall. The dynamics of this system is
nonlinear and unstable, and the TBU task
Tf inal (Φint = 0 ∧ Φext = 0 ∧ x = 0 ∧ y = 0|(Φint , Φ̇int , Φext , x, y))1
is quite complex when learned as a monolithic task.
The TBU problem offers a great variety of control tasks. I will categorize
control tasks by the following task types2 :
• Goal seeking tasks defined by termination within a rewarded goal
region and no reward elsewhere: R > 0 and r ≡ 0 in (1), section 2.1.
1 I use the notation T (p) or T (p|S) for a task T in state space S with goal predicate
p on states. T ◦ (p|S) denotes an avoidance task with p defining the collision region.
2 this categorization is independent of the TBU domain, but is presented in this chapter for illustrative reasons
49
50
The example task domains
goal point
wall
x
y
state space:
Φext
Φgoal
π/4
(Φint , Φint , Φext , Φgoal , x , y )
Φint ∈ [-π/2, π/2]
Φext ∈ [-π, π]
Φgoal ∈[-π, π]
x ∈[-maxx, maxx]
y ∈[0, maxy]
0
Φgoal
-π/4
action intensities Aint :
Φsteer
∆Φsteer∈{0°,+-2°,+-4°,+-6°,+-8°}
Φint
Figure 3.1: a) the TBU problem
0
-π/8
π/4
Φint
b) state partitioning for task T6
• Avoidance tasks defined by negative reward in the collision region of
state space and no reward elsewhere: R < 0 and r ≡ 0 in (1). Note
that collsion states are terminal.
• State maintaining tasks which are like goal seeking tasks but do not
terminate when entering into the rewarded region: r > 0 and R ≡ 0
in (1).
The following list shows some examples of TBU tasks:
avoidance tasks:
- T1◦ (Φint ≥ Φmaxintern |(Φint , Φ̇int )): avoid blocking between cab and
trailer
- T2◦ (”hits the wall”|(Φint , Φ̇int , Φext , y)): avoid wall collision
goal seeking tasks:
- T3 (Φgoal = 0|(Φ̇int , Φint , Φgoal )): line up the trailer with the goal point
state maintaining tasks:
- T4 (Φ̇int = 0|(Φ̇int , Φint )): run on an arclike curve
- T5 (Φint = 0|(Φ̇int , Φint )): run on a straight line
- T6 (Φgoal = 0 ∧ Φint = 0|(Φ̇int , Φint , Φgoal )): line up cab and trailer with
the goal point
Note that solution paths to goals of different tasks may interfere, as for
The example task domains
51
goal
goal
T3()goal=0)
a) goal-seeking
b) state maintaining
Figure 3.2: difference between goal-seeking and state-maintaining tasks
example in the tasks T3 and T5 : these tasks cannot be sequentialized in
order to reach the conjunction of the two goals.
Note further the difference between goal seeking and state maintaining
tasks:
• goal seeking tasks have greedy policies in the sense that they always
follow the shortest path to a rewarded state region.
• state maintaining tasks have policies that maximize the future accumulated reward with a tradeoff between the shortest path to the
rewarded state region and the possibility to keep the system within
that region.
This difference is best demonstrated by looking at a task for a fixed rewarded state region and which is learned once as a goal seeking task and
another time as a state maintaing task. Figure 3.2 illustrates this difference
for the TBU task T3 .
52
The example task domains
3.2
the mountain car problem
goal
3 actions:
no thrust, and
forward
top of hill
backward
vel
oci
ty
V(s)
valley
position
>0
velocity
valley
Figure 3.3: a) the Mountain Car problem
function
position
b) its optimal value
In the mountain car problem, an underpowered car has to climb up a
steep mountain road (figure 3.3). It has three actions - accelerate forward
or backward, and no acceleration -, but the power is not sufficient to climb
up directly: the car has to reach a certain level of kinetic energy before it
can reach the top of the hill at the right side where it will receive a positive
reward (R > 0 at the top and r ≡ 0 elsewhere). When the car reaches
the left border, however, the episode will stop without any reward. In the
modified mountain car problem, the car has to reach the top of the hill with
velocity 0 (i.e. below a small positive threshold).
Note that the optimal value function is discontinuous near the left border,
since any state near it and having too negative a velocity, necessarily ends
up in a terminal state without reward. Note furthermore that the value
function has a strong slope in the region of state space in which the kinetic
energy is almost insufficient to climb up directly towards the top of the hill
- this is the region of a policy change: go downhill first, then climb up the
hill with more kinetic energy. The mountain car is a well-known control
<0
top of hill
The example task domains
53
problem and has been extensively studied in RL by Sutton [Sutton, 1996]
and Moore [Moore, 1991, Munos & Moore, 2002]. It has been studied, and
will be discussed in detail also in this thesis in chapter 5, in the context of
a variable resolution of the discretized state space.
54
The example task domains
Chapter 4
Hierarchic Task
Composition
4.1
Basic Motivation
A strategy for tackling a control problem in a complex state space and with
complex dynamics is to decompose it into several simpler learning problems
in the form of subtasks. Greater simplicity may result from:
• state space reduction: subtasks may solve a control problem in a reduced part of state space, reducing the number of states (cells)
• state space abstraction: a) certain dimensions of the state space of the
overall task may be irrelevant to the subtask, or b) the discretization
of the subtask may be coarser (see chapter 2.3.2), reducing in both
cases the number of states
• temporal abstraction in a hierarchy of subtasks: a subtask at a higher
level within a hierarchical decomposition may control actions lasting
longer than actions of subtasks at lower levels
• simpler goal structure: a simpler goal structure of a subtask may
result in a larger goal region in state space, shortening also the mean
depth of propagation of the reward in state space
• action set reduction: large actions sets of the overall task can be
reduced for subtasks by irrelevance, or by defining subtasks with abstract actions (options), as for example the activation of lower level
subtasks.
55
56
Hierarchic Task Composition
There seems to be no natural way to learn and compose subtasks to a complex task. They may be learned simultaneously [Dietterich, 1997, Singh,
1993, Dayan & Hinton, 1994] or separately [Sutton & Precup, 1998], and
the composition may be sequential and flat [Singh, 1993, Sutton & Precup, 1998] or hierarchical [Dietterich, 1997, Singh 1993, Dayan & Hinton,
1994]. Subtask activation may be one-at-a-time or in parallel as it happens for navigation and simultaneous camera positioning in robotics. The
composition may result in a conjunction of the goals of the subtasks (most
approaches cited above), or in a disjunction [Whitehead & Karlsson &
Tenenberg, 1992], or in a successive (but not necessarily simultaneous) fulfillment, such as in a task agenda (to-do list).
Finally, even for the optimality criterion of task composition there is no
unique definition. Before the three main definitions of such optimality
will be formulated further on in this section, I want to discuss briefly the
meaning of hierarchy in behavior architectures. This argument has been
discussed for a while, particularly in the context of robotics when Brooks
presented his subsumption architecture [Brooks, 1986].
4.1.1
Hierarchical Task Architectures
Given a set of tasks of the three types defined in the last chapter (avoidance,
goal seeking, state maintaining tasks), a hierarchical composition of these
tasks can be characterized in the following way:
A) Leveling
the architecture is leveled, where
(a) each level of tasks is in some way semantically closed: for example, the tasks at one level may represent the full reservoir of
behavior at a specific level of abstraction, or they may represent
all the behaviors which must be strictly sequentialized, or which
must use the same state space discretization,
(b) levels are in some way ordered:
• tasks at higher levels control/activate tasks at lower levels,
or
• goals of lower levels are subordinate to goals at higher levels
(as for examples avoidance tasks may be subordinate to goal
seeking tasks)
B) Delegation
a task delegates the control for achieving one of its (sub)goals to a
task at a lower level
Hierarchic Task Composition
57
C) Refinement
a refinement of
a) state space (discretization)
b) time
both of which may result in a refinement of
c) control
D) Abstraction
the opposite of refinement: higher levels are defined by abstraction of
(a) control, in particular of the actions: at the lowest level basic
control actions with constant duration, at higher levels options
with fixed policies and state-dependent durations
(b) state, more concretely: abstraction of the evaluation of policies
in a state partitioning, as for example in kd-tries (see chapter
2.3.2)
E) Repetition
the same hierarchical principles are repeated at each level: refinement,
delegation, abstraction
All hierarchical task architectures, in one way or the other, refer to these
characteristics. In section 4.1.3, I will specify in more detail how the architecture developped in this thesis refers to these characteristics.
In the following section, the important issue of optimality in hierarchical
architectures will be discussed. Since RL is a learning method that optimizes a cost functional, it is natural to define optimality in the same way
for the composed task as for the component (sub-)tasks:
the execution of actions results in observed reward which accumulates, in
basic steps of size h with decay factor γ, to the reward functional J∞ . An
optimal policy is one which maximizes J∞ .
Different types of optimality can be defined when comparing the behavior
of the composed task with the behavior of its component tasks and with
the behavior of the ”flat”, monolithic (i.e. uncomposed) task defined by
the same reward function as that of the composed task.
For this purpose, we may assume one single ”objective” state st ∈ S for
each time t in each (sub)task within a hierarchical task architecture, even
though different (sub)tasks Ti may use different state spaces or partitionings for learning and executing their own policies πi 1 . When comparing
task behaviors, the values V πi (s)2 will be compared.
1 more formally, we assume the existence of a global state space S, and for each
(sub)task Ti a mapping φ : S → Si where Si is the state space/partitioning of Ti . In
sections 4.2-4.4, this function will be defined for each of the three composition principles.
2 more precisely: V πi (φ(s)), see last footnote
58
Hierarchic Task Composition
4.1.2
Optimality of Policies in Hierarchical Task Architectures
Whenever a complex task is decomposed into subtasks within a hierarchical architecture, this architecture gives rise to hierarchical constraints when
compared with the monolithic (flat) task defined by the same reward predicate as the decomposed, complex task. These constraints are basically of
the following three types:
Hierarchical Constraints
1. constraint on action selection (action permission function A(s))
suppose, for example, that a task T may activate (”call”)
(sub)tasks Ti , i = 1..k at lower levels. Then, task T has to learn a
policy π that decides for each state s, which (sub)task Tj , j ∈ {1..k}
to activate. Let Ai be the action permission function of task Ti .
Then
a) task T has the
constrained action permission function defined
by A(s) := i=1..k Ai (s).
b) having chosen subtask Ti in state s, the action permission
function remains constrained to Ai as long as Ti is active.
c) if the subtasks are already learned, thus having fixed policies
πi (options), the composed task T has a constrained action
permission function A(s) := {π1 (s), ..., πk (s)}.
2. constraint on action duration
suppose that the hierarchic levels are defined by the coarseness
of state space discretization, as for example in a kd-trie. Then,
defining action durations which are adapted to the particular partitioning as described in chapter 2.3.2, each hierarchic level defines
a constraint on the action durations
3. constraint on the solution path
suppose, in the same example as in the first point, that the
(sub)tasks Ti are goal seeking tasks which remain activated until they have reached their goal region in state space. Thus, task T
may decide only in states which belong to the goal region of some
(sub)task. When compared with the monolithic task defined by
the same reward predicate, T is constrained by solution paths that
must run through the goal regions of its subtasks.
Hierarchic Task Composition
59
Depending on these types of constraints, three different types of optimality can be defined for reinforcement learning within hierarchic task
architectures:
a policy of the composed task may be hierarchically optimal [Parr & Russell,
1998], recursively optimal [Dietterich, 1997]3 or it may have flat optimality:
Definition 4.1a: Hierarchical Optimality
A policy π of a hierarchically composed task T giving rise to constraints
of the types defined above, is called hierarchically optimal, when for its
∗
value function V π holds: V π ≡ V π , where π ∗ is the optimal policy of
the corresponding monolithical task which has the same reward predicate and which is subject to the hierarchical constraints of T .
Definition 4.1b: Recursive Optimality
A policy π of a hierarchically composed task T is called recursively optimal, when it executes for any of its goal seeking subtasks Ti , during
an observation that starts with the first activation of Ti and ends with
the next achievement of Ti ’s goal, a (recursively) optimal policy also for
that subtask.
Definition 4.1c: Flat optimality
A policy π of a hierarchically composed task T is said to have flat op∗
timality when V π ≡ V π , where π ∗ is the optimal policy for the corresponding monolithical task which has the same reward predicate and
which is not subject to any of the hierarchical constraints of T .
Note that recursive optimality means plain optimality as defined in (2.16)
for subtasks which are not composed and which thus use elementary actions
as defined in chapter 2.3.3. Recursive optimality for any other subtask in
a hierarchical architecture is recursively defined. This might imply that it
is optimal with respect to actions that are themselves activations of (recursively optimal) subtasks (options, see 2.3.5)
The essential difference between hierarchical and recursive optimality is
that of local and global optimality: recursive optimality requires a policy
that is optimal for each of the subtasks without considering the context
in which they are being applied. Such a policy must not necessarily be
globally optimal, i.e. for the composed task, even if the hierarchical con3 the following definitions differ somewhat from those given by the two authors which
are based on discrete state spaces: Parr and Russell don’t use the particular hierarchical
constraints defined previously but employ finite state machines for describing constraints.
Dietterich doesn’t consider the possibility of interlaced subtask activation.
60
Hierarchic Task Composition
straint would be that of passing through the goal regions of the subtasks.
An example for this situation is given in figure 4.1. Vice versa, a hierarchically optimal policy is globally optimal, but must not necessarily be
locally optimal to a single subtask looked at in isolation. An example of
this situation is given in figure 4.2 (next page) from the TBU domain. Task
T6 (Φgoal = 0 ∧ Φint = 0) is composed of the subtasks T3 (Φgoal = 0) and
T5 (Φint = 0). The hierarchically optimal policy of T6 leads to an execution
of T3 which is not recursively optimal.
*
*
G
Figure 4.1: a recursively, but not hierarchically optimal policy: task1 has
goal states indicated by * ,task2 and the composed task has goal state G
Finally, note that for a hierarchically composed task, a policy that has
flat optimality is by definition the best policy, because of the absence of hierarchical constraints which might compromise optimality. However, such
a policy obviously does not always exist. In those cases, hierarchical constraints lead to suboptimal solutions, or can even lead to the case where
all policies lead to a continuous increase of the distance from the goal (an
example will be given in section 4.3.5.2, figure 4.14).
Hierarchic Task Composition
61
goal
T3(()goal=0) and T5()int=0) :
subtasks of
T()goal=0 š )int=0)
T5
T3
Figure 4.2: TBU task T , a hierarchically, but not recursively optimal policy:
T3 is not optimally executed under T
4.1.3
Three Hierarchical Composition Principles
An approach to task composition has to be specified by the following three
characteristics:
a) type of composition of tasks for combined execution,
b) composition of the learning process,
c) optimality criterion.
In this thesis, three composition principles for tasks have been developed
and studied, which have the following characteristics:
a) composition type:
• tasks are composed in a hierarchy, and all three principles define
hierarchical relationships between tasks.
• the goal of the composed task is represented as a conjunction of
the subtask goals
• some tasks may act in parallel, observing the others in a demonlike manner
b) learning process:
• each task is defined as a seperate, discretized MDP as defined in
chapter 2
62
Hierarchic Task Composition
• tasks are separately learned (one at a time)
• tasks are learned in a bottom-up manner, whereby any task
learns under the influence (composition) of already learned tasks
at lower levels.
c) optimality criterion: the policies of tasks composed by the first and
by the second principle (veto and subtask principle) are hierarchically
optimal, but not necessarily recursively optimal. The third principle
(perturbation principle) is recursively optimal, but not necessarily
hierarchically optimal.
Before presenting in detail the three composition principles, the following
three general features are important since they distinguish this approach
from most of the similar ones [Dietterich, 1997, Dayan & Hinton, 1993,
McGovern & Sutton, 1998, Singh, 1993].
1. The tasks can use independent state spaces without bothering of
their future composition context. This is a consequence of the
following feature.
2. A task uses only the policies of its subtasks at lower levels, not their
value functions. Thus, each task has to learn its value function from
scratch, and its lower-level subtasks have an influence only on the
action selection during learning and execution of the task. The
very reason for this characteristic is that a value function refers to
the task goal reached on the solution path of that task, independent
of a higher-level context, but at least in the domain considered in
this thesis, solution paths to different goals in a goal conjunction
(i.e. in a composed task) depend on each other. In consequence,
the value function of one subtask will not evaluate correctly the
combined solution path.
3. It is easy to show (using the argument of the last point), that any
compositional approach which combines value functions, must be
based on recursive optimality. In fact, other approaches, based on
recursive optimality, as that of maxQ-learning [Dietterich, 1997],
combine the value functions of its component tasks.
Hierarchic Task Composition
63
In each of the next three subsections, I will present and discuss a single
composition principle. In all three principles, tasks T1 , T2 ... with goals
p1 , p2 ... are combined to carry out a combined task with goal p1 ∧ p2 ∧ ... 4 .
4 for
an avoidance task Ti◦ , the goal pi is the negation of the reward predicate representing the collision region of state space with negative reward
64
4.2
Hierarchic Task Composition
The Veto Principle
In this principle, an avoidance task T1 may veto the action selected by some
other task T2 . As an example, take tasks T1◦ and T6 from section 3.
4.2.1
Definition
Let S, A and A be any state space, action set and action permission function, and let v be a function that maps states to action sets:
v : S → ℘(A)
with v(s) ⊂ A(s) the actions that must be suppressed (vetoed) for execution
whenever the agent is in state s.
Let T be any task with policy π, and let V : S → R and Q : SA → R be
any functions. Then T is said to be under the veto of the function v when
the following holds:
∀s∈S
π(s) = arg
∀s∈S
V (s) =
max
a∈A(s)\v(s)
Q(s, a)
(4.1)
and
max
a∈A(s)\v(s)
Q(s, a)
(4.2)
The meaning of (4.1) is that the policy selects the highest-valued action in
state s under the constraint that it is not vetoed in that state. (4.2) defines
the value V (s) of a state s in which actions might get vetoed. This definition is important for the Q-learning rule. Obviously, the term A(s) \ v(s)
in (4.1) and (4.2) can be interpreted as if the veto function belongs to the
action permission function.
With a state space discretization, as defined in chapter 2.3, the previous
definition can be modified in the following way.
Let Ŝ = (Ŝa )a∈A , be a family of partitionings of state space S (see chapter 2.3.2), and let Q̂(., a) be any function on Ŝa , Â the action permission
function on Ŝ, and V be any function on S. Then, (4.1) and (4.2) become
∀s∈S
π(s) = arg maxa∈Â(S(s,a))\v(s) Q̂(S(s, a), a)
(4.3)
∀s∈S
V (s) = maxa∈Â(S(s,a))\v(s) Q̂(S(s, a), a)
(4.4)
where S(s, a) ∈ Ŝa such that s ∈ S(s, a).
With these definitions, a Q-learning rule can be defined for a task under
the veto of function v , following (2.18):
Hierarchic Task Composition
65
Q-learning rule with veto function
Q̂(n+1) (S(sn , an ), an ) = (1 − αn ) · Q̂(n) (S(sn , an ), an ) + αn · [r(sn , an )
Q̂(n) (S(sn+1 , a ), a )]
max
+γ dur(an ) ·
a ∈Â(S(sn+1 ,a ))\v(sn+1 )
(4.5)
a
n
sn+1 ∈ S(sn+1 , a ) and an ∈ Â(S(sn , an )) \ v(sn ).
with S(sn , an ) sn −→
This learning rule is based on (4.4). As mentioned previously, the veto
function can be interpreted as being part of the action permission function
Â. However, Â must be well-defined on partitions, not on states. Therefore, (4.5) can be interpreted as giving eventually rise to a further, finer
state partitioning in a discretized state space whenever an action gets vetoed only in part of a partition. Suppose, for the sake of simplicity, task
T uses only a single partitioning, independent of the action. Then, such a
refinement can be defined in the following way:
N
Let S = i=1 Si be the state space partitioning for task T , and let
A(Sj ) = {a1 , a2 , ..., an } be the actions of task T in state Sj indexed in
a way such that Q̂(Sj , ai ) ≥ Q̂(Sj , ai+1 ). Then, Sj can be refined as
Sj =
n
Sj,ai
(4.6)
i=1
with
Sj,a1
Sj,ai+1
:= {s ∈ Sj | a1 ∈
/ v(s)}
:= {s ∈ Sj \
i
(4.7)
Sj,al | ai+1 ∈
/ v(s)}
l=1
Thus, Sj,ai+1 is that part of Sj , in which a1 , .., ai are vetoed, but not
ai+1 . Obviously, for each s ∈ Sj there is a unique i with Sj,ai s, and
V (s) = Q̂(Sj , ai ) follows from (4.4) and from the way the actions have been
indexed. This virtual partitioning refinement as it eventually gets caused
by the veto function, might create problems of state aliasing which will be
discussed in section 4.2.3.
The next proposition states that a task under some veto function can be
reformulated as a task without veto function, and that both subtasks are
equivalent, based on the following definition of equivalence:
66
Hierarchic Task Composition
Definition of Equivalent Tasks
Two tasks T1 and T2 (with ”objective”, continuous state space S) are called
equivalent, if for any set of learning episodes (state and action sequences),
for identical initialization of their Q-functions, and for identical treatment
of their learning rates, the resulting policies from Q-learning ((2.18) or
(4.5)) are identical.
Proposition 4.1
Let a task T be under the influence of a veto function v. Then, there exists
an equivalent task T1 which is not influenced by a veto function.
Proof:
Let Ŝa = {S1,a , .., SNa ,a }, a ∈ A, be the state space partitioning of task T ,
with Q-function Q̂, action permission function  : Ŝ → ℘(A), and action
duration dur(Si,a , a).
For each partition Si,a , define
Si,a
:= {s ∈ Si,a | a ∈
/ v(s)}
Si,a := {s ∈ Si,a | a ∈ v(s)}
}1≤i≤Na ∪{Si,a
}1≤i≤Na ,
Define for each a ∈ A a new partitioning S̆a := {Si,a
and define a new action permission function Ă : S̆ → ℘(A) with
Ă(Si,a
) := Â(Si,a )
Ă(Si,a
) := Â(Si,a ) \ {a}
for all 1 ≤ i ≤ Na . Define a task T1 on S̆ with the same reward function
as task T , with action permission function Ă, and with action durations
dur(Si,a
, a) := dur(Si,a , a)
Let Q̂(t) and Q̆(t) be the Q-functions of T and T1 at time t.
It is sufficient to show, that for any time t > 0 it holds that
∀a∈A ∀i=1..Na
[Si,a
= ∅] → [Q̂(t) (Si,a , a) = Q̆(t) (Si,a
, a)]
Hierarchic Task Composition
67
This holds certainly for t = 0. Assume now that it holds for some t ≥ 0.
at
st+1 ∈ S(st+1 , a ) be the transition to time t + 1 , with
Let Si,at st −→
). Then, since st ∈ Si,a
:
at ∈ Â(Si,at ) \ v(st ) = Ă(Si,a
t
t
Q̂(t+1) (Si,at , at ) = (1 − αt ) · Q̂(t) (Si,at , at ) + αt · [r (st , at )
+γ dur(Si,at ,at ) ·
=
max
a ∈A with a ∈Â(S(st+1 ,a ))\v(st+1 )
Q̂(t) (S(st+1 , a ), a )]
, at ) + αt · [r (st , at )
(1 − αt ) · Q̆(t) (Si,a
t
+γ dur(Si,at ,at ) ·
max
a ∈A with a ∈Ă(S (st+1 ,a ))\v(st+1 )
Q̆(t) (S (st+1 , a ), a )]
= Q̆(t+1) (Si,a
, at )
t
= 0,
The first and last equality is (4.5), the second one holds because Si,a
t
/ v(st ). (end of proof)
since at ∈
4.2.2
Realization of the veto principle
The Q-functions of avoidance tasks can define a veto function in the following way. In a deterministic avoidance task, the state space divides into two
parts: a collision part (future collision inevitable), and a non-collision part
(future collision avoidable) 5 . The desired interaction between such a task
and a goal-seeking or a state-maintaining task is a veto on those actions
taken by the latter ones that lead directly into the collision part. Such a
requirement is based on the following definition of collision:
Collision
A collision is given by a region of state space which the agent must be
kept out of as an absolute requirement (a kind of survival condition). In
RL terms, it means that the negative reward for such a collision state
is not to be taken into account of any value function of any task of
the agent: it is not evaluated numerically by any goal seeking or state
maintaining task.
To understand the meaning of this definition, imagine a task T defined by
a positive reward for goal states, and a negative reward for collision states.
Applying a discount factor γ < 1 for searching the shortest path to a goal
state, the learning task would eventually find a compromise between taking the shortest path to the goal and entering with a certain probability
into the collision region, especially if it might take a long time to final
5 in
a system with non-deterministic state transitions, this characterization must be
described with probabilities, as will be described later on in this section
68
Hierarchic Task Composition
collision in the collision-inevitable region, thus letting the negative reward
decay considerably. Although the described situation should not occur in
deterministic systems - in each state, the best action always either leads to
collision or to the goal - remember that discretization results in nondeterminism, and therefore leads eventually to such a compromise. Declaring
explicitely certain states as a collision region, the above definition can be
used to implement an absolut avoidance without compromise.
a) Learning the Q-functions of an avoidance task
1. Define avoidance tasks each on their specific state space, but with
the full action permission function as defined in section 2.3.3. This is
because all possible actions in a state must be evaluated for a usefull
veto function.
2. Define a state space discretization. The finer its resolution is, the
better it will approximate the discontinuity of the value function. See
next section for a technique of adaptive discretization with resolution
enhancement right at this border.
3. Choose γ = 1 in learning rule (2.18) when learning the avoidance task
T1 . This choice (an undiscounted MDP) is motivated by the fact that
an avoidance task is not a shortest path problem. Define a ”safe”
region Ssaf e of state space as a set of terminal states for T1 with
R ≡ 0. T1 ’s optimal value function thus results in V ∗ ≡ 0 in the
”collision-avoidable” region of state space and V ∗ ≡ R (R negative)
in the ”collision-inevitable” region of state space. The (optimal) value
function is thus discontinuous. What is required for the veto principle
is a good approximation of this discontinuity.
b) Defining the veto-function
Let Q̂1 , . . ., Q̂m be the Q-functions of avoidance tasks T1 ,...,Tm with
collision reward R < 0. Define
Qavoid (s, a) := min Q̂i (Si (s, a), a)
1≤i≤m
with Si (s, a) the partition of task Ti ’s partitioning that contains s. Then,
define the veto function v as
1
v(s) := {a ∈ A(s) |
< −0.5}
(4.8)
1 + ec2 ·(Qavoid (s,a)−c3 )
A natural choice for c3 is R/2.
The use of a squashing function in (4.8) approximates the discontinuity of
Hierarchic Task Composition
69
the Q-function of the avoidance tasks. The factor c2 is a squashing factor.
See figure 4.3.
s
0
Q*avoid (s,a)
c3
Qavoid (s,a)
R
veto on action a
Q*avoid (s,a): optimal Q-function
on continuous state space
Qavoid (s,a) : learned Q-function
1
1 e
c2 ( Qavoid ( s , a ) c3 )
Figure 4.3: approximation of the discontinuity of the function Q∗avoid (s, a)
as a function of (here 1-dimensional) state space with fixed action a
c) Learning a task under the veto function
A (goal-seaking or state-maintaing) task T is learned with Q-learning rule
(4.5).
Note that a veto may occur also during repetition of a (of a step of size
h) inside the current state cell of task T . This case is not treated in the
definition of Q-learning with veto function (4.5). In pratice, such a veto
will be treated as if the action at were vetoed in the starting state st of
the transition, and thus Q̂(t) (S(st , at ), at ) will not be updated. This can
be interpreted as an extension of the veto function on those states. Note,
however, that the definition of the veto function in section 4.2.1 is independent of the tasks that might get vetoed, while the extended veto function
depends on the partitioning of the vetoed task.
As an example of the effectiveness of this approach, see figure 4.4 for the
TBU example: the optimal trajectory of the goal-seeking task T6 traverses
a region close to the critical region (i.e. inner blocking between cab and
trailer).
70
Hierarchic Task Composition
Figure 4.4: TBU task T6 (Φgoal = 0 ∧ Φint = 0)
a) with veto principle (vetoed by T1◦ (Φint ≥ Φmaxintern ))
b) without veto principle (single task with r > 0 in (Φgoal = 0 ∧ Φint = 0)
and R < 0 in Φint ≥ Φmaxintern
4.2.3
Discussion of the veto principle
1. Benefits from the veto principle
The Veto Principle, first of all, offers benefits common to any task (de)composition:
reuse of learned behaviour without the necessity to learn it again and again.
Avoidance (of obstacles, for instance) is a useful behaviour that can be factored out of a complex behaviour. With a veto function as defined in the
previous section, the agent learns a goal-seaking or state-maintaining task
searching for the shortest path to the reward region in the ”collision avoidable” region of state space, without the necessity to evaluate again the cost
of collision. Second, and more specifically, the agent avoids costly collisions
from the very beginning of learning a new task. For numerical results, see
section 4.6.1.
Hierarchic Task Composition
71
2. Vetoed tasks are hierarchically optimal
From the definition in 4.2.1 and the construction of the equivalent flat task
in the proof of proposition 4.1, it is clear that a collision avoidance task
imposes a hierarchical constraint on the vetoed task (of type 2, see 4.1.2).
Because a vetoed task is equivalent to an unconstrained (flat) task with the
same reward function, it is hierarchically optimal. When we assume the
case of a continuous state space for both the avoidance and the vetoed task
(i.e. the limit case of a partition with infinitely fine resolution), the veto
principle is even of flat optimality. This is obvious from the definition of
the veto function when defined by an avoidance task, in which a collision
occurs in a terminal state.
3. State generalization in avoidance tasks
It is desirable to apply an adaptive partitioning of state space when learning
an avoidance task in order to generalize states without aliasing. Remind
that the state space of a collision task divides into two regions with optimal
value function V ∗ ≡ 0 in the ”collsion avoidable” region and V ∗ ≡ R < 0
in the ”collision inevitable” region. It is easy to define an adaptive partitioning of state space that enhances the resolution in the critical border
region of discontinuity by state spliting in order to better approximate it:
use the variance of observed errors in the Q-learning rule.
The following state splitting technique for avoidance tasks was implemented
for the TBU domain. Starting with a minimum resolution in the kd-trie
representing T1 ’s state space (see chapter 2) , and with a Q-function globally initialized to the (negative) reward R, a cell Sin (node in ) gets split
whenever the observed Q-samples
r(sn , an ) + V̂ (n) (Sin+1 ),
a
n
sn+1 ∈ Sin+1 vary in their minimum
defined for transitions Sin sn −→
and maximum value by more than c3 · (−R), with c3 a constant (I used 0.8
for the TBU task, see below). This splitting criterion is motivated by the
fact that a high variation of the observed Q-samples for a given action in a
given cell implies that the border between the ”collision avoidable” region
and the ”collision inevitable” region must cross that cell.
72
Hierarchic Task Composition
Figure 4.5: adaptive partitioning of state space for the TBU avoidance task
T1◦ (Φint ≥ Φmaxintern )
4. State generalization in tasks under the veto principle
Particularly important is the question of how state generalization of a vetoed task relates to state generalization in avoidance tasks defining the
veto function. In section 4.2.1, it was shown that a veto function might
introduce a refinement of the partitioning of the vetoed task. Under which
circumstances can the vetoed task generalize on states in some of which actions get vetoed, in others not? The answer must account for the problem
of aliasing as defined in section 2.4.
In general, it is clear that a veto function can impose any kind of constraints
on the system dynamics and on the reachable regions of state space. Consider the following
Example 4.1 (figure 4.6)
A frictionless body of mass m moves around in a forceless space. It
can navigate with actions causing a constant impuls to the four directions, plus a noop operation causing no impuls at all. Its goalseeking
task T must cross the (dashed) goal line, and the underlying veto
function T ◦ avoids collision with the u-shaped obstacle. Task T will
learn among other to get out of the u-shape in a way to get as fast as
possible to the goal line, although T has no state information about
the obstacles shape. It is guided just by getting vetoed in the neighborhood of the obstacle.
Hierarchic Task Composition
73
fn
fw
m
fs
fe
goal line
m : frictionless body
with mass
m
actions:
impuls to
fs : south
fn : north
fe : east
fw : west
and
noop: (no impuls)
Figure 4.6: a strongly constraining veto function
It is clear from this example that, in general, a vetoed task must adapt
its partitioning to the constraints imposed by the veto function in order
to minimize the effect of aliasing. This adaptation cannot, in general,
be limited to simply follow the resolution of the avoidance tasks (known
prior to learning the vetoed task) defining the veto.
In fact, the partitoning resolution depends highly on the dynamics and
the goal of the vetoed task itself. For example, a state with a vetoed action
may attract the policy, but in case of a veto, a different policy may have
been required in a state far away from the state in which the veto occured.
But in that distant state it may be that it generalizes too much. See figure
4.7 for an illustrating example.
In conclusion, in the general case, the vetoed task must either use a partitioning with high resolution, or it must apply adaptive state partitionings
as those described in chapter 5.
In the special case of the TBU avoidance task T1◦ (Φint ≥ Φmaxintern )
vetoing the goalseeking task T3 (Φgoal = 0|(Φ̇int , Φint , Φgoal )) (line up the
trailer with the goal point), state generalization in T3 must not take into
account the partitioning of T1◦ since all optimal trajectories will keep the
truck as close as possible to the border of the ”collsion avoidable” and the
”collision inevitable” region of state space, independently from whether an
action will get vetoed or not. This makes the task different from that of
figure 4.7.
74
Hierarchic Task Composition
outside the tunnel:
as in example 4.1
inside the tunnel:
x1 x2
N
W
goal line
p
E
S
(vetoed) task: similar to that of example 4.1
1. only three actions allowed:
fe: to east
fw: to west
noop: no impuls
2. learning episodes are such
that with probabilty 0.9,
the mass point will enter
the tunnel in the center
such that it may exit
through the hole p.
problem: discretization of vetoed task must be of high resolution in front of the tunnel entrance,
while the avoidance task must have a high resolution in the tunnel between x1 and x2
Figure 4.7: example of independent discretizations of avoidance and vetoed
task
4.3
The Subtask Principle
This principle hierarchically combines tasks such that the learning task T
can activate (”call”) already learned tasks Ti ∈ {T1 , ..., Tm } (called the subtasks of T ), and T has to learn a policy that chooses among these abstract
actions defined by subtask activation:
A := {activate(Tj ) | 1 ≤ j ≤ m}
The theoretical framework for subtask activations is the Semi Markov Decision Process (SMDP) defined by options (see chapter 2.3.5): an option is
a fixed action policy π with a termination probability β defined on states,
and a state subset I ⊂ S in which the option is defined. A policy may
now choose among options, and evaluations are learned for execution of an
option in a state.
The basic idea behind the subtask principle is to decompose the goal predicate of a goal-seeking or state-maintaining task T (p1 ∧ p2 ... ∧ pm ) (in conjunctive normal form) into its parts p1 , ..., pm , and to combine tasks Ti
that know how to achieve (or maintain) the single goal predicates pi , in
such a way that the combination achieves the conjunction of the partial
Hierarchic Task Composition
75
goals pi . The subtask principle defines both the way of combination, and
a Q-learning for finding the optimal combination. As such, the subtask
principle aims at
a) a divide-and-conquer approach reducing the complexity of the overall
task.
b) the reuse of already learned, simpler subtasks by composing them to
more complex tasks.
In 4.3.1, the combination of subtasks will be formally defined, and the Qlearning for the subtask principle will be defined in 4.3.2. A motivation for
the choices to be made in 4.3.1 and 4.3.2, shall be discussed in the following.
The simplest form of combination is a sequentialization of subtasks:
Ti
Ti
Ti
m
1
2
pi1 −→
(pi1 ∧ pi2 )... −→
(pi1 ∧ ... ∧ pim )
start −→
The combined task has to learn the optimal sequence, given the possibility
of achieving sequentially and incrementally the conjunction of subgoals. In
other words, there must exist a sequence i1 , ..., im such that the subgoals
are conservative, in the sense that the subtask Tij+1 will not invalidate the
goal pi1 ∧ ... ∧ pij already achieved by the previous subtasks Ti1 , ..., Tij .
This is a typical situation in highlevel symbolic task domains as for example in production assembly tasks where a (RL-)learner has to find an
optimal sequence of production (assembly) steps, or in complex maze tasks
such as Dietterich’s taxi domain task [Dietterich, 1997]. The pecularity of
these domains is that subgoals are (required to be) conservative: already
assembled parts, for instance, won’t get disassembled by further assembly,
and a passenger picked up by a taxi, won’t be dropped during a drive to
the passenger’s destination.
One approach to task composition with non-interacting subgoals is Singh’s
compositional Q-learning [Singh, 1993], another one is Dietterich’s maxQlearning [Dietterich, 1997]. In both approaches, a subtask activation lasts
until achievement of the subtask goal.
At the subsymbolic level, however, with control problems in continuous
state space such as the TBU task or the modified mountain car task (reach
the top of the hill with velocity 0), subgoals are not conservative but depend
notoriously on each other. Just take the TBU domain: the task T6 (Φgoal =
0 ∧ Φint = 0) with the two subtasks T3 (Φgoal = 0) and T5 (Φint = 0). In this
case, the activation of subtask T3 should not last until goal achievement,
but the optimal switching point between the two subtasks occurs before
achievement of (Φgoal = 0) . See figure 4.8a. The same holds for the modified Mountain-Car problem (figure 4.8b): sometime before reaching the top
76
Hierarchic Task Composition
goal
T
vel
2 (v
el=
0)
T1(pos=goal)
0
goal
T5
T3
pos
goal
a)
b)
Figure 4.8: dependent subtasks
a) wrong TBU subtask sequencing: when activations last until goal achievement
b) in Mountain Car: subtask T2 (vel = 0) must start before (pos=goal) becomes true
of the hill, the combined task must switch from subtask T1 (pos = 0) to
T2 (vel = 0).
Thus, the most important question in defining the subtask principle is that
of the action duration, which means in terms of a subtask activation defined
as an action: what shall the termination condition of a subtask activation
be?
The theoretical framework of SMDPs and options as presented in chapter
2.3.5 are at the basis of the definition of the subtask principle in the next
subsection.
Hierarchic Task Composition
4.3.1
77
Definition
Given
a) Ti (pi | S (i) ) , i = 1..m: a set of goal-seeking and/or state-maintaining
tasks with goal predicates pi , state spaces S (i) , and (already learned)
policies πi
b) S: a combined state space, with surjective mappings fj : Sj → S (j)
for some subset
Sj ⊂ S, j = 1..m, and with
Ij ⊂ Sj (with j=1..m Ij = S) is the set of states (i.e. fj (Ij )) in which
Tj is applicable, while S (j) \ fj (Ij ) is the set of terminal states for Tj .
c) T ( i=1..m pi | S): the combined task, with action set
A := {activate(Tj ) | 1 ≤ j ≤ m}, possibly random (in the sense of an
SMDP) action durations dur(s, activate(Tj )) ∀s∈Ij with elementary
time step h as their unit, and
goal predicates pj with: ∀j=1..m ∀s∈Sj pj (s) ⇐⇒ pj (fj (s))
d) a policy of task T selecting a subtask:
π : S → {1, .., m}
with π(s) such that s ∈ Iπ(s) (i.e. π(s) is
applicable in s)
e) Ā: the flat action set (a flat action
set consists only of elementary
actions) of T , defined by Ā := i=1..m Āi with Āi the flat action set
of Ti (might again be composed by subtasks).
f) r(s, a, s ): a running or a boundary (flat) reward function of task T
(see chapter 2.1) with s, s ∈ S and a ∈ Ā
the subtask principle defines a composition of the subtask policies πi under
the combined policy π of task T to a flat policy6 π̄ in the following way:
6a
flat policy is a policy that maps states to elementary actions
78
Hierarchic Task Composition
Definition of the combined flat policy of the Subtask Principle:
when the policy π of task T has selected the subtask Tj in state s (with s ∈ Ij ),
then the flat policy π̄ selects in state s ∈ S the following sequence of actions, resulting for task T in a state transition to state s on the following trajectory of elementary
time steps h:
s = s0
a1,1
a1,k(1)
−→ s1 . . . −→ sk (1)
a2,1
a2,k(2)
(first activation of Tj )
−→ sk (1)+1 . . . −→ sk (2)
(second activation of Tj )
... ...
ad,k(d)
ad,1
−→ sk (d−1)+1 . . . −→ sk (d) = s (d-th (last) activation)
(4.9)
with
ai,l ∈ Āj selected by the flat policy π̄j of Tj in the l-th step of the i-th time that
subtask Tj (with policy πj ) has been activated, in state fj (sk (i−1) ), and
k (i) := l=1..i k(l), with k (d) = dur(s, activate(Tj )), and
k(i) the (flat) duration (i.e. the duration in elementary time steps h) of the i-th
activation of Tj : k(i) := dur fj (sk (i−1) ), πj fj (sk (i−1) ) , 1 ≤ i < d,
and the duration
k(d) for the last activation
such that
k(d) ≤ dur fj (sk (d−1) ), πj fj (sk (d−1) ) and k (d) = dur(s, activate(Tj ))
Note that
1. the policy π of the combined task T can be interpreted as if in a state
s it were ”calling” d times a subtask Tj . See figure 4.9a.
2. the duration of the subtask activation is counted in elementary time
steps h, not by the number of subtask activations, which will be
stressed sometimes in the sequel by calling it the flat duration. The
reason for this choice will be given in section 4.3.4. An illustration of
(4.9) is given in figure 4.9b.
3. in most cases (in all of this thesis), the surjective mappings
fj : Sj → S (j) will be projections of an n-dimensional real-valued
space to an m-dimensional subspace, m ≤ n. However, S (j) could
also be, for example, some lower-dimensional space of features of S
important to subtask Tj (for feature-based state space reductions, see
[TsitsiklisVanRoy96]).
Hierarchic Task Composition
state ^
sS
a
aj := ST(s) =activate(Tj )
...
j
state s  S
aj
aj
task T
d times in S
activates
subtask Tj
state
fj(s)  S(j)
a°
a°
...
a°
...
6
sk'(1)
second activation
of Tj
a*
...
^
a* := ST (fj(s))
j
dur(a*) times
= dur(s,aj)
dur(
), Sj (fj (sk'(1))))
dur(fjfj(s(sk'(1)
k'(1)), Sj (fj (sk'(1))))
dur(
dur(fjfj(s),
(s),SSj j(f(fj j(s)))
(s)))
Tj = S(s)
activates
subtask Tj
state
^ S(j)
fj(s)
a° :=ST (fj(s))
a*
a*
j
dur(a°) times
first activation
of Tj
79
...
sk'(2)
third activation
of Tj
a)
after
afteraatotal
totalof
of
dur(s,activate(T
dur(s,activate(Tj))
j))
elementary
elementarysteps
steps: :
end
endof
oftransition
transition
sk'(3)
s'
sk'(3)+k(3)
forth activation
of Tj
b)
s
Figure 4.9: subtask principle: a) activation of a subtask as a ”call”
b) an example state transition (4.9) with final subtask interrupt
4.3.2
Q-learning under the Subtask Principle
Given a)-f) of the last section, and given
g) partitionings of S and S (i)7 with the condition:
/ Ij ) ⇒ S(s) = S(s )
∀j=1..m ∀s,s ∈Sj (s ∈ Ij ∧ s ∈
(note that this condition assures that a subtask cannot terminate in the
same partition of S in which it was activated), a Q-learning rule under the
subtask principle can be defined for task T on partitionings on S and on
S (i) , using Q(S(s), Tj ) as a shorthand for Q(S(s), activate(Tj )):
7 for
a state s ∈ S, S(s) s denotes a partition of S, and for s ∈ S (i) , S (i) (s ) s
denotes a partition of S (i)
80
Hierarchic Task Composition
Q-learning rule of the subtask principle
after transition in S from s to s under activate(Tj ) on the trajectory:
S(s) s = s0
a1,k(1)
a1,1
−→ s1 . . . −→ sk (1)
(first activation of Tj )
a2,k(2)
a2,1
−→ sk (1)+1 . . . −→ sk (2)
... ...
(second activation of Tj )
ad,k(d)
ad,1
−→ sk (d−1)+1 . . . −→ sk (d) = s ∈ S(s )
with d, ai,j , k (i), and k(i) as defined in (4.9), and with S(s ) = S(s),
update the Q-function as follows:
Q(n+1) (S(s), Tj )
:=
(1 − αn ) · Q(n) (S(s), Tj ) + αn · r̄(s, Tj , s ) +
+γ k (d) · max Q(n) (S(s ), Ti )
Ti ∈T (s )
(4.10)
with r̄(s, Tj , s ) the accumulated reward of T during Tj defined as:
r̄(s, Tj , s ) =
d−1
i=0
k(i+1)−1
γ
k (i)
γ j · r(sk (i)+j , ai+1,j+1 , sk (i)+j+1 )
j=0
where k (0) = 0, Q(n) (S(s ), .) = 0 if s a terminal state, and
T (s) := {Tj | j = 1..m ∧ s ∈ Ij } the set of in s applicable subtasks.
The combined task T then defines its policy as
π(s) := activate arg
max
{Q(S(s), Tj )}
Tj ∈T (s),j=1..m
The sequence
s = s0 , a1,1 , s1 , ..., sk (d)−1 , ad,k(d) , sk (d) = s
can be seen as the application of an option in state s, bringing the system to
the next state s . Thus, (4.10) can be interpreted as an SMDP-Q-learning
for options to approximate the solution to the following
(4.11)
Hierarchic Task Composition
81
Bellman equation of the subtask principle
∗
k
∗ Q (s, Tj ) = E r̄(s, Tj , s ) + γ · max Q (s , Tl )
Tl ∈T (s )
s, s ∈ S (4.12)
where the reward r̄(s, Tj , s ) is the accumulated reward of T as defined in
(4.11). The expectation is with respect to the flat duration k := k (d) of
activate(Tj ), to the flat durations k(i), i = 1..d (see (4.9)) of single activations of Tj , and to the states of the trajectory from s to s .
The approximation given by (4.10) is twofold: at first, it is due to Qlearning as a stochastic approximation of the solution of (4.12). And secondly, it is due to approximation by discretization using partitioning. This
second type of approximation introduces non-Markovian state transitions,
as already discussed in chapter 2.4. Again, known convergence results of
Q-learning do not apply for this reason. However, the problem of state
aliasing is a key issue to be controlled for successfully applying the subtask principle. This point will be discussed in further detail in the next
subsection.
4.3.3
Action durations and partitionings
As already mentioned at the beginning of this section, the most important
issue in the subtask principle is that of the duration of subtask activations.
Different from other approaches to task composition, a subtask activation
should not last until goal achievement of the subtask, but should have
the possibility to terminate some time before, in order to cope with subtasks having dependent dynamics. Since the action duration determines a
temporal generalization, and temporal generalization is bound to spatial
generalization as discussed in chapter 2.3-2.4, I will treat in this subsection
both issues, the action duration and the relationship between partitionings
of the combined task and of the subtasks.
a) action duration
The natural extension of the action duration model as presented in chapter
2.3.5 consists in adapting the duration of a subtask activation to the partition S(s) of state space S of the combined task, just in the same way it
was done in (2.23) and (2.24):
/ S(s)}
dur(S(s), activate(Tj )) := max min{k > 0 | ŝk ∈
ŝ∈S(s)
(4.13)
using the notation of (4.9). Thus, the duration of an action is defined as
the maximum number of elementary time steps during successive subtask
82
Hierarchic Task Composition
activations that are required to get just out of that cell (i.e. from the
”worst” state ŝ ∈ S(s)). This definition depends only on the partition, not
on the state within the partition. Note that this number may be infinite,
in which case the action (Tj ) will simply not be permitted in this partition,
except for terminal states.
Even more ”natural” (and easier to implement) would be a definition that
allows full activation durations of all, i.e. even of the last subtask activation
in sequence (this definition would allow for subtask activation durations
that are independent of the partitioning of the combined state space S).
For this definition, define at first the number of subtask activations
/ S(s)}
d(S(s), activate(Tj )) := max min{d > 0 | ŝk (d) ∈
ŝ∈S(s)
where k (d) :=
l=1..d
(4.14)
k(l) and k(d) = dur fj (ŝk (d−1) , πj fj (ŝk (d−1) )
using the notation of (4.9). Then, define
dur(S(s), activate(Tj )) := k (d(S(s), activate(Tj )))
(4.15)
The alternative approach extending (2.24) is given by
dur(s, S(s), activate(Tj )) := min{k > 0 | sk ∈
/ S(s)}
(4.16)
The rational behind both definitions is that the temporal generalization
given by the duration of an action should be adapted to the spatial generalization given by the width of the partition containing the current state.
As will be shown in the next subsection, the partitioning of the combined
task may be required to have a higher resolution in certain regions than
those of its subtasks. This would lead to a problem when using the action
durations as defined in (4.15): the duration of an action of a subtask is itself
adapted again to the width of the partition of the subtasks partitioning,
independently from the partition of the combined state space S. If the partition of the subtask is coarser than that of the combined task, the subtask
will return control too late with respect to the partition of the combined
task.
Thus, the duration of a subtask activation must be controlled at the elementary step level as defined in (4.13) or in (4.16).
Note that this definition of action duration gives rise to a random action
duration. Note further that this duration may ”interrupt” a subtask which
has not yet reached its own action duration.
Hierarchic Task Composition
83
b) partitionings in the subtask principle
Obviously, this definition of action duration puts into relation the partitioning of the combined task and the partitionings of the subtasks. What
should this relation look like?
There are three aspects that determine it:
1. near and within the goal region of the combined task there should
be a finer resolution in order to fine-tune the policy. As an example,
take the TBU task T6 again. See figure 4.10.
2. a high resolution is required in the region where the policy of the combined task performs a subtask switch, which, as already mentioned,
may not be in the goal region of either of the two sequenced subtasks.
As an example, see this border line in the state space of the modified
Mountain Car example, figure 4.8b.
3. in general, we can expect a greater temporal generalization for the
combined task with respect to the temporal generalization of the subtasks, since the subtask principle
• should define its actions at a higher semantic level than that of
the subtasks, and consequently
• looks for a sequentialization of the subtasks simpler than the
sequence of the actions of the subtasks, in terms of number of
actions or of number of action changes (in the best case a sequence in which each subtask occurs just once). This will be
discussed in more detail in 4.3.5.1.
Only aspect 1 allows for a predefined, fixed strategy, which I call goalcentered partitioning, as illustrated in figure 4.10:
Definition of a goal-centered partitioning
A goal-centered partitioning is defined by a (predefined) minimum
and maximum depth in the kd-trie for each dimension of state space
(i.e. the number of splittings of a cell which cuts it into two halves
along this dimension). The partitioning is then defined by constructing a kd-trie in which cells containing states of the goal region, have
maximum depth within the kd-trie. Then, the kd-trie gets smoothed
such that any two adjacent cells do not differ in their cell width for
each dimension by more than a factor two.
84
Hierarchic Task Composition
π/4
0
Φgoal
-π/4
-π/8
Φint
0
π/4
Figure 4.10: example of goal-centered partitioning:
TBU task T6 (Φgoal = 0 ∧ Φint = 0), with initially unknown goal region
8 9
,
Aspect 2 instead requires a dynamic partitioning strategy, since the border
lines of a subtask switch are not known at the outset of learning. Note
that this aspect is not only important for the subtask principle, but is of
general importance. In fact, any type of action change of a policy defines an
important region of state space that must have a high resolution. This has
been discussed in detail in chapter 2.5 with respect to the optimal policy
partitioning which was defined exactly by these borders of action change.
Since generalization is the key issue in exploiting the subtask principle, it
must be reminded that too strong a generalization may lead to a poor policy, as discussed in example 2.1 of chapter 2.5. In chapter 5, I will present
a Q-learning with dynamic, adaptive partitionings (so-called kd-Q-learning
that can cope with both problems, that of refining resolution in the region of a subtask switch, and that of generalizing while minimizing aliasing
problems.
8 if the reward area is unknown to the learning agent at the outset of learning, it can,
in a preliminary exploration phase, collect reward points
9 without such smoothing, the tiling structure of a kd-trie may have strong discontinuities in the discretization resolution which might make it hard for the learning controller
to navigate the system into the desired goal region, at least with action duration model
(4.16)
Hierarchic Task Composition
4.3.4
85
the algorithm of the subtask principle
Assume that a)-g) from the beginning of subsection 4.2.1 are given, and let
π̄j denote the flat policy of subtask Tj (i.e. the policy that selects elementary actions for each time step h, which results from the policy πj ). Then,
the algorithm of Q-learning of the subtask principle is defined as follows:
Q-learning algorithm for the subtask principle
initialize:
partitioning of S
Q
α , the initial learning rate
repeat (for each learning episode):
initialize:
s ∈ S (randomly)
repeat (subtask selection):
select Tj ∈ T (s) (for example -greedy)
s0 ← s
t1 ← 0
r̄ ← 0
c←1
repeat (for single, elementary action with step size h):
execute π̄j (fj (st1 ))
observe next state st1 +1 and reward r
r̄ ← r̄ + c · r
t1 ← t1 + 1
c←c·γ
/ S(s)
until st1 ∈
s ← st1
Q(S(s), Tj ) ← (1 − α) · Q(S(s), Tj )+
α · r̄ + c · maxTk ∈T (s ) Q(S(s ), Tk )
update α
s ← s
until learning episode finished
end repeat
Note that the subtask principle combines in a simple, straightforward way
with the veto principle:
1. all subtasks should be learned previously under the veto principle.
86
Hierarchic Task Composition
2. when the policy of the task selects a subtask for execution, that subtask will be executed under the veto principle. The selection of the
subtask itself cannot be vetoed.
In other words, a task that learns and runs under the subtask principle, is
not directly concerned with the veto principle: any veto may occur only
at the elementary action level, and the choice of elementary actions is not
under the control of the composed task.
4.3.5
Discussion
The subtask principle aims at reducing complexity of difficult control tasks,
and at reusing already learned behaviors combining them to a new behavior. The context this principle is placed in, is that of control tasks at a
subsymbolic level defined in continous state spaces. As such, it is able to
cope with dependent subtasks, which is unique with respect to other compositional approaches. In this section, two questions will be discussed more
in depth:
1. what is the concrete reduction of complexity the subtask principle
brings about?
2. what type of optimality can be aimed at with the subtask principle,
and what are the conditions under which a task can be learned with
the subtask principle?
For the following discussion, the reward function will be constrained in the
following way:
any task or subtask is a goal-seeking or a state-maintaing task, with the
combined task having a goal (or state) region (to maintain) defined by a
goal predicate i=1..m pi , and subtasks Ti having goal predicates pi , and
the combined task having a reward function defined by
• if the combined task is goal-seeking:
r ≡ 0 as running reward,
and a boundary reward R(s) ≡ const > 0
for all s ∈ S with i=1..m pi (s) (any such state is terminal), and
R(s) ≡ 0 for any other terminal state;
• if the combined task is state-maintaining:
a running
reward r with r(s) ≡ const1 > 0 for all nonterminal s ∈ S
with i=1..m pi (s) , and with r ≡ 0 for any other nonterminal state,
and
a boundary reward R(s) ≡ const2 > 0 for all terminal s ∈ S with
i=1..m pi (s), and R(s) ≡ 0 for any other terminal state.
Hierarchic Task Composition
87
1. Reduction of complexity by the subtask principle
The question of what the concrete reduction of complexity of the subtask
principle is like, has two parts: first, concerning the reduction given the
multiple learning problem of learning the combined task and the subtasks;
second, concerning just the learning problem of the combined task, given
already learned subtasks. The second part is important in the context of
reusability. While both questions will be answered by experimental results
at the end of this section, the second question will now be examined more
thoroughly.
In the subtask principle, a composed task should define its actions at a
higher semantic level than that of the subtasks. An action like ”move onto
the top of the hill” is obviously semantically at a higher level than the
action ”accelerate forward” or ”accelerate backward”. The same holds for
the TBU action ”align trailer with goal point” compared with the steering
action ”steer more to the left”. What is the concrete meaning of ”semantically higher”? In some way, the control dynamics of a behavior should
be simpler and at a larger time scale at a higher semantic level. This can
be defined more formally and with specific reference to a sequential decision problem, by the following two factors having a direct influence on the
complexity of an MDP viewed as a search problem10 :
• the smaller the number of actions, the simpler the MDP becomes
• the smaller the number of action changes in the action sequence of a
solution trajectory, the simpler the MDP
It is clear that this list of complexity factors is not exhaustive. The number
of states in a state space, the number of successor states of a state given
an action (card{s | pas,s > }) the extension and the fan-in of the goal
region and the like, also determine the complexity of the search problem.
However, the two mentioned above, together with the action duration, are
tightly related to the semantic level of an action. Thus, the following three
complexity factors are defined, which are relevant for hierarchical task composition:
10 search
in the (finite) space of policies for the optimal one
88
Hierarchic Task Composition
Complexity factor 1:
the number of actions
Complexity factor 2:
the number of action changes in the action
sequence of a solution trajectory
(consecutive action repetitions not counted)
Complexity factor 3:
the time scale (frequency) of decision points
Note that the most concrete complexity measure for a control learning problem is given by the number of learning steps (and the amount of memory
required), given a desired approximation of the optimal control. However,
since the approximation is to an a priori unkown function (depending on
the particular environment), this effort cannot be estimated a priori, and
consequently it is hardly generalizable (i.e. it is difficult to quantify the
problem size)11 .
Thus, the three complexity factors 1)-3) as defined above, shall be examined
here to determine in which way the subtask principle reduces complexity,
in particular in the two examples considered in this thesis. In what follows, I will assume a random walk during learning (total exploration with
uniformly distributed action selection), and a sequentializability of the subtasks that is hierarchically optimal with respect to the combined task, as
for example the TBU-task T6 and the modified Mountain-Car-task(see Definition 4.1).
factor 1) fewer actions:
The reduction is obvious, since the Q-function in tabular from has to learn
O(|S| · |A|) values. In the TBU domain, the subtask principle applied to
the task T6 reduces the number of actions from 5 to 2. In the modofied
mountain car example, however, no reduction of this type takes place (2
elementary actions, 2 subtasks).
factor 2) the number of action changes, and factor 3) the frequency of
decision points:
The number of action changes in the action sequence on a solution trajectory determines the complexity of a control task. In the lower limit, and
11 little has been done in research on algorithmic complexity and its influencing factors
in RL. A rare exception is the work of [Littman, 1992] concerning the minimum required
memory for RL-tasks in certain categories of environments, which unfortunately had no
follow-up
Hierarchic Task Composition
89
goal
goal
a4(a2)
T5
a3
(∆steer=0)
a2
T3
a1
a) monolithical task: 3 action changes
b) subtask principle: 1 action change
Figure 4.11: example of complexity reduction
TBU task T6 (φgoal = 0 ∧ φint = 0)
thus for simple control tasks, we have a sequence in which each action appears at most once (action repetitions not counted). More complex tasks
require more action changes which may repeat. Illustrating examples are
depicted in figure 4.11 for the TBU and figure 4.12 for the modified Mountain Car problem. A closer look at the TBU example (the action sequence
between a3 and a4 in figure 4.11) reveals that this complexity factor, in
case of a discretization of a continuous action signal, has to be defined
more rigorously12 :
Complexity factor 2’ (defined for a continuous, scalar action signal):
the order of the optimal control a(t) (see chapter 2.1) defined by the number
of local maxima and minima counted on the trajectory to the goal
Why does factor 2 (2’) influence the complexity of the learning task?
12 because a discretized action signal may require frequent action changes due to the
action discretization, oscillating around a single setpoint in the continous action space take as an example of the TBU the trajectory on a constant, arclike curve
90
Hierarchic Task Composition
a4(a2)
T2
goal
goal
a3(a1)
a2
a1
a1 thrust →
a2 thrust ←
a) monolithical task: 3 action changes
T1
T1 (pos=goal)
T2 (vel=0)
b) subtask principle: 1 action change
Figure 4.12: example of complexity reduction
modified Mountain Car task T (pos = goal ∧ vel = 0)
In plain Q-learning, this criteria does not apply, since the learning agent
has to try in each state each action infinitely often for achieving an approximation with vanishing error13 . Thus, the number of action changes does
not influence the number of learning steps to convergence.
Things change when temporal generalization gets applied. Assuming
that factor 2 (2’) applies resulting in less changes of subtasks thatn action
changes of the flat policy on an optimal trajectory to the goal region, a
temporally extended subtask activation is likely to execute the optimal action sequence, under the condition that the selected subtask is the optimal
one.
This argument is not a strong one in the sense that it doesn’t hold independently of the system dynamics: temporal abstraction shifts the decision
problem of frequent action selection towards the decision problem of selecting the right point of time for a less frequent action selection: it should be
easy to define a system with dynamics where this latter decision is crucial
resulting in the same complexity as without temporal generalization.
Nevertheless, the argument has great importance in practice. In fact, many
control problems allow for solutions (trajectory to the goal) that are suboptimal since the partitionings don’t match well the borders of policy (subtask) changes. Since, in practice, an exploration strategy will not be a
random walk, but will somehow be a compromise between exploration and
exploitation (for example, following the -greedy policy), the search for the
optimal path in an adaptive discretization approach will be biased towards
the suboptimal solution paths.
13 although, for finite state and action sets, the algorithm will converge after a finite
number of steps to a Q-function defining the optimal policy - however, there is no upper
bound of this number which is independent from system dynamics, see chapter 2.2
Hierarchic Task Composition
91
Thus, a usefull approach should start with high generalization in space and
time, and then it should detect during learning any aliasing due to overgeneralization and react adaptively by refining state space discretization.
This is the approach taken by kd-Q-learning as defined in chapter 5.
In this way, temporal generalization will accelerate the learning process. In
the algorithm of 4.3.4, temporal generalization is controlled by the partitioning of the combined state space S which the agent gets initialized to at
the beginning of the algorithm.
Returning to the argument that temporal abstraction shifts the decision
problem of frequent action selection towards the decision problem of selecting the right point of time for a less frequent action selection, it is
interesting to note the following. In control theory, it is a well known fact
that for certain systems a bang-bang control (as the Mountain Car with
its two maximum accelerations forward and backward, see chapter 3.2) can
be optimal (minimum time to goal) [Meystel & Albus, 2002]. Bang-bang
control is characterized by the right instants of switching (action change
between the two values) which has to be searched for in this kind of control. This is exactly the problem in temporal generalization in RL - find
the right point of time for a policy change.
2. optimality and applicability of the subtask principle
The subtask principle is based on the decomposition of a complex goal predicate in conjunctive normal form, and on subtasks defined by the terms of
the decomposed goal predicate. The combined task then has to learn the
shortest path to the combined goal region, learning a policy that activates
these subtasks. By definition, the flat policy of the combined task selects
elementary actions from the same set of elementary actions as its subtasks
do (4.3.1.e). However, there are constraints on using these actions: they
must always be selected in the context of a subtask activation, leading to
the constraints discussed in 4.1.2. In this subsection, I will compare this
combined task (i.e. one with these subtask constraints) with the corresponding unconstrained, flat task.
92
Hierarchic Task Composition
Definition: the unconstrained, flat task
given a combined task under the subtask principle, the corresponding
unconstrained, flat task is a task which has the same state space, a
set of elementary actions that is the same as the flat action set of
the combined task, the same reward function, and an unconstrained
policy for the elementary actions.
In chapter 2.3.1, a constraint on a policy has already been discussed: that
which constrains an action to having a control value constant for multiple
elementary time steps. The constraint imposed by the subtask principle on
the flat policy is defined by options instead of action repetitions: in each
state, the set of applicable subtask policies πj determines the set of actions
applicable in state s - all those actions that are optimal for at least one of
the subtasks.
What type of optimality has to be considered for the subtask principle?
The combined task defined under the subtask principle, has an optimal
policy which, by definition, is the solution of the Bellman equation (4.12)
(in this subsection, the learning process itself and the corresponding partitionings are ignored, and just the theoretically existing solutions to the
Bellman equations are considered now).
Reminding the concepts of hierarchic and recursive optimality from chapter
4.1.2, are these concepts useful for characterizing the subtask principle?
Now, the optimal policy of the combined task must not necessarily be recursively optimal. This is due to the fact that subtasks may depend on
each other, which results in subtask activations that may not last until goal
achievement, and consequently each single subtask must not necessarily be
performed optimally in the context of the combined task. The subtasks use
locally optimal policies, but the composed task may not apply them in a
locally optimal way. On the other hand, by definition, the combined task
is always hierarchically optimal with respect to the hierarchical constraint
(in 4.1.2: constraint type 1) that the flat policy may select only from those
actions that are selected also by some subtask.
From these considerations, it is clear that we should consider the third type
of optimality in hierarchies for the subtask principle: the flat optimality
(definition 4.1.c), which refers to the optimal policy of the unconstrained,
flat task. I redefine this optimality in the context of the subtask principle:
Hierarchic Task Composition
93
Redefinition: unconstrained, flat optimality
A policy of a combined task under the subtask principle has unconstrained, flat optimality when its flat policy is an optimal policy for
the unconstrained, flat task.
In order to discuss this type of optimality, any constraint other than that
introduced by options, must be abstracted from, in particular from that
imposed by action durations. Reminding proposition 2.1 which stated that
the unconstrained optimal policy (i.e. for actions with duration 1) is the
best optimal policy among those constrained by action durations, I will
assume in the sequel an action duration of 1 for all tasks and subtasks.
Now we can formulate the following two questions most relevant when applying the subtask principle:
1. when has the combined task a policy with unconstrained, flat optimality?
2. does the optimal policy of the combined task find a (suboptimal) solution path from any state from which there exists a solution path for
the unconstrained, flat task?
this question is unprecise: for instance, is a random walk to a goal region a suboptimal solution path? In order to simplify the discussion,
any path that ends in the goal region, will be regarded a solution path:
Definition: suboptimality
A policy π of a combined task is called suboptimal if the following
holds:
∗
∀s∈S Vunconst
(s) > 0 ⇒ V π (s) > 0
∗
with Vunconst
the optimal value function of the unconstrained flat task
(note that the MDP must be deterministic in this formulation). In
other words, a policy of the combined task is called suboptimal when
it is able to bring the controlled system into the goal region from any
state from which the optimal flat, unconstrained policy is able to do
so, instead of terminating elsewhere (terminal state, infinite loop).
2a. thus, the question is: when does a suboptimal policy of the combined
task exist?
94
Hierarchic Task Composition
A general answer to these questions falls into the field of control theory
and is beyond the scope of this thesis14 .
Instead, I will simply present the three possible cases with concrete examples:
a) unconstrained, flat optimality, b) suboptimality, and c) possibility of no
(suboptimal) solution path at all. For the third case, I will present a sufficient condition which a learning agent may detect during learning.
Case 1: unconstrained, flat optimality
there exists a sequential composition of the tasks Ti defining a solution
for goal conjunction which has flat, unconstrained optimality. This is
the case whenever there exists an optimal unconstrained, flat policy π ∗
for T such that
∀s∈S ∃j∈{1,..,m}
with s∈Ij
π ∗ (s) = πj∗ (fj (s))
(4.17)
with πj∗ the optimal (flat) policy of subtask Tj .
In this case, the constraints imposed by the subtask principle may reduce the learning complexity as discussed previously without compromising
the optimality of the policy with respect to the unconstrained flat task.
It is difficult to give sufficient conditions for this case.
A TBU example is given by T6 (φgoal = 0 ∧ φint = 0) with subtasks
T3 (φgoal = 0) and T5 (φint = 0) with optimal sequence T3 then T5 , see figure 4.11. The sequentializability in this example follows from the fact that
T5 (i.e. its policy and the value function) is independent of Φgoal , which is
the state space dimension in which the goal predicate of the other subtask
is defined.
A mountain car example is given in the modified mountain car task
T (pos = goal ∧ vel = 0) with subtasks T1 (pos = goal) and T2 (vel = 0), see
figure 4.12b. Note that T2 is not independent of pos. Thus, the condition
of the previous example is not sufficient for case 1.
14 to my knowledge, there are no results in the reference literature (with reference to
[Meystel & Albus, 2002], a reference book in this field) reguarding the composability of
controllers for general goal predicates to a conjunctive form
Hierarchic Task Composition
95
Case 2: suboptimal, guaranteed solution path
there exists no optimal solution path but only a policy π for the sequential composition of tasks Tj defining a suboptimal solution for goal
conjunction.
This is the case when
∗
∀s∈S Vunconst
(s) > 0 ⇒ V π (s) > 0
but there is some s ∈ S such that none of the actions optimal for some
subtask Tj in fj (s), is also optimal for the unconstrained flat task.
In this case, (4.17) does not hold. Therefore, the constraints imposed by
the subtask principle may reduce the learning complexity as discussed previously, compromising however the optimality of the policy with respect to
the unconstrained flat task.
A TBU example is given by T7 (|x| < δ1 ∧|y| < δ2 ) with subtasks T8 (|x| < δ1 )
and T9 (|y| < δ2 ), see figure 4.13.
Case 3: no suboptimal, guaranteed solution path
∗
there exists some s ∈ S with Vunconst
(s) > 0 such that for all policies π
of the combined task:
V π (s) = 0
In this case, the subtask principle cannot be applied: any sequentialization
of the subtasks either ends in a terminal state outside of the goal region,
or runs forever within S without reaching a goal state.
A TBU example is given by the task T10 (φgoal = 0∧φext = 0) with subtasks
T3 (φgoal = 0) and T11 (φext = 0). In this example, the state space has been
extended by the angle φext between the trailer and the wall normal. See
figure 4.14. This example represents a case of divergence in the sense of the
following definition:
96
Hierarchic Task Composition
goal
x
y
T8
T9
optimal
path
task T7(|x|<d1 š |y|<d2) with
subtasks T8(|x|<d1) and T9(|y|<d2)
Figure 4.13: TBU-example of suboptimal task composition in the subtask
principle
Definition: divergence type A of a subtask composition
Any improvement with respect to one subtask Tj (i.e. to the term pj
of the decomposed goal predicate)
Vj (s ) > Vj (s)
af ter
Tj
s −→ s
causes a deterioration with respect to all of the other subtasks:
Vi (s ) < Vi (s)
∀i=j
Or, more formally:
Let Vj be the optimal value function of subtask Tj , and define π◦πj to
be the application of the policy of task T followed by the application
of the policy of subtask Tj ,
define π n := π ◦ π n−1 ,
then a subtask composition is said to be divergent when the following
holds:
∃c>0 ∀s∈S with
minj=1..m Vj (s) >c
∀π policy of T ∀n≥0, 1≤k≤m :
π ◦
s −→ s (k) and l := min{k ≥ 0 | Vj fj (s (k)) ≥ Vj (fj (s))}
min Vi (fi (s)) (4.18)
⇒ min Vi fi (s (l)) ≤
n
πjk
i=1..m,i=j
i=1..m,i=j
A method could be defined for the learning agent to detect this type of
Hierarchic Task Composition
97
)ext
goal
aopt
a4: T3
T3
a3: T11
T11
a2: T3
a1: T11
start
optimal path
)goal
optimal path
start
task T10()goal=0š)extern=0) with subtasks T3()goal=0) and T11()extern=0)
Figure 4.14: TBU example for divergence under the subtask principle
divergence when it is learning.
However, case 3 might also take place when this type of divergence is absent,
and instead a monotonous increase of the minimum of the value functions
of the subtasks can be observed. An example is given by the same task as
T10 , with the difference that the truck can move only forward, both for the
combined and the subtasks.
The agent learning under the subtask principle, does not know in advance
neither the optimal value function of the flat, unconstrained task, nor that
of the combined task. Thus, a criterion for detecting case 3 must be based
on the optimal value functions of the subtasks, known in advance to the
agent. The two examples given for case 3 indicate that it will be hard to
give a necessary condition on these value functions to guarantee suboptimality. For the present state of this work, the responsability of defining a
subtask decomposition which lead to case 1 or 2, is completely left to the
designer of the learning agent.
98
4.4
Hierarchic Task Composition
The Perturbation Principle
Many systems can be characterized by equilibrium states which they cannot
leave for a long time without returning in it again. Physiological systems
based on the equilibrium of certain concentrations of chemical substances
(adrenalin for example), or muscle systems based on an equilibrium of the
extensor and flexor muscle (keeping the joint angle constant) are two examples. Immagine you wanted to control such a muscle system. One way could
be to control both muscles seperately, and the corresponding state space
is composed of the joint angle and of the two muscle tensions. Another
way could be to control both muscles together in an equilibrium state. The
state space is composed of the joint angle and the tension of one muscle,
supposing that the other one is in a state of tension such that the joint angle
is stable (constant). A control action now consists in changing (”perturbing”) the tension of one of the muscles, after which the system will reenter
again in equilibrium, having changed as a result the joint angle. This control has reduced the complexity of the controlled system by one dimension.
However, it requires a preexisting control, i.e. that of reaching from any
combination of the two muscle tensions the equilibrium (stable joint angle)
again. This is, in a simplified formulation, the idea of the ”Equilibrium
Point Hypothesis” in biomechanics [Feldman, 1974].
Remind from the introduction to hierarchical task architectures (chapter
4.1.1), that this idea refers to a task decomposition where the perturbing task (controlling the joint angle for some action) is on a semantically
higher level than the reequilibration task (establish muscle tensions for a
stable joint angle): the perturbing task in a certain sense ”abstracts” from
the fact that there are two muscles, and is focused on the joint angle and
on the tension as a whole, both factors that are important for manipulation
tasks of an arm or a hand, for example.
The basic idea behind the perturbation principle is that many control systems can be organized by levels which represent certain system equilibria.
One obvious advantage is state space reduction: the perturbing task has to
learn control only for states which are in equilibrium. Whenever we know
that a system cannot stay for a significant time out of a certain equilibrium,
we can introduce a control level which holds this equilibrium.
Note that the use of the word ”equilibrium” is just a metaphor:
in its most concrete meaning it doesn’t mean anything but a specific region
of state space which the system is guaranteed to stay inside when observed
at a certain level in this leveled task organization.
Another advantage is that this principle introduces task decomposition,
allowing a complex goal predicate p ∧ p (p defining the equilibrium states)
Hierarchic Task Composition
99
to be splitted and learned seperately in its components.
Generally, the perturbation principle hierarchically combines two tasks T
and T such that the hierarchically higher one, say T , perturbs with his
actions the goal state of the lower one, T , in the direction of its own goal
(rewarded state region) . This is accomplished by activating T after execution of any action a of T (for dur(a, s) timesteps) and, unlike the subtask
principle, leaving T active, and T interrupted, until T reaches again its
goal region. T then returns control to T , and only at this point, the
(perturbation-) action a of T is concluded and thus evaluated.
Any action in a perturbation hierarchy is composed of a perturbation part
executed by T and a reequilibration part executed by T .
Note that if the goal predicate of T is p , and if the goal predicate which
the perturbation of T tries to reach is p, then the goal predicate of T combined with T in the perturbation hierarchy has the structure:
p ∧ p .
See figure 4.15 which illustrates the basic idea with a TBU example.
goal
reequilib.
sk
s'
s
dur(s,a) times
action a ('steer > 0)
perturbation phase
sk
T'
s'
reequilibration phase
perturb.
s
)ext
)ext
complex task:
T=T10()ext=0š )goal=0 š
)int=0)
equilibration task:
T'=T3()goal=0 š )int=0)
Figure 4.15: basic idea of the perturbation principle with an TBU-example
100
Hierarchic Task Composition
Interpreting the goal of T as a kind of system equilibrium, T tries to
perturb this equilibrium in the direction of its own goal. This principle can
be extended over several hierarchic levels, in which case the reequilibration phase of a perturbation extends over several lower levels in a nested
reequilibration procedure. The general architecture for this kind will be
presented in section 4.4.4 together with its algorithm. Before that, I will
define more formally the perturbation principle in the next section, and in
section 4.4.2, I will discuss what kind of action duration is required for this
principle. Finally, in section 4.4.5, I will discuss the perturbation principle
regarding its benefits and when to apply it (i.e. what relationship between
perturbation and reequilibration task is required in order to successfully
apply the perturbation principle).
4.4.1
Definition
Given
a) a state maintaining task Teq (peq | S (eq) ): the equilibration task, with
state space S (eq) , equilibration predicate peq on S (eq) , and already
learned policy πeq such that for any state s ∈ S (eq) , the policy leads
to a goal state s (i.e. with peq (s ) is TRUE) or to a terminal state
b) S: a state space, with
• a surjective mapping feq : S → S (eq)
• an equilibration predicate peq on S defined by:
peq (s) := peq (feq (s))
• Seq ⊂ S, the set of equilibrium states in S:
Seq := {s ∈ S | peq (s) is TRUE }
• Sterm ⊂ S, the set of terminal states in S
c) a goal seeking or state maintaining task Tpb (ppb | Seq ∪ Sterm )) : the
perturbation task, with state space Seq ∪ Sterm ⊂ S, goal predicate
ppb on Seq ∪ Sterm , and perturbation action set A.
d) a policy π : Seq → A of Tpb
e) a running and a boundary reward function r(s, a, s ) and R(s, a, s )
for Tpb , defined on S and A such that
(type 1)
r(s, a, s ) > 0 or R(s, a, s ) > 0 iff ppb (s) ∧ peq (s)
(type 2)
r(s, a, s ) ≡ 0, and R(s, a, s ) > 0 iff ppb (s)
Hierarchic Task Composition
101
the perturbation principle defines the execution of the perturbation task
Tpb in the following way:
Definition of an action execution in the Perturbation Principle:
when the policy π of task Tpb has selected an action a in state s ∈ Seq , the
following sequence of actions will be executed, resulting for Tpb in a state transition to state s on the following trajectory of elementary time steps h:
s=
s0
sk
a
a
−→ s1 −→
ak
dur(s,a) times a
ak+1
...
al−1
−→ sk
−→ sk+1 −→ . . . −→ sl = s
(perturbation)
(reequilibration)
(4.19)
with
k := dur(s, a), sj ∈ S, ak := πeq (feq (sk )) ,
ak+i := πeq (feq (sk+i )) or ak+i := ak+i−1 because of action repetition based on
action duration > 1,
s ∈ Seq , and s ∈ Seq ∪ Sterm ,
l := min{m ≥ k | sm ∈ Seq ∪ Sterm } ,
and with the accumulated reward
γ j · r(sj , a, sj+1 ) +
γ j · r(sj , aj , sj+1 ) (4.20)
r̄(s, a, s ) :=
j=0..k−1
j=k..l−1
Note that
1. the equilibration task Teq is required to be a state maintaining task,
and not a goal seeking task. This seems to contradict the requirement
defined by (4.19),which lets the reequilibration terminate as soon as
the first equilibration state has been reached. Reminding the discussion of the difference between goal seeking and state maintaing tasks
in chapter 3, a goal seeking task will in any case find the shortest path
to an equilibrium state, which might not be true for a state maintaining task (see fig. 3.2). The reason for requiring the equilibration task
to be state maintaining, consists in the fact that an equilibrium by
definition is a state region to be maintained. Therefore, the long term
reward for this state region, and not the greedy reward, is opportune
for reequilibration after perturbation.
2. the reward function is defined in e) on all states of S, while the policy
of Tpb is defined just on Seq . This is because an action execution may
102
Hierarchic Task Composition
last several time steps traversing nonequilibrium states, as indicated
in (4.19). However, positive reward (of type 1) is defined only on
equilibrium states Seq , since task Tpb (i.e. π) is supposed to learn the
optimal policy for the goal predicate ppb ∧ peq .
Although this is not conform with the basic definition of an MDP
as defined in chapter 2.2, it is obvious that the basic property of the
reward function is still guaranteed:
the expectation of the reward r̄(s, a, s ) (with s, s ∈ Seq ) does not
depend on previous state transitions.
3. a second type of reward (type 2) allows for modelling tasks under the
perturbation principle that may end in terminal goal states that are
not in equilibrium. In this case, the goal predicate of the combined
task Tpb will be ppb and not ppb ∧ peq . See example 4.2 below.
4. in most cases (in all of this thesis), the surjective mapping
feq : S → S (eq) will be a projection of the n-dimensional real-valued
space S to an m-dimensional subspace S (eq) , m ≤ n. However, S (eq)
could also be, for example, some lower-dimensional space of features
of S important to subtask Teq (for feature-based state space reductions, see [Tsitsiklis & VanRoy, 1996]);
5. the length of the reequilibration phase is finite because of a)
6. the policy π of the perturbation task Tpb has no control of the total
action duration, but just of the duration of the perturbation phase.
The duration of the equilibration phase, as part of the duration of the
entire action, must not coincide with the duration of πeq (feq (sk )) (in
effect, it may be longer or shorter) depending only on the first time
it reaches an equilibrium state;
7. it is straightforward to extend this definition to the case where the
perturbation actions are subtask activations (Tpb under the subtask
principle): just replace in (4.19) and (4.20) the perturbation part
of the trajectory by the trajectory as defined in (4.9) for the subtask
principle. The same holds obviously for Teq - it could as well be under
the subtask principle.
8. there is no problem in combining the perturbation principle with the
veto principle: apply it explicitly to the action selection of π, while,
as far as the equilibration part is concerned, Teq is required to have
already been learned, so it can be assumed that it will have been
learned with the veto function as described in 4.2.
Hierarchic Task Composition
103
Example 4.2 (figure 4.16)
In the TBU domain, the predicate peq (s) := (Φ̇int = 0) represents an
important equilibrium for the truck navigation: a stable path on an
arclike curve. Consider the state maintaining subtask
T4 (Φ̇int = 0|(Φ̇int , Φint ))
as an equilibration task. Then, a task such as
T3 (Φgoal = 0|(Φ̇int , Φint , Φgoal )) (line up the trailer with the goal)
can be redefined under the perturbation principle in the following
way:
• Teq := T4 , with S (eq) := (Φ̇int , Φint )
• S := (Φ̇int , Φint , Φgoal )
• the projection feq : (Φ̇int , Φint , Φgoal ) → (Φ̇int , Φint )
• Seq := (0, Φint , Φgoal ), identified by (Φint , Φgoal )
• Tpb := T3 (Φgoal = 0|(Φint , Φgoal ))
• r(s, a, s ) ≡ 0 and R(s, a, s ) > 0 iff Φgoal (s ) = 0
Note that the perturbation task uses a reward function of type 2.
perturbation phase
reequilibration phase
Figure 4.16: perturbation task T3 (Φgoal = 0|(Φint , Φgoal )) with
equilibration task T4 (Φ̇int = 0|(Φ̇int , Φint ))
104
4.4.2
Hierarchic Task Composition
Q-learning under the Perturbation Principle
Given a)-e) of the last section, and given
g) a partitioning of Seq
a Q-learning rule under the perturbation principle can be defined for task
Tpb on the partitioning on Seq :
Q-learning rule of the Perturbation Principle
after transition in S from s to s under action a on the trajectory of elementary
time steps h:
s=
s0
sk
a
a
−→ s1 −→
ak
dur(s,a) times a
ak+1
...
−→ sk
al−1
−→ sk+1 −→ . . . −→ sl = s
(perturbation)
(reequilibration)
with
k, l, sj , ai as in (4.19), and with
s ∈ Seq , and s ∈ Seq where S(s) = S(s ), or s a terminal state,
update the Q-function
Q(n+1) (S(s), a) := (1 − αn ) · Q(n) (S(s), a) + αn · r̄(s, a, s ) +
+γ l ·
max
a ∈Â(S(s ))
Q(n) (S(s ), a )
with the accumulated reward
γ j · r(sj , a, sj+1 ) +
r̄(s, a, s ) :=
j=0..k−1
γ j · r(sj , aj , sj+1 )
j=k..l−1
Note that the partitioning of the state space S (eq) of the equilibration task
Teq is not constrained in any way by the use of Teq in the perturbation
hierarchy of Tpb . In effect, the only influence this partitioning has on the
performance of Tpb is given by the optimality of πeq , which is defined independently of Tpb .
The perturbation principle defines an option:
As in the subtask principle, the sequence
s = s0 , a , s1 , ..., a , sk , ak , sk+1 , ..., al−1 , sl = s
can be seen as the application of an option in state s, bringing the system to
(4.21)
Hierarchic Task Composition
105
the next state s . Thus, (4.21) can be interpreted as an SMDP-Q-learning
for options to approximate the solution to the following
Bellman equation of the Perturbation Principle
Q∗ (s, a) = E r̄(s, a, s ) + γ k+m · max Q∗ (s , a )
a ∈Â(s )
s, s ∈ Seq
(4.22)
where k is the action duration of the perturbation phase, m is the duration
of the reequilibration phase, and the reward r̄(s, a, s ) is the accumulated
reward as defined in (4.20). The expectation is with respect to the duration
m of the reequilibration phase and to the states sj on the trajectory from
s to s .
The approximation learned by (4.21) is twofold: at first, it is due to Qlearning as a stochastic approximation of the solution of (4.22). And secondly, it is due to approximation by discretization using partitioning. This
second type of approximation introduces non-Markovian state transitions,
as already discussed in chapter 2.4. Again, known convergence results of
Q-learning do not apply for this reason. However, the problem of state
aliasing is a key issue to be controlled for successfully applying the perturbation principle. This will be discussed in further detail in the next
subsection.
4.4.3
Action durations and partitionings
Action durations
Remind from chapter 2.3.5 that we can define an action duration for a given
partitioning either in a static way (2.23) taking the maximum number of
perturbation action repetitions (ever observed) required to get just out of
the partition S(s) - a duration that just depends on S(s), not on s ∈ S(s)
-, or in a dynamic way (2.24) taking the minimum number of perturbation
action repetitions to get out of S(s) when starting from s - thus depending
on s ∈ S(s).
Now, since the action as defined in (4.19) does not finish after dur(a, s) repetitions, but is eventually followed by a reequilibration phase of non-constant
duration (in effect, it depends on s ∈ S), a dynamic action duration will be
impossible in the perturbation principle. In effect, at the time of execution
of a perturbation action, it is generally impossible to predict whether the
system will have transited to a new partitioning S(s ) = S(s) after execution of the reequilibration task. For example, simply getting out of S(s)
106
Hierarchic Task Composition
by perturbation itself will not guarantee that the reequilibration phase will
not eventually let the system return into a state s ∈ S(s).
Consequently, we must use a static action duration for the perturbation
principle.
Let Sj ⊂ Seq be any partition of the partitioning of Seq . Then define
dur(Sj , a) := max min{k > 0 | sl ∈
/ Sj }
s∈Sj
(4.23)
with k, l and sl as defined in (4.19) (i.e. k the number perturbation steps
and l − k ≥ 0 the number of reequilibration steps, and sl the state reached
after l time steps starting from s).
Partitioning
In the last section, it was already mentioned that the partitionings of the
state spaces Seq and S (eq) of the two tasks Tpb and Teq are independent
from each other, and the partitioning of S (eq) in particular has no other
requirement than that of supporting the optimality of πeq .
But what about the partitioning of Seq ?
In chapter 2.4, the role of aliasing in state partitionings has been evidenced.
In the perturbation principle, aliasing (non-markovian state transitions) has
a generic aspect common to any Q-learning in discretized state spaces (see
chapter 2.4), and a specific aspect due to action definition (4.19):
the number of steps for reequilibration depends on s ∈ S(s).
This means that in regions of Seq which are close to regions of the state
space S (eq) with highly varying system dynamics of the reequilibration (task
Teq (peq | S (eq) )), the state space Seq should have higher resolution in its
partitioning. This requires, once again, an adaptive approach to dynamic
partitioning. Unfortunately, the adaptive partitioning defined by kd-Qlearning will work for the perturbation principle only in a very limited way
(see chapter 5). This is because the action duration model for the perturbation principle may have problems with scalability: it doesn’t behave
incrementally (the trajectory of an action a with duration n must not be
part of the trajectory of the same action a with duration n + 1).
Hierarchic Task Composition
level 2
action a of Tmax
perturbs equilibrium of level 2
in direction of
its goal p
task T"( p" )
equilibrium: p'
action a' of T"
perturbs equilibrium of level 1
until
in direction of
p"
its goal p"
task T'(p' )
no equilibrium
a)
p" reestablished
level 1
p'
reestablished
level 0
Tfinal( x<H š y<H | (x, y))
.
equilibrium: )int=0 š)goal=0(š)int=0)
)goal= 0 š )int=0
action a
T6 ()int=0 š)goal=0 | ()int , )goal))
.
equilibrium: )int=0
.
action a'
level 0
task Tmax (p)
equilibrium: p"(š p' )
level 2
A Multilevel Perturbation Architecture:
Definition and Algorithms
level 1
4.4.4
107
)int=0
until
)goal= 0 š )int=0
.
.
T4 ()int=0 | ()int , )int ))
b)
Figure 4.17: a) perturbation hierarchy b)hierarchy of the TBU-example
The perturbation principle as defined in section 4.4.1 combines two tasks
in a hierarchical structure. In a complex task architecture, perturbation
may extend over several levels, where each level has its own equilibrium
predicate. There are basically two scenarios in which reequilibration at a
specific level takes place within a multilevel architecture:
• a reequilibration task may itself use perturbation to bring the system
in an equilibrium state, each time this equilibrium has been perturbed
by a task at a higher level, and therefore it will have to rely on a
reequilibration task at a lower level, on so on. See figure 4.17a for the
general architecture and 4.17b a simplified view of a TBU-example
(the full architecture is depicted in figure 4.19). In this scenario,
equilibria grow incrementally, in the sense that an equilibrium implies
all equilibria at lower levels.
108
Hierarchic Task Composition
• a task T activates a subtask T at a lower level, and after each single
activation of T , equilibration of the lower level must be achieved.
After dur(s, T ) activations of T , the equilibration of the level of
task T must be achieved. See figure 4.18 for an illustration.
state ^s  S
state s  S
aj
aj
task T
...
...
aj := ST(s) =activate(Tj )
reequilibration
perturbation
of level i
activates
activates
subtask Tj
subtask Tj
a° . . . a°
perturbation
with a° :=ST (s)
j
...
reequilibration
of level k
...
a*
. . . a*
perturbation
^
with a* :=STj (s)
...
level i
(kd i)
level k
reequilibration
of level k
Figure 4.18: integration of perturbation and subtask principle
The text box on the next page defines the general perturbation architecture.
The learning algorithm is presented in the following pages. It is quite
complex and will be presented in two parts.
Hierarchic Task Composition
109
Multilevel Perturbation Architecture: Definition
1. tasks are structured by levels 0, ..., maxlevel, where each level i, i = 0, represents an equi(i)
librium of the system defined by a predicate peq on state space S of the complex (flat) task
T . Level 0 does not represent any equilibrium and thus contains all tasks that do not use
the perturbation principle as a perturbing task. Level maxlevel contains the perturbing task
Tmax with the goal predicate of the complex task T : its policy seeks or maintains the goal
states of the overall task. maxlevel may further contain subtasks of Tmax (see 6.).
2. the equilibrium of level i implies the equilibria of all lower levels j, j ≤ i. States at level i
(j)
satisfy 0<j≤i peq (s).
3. thus, each task Ti is situated at some level, which is denoted by Ti .level. and its state space
(j)
can be identified with S (i) := {s ∈ S | 0<j≤Ti .level peq (s)}
(i)
4. for each level i, 0 < i ≤ maxlevel, there is a unique equilibration task Teq (see 4.4.1.a),
(i)
situated at level i − 1 and having goal predicate peq .
5. at each level i, i > 0, a perturbation action may be initiated only when all levels j, 0 <
(j)
j ≤ i, are in equilibrium (level i is consistent):
0<j≤i peq . Otherwise, level i is called
inconsistent. During action execution (if the duration > 0), this is not required any more.
6. at each level i, 0 ≤ i < maxlevel, besides the equilibration task for the next higher level
i + 1, there may be other tasks which must be subtasks under the subtask principle.
7. the subtasks Tj activated by a composed task T under the subtask principle, cannot be
situated at a level higher than that of T :
Tj .level ≤ T .level
A subtask Tj may use the perturbation principle (Tj .level > 0) or may not use it (Tj .level =
0). Note that when it uses it, and the level of the subtask is the same as the level of T , then
the perturbation phase of T does not need any reequilibration phase since it will have been
executed already in the subtask execution (as part of each single perturbation step!).
8. learning occurs bottom-up, with the learning task Tmax at the highest level, and all other
tasks already learned (see discussion in section 4.4.5)
9. all tasks at levels i, i > 0, are perturbing tasks. Thus, they must use static action durations
as defined in (4.23).
10. the system is in an equilibration phase (for level j) whenever there is some level i with
(i)
NOT peq (st ), and there is some active Tj , j ≥ i, which is not in its perturbation phase. Note
that during the perturbation phase of Tmax , an equilibration phase may be active if Tmax
uses the subtask principle activating a subtask at some level k, k > 0.
11. during an equilibration phase for level k, any action of a task at level i, i < k, gets interrupted
as soon as level k gets consistent, even if its action duration during its perturbation phase is
not jet exhausted.
12. a task can be active or in execution:
• in execution: in its own perturbation phase, or active at level 0
• active: in execution, or in its own equilibration phase
110
Hierarchic Task Composition
Algorithm 1: Q-learning under the perturbation principle
initialize partitioning of S for Tmax
initialize Q for Tmax
initialize α , the initial learning rate
repeat (for each learning episode):
initialize s ∈ S
dur ← 0
while learning episode not finished: {1 //one learning step of Tmax
t ← 0, r̄ ← 0, s0 ← s
repeat {2 //one action of Tmax with dur perturbation steps
repeat (for elementary action with step size h) {3
//algorithm 2, see below
select task Ti for st
if Ti = Tmax
a ← π̂Ti (st ) //note 1
else {4
if dur = 0 {5
//note 2
if t > M AXST EP S
disable action a
//for example -greedy
select action a for Tmax in st
if t < M AXST EP S
dur ← dur(a, S(st ))
else
dur ← 1
}5
else
if action a vetoed
select another, not vetoed action a //for example -greedy
dur ← 1
a ← a
dur ← dur − 1
}4
while a is a subtask T (under subtask principle)
activate T a ← πT (st )
execute action a , observe next state st+1 and reward r
r̄ ← γ · r̄ + r
t←t+1
}3 until (Ti = Tmax ∧ dur = 0 ∧ level max consistent) ∨ st terminal
}2 until S(st ) = S(s0 ) ∨ st terminal
”repeat” didn’t repeat
if no autotransition and no veto since s0 //outer
Q(S(s0 ), a) ← (1 − α) · Q(S(s0 ), a) + α · r̄ + γ t · maxâ∈A(S(st )) Q(S(st ), â)
update α, if Q has changed
else
increment dur(S(s0 ), a)
s ← st
}1 end while
end repeat
Hierarchic Task Composition
111
note1: π̂T denotes the policy of task T for each timestep t, given by πT (st )
if at time t no action of T is active, or otherwise (duration of action
still > 0) the still active action, and thus π̂T (st ) := π̂T (st−1 )
note2: last action a of Tmax has finished completely: perturbation (since
dur = 0) and equilibration (since Tmax has been selected, see algorithm 2)
112
Hierarchic Task Composition
Algorithm 2: task selection in the perturbation principle
f il ← first inconsistent level (from bottom) //note 1
if there is no active task //note 2
if f il ≤ maxlevel
activate equilibration phase for maxlevel
activate and select equilibration task Tj with Tj .level + 1 = f il
else
activate and select Tmax
else {0
level ← 0
while level ≤ maxlevel {1
if there is an active task at level {2
Ti ← active task Tk with Tk .level = level and Tk hasn’t activated
any other active task T with T .level = level //note 3
if action of Ti not finished AND NOT (equilibration phase active and f il > level + 1)
//Ti hasn’t finished its perturb. phase and hasn’t recovered level goal
//in equilibration phase: go on with perturbation
select Ti
else {3
//Ti has finished its perturbation phase, or has recovered level goal
//in equilibration phase, in which case Ti has finished anyway
if f il ≤ level {4
//Ti still in its equilibration phase (but perturbation phase finished)
if no active equilibration phase
activate equilibration phase for level
activate and select equilibration task Tj with Tj .level + 1 = f il
}4
else {5
//Ti has finished its own equilibration phase
if equilibration phase active for level lm with m < f il
disactivate equilibration phase
if Ti has been activated as a subtask of another task Tj
disactivate Ti
level ← Tj .level
else {6
if f il = level + 1
//level + 1 inconsistent or level = maxlevel: continue (equilibration) on level
select Ti
else
disactivate Ti
level ← level + 1
}2}3}4}5}6
else
level ← level + 1
}1
}0
Hierarchic Task Composition
113
note 1: 1 ≤ f il ≤ maxlevel + 1 (maxlevel + 1 inconsistent by definition)
note 2: in initialization phase of an episode - go bottom-up
note 3: there is exactly one such task at level, if there are active tasks at
level
level 2
Figure 4.19 illustrates the final architecture used in the TBU-example. It
consists of 6 subtasks at three levels. All three principles get applied. The
results of this architecture will be presented and discussed in chapter 4.6.
Note that the avoidance task defining the veto function is situated at level
0. The question of at which level to learn a veto function will be discussed
in the next section.
Tfinal
()int=0 š)goal=0 š x<H š y<H | (x, y))
action a''
)int=0 š )goal=0
T6
()int=0 š)goal=0 | ()int, )goal))
level 1
subtask
.
action a'
)int=0,
activation
until
T5
)int=0
()int=0 | ()int ))
š
)goal=0
T3
()goal=0| ()int ,)goal ))
T°1
.
()int>)intmax|()int ,)int))
level 0
level 0
veto
.
T4
.
()int=0 | )int , )int )
Figure 4.19: complete hierarchy of the TBU-example
114
4.4.5
Hierarchic Task Composition
Discussion
1. Complexity Reduction
The perturbation principle aims at reducing complexity of difficult control
tasks. The advantage of this approach consists in the reduction of statespace complexity for the perturbing task T in a way that its state space
S reduces to Seq := {s ∈ S | peq (s) is TRUE}
(4.24)
with peq the goal predicate of the equilibration task Teq (or, in a multilevel
perturbation architecture, the conjunction of all equilibrium predicates of
lower levels). This reduction proceeds incrementally in a perturbation extending over several levels. I used in the TBU example (fig. 4.19) the most
immediate reduction of state-space, i.e. that of particular state space dimensions which the lower task keeps the system at some equilibrium point
(or to a restricted interval of a state-space dimension which the lower task
keeps the system in, see subsection 4.4.5.4). However, more general use of
(4.24) can be made. Just take the following example from the TBU domain:
peq (x, y) :=
(y = |x|)
Here, the truck has to navigate on any of the two lines running with an
angle of 45◦ towards the goal point. How comes state space reduction about
in cases like this in which we cannot simply drop a state space dimension?
Distinguishing between memory resources and computational resources, the
design of a learning agent using the perturbation principle may ignore memory demand defining a Q-function on the entire state space S of the complex
task. The Q-learning algorithm under the perturbation principle will learn
Q-values just for states s ∈ Seq , but that’s about all what is required by
a policy for action execution as defined in (4.19). This brings about a reduction of computational complexity during learning. On the other side,
by accepting a greater memory demand when starting with the entire state
space S, we can free the designer from the eventually difficult task of modelling explicitly the reduced state space Seq .
The same argumentation holds when applying adaptive discretization of
state space, as for example in kd-Q-learning which will be presented in
chapter 5: enhancement of discretization resolution will be restricted to an
-environment of Seq . In this case, both computational and memory complexity will be reduced by the perturbation principle.
Hierarchic Task Composition
115
2. What is the meaning of a level’s equilibrium?
The operational semantics of an equilibrium region of state space has been
given by the definition of action execution using perturbation and reequilibration (section 4.4.1).
But how do we find equilibria when designing a learning agent?
Remind that ”equilibrium” is just a metaphor: in its most concrete meaning, it doesn’t mean anything but a specific region of state space which the
system is guaranteed to stay inside when observed at a certain level in a leveled task organization. Nevertheless, the concept of equilibrium has to do
with stability which can be applied when designing a learning agent. Unlike
the general meaning of stable equilibrium in mechanics given by a field of
forces the vector sum of which keeps a body at rest in a stable situation, in
MDPs we have to do with purely deliberative agents acting continuously.
Keeping this difference in mind, stability can be interpreted and applied in
two concrete meanings for defining levels and their equilibria:
1. stability with respect to collisions:
given a particular collision type, define a system equilibrium to be
a state region not containing collision states and for which there exists an (elementary) action which can be repeated forever keeping the
system within equilibrium15 , while for any state outside the equilibrium region, any action brings the system during repetition finally to
a collision of the given type.
2. more generally, stability with respect to a condition during movement
through state space:
depending on the goal and environment types. In physical space,
this could be rectilinear movement when space is not cluttered with
obstacles (Φint = 0 in the TBU-example), or wall-following movement
in a labyrinth (constant distance to a wall).
A stability of the first type is particularly useful for learning when it is
clear from the system dynamics that, relatively to the length of the overall
solution path to (or within) the final goal region, the system cannot leave
the equilibrium state region for a substantial time. This is obviously true
for the condition Φ̇intern = 0 in the TBU-example. As will be further
discussed in the section on instability of the perturbation principle, this
requires that the dynamics of reequilibration has to be at a time scale of
lower order than that of achieving the goal of the next higher level.
15 this
may result hard to achieve in pratice (discretized actions!). It is clear that this
definition is of purely theoretical nature.
116
Hierarchic Task Composition
3. Optimality
How does the perturbation principle influence the solution path compared
to the optimal solution path of the flat, undecomposed task?
Reminding the concepts of hierarchic and recursive optimality from chapter
4.1.2, it is clear that the perturbation principle is recursively optimal: any
subtask follows an optimal reequilibration path.
Is it also hierarchically optimal?
Clearly, the reequilibration phase as the final part of any action in the perturbation principle defines a hierarchical constraint on the flat policy. This
can be viewed alternatively as based on a hierarchical constraint of either
type 1 or type 3, as defined in 4.1.2: a constraint on action selection (follow
the reequilibration task as the final part of any action), or a constraint on
the solution path (all decision points must be equilibrium states). Since
the basic idea of the perturbation principle is that of state space reduction,
it is clear that hierarchical optimality should be interpreted as being constrained on the solution path:
Hierarchical Constraint in the Perturbation Principle:
The Perturbation Principle is hierarchically constrained on the solution
path (type 3): any perturbing action must begin in an equilibrium state.
Thus, the following question emerges:
Question
does Q-learning under the perturbation principle lead to hierarchically optimal policies? In other words: is the optimality of the equilibration task a necessary condition for hierarchical optimality?
In general, this question has to be answered negatively: optimality of reequilibration will be neither necessary nor sufficient for hierarchical optimality.
Again, just take the example of figure 4.15 and take a reequilibration path
from sk which moves the truck on the shortest path to a goal state with
Φext = 0 ∧ Φgoal = 0 ∧ Φint = 0. This path would not be optimal for the
reequilibration task T , but would be optimal for the composed task T .
However, it is reasonable to assume that there will be no condition on the
reequilibration policy sufficient for achieving hierarchical optimality for all
perturbation tasks. In other words, any such condition will be context dependent (i.e. depend on the particular perturbing task Tpb ). But this
violates a basic requirement of the entire approach - that of incremental
learning, in which the reequilibration task has to be learned independently
of any perturbation task. With these considerations in mind, an optimal
Hierarchic Task Composition
117
reequilibration (shortest path) can be justified since it guarantees shortest
intervals between consecutive decision points, binding temporal generalization just to the choice of spatial generalization during state space discretization, as discussed in chapter 2.3.5.
Closely related to this discussion is the question, whether the equilibration
task should be a goal seeking or a state maintaining task. Since hierarchical optimality as defined above, is not guaranteed in the perturbation
principle, the discussion of this question cannot be reduced to optimality
considerations. Now, since perturbation is the basic idea of the technique,
it is quite intuitive to argue that maintainment of the equilibrium is not
an objective in this approach. Therefore, a goal seeking task is generally
better than a state maintaining task for reequilibration, especially when
considering again the advantage of shortest intervals between consecutive
decision points guaranteed only by goal seeking tasks.
A second issue that always has to be discussed in hierarchic task architectures is that of the relationship between the policy learned in the hierarchical composition, and the optimal policy of the undecomposed, complex
task (i.e. with flat optimality, see definition 4.1.c). Returning to the end of
the discussion of subsection 4.4.5.2, reequilibration may become a ”natural” part in the policy with flat optimality whenever the system dynamics
are such that, relatively to the length of the overall solution path to (or
within) the final goal region, the system cannot leave the equilibrium state
region for a substantial time without collision. This is obviously true for
the condition Φ̇intern = 0 in the TBU-example, see figure 4.11a. Thus,
whenever the task architecture is based on levels with equilibria defined by
stability with respect to collisions as defined above, we can expect policies
close to flat optimality.
118
Hierarchic Task Composition
4. Instability
Reequilibration may introduce instability during learning due to state aliasing, especially when it extends over several levels. A low ratio (<< 1) between the length of the perturbation and the reequilibration phase within
a single action execution leads to a loss of control (the perturbation) of the
learning agent. This is jet another argument, besides those discussed in the
last two subsections, to define levels such that the dynamics of reequilibration has to be at a time scale of lower order than that of achieving the goal
of the next higher level.
In pratice, such instability can be controlled by broadening the equilibrium
region Seq in state space thereby decreasing the mean length of the reequilibration phase. In particular, when S ⊂ Rn and Seq := {s = (si )1≤i≤n ∈ S |
sj = const} for some 1 ≤ j ≤ n, then define Seq := {(si )1≤i≤n | const − δ <
sj < const + δ}.
Hierarchic Task Composition
4.5
119
Implementation
The three principles have been implemented in a learning environment in
form of a library of C++ classes. This library offers
1. a framework for defining tasks in a hierarchical task architecture based
on the three composition principles defined in this chapter.
The user has to define the total task specifying
• the global state space in Rn , defining symbolic constants
for the name of each dimension (see chapter 3 for the TBUexample), and a numerical range for each dimension. Note that
the discretized state space must be compact. The region outside
the compact discretized part is treated as one partition. The implementation takes as input an array of floating numbers. This
can either be generated by a simulator, or it can be raw data
of some sensors, or sensory data from a preprocessing unit, of a
real system.
• the global action set defined simply by the constant
M AX ACT ION > 0. The agent encodes all actions as integer numbers from 1 to M AX ACT ION . The translation to a
control signal is left to an external component (for example to
a simulator of the system, or to an actuator controller of a real
system)
• the termination conditions for collisions, goal regions, and
terminal states with no reward, in form of boolean functions on
the current state
• levels in a perturbation hierarchy by M AX LEV EL > 0
(levels are identified by integers from 0 - lowest level without
equilibrium - to M AX LEV EL) and by a level goal for each
level in form of a boolean function on the current state. The
highest level has no level goal: the goal here is defined by its
main task (it may have subtasks in the subtask principle).
Then, the user has to derive subclasses from the two superclasses
Subclass and VSubclass (for learning a veto function) which are predefined in the framework. For each of these derived subclasses, the
framework will instantiate a single object representing a subtask, and
connect it to an object of the class UtilityFunction (containing all of
the logic of an approximation with a kd-trie) which is predefined in
the framework. In a class for a concrete task, the user defines
120
Hierarchic Task Composition
• task type: goal seeking, state maintaining or avoidance task
• state space: projections, defined by a selection of dimensions
of the total state space
• actions: either selection from the total set of elementary actions,
or subtasks (when the task uses the subtask principle)
• level: in the perturbation hierarchy
• reward function: a function on the current state returning the
reward value for the previous step
• applicability condition: a boolean function on state space.
This function, together with a priority, can be used to model
several tasks (at the same level) to be activated in sequence, in
a basic subsumption style ([Brooks86], as realized in the Obelix
architecture [Mahadevan93]): any active task supresses other
active tasks with a lower priority at the same level.
2. a learning and execution environment for the agent composed of
(sub)tasks as defined previously. The user has to edit a configuration file for initialization of the learning task, defining
• the learning task: the id (number) of one of the tasks defined
previously
• the discretization: minimum and maximum depth of the kdtrie for each dimension of the state space of the subtask
• the initial learning factor α0
• the decay rate of the learning factor: an integer CLF > 0
CLF
where
such that the learning factor αt := α0 · CLF +visit(s
t ,t)
visit(st , t) is the number of times state st has been visited by
time t.
• the reward decay rate 1 > γ > 0 if the task is not an avoidance task
• the action exploration either defined by 1 ≥ ≥ 0 and a decay
rate for an -greedy policy16 , or by a temperature decay rate
for a boltzmann-like exploration distribution [see for example
Sutton& Barto, 1998]
• the action duration type: either type 1 - adapt action duration to cell width, or type 2 - go until new state (partition) is
reached. See chapter 2.3.5.
16 a
policy that takes with probability 1 − the currently best action, and with probability a random action (uniform distribution)
Hierarchic Task Composition
121
The execution environment allows for defining various statistics (accumulated rewards, data of single state visits) and (peridically saved) numerical
output for generating graphical representations of value functions, policies, and the dynamic state space partitionings. Two simulators have been
implemented too, one for the TBU and the other for the Mountain Car domain, and for both domains a grafical animation running offline on a trace
optionally generated by the execution environment.
122
Hierarchic Task Composition
4.6
Results
The three principles have been tested for their efficiency in the TBU domain. All results presented in the sequel are based on the same testing
setup:
• the learning run consists of a predefined number (specific for each
experiment) of episodes which start at a random position in state
space, and end either after a maximum number of steps (300), or in
a terminal state. The learned q-function is saved periodically during
each learning run, 10 times for each run.
• the evaluation run consists of 1000 episodes, each starting from random starting states, the set of which, however, is fixed for all evaluation runs representing thereby a benchmark set. Each evaluation run
uses a q-function saved previously during learning, and it applies the
greedy policy on that q-function not changed any more during evaluation. The run generates a trace with two values for each episode:
the length of the episode in number of elementary steps, and the
accumulated, decayed reward (Jte from (2.3) in chapter 2.1).
• the sample space for independent learning runs consists of 50 independent initializations (using the system time) of the generator of the
pseudo random numbers (I used Knuth’s algorithm from ”Numerical
Recipes in C”). Mean value and standard deviation of the run length
and of the accumulated decayed reward, is taken from this sample of
50 independent learning runs.
In all experiments, I used the action duration type 1 (static action duration).
4.6.1
The veto principle
In the first experiment, the system has to learn a veto function for avoiding
an inner blocking between cab and trailer: task T1o as defined in chapter
3.1. In a learning run, the system uses a total exploration (random walk)
updating its q-function with no decay. The parameters of the subtask are
as follows:
Hierarchic Task Composition
123
Task Settings
level: 1
state space: (Φintern , Φ̇intern )
actions: all (7) actions
initial learning rate α0 : 0.7
learning rate decay factor CLF : 25
discount factor γ : 1.0
exploration rate : 1.0
kd-trie parameters:
maximum (minimum) depth for Φintern : 8 (3)
maximum (minimum) depth for Φ̇intern : 4 (2)
The performance has been measured by a task which lets the truck wander
by selecting the action with highest value of the veto function (least vetoed), counting the length in elementary steps of an evaluation run ending
either after 100 steps or after a collision (inner blocking). The result is
shown in figure 4.20.
The next experiment tested the learning of task T3 which has to align the
90
performance of avoidance task T°1
(avoid inner blocking)
runlength (steps): wandering by least vetoed action
80
70
60
50
40
30
20
standard deviation
10
episodes (x100)
0
1
2
3
4
5
6
Figure 4.20: result: learning a veto function
7
8
9
10
124
Hierarchic Task Composition
trailer with the goal point. The experiment compares the performance of
learning with and without the veto principle. The result is shown in figure
4.21. Performance is measured in number of goal achievments.
Task Settings
level: 1
state space: (Φintern , Φ̇intern , Φgoal )
actions: 5 actions
initial learning rate α0 : 0.6
learning rate decay factor CLF : 30
discount factor γ : 0.95
exploration rate : 0.5
kd-trie parameters:
maximum (minimum) depth for Φintern : 3 (3)
maximum (minimum) depth for Φ̇intern : 3 (3)
maximum (minimum) depth for Φ̇goal : 4 (4)
800
performance subtask T3 (align trailer with goal)
700
with veto principle
(inner blocking)
number of goal achievements
600
500
without veto principle
(negative reward
for inner blocking)
400
300
without veto principle
(no reward for inner blocking)
200
100
standard deviation
episodes (x500)
0
1
2
3
4
5
6
Figure 4.21: result: learning under the veto principle
7
8
9
10
Hierarchic Task Composition
4.6.2
125
The perturbation principle
The perturbation principle has been applied to task T3 which has to align
the trailer with the goal point. It uses the perturbation principle with the
goal of level 1: Φ̇intern = 0. The corresponding task (T4 from chapter 3.1)
has been learned previously. The experiment compares the learning of this
task with and without the perturbation principle. Note that both versions
used the veto principle with the veto function of the last subsection. Performance is measured by the decayed reward.
Task Settings
level: 2
state space: (Φintern , Φgoal )
actions: 5 actions
initial learning rate α0 : 0.6
learning rate decay factor CLF : 30
discount factor γ : 0.95
exploration rate : 0.5
kd-trie parameters:
maximum (minimum) depth for Φintern : 3 (3)
maximum (minimum) depth for Φgoal : 4 (4)
without perturbation:
state space: (Φintern , Φ̇intern , Φgoal )
maximum (minimum) depth for Φ̇intern : 3 (3)
The result is shown in figure 4.22.
126
Hierarchic Task Composition
performance of subtask T3 (align trailer with goal)
with perturbation by subtask T4
6
with perturbation
5
4
decayed reward
without perturbation
3
2
1
standard deviation
0
episodes
0
500
1000
1500
2000
Figure 4.22: result: learning under the perturbation principle
4.6.3
The subtask principle
The subtask principle has been applied to task T6 which has to align both
trailer and cab with the goal point. It uses the subtasks T3 (align trailer
with goal point) and T5 (Φintern = 0: run on a straight line), the last one
being a state maintaining task. T6 too is a state maintaining task. In order to evaluate the net effect of the subtask principle, it doesn’t use the
perturbation principle17 . Note that the state space is huge: 8192 cells (leaf
nodes). This explains the long learning time (number of episodes). The
performance gain of the subtask principle is very significant. The result is
shown in figure 4.23.
17 Evaluation
of both principles in combination is presented in the next section
2500
Hierarchic Task Composition
127
Task Settings
level: 1
state space: (Φintern , Φ̇intern , Φgoal )
actions: 2 actions
initial learning rate α0 : 0.6
learning rate decay factor CLF : 80
discount factor γ : 0.95
exploration rate : 0.5
kd-trie parameters:
maximum (minimum) depth for Φintern : 4 (4)
maximum (minimum) depth for Φ̇intern : 4 (4)
maximum (minimum) depth for Φgoal : 5 (5)
8
performance of task T6 (align cab and trailer with goal) with subtask principle
with subtask T3 (align trailer with goal) and T5 (run on a straight line)
7
with subtask principle
6
decayed reward
5
4
3
without subtask principle
2
1
standard deviation
0
1
2
3
4
5
6
7
episodes (x800)
Figure 4.23: result: learning under the subtask principle
8
9
10
128
4.6.4
Hierarchic Task Composition
All three principles together
The next experiment tested the task T6 which has to align both trailer and
cab with the goal point. It uses the subtasks T3 (align trailer with goal
point) and T5 (Φintern = 0: run on a straight line), and the perturbation
task T4 (Φ̇intern = 0). Thus, it combines the subtask and the perturbation
principle, and learns under the veto principle. The result is shown in figure
4.24.
Task Settings
level: 2
state space: (Φintern , Φgoal )
actions: 2 actions
initial learning rate α0 : 0.6
learning rate decay factor CLF : 20
discount factor γ : 0.95
exploration rate : 0.5
kd-trie parameters:
maximum (minimum) depth for Φintern : 4 (4)
maximum (minimum) depth for Φgoal : 5 (5)
Note that the version with subtask principle learns faster at the beginning,
but then gets outperformed by the version without the subtask principle.
This is due to the fact that the subtask principle constrains the combined
task by the previously learned subtasks which will not be (and in this experiment are not) optimal.
Hierarchic Task Composition
6
129
performance of task T6 (align cab and trailer with goal) with
perturbation by task T4 , and with subtask principle with subtask
T3 (align trailer with goal) and T 5 (run on a straight line)
5
with subtask principle
decayed reward
4
3
without subtask principle
2
1
standard deviation
episodes (x20)
0
1
2
3
4
5
6
7
8
Figure 4.24: result: learning under the subtask and the perturbation principle
.
9
10
130
4.6.5
Hierarchic Task Composition
Complex TBU tasks
The two most complex tasks are the avoidance task
T2◦ (”hits the wall”|(Φintern , Φ̇intern , Φextern , y))
avoiding to hit the wall, and the final docking task
Tf inal (Φintern = 0∧Φextern = 0∧x = 0∧y = 0|(Φintern , Φ̇intern , Φextern , x, y)
The avoidance task was learned under the perturbation principle, with the
following parameters:
Task Settings
level: 2
state space: (Φintern , Φextern , y)
actions: 5 actions
initial learning rate α0 : 0.5
learning rate decay factor CLF : 25
discount factor γ : 1.0
exploration rate : 0.3
kd-trie parameters:
maximum (minimum) depth for Φintern : 3 (5)
maximum (minimum) depth for Φextern : 3 (6)
maximum (minimum) depth for y : 2 (3) (clipped to
the interval [0,120])
Figure 4.25a shows the avoidance behavior learned after 5000 episodes.
The final task was learned with the task architecture given in figure 4.19.
Its behavior is shown in figure 4.25b. The composition was tricky since
many parameters had to be fine-tuned in order to minimize instability of
the composed controller. This has been discussed in section 4.4.5.4.
Hierarchic Task Composition
a)
131
b)
Figure 4.25: a) wall avoidance after 5000 trials b) final docking task
132
Hierarchic Task Composition
Chapter 5
kd-Q-Learning
5.0.6
Aliasing problems
In chapter 2.4, the discussion on state aliasing and state generalization
brought up the sensitivity of Q-learning to the particular state space discretization. Remind that the ”optimal policy partitioning” was defined in
chapter 2.4 as the coarsest partitioning that can represent the optimal policy of a given MDP. I gave an example (2.1) in which an agent learning on
that partitioning, is not able to learn the optimal policy. This is a clear
hint that a partitioning, which guarantees a successful learning, may not be
as easiliy defineable at the outset of learning as the reward function itself
is: even the ”a posteriori” knowledge of the optimal policy is insufficient
for this definition. Since a simple problem definition is a key advantage of
reinforcement learning, partitioning should be supported dynamically.
The same requirement of an adaptive, dynamic partitioning has come up
in the context of hierarchic task composition in chapter 4. Hierarchic task
composition defines (semantically) higher-level tasks with abstract actions
(options) extended in time. It was shown in the discussion on complexity reduction in section 4.3.5.1, that abstract actions in the subtask principle may
reduce complexity, but only when supported by an adaptive discretization
(section 4.3.3.b). This is due to the fact that temporal abstraction shifts
the decision problem of frequent selection among many actions towards the
decision problem of selecting the right point of time for a less frequent selection among fewer actions.
133
134
kd-Q-Learning
5.0.7
Dynamic state space discretization is required on
all semantic action levels
Throughout the previous chapters, evidence on a strong relationship between abstraction in state and action space has come up. Abstraction in
state space leads in several ways to a generalization of a policy: its simplest type is action repetition, leading to elementary actions with constant
intensity and dynamic duration adapted to the resolution of spatial discretization. At a higher semantic level, options representing goal-oriented
behavior, use again dynamic action duration as a means of policy generalization, in a different and specific way by any of the three composition
principles. However, with the only exception of the reequilibration subtask
in the perturbation principle, action duration always depends on the resolution of the discretization of state space. Having established this basic
relationship which is independent of the level of action semantics, a support
for dynamic discretization of state space is required in all three composition
principles:
• in the veto principle: the avoidance task requires enhanced resolution
near the discontinuity of its optimal value function (see 4.2.2), not
known prior to learning. The vetoed task may need an enhanced
resolution depending on the constraints of action selection imposed
by the veto function, and thereby depending on the relationship of
these constraints and the goal of the vetoed task (see 4.2.3.4), not
known prior to learning.
• in the subtask principle: the combined task needs an enhanced resolution in state space where a subtask switch is required of its policy.
This must not necessarily be in the proximity of the combined goal
region (see 4.3.3.b), and thus it is unknown prior to learning. See
figure 4.11b for an illustration with the TBU task.
• in the perturbation principle: a high resolution of discretization is
required in those parts of state space in which aliasing (as defined
in 2.4) results from strong variations of the value function of the
reequilibration subtask (see 4.3.3). This, as any type of aliasing, is
not known prior to learning.
5.0.8
Acceleration of learning
An important requirement for good reinforcement learners is a fast generation of exploitable control behavior, especially in cases where expensive
learning has to be carried out on a real system whose system dynamics is
kd-Q-Learning
135
unknown. RL techniques must hence improve on generating useful control
behavior early on, even though this may be suboptimal in the beginning.
In this concern, a key issue is the approximation of the value function
V : S → R of a state (or the quality function of a state-action pair:
Q : S × Act → R ). For fast learning of suboptimal behavior, the use
we want to make of these functions determines ultimately the requirements
on the approximation error:
a) it is sufficient for the value and the quality functions to determine correctly the action policy π : S → Act defined by π(s) := argmaxa∈Act Q(s, a)
(see (2.16)). ”Correctly” means that it determines the optimal policy.
b) the decision points in state space for a policy must only be at the
border of a policy change (change of action).
The first point tells us that we might achieve an optimal policy already
long before we approximate the function well with respect to some norm in
the space of approximation functions. The second point is about temporal
generalization. It tells us that optimal actions are constant over more or
less vast regions of state space (especially when using only a small number
of control actions) and that they can thus be temporally generalized over
these regions, i.e. without consulting the action policy. Consequently, a
quality function that generates good or even optimal behavior could be
represented with low resolution in these areas of constant control, and it
would need a finer spatial resolution only at the borders of these areas.
These two points indicate the direction in which acceleration of the learning
process can be looked for. However, the approximation must be learned,
and no precise state-value pairs are known at the beginning.
The evaluation must be propagated through state space which makes its
approximation to something like a dynamic process, and adaquate means
are required for a variable resolution of generalization in space (value function) and in time (action duration) which evolves during learning.
5.0.9
Approaches to adaptive state space discretization
Various strategies have been proposed for adapting the partitioning’s granularity to the specific problem at hand. Sutton proposes in [Sutton, 1996]
a Sarsa algorithm using a CMAC approach based on tilings, and presents
good results with this kind of generalization in state space. Other researchers have employed tree structures such as kd-tries [Moore, 1991 &
136
kd-Q-Learning
1994, Munos & Moore 2002, Vollbrecht, 1998]. These are particularly suitable for state splitting techniques used for enhancing dynamically the resolution in state space discretization.
A series of publications repropose temporal abstraction in action space
[McGovern & Sutton, 1998, Precup & Sutton & Singh, 1998] in order to
accelerate the policy learning by temporal propagation deeper than that
achieved by 1-step actions. Sutton et al. [Precup & Sutton & Singh 98]
proposed the use of options as defined in chapter 2.3.5. However, their use
in this approach is not coupled with generalization in state space. This is
justified from a theoretical point of view since the Markov property of the
process model shall not be violated when convergence shall be guaranteed.
The same holds for the MSA-Q-learning algorithm proposed by Schoknecht
and Riedmiller [Schoknecht & Riedmiller, 2002]. This approach follows an
idea similar to that used in kd-Q-learning: learn simultaneously at several
levels of temporal abstraction (so-called multistep actions). But it does not
combine it with spatial abstraction as does the kd-Q-learning. Dayan and
Hinton propose in [Dayan & Hinton, 1993] the Feudal RL with a hierarchy of local controllers in a corresponding hierarchic partition of (physical)
space. Controllers are responsible for specific cell transitions ordered by
the next higher level. To my knowledge, this approach is the first one that
combines generalization in state space with temporal generalization in action space. Pareigis employs in [Pareigis, 1998] local error estimates for an
adaptive refinement of grid and time in RL, showing that the ratio of time
to space generalization is crucial. His use of temporal action generalization
is very similar to the one presented here: actions are kept constant over
a variable time that depends on the local spatial generalization, and thus
follow the most simple temporal abstraction - that of repetition. Following
these arguments of variable state space resolution in tiling structures, and
of the relation between spatial and temporal generalization, I will show that
state splitting techniques (section 5.2), important for learning with variable
resolution, present however some drawbacks. The most important one concerns the loss of information during splitting with respect to the experience
of the learning agent that led to the decision to split a cell. Spatial and
temporal generalization get locally refined by each single cell splitting, but
the already learned information for that cell is an averaged information. It
is thus the same information that gets inherited by the newly created cells,
while the splitting decision is usually based on a variance in experienced
state transitions. I will show how the information that leads to a splitting
decision can be exploited better when state splitting is not viewed as a
creation of new states, but instead as a level descent in a kd-trie. In section
5.3, a learning algorithm will be presented which at the beginning learns
for each experienced state transition simultaneously on several hierarchic
kd-Q-Learning
state space
137
S(s)
.s
..
..
.
.....
kd-trie
.s'
S(s')
transition with action a and
duration dur(a)
s  S(s)
a
S(s') s'
Figure 5.1: state space partitioning with a kd-trie
levels representing different spatial generalizations. As learning proceeds,
state transitions get increasingly refined by a descent in the kd-trie scaling down both the spatial and temporal generalization. In this way, while
increasing the representational means within the learner, we can however
reduce considerably the learning effort. The benefit of the approach will be
demonstrated in section 5.4 by the results of learning two control tasks: the
mountain car problem and the TBU task (defined in chapter 3). Some interesting features of the algorithm will be discussed in section 5.5, together
with enhancements to be developped in future work.
5.1
State Splitting
Within the discretization framework presented in chapter 2.3, it is reasonable to use a non-uniform resolution represented by an unbalanced kd-trie.
See figure 5.1 for an illustration. However, since nothing is known about the
optimal value functiuon and the optimal policy at the outset of learning,
partitioning has to be dynamic during learning. Several researchers have
employed state splitting techniques [Moore, 1991 & 1994, Munos & Moore,
2002, Kaelbling, 1993]. Munos and Moore present in [Munos & Moore,
2002] a number of splitting criteria based
• on the value function: split in the direction of greatest gradient, or of
greatest variance (with respect to the successor states of transitions);
138
kd-Q-Learning
• on the policy: disagreement between the policy defined by the solution
of an MDP problem and by the gradient of the value function as
defined by the Hamilton-Bellman-Jacobi equation;
• on a global criterion using a so-called influence of a state on the global
value function.
All approaches are model-based and require a solution of an MDP problem
(for example by value iteration) before a splitting decision can be taken.
Things become different, however, when we start without a predefined
model of system dynamics. The splitting rules could still be the same,
but now we have to apply them during the learning process, and the problem is when to apply them. A concept of confidence in the current value of
the cell (or neighboring cells) to be splitted is required.
In the next section, I present some of my own state splitting rules (which
are not model-based, i.e. they apply during learning of the policy) without
going too much into detail. What is important is to convey the general
idea of state splitting in order to motivate the learning algorithm to be
presented in the next section, which improves on some major drawbacks in
the state splitting approach.
5.1.1
Some simple state splitting rules
Using a kd-trie as discretization representation, a state splitting is implemented by the creation of two son nodes for the leaf node to be split.
Starting from a minimal depth within the kd-trie defined (together with
a maximally reachable depth) manually by the designer of the learning
agent, state splitting rules are defined in dependence on the type of task
(see chapter 3 for definition of the task types):
• Goal seeking tasks (defined by termination within a rewarded goal
region and no reward elsewhere: R > 0 and r ≡ 0):
a cell Sin with currently best action a gets split when it has not yet
reached its maximal level in the kd-trie, and when
a) a rewarded goal state is sometimes reached from within that cell,
but not often enough:
0 < P [r(s, a) > 0|s ∈ Sin ] < c1 ,
c1 < 1
(5.1)
b) or, after a transition to Sin+1 , and under a certain confidence
condition (see below) on the values V̂ (Sin ) := maxa ∈A Q̂(Sin , a ) and
kd-Q-Learning
139
mountain car problem:
value function
after 20000 trials: 58 steps to the top
(mean value on 2500 uniformly
distributed initial positions)
>0 16
velocity
14
12
10
>0
8
6
<0
top
<0
position
velocity
4
>0
2
0
<0
position
>0
Figure 5.2: kd-trie partitioning for the mountain car task
V̂ (Sin+1 ), the following condition is met:
V̂ (Sin+1 ) < c2 · V̂ (Sin )
(c2 < 1)
(5.2)
Although this rule is quite intuitive (the best action should always
lead to a better situation in deterministic systems - otherwise the cell
generalizes too much), it is less intuitive when to have confidence in
the values of the two cells. This issue will be treated in more detail
in the next section.
This splitting rule has been applied to the mountain car problem, in
which an underpowered car has to climb up a steep mountain road
(see chap. 3.2). The result of the splitting technique for the mountain
car problem is shown in figure 5.2 (with c2 = 0.5). Note that the state
splitting has occured along the highest gradient of the value function.
Note furthermore, that the optimal value function is discontinuous
near the left border, since any state near it and having too negative
a velocity, necessarily ends up in a terminal state without reward. In
that region, the state splitting has occured again with splitting rule
5.2.
<0
top of the hill
140
kd-Q-Learning
• Avoidance tasks (defined by negative reward in the collision region of
state space and 0 elsewhere: R < 0 and r ≡ 0 in (1)):
Starting from the minimum resolution, and with a Q-function initialized to the negative reward everywhere, a cell Sin gets split whenever
the observed Q-samples
r(sn , an ) + V̂ (n) (in+1 )
defined for transitions to cells Sin+1 for the currently (in Sin ) best
action an vary in their minimum and maximum value by more than
c3 · (−R), with c3 a constant (I used 0.8 for the TBU task). This rule
is based on the idea that in avoidance tasks, state space splits into two
regions - one in which collision is inevitable and one in which collision
is still avoidable with some action -, and that the finest resolution
should be at the border between them. See 4.2.3.3 for a discussion of
this rule.
• State maintaining tasks (which are like goal seeking tasks but do not
terminate when entering into the rewarded region: r > 0 and R = 0
in (1)):
the rule for state splitting that is used for goal seeking tasks is also
applied here. In addition, the kd-trie gets smoothed such that any two
adjacent cells do not differ in their cell width for each dimension by
more than a factor of two. This requirement lets the agent navigate
smoothly into the desired region, which is important when the goal
region is narrow.
5.1.2
State splitting and confidence in the value function
The state splitting rule (5.2) of the last section needs a precondition to get
successfully applied: a confidence in the estimated values of the two partitions of a state transition1 . A statistically sound basis for this confidence
cannot be expected to exist in early learning phases in which the stochastic
learning process is in a transient phase. The most promising feature for
estimating the confidence seems to be the variance of the value of a state.
Various experiments with variance estimators have brought negative results. The reason for this failure seems to be clear, especially when using a
discrete action set as in the two example domains: in general, any switching
from a suboptimal policy to a better one will take place after any number
1 note
that having confidence or not in the value of a state is also at the basis of a
fundamental problem of reinforcement learning: when should learning end?
kd-Q-Learning
141
of learning episodes not known in advance, resulting in a discontinuous,
non-monotonous evolution of the variance of the value function in areas of
state space again not known in advance. See figure 5.3 for an example state
of the mountain car problem. Criteria for confidence estimation based on
6
5
value
4
3
2
1
0
number of trials
Figure 5.3: trace of the learned value of state 460 (see figure 5.11)
in the MC-problem (2000 trials)
the variance of the value function seem to be hard to apply for this reason.
An alternative, heuristic approach for defining a confidence will be given
in the context of the kd-Q learning algorithm in the next section. It is
restricted to goal seeking tasks and is based on the idea that confidence
should propagate through state space much the same as the value function, starting from those terminal states in which the agent has received
(deterministic) reward, and which obviously must have a 100% confidence.
5.2
kd-Q-learning
State splitting techniques offer a way to employ adaptive generalization
with variable resolution for faster propagation of the experienced reward
through state space. However, they suffer generally from two drawbacks:
1. Collecting statistics for splitting rules takes a long time.
142
kd-Q-Learning
2. the learning agent uses its experience only for the splitting decision
and for the value of the node to be split, but not for the values of the
newly created nodes during splitting.
Although a splitting rule can be quite simple, like the rules presented in
the last chapter, the problem is when to apply these rules. As I mentioned
before, a concept of confidence in the current value of the state to be splitted
(or of its successor states, or of both) is required, and it takes a long time to
collect the corresponding statistics. The second drawback is of informationtheorical nature. It says that the information to be inherited by the newly
created cells during splitting will always be averaged information and will
thus be the same for both cells. But the splitting decision in most splitting
rules is based on information that measures a variance. Thus, information
gets lost. The inherited information may basically be the following (see fig.
5.4):
S2
st1
S1
S3
st2
inherit averaged information:
Q(S1,a) = E( r(st , a ) + J dur(a) · V(Sj) )
i
state splitting
or
a learned model of (averaged)
state transitions
S11
st1
S2
S12
st2
S3
Figure 5.4: information inheritance during splitting
• the current Q-values of the cell S to be split, i.e. the estimator of
E[r(s, a) + γ dur(a,s) · V (S )|s ∈ S]
a
with S s −→ s ∈ S . This is, however, an averaged information
and will thus be identical for the two created cells.
• a learned model of state dynamics in form of estimators of the state
transition probabilities, or in form of an approximation of the function f (s, a) of state dynamics as defined in (2.1) of chapter 2. In
kd-Q-Learning
143
the first case, it is clearly an averaged information. But even when
we use a non-probabilistic model as in the second case (for details,
see [Munos 98]), the approximation of f will be based on information averaged over experienced state transitions. In any case, the
information inherited by the two created cells will be identical again.
5.2.1
Basic idea
The argument of information loss during state splitting is the starting point
for the new learning algorithm which I call kd-Q-learning. The basic idea
is the following (see fig. 5.5).
State splitting is now interpreted as a level descent in a kd-trie that represents maximum resolution in all parts of state space from the very beginning
of learning. However, learning begins at an intermediate level in the kd-trie
(level k − d in fig. 5.5) whose nodes represent a coarser resolution in state
space. A state transition executed at that level (with an action duration
corresponding to the resolution of this level) is then evaluated by updating
simultaneously the Q-functions on all levels equal to or below the higher
level. For this purpose, Q-values are kept at all nodes in the kd-trie, not
just at the leaf nodes. Now, actions at levels with a coarser resolution last
longer and can only be suboptimal since they are more constrained (by
being constant for their duration) than actions at lower levels. Thus, what
might eventally be required is a level descent which refines the temporal
generalization. The key idea in this approach is to view state splitting as
a level descent in a tree structure that already exists and gets used before
state splitting (i.e. level descent) actually occurs.
144
kd-Q-Learning
level
(k - d)
.
.
.
level
(k - d + 1)
.
.
.
level
k
state transitions: at level (k-d)
at level (k-d+1)
at level k
kd-Q learning
- learns simultaneously at several levels in a kd-trie with different spatial resolutions,
but with a unique temporal resolution
- temporal resolution gets refined in a level descent process
Figure 5.5: kd-Q-learning: basic idea
kd-Q-Learning
5.2.2
145
The learning algorithm
For s ∈ S, let a hierarchic state of depth d
defined as
in a kd-trie of depth k be
ŝ := (ŝk , ŝk−1 , ..., ŝk−d )
(5.3)
where ŝk−i is the cell of state space corresponding to the (unique) node in
the kd-trie at level (k − i) which contains the state s, ŝk corresponding to
a leaf node.
Define the hierarchic Q-function in state ŝ as (see fig. 5.6)
Q̂(ŝ, a) := (Q(ŝk , a), Q(ŝk−1 , a), ..., Q(ŝk−d , a))
.
.
.
Q ( sk 2 , a )
Q( sk 1 , a)
(5.4)
Q( sk 3 , a) (d=3)
..
.
Q ( sk , a )
Figure 5.6: kd-Q function
With this definition, the algorithm is given in figure 5.7.
Q-learning is based on the Bellman equations (2.13) and (2.15) which defines the relationship between the q-function of the two successive states of
a
a state transition under a given but arbitrary action: s −→ s . An extension of the basic q-learning for hierarchic states with hierarchic q-functions
requires
a) the selection of a level j for the starting state ŝ defining thereby
– the action duration adapted to the resolution of ŝj
– the policy defined by Q(ŝj , .)
b) the definition of the value of the (hierarchic) successor state ŝ as
needed in the q-learning rule
146
kd-Q-Learning
in order to apply to a) a level descent, we need furthermore
c) a descent criterion deciding on when to descend from level j to a lower
level j + l
The choices made for a) to c) in kd-q-learning are defined in figure 5.8 and
will be discussed in the rest of this chapter.
Figure 5.7: kd-Q learning algorithm
- choose maximum depth k of kd-tree, and depth k-d to start with; mark all partitons at depth k-d as "active"
- repeat (for each episode)
- choose s  S randomly
- repeat (for each step in episode)
- let sˆ ( sˆk , sˆk 1 ,..., sˆk d ) be the hierarchic state of s
- j m k - i such that
sˆk i
( ŝ k is a leaf node)
is marked as "active"
- select and execute action a in ŝ j with duration dur (a, sˆ j ) , and observe reward r , and successor
state s' with hierarchic state
sˆ' ( sˆ'k , sˆ'k 1 ,..., sˆ' k d ) such that sˆ j z sˆ' j
- V ' m get _ value(sˆ' , s) : get the value of successor state
ŝ '
(function: see box below)
- determine the hierarchic evaluation error Gˆ : (G k , G k 1 , . . . , G j )
with G k i m r J dur ( a ) ˜ V ' Q( sˆk i , a)
- if G z 0: update hierarchic Q-function Qˆ ( sˆ, a )
- Q( sˆk i , a) m Q( sˆk i , a) D ˜ G k i
j d k i d k
- update learning rate D
- update confidence of ŝ : update_state_confidence( ŝ , ŝ ' ) (function, see box below)
- see whether we need a level descend in ŝ : det_active_state( ŝ ) (function that marks some sˆk i
in ŝ as "active" , see box below)
- s m s', and
- until
ŝ
ŝ m ŝ '
is terminal
kd-Q-Learning
147
Figure 5.8: functions used by the kd-Q learning algorithm
det_active_state( ŝ )
(detects a level descent in
ŝ
and in that case, it marks a higher level as "active")
( sˆ k , sˆ k 1 ,..., sˆ k d ) , and sˆk j the partition marked as "active"
- let sˆ
- if policy disagreement at level k-j
- unmark
- mark
(policy disagreement: see definition (5.5))
sˆk j
sˆk i with
i := max{ m<j | sˆ k m confident and no policy disagreement at level k-m }
get_value( ŝ ' , s )
(returns the value of successor state
- let sˆ'
ŝ ', after transition from s )
( sˆ' k , sˆ' k 1 ,..., sˆ' k d ) , and sˆ ' k j the partition marked as "active"
sˆ' k m }
- return max 0 d m d i { V( sˆ ' k m ) }
- let i := max{ m d j | s 
update_state_confidence( ŝ , ŝ ' )
(updates the confidence of all nodes in
- for all i with 0 d i d j where
ŝ
, after transition to
ŝ ' )
sˆk j the partition marked as "active"
- if s' is a rewarded terminal state
- successor_confidence m 1.0
- else
- successor_confidence m max 0 d m d l { confidence(
sˆ' k m ) }
where
sˆ' k j ' marked as active and sˆ' k n ˆ sˆk i ‡ }
- confidence( sˆ k i ) m (1-E) ˜ confidence( sˆ k i ) + E ˜ successor_confidence
l := max{n d j' |
148
kd-Q-Learning
The Level Descent Process
For level descent, any rule that is used in state splitting techniques may
also be employed for a level descent. The rule employed in kd-q-learning
for level descent is based on a policy disagreement of the policies defined by
the hierarchic q-function Q̂ of the hierarchic state ŝ:
Hierarchic Policy Disagreement
Given a distance d : A × A → [0, 1] between actions,
a hierarchic state ŝ := (ŝk , ŝk−1 , ..., ŝk−d ) with
hierarchic q-function Q̂(ŝ, a) := (Q(ŝk , a), Q(ŝk−1 , a), ..., Q(ŝk−d , a)),
ŝ is said to have a hierarchic policy disagreement at confident level j
(k ≥ j > k − d) whenever there exists a confident level i (k ≥ i > j)
such that
d(π(ŝj ), π(ŝi )) > d0
(0 < d0 < 1 a constant)
(5.5)
where π(ŝj ) := argmaxa∈A(ŝj ) Q(ŝj , a) (π(ŝi ) idem)
The meaning of this definition is that the policies of two nodes in the hierarchic state, as defined by the q-functions, disagree ”considerably” on the
best action. ”Considerably” is quantified by a threshold on a distance measure between actions.2 Whenever a policy disagreement is detected after
an update of the q-function, a level descent occurs from level j to level i.
As with all state splitting rules, a measure of confidence is required, here
for the policy of a node in a hierarchic state. It will be defined in the section
after the next one.
Note that this rule for level descent is based on the policy, not on the value
function. Since, by definition, the optimal policy of an MDP (with finite
action set) is learned before the optimal value (or q-)function has been approximated accurately, this rule is preferable with respect to rules based
on the value function. Note further that this criterion is based only on the
starting state ŝ, not on both starting and successor state as in criterion
(5.2): instead of comparing values of successive states, it compares policies
2 the value of d used for producing the results reported in the results section is 0.5.
0
In the two example domains of this thesis, the definition of a distance between actions
comes up quite naturally: in the TBU task, actions are numbered from 0 to 8, starting
from a steering action of -8 degrees, up to +8 degrees in increments of 2 degrees each.
In the mountain car problem, action 0 is backward acceleration, 1 is no acceleration,
and 2 is forward acceleration. The distance is the absolute difference of the two action
numbers, divided by the number of actions
kd-Q-Learning
149
at different spatial resolutions. This requires simultaneous q-learning at
different levels in the kd-trie, the basic idea of kd-q-learning.
The rational that is at the basis of this choice, is that state partitions which
spatially generalize too much, have a policy change (i.e. two different best
actions) for the optimal policy within that partition. Assuming that this
optimal policy is the one learned at level k by plain (i.e. not hierarchical)
q-learning as defined in 2.3.3 by (2.22) (with k → ∞, the learned q-function
converges to the optimal continuous q-function which is the solution to the
Bellman equation), the following question arises
Question 5.1
Does policy disagreement generally occur in a partition ŝj at level j that
generalizes too much? ”too much” generalization is meant in the sense
that ŝj contains a partition ŝk at level k(> j) with a policy disagreement
between the policy of level j and the optimal policy learned by plain qlearning at level k (i.e. with action durations adapted to level k).
The answer must be negative - a general guarantee for policy disagreement
to work cannot be given independently of the particular system dynamics
of the task. In fact, the problem is that policy disagreement is only based
on spatial refinement as given by hierarchical states, while temporal resolution, which determines the frequency of the decision points for changing
the policy, is uniform for all levels.
This may lead to a situation (see figure 5.9) in which the optimal policy
learned by plain q-learning at level k (k > j), which has both a finer spatial
and temporal resolution than level j, may result in a solution path within
a partition of level j which has a policy change within it. But this policy
change happens to be in a partition (the red one in figure 5.9) at level k
from which, when starting a transition in it at level j with an action that is
optimal at level k, the resulting trajectory is not optimal any more because
of the longer duration as imposed by low temporal resolution at level j.
The success of the policy disagreement criterion clearly depends on the
starting level k−d and on the number and the distribution of policy changes
of the optimal policy within a partition at that level, and thus in ultimate
analysis on the particular system dynamics.
150
kd-Q-Learning
action at level j , that would be optimal for two steps
at level k, but which is not optimal because of a longer
action duration
optimal path
at level k (> j)
optimal action at level k
(by hierarchic kd-q-learning)
partition at level j
partition at level k
optimal action at level k
(by plain q-learning)
Figure 5.9: sketch of a situation in which policy disagreement doesn’t
work
Definition of the value of the (hierarchic) successor state in a state
transition
The definition of the value of the hierarchic successor state as required by
the q-learning rule, is an important one. The example given in figure 5.10
illustrates this importance.
In this example, assume that in hierarchic state ŝ = (ŝi , ŝi+1 , ...), the
best action is to move to the right, without any policy disagreement for
both hierarchic states ŝ and ŝ” = (ŝi ”, ŝi+1 ”, ...) (such that it is the best
action also for partitions ŝi+1 and ŝi+1 ”). However, the system dynamics in
ŝi is such that the successor state reached from ŝi+1 has a value (V2 ) much
higher than that (V1 ) achieved from state ŝi+1 ”. Since there is no policy
disagreement in state ŝ , its level i is considered to be the correct resolution
in kd-q learning. But taking V (ŝi ) as the successor value of transitions
from ŝi , which is an average value over all experienced transition from both
ŝi+1 and ŝi+1 ”, the learning agent will never be able to distinguish in ŝi
between the best policy in ŝoi+1 , which is to move upwards in order to reach
ŝi+1 , and the best policy in ŝi+1 , which is assumed here to move to the right.
kd-Q-Learning
ŝi
151
sˆi '
level i
V2
sˆi' 1
sˆi 1
: optimal
actions
o
i 1
sˆ
( sˆi ) ŝi q
''
i 1
sˆ
level i
sˆi " ( sˆi ' )
V1
V2 >>V1 : values
Figure 5.10: the problem of finding the right level (here: i or i + 1?) for
successor value V (ŝ ): an example
This example makes clear that policy disagreement is not the right criterion for selecting the level in the successor state of a transition when
determining the successor state’s value.
This example illustrates furthermore, that a good choice for the value (certainly for states s ∈ ŝi+1 ) is given by the maximum of the values of all
nodes in the hierarchic state:
V (ŝ )
:=
max{V (ŝk−m ) | 0 ≤ m ≤ j}
(5.6)
where j ≤ d is the currently active level selected by the level descent function.
Since the q-value of a partition is updated less frequently than those of its
ancestor partitions in the kd-trie with lower resolution, the maximum value
usually is at level k − j, unless a situation like the one in the example is
encountered.
However, this criterion would choose in state ŝ” the overestimated value of
node ŝi ” instead of ŝi+1 ”. In order to choose the value V (ŝi+1 ”), we would
152
kd-Q-Learning
need to discover that
V (ŝi+1 )
=
max{V (ŝk−m ) | 0 ≤ m ≤ j}
>
V (ŝi )
holds. Because this cannot be detected any more within the hierarchic
successor state (linear complexity), but requires an analysis of a whole subtree which may be computationally expensive (exponential complexity), I
applied in kd-q-learning the criterion (5.6) in all cases. This should be sufficient for detecting in a situation like the one of the example, that state ŝ
has a policy disagreement, since V (ŝ ) > V (ŝ”).
Confidence in the value of a state
The problem of a sound definition of the confidence in the value of a state
has been discussed in section 5.2.2. In kd-q learning, I used for goal-seeking
tasks a simple value propagation mechanism which starts with a 100% confidence (1.0) in rewarded terminal states (note that reward is assumed to be
deterministic in the task domains of this thesis). ”Confidence” then spreads
from those states back through state space by updating the confidence of
each node ŝi in a hierarchic state ŝ with the weighted mean of the current
confidence of ŝi and the maximum confidence of the nodes of the successor
state ŝ :
ŝi .conf idence ← (1 − β) · ŝi .conf idence + β · conf idence(ŝ )
where conf idence(ŝ ) = 1.0 if s a rewarded, terminal state, otherwise
conf idence(ŝ ) = max{ŝl .conf idence | k ≥ l ≥ k − j} with k − j the level
with the currently active level in ŝ .
A state (i.e. a node in a hierachic state) then is called confident if its confidence is above a predefined, global threshold value.
This definition of confidence makes no claim of being a statistically sound
definition of confidence. It simply gives, by means of the parameter β controlling the propagation speed of confidence, and by the threshold parameter, the possibility to control level descent on the basis of the complexity
and particular dynamic nature of the task to be learned. Future work will
have to find more sound definitions of confidence which, as has been discussed in the previous section, will be a key issue in adaptive discretization
techniques.
kd-Q-Learning
5.3
153
Results
The kd-Q-learning algorithm has been evaluated for the mountain car and
for the TBU problem, with a more thorough analysis of the mountain car.
The emphasis on the Mountain Car is due to its particular system dynamics
and because it has been studied intensively in the literature in the context
of state splitting techniques.
5.3.1
Mountain Car
The algorithm of figure 5.7 has been implemented with the following parameter settings:
Parameter Settings
learning rate α : 0.5
discount factor γ : 0.98
exploration rate : 0.5
maximum depth k : 10
hierarchic depth d : 4 (i.e. learns on levels 6 to 10)
action duration type : static action duration (2.23)
confidence propagation factor β : 0.2
confidence threshold : 0.1
The learning episodes start randomly (uniform distribution) in the entire
state space, and end either in a terminal state, or after a maximum of 200
steps (of step size h which moves the trailer a 1/20 of its total length when
running straight forward).
Figure 5.11 shows the state space discretization at level 10 with the isolines
of the optimal value function from figure 5.2 (which was learned at level 10
with plain q-learning, after 20.000 episodes). Most of the following results
of the analysis of the performance and of policy disagreement refer to state
115, in particular to state 1840: the red point in figure 5.11. General performance results refer to 10 states all with velocity 0: the blue points in
figure 5.11. They are listed in appendix A.
The performance results refer to a sample of 50 independent runs each
with 2000 learning episodes starting from uniformly distributed starting
states and ending either in a terminal state or after 200 steps. Results are
averaged over these 50 samples.
154
kd-Q-Learning
9-12
state
115
12-15
461
463
230
231
460
462
level 7 and 8 in kd-trie
6-9
1853 1855
926
state
115
3-6
927
1852 1854
1841 1843
0-3
920
921
1840 1842
state 115 at level 6 in kd-trie
level 9 and 10 in kd-trie
Figure 5.11: discretized state space of the MC-problem (with isolines of the value function), with examined state 115
Figure 5.12 shows the performance (measured in number of steps to
the goal state, starting from state 1840), and the corresponding variance.
The figures compare the performance for the kd-q learning and for plain
q-learning on levels 6, 8 or 10 (each dimension being split 3, 4 or 5 times).
kd-Q-Learning
155
Figure 5.12: performance comparison for hierarchic and plain q-learning
perform ance in state 1840
300
250
k=10, d=0
steps to goal
200
k=8, d=0
150
k=6, d=0
100
k=10, d=4
50
0
200
episodes
400
600
800
1000
1200
1400
1600
1800
2000
variance on the perform ance in state 1840
100
90
k=10, d=0
80
k=6, d=0
variance
70
60
k=8, d=0
50
40
30
20
k=10, d=4
10
0
200
episodes
400
600
800
1000
1200
1400
1600
1800
2000
156
kd-Q-Learning
Figure 5.13 shows the performance for the early learning phase, and the
corresponding variance.
Figure 5.13: performance comparison for hierarchic and plain q-learning:
early learning phase
perform ance: early learning phase
300
k=8, d=0
250
k=10, d=0
steps to goal
200
k=6, d=0
150
k=10, d=4
100
50
0 e50
100
150
200
250
300
350
400
450
500
variance on perform ance: early learning phase
120
100
k=10, d=0
k=6, d=0
variance
80
k=8, d=0
60
k=10, d=4
40
20
0
50
episodes
100
150
200
250
300
350
400
450
500
kd-Q-Learning
157
Figures 5.14 and 5.15 show the level descent process in state 1840.
Figure 5.14: level descent process: a) after 200 episodes b) after 400
episodes
a)
b)
10
9
8
7
6
158
kd-Q-Learning
Figure 5.15: level descent process: a) after 800 episodes b) after 1200
episodes
a)
b)
10
9
8
7
6
kd-Q-Learning
159
The value function for the early learning phase is shown in figure 5.16.
Note that the value function as depicted here refers to the first consistent
state bottom-up. This usually is lower than the activated state during
learning, as shown in the previous figures.
Figure 5.16: value function in the early learning phase
100 episodes
50 episodes
16
16
14
14
12
12
10
10
V
V
8
8
6
6
4
4
2
2
0
0
200 episodes
500 episodes
16
16
14
14
12
12
10
10
V
V
8
8
6
6
4
4
2
2
0
0
160
kd-Q-Learning
The process of policy disagreement is illustrated in figure 5.17 which
shows the hierarchic q-function of state 1840 which has a policy disagreement after about 1000 episodes: it changes from action ”forward” (to the
right) to action ”backward”.
Figure 5.17: hierarchic q-function for state 1840
3.5
state 1840
3
forw ard
2.5
q
2
1.5
1
noacc
0.5
backw ard
0
episodes
2000
7
state 920
6
forw ard
5
q
4
3
noacc
2
backw ard
1
episodes
0
2000
8
state 460
7
forward
6
q
5
4
3
noacc
2
backward
1
0
2000
kd-Q-Learning
161
Figure 5.18: hierarchic q-function for state 1840: continued
12
state 230
10
forw ard
q
8
6
4
backw ard
noacc
2
episodes
0
2000
12
state 115
forward
10
8
q
noacc
6
4
backward
2
0
episodes
2000
162
kd-Q-Learning
Finally, figure 5.19 shows a comparison of Sutton’s Sarsa algorithm using
a CMAC approach based on tilings [Sutton, 1996] and kd-Q(λ)-learning, a
variant of the algorithm of figure 5.7 using an eligibility trace. For details
of Q(λ)-learning, see [Sutton & Barto, 1998].
180
(k= ,d= ): kd-Q(O)-learning with D=0.6 H =0.5 O=0.9
and trie depth k, hierarchic depth d
160
CMAC: D H O mean nr of steps to hill
140
120
100
80
60
(k=5,d=0)
40
(k=7,d=2)
(k=7,d=0)
CMAC
20
nr of steps
0
0
5000
10000
15000
20000
25000
30000
35000
Figure 5.19: performance of kd-Q(λ) versus Sutton’s CMAC algorithm
5.3.2
The TBU problem
The kd-q-learning algorithm has been applied to the TBU task in which
the truck has to align cab and trailer with the goal point, without the
final docking task, and it may only navigate backwards3 . This task has
been learned as a monolithical task, without decomposition into subtasks,
and without a veto task. This is a problem with a threedimensional state
space (φint , φgoal , φsteer in fig. 3.1). In a kd-trie with maximum depth 10
(1023 nodes), the influence of various choices for the hierarchic depth on
3 this
is task T6 of chapter 3.1
40000
kd-Q-Learning
163
the performance can be seen in figure 5.20. The performance is measured
by the number of goal achievements from a uniformly distributed test set
of 2500 starting states. Any trial either ends in a goal state, with an inner
blocking, or after a maximum of 300 steps.
Interestingly, there is a precise hierarchic depth from which performance
accelerates clearly (l = 4). Furthermore, note that a kd-trie with too low a
maximal resolution (k = 7) performs poorly at the end, although starting
well, since it generalizes too much for the specific task.
2000
nr of goal achievements (from 2500 start points)
1800
k=10, l=4
1600
k=10, l=0
1400
k=10, l=3
1200
k=7, l=0
1000
800
600
k=10, l=5
400
200
nr of learning trials
0
0
1000
2000
3000
4000
5000
6000
Figure 5.20: influence of the hierarchic depth on performance in the TBU
problem
164
5.4
5.4.1
kd-Q-Learning
Discussion
Problems with the high variance of the performance
The kd-q-algorithm aims at accelerating the learning process in the early
learning phase. In that phase, I have shown that accelerating can be
achieved, but the variance of the resulting performance of the policy is
very high, as can be seen from figure 5.12 and 5.13: during the first 500
trials of the Mountain Car problem, the variance is inacceptably high. I
have explored a great variety of variants of the algorithm (not to speak of
the huge number of parameter tunings), mostly concerning the level descent criteria, the confidence definition and the definition of the value of
the successor state. None of these modifications has brought up better results (i.e. lower variances together with acceleration of learning). Thus, the
kd-q-learning algorithm seems to follow the right basic idea (as can be seen
from the results), but its development and analysis has to be considered
inconclusive up to know.
5.4.2
Special features and possible extensions
The kd-Q-learning algorithm has a number of interesting features and potential extensions:
1. Monotonous increase of the value function
For the optimal value function V ∗ := maxa Q∗ (see chap. 2) holds:
(k−i)
r(s, a(k−i) )+γ dur(a
)
(k−j)
·V ∗ (s ) ≥ r(s, a(k−j) )+γ dur(a
)
·V ∗ (s )
if i < j
(5.7)
with a(k−i) (a(k−j) ) the best action of node ŝk−i (ŝk−j ) in the hierarchic
state ŝ, and s (s ) the state reached under action a(k−i) (a(k−j) ). This can
easily be shown since the actions at the higher level are more constrained
than those at lower levels (dur(s, ŝk−j , a) ≥ dur(s, ŝk−i , a)). We can expect
(5.7) to be true also for the approximation V̂ at level k −i. This means that
in case of convergence to a (suboptimal) solution, the learned Q-function
increases towards the (sub)optimal Q-function as the algorithm descends
to the leaf nodes. This, however, holds only in case of a guaranteed level
decent. Possible problems related to this point have been discussed in
section 5.3.2, question 5.1. Convergence to the optimal function Q∗ will
generally hold only when k ∞ [Munos 98]
2. Dimensioning of the tree depth k and the hierarchic depth d
The choice of k and d is important (see fig. 5.20). However, choosing k and d
kd-Q-Learning
165
large enough, the algorithm can go on descending using any type of stopping
criterion (time, performance, policy agreement, or variance statistics). The
purely exploiting controller will then define its policy by the Q-function at
the lowest confident node.
If dimensioning at the outset of learning is difficult, the algorithm could
start with an underestimated value for k and then allocate dynamically
new leaf nodes one level deeper whenever level descent has proceeded in a
certain way.
3. Splitting order of state space dimensions
A problem in kd-Q-learning consists in the a priori choice of the order of
the state space dimensions in which cells get split in the kd-trie, while state
splitting techniques can dynamically decide on the splitting dimension. So
far, I used a cyclic splitting order. I haven’t run any special experiments
for analysing the influence of this order on the performance.
166
kd-Q-Learning
Chapter 6
Conclusion
Reinforcement Learning is an elegant learning theory which has a sound
mathematical basis. In its essential form, as a synthesis of previously many
different views on basically the same problem, it has found its mature formulation in the second half of the nineties [Sutton & Barto, 1998, Bertsekas
& Tsitsiklis, 1996]. This formulation established a solid reference basis for
further research on topics regarding state representation, state abstraction,
temporal abstraction, and in general a scalability to complex learning problems in order to gain acceptance in real world application domains.
On the other side, and independently of RL research, behavioral architectures have been developped and applied to real world applications such as
to autonomous robots. The inspiration of these architectures was manyfold
and very creative, overcoming some dogmas of the classical approaches of
AI (such as hierarchical planing and reasoning with a closed world assumption). This creativeness was certainly favored by the lack of a theoretical
reference frame within which behavioral architectures had to be formulated.
The result of this movement is, besides a number of interesting architectural
approaches such as the subsumption architecture, a prevalently intuitive
approach to the design of a behavioral architecture for a specific control
problem.
6.1
Contributions
This thesis has developped a number of results all starting from the idea
that a behavioral architecture can be defined with sound principles of behavior composition when it is viewed in first place as a learning problem, and not exclusively as an execution problem as it has been done
167
168
Conclusion
mostly in the past. Since RL in its mature formulation has reached a
high degree of abstraction for basic concepts such as state, action, and
goal-directedness, within a homogeneous, simple theoretical framework, it
seemed to be promising to take RL as the unique framework for learning. This approach allowed for putting the evaluation of architectures on
a sound and transparent basis: optimality in the sense of optimality of
an RL-learned solution to a control problem (i.e. maximization of a value
function). This basic definition of optimality finds a coherent extension
applicable to hierarchical behavior architectures, in the form of the three
optimality types: hierarchical, recursive, and flat optimality (see chapter
4.1.2).
Besides that, the concept of hierarchy as an organizational principle has
driven the work of this thesis. Very much the same as in software architectures, hierarchy in behavioral architectures aims at modularization and is
based on top-down abstraction - high-level tasks abstract from operational
details which are delegated to low-level tasks -, and on bottom-up abstraction - low-level tasks are defined and learned independently of the context
they will be used in by high-level tasks. Again in analogy to software architectures, hierarchical and modular behavior architectures must be defined
by composition principles for these modules1 .
This thesis has presented three such principles, all of which are defined in
the context of learnability by RL, giving for each of them
• a specific Bellman equation in the continuous state space,
• an execution model for the composition, which for the second and
third principle represents an SMDP (Semi Markov Decision Process).
• a Q-learning algorithm in discretized state space,
• a discussion of the specific requirements on state and temporal abstraction.
In particular, these principles have been identified as
1. the Veto Principle:
an avoidance task gives rise to a veto function which may veto the
execution, on behalf of the learning task, of an action in a specific
state, in order to avoid a future state representing a severe violation of
a system constraint (such as a collision during navigation of a robot).
It has been shown that a task learning under the veto principle, is
equivalent to another task having the same goal and no veto function,
1 in
object-oriented software architectures, for example, theses principles are message
passing, inheritance, polymorphism, object aggregation and functional delegation
Conclusion
169
but having eventually a different discretization of state space. Tasks
learning under the veto principle, are hierarchically optimal. They
benefit from a case of genuine modularization: factoring out of many
different tasks a common concern - that of avoiding the violation of
basic system contraints.
2. the Subtask Principle:
a task learns by delegating detailed operations to already learned subtasks. Each such activation is treated as an execution of one action
of the learning task. Since this thesis aims at control problems in
continuous state spaces not restricted to physical space, the eventual
dependence between the goals of several subtasks requires particular
attention. This has been solved, unlike the common approach to task
composition in RL in which the activated subtask remains active until
achievement of its subgoal, by letting the activated subtask perform
just one action the length of which gets adapted to the state space
discretization of the learning task.
It has been shown that the subtask principle results in a reduction
of task complexity in the sense of reduction of the number of actions
and/or of the number of action changes.
The analysis of the question of optimality of the subtask principle has
revealed that it is obviously hierarchically but not recursively optimal, but that the most important question, i.e. that for conditions
for generating suboptimal (i.e. any) solutions (and thus for the applicability of this principle) cannot be answered generally. Instead, just
some sufficient conditions have been formulated for the three cases of
flat optimality, of just suboptimality, and of divergence (no solution
found by the learning task).
3. the Perturbation Principle:
the learning task learns to ”perturb” with its actions a kind of equilibrium state towards its own goal states. The reequilibration of a
perturbance is accomplished by already learned tasks at lower levels.
This results in a substantial state space reduction for the learning task
since it learns an evaluation, and thus a policy, only for equilibrium
states. This principle is inspired by the so-called Equilibrium Point
Hypothesis of biomechanics [Feldman, 1974].
The entire process of perturbation and reequilibration can be seen as
an option which is a theoretically sound abstraction of actions in RL
[Precup & Sutton &Singh, 1998], see chapter 2.3.5. An informal interpretation of what an equilibrium represents, has been given, with
the main requirement that the reequilibration dynamics must run on
170
Conclusion
a different (shorter) time scale than that of the perturbing task, in
order to avoid instability of the solutions.
All three principles have been given a common architectural framework in
which they can be combined with each other. A corresponding Q-learning
algorithm has been presented. It has been tested for the TBU problem
(backing up a trailer truck), and it has been shown that there are substantial savings of learning time.
Besides this central part of the thesis, the first part has presented
4. the theoretical basis for RL in discretized continuous state spaces:
starting from the theory of Dynamic Programming [Bellman 1957,
Bertsekas 1987], and from the formulation of the general optimization
task in continuous control problems, a learning rule (Q-learning) has
been formulated in a general (finite) state space partitioning.
5. the fundamental relationship between spatial and temporal abstraction:
spatial abstraction, as defined by a partitioning, and temporal abstraction, as defined by the duration of an action (of either constant
value or of a predefinded, previously learned policy) have been discussed with respect to their relation to each other in the context of
aliasing and generalization problems for the learned policy. Although
a general convergence proof for RL in discretized state spaces is impossible since the Markov property is violated, it has been argued
that the Markov property of state transitions generally is too strong
a requirement for convergence, which explains why RL methods quite
often behave better than guaranteed by their theory.
Finally, it has come up during the work of this thesis a strong request for an
adaptive discretization technique2 during learning, since little is known of
the required resolution of discretization prior to learning, and this remains
true, or even might become a key issue in the subtask principle, when
learning a task composition. For this purpose, I defined in the third part of
this thesis the so-called
6. kd-Q-learning:
a Q-learning variant on a state space discretization represented by a
kd-trie. In such a kd-trie, a node representing a cell of state space, has
2 which
has been recognized by some other researchers too as a fundamental area of
research in RL [Moore 1994, Munos & Moore 2002]
Conclusion
171
two son nodes representing the two cells that result from splitting the
father cell along a hyperplan which cuts it into two halves. Based on
hierarchical abstraction of state space, a new learning algorithm has
been presented and tested which learns simultaneously at different
levels in the kd-trie representing different resolutions of state space
discretization. A hierarchical Q-function is defined for a generalized
hierarchical state represented by a vector of ancestor nodes of a leaf
node of the kd-trie. It has been shown that the resulting policy adapts
during learning to different levels of resolution depending on a policy
disagreement within a hierarchical state. In this way, the learning
task learns to minimize aliasing problems not known prior to learning.
This learning method improves on state splitting techniques which
do not exploit efficiently the experience of the learner that leads to a
splitting decision.
The main advantage of this technique consists in acceleration of the
very first learning phase. This, however, brings about the yet unsolved
problem of high variance in the results of kd-Q-learning. This is due to
the lack of a sound criterion of confidence in the policy disagreement
during the first learning phase in which the stochastic process is in a
transient phase.
Nevertheless, positive results on the application of kd-Q-learning to
the Mountain Car task ( an underpowered car has to climb up a
steep moutain road for which it requires kinetic energy) and to the
TBU task, support the hope that this algorithm moves into the right
direction for shortening learning times.
6.2
Future Work
A number of issues of this thesis remain open (only partially answered),
and require further research:
• kd-Q-learning: improvement of the level descent criterion:
the algorithm for adaptive partitioning has to be improved for more
robustness with respect to the problem of high variances of the learned
behavior. A more thorough analysis of the alternatives for a criterion
of level descent in the kd-trie has to be done. This should be carried
out possibly with more examples than the two used in this thesis.
The Mountain Car task remains a difficult benchmark problem, but
a broader range of examples would better structure the research on
this topic.
172
Conclusion
• a formal framework for complexity measures of learning algorithms in
hierarchical architectures
in 4.3.5, I gave some informal definitions of such complexity measures
for the learning algorithm of the subtask principle. As I mentioned
there, litlle to nothing has been done in this direction in the RL research community. This might be due to the fact that the general RL
problem (optimize a value function in a finite state and action space
with any kind of system dynamics) is too general to be described by
a useful complexity or even just problem space measure. However,
things might become different when more structure of the learning
problem is predefined by a hierarchical learning architecture, especially when recurring to already learned subtasks in an incremental
learning process. I limited the discussion on this issue to informal
definitions since I believe that further restrictions (on system dynamics in first place) have to be defined in advance in order to arrive at
useful complexity measures.
• convergence proofs of the algorithms:
further research has to be done for defining conditions under which
convergence of the algorithms (composition principles, and kd-Qlearning) can be proven. This includes a precise formulation of limit
cases (such as δS(s) → 0) which certainly will be necessary for any
kind of convergence theorem. The reference work for future research
in this direction is the work of Munos (for example [Munos 1998])
• acceleration of learning time by employing model based approaches:
the algorithms presented are completely model-free (except for the
predefined task decomposition). This is justified by the objective of
this thesis to study the basic composition principles in a modular, hierarchical architecture for plain RL. However, the information of the
experience during learning could be exploited better by the learning
agent by learning a model of state dynamics, and then performing
updates with the Q-learning rule on hypothetical state transitions.
This would accelerate the value iteration of Dynamic Programming.
An early example has been Sutton’s Dyna-Q [Sutton 1991].
A promising recent approach for such an extension can be found in
Munos’ Finite Difference Reinforcement Learning [Munos 1998]: in
equation (2.1) of chapter 2, the function f modelling the system dynamics, can be approximated by finite differences in state space observed during learning. Based on the Hamilton-Jacobi-Bellman differential equation and its reinterpretation as an MDP, Munos defined
a model-based RL algorithm and proved it to converge in the limit
case δS(s) → 0 (he used triangulations for discretization). It should
Conclusion
173
be possible to integrate this approach with the one presented in this
thesis in order to extend the model-free algorithms towards partly
model-based ones as proposed by Sutton’s Dyna-Q.
• the inverse learning problem: learning a task decomposition
the hierarchical learning architecture presented in this thesis, is concerned with the composition of already learned tasks, and requires a
preliminary manual decomposition of the control problem. A machine
learning approach for this activity seems to be a hard problem, and
has been tackled only by a small number of researchers [for example Thrun & Schwartz, 1995]. There are two approaches that would
integrate into the work of this thesis which has defined composition
principles:
– symbolic decomposition of a complex reward function:
the agent learns to group the terms of a complex reward predicate to define subtasks, and to find which of the three composition principles apply for these subtasks
– subsymbolic analysis of the structure of a reward predicate:
the agent collects reward information in a preliminary exploration phase and analyses this set of reward points (for example
with a Principal Component Analysis) in order to decompose the
reward function into its components. This could be examined in
a first step for complex avoidance tasks.
174
Acknowledgment: This work has been supported by the German Science Foundation (CRC 527: Integration of Symbolic and Subsymbolic Information Processing in Adaptive Sensorimotor Systems).
Bibliography
[1] Agre, P.E. & Chapman, D. (1990): What are plans for? Robotics and
Autonomous Systems 6: 17-34
[2] Barto, A.G. & Sutton, R.S. & Anderson, C.W. (1983): Neuronlike Elements that can solve difficult learning control problems. IEEE Transactions on System, man and Cybernetics 13: 835-846
[3] Bellman, R.E. (1957): Dynamic Programming. Princeton University
Press, Princeton
[4] Bertsekas, D.O. (1987): Dynamic Programming. Prentice-Hall N.J.
[5] Bertsekas, D.O. (1995): Dynamic Programming and Optimal Control.
Athena Scientific, Belmont MA
[6] Bertsekas, D.O. & Tsitsiklis, J.N. (1996): Neuro-Dynamic Programming. Athena Scientific, Belmont, MA
[7] Bradtke, S. J. & Duff, M. O. (1995): Reinforcement learning methods
for continuous time Markov decision problems. Advances in Neural Information Processing Systems 7: Proceedings of the 1994 Conference
Denver, Colorado. MIT Press.
[8] Brooks, R. A. (1986): A robust layered control system for a mobile
robot. IEEE Journal of Robotics and Automation, 2, 14–23.
[9] Caironi, P.V.C. (1997): Gradient-Based Reinforcement Learning:
Learning Com- binations of Control Policies. Tech.Rep. 97.50, Politecnico di Milano
[10] Dayan, P. & Hinton, G.E. (1993): Feudal Reinforcement Learning, in:
Advances in Neural Information Processing Systems 5, pp.271-278. San
Mateo, CA
175
176
[11] Dietterich, T. G. (1997): Hierarchical reinforcement learning with the
MAXQ value function decomposition. Tech. rep., Department of Computer Science, Oregon State University, Corvallis, Oregon.
[12] Dietterich, T.G. (2000a): Hierarchical reinforcement learning with the
MAXQ value function decomposition, Journal of Artificial Intelligence
Research, 13, S. 227-303
[13] Dietterich, T. G. (2000b): State abstraction in MAXQ hierarchical
reinforcement learning. In Advances in Neural Information Processing
Systems, 12. S. A. Solla, T. K. Leen, and K.-R. Muller (eds.), 994-1000,
MIT Press
[14] Feldman, A.G. (1974): Control of the Length of a Muscle. Biophysics
19, pp.766-771
[15] Friedman, J.H. & Bentley, J.L. & Finkel, R.A. (1977): An Algorithm for Finding Best Matches in Logarithmic Expected Time. In ACM
Trans. on Mathematical Software, 3(3): pp.209-226
[16] Ghavamzadeh, M. & Mahadevan, S. (2001): Continuous-time Hierarchical Reinforcement Learning, Eighteenth International Conference on
Machine Learning (ICML), Williams College, Massachusetts
[17] Gullapalli, V. (1992): Reinforcement Learning and its Application to
Control. PhD. Thesis, University of Massachusetts, Amherst, MA01003
[18] Hackbusch, W. & Trottenberg, U. (1982): Multigrid Methods I+II,
Springer Verlag, Berlin
[19] Hernandez, N. & Mahadevan, S. (2000): Hierarchical Memory-based
Reinforcement Learning” , Fifteenth International Conference on Neural
Information Processing Systems, Denver
[20] Huber, M. & Grupen, R.A. (1997): Learning to Coordinate Controllers
Reinforcement Learning on a Control Basis. Proceedings of the Fifteenth
International Conference on Artificial Intelligence, IJCAI-97, San Francisco, CA, Morgan Kaufman
[21] Kaelbling, L.P. (1993): Learning in Embedded Systems. MIT Press,
Cambridge MA.
[22] Kaelbling, L. P. & Littman, M. L. & Moore, A. W. (1996): Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4,
237–285
177
[23] Kalmar, Z. & Szepesvari, Cs. & Lorincz, A. (1998): Module-Based
Reinforcement Learning: Experiments with a Real Robot, in Machine
Learning 31, pg.55- 85, Kluwer Academic Publishers
[24] Koenig, S. & Liu, Y. (2000): Representations of Decision-Theoretic
Planning Tasks, Proceedings of the International Conference on AI Planning Systems, 187-195
[25] Kushner, H.J. & Dupuis, P. (1992): Numerical Methods for Stochastic
Control in Continuous Time. Springer Verlag, N.Y.
[26] Mahadevan, S. & Connell, J. (1992): Automatic Programming of Behavior Based Robots using Reinforcement Learning. Artificial Intelligence 55(2-3)
[27] Makar, R. & Mahadevan, S. & Ghavamzadeh, M. (2001): Hierarchical
Multi-Agent Reinforcement Learning, Fifth International Conference on
Autonomous Agents, Montreal
[28] McGovern, A. & Barto, A. G. (2001): Automatic Discovery of Subgoals
in Reinforcement Learning using Diverse Density. Proceedings of the
18th International Conference on Machine Learning, pages 361-368.
[29] McGovern, A. & Sutton, R.S. (1998): Macro-Actions in Reinforcement
Learning: An Empirical Analysis, Amherst Technical Report Number
98-70, University of Massachusetts
[30] Meystel, A.M. & Albus, J.S. (2002): Intelligent Systems. Wiley Series
in Intelligent systems. John Wiley & Sons.
[31] Moore, A.W. (1991): Variable Resolution Dynamic Programming:
Efficiently Learning Action Maps in Multivariate Real-Valued StateSpaces. In Birnbaum, L., Collin, G. (eds.), Machine Leaning: Proceedings of the Eighth International Workshop, Morgan Kaufmann.
[32] Moore, A.W. (1994): The Parti-Game Algorithm for Variable Resolution RL in Multidimensional State Spaces. In Advances in Neural
Information Processing Systems 6, pp.711-718. Morgan Kaufman, San
Francisco
[33] Moore, A. W. & Baird, L. & Kaelbling, L.P. (1999): Multi-ValueFunctions: Efficient Automatic Action Hierarchies for Multiple Goal
MDPs, International Joint Conference on Artificial Intelligence (IJCAI99).
178
[34] Munos, R. (1998): A General Convergence Method for Reinforcement
Learning in the Continuous Case. Proceedings of the European Conference on Machine Learning ECML’98.
[35] Munos, R. & Moore A.W. (2002): Variable Resolution Discretization
in Optimal Control. Machine Learning 49, Nr 2/3, Kluwer Academic
Publishers
[36] Nguyen, D. & Widrow, B. (1991): The Truck Backer-Upper: An Example of Self-Learning in Neural Networks. In: W.T. Miller, R.S. Sutton,
P.J. Werbos (eds.): Neural Networks for Control. MIT Press, Cambridge
1991
[37] Pareigis, S. (1998): Adaptive Choice of Grid and Time in Reinforcement Learning, in Neural Information Processing Systems: Proceedings
of the 1997 Conference,pp.1036-1042. MIT Press, Cambridge, Ma
[38] Parr, P. & Russell, S. (1998): Reinforcement Learning with Hierarchies
of Machines, in: Advances Neural Information Processing Systems: Proceedings of the 1997 Conference. MIT Press, Cambridge,MA
[39] Precup, D. & Sutton, R.S. (1998): Multi-time Models for Temporally
Abstract Planning. Advances in Neural Information Processing Systems:
Proceedings of the 1997 Conference. MIT Press, Cambridge, Ma
[40] Precup, D. & Sutton, R.S. & Singh, S.P. (1998): Theoretical Results on Reinforcement Learning with Temporally Abstract Actions, in
Proceedings of the Tenth European Conference on Machine Learning
ECML98,pp.382-393. Springer Verlag
[41] Reynolds, S.I. (2000): Adaptive Resolution Model-Free Reinforcement Learning: Decision Boundary Partitioning Generating Hierarchical
Structure in Reinforcement Learning from State Variables, International
Conference of Machine Learning, ICML2000, Stanford University
[42] Rohanimanesh, K. & Mahadevan, S. (2001): Decision-Theoretic Planning with Concurrent Temporally Extended Actions, in Proceedings of
the Seventeenth Conference on Uncertainty in Artificial Intelligence ,
August 3-5, 2001
[43] Ryan, M.R.K & Pendrith, M.D. (1998): RL-TOPs: An Architecture
for Modularity and Re-Use in Reinforcement Learning, The Fifteenth
International Conference on Machine Learning, ICML98 Madison, Wisconsin, July 1998
179
[44] Ryan, M.R.K. & Reid, M.D. (2000): Learning to Fly: An Application
of Hierarchical Reinforcement Learning The Seventeenth International
Conference on Machine Learning, ICML2000 San Francisco, California,
June 2000
[45] Schoknecht, R. & Riedmiller, M. (2002): Speeding Up Reinforcement
Learning with Multi-Step Actions. Proceedings of the ICANN 2002, Lecture Notes of Computer Science, 2415, p. 813-818
[46] Singh, S.P. (1993): Learning to solve Markovian Decision Processes.
Ph.D. Thesis, University of Massachusetts, Amherst, CMPSCI Technical
Report 93-77
[47] Sutton, R.S. (1991): Dyna, an Integrated Architecture for Learning,
Planning, and Reacting. SIGART Bulletin, 2:160-163, ACM Press
[48] Sutton, R.S. (1996): Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding, in Advances in Neural
Information Processing Systems: Proceedings of the 1995 Conference,
pp.1038-1044. MIT Press, Cambridge, Ma
[49] Sutton, R.S. & Barto, A.G. (1998): Reinforcement Learning. An Introduction. MIT Press, Cambridge, Ma
[50] Sutton, R.S. & Precup, D. & Singh, S. (1999): Between MDPs and
semi-MDPs: A Framework for Temporal Abstraction in Reinforcement
Learning. Artificial Intelligence 112:181-211.
[51] Thrun, S. & Schwartz, A. (1995): Finding Structure in Reinforcement
Learning in Advances in Neural Information Processing Systems, Volume 7, MIT Press, Cambridge, Ma
[52] Tsitsiklis, J.N. & van Roy, B. (1996): Feature-Based Methods for Large
Scale Dynamic Programming. In Machine Learning, 22, pp. 59-94
[53] Tyrrel, T. (1993): The Use of Hierarchies for Action Selection. Adaptive Behavior, Vol. 1, Nr. 4, pp. 387-420
[54] Vollbrecht, H. (1998): Three Principles of Hierarchical Task Composition in Reinforcement Learning, in Proceedings of the ICANN98,
pp.1121-1126 Springer Verlag
[55] Vollbrecht, H. (1999): kd-Q-learning with Hierarchical Generalization
in Action- and State Space, Ulm, SFB 527 Technical Report 1999/8
180
[56] Vollbrecht, H. (2000): Hierarchic function approximation in kdQ-learning. In Proceedings of the Knowledge Engineering Systems
KES2000, Brighton, Great Britain
[57] Watkins, C.J.C.H. (1989): Learning from Delayed Rewards. PhD thesis, Kings College, Cambridge, UK
[58] Whitehead, S.D. & Karlsson, J. & Tenenberg, J. (1992): Learning
Multiple Goal Behavior via Task Decomposition and Dynamic Policy
Merging. Robot Learning (J. Connell, S. Mahadevan, eds.) Kluwer Academic Press,