relational reinforcement learning - Lirias

n
KATHOLIEKE UNIVERSITEIT LEUVEN
FACULTEIT TOEGEPASTE WETENSCHAPPEN
DEPARTEMENT COMPUTERWETENSCHAPPEN
Celestijnenlaan 200A – 3001 Leuven (Heverlee)
RELATIONAL REINFORCEMENT
LEARNING
Promotoren :
Prof. Dr. L. DE RAEDT
Prof. Dr. ir. M. BRUYNOOGHE
Proefschrift voorgedragen tot het
behalen van het doctoraat in de
toegepaste wetenschappen
door
Kurt DRIESSENS
Mei 2004
n
KATHOLIEKE UNIVERSITEIT LEUVEN
FACULTEIT TOEGEPASTE WETENSCHAPPEN
DEPARTEMENT COMPUTERWETENSCHAPPEN
Celestijnenlaan 200A – 3001 Leuven (Heverlee)
RELATIONAL REINFORCEMENT
LEARNING
Jury :
Prof. Dr. ir. J. Berlamont, voorzitter
Prof. Dr. L. De Raedt, promotor
Prof. Dr. ir. M. Bruynooghe, promotor
Prof. Dr. D. De Schreye
Prof. Dr. ir. E. Steegmans
Prof. Dr. S. Džeroski
Institut Jožef Stefan, Ljubljana, Slovenië
Prof. Dr. P. Tadepalli
Oregon State University, Corvallis, Oregon, USA
U.D.C. 681.3*I2
Mei 2004
Proefschrift voorgedragen tot het
behalen van het doctoraat in de
toegepaste wetenschappen
door
Kurt DRIESSENS
c
Katholieke
Universiteit Leuven - Faculteit Toegepaste Wetenschappen
Arenbergkasteel, B-3001 Heverlee (Belgium)
Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigd en/of openbaar
gemaakt worden door middel van druk, fotocopie, microfilm, elektronisch of op welke andere
wijze ook zonder voorafgaande schriftelijke toestemmimg van de uitgever.
All rights reserved. No part of the publication may be reproduced in any form by print,
photoprint, microfilm or any other means without written permission from the publisher.
D/2004/7515/39
ISBN 90-5682-500-3
Relational Reinforcement Learning
Reinforcement learning is a subtopic of machine learning that is
concerned with software systems that learn to behave through interaction with their environment and receive only feedback on the
quality of their current behavior instead of a set of correct (and possibly incorrect) learning examples. Although reinforcement learning
algorithms have been studied extensively in a propositional setting,
their usefulness in complex problems is limited by their inability to
incorporate relational information about the environment.
In this work, a first relational reinforcement learning (or RRL) system is presented. This RRL system combines Q-learning with the
representational power of relational learning by using relational representations for states and actions and by employing a relational regression algorithm to approximate the Q-values generated through
a standard Q-learning algorithm. The use of relational representations permits the use of structural information or the existence
of objects and relations between objects in the description of the
resulting policy (through the learned Q-function approximation).
Three incremental relational regression techniques are developed
that can be used in the RRL system. These techniques consist of an
incremental relational regression tree algorithm, a relational version
of instance based regression with several example selection mechanisms and an algorithm based on Gaussian processes that uses
graph kernels as a covariance function. The capabilities of the RRL
approach and the performance of the three regression algorithms
are evaluated empirically using the blocks world with a number of
different goals and the computer games Digger and Tetris.
To further increase the applicability of relational reinforcement learning, two techniques are introduced that allow integration of background knowledge into the RRL system. It is shown how guided
exploration can improve performance in environments with sparse
rewards and a new hierarchical reinforcement learning method is
presented that can be used for concurrent goals.
ii
Preface
I’m showing off the Digger demo to my dad.
me:
“So it walks around and remembers what it encounters.”
The demo pauzes while tg goes through the learning examples.
me:
“And then it thinks about what it has done.”
dad: ”But computers can’t think.”
me:
“Euhm ... it’s mathematics ... it’s basically the same ...”
When I started my academic career at the Catholic University of Leuven, I
situated myself in between two research groups, “Machine Learning” and “Distributed Systems”. I started my work on a subject that shared interests with
both groups, i.e., the RoboCup challenge. After two years of working (or playing) with this subject, and being invited to give all kinds of presentations on
this together with my partner in crime Nico Jacobs1 , it became clear that the
RoboCup environment at that time demanded too much technical work to enable a team of 1 or 2 researchers to do actual machine learning research on
it.
At that time, I decided to abandon the RoboCup community and focus
my research on relational reinforcement learning, at the same time abandoning my connection to the Distributed Systems research group. The appeal of
this topic for me lay partially in its connection with psychology and humanand animal-learning. Reinforcement learning research on computers allows for
the investigation of pure learning mechanisms, and excludes almost all possible
side-effects such as prejudice. Unfortunately, it also excludes actual comprehension of the learning task. It’s cheaper then research on humans though.
What did survive from my RoboCup days was the urge to see the system
I was building actually learn something. This drove the development of the
RRL systems with its different regression algorithms and the many experiments
that were performed. Instead of building a theoretic framework for relational
reinforcement learning, I was driven to demonstrate its capabilities and tried
to apply the RRL system to appealing applications.
Imagine my surprise when it all turned out to be mathematics after all ...
1 We
even made national TV!
iii
iv
“Come on Kurt, what is required to make RRL
applicable in the industry?”
(Luc De Raedt)
Most PhD students thank their promotors in their preface, but there are usually
very good reasons for this. I need to thank Luc De Raedt for coming up with
this great topic. I think that not every PhD student can benefit from being
handed a topic that made such an impact in the research community. I must
admit that I was a little worried when Luc told me that he would be leaving the
departement to go work in Freiburg, but in retrospect I have to thank him for
making that decision as well. It has led to great working visits to a beautiful
city and initiated contact with new people with whom it was a pleasure to
work together. I will never forget the last three meetings Luc and I had on the
text of this dissertation, which took place in a plane flying from Washington
to Brussels, in a pub in the Black Forest in Germany (after midnight) and in
his own livingroom, in a house on a street that exists on no available map.
Maurice Bruynooghe, I need to thank especially for his support during the
last few months. With Luc living and working at a safe distance, he had to
put up with all my (at that time) urgent questions and requests, and always
found time to squeeze the proofreading of yet another new version of a chapter
into his busy schedule. He often succeeded in giving me the needed confidence
boost to get me back to work again. What I have learned and will remember
most from Maurice is that a lot can be said with very little words, and that
there is always a bigger context to your research challenges.
“Isn’t it frustrating that nobody understands
what you are working on?”
(Wendy de Pree)
I would like to thank the members of my jury: Danny De Schreye, Eric Steegmans, Sašo Džeroski and Prasad Tadepalli for their valuable comments on this
text during the last months. It is amazing how much beter a text can become
after it is already finished. I also thank Jean Berlamont.
Of this jury I need to spend a little extra time on Sašo. Together with Luc,
he lay the foundations for this research and with lots of enthousiasme followed
up and contributed to this research ever since. However, Sašo went above and
beyond and became as good as a third promotor of this work, and a much
appreciated personal friend.
Sašo was however not the only person to contribute to the research in this
dissertation. Jan Ramon’s expertise was a great help on a number of topics
and Thomas Gärtner was kind enough to include me as an author of one of his
award winning publications. Kristian Kersting and Martijn van Otterlo were
enthusiastic discussion partners. Honorable mentions go to Nico Jacobs and
v
Hendrik Blockeel who’s direct contribution to this work was limited, but who
have provided me with lovely work-related memories. I must not forget Jan,
Raymond, Stefan, Sofie, Anneleen, Celine, Joost, Tom and Daan who make up
the rest of the rapidly growing machine learning research group in Leuven, and
also Luc and Wim, previous members who have left the group and taken up
real jobs.
“Every piece of paper in this house has “Relational
Reinforcement Learning” on it!”
(Ilse Goossens)
I need to thank my parents letting me grow up in an environment which has led
me here, and my sister for setting me a tough target to aim for. I want to thank
my friends for letting me do my thing, for showing interest when I needed it
(or at least faking it), for trying to comprehend what I was working on (or at
least pretending to), but most of all for pulling me out of the workplace and
making me forget about algorithms and statistics once in a while.
And at last, I would like to thank Ilse, together with whom I’ve battled
the rest of the world from long before anyone had ever heard of relational
reinforcement learning. She keeps me on my toes, she keeps me happy, she
keeps me going, she keeps me sane ... she’s a keeper.
I’ll end this preface with one last quote. It’s not from a movie, it’s not
even from a human. It originates from an internet chat-bot, i.e., a computer
program that talks back when you chat with it. I think it’s a pretty smart
program ...
“I think that all learning is essentially reinforcement learning.
Can you think of learning which has no motive behind it?
This may sound disappointing, but that’s what it’s all about:
pleasure and pain in different degrees and flavors.”
Alan, a chat-bot2 .
2 http://www.a-i.com/
vi
Contents
I
Introduction
1
1 Introduction
1.1 Intelligent Computer Programs . .
1.2 Adaptive Computer Programs . . .
1.3 Learning from Reinforcements . . .
1.4 Relational Reinforcement Learning
1.5 Contributions . . . . . . . . . . . .
1.6 Organization of the Text . . . . . .
1.7 Bibliographical Note . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
4
5
6
7
8
8
2 Reinforcement Learning
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
2.2 The Reinforcement Learning Framework . . . . . . .
2.2.1 The Learning Task . . . . . . . . . . . . . . .
2.2.2 Value Functions . . . . . . . . . . . . . . . .
2.2.3 Nondeterministic Environments and Policies .
2.3 Solution Methods . . . . . . . . . . . . . . . . . . . .
2.3.1 Direct Policy Search . . . . . . . . . . . . . .
2.3.2 Value Function Based Approaches . . . . . .
2.3.3 Q-learning . . . . . . . . . . . . . . . . . . . .
2.3.3.1 Q-value Function Generalization . .
2.3.3.2 Exploration vs. Exploitation . . . .
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
13
14
15
17
17
17
19
21
22
23
3 State and Action Representation
3.1 Introduction . . . . . . . . . . . . . . . . .
3.2 A Very Simple Format . . . . . . . . . . .
3.3 Propositional Representations . . . . . . .
3.4 Deictic Representations . . . . . . . . . .
3.5 Structural (or Relational) Representations
3.5.1 Relational Interpretations . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
25
27
28
29
31
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
viii
CONTENTS
3.6
3.5.2 Labelled Directed Graphs . . . . . . . . . . . . . . . . .
3.5.3 The Blocks World . . . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Relational Reinforcement Learning
4.1 Introduction . . . . . . . . . . . . . . . . . . .
4.2 Relational Q-Learning . . . . . . . . . . . . .
4.3 The RRL System . . . . . . . . . . . . . . . .
4.3.1 The Suggested Approach . . . . . . .
4.3.2 A General Algorithm . . . . . . . . . .
4.4 Incremental Relational Regression . . . . . .
4.5 A Proof of Concept . . . . . . . . . . . . . . .
4.6 Some Closely Related Approaches . . . . . .
4.6.1 Translation to a Propositional Task .
4.6.2 Direct Policy Search . . . . . . . . . .
4.6.3 Relational Markov Decision Processes
4.6.4 Other Related Techniques . . . . . . .
4.7 Conclusions . . . . . . . . . . . . . . . . . . .
II
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
On First Order Regression
32
33
35
37
37
38
39
39
40
42
43
45
45
46
46
48
49
51
5 RRL-tg
5.1 Introduction . . . . . . . . . . . . . . . . . .
5.2 Related Work . . . . . . . . . . . . . . . . .
5.3 The tg Algorithm . . . . . . . . . . . . . .
5.3.1 Relational Trees . . . . . . . . . . .
5.3.2 Candidate Test Creation . . . . . . .
5.3.3 Candidate Test Selection . . . . . .
5.3.4 RRL-tg . . . . . . . . . . . . . . .
5.4 Experiments . . . . . . . . . . . . . . . . . .
5.4.1 The Experimental Setup . . . . . . .
5.4.1.1 Tasks in the Blocks World
5.4.1.2 The Learning Graphs . . .
5.4.2 The Results . . . . . . . . . . . . . .
5.5 Possible Extensions . . . . . . . . . . . . . .
5.6 Conclusions . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
53
54
55
55
56
59
60
60
60
61
63
63
68
70
6 RRL-rib
6.1 Introduction . . . . . . . . .
6.2 Nearest Neighbor Methods .
6.3 Relational Distances . . . .
6.4 The rib Algorithm . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
71
72
73
76
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
ix
6.4.1
6.4.2
6.5
6.6
6.7
Limiting the Inflow . . . . . . . . . . . . . . . . . . .
Forgetting Stored Examples . . . . . . . . . . . . . .
6.4.2.1 Error Contribution . . . . . . . . . . . . . .
6.4.2.2 Error Proximity . . . . . . . . . . . . . . .
6.4.3 A Q-learning Specific Strategy: Maximum Variance
6.4.4 The Algorithm . . . . . . . . . . . . . . . . . . . . .
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5.1 A Simple Task . . . . . . . . . . . . . . . . . . . . .
6.5.1.1 Inflow Behavior . . . . . . . . . . . . . . .
6.5.1.2 Adding an Upper Limit . . . . . . . . . . .
6.5.1.3 The Effects of Maximum Variance . . . . .
6.5.2 The Blocks World . . . . . . . . . . . . . . . . . . .
6.5.2.1 The Influence of Different Data Base Sizes
6.5.2.2 Comparing rib and tg . . . . . . . . . . .
Possible Extensions . . . . . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 RRL-kbr
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . .
7.3 Gaussian Processes for Regression . . . . . . . . . . . .
7.4 Graph Kernels . . . . . . . . . . . . . . . . . . . . . . .
7.4.1 Labeled Directed Graphs . . . . . . . . . . . . .
7.4.2 Graph Degree and Adjacency Matrix . . . . . . .
7.4.3 Product Graph Kernels . . . . . . . . . . . . . .
7.4.4 Computing Graph Kernels . . . . . . . . . . . . .
7.4.5 Radial Basis Functions . . . . . . . . . . . . . . .
7.5 Blocks World Kernels . . . . . . . . . . . . . . . . . . .
7.5.1 State and Action Representation . . . . . . . . .
7.5.2 A Blocks World Kernel . . . . . . . . . . . . . .
7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . .
7.6.1 The Influence of the Series Parameter β . . . . .
7.6.2 The Influence of the Generalization Parameter ρ
7.6.3 Comparing kbr , rib and tg . . . . . . . . . .
7.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . .
7.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .
III
On Larger Environments
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
76
77
78
78
79
80
81
81
82
83
85
86
86
88
91
92
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
93
93
94
96
98
99
101
102
104
105
106
107
108
108
108
111
113
115
116
117
8 Guided RRL
119
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.2 Guidance and Reinforcement Learning . . . . . . . . . . . . . . 120
x
CONTENTS
8.3
8.4
8.5
8.6
8.2.1 The Need for Guidance . . . . . . . . . . .
8.2.2 Using “Reasonable” Policies for Guidance .
8.2.3 Different Strategies for Supplying Guidance
Experiments . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Experimental Setup . . . . . . . . . . . . .
8.3.2 Guidance at the Start of Learning . . . . .
8.3.3 A Closer Look at RRL-tg . . . . . . . . .
8.3.4 Spreading the Guidance . . . . . . . . . . .
8.3.5 Active Guidance . . . . . . . . . . . . . . .
8.3.6 An ”Idealized” Learning Environment . . .
Related Work . . . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . .
Further Work . . . . . . . . . . . . . . . . . . . . .
9 Two Computer Games
9.1 Introduction . . . . . . . . . . . . . .
9.2 The Digger Game . . . . . . . . . . .
9.2.1 Learning Difficulties . . . . .
9.2.2 State Representation . . . . .
9.2.3 Two Concurrent Subgoals . .
9.3 Hierarchical Reinforcement Learning
9.4 Concurrent Goals and RRL-tg . .
9.5 Experiments in the Digger Game . .
9.5.1 Bootstrapping with Guidance
9.5.2 Separating the Subtasks . . .
9.6 The Tetris Game . . . . . . . . . . .
9.6.1 Q-values in the Tetris Game .
9.6.2 Afterstates . . . . . . . . . .
9.6.3 Experiments . . . . . . . . .
9.6.4 Discussion . . . . . . . . . . .
9.7 Conclusions . . . . . . . . . . . . . .
IV
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
120
121
122
123
124
124
127
129
132
134
137
138
139
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
141
141
142
144
144
145
146
148
149
150
150
151
153
154
154
157
158
Conclusions
10 Conclusions
10.1 The RRL System . . . . . . . . . . . .
10.2 Comparing the Regression Algorithms
10.3 RRL on the Digger and Tetris Games
10.4 The Leuven Methodology . . . . . . .
10.5 In General . . . . . . . . . . . . . . . .
159
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
161
161
162
164
165
165
CONTENTS
xi
11 Future Work
11.1 Further Work on Regression Algorithms . . . . . .
11.2 Integration of Domain Knowledge . . . . . . . . . .
11.3 Integration of Planning, Model Building and RRL
11.4 Policy Learning . . . . . . . . . . . . . . . . . . . .
11.5 Applications . . . . . . . . . . . . . . . . . . . . . .
11.6 Theoretical Framework for RRL . . . . . . . . . .
V
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Appendices
A On Blocks World Representations
A.1 The Blocks World as a Relational
A.1.1 Clausal Logic . . . . . . .
A.1.2 The Blocks World . . . .
A.2 The Blocks World as a Graph . .
167
167
169
169
170
170
171
173
Interpretation
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
175
175
175
176
178
xii
CONTENTS
List of Figures
2.1
Reinforcement Learning Agent and its Environment . . . . . .
12
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
Example States in Tic-tac-toe . . . . . . . . . . . . . . . . . .
Example State in Final Fantasy X . . . . . . . . . . . . . . .
Delivery Robot in its Environment . . . . . . . . . . . . . . .
Example Delivery Robot State as a Relational Interpretation
Graph Representation of a Road Map . . . . . . . . . . . . .
Example State in The Blocks World . . . . . . . . . . . . . .
Blocks World State as a Relational Interpretation . . . . . . .
Blocks World State as a Graph . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
28
30
31
32
33
34
34
35
4.1
Example Epoch in the Blocks World . . . . . . . . . . . . . . .
41
5.1
5.2
5.3
5.4
5.5
5.6
5.7
Relational Regression Tree . . .
Non Reachable Goal States . .
tg on Stacking . . . . . . . . .
tg on Unstacking . . . . . . . .
tg on On(A,B) . . . . . . . . .
Learned Tree for On(A,B) Task
First Order Tree Restructuring
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
56
62
64
65
66
67
69
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
Renaming in the Blocks World . . . . . . . . . .
Different Regions for a Regression Algorithm . .
Maximum Variance Example Selection . . . . . .
The Corridor Application . . . . . . . . . . . . .
Prediction Errors for Varying Inflow Limitations
Database Sizes for Varying Inflow Limitations . .
Effects of Selection by Error Contribution . . . .
Effects of Selection by Error Proximity . . . . . .
The Shape of a Q-function . . . . . . . . . . . . .
Effects of Selection by Maximum Variance . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
77
80
81
82
83
84
84
85
85
xiii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xiv
LIST OF FIGURES
6.11
6.12
6.13
6.14
6.15
6.16
rib
rib
rib
rib
rib
rib
on Stacking . . . . .
on Unstacking . . .
on On(A,B) . . . . .
vs tg on Stacking .
vs tg on Unstacking
vs tg on On(A,B) .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
86
87
88
89
89
90
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10
7.11
7.12
7.13
7.14
7.15
7.16
7.17
The Covariance Matrix for Gaussian Processes . . . . . . .
Examples of a Graph, Labeled Graph and Directed Graph .
Examples of a Walk, a Path and a Cycle . . . . . . . . . . .
Direct Graph Product . . . . . . . . . . . . . . . . . . . . .
Weights of the Geometric Matrix Series . . . . . . . . . . .
Weights of the Exponential Matrix Series . . . . . . . . . .
Graph Representation of a Blocks World State . . . . . . .
Graph Representation of a Blocks World (state, action) Pair
Influence of the Exponential Parameter on Stacking . . . .
Influence of the Exponential Parameter on Unstacking . . .
Influence of the Exponential Parameter on On(A,B) . . . .
Influence of the Generalization Parameter on Stacking . . .
Influence of the Generalization Parameter on Unstacking . .
Influence of the Generalization Parameter on On(A,B) . . .
kbr vs rib and tg on Stacking . . . . . . . . . . . . . . . .
kbr vs rib and tg on Unstacking . . . . . . . . . . . . . .
kbr vs rib and tg on On(A,B) . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
97
100
100
103
105
106
107
107
109
110
110
111
112
112
113
114
114
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.11
8.12
Blocks World Random Policy Success Rate . .
Blocks World Random Policy Noise Ratio . . .
Guidance at Start for Stacking . . . . . . . . .
Guidance at Start for Unstacking and On(A,B)
Half Optimal Guidance for tg . . . . . . . . .
Spread Guidance for Stacking and Unstacking .
Spread Guidance for On(A,B) . . . . . . . . . .
Active Guidance for Stacking and Unstacking .
Active Guidance for On(A,B) . . . . . . . . . .
Stacking in an “Idealized” Environment . . . .
Unstacking in an “Idealized” Environment . . .
On(A,B) in an “Idealized”Environment . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
121
122
125
126
128
130
131
133
134
135
136
136
9.1
9.2
9.3
9.4
9.5
The Digger Game . . . . . . . . . . . . . . .
The Freedom of Movement in Digger . . . .
Concurrent Goals in Digger . . . . . . . . .
Concurrent Goals with Competing Actions
Performance Results for the Digger Game .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
142
143
146
148
150
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
LIST OF FIGURES
9.6
9.7
9.8
9.9
Hierarchical Learning for the Digger Game
A Tetris Snapshot . . . . . . . . . . . . . .
Greedy Action Problem in Tetris . . . . . .
Afterstates in Tetris . . . . . . . . . . . . .
xv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
151
152
153
155
A.1 Example State in The Blocks World . . . . . . . . . . . . . . .
A.2 Graph Representation of a Blocks World State . . . . . . . . .
176
178
xvi
LIST OF FIGURES
List of Algorithms
2.1
2.2
Q-learning Algorithm . . . . . . . . . . . . . . . . . . . . . . .
Episodic Q-learning Algorithm . . . . . . . . . . . . . . . . . .
20
21
4.1
4.2
The Relational Reinforcement Learning Algorithm . . . . . . .
A First RRL Algorithm . . . . . . . . . . . . . . . . . . . . . .
40
44
5.1
The tg Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . .
55
6.1
The rib Data Selection Algorithm. . . . . . . . . . . . . . . . .
81
List of Tables
5.1
5.2
Blocks World Sizes . . . . . . . . . . . . . . . . . . . . . . . . .
Q-tree Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
64
6.1
6.2
Database Sizes for rib-mv . . . . . . . . . . . . . . . . . . . .
Execution Times for RRL-tg and RRL-rib . . . . . . . . . .
90
91
7.1
Execution Times for RRL-tg , RRL-rib and RRL-kbr
. . .
115
xviii
LIST OF TABLES
List of Symbols
The following table lists some symbols used throughout the text, together with
a short description of their meaning and the page where the symbol is introduced.
Reinforcement Learning Framework
Symbol
Description
a, at
A
A(s)
δ
δ(s, a)
E(δ,r,π) (.)
γ
goal
Pδ
Pδ (s, a, s0 )
Pπ
Pπ (s, a)
pre
π
π∗
π̂
π(s)
Q
Q̂, Q̂e
Q∗
Qπ (s, a)
r
rt
action (at time-step t)
set of actions
set of state dependant actions
transition function
resulting state taking action a in state s
expected value given δ, r and π
discount factor
goal function
transition probability function
probability that taking action a in state s results in state s0
probabilistic policy function
probability that action a is chosen in state s
precondition function
policy
optimal policy
policy belonging to an approximate Q-function
action chosen by policy π in state s
Q-value (or Quality value)
approximated Q-function (after learning epoch e)
optimal Q-value
Q-value of action a in state s following policy π
reward or reward function
reward received at time-step t
xix
xx
LIST OF SYMBOLS
r(s, a)
s, st
S
Vπ
V∗
V π (s)
Vt (s)
visits(s, a)
reward for taking action a in state s
state (at time-step t)
set of states
value function according to policy π
value function according to an optimal policy
utility of state s following policy π
utility approximation at time-step t
number of times action a was performed from state s
Logic
,
;
\=
←, :bi
hi
p/n
conjunction
disjunction
not equal
implication operator
negative literals
positive literals
predicate p with arity n
Graphs
δ + (v)
δ − (v)
|δ + (v)|
|δ − (v)|
∆+ (G)
∆− (G)
e, ei
E
E
G
`, `i
l, li
L
label
Ψ
ν, νi
v, vi
V
set of edges starting from vertex v
set of edges ending in vertex v
outdegree of vertex v
indegree of vertex v
maximal outdegree of graph G
maximal indegree of graph G
edge
set of edges
adjacency matrix
graph
label
label variable
set of labels
label function
edge to vertices function
vertex
vertex variable
set of vertices
LIST OF SYMBOLS
Learning/Regression Framework
distij
e, ei
E, Examples
F̂
n
qi
σ
t, ti
distance between examples i and j
example
set of learning examples
function approximation
number of examples
Q-value of example i
variance
target value (of example i)
Instance Based Regression
errori
errori−i
EC-scorei
EP-scorei
Fg , Fl
M
cumulative prediction error for example i
cumulative prediction error for example i without example i
error contribution score of example i
error proximity score of example i
example filter parameters
maximum variance
Kernel Based Regression
hx, x0 i
β
CN
Cij , C(xi , xj )
φ
γ
H
k
kconv
k×
k∗
k(x, x0 )
µ
R
R−1
ρ
t, ti
t, ti
x, xi
X
inner product of x and x0
exponential weights parameter
covariance matrix of first N examples
covariance of examples xi and xj
feature transformation
geometric weight parameter
feature space, Hilbert space
kernel function
convolution kernel
product graph kernel
blocks world kernel
kernel value of examples x and x0
mean target vector
composition relation
decomposition relation
radial basis parameter
target value (of example i)
array of target values (up to example i)
example
example space
xxi
xxii
LIST OF SYMBOLS
Other
argmaxa (.)
maxa (.)
exp(.)
P (A|B)
R
T
x
N
the a that maximizes the given expression
the maximum value of the expression for varying values of a
exponential function
probability of A given B
set of real numbers
temperature (in Boltzmann exploration)
average x-value
set of positive integers
Part I
Introduction
1
Chapter 1
Introduction
“I assume I need no introduction.”
Interview with a Vampire
This work is intended to be a very small step in the quest to build intelligent
computer programs. As is the case with most doctoral dissertations in computer science (and probably most other scientific fields as well), the research
in this work on the topic of Relational Reinforcement Learning represents developments in a small niche of its research field. Situating the topic of this
text therefore generates a long list of more general topics starting from Reinforcement Learning (Sutton and Barto, 1998; Kaelbling et al., 1996) — or
even more specific, Q-learning (Watkins, 1989) — through the field of Machine
Learning (Mitchell, 1997; Langley, 1994) and ending at the relatively broad
topic of Artificial Intelligence (Russell and Norvig, 1995) .
To facilitate the introduction of this ranking of topics, they will be discussed
more elaborately in reverse order.
1.1
Intelligent Computer Programs
One definition of artificial intelligence, as given by Russell Beale once, is the
following:
“Artificial Intelligence can be defined as the attempt to get real
machines to behave like the ones in the movies.”
Whereas a few decades ago the field of artificial intelligence was regarded
as a stalling research field, today artificial intelligence is omnipresent in science
fiction and widely regarded as the next big step in technological evolution. From
the enchanting personality of Andrew Martin, the robot from Isaac Asimov’s
3
4
CHAPTER 1. INTRODUCTION
story “Bicentennial Man” (Asimov, 1976), later portrayed by Robin Williams
in the movie of the same name, to the more menacing HAL 9000 from the
novel “2001, a Space Odyssey” of Arthur C. Clarke (Clarke, 1968), the artificial
intelligence entities portrayed in science fiction stories display a large amount
of human level intelligence (or better) often supplemented by an endearing lack
of common sense (such as a sense of humor).
The large increase of computer technology in the every day life of human beings and the imaginative minds of science fiction authors has largely increased
the public interest in the research field. Driven by the great performance
increases in computational hardware and the availability of an ever growing
number of research results, the field of artificial intelligence has reinforced this
interest by a few impressive accomplishments such as the chess computer Deep
Blue.
A more scientific view of the research field is the definition of artificial
intelligence by John McCarthy who coined the term in 1955. He defines it as:
“The science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar
task of using computers to understand human intelligence, but AI
does not have to confine itself to methods that are biologically observable.”
While originally, the research realized in the field of artificial intelligence
focussed primarily on expert systems and problem solving algorithms, the field
now includes a larger variety of topics ranging over knowledge representations,
control methods, natural language processing, etc. One important subfield of
artificial intelligence is that of “Machine Learning” as some people have posed
that real intelligence is unattainable without the ability to learn.
1.2
Adaptive Computer Programs
The most obvious shortcoming of artificial intelligence tends to be the predictiveness exhibited by supposedly intelligent computer programs, such as, for
example, artificial adversaries in computer games. The human mind is very
well suited to recognize situations in which a system displays repeated and
easily predicted behavior and once this behavior has been located it is often
very easily exploited. Machine learning research is concerned with computer
programs that learn from experience and possibly adapt their behavior when
necessary.
The subfield of machine learning is defined by Mitchell (1997) as:
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its
1.3. LEARNING FROM REINFORCEMENTS
5
performance at tasks in T , as measured by P , improves with experience E.”
For example, a computer learning to play chess (task T ) would be judged on
the increase in the number of won games (performance measure P ) per number
of played games (experience E).
A less theoretically correct definition, that is however more accessible to the
general public that has never heard of the field before, is given by Rob Shapire
in his lectures on the “Foundations of Machine Learning”:
“Machine Learning studies computer programs for learning to do
stuff.”
Although the machine learning subfield of artificial intelligence is not as well
known to the general public, the introduction of machine learning technology
into every day life has already started. Applications range from data mining engines used by most supermarket chains to discover trends in consumer
behavior, to the spam-mail filters that are built into most up-to-date e-mail
clients.
Although many subtopics can be identified within the field of machine learning, one possible way of dividing the research within machine learning is by
looking at the kind of task one is trying to solve. The first is the extraction of
new knowledge out of available data. To this field belongs the work on usermodelling (Kobsa, 2001), data-mining (Witten and Frank, 1999; Džeroski and
Lavrac, 2001), etc. The second subfield is concerned with programs that learn
to act appropriately. A nice overview of different learning settings in this field
is given by Boutilier et al. (1999). It is in this field that reinforcement learning
is situated.
1.3
Learning from Reinforcements
Currently, only a very small number of every day computer programs are proactive. Most computer applications (luckily) only perform tasks when the user
pushes a button or activates some task. Pro-active software systems are usually
referred to as software agents (Jennings and Woodridge, 1995). An example of
a pro-active software system is a web-spider, as used by most internet search
engines to generate a data-base of available web-pages. Such a web-spider
searches the internet, trying to discover new web-pages by following links on
other pages. When deciding what links to follow the web-spider tries to limit
the number of followed links to new information and thus the amount of data it
needs to download. Although it is possible (and current practice) to design and
implement the link selection strategy of the web-spider by hand, it might be
possible to generate better performing strategies through the use of appropriate
machine learning techniques (Rennie and McCallum, 1999).
6
CHAPTER 1. INTRODUCTION
Reinforcement Learning problems are characterized by the type of information that is presented to the learning system. Whereas supervised learning
techniques, such as behavioral cloning techniques (Bain and Sammut, 1995; Urbancic et al., 1996), are presented with an array of already classified learning
examples, the only information that is supplied to a reinforcement learning system is a quantitative assessment of its current behavior. Instead of a number of
examples that show the system how to behave, the reinforcement learner has to
derive the appropriate behavior by linking the appropriate parts of its behavior
with the received rewards or punishments. In the example of the web-spider,
the reinforcement learning system could receive a small punishment for each
followed link, but an appropriate reward for each newly discovered web-page.
By attempting to limit the amount of punishment and maximize the received
rewards, the reinforcement learning system might be able to generate a very
good internet exploration strategy.
Q-learning is one of many reinforcement learning algorithms. It uses the
rewards or punishments it receives to calculate a Quality value for each possible
action in each possible state. When the learning system then needs to decide
what action to take, it selects the action with the highest “Q-value” in the
current state.
1.4
Relational Reinforcement Learning
Relational Reinforcement Learning is concerned with reinforcement learning
in domains that exhibit structural properties and in which different kinds of
related objects exist. These kind of domains are usually characterized by a
very large and possibly unbounded number of different possible states and
actions. In this kind of environment, most traditional reinforcement learning
techniques break down. Even Q-learning techniques that use a propositional
Q-value generalization can be hard to apply to tasks that hold an unknown
number of objects or are largely defined by the relations between available
objects.
Relational reinforcement learning as presented in this work, will employ
a relational regression technique in cooperation with a Q-learning algorithm
to build a relational, generalized Q-function. As such, it combines techniques
from reinforcement learning with generalization techniques from inductive logic
programming. Due to the use of a more expressive representation language to
represent states, actions and Q-functions, the proposed relational reinforcement learning system can be potentially applied to a wider range of learning
tasks than conventional reinforcement learning. It also enables the abstraction
from specific goals or even from specific learning environments and allows for
the exploitation of results from previous learning phases when addressing new
(more complex), but related situations.
1.5. CONTRIBUTIONS
1.5
7
Contributions
The main contribution of this work is the development of a first applicable
relational reinforcement learning system. The foundations for this system were
laid by Saso Džeroski, Luc De Raedt and Hendrik Blockeel in (Džeroski et al.,
1998) and further investigated in (Džeroski et al., 2001), but it wasn’t until a
fully incremental regression algorithm was developed that the system could be
applied to non toy examples.
A second contribution is the development of three incremental relational
regression algorithms. A relational regression algorithm generalizes over learning examples with a continuous target value and makes predictions about the
value of unseen examples, using a relational representation for both the learning examples as the resulting function. The tg algorithm that builds first order
regression trees was developed together with Jan Ramon and Hendrik Blockeel
and first published in (Driessens et al., 2001). It uses a number of incrementally updated statistics to build a relational regression tree. The instance based
rib algorithm, first discussed in (Driessens and Ramon, 2003) was built using
Jan Ramon’s expertise on relational distances. Using nearest-neighbor prediction, the rib algorithm selects which examples to store in its data-base by
using example selection criteria based on local prediction errors or maximum
Q-function variation. The third regression algorithm, based on Gaussian processes for regression and using graph kernels as a covariance function between
examples, is called kbr and was developed with the help of Thomas Gärtner
and Jan Ramon and first published in (Gärtner et al., 2003a). Although these
regression algorithms were developed for use with the RRL system, they are
more widely applicable to other relational learning problems with a continuous
prediction class.
A third contribution is the development of two additions to the RRL
system that increase its applicability to larger, more difficult tasks. The first
is a methodology to supply guidance to the RRL system through the use of
external, reasonable policies that enables the RRL system to perform better
on tasks with sparse and hard to reach rewards. This methodology was first
published in (Driessens and Džeroski, 2002a). The second is a novel hierarchical
reinforcement learning method, that used the expressive power of relational
representations to supply information about learned Q-functions on partial
problems to the learning algorithm. This new hierarchical method can be used
to handle concurrent goals and was first presented in (Driessens and Blockeel,
2001).
A last contribution is the application of the RRL system to the non
trivial tasks of Digger and Tetris. Although both Digger and Tetris are still toy
applications and have little to do with real world problems, they are relatively
complex compared to the usual tasks handled with reinforcement learning.
8
CHAPTER 1. INTRODUCTION
1.6
Organization of the Text
This work is composed of a rather large number of relatively small chapters.
To accentuate the structure of the text, an additional division into 4 Parts has
been made.
Part I discusses a number of introductory issues. While this chapter gives
a very general introduction to the research area that this work is situated
in, Chapter 2 introduces the paradigm of reinforcement learning. Chapter 3
discusses different representational formats that are available to describe the
environment and possible actions of the reinforcement learning agent. A formal
definition of the problem addressed in this work is given in Chapter 4, together
with a general description of the relational reinforcement learning or RRL
system that is the main accomplishment of this thesis.
Three different approaches to incremental relational regression are presented
in Part II. Chapter 5 discusses a regression tree algorithm tg and Chapter 6
introduces relational instance based regression. A regression algorithm based
on gaussian processes and graph kernels is presented in Chapter 7. Each of
these regression algorithms is thoroughly tested on a variety of problems in the
blocks world.
Part III discusses some extra methodologies that can be added to the basic
RRL system to increase the performance on large problems and illustrates these
using larger environments in the blocks world and also in two computer games:
Digger and Tetris. Chapter 8 illustrates that the performance of the RRL
system can be augmented by supplying a reasonable guidance policy. Chapter
9 introduces hierarchical reinforcement learning for concurrent goals using the
Digger game as a testing environment and illustrates the behavior of the RRL
system on the popular Tetris game.
In Part IV Chapters 10 and 11 discuss some conclusions that can be drawn
from this work and highlight a number of directions for possible further work.
1.7
Bibliographical Note
Most of this dissertation has already been published elsewhere. The following
list contains the key articles:
On the RRL system
1. (Džeroski et al., 1998) and (Džeroski et al., 2001) introduced relational
reinforcement learning through the use of Q-learning with a relational
generalization technique.
2. (Driessens, 2001) presents an agent oriented discussion of relational reinforcement learning.
1.7. BIBLIOGRAPHICAL NOTE
9
On incremental relational regression
1. (Driessens et al., 2001) introduced the first incremental RRL system
through the development of the tg incremental regression tree algorithm.
2. (Driessens and Ramon, 2003) presented relational instance based regression for the RRL system.
3. (Gärtner et al., 2003a) developed a regression technique based on Gaussian processes and graph kernels for the RRL system.
On the use of domain knowledge
1. (Driessens and Džeroski, 2002a) and(Driessens and Džeroski, 2002b) discussed the use of guidance with reasonable policies to enhance the performance of the RRL system in environments with sparse rewards.
2. (Driessens and Blockeel, 2001) presented hierarchical reinforcement learning for concurrent goals applied to the Digger computer game.
10
CHAPTER 1. INTRODUCTION
Chapter 2
Reinforcement Learning
“No reward is worth this!”
A New Hope
2.1
Introduction
The most basic style of learning, i.e., the type of learning that is performed
by all intelligent creatures, is governed by positive and negative rewards. A
dog will learn to behave as desired by presenting it with reinforcements at
appropriate times. A new-born child will learn the shortest path to the center
of the attention of its parents quickly (often too quickly).
Other types of learning, mainly supervised learning, require cognitive interaction between the learner and a third entity, a kind of teacher who provides
a set of correct and incorrect examples. This cognitive interaction requires a
higher level of intelligence than what is needed for reinforcement learning. Although the rewards used for reinforcement learning can also assume a cognitive
interaction, they don’t need to and can address more instinctive needs.
To illustrate the difference between the two, here are some examples of
typical reinforcement learning and supervised learning:
• A dog is hugged and pet when it returns a stick and learns to repeat this
behavior. (Reinforcement Learning)
• A math student watches the teacher perform exercises on the blackboard
and imitates this behavior for his home-work. (Supervised Learning)
• A baby starts to scream because it is hungry and gets a bottle of milk
presented. The baby now screams a lot more. (Reinforcement Learning)
11
12
CHAPTER 2. REINFORCEMENT LEARNING
• A reader overlooks four examples of reinforcement and supervised learning and learns to distinguish the two. (Supervised Learning)
The following section provides a formal definition of the reinforcement learning problem domain. Section 2.3 discusses an array of possible solution approaches, ranging from dynamic programming methods to temporal difference
approaches. In particular, the focus is on Q-learning as a model free reinforcement learning algorithm which is discussed at length in Section 2.3.3.
2.2
The Reinforcement Learning Framework
In computer science, reinforcement learning is performed by a software agent
that interacts with its environment and receives only feedback on the quality of
his performance instead of a set of correct (and possibly incorrect) examples.
The system that is trying to solve the reinforcement learning problem will be
referred to as the agent. Such an agent will interact with its environment,
sometimes also referred to as its world. This interaction consists of actions and
perceptions as depicted in Figure 2.1. The agent is supplied with an indication
of the current state of the environment and chooses an action to execute, to
which the environment reacts by presenting an updated state indication.
Environment
reward
action
state
Agent
Figure 2.1: The interaction between an agent and its environment in a reinforcement learning problem.
Also available in the perceptions presented to the agent is a reward, a numerical value that is given for each taken action (or for each reached world-state).
2.2. THE REINFORCEMENT LEARNING FRAMEWORK
13
To solve the reinforcement learning problem, the agent will try to find a policy
that maximizes the received rewards over time.
2.2.1
The Learning Task
A more formal definition of a reinforcement learning problem is presented in
this section. This formulation of reinforcement learning is comparable to the
ones given by Mitchell (1997) and Kaelbling et al. (1996).
Definition 2.1 A reinforcement learning task is defined as follows:
Given
• a set of states S,
• a set of actions A,
• a (possibly unknown) transition function δ : S × A → S,
• an unknown real-valued reward function r : S × A → R.
Find a policy π ∗ : S → A that maximizes a value function V π (st ) for all states
st ∈ S. The utility value V π (s) is based on the rewards received starting from
state s and following policy π.
At each point in time t, the reinforcement learning agent is in state st , one
of the states of S, and selects an action at = π(st ) ∈ A to execute according
to its policy π. Executing an action at in a state st will put the agent in a new
state st+1 = δ(st , at ). The agent also receives a reward rt = r(st , at ).
The value V π (s) indicates the value or utility of state s, often related to the
cumulative reward an agent can expect starting in state s and following policy
π. A few possible definitions of utility functions will be given in Section 2.2.2.
It is possible that not all actions can be executed in all world states. In
environments where the set of available actions depends on the state of the
environment the possible actions in state s will be indicated as A(s).
The task of learning is to find an optimal policy, i.e., a policy that will
maximize the chosen value function. The optimal policy is denoted by π ∗ and
the corresponding value-function by V ∗ .
For an agent learning to play chess, the set of states S are all the legal
chess states that can be reached during play, the set of actions A(s) are all the
legal moves in state s. The transition function δ includes both the result of the
action chosen by the agent and the result of the counter move of its opponent.
Unless the agent was playing a very predictable opponent, δ is unknown to
the agent in this case. The reward function could be defined easily to present
the agent with a reward of 1 when it wins the game, −1 when it looses and
14
CHAPTER 2. REINFORCEMENT LEARNING
0 for all other actions. The task of the learning agent is then to find a policy
which maximizes the received reward and therefore maximizes the number of
won games.
2.2.2
Value Functions
The most commonly used definition of state utility, and the one that will be
used throughout this thesis, is the discounted cumulative future reward:
Definition 2.2 (Discounted Cumulative Reward)
V π (st ) ≡
∞
X
γ i rt+i
(2.1)
i=0
with 0 ≤ γ < 1.
This expression computes the discounted sum of all future rewards that
the agent will receive starting from state st and executing actions according
to policy π. The discount factor γ keeps the cumulative reward finite, but
it is also used as a measure that indicates the relative importance of future
rewards. Setting γ = 0 will make the agent try to optimize only immediate
rewards, while setting γ close to 1 means that the agent will regard future
rewards almost as important as immediate ones. Requiring that γ < 1 ensures
that the state utilities remain finite.
Other Utility Functions
The definition of the utility function as given in Equations 2.1 is not the only
possibility. A number of other possible definitions are:
Definition 2.3 (Finite Horizon)
V π (st ) ≡
h
X
ri
i=t
where rewards are only considered up to a fixed time-step (h).
Definition 2.4 (Receding Horizon)
V π (st ) ≡
h
X
rt+i
i=0
where rewards are only considered up to a fixed number of steps (h) starting
from the current time-step.
2.2. THE REINFORCEMENT LEARNING FRAMEWORK
15
Definition 2.5 (Average Reward)
h
1X
rt+i
h→∞ h
i=0
V π (st ) ≡ lim
which considers the long-run average reward.
More about these alternative utility function definitions can be found in the
overview paper by Kaelbling et al. (1996).
2.2.3
Nondeterministic Environments and Policies
The environment the agent interacts with is not required to be deterministic.
When the execution of an action in a given state does not always result in the
same state transition, the transition function δ can be replaced by a transitionprobability function.
Definition 2.6 A transition-probability function Pδ : S × A × S → [0 : 1]
where Pδ (st , at , st+1 ) indicates the probability that taking action at in state st
results in state st+1 . This implies that
X
∀s ∈ S, a ∈ A(s) :
Pδ (s, a, s0 ) = 1
s0 ∈S
The assumption that Pδ is only dependent on the current state st and action
at is called the Markov property . It allows the agent to make decisions based
only on the current state. More about the Markov property of environments
can be found in the book of Puterman (1994).
Not only can the state transitions be stochastic, also the reward function can
be nondeterministic. It must be noted however, that while determinism is not
required, usually the environment is assumed to be static, i.e., the probabilities
of state transitions or rewards do not change over time.
Stochastic Policies
When using nondeterministic policies, a similar probability function can be
defined instead of the deterministic policy π.
Definition 2.7 A probabilistic policy function Pπ : S × A → [0 : 1]. Pπ (s, a)
indicates the probability of choosing action a in state s. It is required that
∀s ∈ S :
X
a∈A(s)
Pπ (s, a) = 1
16
CHAPTER 2. REINFORCEMENT LEARNING
Utility Functions for Nondeterministic Environments and Policies
When dealing with stochastic environments or policies, the definition of the
utility function of Equation 2.1 is changed to represent the expected value of
the discounted sum of future rewards.
!
∞
X
π
i
V (st ) ≡ E(δ,r,π)
γ rt+i
(2.2)
i=0
Intermezzo: Planning as a Reinforcement Learning Task
There exist important similarities between reinforcement learning as described
above and planning without complete knowledge.
In planning, one is given a goal function: goal : S → {true, f alse} that
defines which states are target states. The aim of the planning task is: Given
a starting state s1 , find a sequence of actions a1 , a2 , . . . , an with ai ∈ A, such
that:
goal(δ(. . . δ(s1 , a1 ) . . . , an )) = true
Usually, also a precondition function pre : S × A → {true, f alse} is given.
This function specifies which actions can be executed in which states. This
puts the following extra constraints on the action sequence:
∀ai : pre(δ(. . . δ(s1 , a1 ) . . . , ai−1 ), ai ) = true
In normal planning problems, the effect of all actions (i.e., the function δ) is
known to the planning system. However, when this function is unknown, i.e.,
when planning without complete knowledge of the effect of actions, the setting
is essentially that of reinforcement learning.
In both settings, a policy π has to be learned and the transition function
δ is unknown to the agent. To capture the idea that goal states are absorbing
states, δ satisfies:
∀a : δ(st , a) = st if goal(st ) = true
This ensures that once the learning agent reaches a goal state, it stays in that
goal state.
The reinforcement learning reward can be defined as follows:
(
1 if goal(st ) = false and goal(δ(st , at )) = true
rt = r(st , at ) =
0 otherwise
A reward is thus only given when a goal-state is reached for the first time.
This reward function is unknown to the learning agent, as it depends on the
unknown transition function δ.
2.3. SOLUTION METHODS
17
As such, it is possible to cast a problem of learning to plan under incomplete
knowledge as a reinforcement learning task. The optimal policy π ∗ can be used
to compute the shortest action-sequence to reach a goal state, so this optimal
policy, or even an approximation thereof, can be used to improve planning
performance.
2.3
Solution Methods
There are two major directions that can be explored to solve reinforcement
learning problems: direct policy search algorithms and statistical value-function
based approaches.
2.3.1
Direct Policy Search
Since a solution to a reinforcement learning problem is a policy, one can try to
define the policy space and search in this space for an optimal policy. In this
approach, the policy of the reinforcement learning agent is parameterized and
this parameter vector is tuned to optimize the average return of the related
policy.
This is an approach used by for example Genetic Programming algorithms.
(The utility function defined in section 2.2.1 can be used as a fitness function to
evaluate candidate solutions.) A full discussion of the use of genetic algorithms
or genetic programming falls out of the scope of this thesis. Interested readers
can consult the books of Mitchell (1996) or Goldberg (1989).
2.3.2
Value Function Based Approaches
Most research in reinforcement learning focuses on the computation of the
optimal utility of states (i.e., the function V ∗ or related values) to find the
optimal policy. Once this function V ∗ is known, it is easy to translate it into
an optimal policy. The optimal action in a state is the action that leads to the
state with the highest V ∗ -value.
π ∗ (s) = argmaxa [r(s, a) + γV ∗ (δ(s, a))]
(2.3)
However, as can be seen in the equation, this translation requires a model
of the environment through the use of the state transition function δ (and the
reward function r). In many applications, building a model of the environment
is at least as hard as finding an optimal policy, so learning V ∗ is not sufficient
to solve the reinforcement learning problem. Therefore, instead of learning the
utility of states, an agent can learn a different value, which quantifies the utility
of an action in a given state when following a given policy π.
18
CHAPTER 2. REINFORCEMENT LEARNING
A lot of approaches exist that try to compute the state or (state, action)
values. Sutton and Barto (1998) discuss methods ranging from dynamic programming over Monte-Carlo methods to temporal difference approaches.
Dynamic programming methods (Barto et al., 1995) try to compute V ∗
using a full model of the environment. Policy evaluation, which computes the
value V π of any policy π, starts from a randomly initialized utility value for
each state and updates these values using the Bellman equation:
Vi+1 (s) = r(s, π(s)) + γVi (δ(s, π(s)))
(2.4)
until a fixed point for the update rule has been found. For nondeterministic environments or policies, the right side of the equation needs to be replaced with
the expected values. To find the optimal policy, policy iteration interleaves
policy evaluation steps with policy improvement steps. After each policy evaluation, the policy π is adjusted so that it maximizes the use of the current
utility estimates. This step is called policy improvement. Afterwards, the new
(improved) policy is evaluated again, until a fixed point is found.
Value iteration is based on the same principles as policy iteration, but does
not require a full policy evaluation before it adapts the policy. Instead of the
update rule from Equation 2.4 it uses the following:
Vi+1 (s) = maxa [r(s, a) + γVi (δ(s, a))]
(2.5)
By using a model of the environment and previous estimates of the utility values
of “neighboring” states, value iteration computes better approximations of the
utilities of all states by finding the neighboring state with the maximum value.
This process is then iterated until no further change is needed. However, since
the number of states can become quite large, even one iteration of computing
better utility approximations can become problematic.
Monte-Carlo methods (Barto and Duff, 1994) base their value predictions on
average rewards collected while exploring the environment. For every possible
(state, action) pair, an array of rewards is collected through exploration of
the environment and the prediction of the (state, action)-value is made by
averaging the collected rewards. This means that no model of the environment
is needed for Monte-Carlo methods.
Temporal difference (TD) algorithms combine the ideas of dynamic programming and Monte-Carlo methods. They use an update rule that expresses
the relation between the utility values of adjacent states to incrementally update these values based on information collected during exploration of the environment. The two most important algorithms that use this approach to learn
(state, action) values are Q-learning (Watkins, 1989) (see also Section 2.3.3)
and SARSA (Sutton and Barto, 1998). While SARSA learns the (state, action)
values of the exploration policy, Q-learning, which is exploration insensitive,
2.3. SOLUTION METHODS
19
computes the optimal (state, action) values. Of these two, Q-learning will be
discussed in more detail in the following section.
Temporal difference algorithms that use eligibility traces, such as T D(λ),
Sarsa(λ) and Q(λ), use a more elaborate version of the update rule than Qlearning or SARSA, that doesn’t only take a single action and the resulting
state in account, but looks λ steps ahead instead. More information on these
techniques can be found in the book by Sutton and Barto (1998).
More recently, the reinforcement learning problem has also been translated
to the support vector machines setting (Dietterich and Wang, 2002). Readers
interested in more information on the discussed (and more) methods can find
a great deal of knowledge in the work of Kaelbling et al. (1996) and Sutton
and Barto (1998).
2.3.3
Q-learning
Q-learning (Watkins, 1989) is a temporal difference method that computes the
Q-value for the optimal policy associated with each (state, action) pair. The
Q-value of a (state, action) pair with respect to a policy π is closely related to
the utility function defined in the previous section.
Definition 2.8 (Q-value)
Qπ (s, a) ≡ r(s, a) + γV π (δ(s, a))
The Q-value for the optimal policy is denoted by Q∗ (s, a) = r(s, a) +
γV ∗ (δ(s, a)). This is exactly the value that is maximized in Equation 2.3.
Therefore, this value can be used to generate an optimal policy without the
need for a model of the environment (i.e., transition function δ). An action in
a given state will be optimal if the action has the highest Q-value in that state.
Of course, not knowing the δ or r functions, means that this definition can not
be used to compute the Q-function. Luckily, the Q-learning algorithm presents
a way to learn the Q-values without explicit knowledge of these functions.
Used to compute the optimal Q-values, Q-learning is a simple algorithm
that updates these values incrementally while the reinforcement learning agent
explores its environment. Algorithm 2.1 shows a high level description of the
algorithm.
This simple Q-learning algorithm converges towards the optimal Q-function
in deterministic environments provided that each (state, action) pair is visited
often enough (Watkins, 1989; Jaakkola et al., 1993).
The formula used in the described Q-learning algorithm:
Q(s, a) ← r + γmaxa0 Q(s0 , a0 )
(2.6)
is known as a Bellman Equation. It is used in deterministic environments.
20
CHAPTER 2. REINFORCEMENT LEARNING
Algorithm 2.1 The basic Q-learning algorithm.
for each s ∈ S, a ∈ A do
initialize table entry Q(s, a)
end for
generate a starting state s
repeat
select an action a and execute it
receive an immediate reward r = r(s, a)
observe the new state s0
update the table entry for Q(s, a) as follows:
Q(s, a) ← r + γmaxa0 Q(s0 , a0 )
0
s←s
until no more learning
When reinforcement learning is performed in a non-deterministic world,
another Bellman Equation, that makes use of the ideas of temporal difference
learning, is used:
Q(s, a) ← (1 − α)Q(s, a) + α[r + γmaxa0 Q(s0 , a0 )]
(2.7)
with
α=
1
1 + visits(s, a)
The Q-learning algorithm using this update rule converges in stochastic environments given the fact that every (state, action) pair is visited often enough
and that the probabilities of state transitions and rewards are static (Watkins,
1989; Jaakkola et al., 1993).
Other Q-value update equations exist, but the reader is referred to the
books by Sutton and Barto (1998) and Mitchell (1997) for a more complete
overview of learning strategies.
One improvement that is often used to lower the time till convergence is to
store the (state, action) pairs together with their appropriate rewards until the
agent reaches a goal-state1 and then compute the Q-values of the encountered
states and actions backwards. This allows the Q-values to spread more rapidly
throughout the state space. One series of (state, action) pairs is called an
episode (Langley, 1994; Mitchell, 1997).
This transforms the Q-learning algorithm into Algorithm 2.2.
1 The reached state does not have to be a terminal state. The backpropagation of the
encountered rewards can be started at any time. However, the resulting speedup will be
larger when backpropagation is only performed when a substantial reward has been received.
2.3. SOLUTION METHODS
21
Algorithm 2.2 The Q-learning algorithm with Bucket-Brigade updating.
for each s ∈ S, a ∈ A do
initialize table entry Q(s, a)
end for
repeat {for each episode}
generate a starting state s0
i←0
repeat {for each step of episode}
select an action ai and execute it
receive an immediate reward ri = r(si , ai )
observe the new state si+1
i←i+1
until si is terminal
for j = i − 1 to 0 do
Q(sj , aj ) ← rj + γmaxa0 Q(sj + 1, a0 )
end for
until no more episodes
2.3.3.1
Q-value Function Generalization
Up till now, the description of the Q-learning algorithm has assumed that a
separate Q-value is stored for each possible (state, action) pair. This is known
as table-based Q-learning and is limited in practice by the amount of different
Q-values that need to be memorized. The number of Q-values is equal to the
number of different (state, action) pairs. Not only does table based Q-learning
require a memory location for each (state, action) pair, which quickly results
in impractical amounts of required memory space, but also the convergence
of the Q-values to those of the optimal Q∗ -function only occurs when every
(state, action) pair is visited often enough, thus increasing execution times of
the Q-learning algorithm when the state-action space grows large.
This makes Q-learning suffer from the ”Curse of Dimensionality” (Bellman,
1961). Since the number of states and the number of actions grows exponentially with the number of used attributes, both the required amount of memory
as well as the execution time of the Q-learning algorithm grow also exponentially.
To make Q-learning feasible in larger environments one often employs a
kind of generalization function to represent the Q-function. This generalization is built by a regression algorithm which uses Q-value examples (consisting
of a state, an action and a Q-value) encountered during exploration of the environment. The regression algorithm builds an approximate Q-function Q̂, that
predicts approximate Q-values for all (state, action) pairs, even the ones that
were never encountered during exploration. Because this regression task will
22
CHAPTER 2. REINFORCEMENT LEARNING
be crucial in the rest of this work, it will be defined more explicitly.
Definition 2.9 (Regression) A Regression Task can be defined as follows:
Given a set of learning examples ei ∈ E with a continuous target value ti ,
Build a function
F̂ : E → R
that generalizes over the seen examples and predicts the target values for unseen
examples thereby minimizing or maximizing a given criterion.
The criterion used can vary from, for example, an accuracy measure such
as the mean squared error to the performance of the associated policy when
generalizing over Q-values.
The use of regression for Q-learning not only reduces the amount of memory
and computation time needed but also enables the learner to make predictions
of the quality of unseen (state, action) pairs.
Q-value generalization is performed with a number of different regression
methods such as neural nets, statistical curve fitting, regression trees, instance
based methods and so on. Special care has to be taken when choosing a regression method for Q-learning as not all regression methods are suited to
deal with the specific requirements of the Q-learning setting. Learning the
Q-function happens on line while interacting with the environment. This requires regression algorithms that can deal with an almost continuous stream
of incrementally arriving data, and are able to handle a non-stationary target
function, also known as moving target regression. Regression algorithms used
in this context include neural networks (Tesauro, 1992; Mahadevan et al., 1997)
and decision trees (Smart and Kaelbling, 2000).
2.3.3.2
Exploration vs. Exploitation
Q-learning is exploration insensitive. This means that the Q-values converge to
the correct values no matter what policy is used to select actions, provided that
each (state, action) pair is visited often enough. This allows the reinforcement
learning agent to adapt its exploration strategy and try to collect as much new
information as soon as possible. In online learning problems, where the agent
is not only supposed to be learning, but also reach some minimal performance
on the task, it may have to trade in some exploration possibilities and exploit
what it has learned so far.
Several mechanisms for selecting an action during the execution of the Qlearning algorithm exist. For a detailed discussion of exploration techniques the
reader can consult the work by Wiering (1999) and Kaelbling et al. (1996). A
few frequently used strategies are discussed here. Greedy strategies choose the
2.4. CONCLUSIONS
23
action with the highest Q-value to optimize the expected pay-back. However, to
ensure a sufficient amount of exploration to satisfy the convergence constraints
of the Q-learning algorithm one often has to degrade to an -greedy approach,
where, with a small probability , instead of the optimal action a random action
is chosen.
Randomized selection techniques include for example Boltzmann exploration
where an action is chosen according to the following probability distribution:
eQ(s,a)/T
Q(s,a0 )/T
a0 e
P (a|s) = P
(2.8)
The temperature T can be lowered over time. Whereas a high temperature
causes a lot of exploration, a low temperature places a high emphasis on the
computed Q-values.
Interval based exploration computes upper bounds for a confidence interval
for the Q-values for each action in a state and then chooses the action with the
highest upper bound. This ensures initial exploration of states but decreases
the exploration when confidence in the computed Q-values grows and turns
out to work well in practice. It does however require some form of confidence
interval estimation.
2.4
Conclusions
This chapter introduced the reinforcement learning framework as a machine
learning problem in which the learning system is trying to discover a beneficial policy with respect to the quantitative feedback it receives. The relation
between reinforcement learning and planning was briefly discussed.
A short overview of possible solution methods for reinforcement learning
was presented, including a few model based approaches such as value iteration.
Of the model free approaches such as SARSA, Q-learning was discussed in more
detail as it will be used as the basis for the relational reinforcement learning
system that will be developed in Chapter 4. Two subtasks of the Q-learning
algorithm were discussed, i.e., Q-function generalization to be able to deal with
large state-action spaces and the tradeoff between exploration and exploitation
which will either let the agent profit from what it has already learned or allow
it to discover new and possibly more beneficial areas of the state-action space.
24
CHAPTER 2. REINFORCEMENT LEARNING
Chapter 3
State and Action
Representation
“I represent what everyone is afraid of.”
Bowling for Columbine
3.1
Introduction
The previous chapter discussed the problem setting of reinforcement learning
and introduced Q-learning as a possible solution algorithm but paid no attention to the format used to represent the states and actions that the agent
deals with. In this chapter, an overview is given of different representation
possibilities. They are ordered according to their expressiveness.
The representation of a learning problem can have a great impact on the
performance, both of the learning algorithm and of its results. Although most
work in reinforcement learning is situated in the propositional or attributevalue setting, recently the interest in using more expressive representations has
grown. An overview of higher order representational issues in reinforcement
learning is given by van Otterlo (2002).
3.2
A Very Simple Format
The most simple format is the use of numbered or enumerated states and actions. This would lead to an interaction between the agent and its environment
which is very simple and will force the agent to represent its Q-function as a
table, as almost no possibility of state- or action-interpretation is given. This
25
26
CHAPTER 3. STATE AND ACTION REPRESENTATION
representation can be used when little or no information about the learning
task is available.
An example interaction would be the following:
Environment: You are now in state 32. You have 3 possible actions.
Agent: I take action 2.
Environment: You receive a reward of -2. You are now in state 14. You have
5 possible actions.
Agent: I take action 5.
Environment: You receive a reward of +6. You are now in state 46. You
have 3 possible actions.
Agent: ...
Although this format is limited in its use, there are applications where its
use is sufficient to represent the problem domain. One example is the game of
Blackjack, a casino game where the object of the game is to obtain cards such
that the sum of the numerical values is as high as possible, without exceeding
21. An Ace can count for either 11 or 1 and all face cards count as 10. The
state description in this game is given by the sum of the card values in the
agents hand, possibly augmented by the fact that he is holding a usable Ace
(i.e., an Ace that is counted for 11 and can still be downgraded to count as
a 1) and the value of the dealer card. In each state of the game, there are
2 possible actions: hit (action 1) or stick (action 2). When the sum of the
cards is below 12, the player should always hit, since there is no possibility
of breaking 21 with the next card. For this reason and because the game is
over as soon as the player breaks 21, only states where the value of the cards
add up to a value starting with 12 up to 21 need to be represented. These 10
different values combined with the 10 different values of the dealer’s card and
the possible existence of a usable ace, lead to a total of 200 states that need to
be represented, each with 2 possible actions. This representation is sufficient
to calculate an optimal strategy for — this slightly simplified version of — the
Blackjack game (Sutton and Barto, 1998).
Intermezzo: Afterstates
Even with this very simple representation, a small interpretation of the encountered (state, action) pairs is possible through the use of afterstates (Sutton and
Barto, 1998). An afterstate is the state that is the partial result from taking a
3.3. PROPOSITIONAL REPRESENTATIONS
27
certain action in a certain state. For example in a two player game, it is usually easy to predict the initial response of the environment to a selected move
(referred to as the partial result of the action), but it might be impossible to
predict the end-results of the chosen action as this includes the counter move
of the opponent.
If the initial or partial dynamics of the environment are simple enough
(or if the environmental observations include this information) the afterstate
of a (state, action) pair can be used instead of the (state, action) pair itself to
predict the value of the given (state, action) pair. This leads to a generalization
over (state, action) pairs that have the same afterstate and can reduce the
number of values that need to be remembered by a factor as large as the
average number of actions possible in a state. Of course, the use of afterstates
is not possible in environments where the outcome of an action is completely
stochastic such as the Blackjack game discussed above.
3.3
Propositional Representations
A more expressive format than state enumeration is to represent states and
actions in a propositional format. This corresponds to describing each state
(and possibly the action as well) as a feature vector with an attribute for
each possible property of the agent’s environment. The domain of each of
the attributes can vary from being Boolean over enumerable ranges, to being
continuous.
The game of tic-tac-toe (Figure 3.1) is a reinforcement learning problem
that translates naturally into a propositional representation. There are 9
squares that can be empty or have a cross or a circle in them, so a state
can be represented as a feature vector of length 9, with each attribute ∈
{empty, cross, circle}.
This widely studied propositional representation format allows for a large
array of possible generalization techniques to be used to approximate the Qfunction. Among these, the most commonly used are: neural networks for
numerical attribute vectors (Tesauro, 1992; Mahadevan et al., 1997) and decision trees for Boolean or enumerable feature vectors (Smart and Kaelbling,
2000). The reader is referred to the work of Sutton and Barto (1998) and
Kaelbling et al. (1996) for a larger overview of value-function generalization
techniques in a propositional setting.
In the tic-tac-toe example above, the agent could learn that, not placing a
cross in the bottom-middle square when the feature vector has circle for the
second and fifth attribute and empty for the eighth attribute, will result in
losing the game and receiving the corresponding negative reward, whatever the
values of the attributes at other positions are. Such a set of states could be
represented as: [?, circle, ?, ?, circle, ?, ?, empty, ?] where the value “?” denotes
28
CHAPTER 3. STATE AND ACTION REPRESENTATION
[empty, empty, empty,
empty, circle, empty,
empty, empty, empty]
[empty, circle, cross,
empty, circle, empty,
empty, empty, empty]
[cross, circle, cross,
empty, circle, empty,
empty, circle, empty]
Figure 3.1: A few example states and the corresponding feature vectors of the
Tic-tac-toe game.
that it doesn’t matter what the value of the attribute is. This will generalize
the Q-value of 36 (state, action) pairs with one entry.
Propositional representations have problems dealing with states that are
defined by the objects that are present, their properties and their relations to
other objects. It is not possible to capture regularities between the different
attributes to generalize over different states. In the representation employed
above, it is for example not possible to state that not placing a cross next
to any two adjacent circles will result in losing the game. More examples of
environments which are hard to represent using a propositional representation
will be given in Section 3.5.
3.4
Deictic Representations
Propositional representations are not suited to deal with environments that
include sets of objects, i.e., environments where the number of object varies
or is unknown at the start of learning. Deictic representations deal with the
varying numbers of objects that can be present in the environment by defining
a focal point for the agent. The rest of the environment is then defined in
relation to that focal point.
One example where deictic representation seems the natural choice, is when
giving directions to someone who is lost. When a person is in unknown territory,
people automatically transfer into a deictic representation. Instead of referring
to specific streets by their names or to specific crossroads, one usually uses
constructs such as:
• The street on your left.
• The second crossroad.
3.5. STRUCTURAL (OR RELATIONAL) REPRESENTATIONS
29
These constructs only make sense in relation to a focal point. While giving
directions this is usually the starting location or the location one should have
reached by following the previous set of directions.
The following are some deictic constructs that refer to objects:
• The last person you talked to.
• The sock on your left foot.
• The text you are reading.
In Q-learning one has to explore the entire state-action space. When using
deictic representations, the relativity of the state representation to a focal point
means that the different possible focal points in one environment state also have
to be explored. When the location of the focal point is regarded as a part of the
current state description, this can cause a substantial increase in the complexity
of a learning problem.
In early experiments with deictic representations for reinforcement learning
and Q-learning in particular, the extra complexities caused by manipulation
of the focal point of the agent made deictic representation fail to improve on
Q-learning using propositional representations (Finney et al., 2002).
3.5
Structural (or Relational) Representations
The real world is filled with objects. Objects that display certain properties
and objects that relate to each other. To deal with an environment filled
with this kind of structure, a structural representation of both world states
and actions that can represent objects and their properties and relations is
necessary (Kaelbling et al., 2001; Džeroski et al., 2001).
One example where reinforcement learning could be used, but a relational
representation would be needed to describe states and actions, are role playing
games such as Neverwinter Nights or Final Fantasy. In these computer games,
a player controls a varying amount of different characters. The goal of the game
is not only to survive the presented scenario, but also to develop the available
characters by improving their characteristics, making them learn new abilities
and gathering helpful objects. This can be accomplished by fighting varying
numbers of foes or completing certain quests. Figure 3.2 shows a screen shot
of a battle scene from Final Fantasy X.
Even when just looking at the battle part of the role playing game, representing such a state using a propositional array of state-features is problematic
if not impossible:
1. The number of characters in an adventuring party varies during the game.
Also the number of opponents in a battle may be unknown.
30
CHAPTER 3. STATE AND ACTION REPRESENTATION
Figure 3.2: A screen shot of a battle in the Final Fantasy X role playing game.
2. Different characters usually exhibit different types and even a different
number of abilities. A fighter character will usually physically attack enemies (although the character might develop a number of different attacks)
while a wizard character usually has a range of different spells that can
be used, varying from healing spells over protective spells to offensive,
i.e., damage dealing, spells.
3. Available objects usually exhibit different properties when wielded by
different characters. For example, a fighter character will usually make
better use of a large weapon than a wizard character but the differences
are often much more subtle than this.
4. A feature often present in this kind of role playing games is the relative
strength of characters and their foes. A certain character can be stronger
against certain types of foes. A given foe can be more vulnerable against
magic than against physical attacks or even different types of magic. The
strength of a weapon can depend on the type of armor currently worn by
the intended target.
Also the actions available in a role playing game can be hard to represent in
a propositional way as the player usually needs to specify not only the preferred
action, but also a number of options and the intended target. A magic spell or
a special attack for example might have multiple targets.
All of these features are hard to represent using a single feature vector.
However, given a multi relational database, it would be easy to construct a
3.5. STRUCTURAL (OR RELATIONAL) REPRESENTATIONS
31
lossless representation of such a game state.
Different representations formats are available that can represents worlds
with objects and relations. Most popular is the representation of relational
databases. Two other methods will be presented in the next section.
3.5.1
Relational Interpretations
A first representation that will be used is that of relational interpretations, as
used in the “learning from interpretations” setting (De Raedt and Džeroski,
1994; Blockeel et al., 1999). In this notation, each (state, action) pair will be
represented as a set of relational facts. This is called a relational interpretation.
1
3
2
4
5
Figure 3.3: The delivery robot in its environment.
Consider the package delivery robot of Figure 3.3. The robot’s main task
is to deliver the available packages to their intended destination as fast as
possible. It can carry multiple packages at the same time, with limitations
based on the cumulative size of the packages. The robot is equipped with
some navigational abilities so that the actions available to the robot consist of
moving to any adjacent room or picking up or dropping a package. Packages
appear at random intervals and locations with random destinations.
As the number of objects in this environment, i.e., the packages, is variable,
it is hard to represent all possible (state, action) pairs using a fixed length
feature vector. Figure 3.4 shows the state of Figure 3.3 represented as a relational interpretation, or a set of facts. Additional objects appearing during
interaction with the world will lead to additional facts becoming part of the
set.
The action can be represented by an additional fact such as “move(south)”
32
CHAPTER 3. STATE AND ACTION REPRESENTATION
location(r2).
carrying(p2).
maximumload(5).
package(p1).
package(p2).
package(p4).
destination(p1,r3).
destination(p2,r4).
destination(p4,r3).
size(p1,3).
size(p2,1).
size(p4,3).
location(p1,r4).
location(p2,r2).
location(p4,r2).
Figure 3.4: The state of the delivery robot represented as a relational interpretation.
or “pickup(p4)”. The relational state description can be extended by providing
background knowledge in the form of logical clauses. These clauses can derive
more complex knowledge about the (state, action) pair such as what packages
can be carried together.
3.5.2
Labelled Directed Graphs
Another representation format that can be used to represent relational data
is a graph (Diestel, 2000; Korte and Vygen, 2002). Graphs are particularly
well suited for representing worlds with a lot of structure, where objects are
defined by their relation to other objects, more than by their own properties.
An example of this can be found in navigational tasks, where the agent needs
a representation of all possible paths available to it. Figure 3.5 shows part of
a road map and the corresponding graph that can be used to represent the
environment. The agent is located at the intersection labelled “current position” and wants to get to Paradise. In case of one-way streets, the graph could
be changed to a directional graph, with an edge for each possible travelling
direction.
An action, e.g. travelling between two adjacent intersections, can be represented by an additional, labelled, edge. By only representing the structure of
the environment and not, for example, the individual names of the highways
and streets, the reinforcement learning agent will be forced to learn a general
policy that can be applied to similar environments. By using labelled graphs, it
remains possible however to supply extra information on travelling paths such
as speed limits or congestion probabilities.
In the case of the road map, the graph structure of the state representation
3.5. STRUCTURAL (OR RELATIONAL) REPRESENTATIONS
33
Current
Position
{curpos}
{goal}
Goal
Figure 3.5: A part of a road map and its representation as a labelled graph.
doesn’t change, except for the “current position” label. This is not required
however, as shown in the following example where this is not the case.
3.5.3
The Blocks World
As a running example used throughout the following chapters, the blocks world
— which is well known in artificial intelligence research (Nilsson, 1980; Langley,
1994; Slaney and Thiébaux, 2001) — can easily be represented either as a
relational interpretation or as a graph.
The blocks world, as used in this work, consists of a constant number of
blocks and a floor large enough to hold all the blocks. Blocks can be on the
floor or can be stacked on one another. Only states with neatly stacked blocks
are considered, i.e. it is not possible for a block to be on top of two different
blocks.
The actions in the blocks world consist of moving a clear block, i.e. a block
that has no other block on top of it, onto another clear block or the floor.
Figure 3.6 shows a possible blocks world state and action pair. The shown
state is in a blocks world with 5 blocks. The blocks are given a number to
establish their identity. The action of moving block 3 onto block 2 is represented
by the dotted arrow.
Although the blocks world is a good testing environment, a goal has to be
set for the agent to make it a reinforcement learning task. In this work, both
specific goals (e.g. “Put block 3 on top of block 1”) and more general goals
(e.g. “Put all the blocks in one stack”) will be considered.
The blocks world can be easily represented in the two proposed formats. As
a relational interpretation, the blocks world state is represented by the “clear”
34
CHAPTER 3. STATE AND ACTION REPRESENTATION
3
1
5
4
2
Figure 3.6: An example state of a blocks world with 5 blocks. The action is
indicated by the dotted arrow.
on(1,5).
on(2,floor).
on(3,1).
on(4,floor).
on(5,floor).
clear(2).
clear(3).
clear(4).
move(3,2).
Figure 3.7: The blocks world state represented as a relational interpretation.
and “on” predicates. The action is represented by the “move” predicate and
the goal, if necessary, can be represented by a “goal” predicate. Figure 3.7
shows the relational representation of the blocks world (state, action) pair of
Figure 3.6. Possible extensions of the (state, action) representation that can
be added as background knowledge include for example : “above(3,5)” and
“numberOfStacks(3)”.
Figure 3.8 shows the graph representation of the same (state, action) pair.
Each block, together with the sky and table are represented by vertices, while
the “clear” and “on” relations are represented by directed edges. The action is
represented by an additional labelled edge, and the extra labels a1 and a2 . If
needed, additional edges can be used to represent the intended goal.
More detailed information about the used blocks world representations can
be found in Appendix A.
Some problem domains will fit easily into a representation as a relational
interpretation, while others may be easier to represent as a graph. In the next
two chapters no distinction will be made between these representations. It will
be assumed that a representation for states and actions is available in which it
is possible to represent objects as well as their relations.
3.6. CONCLUSIONS
35
{clear}
v6
{on}
{block,a1}
{on}
{block}
{on}
v3
{on}
v4
{action}
v1
{block}
{block,a2 }
v2
{on}
{on}
{on}
v5
{block}
{on}
v0
{floor}
Figure 3.8: The blocks world state represented as a labelled directed graph.
3.6
Conclusions
This chapter gave an overview of possible representation formats with increasing expressiveness ranging from simple state and action enumeration over
attribute-value representations to deictic and relational representations. Two
relational or structural representation possibilities were introduced: relational
interpretations and labelled directed graphs.
36
CHAPTER 3. STATE AND ACTION REPRESENTATION
Chapter 4
Reinforcement Learning In
a Relational Environment
“It is our belief that the message contains instructions for
building something, some kind of machine.”
Contact
4.1
Introduction
As argued in the previous chapter, relational problem domains require their
own representational format. This eliminates the use of fixed size arrays as
state and action representations and thereby greatly limits the use of standard
reinforcement learning or Q-learning techniques as much of them depend on
the fact that the state-action space can be regarded as a vector space. In a
relational application, no limits are placed on the size or even on the dimensions
of the state space.
Problems in relational or structural domains tend to be very large as the
size of the state space increases quickly with the number of objects that exist
in the environment. To deal with this problem, the goal of using relational
reinforcement learning should be to abstract away from the specific identity of
states, actions or even objects in the environment and instead identify states
and actions only by referring to objects through their properties and the relations between the objects. For example, a house-hold robot would be required
to wash and iron any shirt the owner buys, instead of just the shirts that originally came with the purchase of the robot. By defining a shirt by its properties
instead of its specific identity, any object made of cotton with two sleeves,
a collar, buttons in front and the shape to fit one human would be handled
correctly by the robot.
37
38
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
The same remarks can be made for the environment. Different environments
should be generalized over by using their structure instead of learning to behave
in their specific layout. Instead of having to buy a robot together with a house
in which it can move around, the robot should be able to find its way around
any house and be able to identify the kitchen by the objects located inside.
This chapter introduces relational reinforcement learning and presents the
RRL system, one possible approach to relational reinforcement learning that
uses a kind of relational Q-learning.
Section 4.2 defines the relational reinforcement learning task. In Section 4.3,
a general algorithm for relational Q-learning is presented. This will introduce
the need for incremental first order regression and provide a framework for the
next chapters that will discuss various regression techniques that can be used in
the RRL system. This chapter concludes by discussing the first, prototypical
implementation of an RRL system and by presenting an overview of some
closely related work.
4.2
Relational Q-Learning
The relational reinforcement learning task is very similar to the regular reinforcement learning task except that a relational representation is used for the
states and actions.
Definition 4.1 (Relational Reinforcement learning) The relational Reinforcement Learning task can be defined as follows:
Given:
1. A set of possible states S, represented in a relational format,
2. A set of possible actions A, also represented in a relational format,
3. An unknown transition function δ: S × A → S, (this function can be
nondeterministic)
4. A real-valued reward function r: S × A → R,
5. Background knowledge generally valid about the environment.
Find a policy for selecting actions π ∗ : S → A that maximizes a value function:
V π (st ) for all st ∈ S.
The states of S need not be explicitly enumerated. They can be defined
through the alphabet of the chosen representational format. For the actions,
the learning system will only have to deal with the actions applicable to a given
4.3. THE RRL SYSTEM
39
state, denoted by A(s). The reward function r can be substituted by a “goal”
function: goal : S → {true,false}. The reward function will then be defined as
discussed in the intermezzo on page 16.
The value function V π can be defined as discussed in Section 2.2.2. In
this work, only the definition of Equation 2.1 (i.e., the discounted cumulative
reward) is considered.
The background knowledge at this point covers a large number of different
kinds of information. Possibilities include information on the shape of the value
function, partial knowledge about the effect of actions, similarity measures
between different states and so on. Specific forms of background knowledge
and how they are used to help learning will be discussed later. The background
knowledge can for example include predicates that derive new relational facts
about a given state when using relational interpretations.
For the blocks world example from Section 3.5.3, the set of all states S would
be the set of all possible blocks world configurations. If the number of blocks
is not limited, there exists an infinite number of states. The possible actions
in the blocks world depend on the state. They consist of moving all blocks
that are clear onto another block that is clear or to the floor. The transition
function δ in the blocks world as used throughout this work is deterministic,
i.e., all actions will have the expected consequences. The reward function r will
depend on the goal defined in the learning experiment and will be computed
as defined in the intermezzo of page 16. Background knowledge in the blocks
world could for instance include which blocks belong to the same stack or even
the number of stacks in a given state.
4.3
A Relational Reinforcement Learning System (or RRL system)
The RRL system was designed to solve the relational Q-learning problem as
defined in the previous section. Some intuition behind the approach is given,
followed by a general algorithm describing the RRL system’s approach.
4.3.1
The Suggested Approach
The RRL system follows the same reinforcement learning approach as most Qlearning algorithms that use Q-function generalization. Instead of representing
the Q-values in a lookup table, the RRL system uses the information it collects about the Q-values of different (state, action) pairs to allow a regression
algorithm to build a Q-function generalization.
The difference between the RRL system and other Q-learning algorithms
using function generalization lies in the fact that the RRL system employs a
40
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
Algorithm 4.1 The Relational Reinforcement Learning Algorithm
initialize the Q-function hypothesis Q̂0
e←0
repeat {for each episode}
Examples ← ∅
generate a starting state s0
i←0
repeat {for each step of episode}
choose ai for si using a policy derived from the current hypothesis Q̂e
take action ai , observe ri and si+1
i←i+1
until si is terminal
for j = i − 1 to 0 do
generate example x = (sj , aj , q̂j ) where q̂j ← rj + γmaxa Q̂e (sj+1 , a)
Examples ← Examples ∪ {x}
end for
Update Q̂e using Examples and a relational regression
algorithm to produce Q̂e+1
e←e+1
until no more episodes
relational representation of the environment and the available actions and that
a relational regression algorithm is used to build a Q-function.
This relational regression algorithm prevents from using specific identities
of states, action and objects in its Q-function and instead relies on the structure and relations present in the environment to define similarities between
(state, action) pairs and to predict the appropriate Q-value.
4.3.2
A General Algorithm
Algorithm 4.1 presents an algorithm to solve the previously defined task. It
uses a relational regression algorithm to approximate the Q-function. The
exact nature of this algorithm is not yet specified. The development of such a
regression algorithm will be the subject of the following chapters.
The algorithm starts by initializing the Q-function. Although this usually means that the regression engine returns the same default value for each
(state, action) pair, it is also possible to include some initial information about
the learning environment in the Q-function initialization.
The algorithm then starts running learning episodes like any standard Qlearning algorithm (Sutton and Barto, 1998; Mitchell, 1997; Kaelbling et al.,
1996). For the exploration strategy, the system translates the current Q-
4.3. THE RRL SYSTEM
41
function approximation into a policy using Boltzmann statistics (Kaelbling et
al., 1996).
During the learning episode, all the encountered states and the selected
actions are stored, together with the rewards connected to each encountered
(state, action) pair. At the end of each episode, when the system encounters
a terminal state, it uses reward back-propagation and the current Q-function
approximation to compute the appropriate Q-value approximation for each
encountered (state, action) pair.
The algorithm then presents the set of (state, action, qvalue) triplets to a
relational regression engine, which will use this set of examples to update the
current Q-function estimate, whereafter the algorithm continues executing the
next learning episode.
The algorithm described is the one that will be used in the rest of the text
and will be referred to as the RRL algorithm. A few choices were made when
designing the algorithm that do not have a direct influence on the usability
of the relational reinforcement learning technique in general. For example,
the RRL system uses Boltzmann based exploration but any other exploration
technique could also be used (See also Chapter 8).
Also, in the described system, learning examples are generated and the
regression algorithms is invoked at the end of an episode, i.e., when the policy
takes the agent into an end-state. While the backpropagation of the reward will
aid the system in generating more correct learning examples more quickly, this
setup is not imperative for the proposed technique. Another approach would
be to let the algorithm explore for a fixed number of steps before starting the
regression algorithm, or even to send a newly generated learning example to
the regression engine after each step. Although this could lead to a slower
convergence of the regression algorithm to a usable Q-function, the general
ideas behind the system would not be lost.
3
1
2
4
Q-value = 0.81
reward = 1
reward = 0
reward = 0
1
2
4
3
Q-value = 0.9
2
1
4
3
Q-value = 1.0
Figure 4.1: An example of a learning episode in the blocks world with 4 blocks
using “on(2,3)” as a goal.
Figure 4.1 shows a possible episode for a blocks world with 4 blocks with
“on(2, 3)” as a goal. The episode consists of 3 (state, action) pairs that have
42
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
been executed from left to right. When the terminal state is reached, after the
execution of the last action, the Q-values are computed from right to left to the
shown values. At that point in time all three (state, action) pairs together with
their Q-values are presented to the regression algorithm. The actual format of
the examples will depend on the chosen regression algorithm.
4.4
Incremental Relational Regression
The RRL system described above, requires a regression algorithm, i.e., an algorithm that can make real value predictions for unseen (state, action) pairs
based on previously encountered examples. To work in the relational reinforcement learning system, the algorithm needs to be both incremental and able to
deal with a relational representation of the presented examples.
Definition 4.2 (The RRL Regression Task) The Regression Task for the
relational reinforcement learning system is defined as follows:
Given a continuous stream of (state, action, qvalue) triplets calculated by the
RRL system as described above,
Build and update a function
Q̂ : S × A → R
that generalizes over the seen (state, action, qvalue) examples and predicts Qvalues for unseen (state, action) pairs such that the policy defined as:
∀s ∈ S : π̂(s) = argmaxa∈A(s) Q̂(s, a)
is an optimal policy with respect to the chosen value function V , i.e. the Utility
Function of Equation 2.1.
Several requirements for the regression algorithm are implicit in this definition.
Incremental (1) : The continuous stream of new (state, action, qvalue) triplets implies that the regression algorithm must be able to deal with incremental data. The RRL system will query the Q̂ function during the
learning process, not only after all examples have been processed.
Incremental (2) : The large number of examples presented to the learning
algorithm that is inherent to the design of the RRL system inhibits the
algorithm from storing all encountered learning examples. This requirement is strengthened by the fact that all states and actions are represented
in a relational format, which usually results in larger storage requirements
than other, less expressive, representational formats.
4.5. A PROOF OF CONCEPT
43
Moving target : Because the learning examples are incrementally computed
by a Q-learning algorithm, the function that needs to be learned is not
stable during learning. As shown in Equation 2.6, the computations that
generate examples to learn the Q-function, make use of current estimates
of the same Q-function. This means that the examples of Q-values will
only gradually converge to the correct values and that early examples
will almost certainly be noisy. The regression algorithm should be able
to deal with this kind of learning data.
No vector space : The learning examples (as well as the examples on which
predictions have to be made) will be represented in a relational format.
This means that the algorithm cannot treat the set of all examples as
a vector space. Regression algorithms often rely on the fact that the
dimension of the example space is finite and (more importantly) known
beforehand. With relational representations, this is not the case. Although in all practical applications the dimension of the state space will
be finite, it will not be known at the start of the learning experiment and
may even vary during the experiment (e.g. when the number of objects in
the agent’s environment varies). This refrains the algorithm from using
techniques such as local linear models, convex hull building and instance
averaging, which rely on the use of a vector space.
To the best of the author’s knowledge, no work existed in the field of incremental relational regression algorithms. Although a number of first order
algorithms exist that can be used for regression prior to this thesis (Karalič,
1995; Kramer and Widmer, 2000; Blockeel et al., 1998), none of these systems
is incremental. The next three chapters will discuss three newly developed
regression algorithms that take the requirements described above into account.
4.5
A Proof of Concept
A first prototype of a relational reinforcement learning system that uses a
relational form of Q-value generalization was built by Džeroski et al. (1998).
It used a relational interpretations representation of states and actions and an
of the shelf regression tree algorithm Tilde (Blockeel and De Raedt, 1998) to
construct the Q-function. The pseudo code of the system is given in Algorithm
4.2. A more elaborate discussion can be found in (Džeroski et al., 2001).
The relational regression tree that represents the Q-function in the original
RRL system is built starting from a knowledge base which holds examples
of state, action and Q-function value triplets. To generate the example-set for
tree induction, the system stores all encountered (state, action, qvalue)-triplets
into a knowledge base that is then used for the tree induction by Tilde .
44
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
Algorithm 4.2 The algorithm used by the proof of concept implementation.
initialize Q̂0 to assign 0 to all (s, a) pairs
Examples ← ∅
e←0
repeat
generate an episode that consists of states s0 to si and actions a0 to ai−1
(where aj is the action taken in state sj ) through the use of a standard
Q-learning algorithm, using the current hypothesis for Q̂e
for j = i − 1 to 0 do
generate example x = (sj , aj , q̂j ),
where q̂j = rj + γmaxa0 Q̂e (sj+1 , a0 )
if (sj , aj , q̂old ) ∈ Examples then
Examples ← Examples ∪ {x}\{(sj , aj , q̂old )}
else
Examples ← Examples ∪ {x}
end if
end for
use Tilde and Examples to produce Q̂e+1
e←e+1
until no more learning
Originally a classification algorithm, Tilde was later adapted for regression
(Blockeel et al., 1998). Tilde was not an incremental algorithm, so all the
encountered (state, action) pairs were remembered together with their Q-value.
Therefore, after each learning episode, the newly encountered (state, action)
pairs and their newly computed Q-values are added to the example set and
a new regression tree is build from scratch. Note that old examples are kept
in the knowledge base at all times and never deleted. The system avoids the
presence of contradictory examples in the knowledge base by replacing the old
example with a new one that holds the updated Q-value, if a (state, action)
pair is encountered more than once.
Problems with the original RRL
While the results of this prototype implementation demonstrated that it is
indeed possible to learn a relational Q-function and that this function can
be used to generalize not only over states and actions but also over related
environments, the non-incremental nature of the system limits its usability.
Four problems with the original RRL implementation can be identified that
diminish its performance.
1. The original RRL system needed to keep track of an ever increasing
4.6. SOME CLOSELY RELATED APPROACHES
45
amount of examples: for each different (state, action) pair ever encountered a Q-value is kept. This caused a large amount of memory to be
used.
2. When a (state, action) pair is encountered for the second time, the new Qvalue needs to replace the old value. This means that each encountered
(state, action) pair needs to be matched against the entire knowledge
base, to check whether an old example needs to be replaced.
3. Trees are built from scratch after each episode. This step, as well as the
example replacement procedure, takes increasingly more time as the set
of examples grows.
4. A final point is related to the fact that in Q-learning, early estimations
of Q-values are used to compute better estimates. In the original implementation, this leads to the existence of old and probably incorrect
examples in the knowledge base. An existing (state, action, qvalue) example gets an updated Q-value at the moment when exactly the same
(state, action) pair is encountered, but in structural domains, were there
is usually a large number of states and action, this doesn’t occur very
often. In the original implementation, no effort is made to forget old or
incorrect learning examples.
Most of these problems stem from the fact that Tilde expects the full set
of examples to be available when it starts. To solve these problems, a fully
incremental relational regression algorithm is needed as discussed in Section
4.4. Such an algorithm avoids the need to regenerate the Q-function when new
learning examples become available.
4.6
Some Closely Related Approaches
Very recently, the interest in relational reinforcement learning problems has
grown significantly and a number of different approaches have been suggested.
This section highlights a few different routes that can be taken to handle relational reinforcement learning problems.
4.6.1
Translation to a Propositional Task
As shown in Chapter 3 a first step toward relational representations and object
abstraction is the use of a deictic representation. The use of a focal point in
the representation allows a fixed size array to be used to represent a world
with varying amounts of objects and allows for a limited representation of the
structure of the environment. The deictic representation deals with varying
46
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
amounts of objects by limiting the represented objects to the surroundings
of the focal point. The representation relies on the focal point to find the
appropriate part of the world state.
Finney et al. (2002) used a deictic representation in a blocks world environment. The use of a focus pointer and state features, that describe the block
that is focussed on as well as the blocks around it, allow representing a blocks
world with an arbitrary amount of blocks. The focus pointer is controlled by
the learning agent as well and can be moved in four directions. As it turns
out, the extra complexity caused by the movement of the focus pointer (i.e.,
extra actions and longer trajectories to reach the intended goal) and the fact
that the problem is made partially observable by only representing parts of the
entire state causes Q-learning with deictic representations to perform worse
than expected.
4.6.2
Direct Policy Search
Fern et al. (2003) use an approximate variant of policy iteration to handle
large state spaces. A policy language bias is used to enable the learning system
to build a policy from a sampled set of Q-values. Just like in standard policy
iteration (Sutton and Barto, 1998), approximate policy iteration interleaves
policy evaluation and policy improvement steps.
However, the policy evaluation step generates a set of (state, qvaluelist)
tuples (where the qvaluelist includes the Q-values of all possible actions) as
learning examples by sampling the state space instead of computing the utility
values for the entire state space. In the policy improvement step, a new policy
is learned from the learning examples according to the policy language bias.
The Q-values of the learning examples are estimated using policy roll-out,
i.e., generating a set of trajectories following the policy and computing the costs
of these trajectories. This step requires a model of the environment. Initial
results of this approach are promising, but the need for a world model limits
its applicability.
4.6.3
Relational Markov Decision Processes
A lot of attention has gone to relational representations of Markov Decision
Processes (MDPs). These approaches use an abstraction of the state and action space to reduce the size of the learning problem. A distinction can be
made between approaches that use a predefined model of the environment and
algorithms that induce the state space abstraction.
Kersting and De Raedt (2003) introduce Logical Markov Decision Processes
as a compact representation of relational MDPs. They define abstract states as
a conjunction of first order literals, i.e., a logical query. Each abstract state represents the set of states that are covered by the logical query. Between these
4.6. SOME CLOSELY RELATED APPROACHES
47
abstract states, they defined abstract transitions and abstract actions that
represent sets of actions of the original problem. Abstract actions are defined
in a STRIPS-like manner (Fikes and Nilsson, 1971), defining pre-conditions
and post-conditions of the action. By this translation of the “ground” problem into a higher level representation of the problem, the number of possible
(state, action) pairs and thus the number of Q-values that need to be learned
is greatly reduced and Kersting and De Raedt define a “Logical Q-learning”
algorithm to accomplish this task.
Independent of this, Morales (2003) introduced rQ-learning, i.e., Q-learning
in R-Space. R-Space consists of r-states and r-actions which are very comparable to the abstract states and abstract actions in the work of Kersting and
De Raedt (2003). The rQ-learning algorithms tries to compute the Q-values of
the (r-state, r-action) pairs.
Van Otterlo (2004) defines the CARCASS representation that consists of
pairs of abstract states and the set of abstract actions that can be performed in
the given abstract state. While this is again comparable to the two previously
discussed approaches, van Otterlo not only defines a Q-learning algorithm for
his representation but also suggests learning a model of the relational MDP
defined by CARCASS to allow prioritized sweeping to be used as a solution
method.
A downside to all three techniques is that they require a hand coded definition of the higher level states and actions. This can be cumbersome and
greatly influences the performance of the approach. They also suffer from the
fact that this translation caused the problem to become partially observable,
so no convergence claims were made.
Boutilier et al. (2001) suggests a value-iteration (and thus value based)
approach that uses the Situation Calculus as a representation format for the
relational MDP. This creates a theoretical framework that can be used to derive
a partitioning of the state-space that distinguishes states and actions according
to the utility function of the problem. However, the complexities imposed
by the use of Situation Calculus and the need for sophisticated simplification
techniques have so far prevented a full implementation of the technique.
Very recently, Kersting et al. (2004) introduce a relational version of the
Bellman update rule called ReBel. Using this update rule they have devised
an algorithm that uses value iteration to automatically construct a state-action
space partitioning. In contrast to the Situation Calculus used by Boutilier et al.
they used a constraint logic programming language to represent the relational
MDP.
The Relation to the RRL System
The idea of abstract states and actions to represent sets of world states and
actions seems both elegant and quite practical. However, instead of requiring
48
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
a user defined abstraction of the reinforcement learning problem, it would be
preferable if the task of finding the correct level of abstraction could be left to
the learning algorithm.
To distinguish between similar subsets of states and actions, the learning
algorithm can use the utility values of states as an indication of related states
or Q-values to relate several (state, action) pairs.
The RRL system uses Q-values as a similarity indication of states and actions. However, it does not learn an explicit model of abstract states or actions,
but models the abstract state-action space implicitly by building a Q-function
based on structural and relational information available in the (state, action)
pair. This Q-function will also represent the policy learned by the system as
required by the problem definition.
It must be noted that the use of a Q-function to implicitly define the regions
of related (state, action) pairs can result in a partition of the state-action space
that can’t be directly translated into a set of abstract states and actions. This
can occur when the Q-function generalizes over (state, action) pairs by using
relational features that combine elements from both states and actions.
4.6.4
Other Related Techniques
Yoon et al. (2002) devise an approach to extend policies from small environments to larger environments using first order policy induction. Using a
planning component or a policy that can solve problems in small worlds, a
learning set is generated from which a rule-set is learned, that generalizes well
to worlds with more objects and thus a larger state space. This is related to the
P-learning approach suggested in (Džeroski et al., 2001) that uses a Q-function
learned on small worlds to generate learning examples for learning a policy.
Another value based approach is presented by Guestrin et al. (2003). By
assuming that the utility of a state can be represented as the sum of various subparts of the state, Guestrin et al. are able to build a class-based value-function.
Such a value function computes the values for a set of classes in which each
object is assumed to have the same contribution to the total state utility. Using
these class values, a value can be computed for each possible state configuration. However, the assumption that each object belonging to the same class
has the same value contribution can be very restrictive on the type of problems that can be handled by the technique. The authors themselves suspect
problems in dealing with environments where the reward function depends on
the state of a large number of objects, or where there is a strong interaction
between many of the worlds objects. A typical example of such an environment
is the blocks world. A limitation of the suggested learning technique is that so
far, the relations between objects are assumed to be static, which is certainly
not the case in for example the blocks world.
4.7. CONCLUSIONS
4.7
49
Conclusions
This chapter introduced the relational reinforcement learning task. The relational reinforcement learning system or RRL system uses a standard Qlearning approach that incorporates incremental relational regression to build
a Q-function generalization. System specific requirements for the incremental
relational regression engine were discussed. The chapters of part II each discuss
a new regression algorithm that can be used in the RRL system.
A prototype of a relational Q-learning system was briefly discussed and an
overview of some closely related work was given of which quite a few suggestions
use a relational abstraction of states and actions to reduce the number of utility
values or Q-values that need to be learned. Compared to these approaches, the
RRL system uses a Q-value driven search for related (state, action) pairs. The
Q-function that is the result of the RRL algorithm implicitly defines a stateaction partition of the learning task.
50
CHAPTER 4. RELATIONAL REINFORCEMENT LEARNING
Part II
On First Order Regression
51
Chapter 5
Incremental First Order
Regression Tree Induction
“You’re out of your tree.”
“It’s not my tree.”
Benny & Joon
5.1
Introduction
Decision trees are a successful and widely studied machine learning technique.
The learning algorithms used to build classification and regression trees use
a greedy learning technique called Top down induction of decision trees or
TDIDT. TDIDT applies a divide and conquer strategy which makes it a very
efficient learning technique.
After a discussion of some related work, this chapter introduces the first of
three relational regression algorithms that have been developed in this thesis
for use in the RRL system. The tg algorithm is an incremental first order regression tree algorithm. It accepts the same language bias as the Tilde system
(Blockeel and De Raedt, 1998) and adopts some ideas of incremental attribute
value regression tree algorithms. Section 5.4 first introduces the experimental
setting for the next three chapters. This experimental setting uses the blocks
world as introduced in Section 3.5.3 with 3 different goals. The behavior of
the tg algorithm is evaluated on these 3 tasks. This chapter concludes by
discussing a number of possible improvements to the tg system.
The tg algorithm was designed and implemented with the help of Jan
Ramon and Hendrik Blockeel and was first introduced in (Driessens et al.,
2001).
53
54
5.2
CHAPTER 5. RRL-TG
Related Work
For attribute-value representations, there exists an incremental regression tree
induction algorithm that was designed for Q-learning, i.e., the G-algorithm by
Chapman and Kaelbling (1991). This is a tree learning algorithm that updates
its theory incrementally as examples are added. An important feature is that
examples can be discarded after they are processed. This avoids using a huge
amount of memory to store examples.
On a high level, the G-algorithm stores the current decision tree, and for
each leaf node statistics for all tests that could be used to split that leaf further.
Each time an example is inserted, it is sorted down the decision tree according
to the tests in the internal nodes, and in the leaf the statistics of the tests
are updated. When needed, i.e., when indicated by the statistics in a leaf,
the algorithm splits the leaf under investigation according to the best available
test, again indicated by the stored statistics, and creates two new empty leafs.
Once a test is chosen for a certain node, this choice cannot be undone. This is
possibly dangerous for Q-learning, because Q-learning requires the regression
algorithm to do moving target regression. This means that tests at the top of
the tree are chosen, based on possibly unreliable learning examples.
The ITI algorithm of Utgoff et al. (1997) incorporates mechanisms for
tree restructuring, i.e., changing the chosen test in a node when the statistics
kept in that node indicate that a change is necessary. The algorithm uses two
tree revision operators: tree transposition and slewing cut-points for numerical attributes. These revision techniques rely on the fact that each decision
node stores statistics about all possible tests that could be used at that node.
When dealing with an attribute value representation of the data, this is feasible because the possible tests are limited to the possible values of the nominal
attributes and the possible cut-points for numerical attributes. As will be discussed later, this is not possible when using a first order representation of the
data.
Another related approach is the U-tree algorithm of MacCallum (1999)
which is specifically designed for Q-learning. U-trees rely on an attribute value
representation of states and actions, but allow the use of a (limited term)
history of a state as part of the decision criteria. For this, the U-tree algorithm
stores the sequence of all (state, action) pairs. This history allows U-trees to be
used with partially observable MDP’s, while the feature selection used in the
decision tree allows generalization over similar states. In the U-tree algorithm,
the trees are not used in the same way as the use of regression is proposed for
the RRL system, but value iteration is performed using the leaves in the tree
as states with each step that is taken by the learning agent. Q-values are thus
stored per (leaf, action) pair. The reward value needed for value iteration is
computed as the average of the rewards of all stored (state, action) pairs that
belong to that leaf and share the same action.
5.3. THE TG ALGORITHM
5.3
55
The tg Algorithm
The tg algorithm is a first order extension of the G algorithm of Chapman and
Kaelbling (1991). Algorithm 5.1 shows a high level description of the regression
algorithm.
Algorithm 5.1 The (T)G-algorithm.
initialize by creating a tree with a single leaf with empty statistics
for each learning example that becomes available do
sort the example down the tree using the tests of the internal
nodes until it reaches a leaf
update the statistics in the leaf according to the new example
if the statistics in the leaf indicate that a new split is needed then
generate an internal node using the indicated test
grow 2 new leafs with empty statistics
end if
end for
The tg algorithm uses a relational representation language for describing
the examples and for the tests that can be used in the regression tree.
5.3.1
Relational Trees
A relational (or logical) regression tree can be defined as follows (Blockeel and
De Raedt, 1998):
Definition 5.1 (Relational Regression Tree) A relational regression tree
is a binary tree in which
• Every internal node contains a test which is a conjunction of first order
literals.
• Every leaf (terminal node) of the tree contains a real valued prediction.
An extra constraint placed on the first order literals that are used as tests in
internal nodes is that a variable that is introduced in a node (i.e., it does not
occur in higher nodes) does not occur in the right subtree of the node.
Figure 5.1 gives an example of a first order regression tree. The test in a
node should be read as the existentially quantified conjunction of all literals
in the nodes in the path from the root of the tree to that node. In the left
subtree of a node, the test of the node is added to the conjunction, for the right
subtree, the negation of the test should be added.
56
CHAPTER 5. RRL-TG
on(BlockA,BlockB)
yes
no
Qvalue = 0.1
clear(BlockA)
yes
Qvalue = 0.4
no
on(BlockB,floor)
yes
Qvalue = 0.9
no
Qvalue = 0.3
Figure 5.1: A relational regression tree
The constraint on the use of variables stems from the fact that variables
in the tests of internal nodes are existentially quantified. Suppose a node
introduces a new variable X. Where the left subtree of a node corresponds to
the fact that a substitution for X has been found to make the conjunction true,
the right side corresponds to the situation where no substitution for X exists,
i.e., there is no such X. Therefore, it makes no sense to refer to X in the right
subtree.
A relational (logical) regression tree can be easily translated into a Prolog
decision list. For example, the tree of Figure 5.1 can be represented as the
following Prolog program:
q_value(0.4) :- on(BlockA,BlockB), clear(BlockA), !.
q_value(0.9) :- on(BlockA,BlockB), on(BlockB,floor), !.
q_value(0.3) :- on(BlockA,BlockB), !.
q_value(0.1).
The use of the cut-operator “!” can be avoided by the introduction of extra
definite clauses and the negation operator, but this leads to a larger and less
efficient program (Blockeel and De Raedt, 1998).
5.3.2
Candidate Test Creation
The construction of new tests is done through a refinement operator. tg uses a
user-defined refinement operator that originated in the Tilde system (Blockeel
and De Raedt, 1998).
This refinement operator uses a language bias to specify the predicates
that can be used together with their possible variable bindings. This language
bias has to be defined by the user of the system. By specifying possible testextensions in the form of rmode-declarations, the user indicates what literals
can be used as tests for internal nodes.
5.3. THE TG ALGORITHM
57
An rmode-declaration looks as follows:
rmode(N: conjunction_of_literals).
This declaration means that the conjunction of literals can be used as the
test in an internal node, but at most N times in a path from the top of the
tree to the leaf. When the N is omitted, its value defaults to infinity.
To allow for the unification of variables between the tests used in different
nodes within one path of the tree, the conjunction of literals given in the rmodedeclaration includes mode information of the used variables. Possible modes
are ‘+’, ‘−’ and ‘+−’. A ‘+’ indicates that the variable should be used as an
input variable, i.e., that the variable should occur in one of the tests on the
path from the top of the tree to the leaf that will be extended. A ‘−’ stands
for output, i.e., that the associated variable should not yet occur. ‘+−’ means
that both options are allowed, i.e., extensions can be generated both with an
already occurring or a completely new variable.
To illustrate this, look for example at the leftmost node of the regression
tree in Figure 5.1.
With the following rmode-declarations:
rmode(5: clear(+-X)).
rmode(5: on(+X,-Y)).
this leaf could be replaced by an internal node using any of the following tests:
clear(BlockA).
clear(BlockB).
clear(BlockC).
as a result from the first rmode, or
on(BlockA,BlockC).
on(BlockB,BlockC).
resulting from the second. A more detailed description of this language bias
part of the system, which includes the more advanced use of typed variables
and lookahead search, can be found in (Blockeel, 1998).
Because the possible test candidates depend on the previously chosen tests,
it is not possible to restructure relational trees in the same way as done with
the ITI algorithm in the work of Utgoff et al. (1997). In the propositional
case the set of candidate queries consists of the set of all features minus the
features that are already tested higher in the tree. This makes it relatively
easy to keep statistics about all possible tests at each internal node of the tree.
With relational trees, if a test in an internal node of the tree is changed, the
test candidates of all the nodes in the subtrees of that node change as well,
so no statistics can be gathered for all possible tests in each internal node.
Although not yet used in the tg algorithm, Section 5.5 will discuss some tree
restructuring mechanisms that can be used for relational trees.
58
CHAPTER 5. RRL-TG
Intermezzo: Candidate Test Storage
In contrast to the propositional case, keeping track of the candidate-tests (the
refinements of a query) is a non-trivial task. In the first order case, the set of
candidate queries consists of all possible ways to extend a query. The longer a
query is and the more variables it contains, the larger the number of possible
ways to bind the variables becomes and the larger the set of candidate tests is.
These dynamics in the set of possible test-extensions causes problems when
trying to extend the U-tree algorithm of Utgoff et al. (1997) to first order
representations. Since the set of possible tests for a node is not fixed when the
ancestor nodes are not fixed, the statistics for each possible test in a node will
be useless when an ancestor node is changed by one of the tree restructuring
mechanisms.
Since a large number of such candidate tests exist, they must be stored
as efficiently as possible. To this aim the query packs mechanism introduced
by (Blockeel et al., 2000) is used. A query pack is a set of similar queries
structured into a tree; common parts of the queries are represented only once
in such a structure. For instance, the set of conjunctions
{(p(X), q(X)), (p(X), r(X))}
can be represented as a single term
p(X), (q(X); r(X))
This can yield a significant gain in practice.
Assuming a constant branching factor b, the memory use for storing a pack
of n queries of length l is proportional to the number of nodes in a tree with
l layers, i.e., b + b2 + . . . + bl = (bl − 1)b/(b − 1) (The root node of the tree is
not considered as not all queries are required to start with the same predicate).
Since n = bl (the number of leafs in the tree) this is equal to (n − 1)b/(b − 1).
The amount of memory used to store n queries of length l without using a pack
representation is proportional to nl. Also, executing a set of queries structured
in a pack requires considerably less time than executing them all separately.
Even storing queries in packs requires much memory. However, the packs
in the leaf nodes are very similar. Therefore, a further optimization is to reuse
them. When a node is split, the pack for the new right leaf node is the same
as the original pack of the node.
For the new left sub-node, the pack is currently only reused if a test is
added which does not introduce new variables. In that case the query pack in
the left leaf node will be equal to the pack in the original node except for the
chosen test which of course can’t be taken again. In further work, it is also
5.3. THE TG ALGORITHM
59
possible to reuse query packs in the more difficult case when a test is added
which introduces new variables.
5.3.3
Candidate Test Selection
The statistics for each leaf consist of the number of examples on which each
possible test succeeds or fails as well as the sum of the Q-values and the sum of
squared Q-values for each of the two subsets created by the test. These statistics
can be calculated incrementally and are sufficient to compute whether some test
is significant, i.e., whether the variance of the Q-values of the examples would be
reduced sufficiently by splitting the node using that particular test. A standard
F-test with a significance level of 0.001 is used to make this decision. (This
may seem as a very high significance level, but since the Q-learning setting
supplies the regression algorithm with lots of examples, this value needs to be
very high.)
The F-test compares the variance of the Q-values of the examples collected
in the leaf before and after splitting, i.e.,
np 2 nn 2
σ +
σ
n p
n n
vs.
2
σtotal
where np and nn are the number of examples for which the test succeeds or
fails and σp and σn are the variances of the Q-values of the examples for which
the test succeeds and fails respectively. The variances of the subsets are added
together, weighted according to the size of the subsets.
Since
Pn
Pn
(xi − x)2
x2 − nx2
σ 2 ≡ i=0
= i=0 i
n
n
with x the average of all xi , the comparison above can be rewritten as:
Pnp
Pnn 2
Pn
2
2
qi − nn q 2n
np i=0 qi2 − np q 2p
nn i=0
i=0 qi − nq
+
vs.
n
np
n
nn
n
or after multiplying by n:
np
X
i=0
qi2
−
np q 2p
+
nn
X
i=0
qi2
−
nn q 2n
vs.
n
X
qi2 − nq 2
i=0
This comparison can be easily expressed using the 6 statistical values stored
in a leaf for each test:
!2 n
!2
!2
np
np
nn
n
n
n
X
X
X
1 X
1 X
1 X
2
2
2
qi
+
qi −
qi
qi −
qi
vs.
qi −
np i=0
nn i=0
n i=0
i=0
i=0
i=0
60
CHAPTER 5. RRL-TG
with n = np + nn .
Each leaf also stores the Q-value that should be predicted for (state, action)
pairs that are assigned to that leaf. This value is obtained from the statistics
of the test used to split its parent node when the leaf was created. Later, this
value is updated as new examples are sorted in the leaf.
A node is split after some minimal number of examples are collected and
some test becomes significant with a high confidence. This minimal number
of examples or minimal sample size is a parameter of the system that can be
tuned. Low values will cause the tg to learn faster, but may cause problems
by choosing the wrong test based on too little information.
5.3.4
RRL-tg
The introduction of the incremental tg algorithm into the RRL system solves
the problems of the original RRL implementation while keeping most of the
properties of the original system.
• The trees are no longer generated from scratch after each episode but are
built incrementally. This enables the new system to process much more
learning episodes with the same computational power.
• Because tg only stores statistics about the examples in the tree and only
references these examples once (when they are inserted into the tree) the
need for remembering and therefore searching and replacing examples has
disappeared.
• Since tg begins each new leaf with completely empty statistics, examples
have a limited life span and old (possibly noisy) Q-value examples will
be deleted even if the exact same (state, action) pair is not encountered
twice.
• Since the bias used by this incremental algorithm is the same as with
Tilde , the same theories can be learned by tg . Both algorithms search
the same hypothesis space and although tg can be misled in the beginning
due to its incremental nature, in the limit the quality of approximations
of the Q-values should be the same.
5.4
5.4.1
Experiments
The Experimental Setup
To test the tg algorithm, a number of experiments using the blocks world
(See Chapter 3) have been performed. To test the ability of RRL and tg
to generalize over varying environments, the number of blocks (and thus the
5.4. EXPERIMENTS
61
number of objects in the world) is varied between 3 and 5. Although this
limits the number of different world states to 587 and the number of different
state-action combinations to approximately 2000, this number is large enough
to illustrate the generalization characteristics of the regression algorithms. Per
episode, RRL will collect about 2 to 2.5 examples (depending on the exact
learning task), so after 1000 episodes RRL will have collected around 2000
to 2500 learning examples. However, one should also remember that not all
of these learning examples will carry correct, if any, information and that Qlearning expects each (state, action) pair to be visited multiple times.
5.4.1.1
Tasks in the Blocks World
Three different goals where used, each with their own characteristics.
Stacking all blocks : In this task, the agent needs to build one large stack
of blocks. The order of the blocks in the stack does not matter. The
optimal policy is quite simple, as the agent just needs to stack blocks
onto the highest stack. However, since the RRL algorithm is a Q-learning
technique, tg will still need to predict a number of different Q-values.
Stacking is the simplest task that will be considered.
Unstacking all blocks : The Unstacking task consists of putting all blocks
onto the floor. Again (of course) there is no order in which the blocks
should be put on the floor. The Unstacking task is closely related to the
Stacking task, but there are some important differences. Although the
optimal policy for Unstacking is even simpler than the one for Stacking,
i.e., in each step, put a block on the floor that is not already on it, the task
is quite a bit harder to learn using Q-learning. Not only is there only one
goal state, but the number of possible actions (related to the number of
blocks with no other blocks on top of it) increases with each step closer to
the goal. This makes it very hard for a Q-learning agent to reach the goal
state using random exploration and will cause the regression algorithm
to be confronted with a lot of non-informative Q-values as not reaching
the goal state during an episode results in Q-values = 0.0 for all the
(state, action) pairs encountered during the episode.
Stacking two specific blocks: On(A,B) : The third task under consideration is stacking two specific blocks. This task is interesting for a number
of reasons. First, the task is actually a combination of several sub-tasks.
To stack two specific blocks, one first needs to clear the two blocks and
then put the correct block on top of the other. This means that the optimal policy is harder to represent than for the two other tasks and the
same can be expected for the Q-function. Secondly, RRL will be trained
to learn to stack any two specific blocks. In each training episode, the
62
CHAPTER 5. RRL-TG
goal will be changed to refer to a different set of blocks. This way, RRL
will have to describe the learned Q-function with regard to the properties of blocks and relations between blocks instead of referring to specific
block identities. This will allow RRL to use its learned Q-function to
stack “Block 1” onto “Block 2” and to stack “Block 5” onto “Block 3”
without retraining.
Table 5.1: Number of states and number of reachable goal states for the three
goals and different numbers of blocks.
No. of blocks No. of states RGS stack RGS on(a,b) RGS unstack
3
13
6
2
1
4
73
24
7
1
5
501
120
34
1
6
4 051
720
209
1
7
37 633
5 040
1 546
1
8
394 353
40 320
13 327
1
9
4 596 553
362 880
130 922
1
10
58 941 091
3 628 800
1 441 729
1
Table 5.1 shows the number of states in the Blocks World in relation to the
number of blocks in the world. Although only environments with 3 to 5 blocks
are used in the tests in this chapter, the world size for larger numbers of blocks
are shown also, as these worlds will be used in later chapters. The number of
“reachable goal states” (RGS) is also shown for each of the three tasks. Since
goal states are modelled to be absorbing states and given that RRL always
starts an episode in a non goal state, there are a (large) number of goal states
for the On(A,B) task that can not be reached during an episode. These are
states that have block A on top of block B, but also have other blocks on top of
A. See Figure 5.2 for an example of a reachable and non reachable goal state.
1
2
4
3
4
1
2
3
Figure 5.2: Left a reachable goal and right a non reachable goal state for the
on(1, 2) task in the Blocks World.
5.4. EXPERIMENTS
5.4.1.2
63
The Learning Graphs
Unless indicated otherwise the learning graphs throughout this work combine
the results of 10-fold experiments. The tested criterium is the average reward
received by the RRL system starting from 200 different, randomly generated
starting states. The rewards in the test are given according to the planning
setting discussed in the intermezzo on page 16, with the added constraint that a
reward is only presented to RRL if it reaches the goal in the minimal number of
steps needed. If RRL does not reach the goal in the minimal number of steps,
the episode is terminated and no reward is given. This means that the y-axis on
the learning curves represents the percentage of cases in which RRL succeeds in
reaching the goal state in the minimal number on steps, i.e., solving the problem
optimally. Giving only a reward of 1 for optimal traces (i.e., reaching the goal
in the minimal number of steps) will not result in only learning from correct
examples. Because RRL calculates the Q-values of new learning examples using
its current Q-function approximation, it will also generate non-zero Q-values
for examples from non-optimal episodes. These may or may not be correct.
The number of blocks is rotated between 3, 4 and 5 during the learning
episodes, i.e., each episode of 3 blocks is followed by one with 4, which is in
turn followed by one with 5, etc. Every 100 episodes, the Q-function built
by the RRL system is translated into a deterministic policy and tested on
200 randomly generated problems with 3 to 5 blocks. Note that during these
testing episodes, RRL uses the greedy policy indicated by the Q-function it has
learned at that time. This removes the influence of the exploration strategy
from the test results. This choice was made because RRL so far only uses
exploration strategies to control the learning examples that are presented to
the used regression algorithm. Because the experiments are designed to test
the performance of the regression algorithm as a Q-function generalization, the
influence of the exploration strategy during testing would only make the results
harder to interpret.
If interested, the reader can check Appendix A for the representation of
the blocks world that was used. The appendix also includes the language bias
needed for the tg algorithm, which is the same as for the original implementation which used Tilde for regression.
5.4.2
The Results
Aside from the language bias, the most important parameter of the tg algorithm is the minimal number of examples that have to be classified to a leaf,
before tg is allowed to split that leaf and choose the most appropriate test.
This minimal example set size influences the number of splits that tg makes
only in the short term as tg counts on the statistics in the leafs to choose the
appropriate tests whenever it becomes necessary to split a leaf. However, since
64
CHAPTER 5. RRL-TG
the examples that are presented to the tg algorithm are noisy at the start
of learning, and since splitting decisions can not be undone, it is important
to collect a significant amount of learning experience before committing to a
decision. The minimal sample size can be tuned to make tg wait long enough
to make a split in the Q-tree.
Stacking
1
Average Total Reward
0.9
0.8
0.7
0.6
0.5
0.4
’mss = 30’
’mss = 50’
’mss = 100’
’mss = 200’
’mss = 500’
0.3
0.2
0.1
0
200
400
600
Number of Episodes
800
1000
Figure 5.3: Comparing minimal sample size influences for the Stacking task
Different minimal sample sizes were tested on the different blocks world
tasks. Figure 5.3 shows the results for the Stacking task. For the Stacking task
in worlds with 3 to 5 blocks, the average length of an episode is only 2 steps,
so 2 learning examples are generated per episode on average. The performance
graph shows that higher minimal sample sizes result in a higher accuracy (e.g.
±98% accuracy for mss = 200 compared to ±94% for mss = 30), at the
cost of slower convergence. The performance of tg with the sample size set
to 30 jumps up quite fast, but is also the least stable. The performance for
the size set to 500 rises slowly and with large jumps. Each jump represents
the fact that tg has extended the Q-tree with one test. These performance
jumps are typical for RRL using tg as the performance will increase (or even
decrease) only with the split of a leaf. These jumps are also present in the other
performance curves, but because the changes in the Q-tree happen faster, they
are less apparent.
The size of the resulting Q-tree is also influenced by the minimal sample
size. More informed by a larger example set, tg succeeds in choosing better
tests. This yields equivalent or better performance with smaller trees as badly
chosen tests at the top of the tree force tg to make corrections — and thus
build larger subtrees — lower in the Q-tree. Table 5.2 shows the size of the
Q-tree after learning for 1000 episodes. Although tg only builds a tree with 3
leafs for a minimal sample size set to 500, the performance of this small Q-tree
5.4. EXPERIMENTS
65
Table 5.2: The number of leafs in the Q-tree for different minimal sample sizes
(mss)
mss
30
50
100
200
500
1000
Stacking
10.8
9.6
7.6
5.9
3.0
2.0
Unstacking
16.3
16.6
13.1
8.2
3.6
2.0
On(A,B)
33.1
24.4
16.3
9.7
4.0
3.0
when translated into a policy is already very good as shown in figure 5.3. The
performance curve for the sample size of 500 clearly shows the performance
increase related to each of the two splits. The table also shows a tree size for
setting the minimal sample size to 1000, but tg is only able to build a tree
with one split in that case and needs more learning episodes to build a useful
tree with that setting. Therefore the performance of tg for that value is not
shown in figure 5.3.
Unstacking
1
Average Total Reward
0.9
0.8
0.7
0.6
0.5
0.4
’mss = 30’
’mss = 50’
’mss = 100’
’mss = 200’
’mss = 500’
0.3
0.2
0.1
0
200
400
600
Number of Episodes
800
1000
Figure 5.4: Comparing minimal sample size influences for the Unstacking task
Figure 5.4 shows the results for the Unstacking task. An episode for the
Unstacking task in a world with 3 to 5 blocks generates on average 2.3 learning
examples. Since the goal is harder to reach (see the discussion in the previous
section), tg will be presented with more uninformative examples than in the
Stacking task. This results in a slower performance increase. It also causes tg
to build a larger tree (see table 5.2). For this kind of task, it is better to raise
66
CHAPTER 5. RRL-TG
the minimal example set size and force tg to make decisions based on more
experience.
On(A,B)
Average Total Reward
1
0.8
0.6
0.4
’mss = 30’
’mss = 50’
’mss = 100’
’mss = 200’
’mss = 500’
0.2
0
0
200
400
600
Number of Episodes
800
1000
Figure 5.5: Comparing minimal sample size influences for the On(A,B) task
The On(A,B) task does not suffer from the low number of goal states as
much as the Unstacking task does, so it doesn’t need the minimal example set
size to be raised. Compared to the previous two tasks, the On(A,B) task is
harder because it consists of two subtasks as explained. This makes the Qfunction more complex, and will force tg to build a more complex (i.e., larger)
Q-tree. An episode in worlds with 3 to 5 blocks generates about 2.5 learning
examples on average.
Figure 5.5 shows the performance graphs for the On(A,B) task. Because
of the need of a larger Q-tree, lower values for the minimal sample size are
better with a fixed number of learning episodes. Figure 5.6 shows a tree that
was learned after 1000 episodes with the minimal sample size set to 200. It
can be easily seen why it doesn’t solve the complete On(A,B) task yet. The
topmost line in the tree is a standard initialization of the tree and is defined
by the user. It allows tg to refer to the blocks participation in either the goal
or the action. It can also compare these blocks to each other, as it does for
the leftmost leaf, which represents the (state, action) pair that moves the goal
blocks on top of each other. Therefore the Q-value of 1.0 is correct for this leaf.
However the right handed sibling of this node still overgeneralizes. It treats
all (state, action) pairs that move a block, that is not the A block, onto the
B block, as identical. On the right side of the tree tg has begun to build a
tree which calculates the distance to the goal. It makes a distinction between
the states which have the B block on top of A, and the ones which don’t. In
the right most branch, it starts to introduce new variables. This allows it to
check whether there are still other blocks which have to be moved. To solve
67
5.4. EXPERIMENTS
Q = 0.1068
no
equal(X,A)
yes
Q = 1.0
yes
no
yes
yes
no
no
yes
above(X,B)
Q = 0.02386
above(K,A)
Q = 0.2891
no
above(L,B)
on(B,A)
move(X,Y), goal_on(A,B),
equal(Y,B)
yes
Q = 0.3222
no
equal(X,B)
yes
Q = 0.6400
Q = 0.1935
no
Q = 0.0
no
above(M,B)
yes
Q = 0.1194
Figure 5.6: A tree learned by tg after 1000 learning episodes for the On(A,B)
task. The minimal sample size was set to 200.
68
CHAPTER 5. RRL-TG
the complete problem, the tree would have to be expanded with additional
tests until it represents at least all possible distances to the goal and with it
all different possible Q-values.
The execution times for the RRL-tg system are almost equal for the 3 tasks
presented. It takes RRL-tg about 15 seconds to process the 1000 learning
episodes and an additional 2 seconds to perform the 2000 tests in an entire
learning episode. The Q-tree learned by the tg algorithm is quite simple, and
thus fast, to consult.
5.5
Possible Extensions
The tg algorithm as discussed above can be improved in a number of ways.
The major drawback of the tg algorithm as it stands, is the inability to undo
mistakes it makes at the top of the tree.
The minimal sample size as it is used now can be an effective way to reduce
the probability of selecting bad tests at the top of the tree, but also introduces
a number of unwanted side-effects.
Large minimal sample sizes reduce the learning speed of the tg algorithm
and thereby the learning speed of the RRL system, especially in later stages
of learning, since a lot of data has to be collected in each leaf, before tg is
allowed to split. This can be circumvented by the use of a dynamic minimal
sample size. One could, for instance, reduce the minimal sample size for each
level of the tree.
Another method of dealing with early mistakes would be to store statistics
about the tests in each node — not just the leaf nodes of the tree — and allow
tg to alter a chosen test when it becomes clear that a mistake has been made.
Table 5.2 suggests the existence of very good tests, that should surface, given
enough training examples.
As explained in section 5.3.2 using first order trees makes it impossible to
store statistics on each possible test for a node if the tests in the nodes above it
are not fixed. This makes it impossible to simply propagate the change through
the underlying subtrees. However, two possibilities present themselves. The
most simple solution would be to delete the two subtrees and start building
them from scratch, this time with the assurance of a better parent node (See
left side of figure 5.7). This is probably the easiest solution, however, a lot of
information is lost with the deletion of the two subtrees.
Another possible approach is to reuse the original subtree in each of the two
leafs (cf. the right side of figure 5.7). If the statistics in the subtrees are reset
tg can start to collect information on new best tests without the immediate
loss of all the previously learned information. This approach would also need
a pruning strategy to delete subtrees when they become unnecessary.
To reduce the jumps in performance and possibly speed up the learning rate,
5.5. POSSIBLE EXTENSIONS
69
Old (bad) Test
...
Subtree
...
New Test
New Test
...
...
Old Test
Old Test
Subtree
Subtree
...
...
Figure 5.7: Two possible tree rebuilding strategies when a better test is discovered in a node.
tg could use the statistics gathered in a leaf to make more informed Q-value
estimates. It is possible for tg to use the information gathered on the different
tests to select the Q-value prediction of the best test without committing to
select that test to expand the tree. This could yield better estimations earlier
on and as a consequence also improve the estimations of the new learning
examples. However, caution should be used here as prediction errors might
cause instability as well.
Another extension that is worth investigating is the addition of aggregate
functions into the language bias of the tg system. Since the Q-function often
implicitly encodes a distance to the goal or to the next reward, this distance
might be more easily represented by allowing tg to use aggregates such as
summation. The inclusion of a history for each state such as used in the Utree algorithm of MacCallum (1999) would improve the behavior of RRL-tg
is partially observable environments. The relational nature of the tests used by
tg easily facilitate the use of a short term memory. The only needed change
would be to store a the (state, action) pairs which make up the memory of the
learning agent. The additional memory requirements would be limited, as only
a fixed sized window of (state, action) pairs would have to be remembered.
A larger step from the current tg algorithm would be to grow a set of
trees instead of a single tree. In this setup, the second tree could be trained to
represent the prediction error made by the first tree, the third tree to represent
the error still made after the combination of the first and second tree, and so
on. This would allow tg to build a first tree based on largest differences in
the Q-function, but might help further tuning of this coarse function by only
70
CHAPTER 5. RRL-TG
having to build a error correcting tree once instead of having to expand each
leaf of the first tree into a full subtree. The use of more than one tree might
allow each tree to focus on a different aspect of the (state, action) pair and its
influence on the Q-value.
5.6
Conclusions
This chapter introduced the first of three relational regression algorithms for
relational reinforcement learning. The tg algorithm incrementally builds and
updates a first order regression tree. The algorithm stores statistics in the leafs
that contain information on the possible tests that can be used for splitting the
leafs. This alleviates the need to store all encountered examples and greatly
increases the applicability of relational reinforcement learning compared to the
original RRL system.
The tg system was tested on three tasks in the blocks world and the influence of its most important parameter, the minimal number of examples that
needs to be classified to a leaf before that leaf is allowed to split, was investigated.
Chapter 6
Relational Instance Based
Regression
“You know, when you’re the middle child in a family of five million,
you don’t get any attention.”
Antz
6.1
Introduction
Instance based learning (classification as well as regression) is known for both
its simplicity and performance. Instance based learning — also known as lazy
learning or as nearest neighbor methods — simply stores the learning examples
or a selection thereof and computes a classification or regression value based on
a comparison of the new example with the stored examples. For this comparison, some kind of similarity measure, often a distance, has to be defined. The
two major concerns for the use of instance based learning as a regression technique for RRL are the noise inherent to the Q-learning setting and the need
for example selection as the instance based learning system will be presented
with a continuous stream of new learning examples.
This chapter describes a relational instance based regression technique that
will be called rib algorithm. Several data base management techniques designed to limit the number of examples that are stored by the rib algorithm
are presented. rib is then tested on two applications. A simple corridor application is used to study the influence of the different parameters of the system
and the blocks world environment is used to compare the performance of the
rib system to the tg regression algorithm. The chapter concludes by discussing
some further work for the rib system.
71
72
CHAPTER 6. RRL-RIB
The rib system was designed and implemented with the help of Jan Ramon
and was first presented in (Driessens and Ramon, 2003).
6.2
Nearest Neighbor Methods
This section discusses some previous work in instance based regression and
relates it to the regression task in the RRL system.
Aha et al. introduced the concept of instance based learning for classification (Aha et al., 1991) through the use of stored examples and nearest neighbor
techniques. They suggested two techniques to filter out unwanted examples to
both limit the number of examples that are stored in memory and improve
the behavior of instance based learning when confronted with noisy data. To
limit the inflow of new examples into the database, the IB2 system only stores
examples that are classified wrong by the examples in memory so far. To be
able to deal with noise, the IB3 system removes examples from the database
who’s classification record (i.e., the ratio of correct and incorrect classification
attempts) is significantly worse than that of other examples in the database.
Although these filtering techniques are simple and effective for classification,
they do not translate easily to regression.
The idea of instance based prediction of continuous target attributes was
introduced by Kibler et al. (1989). This work describes an approach that uses
a form of local linear regression. Although Kibler et al. refer to instance based
classification methods for reducing the amount of storage space needed by the
instance based techniques, they have not used these techniques for continuous
value prediction tasks.
This idea of local linear regression is in greater detail explored by Atkeson
et al. (1997), but again, no effort is made to limit the growth of the stored
database. In follow-up work however (Schaal et al., 2000), they do describe
a locally weighted learning algorithm that does not need to remember any
data explicitly. Instead, the algorithm builds “locally linear models” which are
updated with each new learning example. Each of these models is accompanied
by a “receptive field” which represents the area in which this linear model can
be used to make predictions. The algorithm also determines when to create a
new receptive field and the associated linear model. Although this is a good
idea, building local linear models in the relational setting (where data can not
be represented as a finite length vector) does not seem feasible.
An example where instance based regression is used in Q-learning is in the
work of Smart and Kaelbling (2000) where they use locally weighted regression
as a Q-function generalization technique for learning to control a real robot
moving through a corridor. In this work, the authors do not look toward
limiting the size of the example-set that is stored in memory. They focus
on making safe predictions and accomplish this by constructing a convex hull
6.3. RELATIONAL DISTANCES
73
around their data. Before making a prediction, they check whether the new
example is inside this convex hull. The computation of the convex hull again
relies on the fact that the data can be represented as a vector, which is not the
case in the relational setting.
Another instance of Q-learning in which instance based learning is used
is given by Forbes and Andre (2002) where Q-learning is used in the context
of automated driving. In this work the authors do address the problem of
large example-sets. They use two parameters that limit the inflow of examples
into the database. First, a limit is placed on the density of stored examples.
They overcome the necessity of forgetting old data in the Q-learning setting by
updating the Q-value of stored examples according to the Q-value of similar
new examples. Secondly, a limit is given on how accurately Q-values have to
be predicted. If the Q-value of a new example is predicted within a given
boundary, the new example is not stored. When the number of examples in
the database reaches a specified number, the example contributing the least to
the correct prediction of values is removed. Some of these ideas will be adopted
and expanded on in the design of the new relational regression algorithm.
6.3
Relational Distances
Nearest neighbor methods need some kind of similarity measure or distance
between the used examples. In the RRL context, this means a distance is
needed between different (state, action) pairs. Because RRL uses a relational
representation of both states and actions, a relational distance is needed.
Since states and actions are represented using a set of ground facts, it might
seem sufficient to use a similarity measure on sets, combined with a distance
on ground facts. However, this approach would treat each fact as a separate
entity while different facts can contain references to the same object in a state
or action. Therefore, a distance between relational interpretations is needed.
Several approaches to compute such a distance exist (Sebag, 1997; Ramon and
Bruynooghe, 2001). Related to this, Emde and Wettschereck (1996) define a
similarity measure on first order representations that doesn’t satisfy the triangle
inequality.
Although these general distances can be used in RRL , they incorporate no
prior knowledge about the structure and dynamics of the environment RRL is
learning in. Where in the RRL-tg system, the user could guide the search by
specifying a language bias for the tg system, an application specific distance
will allow the user to do the same for the RRL-rib system. The use of such a
specific distance will not only allow the user to emphasize the important structures in the environment, but might also reduce the computational complexity
of the distance.
In the blocks world for example, an application specific distance can empha-
74
CHAPTER 6. RRL-RIB
size the role of stacks of blocks. For example, the distance between two blocks
world states can be defined as the distance between two sets of stacks. For a
distance between sets, the matching distance (Ramon and Bruynooghe, 2001)
can be used. This distance first tries to find the optimal matching between the
elements of the two sets and then computes a distance based on the distances
between the matched elements. Also, by treating the blocks not mentioned in
either the goal or as part of the performed action as identical, the use of object
specific information can be avoided (comparable to the use of variables in the
tg language bias.)
Such a distance for the blocks world environment can be computed as follows:
1. Try to rename the blocks so that block-names that appear in the action
(and possibly in the goal) match between the two (state, action) pairs.
If this is not possible, add a penalty to your distance for each mismatch.
Rename each block that does not appear in the goal or the action to the
same name.
2. To compute the distance between the two states, regard each state (with
renamed blocks) as a set of stacks and calculate the distance between
these two sets using the matching-distance between sets based on the
distance between the stacks of blocks (Ramon and Bruynooghe, 2001).
3. To compute the distance between two stacks of blocks, transform each
stack into a string by reading the names of the blocks from the top of the
stack to the bottom, and compute the edit distance (Wagner and Fischer,
1974) between the resulting strings.
While this procedure defines a generic distance, it will adapt itself to deal
with different goals as well as different numbers of blocks in the world. The
renaming step (Step 1) ensures that the blocks that are manipulated and
the blocks mentioned in the goal of the task are matched between the two
(state, action) pairs. The other blocks are given a standard name to remove
their specific identity. This allows instance-based RRL to generalize over actions that manipulate different specific blocks as well as to train on similar
goals which refer to different specific blocks. This is comparable to RRL-tg
which uses variables to represent blocks which appear in the action and goal
description. Blocks which do not appear in the action or goal description are
all regarded as generic blocks, i.e., without paying attention to the specific
identity of these blocks. The renaming step is illustrated in Figure 6.1. In the
first state, block 1 is moved onto another block while block 3 is moved in the
second state. Therefore, both blocks are given the same name during renaming.
Block 2 in the left state and block 1 in the right hand state are also given the
same name because they are both mentioned as the first block in the goal. For
6.3. RELATIONAL DISTANCES
1
2
4
3
goal: on(2,4)
a
c
b
x
75
3
1
5
4
2
goal: on(1,2)
a
c
x
x
b
Figure 6.1: The renaming step in the computation of the blocks world distance.
block 4 in the left state and block 2 in the right state, both the action and the
goal references to the blocks match, so they can also be given the same name.
Were this not the case, they would have received different names and a penalty
would be added to the distance per mismatched block. In the experiments this
penalty value was set to 2.0 All the other blocks are given a generic name,
because their identity is neither referred to by the action nor by the goal.
After this renaming step, the two bottom states of Figure 6.1 are translated
into the two following sets of strings: {ac, bx} and {acx, b, x}. These sets are
compared using the matching distance between the sets and the edit distance
between the different strings. Computing the edit distances of the available
strings results in the following table:
d(ac,b) = 3.0
d(ac,acx) = 1.0
d(ac,x) = 3.0
d(bx,b) = 1.0
d(bx,acx) = 3.0
d(bx,x) = 1.0
An optimal matching in this case matches ‘ac’ to ‘acx’ and ‘bx’ to ‘b’ (or ‘x’).
The unmatched element of the second set causes a penalty to be added to the
distance. In the experiments, this penalty was set to 2.0, so that the distance
for the two (state, action) pairs of Figure 6.1 is computed to be 4.0. The use
of the matching-distance and the edit-distance enables RRL-rib to train on
76
CHAPTER 6. RRL-RIB
blocks worlds of different sizes.
6.4
The rib Algorithm
This section describes a number of different techniques that can be used with
relational instance based regression to limit the number of examples stored in
memory. As stated before, none of these techniques will require the use of
vector representations. Some of these techniques are designed specifically to
work well with Q-learning.
The rib algorithm will use c-nearest-neighbor prediction as a regression
technique, i.e., the predicted Q-value q̂i will be computed as follows:
P qj
j distij
1
j distij
q̂i = P
(6.1)
where distij is the distance between example i and example j and the sum
is computed over all examples stored in memory. To prevent division by 0, a
small amount δ can be added to this distance.
6.4.1
Limiting the Inflow
In IB2 (Aha et al., 1991) the inflow of new examples into the database is limited
by only storing examples that are classified incorrectly by the examples already
stored in the database. However, when predicting a continuous value, one can
not expect to predict a value correctly very often. A certain margin for error
in the predicted value will have to be tolerated. Comparable techniques used
in regression context (Forbes and Andre, 2002) allow an absolute error when
making predictions as well as limit the density of the examples stored in the
database.
To translate the idea of IB2 towards regression in a more adaptive manner,
instead of adopting an absolute error-margin it is better to use an error-margin
which is proportional to the standard deviation of the values of the examples
closest to the new example. This will make the regression engine more robust
against large variations in the values that need to be predicted. Translated to
a formula, this means that examples will be stored if
|q − q̂| > σlocal · Fl
(6.2)
with q the real Q-value of the new example, q̂ the predicted Q-value, σlocal the
standard deviation of the Q-value of a representative set of the closest examples
(the rib algorithm uses the 30 closest examples) and Fl a suitable parameter.
The idea of limiting the number of examples which occupy the same region
of the example space is also adopted, but without the rigidity that a global
6.4. THE RIB ALGORITHM
77
maximum density imposes. Equation 6.2 will limit the number of examples
stored in a certain area. However, when trying to approximate a function such
as the one shown in Figure 6.2, it seems natural to store more examples of
region A than region B in the database. Unfortunately, region A will yield a
large σlocal in Equation 6.2 and will not cause the algorithm to store as many
examples as it should.
A
B
Figure 6.2: To predict the shown function correctly, an instance based learner
should store more examples from area A than area B.
rib therefore adopts an extra strategy that stores examples in the database
until the local standard-deviation (i.e., of the 30 closest examples) is only a
fraction of the standard deviation of the entire database, i.e., an example will
be stored if
σglobal
σlocal >
(6.3)
Fg
with σlocal the standard deviation of the Q-value of the 30 closest examples,
σglobal the standard deviation of the Q-value of all stored examples and Fg a
suitable parameter.
This will result in more stored examples in areas with large variance of the
function value and less in areas with small variance. An example will be stored
by the RRL system if it meets one of the two criteria.
Both Equation 6.2 and Equation 6.3 can be tuned by varying the parameters
Fl and Fg .
6.4.2
Forgetting Stored Examples
The techniques described in the previous section might not be sufficient to
limit the growth of the database appropriately. When memory limitations are
reached, or when computation times grow too large, one might have to place a
78
CHAPTER 6. RRL-RIB
hard limit on the number of examples that can be stored. The algorithm then
has to decide which examples it can remove from the database.
IB3 (Aha et al., 1991) uses a classification record for each stored example
and removes the examples that perform worse than others. In IB3, this removal
of examples is added to allow the instance based learner to deal with noise in
the training data. Because Q-learning has to deal with moving target regression and therefore inevitably with noisy data, rib will probably benefit from a
similar strategy. However, because the rib algorithm has to deal with continuous values, keeping a classification record which lists the number of correct
and incorrect classifications is not feasible. Two separate scores are proposed
that can be computed for each example and that will indicate which example
should be removed from the database.
6.4.2.1
Error Contribution
Since rib is in fact trying to minimize the prediction error, it is possible to
compute for each example what the cumulative prediction error is with and
without the example.
The cumulative prediction error with example i included is computed as
follows:
X
errori =
(qj − q̂j )2
j6=i
with q̂j the prediction of the Q-value of example j by the database with example
i included.
The cumulative prediction error without example i is:
X
errori−i =
(qj − q̂j−i )2
j
with q̂j−i the prediction of the Q-value of example j by the database without
example i. Note that this time a term is included to represent the loss of
information about the Q-value of example i.
The resulting score for example i obtained by taking the difference between
the two cumulative prediction errors looks as follows:
X
EC-scorei = (qi − q̂i−i )2 +
[(qj − q̂j−i )2 − (qj − q̂j )2 ]
(6.4)
j6=i
The lowest scoring example is the example that has the lowest influence on the
cumulative prediction error and thus is the example that should be removed.
6.4.2.2
Error Proximity
A score simple to compute is based on the proximity of examples in the database
that are predicted with large errors. Since the influence of stored examples is
6.4. THE RIB ALGORITHM
79
inversely proportional to the distance, it makes sense to presume that examples
which are close to the examples with large prediction errors are also causing
these errors. The score for example i can be computed as:
EP-scorei =
X |qj − q̂j |
distij
j
(6.5)
where q̂j is the prediction of the Q-value of example j by the database and
distij the distance between example i and example j. In this case, the example
with the highest score is the one that should be removed.
Another scoring function is used by Forbes and Andre (2002). In this
work, the authors also suggest not just forgetting examples, but using instanceaveraging instead. Applying the instance averaging technique would imply the
use of first order prototypes, which is complex and therefore it is not used in
the RIB algorithm.
6.4.3
A Q-learning Specific Strategy: Maximum Variance
The major problem encountered while using instance based learning for regression is that it is impossible to distinguish high function variation from actual
noise. It seems impossible to do this without prior knowledge about the behavior of the function that the algorithm is trying to approximate. If one could
impose a limit on the variation of the function to be learned, this limit might
allow us to distinguish at least part of the noise from function variation. For
example in Q-learning, an application expert could know that
|qi − qj |
<M
distij
(6.6)
or one could have some other bound that limits the difference in Q-value as a
function of the distance between the examples.
Since the instance based regression algorithm is used in a Q-learning setting,
it can try to exploit some properties of this setting to its advantage. In a
deterministic application and with the correct initialization of the Q-values
(i.e., to values that are underestimations of the correct Q-value), the Q-values
of tabled Q-learning will monotonically increase during computation. This
means that the values in the Q-table will always be an underestimation of the
real Q-values.
When Q-learning is performed with the help of a generalization technique,
this behavior will normally disappear. The Q-value of a new example is normally given by
Q(s, a) = r + maxa0 Q̂(s0 , a0 )
(6.7)
80
CHAPTER 6. RRL-RIB
where s0 is the state that is reached from performing action a in state s and
Q̂(s0 , a0 ) is the estimation of the Q-value of the (state, action) pair (s0 , a0 ). This
estimation however, when done by Equation 6.1 might not be an underestimate.
By using the following formula for Q-value prediction
P
q̂i =
j
(qj −(M ·distij ))
distij
P
1
j distij
(6.8)
where M is the same constant as the M in Equation 6.6, a generalization which
is an underestimate is ensured, because qj + (M · distij ) is an upper bound on
the Q-value of example i.
With all the Q-value predictions being underestimates, Equation 6.6 can
eliminate examples from the database. Figure 6.3 shows the forbidden regions
that result from the combination of the domain knowledge represented by the
maximum derivative and the Q-learning property of having underestimations
of the values to be predicted. These forbidden regions can be used to eliminate
examples from the database. In the example of Figure 6.3 this would mean
that examples b and f can be removed. Example d will stay in the database.
c
a
e
g
b
d
f
Forbidden Territory
Figure 6.3: Using Maximum Variance to select examples.
The applicability of this approach is not limited to deterministic environments only. Since the algorithm will compute the highest possible Q-value
for each example it can also be used in stochastic environments where actions
may fail. If accidental results of actions are of lesser quality than the normal
results, the algorithm will still find the optimal strategy. If actions can have
better than normal results, this approach cannot be used.
6.4.4
The Algorithm
To summarize over the example selection strategies, Algorithm 6.1 shows the
complete data selection strategy for the rib algorithm using the separate inflow filters and either the error-contribution (EC) or the error-proximity (EP)
forgetting strategy.
6.5. EXPERIMENTS
81
Algorithm 6.1 The rib data selection algorithm.
for each data point that becomes available do
predict the Q-value of the new data point
compute σlocal and σglobal
σ
if (|q − q̂| > σlocal · Fl ) or (σlocal > global
Fg ) then
store the new example in the data base
if (number of stored examples > maximum allowed examples) then
compute the EC- or EP-score for each stored example
remove example with the worst score
end if
end if
end for
When a new learning example becomes available, the rib system will try
to predict its Q-value. If this predicted value sufficiently differs from the real
value, or if the new example is from a region of the state-action space where
rib has not yet collected enough information, the new example is stored in
the data-base. If this brings the number of stored examples over the allowed
maximum, the EC- or EP-score is used to select the best example for removal.
When the maximum variance strategy is used, all examples are stored when
they arrive (independent of the predictive accuracy on these examples). However, after each addition to the data base, Equation 6.6 is used to select which
examples stay in the data-base and which are removed.
Prediction in the rib system is performed by evaluating Equation 6.1.
6.5
Experiments
6.5.1
A Simple Task
Start
Goal
To test the different data management approaches, experiments are performed
using a very simple (non-relational) Q-learning task. A reinforcement learning
agent walks through a corridor of length 10 shown in Figure 6.4. The agents
starts on one end of the corridor and receives a reward of 1.0 when he reaches
Figure 6.4: The corridor application.
82
CHAPTER 6. RRL-RIB
Effect of Filter Parameters on Prediction Error
’Fl=3 Fg=5 error’
’Fl=5 Fg=3 error’
’Fl=5 Fg=5 error’
’Fl=10 Fg=5 error’
’Fl=5 Fg=8 error’
Average Prediction Error
0.024
0.022
0.02
0.018
0.016
0
50
100
150
Number of Episodes
200
Figure 6.5: Prediction errors for varying inflow limitations.
the other end. The distance between two state-action-pairs, is related to the
number of steps it takes to get from one state to the other, slightly increased
if the chosen actions differ.
The Q-function related to this problem is a very simple, monotonically
increasing function, so that it only takes two (well chosen) examples for the Qlearner to learn the optimal policy. This being the case, the average predictionerror on all (state, action) pairs is compared for the different suggested approaches.
6.5.1.1
Inflow Behavior
To test the two inflow-filters of section 6.4.1 several experiments varying the
Fl and Fg values separately were performed. Figure 6.5 shows the average prediction errors over 50 test trials. Figure 6.6 shows the corresponding database
sizes.
The influence of Fg is exactly what one would expect. A larger value for Fg
forces the algorithm to store more examples but lowers the average prediction
error. It is worth noticing that in this application the influence on the size of
the database and therefore on the computation time is quite large with respect
to the relatively small effect this has on the prediction errors.
The influence of Fl is not so predictable. First of all, the influence of this
parameter on the size of the database seems limited. Second, one would expect
that an increase of the value of Fl would cause an increase in the prediction
error as well. Although the differences measured were not significant enough
to make any claims, this does not seem to be the case.
6.5. EXPERIMENTS
83
Average Number of Stored Examples
Effect of Filter Parameters on Data-set Size
600
500
400
300
200
’Fl=3 Fg=5 size’
’Fl=5 Fg=3 size’
’Fl=5 Fg=5 size’
’Fl=10 Fg=5 size’
’Fl=5 Fg=8 size’
100
0
0
50
100
Number of Episodes
150
200
Figure 6.6: Database sizes for varying inflow limitations.
6.5.1.2
Adding an Upper Limit
The two scoring functions from section 6.4.2 were tested in the same setting by
adding an upper limit to the database size that the rib algorithm is allowed
to use. The two parameters Fl and Fg were set to 5.0 — values that gave
both average prediction errors and average database size — and the number of
examples that rib could store to make predictions was varied.
Figure 6.7 shows the average prediction-error as a function of the number of
learning episodes when using the error-contribution-score (EC-score) of Equation 6.4 for different maximum database sizes. The ’no limit’ curve in the graph
shows the prediction error when no examples are removed.
Figure 6.8 shows the average prediction-error when managing the database
size with the error-proximity-score (EP-score) of Equation 6.5. Although differences with the EC-score are small, EP-score management performs at least
as well and is easier to compute.
A rather disturbing feature of both graphs is that the prediction error at
first seems to reach a minimum, whereafter the prediction error goes up again.
The reasons for this behavior are not entirely clear although the order in which
Q-values are usually discovered in Q-learning might offer a possible explanation.
The typical exponentially decreasing shape of the Q-function in relation to the
distance to the goal is shown in Figure 6.9. The first learning examples with
a non zero Q-value are usually discovered close to the goal, as it is easier to
stumble onto a reward when already close to the goal. These examples lie
in the region of the state-action space where differences between Q-values are
large and making good predictions in this area will greatly reduce the overall
prediction error. Later in the learning experiment, examples from all over
the state-action space will be stored to keep the local standard variation low.
84
CHAPTER 6. RRL-RIB
Select by Error Contribution
’no limit’
’50 examples’
’100 examples’
’200 examples’
Average Prediction Error
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
50
100
Number of Episodes
150
200
Figure 6.7: The effect of example selection by Error Contribution.
Select by Error Proximity
’no limit’
’50 examples’
’100 examples’
’200 examples’
Average Prediction Error
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
50
100
Number of Episodes
150
200
Figure 6.8: The effect of example selection by Error Proximity.
85
Q-value
6.5. EXPERIMENTS
Range of early
examples
Distance to goal
Figure 6.9: The shape of a Q-function in relation to the distance to the goal.
However, the influence of good predictions in the right part of the state space
as presented in Figure 6.9 on the prediction error is a lot smaller. Balancing
the examples stored in the data-base over the entire state-space therefore might
result in a small increase of the prediction errors.
6.5.1.3
The Effects of Maximum Variance
Prediction Error for Maximum Variance
0.7
’no limit’
’100 ec-score’
’100 ep-score’
’Max Variance’
Average Prediction Error
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
Number of Episodes
150
200
Figure 6.10: The effect of example selection by Maximum Variance.
Figure 6.10 shows the prediction-error when the maximum variance (or mv)
strategy is used to manage the database. The M -value of Equation 6.6 was set
to 0.1, the maximum difference in Q-values in this environment. The prediction
errors are a lot larger than with the other strategies, but RRL is still able to
find the optimal strategy. The advantage of the mv-strategy lies in the number
86
CHAPTER 6. RRL-RIB
of examples stored in the database. With this particular application, only 20
examples are stored, one for each possible Q-value.
6.5.2
The Blocks World
The rib algorithm was also tested in the same blocks world environment as
the tg system (described in the previous chapter). To apply the rib algorithm
to work in the blocks world, a distance has to be defined between different
(state, action) pairs. The following experiments use the blocks world distance
as defined in Section 6.3.
6.5.2.1
The Influence of Different Data Base Sizes
Since the behavior of the error-proximity score and the error-contribution score
is very similar as shown in the previous section, the rib system is only tested
on the blocks world with the error-proximity score and the maximum variance
selection strategy.
First, the influence of the maximum size of the data base is tested with the
use of the error-proximity selection strategy. The data base size is limited to
a number of different values ranging from 25 to 200 examples. 25 examples
will cause the rib system to keep all new examples that arrive and only select
examples by forgetting others. The low number of stored examples causes both
the global and local variance to be computed using all the stored examples and
will cause Equation 6.3 to accept all new examples. The high limit of 200
examples to be stored will cause rib to take longer to select the right examples
to generalize nicely over the entire state-action space.
Stacking
1
Average Total Reward
0.9
0.8
0.7
0.6
0.5
0.4
’RIB-EP(25)’
’RIB-EP(50)’
’RIB-EP(100)’
’RIB-EP(200)’
0.3
0.2
0.1
0
200
400
600
Number of Episodes
800
1000
Figure 6.11: The performance of rib with different data base sizes on the
Stacking task.
6.5. EXPERIMENTS
87
Figure 6.11 shows the results for the stack goal. Although 25 examples is
clearly too few for rib to be able to build a well performing Q-function, with
50 examples it already builds a Q-function that translates in a well performing
policy. With approximately 3 possible actions per state, there are over 1500
different (state, action) pairs. Using 200 as an upper limit for the database
slows the performance improvement of rib , as a higher number of examples
makes it harder to select the ones that help to generalize well.
Unstacking
1
Average Total Reward
0.9
0.8
0.7
0.6
0.5
0.4
’RIB-EP(25)’
’RIB-EP(50)’
’RIB-EP(100)’
’RIB-EP(200)’
0.3
0.2
0.1
0
200
400
600
Number of Episodes
800
1000
Figure 6.12: The performance of rib with different data base sizes on the
Unstacking task.
Figure 6.12 shows the results for the rib regression algorithm on the Unstacking task, again for different sizes of the stored example set. Here, rib is
not able to build a well performing policy using only 50 examples.
When rib is allowed to store 200 examples, there is a bump in the learning
curve where at first, rib learns a well performing policy, then seems to forget
what it has learned for a while and then starts to recover and learn a better
policy once again. Remember that the Unstacking task generates a lot of
uninformative examples, making it hard for the regression engine to filter out
the informative examples, as they can appear to be nothing but noise. This
is exactly what is happening in this case. At first, rib is allowed to store all
the examples that it receives. It will recognize the fact that some examples
are non-informative and not accept these and will store the examples that
yield large Q-values as interesting. Since the translation of the Q-function into
a policy only looks at the maximum Q-values of a state, this will lead to a
largely incorrect but relatively well performing Q-function. When the data
base gets filled up with examples, rib will try to remove the ones that seem
noisy and mistakes the few high yielding (state, action) pairs for inaccurate.
When it removes these examples, the overall performance of the constructed
88
CHAPTER 6. RRL-RIB
policy degrades. Only when the discovered rewards start to spread to the
computed Q-values of other (state, action) pairs does rib recover and does the
performance start to increase again.
On(A,B)
Average Total Reward
1
0.8
0.6
0.4
’RIB-EP(25)’
’RIB-EP(50)’
’RIB-EP(100)’
’RIB-EP(200)’
0.2
0
0
200
400
600
Number of Episodes
800
1000
Figure 6.13: The performance of rib with different data base sizes on the
On(A,B) task.
The performance of the rib regression engine on the On(A,B) task is shown
in figure 6.13. In this task, the differences between the different data-base sizes
are small. None of the sizes allows rib to find an optimal policy. The largest
data base size again causes the slowest learning rate.
6.5.2.2
Comparing rib and tg
Choosing a data base size equal to 100 for the examples selection using errorproximity and the selection of examples using maximum variance, the performance of rib was compared to the performance of tg as a regression engine
for RRL in the blocks world. The M value of Equation 6.6 was set to 0.1, the
maximum difference between two different Q-values in these tests. Figure 6.14
shows the results for the three regression algorithms for the Stacking goal.
rib seems to have the upper hand when it comes to the learning rate per
episode. This is partly due to the fact that tg needs to collect a larger number
of (state, action, qvalue) examples before it can make use of them by generating different nodes and leafs in the Q-tree. rib will use the predictive power
of examples as soon as they become available. This said, it needs to be considered that rib needs a lot more processing power (and thus processing time)
to make Q-value predictions. This causes RRL-rib to be a lot slower than
RRL-tg when it comes to computation times. However, since interaction with
the environment usually yields the largest learning cost, the learning rate per
episode is the fairest comparison between the different systems.
6.5. EXPERIMENTS
89
Stacking
1
Average Total Reward
0.9
0.8
0.7
0.6
0.5
0.4
’TG’
’RIB-EP(100)’
’RIB-MV’
0.3
0.2
0
200
400
600
Number of Episodes
800
1000
Figure 6.14: Performance comparison between the tg and rib algorithms for
the Stacking task.
Between the two rib systems, rib-mv seems to learn a little quicker than
rib-ep but rib-ep has the advantage of yielding a better policy. This difference
is more explicit for the Unstacking tasks. The results for the Unstacking task
are shown in Figure 6.15.
Unstacking
1
Average Total Reward
0.9
0.8
0.7
0.6
0.5
0.4
0.3
’TG’
’RIB-EP(100)’
’RIB-MV’
0.2
0.1
0
200
400
600
Number of Episodes
800
1000
Figure 6.15: Performance comparison between the tg and rib algorithms for
the Unstacking task.
For the Unstacking task, rib-mv learns its policy very quickly, but then
gets stuck at the reached performance level. The performance graphs shows
this to be at ±90%, but actual levels in the different experiments ranged from
the optimal policy to ±73%. Each time rib-mv reached its eventual level of
90
CHAPTER 6. RRL-RIB
performance quickly, but was unable to escape from that local maximum. ribep learns slower compared to rib-mv but was able to learn the optimal policy
in each of the 10 test-runs.
Table 6.1: The average number of examples stored in the rib database using
Maximum Variance Example Selection at the end of the learning experiment.
Task
Stacking
Unstacking
On(A,B)
Average Final Database Size
35
10
103
Table 6.1 shows the average number of examples that rib-mv had in its
data-base at the end of the learning task. For the Stacking and Unstacking
tasks, these numbers are a lot lower than the 100 examples used by rib-ep
in its experiments. This lower number also makes rib-mv a lot more efficient
computationally. Considering each test-run separately, there was no correlation
between the number of stored examples and performance level reached by the
rib-mv system.
On(A,B)
Average Total Reward
1
0.8
0.6
0.4
0.2
’TG’
’RIB-EP(100)’
’RIB-MV’
0
0
200
400
600
Number of Episodes
800
1000
Figure 6.16: Performance comparison between the tg and rib algorithms for
the On(A,B) task.
Figure 6.16 shows the learning curves for the On(A,B) task. This is the first
task in which rib-mv outperforms the rib-ep policy. The number of examples
remembered by rib-mv is also a little higher then the 100 allowed examples
for the rib-ep approach. tg is again a little slower to reach high performance
levels than both rib versions.
6.6. POSSIBLE EXTENSIONS
91
Table 6.2: The execution times for RRL-tg , rib-mv and rib-ep on a Pentium
III 1.1 Ghz machine in seconds. The first number in each cell indicates the
learning time, the second indicates the testing time.
Task
Stacking
Unstacking
On(A,B)
RRL-tg
14 – 1
15 – 2
16 – 1
rib-mv
27 – 27
7 – 10
44 – 100
rib-ep
920 – 90
1500 – 200
1800 – 200
When applicable (see Section 6.4.3), the large reduction of the number of
stored examples together with the automatic adaptation to the difficulty of
the learning task, makes rib-mv the preferred technique for instance based
regression. rib-mv largely outperforms rib-ep with respect to computation
times. Table 6.2 shows the execution times of the two rib systems next to
the RRL-tg system. The first number in each cell is the time spend by the
RRL system processing the 1000 learning episodes, the second number is the
time spend on the 2000 testing trials. This second number gives an indication
of the prediction efficiency of the system. To be fair, it must be stated that
the rib-mv implementation was optimized with respect to computation speed,
while rib-ep was not. However, this optimization can not be held responsible
for the large differences observed.
Compared to the RRL-tg system, the instance based regression technique
of RRL-rib results in a smoother learning progression. RRL-tg relies on
finding the correct split for each node in the regression tree. When this node is
found, this results in a large improvement in the learned policy. Instance-based
RRL does not rely on such key-decisions and therefore can be expected to be
more robust than the tree based RRL-tg system. As can be seen in Table 6.2
the policy learned by tg can be evaluated much more efficiently.
6.6
Possible Extensions
The major drawback of the rib regression system is the computational complexity of the system and the slow performance that results from this. Often, a
strict constraint on the number of examples stored in the data-base is needed
to reduce computation times.
There are multiple ways of dealing with this problem. The most simple
solution, i.e., with the least intrusive change to the rib system itself, would
be to use incrementally computable distances. An incrementally computable
distance would first compute a rough estimate of the distance between different
(state, action) pairs, which can later be refined if necessary. When the use of
92
CHAPTER 6. RRL-RIB
approximate distances is limited to distant (state, action) pairs, the influence
of the approximations would be limited or non-existent, but computation times
might be reduced significantly.
A more complex, but also more promising change to the rib system, would
be the combination of a model building regression technique (such as the tg
system) and instance based learning. Such an approach would use the model
building regression algorithm to make a coarse division of the state-action space
and use instance based learning locally to improve the predictive accuracy.
From the rib viewpoint, this would remove the need to compute the distance
from the new (state, action) pair to every stored (state, action) pair. The
instance based part of the regression system would only need to consider the
stored examples that are classified to the same group as the new example.
From the model building algorithm’s viewpoint, this would allow for better
local function fitting.
6.7
Conclusions
This chapter introduced relational instance based regression, a new regression
technique that can be used when instances can not be represented as vectors.
Several database management approaches were developed that limit the
memory requirements and computation times by limiting the number of examples that need to be stored in the database. The behavior of these different
approaches was shown and discussed using a simple example application and
was compared with the regression tree based RRL-tg algorithm. Empirical
results show that instance-based RRL outperforms RRL-tg considering the
learning speed per training episode. RRL-tg on the other hand has the advantage of producing a Q-function in a comprehensible format.
Chapter 7
Gaussian Processes and
Graph Kernels
“Unfortunately, no one can be told what the Matrix is.
You have to see it for yourself.”
The Matrix
7.1
Introduction
Kernel methods are among the most successful recent developments within the
field of machine learning. Used together with the recently developed kernels
for structured data they yield powerful classification and regression methods
that can be used for relational applications.
This chapter introduces an incremental regression algorithm based on Gaussian processes and graph kernels. This algorithm is integrated into the RRL
system to create RRL-kbr and several system parameters are discussed and
their influence is evaluated empirically. Afterwards, it is compared with the
previous regression techniques. After a brief introduction of kernel methods and
kernel functions, this chapter introduces Gaussian processes as an incrementally learnable regression technique that uses kernels as the covariance function
between learning examples. Section 7.4 introduces graph kernels that can be
used as kernels between (state, action) pairs. Next, Section 7.5 illustrates how
to apply a graph kernel in the blocks world environment. In Section 7.6 the
behavior of Gaussian processes as a regression system for RRL is empirically
evaluated and compared to the RRL-tg and RRL-rib systems. Section 7.7
discusses a few improvements that can still be made to the RRL-kbr system.
Related work on reinforcement learning with kernel methods is very limited
so far. In the work by Ormoneit and Sen (2002) the term ‘kernel’ is not used
93
94
CHAPTER 7. RRL-KBR
to refer to a positive definite function but to a probability density function.
Dietterich and Wang (2002) and Rasmussen and Kuss (2004) don’t use Qlearning as in the RRL system but model the reinforcement learning task
in a closed form. Dietterich and Wang use support vector machines, while
Rasmussen and Kuss use Gaussian processes.
The kernel based regression system was designed and implemented in collaboration with Thomas Gärtner and Jan Ramon and was first introduced in
(Gärtner et al., 2003a).
7.2
Kernel Methods
Kernel methods work by embedding the data into a vector space and then
looking for (often linear) relations between the data in that space. If the
mapping to the vector space is well chosen, complex relations can be simplified
and more easily discovered. These relations can then be used for classification,
regression, etc.
Based on the fact that all generalization requires some form of similarity
measure, all kernel methods are in principle composed of 2 parts:
1. A general purpose machine learning algorithm.
2. A problem specific kernel function.
The kernel function is employed to avoid the need for an explicit mapping
to the (often high dimensional) vector space. Technically, a kernel k computes an inner product in some feature space which is, in general, different
from the representation space of the instances. The computational attractiveness of kernel methods comes from the fact that quite often a closed form of
these ‘feature space inner products’ exists. Instead of performing the expensive
transformation step φ explicitly, a kernel k(x, x0 ) = hφ(x), φ(x0 )i computes the
inner product directly and performs the feature transformation only implicitly.
Whether, for a given function k : X × X → R, a feature transformation
φ : X → H into some Hilbert space H exists, such that k(x, x0 ) = hφ(x), φ(x0 )i
for all x, x0 ∈ X can be checked by verifying that the function is positive
definite (Aronszajn, 1950). This means that any set, whether a linear space or
not, that admits a positive definite kernel can be embedded into a linear space.
Thus, throughout this text, ‘valid’ means ‘positive definite’. Here then is the
definition of a positive definite kernel. (N is the set of positive integers.)
Definition 7.1 Let X be a set. A symmetric function k : X × X → R is a
positive definite kernel on X if, ∀n ∈ N, x1 , . . . , xn ∈ X , and c1 , . . . , cn ∈ R
X
ci cj k(xi , xj ) ≥ 0
i,j∈{1,...,n}
7.2. KERNEL METHODS
95
While it is not always easy to prove positive definiteness for a given kernel, positive definite kernels do have some nice closure properties. In particular,
they are closed under sum, direct sum, multiplication by a scalar, product, tensor product, zero extension, point-wise limits, and exponentiation (Cristianini
and Shawe-Taylor, 2000; Haussler, 1999).
Kernels for Structured Data
The best known kernel for representation spaces that are not mere attributevalue tuples is the convolution kernel proposed by Haussler (1999). The basic
idea of convolution kernels is that the semantics of composite objects can often
be captured by a relation R between the object and its parts. The kernel on
the composed object is then a combination of kernels defined on it’s different
parts.
Let x, x0 ∈ X be the objects and (x1 , x2 , · · · , xD ), (x01 , x02 , · · · , x0D ) ∈ X1 ×
· · · × XD be tuples of parts of these objects. Given the relation R : (X1 × · · · ×
XD ) × X a decomposition R−1 can be defined as R−1 (x) = {(x1 , x2 , · · · , xD ) :
R((x1 , x2 , · · · , xD ), x)}. With positive definite kernels kd : Xd × Xd → R for all
d ∈ {1, . . . , D} the convolution kernel is defined as
0
kconv (x, x ) =
X
(x1 , x2 , · · · , xD ) ∈ R−1 (x)
(x01 , x02 , · · · , x0D ) ∈ R−1 (x0 )
D
Y
kd (xd , x0d )
d=1
The term convolution kernel refers to a class of kernels that can be formulated in the above way. The advantage of convolution kernels is that they
are very general and can be applied in many different problems. However,
because of that generality, they require a significant amount of work to adapt
them to a specific problem, which makes choosing the composition relation R
in ‘real-world’ applications a non-trivial task.
There are other kernel definitions for structured data in the literature. However, they usually focus on a very restricted syntax and are more or less domain
specific. Examples are string and tree kernels. Traditionally, string kernels
(Lodhi et al., 2002) have focused on applications in text mining and measure
similarity of two strings by the number of common (not necessarily contiguous) substrings. These string kernels have not been applied in other domains.
However, other string kernels have been defined for other domains, e.g., recognition of translation initiation sites in DNA and mRNA sequences (Zien et al.,
2000). Again, these kernels have not been applied in other domains. Tree
kernels (Collins and Duffy, 2002) can be applied to “ranked” trees , i.e., trees
where the number of children of a node is determined by the label of the node.
They compute the similarity of trees based on their common subtrees. Tree
96
CHAPTER 7. RRL-KBR
kernels have been applied in natural language processing tasks. A kernel for
instances represented by terms in a higher-order logic is presented by Gärtner
et al. (2003c).
For an extensive overview of these and other kernels on structured data,
the reader is referred to the overview paper by Gärtner (2003). None of these
kernels, however, can be easily applied to the kind of state-action representations encountered in relational reinforcement learning problems. Kernels that
can be applied there have independently been introduced by Gärtner (2002)
and Kashima and Inokuchi (2002) and will be presented in Section 7.4.
7.3
Gaussian Processes for Regression
The exact mathematics in the description of the Gaussian processes here, are
kept to a minimum. Readers interested in a more rigorous explanation of
Gaussian processes and their properties may want to consult MacKay (1997).
Parametric regression techniques use a parameterized function as a hypothesis. The learning algorithms use the observed data to tune the parameter
vector w. In some cases, a single function is chosen and used for predictions
of unseen examples. In other cases, a combination of functions is used. Examples of parametric regression techniques are neural networks and radial basis
functions .
Bayesian regression techniques assume a prior distribution over the parameter vector and calculate a posterior distribution over parameter vectors using
Bayes rule and the available learning data. Predictions for new, unseen data
can be made by marginalizing over the parameters.
Gaussian processes implement a non-parametric Bayesian technique. Instead of assuming a prior over the parameter vectors, a prior is assumed over
the target function itself.
Assume that a set of data points {[xi |ti ]}N
i=1 is observed, with xi the description of the example and ti the target value. The regression task in a
Bayesian approach is to find the predictive distribution of the value tN +1 given
the example description xN +1 , i.e.:
P (tN +1 |[x1 · · · xN ], [t1 · · · tN ], xN +1 )
To model this task as a Gaussian process it is assumed (Gibbs, 1997) that
the target values tN = [t1 · · · tN ] have a joint distribution:
1
1
−1
(tN − µ)
(7.1)
P (tN |[x1 · · · xN ], CN ) = exp − (tN − µ)T CN
Z
2
where µ is the mean vector of the target values, C is a covariance matrix (Cij =
C(xi , xj ), 1 ≤ i, j ≤ N ) and Z is an appropriate normalization constant. The
7.3. GAUSSIAN PROCESSES FOR REGRESSION
CN
k N+1
T
κ
97
C N+1 =
k N+1
Figure 7.1: The relationship between the covariance matrices CN and CN +1 .
choice of covariance functions is restricted to positive definite kernel functions
(See Section 7.2). For simplicity reasons1 , assume the mean vector µ = 0.
Because P (A∧B|C) = P (A|C)·P (B|C), the predictive distribution of tN +1
can be written as the conditional distribution:
P (tN +1 |[x1 · · · xN ], tN , xN +1 ) =
P (tN +1 |[x1 · · · xN ], xN +1 )
P (tN |[x1 · · · xN ])
and, using Equation 7.1, as the following Gaussian distribution:
P (tN +1 |[x1 · · · xN ], tN , xN +1 , CN +1 )
1
ZN
−1
−1
T
=
exp − tTN +1 CN
t
−
t
C
t
(7.2)
N
+1
N
N N
+1
ZN +1
2
with ZN and ZN +1 appropriate normalizing constants and CN and CN +1
as in Figure 7.1. The vector kN +1 and scalar κ are defined as:
kN +1
κ
= [C(x1 , xN +1 ) · · · C(xn , xN +1 )]
= C(xN +1 , xN +1 )
By grouping the terms that depend on tN +1 (Gibbs, 1997) Equation 7.2
can be rewritten as:
tN +1 − t̂N +1
1
P (tN +1 |[x1 · · · xN ], tN , xN +1 , CN +1 ) = exp −
Z
2σt̂2
2 !
(7.3)
N +1
1 Although this may seem as a leap of faith, assuming 0 as an apriori Q-value is standard practice in Q-learning. This assumption was also used in the case of the tg and rib
algorithms, albeit less explicitly.
98
CHAPTER 7. RRL-KBR
with
t̂N +1
σt̂2N +1
−1
= kTN +1 CN
tN
= κ−
(7.4)
−1
kTN +1 CN
kN +1
(7.5)
and kN +1 and κ as previously defined. This expression maximizes at t̂N +1 ,
and therefore the value t̂N +1 is the one that will be predicted by the regression
algorithm. σt̂N +1 gives the standard deviation on the predicted value. Note
−1
that, to make predictions, CN
is used, so there is no need to invert the new
matrix CN +1 for each prediction.
Although using Gaussian processes for Q-function regression might seem
like an overkill, this technique has some properties that makes it very well
suited for reinforcement learning. First, the probability distribution over the
target values can be used to guide the exploration of the state-space during
the learning process comparable to interval based exploration (Kaelbling et
al., 1996). Secondly, the inverse of the covariance matrix can be computed
incrementally, using the partitioned inverse equations of Barnett (1979):
M m
−1
CN +1 =
mT µ
with
M
−1
= CN
− µkN +1 kTN +1
m
−1
= −µCN
kN +1
µ =
−1
κ − kTN +1 CN
kN +1
−1
While matrix inversion is of cubic complexity, computing the inverse of a
matrix incrementally after expansion is only of quadratic time complexity. As
stated before, no additional inversions need to be performed to make multiple
predictions.
7.4
Graph Kernels
Graph kernels are an important means to extend the applicability of kernel
methods to structured data. To be able to use Gaussian processes as a regression technique in the RRL system, a covariance function needs to be defined
between different (state, action) pairs. This covariance function can be provided by the use of a graph kernel.
This section gives a brief overview of graphs and graph kernels. For a more
in-depth discussion of graphs the reader is referred to the work of Diestel (2000)
and Korte and Vygen (2002). For a discussion of different graph kernels see
(Gärtner et al., 2003b).
7.4. GRAPH KERNELS
7.4.1
99
Labeled Directed Graphs
Before the graph kernel can be introduced, there are some concepts that need
to be defined.
Definition 7.2 (Graph) A graph G is described by a finite set of vertices V,
a finite set of edges E and a function Ψ that denotes which vertices belong to
which edge.
Definition 7.3 (Labeled Graph) For labeled graphs there is additionally a
set of labels L along with a function label assigning a label to each edge and
vertex.
Definition 7.4 (Directed Graph) For directed graphs the function Ψ : E →
{V × V} maps each edge to the tuple consisting of its initial and terminal node.
Edges e in a directed graph for which Ψ(e) = (v, v) are called loops. Two edges
e, e0 are parallel if Ψ(e) = Ψ(e0 ).
Frequently, only graphs without parallel edges are considered. For application within the RRL setting however, it is important to also consider graphs
with parallel edges.
Sometimes some enumeration of the vertices and labels in a graph is assumed, i.e., V = {νi }ni=1 where n = |V| and L = {`r }r∈N .2
To refer to the vertex and edge set of a specific graph, the notation V(G) and
E(G) can be used. Wherever two graphs are distinguished by their subscript
(Gi ) the same notation will be used to distinguish their vertex and edge sets.
Figure 7.2 shows examples of a graph, a labeled graph and a directed graph.
For all these graphs,
V = {ν1 , ν2 , ν3 , ν4 , ν5 }
and
E = {(ν1 , ν1 ), (ν1 , ν3 ), (ν2 , ν1 ), (ν2 , ν5 ), (ν3 , ν2 ), (ν3 , ν4 ), (ν4 , ν2 ), (ν4 , ν3 ), (ν4 , ν5 )}
Some special graphs, relevant for the description of graph kernels are walks,
paths, and cycles.
Definition 7.5 (Walk) A walk w (sometimes called an ‘edge progression’) is
a sequence of vertices vi ∈ V and edges ei ∈ E with w = v1 , e1 , v2 , e2 , . . . en , vn+1
and Ψ(ei ) = (vi , vi+1 ). The length of the walk is equal to the number of edges
in this sequence, i.e., n in the above case.
2 While ` will be used to always denote the same label, l is a variable that can take
1
1
different values, e.g., `1 , `2 , . . .. The same holds for vertex ν1 and variable v1 .
100
CHAPTER 7. RRL-KBR
s
1
1
1
r
c
r
2
3
r
h
3
2
2
c
3
r
h
r
4
c
4
4
r
5
5
5
e
Graph
Directed Graph
Labelled Graph
Figure 7.2: From left to right, a graph, a labeled graph and a directed graph,
all with the same node and edge set.
Definition 7.6 (Path) A path is a walk in which vi 6= vj ⇔ i 6= j and ei 6=
ej ⇔ i 6= j.
Definition 7.7 (Cycle) A cycle is a path followed by an edge en+1 where
Ψ(en+1 ) = (vn+1 , v1 ).
Figure 7.3 gives an illustration of a walk, a path and a cycle in the directed
graph of Figure 7.2. Note that when a cycle exists in a graph, there are infinitely
many walks in that graph and walks of infinite length.
1
1
1
2
3
3
3
4
4
4
5
Walk
2
2
5
5
Path
Cycle
Figure 7.3: From left to right, a walk, a path and a cycle of a graph.
7.4. GRAPH KERNELS
7.4.2
101
Graph Degree and Adjacency Matrix
Some functions describing the neighborhood of a vertex v in a graph G also
need to be defined.
Definition 7.8 (Outdegree) δ + (v) = {e ∈ E | Ψ(e) = (v, u)} is the set
of edges that start from the vertex v. The outdegree of a vertex v is defined
as |δ + (v)|. The maximal outdegree of a graph G is denoted by ∆+ (G) =
max{|δ + (v)|, v ∈ V}.
Definition 7.9 (Indegree) δ − (v) = {e ∈ E | Ψ(e) = (u, v)} is the set of edges
that arrive at the vertex v. The indegree of a vertex v is defined as |δ − (v)|. The
maximal indegree of a graph G is denoted by ∆− (G) = max{|δ − (v)|, v ∈ V}.
For example, the outdegree of vertex ν3 in the directed graph of Figure
7.3 |δ + (ν3 )| = 2 while the maximal outdegree of the graph ∆+ (G) = 3. The
indegree of the same node |δ − (ν3 )| = 2 which is also the graphs maximal
indegree.
For a compact representation of the graph kernel the adjacency matrix E
of a graph will be used.
Definition 7.10 (Adjacency Matrix) The adjacency matrix E of a graph
G is a square matrix where component [E]ij of the matrix corresponds to the
number of edges between vertex νi and νj .
The adjacency matrix of the directed graph in Figure 7.3 is the following:


1 0 1 0 0
 1 0 0 0 1 



E=
 0 1 0 1 0 
 0 1 1 0 1 
0 0 0 0 0
Parallel edges in the graph would give rise to components with values greater
than 1. Replacing the adjacency matrix E by its n-th power (n ∈ N, n ≥ 0),
the interpretation is quite similar. Each component [E n ]ij of this matrix gives
the number of walks of length n from vertex νi to νj .
It is clear that the maximal indegree equals the maximal column sum of
the adjacency matrix and that the maximal outdegree equals the maximal row
sum of the adjacency matrix. For a ≥ ∆+ (G)∆− (G), an is an upper bound on
each component of the matrix E n . This is useful to determine the convergence
properties of some graph kernels.
102
7.4.3
CHAPTER 7. RRL-KBR
Product Graph Kernels
This section briefly reviews one of the graph kernels defined in (Gärtner et al.,
2003b). Technically, this kernel is based on the idea of counting the number
of walks in a product graph. Note that the definitions given here are more
complicated than those given in (Gärtner et al., 2003b) as parallel edges have
to be considered here.
Product graphs (Imrich and Klavžar, 2000) are a very interesting tool in discrete mathematics. The four most important graph products are the Cartesian,
the strong, the direct, and the lexicographic product. While the most fundamental one is the Cartesian product, in this context the direct graph product
is the most important one. However, the definition needs to be extended to labelled directed graphs. For that consider a function match(l1 , l2 ) that is ‘true’
if the labels l1 and l2 ‘match’. In the simplest case match(l1 , l2 ) ⇔ l1 = l2 .
Using this function, the direct product of two graphs is defined as follows:
Definition 7.11 (Direct Product Graph) The direct product of the two
graphs G1 = (V1 , E1 , Ψ1 ) and G2 = (V2 , E2 , Ψ2 ) is denoted by G1 × G2 . The
vertex set of the direct product is defined as:
V(G1 × G2 ) = {(v1 , v2 ) ∈ V1 × V2 : match(label (v1 ), label (v2 ))}
The edge set is then defined as:
E(G1 × G2 ) = {(e1 , e2 ) ∈ E1 × E2 : ∃ (u1 , u2 ), (v1 , v2 ) ∈ V(G1 × G2 )
∧ Ψ1 (e1 ) = (u1 , v1 ) ∧ Ψ2 (e2 ) = (u2 , v2 )
∧ match(label (e1 ), label (e2 ))}
Given an edge (e1 , e2 ) ∈ E(G1 × G2 ) with Ψ1 (e1 ) = (u1 , v1 ) and Ψ2 (e2 ) =
(u2 , v2 ) the value of ΨG1 ×G2 is:
ΨG1 ×G2 ((e1 , e2 )) = ((u1 , u2 ), (v1 , v2 ))
The graphs G1 , G2 are called the factors of graph G1 × G2 . The labels of the
vertices and edges in graph G1 × G2 correspond to the labels in the factors.
Figure 7.4 shows two directed labelled graphs and their direct product. The
edges are presumed to be unlabelled. Intuitively, higher levels of similarity
between two graphs leads to a higher number of nodes and edges in their
product graph.
Having introduced product graphs, finally, the product graph kernel can be
defined.
7.4. GRAPH KERNELS
103
s
1
c
2
c
3
4
e
s
I
1,I
c
2,III
c
s
II
3,III
c
e
2,II
c
III
c
4,IV
3,II
IV
e
c
Figure 7.4: Two labelled directed graphs and their direct product graph at the
bottom right. (For simplicity reasons, no labels are used on the edges in this
example.)
Definition 7.12 (Product Graph Kernel) Let G1 , G2 be two graphs, let
E× denote the adjacency matrix of their direct product E× = E(G1 × G2 ),
and let V× denote the vertex set of the direct product V× = V(G1 × G2 ). With
a sequence of weights λ = λ0 , λ1 , . . . (λi ∈ R; λi ≥ 0 for all i ∈ N) the product
graph kernel is defined as
|V× |
k× (G1 , G2 ) =
X
∞
X
i,j=1
n=0
"
#
n
λ n E×
(7.6)
ij
if the limit exists.
For the proof that this kernel is positive definite, see (Gärtner et al., 2003b)3 .
There it is shown that this product graph kernel corresponds to the inner product in a feature space made up by all possible contiguous label sequences in
the graph. Each feature value
√ corresponds to the number of walks with such a
label sequence, weighted by λn where n is the length of the sequence.
3 The
extension to parallel edges is straight forward.
104
CHAPTER 7. RRL-KBR
7.4.4
Computing Graph Kernels
To compute this graph kernel, it is necessary to compute the limit of the above
matrix power series. Two possibilities immediately present themselves: the
i
exponential weight setting (λi = βi! ) for which the limit of the above matrix
power series always exists, and the geometric weight setting (λi = γ i ) for which
the limit exists if γ < 1/a, where a = ∆+ (G)∆− (G) as above.
Exponential Series Similar to the exponential of a scalar value (eb = 1 +
b/1! + b2 /2! + b3 /3! + . . .) the exponential of the square matrix E is defined as
eβE = lim
n→∞
n
X
(βE)i
i=0
i!
(7.7)
0
where β0! = 1 and E 0 = I. Feasible exponentiation of matrices in general
requires diagonalizing the matrix. If the matrix E can be diagonalized such
that E = T −1 DT arbitrary powers of the matrix can be easily calculated as
E n = (T −1 DT )n = T −1 Dn T and for a diagonal matrix the power can be
calculated component-wise [Dn ]ii = [Dii ]n . Thus eβE = T −1 eβD T where eβD
can be calculated component-wise. Once the matrix is diagonalized, computing
the exponential matrix can be done in linear time. Matrix diagonalization
is a matrix eigenvalue problem and such methods have roughly cubic time
complexity.
P
Geometric Series The geometric series i γ i is known toP
converge if and
n
1
only if |γ| < 1. In that case the limit is given by limn→∞ i=0 γ i = 1−γ
.
Similarly, the geometric series of a matrix is defined as
lim
n→∞
n
X
γiEi
(7.8)
i=0
if γ < 1/a, where a = ∆+ (G)∆− (G). Feasible computation of the limit of a
geometric series is then possible by inverting the matrix I − γE.
To see this, suppose (I−γE)x = 0 and thus γEx = x and (γE)i x = x. Now,
note that, given the limitations on γ, (γE)i → 0 as i → ∞. Therefore x = 0 and
I − γE is regular and can be inverted. Then (I − γE)(I + γE + γ 2 E 2 + · · · ) = I
and (I − γE)−1 = (I + γE + γ 2 E 2 + · · · ) is obvious. Matrix inversion is roughly
of cubic time complexity.
Weight Influences When comparing Equation 7.6 with the two equations
7.7 and 7.8, it can be seen that the role of the weight λi is taken by β i /i! in the
exponential series and by γ i in the geometric series. Each of these two forms
7.4. GRAPH KERNELS
105
Geometric Weights
1
’Gamma = 1/2’
’Gamma = 1/5’
’Gamma = 1/10’
Relative Weight Value
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2
4
6
Length of Walk
8
10
Figure 7.5: The weight distribution of the geometric series for different values
of the parameter γ after normalization
have a parameter (β and γ respectively) that can be used to tune the weights
of walks of a given length. However, the shape of these weights is different for
each form.
For the geometric series, Figure 7.5 shows the resulting weights for a number
of different values of the parameter γ. Although the weights do change for
varying parameters, the relative importance does not change. The highest
weights are always for the shortest walks. For the exponential series however,
different values of the parameter β shift the highest importance to walks of
different lengths, as shown in Figure 7.6. This allows for better fine-tuning of
the kernel towards different applications. Therefore, the exponential series will
be used to compute kernels for the RRL system.
In relatively sparse graphs, it is often more practical to actually count the
number of walks rather than using the closed forms presented.
7.4.5
Radial Basis Functions
In finite state-action spaces Q-learning is guaranteed to converge if the mapping between (state, action) pairs and Q-values is represented explicitly. One
advantage of Gaussian processes is that for particular choices of the covariance
function, the representation is explicit.
To see this, consider the matching kernel kδ : X × X → R defined as:
(
1 if x = x0
kδ (x, x0 ) =
0 otherwise
as the covariance function between examples. Let the predicted Q-value be
106
CHAPTER 7. RRL-KBR
Exponential Weights
Relative Weight Value
1
’Beta = 1’
’Beta = 5’
’Beta = 10’
0.8
0.6
0.4
0.2
0
0
5
10
Length of Walk
15
20
Figure 7.6: The weight distribution of the exponential series for different values
of the parameter β after normalization
−1
the mean of the distribution over target values, i.e., t̂n+1 = kTN +1 CN
tN where
the variables are used as defined in section 7.3. Assume the training examples
are distinct and the test example is equal to the j-th training example. It
−1
then turns out that CN = I = CN
where I denotes the identity matrix. As
furthermore kN +1 is then the vector with all components equal to 0 except the
j-th which is equal to 1, it is obvious that t̂n+1 = tj and the representation is
thus explicit.
A frequently used kernel function for instances that can be represented by
vectors is the Gaussian radial basis function kernel (RBF). Given the bandwidth
parameter ρ the RBF kernel is defined as: krbf (x, x0 ) = exp(−ρ||x − x0 ||2 ). For
large enough ρ the RBF kernel behaves like the matching kernel. In other
words, the parameter ρ can be used to regulate the amount of generalization
performed in the Gaussian process algorithm: For very large ρ all instances
are very different and the Q-function is represented explicitly; for small enough
ρ all examples are considered very similar and the resulting function is very
smooth.
7.5
Blocks World Kernels
This section first shows how the states and actions in the blocks world can
be represented as a graph. Then it discusses the kernel that is used as the
covariance function between blocks worlds (state, action) pairs.
7.5. BLOCKS WORLD KERNELS
107
{clear}
v6
{on}
3
1
5
{block}
{on}
v3
{on}
{block}
v4
{on}
4
2
{block}
v2
v1
{block}
{on}
{on}
{on}
v5
{on}
v0
{block}
{floor}
Figure 7.7: The graph representation of the blocks world state.
7.5.1
State and Action Representation
To be able to apply the graph kernel to the blocks world, the (state, action)
pairs of the blocks world need to be represented as a graph.
Figure 7.7 shows a blocks world state and the graph that will be used to
represent this state.
To also represent the action of the (state, action) pair to which a Q-value
belongs, an edge with the label ’action’ is added between the two blocks that
are manipulated, as well as the extra labels ‘a1 ’ and ‘a2 ’ which identify the
moving block and the target block.
For the On(A,B) goal, the graph representation needs to be extended with
an indication of the two blocks that need to be stacked. This is represented
both by adding extra labels ‘g1 ’ and ‘g2 ’ to the blocks as an extra edge labeled
‘goal’ between the two blocks that need to be stacked. This addition of edges
and the labels connected to these edges allow for the representation of an
arbitrary goal. Figure 7.8 shows an example of a full (state, action) pair with
{clear}
v5
{on}
{block,g2 }
{on}
{on}
v2
{block,a2 }
v1
{on}
2
4
1
3
v3
{block}
{block,a1 ,g1 }
v4
{on}
{on}
{on}
v0
{floor}
Figure 7.8: The graph representation of the blocks world (state, action) pair
with On(3,2) as the goal.
108
CHAPTER 7. RRL-KBR
the representation of the on(3, 2) goal included.
A more complete description of the blocks world to graph translation is
included in Appendix A.
7.5.2
A Blocks World Kernel
In order to have a means to regulate the amount of generalization in the blocks
world setting, the graph kernel is not used directly, but ‘wrapped’ in a Gaussian
RBF function.
Of the two settings used to compute the graph kernel, the exponential
setting allows tuning the graph kernel to consider different lengths of walks as
most important.
Thus, let k be the graph kernel with exponential weights, then the kernel
used in the blocks world is given by
k ∗ (x, x0 ) = exp[−ρ(k(x.x) − 2k(x, x0 ) + k(x0 , x0 ))]
This choice introduces two parameters into the regression system: β, that
allows the user to shift the focus to different lengths of walks in the direct
product graph, and ρ which allows to tune the level of generalization.
7.6
Experiments
To test the kernel based regression system (kbr), RRL once again tackled the
same blocks world tasks as in the two previous chapters.
As the system has two different parameters of which the influence needs to
be investigated, an array of experiments was run to find good values of these
parameters. In this text, only a small number of these experiments are shown,
as the influence of one parameter is best illustrated when using the best value
of the other parameter. In the experimental setup used, the best performance
was reached with β = 10 and ρ = 0.01. Therefore, the influence of the β
parameter is shown in tests where ρ = 0.01 and likewise for the tests on the
influence of the parameter ρ.
7.6.1
The Influence of the Series Parameter β
As already stated, the exponential series parameter β influences the weights of
the number of walks with a certain length as shown in Figure 7.6. Although
worlds with only 3 to 5 blocks are considered, in the Stacking and Unstacking
experiments, the maximum length of a walk in the two subgraphs is 6, because
of the ’clear’ and ’floor’ nodes. Because of the extra ’goal’ edge in the On(A,B)
experiments, there is a possibility of a cycle in the graph and therefore also of
walks of infinite length.
7.6. EXPERIMENTS
109
Stacking - Walk Length Influence
1
Average Total Reward
0.98
0.96
0.94
0.92
0.9
0.88
0.86
’Beta = 1’
’Beta = 5’
’Beta = 10’
’Beta = 50’
’Beta = 100’
0.84
0.82
0.8
0
200
400
600
Number of Episodes
800
1000
Figure 7.9: The performance of kbr with focussing on different length walks
for the Stacking task.
The different values of the β parameter tested were 1 (which would be
basically the same as using the geometric setting) 5, 10, 50 and 100. For
the Stacking and Unstacking task, there should be little difference between
the values larger than 10, for the On(A,B) task, these large values could be
significant.
As expected, Figure 7.9 shows that there is little difference between the
higher values of the β parameter for the Stacking task. The performance graph
does show a difference between the low values. The value of 1, corresponding
to the geometric kernel computation, performs less well than the rest. Even
the value of 5 seems a little worse than the rest, indicating that the longest
walks should be given the highest weights. Intuitively, this is consistent with
what one would expect, as the Stacking task, when considered in the graph
notation, basically consists of building the longest walk possible.
Figure 7.10 shows that the behavior of the β parameter for the Unstacking
task is greatly similar. The only difference is that the value of 5 works as well
as higher values. This makes sense as the goal of Unstacking will cause most
of the encountered (state, action) graphs to have many short walks.
For the On(A,B) task, where it is possible for the (state, action) graph to
contain cycles and walks of infinite length, there is a larger difference between
the different parameter values. The best performance is reached for the values
5 and 10, i.e., focussing on walks of length 5 to 6 or 10 to 11. The performance
drops for lower values, which is comparable to the weight distribution of the
geometric setting and for higher values, that make the kernel focus on the cycles
in the graph.
110
CHAPTER 7. RRL-KBR
Unstacking - Walk Length Influence
1
Average Total Reward
0.98
0.96
0.94
0.92
0.9
0.88
0.86
’Beta = 1’
’Beta = 5’
’Beta = 10’
’Beta = 50’
’Beta = 100’
0.84
0.82
0.8
0
200
400
600
Number of Episodes
800
1000
Figure 7.10: The performance of kbr with focussing on different length walks
for the Unstacking task.
On(A,B) - Walk Length Influence
1
Average Total Reward
0.9
0.8
0.7
0.6
’Beta = 1’
’Beta = 5’
’Beta = 10’
’Beta = 50’
’Beta = 100’
0.5
0.4
0
200
400
600
Number of Episodes
800
1000
Figure 7.11: The performance of kbr with focussing on different length walks
for the On(A,B) task.
7.6. EXPERIMENTS
111
Stacking - Rbf Parameter Influence
1
Average Total Reward
0.98
0.96
0.94
0.92
0.9
0.88
’Rho = 0.00001’
’Rho = 0.001’
’Rho = 0.01’
’Rho = 0.1’
’Rho = 10’
’Rho = 1000’
0.86
0.84
0.82
0.8
0
200
400
600
Number of Episodes
800
1000
Figure 7.12: The performance of kbr for different levels of generalization on
the Stacking task.
7.6.2
The Influence of the Generalization Parameter ρ
The parameter ρ allows tuning of the amount of generalization by controlling
the width of the radial basis functions that are used to wrap the graph kernel.
Low values, connected to a large amount of generalization will cause kbr to
learn quicker, but possibly lead to lower accuracy of the resulting policy. High
values, resulting in little generalization, will cause RRL to learn slowly and
might prevent it from learning a policy for the entire state space.
Figure 7.12 shows the influence of different generalization levels for the
Stacking goal. High generalization leads to the lowest resulting performance
(although the ρ value of 0.00001 works well in the beginning of the experiment),
but overall the differences are quite small. Even for very high ρ values (1000),
kbr succeeds in learning a good policy.
The Unstacking goal causes a larger delay when kbr uses very little generalization. The Unstacking goal is harder to reach than the Stacking goal, so a
smaller percentage of the set of learning examples contains useful information.
Higher levels of generalization cause these values to spread more and help RRL
generate a better performing policy. At very low values of ρ, overgeneralization
causes the performance to drop again.
For the On(A,B) goal, there is again little difference between the different
values of the parameter, although very little generalization causes kbr to learn
slower.
112
CHAPTER 7. RRL-KBR
Unstacking - Rbf Parameter Influence
1
Average Total Reward
0.98
0.96
0.94
0.92
0.9
0.88
’Rho = 0.00001’
’Rho = 0.001’
’Rho = 0.01’
’Rho = 0.1’
’Rho = 10’
’Rho = 1000’
0.86
0.84
0.82
0.8
0
200
400
600
Number of Episodes
800
1000
Figure 7.13: The performance of kbr for different levels of generalization on
the Unstacking task.
On(A,B) - Rbf Parameter Influence
1
Average Total Reward
0.9
0.8
0.7
’Rho = 0.00001’
’Rho = 0.001’
’Rho = 0.01’
’Rho = 0.1’
’Rho = 10’
’Rho = 1000’
0.6
0.5
0.4
0
200
400
600
Number of Episodes
800
1000
Figure 7.14: The performance of kbr for different levels of generalization on
the On(A,B) task.
7.6. EXPERIMENTS
113
Stacking
1
Average Total Reward
0.9
0.8
0.7
0.6
0.5
0.4
’TG’
’RIB’
’KBR’
0.3
0.2
0
200
400
600
Number of Episodes
800
1000
Figure 7.15: Performance comparison between the tg , rib and kbr algorithms
for the Stacking task.
7.6.3
Comparing kbr , rib and tg
For the comparison of kbr to the two other regression engines tg and rib ,
the parameter values β = 5 and ρ = 0.01 were chosen. The architecture of
the kbr regression engine is most comparable to the rib algorithm, as each
new learning example is stored and influences the prediction of new examples
according to their similarity. tg on the other hand, uses the learning examples
to build an explicit model of the Q-function.
Figure 7.15 shows the performances of the three algorithms on the Stacking
goal. tg needs more learning episodes to reach the same level of performance
as the two other algorithms, of which kbr is a little faster. It needs to be
pointed out that the tg algorithm is a lot faster computationally though and
that it can handle a lot more episodes then rib and kbr with the same computational capacity. However, this is only advantageous for environments that
can react quickly to the agents decisions and have a low exploration cost, such
as completely simulated environments.
For the Unstacking goal, the difference in the learning rate between tg and
the two others is even more apparent, as tg does not learn the optimal policy
during the available learning episodes.
The performance curves of Figure 7.17 show that none of the three regression engines allow RRL to learn the optimal policy for the On(A,B) goal.
Remarkably, the three systems perform very comparably, although tg is again
a little slower at the start, but it quickly catches up.
As shown in Table 7.1, the current implementation of the RRL-kbr system is quite slow, and comparable in learning speed to the rib-ep system (see
Table 6.2). This is largely due to the fact that no example selection strategy
114
CHAPTER 7. RRL-KBR
Unstacking
1
Average Total Reward
0.9
0.8
0.7
0.6
0.5
0.4
0.3
’TG’
’RIB’
’KBR’
0.2
0.1
0
200
400
600
Number of Episodes
800
1000
Figure 7.16: Performance comparison between the tg , rib and kbr algorithms
for the Unstacking task.
On(A,B)
Average Total Reward
1
0.8
0.6
0.4
0.2
’TG’
’RIB’
’KBR’
0
0
200
400
600
Number of Episodes
800
1000
Figure 7.17: Performance comparison between the tg , rib and kbr algorithms
for the On(A,B) task.
7.7. FUTURE WORK
115
Table 7.1: The execution times for RRL-tg , RRL-rib and RRL-kbr on a
Pentium III 1.1 Ghz machine in seconds. The first number in each cell indicates
the learning time, the second indicates the testing time.
Task
Stacking
Unstacking
On(A,B)
RRL-tg
14 – 1
15 – 2
16 – 1
RRL-rib
27 – 27
7 – 10
44 – 100
RRL-kbr
650 – 900
1200 – 2300
1500 – 3000
has been implemented for RRL-kbr so far. As such, the size of the covariance matrix used by the Gaussian processes algorithm grows with each visited
(state, action) pair. This also greatly augments the time needed to make predictions on new (state, action) pairs as indicated by the right side figures of
Table 7.1.
7.7
Future Work
So far, no work has been done to select the examples that are accepted by the
kbr system. All presented examples are used to build the covariance matrix
C, that (as a consequence) tends to grow very large and greatly influences the
computational efficiency of the kbr regression algorithm. To reduce the size of
this matrix (and the related vectors) similar measures as for the rib algorithm
could be used.
However, the use of kernels allows for other selection mechanisms as well,
such as the removal of examples that give rise to parallel covariance vectors
(i.e., (state, action) pairs that are identical with respect to the feature space).
Another item that needs to be addressed in the future to take full advantage
of the Gaussian processes for regression, is the use of the probability estimates
that can be predicted. An obvious use for them would be to guide exploration.
The probability estimates can be used to build confidence intervals for the
predicted Q-value. Such confidence intervals are useful for the interval based
exploration that was discussed in Section 2.3.3.2.
Possibly, the probability estimates can also be used for learning example
selection, much like the use of the standard deviation in the rib system.
So far, the parameters used in the kernel function (such as the β and ρ parameters in the blocks world experiments) have been fixed throughout a single
learning experiment. However, Gaussian processes allow the use of parameters
in the covariance function — referred to as hyper-parameters — and the tuning
of these parameters to fit the data more precisely. However, the application of
these parameter learning techniques to the RRL setting seems non trivial.
116
7.8
CHAPTER 7. RRL-KBR
Conclusions
This chapter introduced Gaussian processes and graph kernels as a regression
technique for the RRL system. The graph kernels act as the covariance function
needed for the Gaussian processes and are based on the number of walks in
the direct product graph of the two (state, action) graphs. With the weights
of the exponential series setting, the kernels can be tuned to walks of different
lengths. Wrapping this kernel in a radial basis function allows one to control
the amount of generalization.
Experimental results show that the behavior of kernel based regression is
comparable and maybe even better than instance based regression. Both of
these outperform tg with respect to the learning rate per episode.
Part III
On Larger Environments
117
Chapter 8
On adding Guidance to
Relational Reinforcement
Learning
“How do you explain school to a higher intelligence?”
E.T. - The Extra Terrestrial
8.1
Introduction
In structural domains, the state space is typically very large, and although the
relational regression algorithms introduced before can provide the right level of
abstraction to learn in such domains, the problem remains that rewards may be
distributed very sparsely in the state space. Using random exploration through
the search space, rewards may simply never be encountered. In some of the
application domains mentioned above this prohibits RRL from finding a good
solution.
While plenty of exploration strategies exist (Wiering, 1999), few deal with
the problems of exploration at the start of the learning process. It is exactly
this problem that occurs often in the RRL setting. There is, however, an
approach which has been followed with success, and which consists of guiding
the Q-learner with examples of “reasonable” strategies, provided by a teacher
(Smart and Kaelbling, 2000). Thus a mix between the classical unsupervised
Q-learning and (supervised) behavioral cloning is obtained. It is the suitability
of this mixture in the context of RRL that is explored in this chapter.
This chapter introduces guidance as a way to help RRL (and other reinforcement learning techniques) tackle large environments with sparse rewards.
119
120
CHAPTER 8. GUIDED RRL
After an argumentation for the need of guidance in relational reinforcement
learning, several modes of guidance are suggested and empirically evaluated on
larger versions of the blocks world problems. This substantial array of testcases also provides a view of the individual characteristics of the three different
regression engines. The chapter concludes by discussing related work and presenting a large array of possible directions for further work.
The idea of using guidance was developed together with Sašo Džeroski and
published in (Driessens and Džeroski, 2002a; Driessens and Džeroski, 2002b;
Driessens and Džeroski, 2004).
8.2
8.2.1
Guidance and Reinforcement Learning
The Need for Guidance
In the early stages of learning, the exploration strategy used in Q-learning is
pretty much random and causes the learning system to perform poorly. Only
when enough information about the environment is discovered, i.e., when sufficient knowledge about the reward function is gathered, can better exploration
strategies be used. Gathering knowledge about the reward function can be
hard when rewards are sparse and especially if these rewards are hard to reach
using a random strategy. A lot of time is usually spent exploring regions of
the state-action space without learning anything because no rewards (or only
similar rewards) are encountered.
Relational applications often suffer from this problem because they deal
with very large state-spaces when compared to attribute-value problems. First,
the size of the state-space grows exponentially with regard to the number of
objects in the world, the number of properties of each object and the number of
possible relations between objects. Second, when actions are related to objects
— such as moving one object to another — the number of actions grows equally
fast.
For example, when the number of blocks in the blocks world increases, the
goal states — and as a consequence the (state, action) pairs that yield a reward
— become very sparse. To illustrate this, Figure 8.1 shows the success-rate of
random policies in the blocks world. The agent with the random policy starts
from a randomly generated state (which is not a goal state) and is allowed to
take at most 10 actions. For each of the three goals (i.e., Stacking, Unstacking
and On(A,B)) the graph shows the percentage of trials that end in a goal state
and therefore with a reward, with respect to the number of blocks in the world.
As shown in the graph, the Unstacking goal in the blocks world with 10 blocks
would almost never be reached by random exploration. Not only is there a
single goal state out of 59 million states, but the number of possible actions
increases as one gets closer to the goal state: in a state from which a single
8.2. GUIDANCE AND REINFORCEMENT LEARNING
121
Percentage of Succeeding Episodes
Random Exploration Success Rate
100
’stack’
’unstack’
’on(A,B)’
80
60
40
20
0
3
4
5
6
7
Number of Blocks
8
9
10
Figure 8.1: Success rate for the three goals in the blocks world with a random
policy.
action leads to the goal state, there are 73 actions possible. The graph of Figure
8.2 shows the percentage of learning examples with a non zero Q-value that is
presented to the regression algorithm. Since all examples with a zero Q-value
can be regarded as noise for the regression algorithm, it is clear that learning
the correct Q-function from these examples is very hard.
8.2.2
Using “Reasonable” Policies for Guidance
Although random policies can have a hard time reaching sparsely spread rewards in a large world, it is often relatively easy to reach these rewards by
using “reasonable” policies. While optimal policies are certainly “reasonable”,
non-optimal policies are often easy (or easier) to implement or generate than
optimal ones. One obvious candidate for an often non-optimal, but reasonable,
controller would be a human expert.
To integrate the guidance that these reasonable policies can supply with
our relational reinforcement learning system, the given policy is used to choose
the actions instead of a policy derived from the current Q-function hypothesis
(which will not be informative in the early stages of learning). The episodes
created in this way can be used in exactly the same way as normal episodes
in the RRL algorithm to create a set of examples which is presented to the
relational regression algorithm. In case of a human controller, one could log
the normal operation of a system together with the corresponding rewards and
generate the learning examples from this log.
Since tabled Q-learning is exploration insensitive — i.e., the Q-values will
converge to the optimal values, independent of the exploration strategy used
122
CHAPTER 8. GUIDED RRL
Percentage of Informative Examples
Informative Examples using Random Exploration
100
’stack’
’unstack’
’on(A,B)’
80
60
40
20
0
3
4
5
6
7
Number of Blocks
8
9
10
Figure 8.2: The percentage of informative examples presented to the regression
algorithm for the three goals in the blocks world with a random policy.
(Kaelbling et al., 1996) — the non-optimality of the used policy will have no
negative effect on the convergence of the Q-table. While Q-learning with generalization is not exploration insensitive, the experiments will demonstrate that
the “guiding policy” helps the learning system to reach non-obvious rewards
and that this results in a two-fold improvement in learning performance. In
terms of learning speed, the guidance is expected to help the Q-learner to discover higher yielding policies earlier in the learning experiment. Through the
early discovery of important states and actions and the early availability of
these (state, action) pairs to the generalization engine, it should also be possible for the Q-learner to reach a higher level of performance — i.e., a higher
average reward — in the available time.
While the idea of supplying guidance or another initialization procedure
to increase the performance of a tabula rasa algorithm such as reinforcement
learning is not new (See Section 8.4), it is under-utilized. With the emergence
of new reinforcement learning approaches, such as the RRL system, that are
able to tackle larger problems, this idea is gaining importance and could provide
the leverage necessary to solve really hard problems.
8.2.3
Different Strategies for Supplying Guidance
When supplying guidance by creating episodes and presenting the resulting
learning examples to the used regression engine, different strategies can be
used to decide when to supply this guidance.
One option that will be investigated, is supplying the guidance at the beginning of learning when the reinforcement learning agent is forced to use a
8.3. EXPERIMENTS
123
random policy to explore the state-space. This strategy also makes sense when
using guidance from a human expert. After logging the normal operations of a
human controlling the system, one can translate these logs into a set of learning
examples and present this set to the regression algorithm. This will allow the
regression engine to build a partial Q-function which can later be used to guide
the further exploration of the state-space. This Q-function approximation will
neither represent the correct Q-function, nor will it cover the entire state-action
space, but it might be suitable for guiding RRL towards more rewards with
the use of Q-function based exploration. The RRL algorithm explores the
state space using Boltzmann exploration (Kaelbling et al., 1996) based on the
values predicted by the partial Q-function. This makes a compromise between
exploration and exploitation of the partial Q-function.
Another strategy is to interleave the guidance with normal exploration
episodes. In analogy with human learning, this mixture of perfect and more
or less random examples can make it easier for the regression engine to distinguish beneficial actions from other ones. The influence of guidance when it is
supplied with different frequencies will be compared.
One benefit of interleaving guided traces with exploration episodes is that
the reinforcement learning system can remember the episodes or starting states
that did not lead to a reward. It can then ask for guidance starting from the
states in which it failed. This will allow the guidance to zoom in on areas of
the state-space which are not yet covered correctly by the regression algorithm.
This type of guidance will be called active guidance.
8.3
Experiments
The guidance added to exploration should have a two-fold effect on learning performance. In terms of learning speed, the guidance should help the
Q-learner to discover better policies earlier. Through the early discovery of
important states and actions and the early availability of these (state, action)
pairs to the generalization engine, it should also be possible for the Q-learner
to reach a higher level of performance — i.e., a higher average reward — for
a given amount of learning experience. The experiments will test the effects
of guidance at the start of the learning experiment as well as guidance that is
spread throughout the experiment and active guidance, and will illustrate the
differences between them.
The influence of guided traces will be tested on each of the three different
regression algorithms discussed previously. All have different strategies for
dealing with incoming learning examples and as such will react differently to
the presented guidance. Although the chosen setup will not allow for a fair
comparison between the three algorithms, the different learning behaviors will
provide a view on the consequences of the different inner workings of the three
124
CHAPTER 8. GUIDED RRL
regression engines and possibly an indication of the kind of problems that are
best handled by each algorithm.
8.3.1
Experimental Setup
To test the effects of guidance, RRL (without guidance) will be compared with
G-RRL (with guidance) in the following setup: first RRL is run in its natural
form, giving it the possibility to train for a certain number of episodes; at
regular time intervals the learned policy is extracted from RRL and tested on
a number of randomly generated test problems. To compare with G-RRL some
of the exploration is substituted with guided traces. These traces are generated
by either a hand-coded policy, a previously learned policy or a human controller.
In between these traces, G-RRL is allowed to explore the state-space further on
its own. Note that in the performance graphs, the traces presented to G-RRL
will count as episodes.
Using a blocks world with 10 blocks provides a learning environment with
is large enough for the rewards to become sparse (see Table 5.1 and Figures 8.1
and 8.2). For the blocks world it is easy to write optimal policies for the three
goals. Thus it is easy to supply RRL with a large amount of optimal example
traces.
The tg algorithm needs a higher number of learning examples compared to
the rib and kbr algorithms. On the other hand, the tg implementation is a
lot more efficient than the other implementations, so tg is able to handle more
training episodes for a given amount of computation time. Since the goal of the
experiments is to investigate the influence of guidance for the two systems and
not to compare their performance, the tg algorithm will be supplied with a
lot of training episodes (as it has little difficulty handling them) and the other
algorithms with less training episodes (given the fact that they usually don’t
need them).
8.3.2
Guidance at the Start of Learning
Reaching a state that satisfies the Stacking goal is not really all that hard,
even with 10 blocks and random exploration: approximately one of every 17
states is a goal state. Even so, some improvement can be obtained by using a
small amount of guidance as shown in Figure 8.3. The tg based algorithm is
quite good at learning a close to optimal policy by itself. However, the added
help from the guided traces helps it to decrease the number of episodes needed
to obtain a certain performance level. (The strange behavior of tg when it is
supplied with 100 guided traces will be further investigated in Section 8.3.3.)
RRL-rib has a harder time with this goal. It doesn’t come close to reaching
the optimal policy, but the help it receives from the guided traces do allow
it to both reach better performance earlier during learning as well as reach a
8.3. EXPERIMENTS
125
Stacking - TG
Stacking - RIB
1
1
0.8
Average Total Reward
Average Total Reward
0.9
0.7
0.6
0.5
0.4
0.3
’no guidance’
’5 at start’
’20 at start’
’100 at start’
0.2
0.1
0
0
200
400
600
800
1000
Number Of Episodes
1200
0.8
0.6
0.4
’no guidance’
’5 at start’
’20 at start’
’100 at start’
0.2
0
1400
0
200
400
600
Number of Episodes
800
1000
Stacking - KBR
1
Average Total Reward
0.9
0.8
0.7
0.6
0.5
0.4
0.3
’no guidance’
’5 at start’
’20 at start’
’100 at start’
0.2
0.1
0
0
200
400
600
Number of Episodes
800
1000
Figure 8.3: Guidance at the start for the Stacking goal in the blocks world.
higher level of performance overall. The strangest behavior is exhibited by the
the kbr algorithm. Using the guided traces it quickly improves its strategy —
with 100 guided traces it even reaches optimal behavior — but then starts to
decrease its performance when it is allowed to explore the state space on its
own. The graphs show the average performance of RRL over 10 test runs.
As already stated, in a world with 10 blocks, it is almost impossible to
reach a state satisfying the Unstacking-goal at random. This is illustrated by
the graphs on the left side of Figure 8.4. It also shows the average performance
over 10 test runs. RRL never learns anything useful on its own, because it
doesn’t succeed in reaching the reward often enough (if ever). Because tg
does not make any decisions until enough evidence has been collected, small
amounts of guidance do not help tg . Even when supplied with 100 optimal
traces, tg does not learn a useful policy. Supplied with 500 optimal traces, tg
has collected enough examples to learn something useful, but the difficulties
with exploring such a huge state-space with so little reward still shows in the
fact that tg is able to do very little extra with this information. This is caused
by the fact that tg throws out the statistics it collected when it chooses a
suitable splitting criterion and generates two new and empty leafs. Thus, when
126
CHAPTER 8. GUIDED RRL
On(A,B) - TG
1
0.8
0.8
Average Total Reward
Average Total Reward
Unstacking - TG
1
0.6
0.4
0.2
’no guidance’
’100 at start’
’500 at start’
0
0
500
1000
1500
Number of Episodes
2000
0.6
0.4
’no guidance’
’5 at start’
’20 at start’
’100 at start’
0.2
0
2500
0
2000
1
0.8
0.8
0.6
0.4
’no guidance’
’5 at start’
’20 at start’
’100 at start’
0.2
0
0
200
400
600
Number of Episodes
800
’no guidance’
’5 at start’
’20 at start’
’100 at start’
0.4
0.2
0
1000
0
200
0.8
0.8
0.6
0.4
’no guidance’
’5 at start’
’20 at start’
’100 at start’
0
200
400
600
Number of Episodes
400
600
Number of Episodes
800
1000
On(A,B) - KBR
1
Average Total Reward
Average Total Reward
Unstacking - KBR
0
10000
0.6
1
0.2
8000
On(A,B) - RIB
1
Average Total Reward
Average Total Reward
Unstacking - RIB
4000
6000
Number of Episodes
800
’no guidance’
’5 at start’
’20 at start’
’100 at start’
0.6
0.4
0.2
0
1000
0
200
400
600
Number of Episodes
800
1000
Figure 8.4: Guidance at the start for the Unstacking (left) and On(A,B) (right)
goals in the blocks world.
8.3. EXPERIMENTS
127
a split is made, tg basically forgets all the guidance it received.
The rib algorithm was designed to remember high scoring (state, action)
pairs. Once an optimal Q-value is encountered, it will never be deleted from the
stored data base. This makes rib perform very well on the Unstacking-goal,
reaching a close to optimal strategy with little guidance. Figure 8.4 does show
that even 5 guided traces are sufficient, although a little extra guidance helps
to reach high performance sooner.
The kbr algorithm shows the same behavior as it did for the Stacking goal.
Using the guided traces it quickly develops a high performance policy, but when
it is left to explore on its own, the performance of the learned Q-function starts
to decrease. The kbr algorithm bases its Q-value estimation on an estimated
probability density. Since it has no example selection possibilities, the large
number of uninformative (state, action) pairs (i.e., with 0 Q-value) generated
during exploration overwhelm the informative ones and cause the probability
estimate to degenerate.
The On(A,B) goal has always been a hard problem for RRL (Džeroski et
al., 2001;Driessens et al., 2001). The right side of Figure 8.4 shows the learning
curves for each of the algorithms. Every data point is the average reward over
10 000 randomly generated test cases in the case of the tg curve, 1 000 for the
rib and kbr graphs, all collected over 10 separate test runs. Although the
optimal policy is never reached, the graph clearly shows the improvement that
is generated by supplying RRL with varying amounts of guidance. Only in the
kbr case, when supplied with limited amounts of guidance, the performance is
comparable to that of RRL without guidance.
8.3.3
A Closer Look at RRL-tg
An interesting feature of the performance graphs of tg is the performance of
the experiment with the Stacking goal where it was supplied with 100 (or more)
optimal traces in the beginning of the learning experiment. Not only does this
experiment take longer to converge to a high performance policy, but during
the first 100 episodes, there is no improvement at all. rib and kbr do not
suffer from this at all.
This behavior becomes worse when tg is supplied with even more optimal
traces. Figure 8.5 shows the learning curves when tg is supplied with 500 optimal traces. The reason for tg ’s failing to learn anything during the first part of
the experiment (i.e., when being supplied with optimal traces) can be found in
the specific inner workings of the generalization engine. Trying to connect the
correct Q-values with the corresponding (state, action) pairs, the generalization engine tries to discover significant differences between (state, action) pairs
with differing Q-values. In the ideal case, the tg -engine is able to distinguish
between states that are at different distances from a reward producing state,
and between optimal and non-optimal actions in these states.
128
CHAPTER 8. GUIDED RRL
Stacking
Unstacking
1
1
0.8
Average Total Reward
Average Total Reward
0.9
0.7
0.6
0.5
0.4
0.3
0.2
0.8
0.6
0.4
0.2
’optimal traces’
’half optimal traces’
0.1
0
0
200
400
600
800
1000
Number Of Episodes
1200
’optimal traces’
’half optimal traces’
0
1400
0
500
1000
1500
Number of Episodes
2000
2500
On(A,B)
Average Total Reward
1
0.8
0.6
0.4
0.2
’optimal traces’
’half optimal traces’
0
0
2000
4000
6000
Number of Episodes
8000
10000
Figure 8.5: Half optimal guidance in the blocks world for tg
However, when tg is supplied with only optimal (state, action, qvalue) examples, overgeneralization occurs. The generalization engine never encounters
a non-optimal action and therefore, never learns to distinguish optimal from
non-optimal actions. It will create a Q-tree that separates states which are at
different distances from the goal-state. Later, during exploration, it will expand this tree to account for optimal and non-optimal actions in these states.
These trees are usually larger than they should be, because in the normal case,
when supplied with both optimal and non-optimal examples, tg is often able
to generalize in one leaf of its tree over both non-optimal actions in states that
are close to the goal and optimal actions in states that are a little further from
the goal.
To illustrate this behavior, tg was supplied with 500 half-optimal guidance
traces in which the used policy alternates between a random and an optimal
action. Figure 8.5 shows that, in this case, G-RRL does learn during the guided
traces. Most noticeable is the behavior of tg with half optimal guidance when
it has to deal with the Unstacking goal. Even though it is not trivial to reach
the goal state when using a half optimal policy, it is reached often enough for
G-RRL to learn a correct policy. Figure 8.5 shows that G-RRL is able to
8.3. EXPERIMENTS
129
learn quite a lot during the 500 supplied traces and then is able to reach the
optimal policy after some extra exploration.
This experiment (although artificial) shows that G-RRL can be useful even
in domains where it is easy to hand-code a reasonable policy. G-RRL will use
the experience created by that policy to construct a better (possibly optimal)
one. The sudden leaps in performance are characteristic for tg : whenever
a new (well chosen) test is added to the Q-tree, the performance jumps to a
higher level.
rib and kbr do not suffer from this overgeneralization. Since the Q-value
estimation is the result of a weighted average of neighboring examples in the
rib case, the rib algorithm is able to make more subtle differences between
(state, action) pairs. Since the used weights in the average calculation are
based on the distance between two (state, action) pairs, and since this distance
has to include information about the resemblance of the two actions, there is
almost no chance of overgeneralization of the Q-values over different actions
in the same state. The same holds for the covariance function in the kbr
case. The covariance function or kernel is based on the (state, action) graph
which includes the information on the chosen action. Where tg is left on its
own to decide which information about the (state, action) pair to use in its
Q-function model, rib and kbr are forced to use the complete (state, action)
pair description.
8.3.4
Spreading the Guidance
Instead of supplying all the guidance at the beginning of learning, it is also
possible to spread the guidance throughout the entire experiment. In the following experiments, guidance is supplied either 1 guided trace every 10 learning
episodes or in batches of 10 guided traces every 100 learning episodes.
Spreading the guidance through the entire learning experiment compared to
presenting an equal total amount of guidance at the beginning of learning avoids
the overgeneralization problem that occurred when using the tg algorithm.
The top left graph of Figure 8.6 clearly shows that tg does not suffer the same
learning delay. The influence of spread guidance for the Unstacking goal (top
right of Figure 8.6) is remarkable. With initial (and optimal) guidance, tg was
not able to learn any useful strategy. In this case however, the mix of guided
traces and explorative traces allows tg to build a well performing Q-tree. It
still is less likely to find the optimal policy then with the (artificial) half-optimal
guidance but performs reasonably well.
The rib algorithm did not suffer from overgeneralization and as a consequence, there is little difference between the obtained results with initial and
spread guidance. rib is designed to select and store examples with high Qvalues. It does not matter when during the learning experiment these examples are encountered. The only noticeable difference is that the performance
130
CHAPTER 8. GUIDED RRL
Unstacking - TG
1
0.9
0.9
0.8
0.8
Average Total Reward
Average Total Reward
Stacking - TG
1
0.7
0.6
0.5
0.4
0.3
’no guidance’
’100 at start’
’1 every 10’
’10 every 100’
0.2
0.1
0
0
200
400
600
800
1000
Number of Episodes
1200
0.7
0.6
0.5
0.4
0.3
’no guidance’
’500 at start’
’1 every 10’
’10 every 100’
0.2
0.1
0
1400
0
500
1
0.8
0.98
0.6
0.4
’no guidance’
’100 at start’
’1 every 10’
’10 every 100’
0.2
0
0
200
400
600
Number of Episodes
2000
2500
Unstacking - RIB
1
Average Total Reward
Average Total Reward
Stacking - RIB
1000
1500
Number of Episodes
800
0.96
0.94
0.92
’100 at start’
’1 every 10’
’10 every 100’
0.9
1000
0
200
Stacking - KBR
400
600
Number of Episodes
800
1000
Unstacking - KBR
1
1
0.8
Average Total Reward
Average Total Reward
0.9
0.7
0.6
0.5
0.4
0.3
’no guidance’
’100 at start’
’1 every 10’
’10 every 100’
0.2
0.1
0
0
200
400
600
Number of Episodes
800
0.8
0.6
0.4
0.2
’100 at start’
’1 every 10’
’10 every 100’
0
1000
0
200
400
600
Number of Episodes
800
1000
Figure 8.6: Guidance at start and spread for the Stacking (left) and Unstacking
(right) goals in the blocks world
8.3. EXPERIMENTS
131
increase becomes somewhat slower but smoother.
With spread guidance, the kbr algorithm does not get the same jump-start
as with all the guidance at the beginning of learning. As expected, the same
total amount of guidance, leads to approximately the same level of behavior,
regardless of when this guidance is supplied.
On(A,B) - TG
On(A,B) - RIB
1
1
’no guidance’
’100 at start’
’1 every 10’
’10 every 100’
0.8
Average Total Reward
Average Total Reward
0.9
0.7
0.6
0.5
0.4
0.3
’1000 at start’
’1000 delayed’
’1 every 10’
’10 every 100’
’100 every 1000’
0.2
0.1
0
0
2000
4000
6000
Number of Episodes
8000
0.8
0.6
0.4
0.2
0
10000
0
200
400
600
Number of Episodes
800
1000
On(A,B) - KBR
Average Total Reward
1
’no guidance’
’100 at start’
’1 every 10’
’10 every 100’
0.8
0.6
0.4
0.2
0
0
200
400
600
Number of Episodes
800
1000
Figure 8.7: Guidance at start and spread for the On(A,B) goal in the blocks
world
For the On(A,B) goal, the influence of the spread guidance on the performance of tg is large, both in terms of learning speed as in the overall level
of performance reached as shown in the top left graph of Figure 8.7. Again,
for rib and KBR, there is little difference in the performance of the resulting
policies.
All graphs, but especially the tg case of Figure 8.7, show the influence of
different frequencies used to provide guidance. Note that in all cases, an equal
amount of guidance was used. Although the results show little difference, there
seems to be a small advantage for thinly spread guidance. Intuitively, it seems
best to spread the available guidance as thin as possible and the performed
experiments do not show any negative results for doing so. However, spreading
out the guidance when there is only a small amount available (e.g. 1 guided
132
CHAPTER 8. GUIDED RRL
trace every 10 000 episodes) might prevent the guidance from having any effect.
Another possibility for dealing with scarce guidance is to provide all the
guidance after RRL has had some time to explore the environment. Although
Figure 8.7 shows inferior results for this approach when compared to the spread
out guidance, this is probably due to the large size of the presented batch.
Note also that learning here is faster than when all guidance is provided at the
beginning of the learning experiment.
8.3.5
Active Guidance
As stated at the end of Section 8.2.3, in the blocks world where each episode
is started from a randomly generated starting position, RRL can be given the
opportunity to ask for guided traces starting from some of these starting states
where it failed. In planning problems like the ones of the blocks world, RRL
can discover whether or not it succeeded by checking whether it received a
reward of 1 or not. This will allow RRL to receive information about parts of
the state space where it does not yet have enough knowledge and to supply the
generalization algorithm with examples which are not yet correctly predicted.
Figures 8.8 and 8.9 show the results of this active guidance. In these learning
experiments, guidance was spread like in the previous section, but replaced with
active guidance. Two kinds of behavior can be distinguished. In the first, GRRL succeeds in finding an almost optimal strategy, and the active guidance
succeeds in pushing G-RRL to even better performance at the end of the
learning experiment. This is the case for all goals using tg for regression, for
the Unstacking goal with rib and for the Stacking and Unstacking goals with
the kbr algorithm, often leading to an optimal policy that was not reached
before, or greatly reducing the number of cases in which the goal was not
reached. For example, the percentage of episodes where RRL does not reach
the goal state is reduced from 11% to 3.9% using the tg algorithm in the
On(A,B) experiment.
This behavior is completely consistent with what one would expect. In the
beginning, both modes of guidance provide enough new examples to increase
the accuracy of the learned Q-functions. However, when a large part of the
state-space is already covered by the Q-functions, the specific examples provided by active guidance allow for the Q-function to be extended to improve
its coverage of the outer reaches of the state-space.
In the second kind of behavior, G-RRL does not succeed in reaching a sufficiently high level of performance. This happens for rib on the tasks of Stacking
and On(A,B) and for kbr with the On(A,B) goal. There is little difference here
between the help provided by normal and active guidance. Active guidance is
not able to focus onto critical regions of the state-space and improve upon the
examples provided by regular guidance.
8.3. EXPERIMENTS
133
Stacking - TG
Unstacking - TG
1
1
Average Total Reward
Average Total Reward
0.99
0.98
0.97
0.96
0.95
’1 every 10’
’1 every 10 active’
0.94
0
500
1000
1500
Number of Episodes
2000
0.8
0.6
0.4
0.2
’1 every 10’
’1 every 10 active’
0
2500
0
500
1
0.8
0.98
0.6
0.4
0.2
’1 every 10’
’1 every 10 active’
0
0
200
400
600
Number of Episodes
800
0.94
0.92
’1 every 10’
’1 every 10 active’
0.9
1000
0
200
0.9
0.9
0.8
0.8
0.7
0.6
0.5
0.4
0.3
0.2
’1 every 10’
’1 every 10 active’
0
200
400
600
Number of Episodes
400
600
Number of Episodes
800
1000
Unstacking - KBR
1
Average Total Reward
Average Total Reward
Stacking - KBR
0
2500
0.96
1
0.1
2000
Unstacking - RIB
1
Average Total Reward
Average Total Reward
Stacking - RIB
1000
1500
Number of Episodes
800
0.7
0.6
0.5
0.4
0.3
0.2
’1 every 10’
’1 every 10 active’
0.1
0
1000
0
200
400
600
Number of Episodes
800
1000
Figure 8.8: Active guidance in the blocks world for the Stacking (left) and the
Unstacking (right) goals.
134
CHAPTER 8. GUIDED RRL
On(A,B) - TG
On(A,B) - RIB
1
1
0.8
Average Total Reward
Average Total Reward
0.9
0.7
0.6
0.5
0.4
0.3
0.2
’1 every 10’
’1 every 10 active’
0.1
0
0
2000
4000
6000
Number of Episodes
8000
0.8
0.6
0.4
0.2
’1 every 10’
’1 every 10 active’
0
10000
0
200
400
600
Number of Episodes
800
1000
On(A,B) - KBR
Average Total Reward
1
0.8
0.6
0.4
0.2
’1 every 10’
’1 every 10 active’
0
0
200
400
600
Number of Episodes
800
1000
Figure 8.9: Active guidance in the blocks world the On(A,B) goal.
8.3.6
An ”Idealized” Learning Environment
Because RRL is able to generalize over different environments (as already
shown in Chapters 5 to 7) it is possible to create an “idealized” learning environment. In a small environment, RRL can create a set of learning examples
with a variation of optimal and non-optimal (state, action) pairs by simply exploring on its own. In a large environment, G-RRL can be used to avoid large
numbers of uninformative learning examples. To combine both of these ideas,
the following experiments allow RRL to explore on its own in environments
with 3 to 5 blocks, and guidance is provided in 10% of the cases in a world with
10 blocks. This kind of learning environment is comparable to human learning
environments where often, a teacher will both show students how they solve
difficult problems and make the students practice solving easier problems on
their own.
To test whether RRL is able to generalize over different environments, the
learned Q-function and its resulting policy are tested in environments where
the number of blocks ranges from 3 to 10. This means that RRL will have to
handle worlds with for example 8 blocks without ever being allowed to train
8.3. EXPERIMENTS
135
itself in such a world.
Stacking
1
Average Total Reward
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
’RRL-TG’
’RRL-RIB’
’RRL-KBR’
0.1
0
0
50
100 150 200 250 300 350 400 450 500
Number of Episodes
Figure 8.10: Performance comparison between the tg , rib and kbr algorithms
for the Stacking task in an idealized learning environment.
Figure 8.10 shows the learning curves for all three regression engines for
the Stacking goal in the described setup. The kbr algorithm is the only one
which builds an optimal Q-function in this case. Although the tg algorithm
seems to be the slowest to improve, i.e., it needs the most learning episodes to
yield a comparable policy, it must be said that the tg algorithm is much more
efficient and is able to handle a much larger set of learning examples given a
fixed amount of computational power. However, this is only beneficial when
the learning environment is fast to interact with and the exploration costs of
the environment are low. In real world applications this will often not be the
case, causing the additional exploration cost needed by tg to dominate the
computational cost of the rib and kbr algorithms.
The “idealized” learning environment works extremely well for the Unstacking task. All three regression algorithms succeed in building an optimal policy
quickly as shown in Figure 8.11. The mixed set of optimal and non-optimal
examples resulting from the combination of small world exploration and large
world guidance turns the difficult task of Unstacking in large worlds into an
easy generalization problem.
For the On(A,B) task, tg is again slower than the two other algorithms in
reaching a comparable level of performance. However, when tg is presented
with 5 times more episodes then the two other algorithms, the performance
becomes remarkably similar. (The “RRL-tg (episodes ∗5)” performance curve
shows the average reward reached by tg after 5 times as many learning episodes
as indicated on the x-axis.)
136
CHAPTER 8. GUIDED RRL
Unstacking
1
Average Total Reward
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
’RRL-TG’
’RRL-RIB’
’RRL-KBR’
0.1
0
0
200
400
600
Number of Episodes
800
1000
Figure 8.11: Performance comparison between the tg , rib and kbr algorithms
for the Unstacking task in an idealized learning environment.
On(A,B)
Average Total Reward
1
0.8
0.6
0.4
’RRL-TG’
’RRL-RIB’
’RRL-KBR’
’RRL-TG (episodes * 5)’
0.2
0
0
200
400
600
Number of Episodes
800
1000
Figure 8.12: Performance comparison between the tg , rib and kbr algorithms
for the On(A,B) task in an idealized learning environment.
8.4. RELATED WORK
8.4
137
Related Work
The idea of incorporating guidance in automated learning of control is not new.
Chambers and Michie (1969) discuss three kinds of cooperative learning. In
the first, the learning system just accepts the offered advice. In the second, the
expert has the option of not offering any advice. In the third, some criterion
decides whether the learner has enough experience to override the human decision. Roughly speaking, the first corresponds to behavioral cloning, the second
to reinforcement learning and the third to guided reinforcement learning.
The link between this work and behavioral cloning (Bain and Sammut, 1995;
Urbancic et al., 1996) is not very hard to make. If the used regression algorithm
would be supplied with low Q-values for not encountered (state, action) pairs,
RRL would learn to imitate the behavior of the supplied traces. Because of
this similarity of the techniques, it is not surprising that similar problems are
encountered as in behavioral cloning.
Scheffer et al. (1997) discuss some of these problems. The differences between learning by experimentation and learning with “perfect guidance” (behavioral cloning) and the problems and benefits of both approaches are highlighted. At first sight, behavioral cloning seems to have the advantage, as it
sees precisely the optimal actions to take. However, this is all that it is given.
Learning by experimentation, on the other hand, receives imperfect information
about a wider range of (state, action) pairs. While some of the problems Scheffer mentions are solved by the combination of the two approaches as suggested
in this chapter, other problems resurface in the presented experiments. Scheffer
states that learning from guidance will experience difficulties when confronted
with memory constraints so that it can not simply memorize the ideal sequence
of actions but has to store associations instead. This is very closely related to
the problems of the tg generalization engine when it is supplied with only
perfect (state, action) pairs.
Wang (1995) combines observation and practice in the OBSERVER learning
system which learns STRIPS-like planning operators (Fikes and Nilsson, 1971).
The system starts with learning from “perfect guidance” and improves on the
planning operators (pre- and post-conditions) through practice. There is no
reinforcement learning involved.
Lin’s work on reinforcement learning, planning and teaching (Lin, 1992)
and the work of Smart and Kaelbling on reinforcement learning in continuous
state-spaces (Smart and Kaelbling, 2000) is closely related to the one presented
in this chapter in terms of combining guidance and experimentation. Lin uses a
neural network approach for generalization and uses a human strategy to teach
the agent. The reinforcement learning agent is then allowed to replay each
teaching episode to increase the amount of information gained from a single
lesson. However, the number of times that one lesson can be replayed has to
be restricted to prevent over-learning. This behavior is strongly related to the
138
CHAPTER 8. GUIDED RRL
over-generalization behavior of tg when only perfect guidance is presented.
Smart’s work, dealing with continuous state-spaces, uses a nearest neighbor
approach for generalization and uses example training runs to bootstrap the
Q-function approximation. The use of nearest neighbor and convex hulls to
select the examples for which predictions are made, successfully prevents overgeneralization. It is not clear how to translate the convex hull approach to the
relational setting.
Another technique that is based on the same principles as our approach
is used by Dixon et al. (2000). They incorporate prior knowledge into the
reinforcement learning agent by building an off-policy exploration module in
which they include the prior knowledge. They use artificial neural networks as
a generalization engine.
Other approaches to speed up reinforcement learning by supplying it with
non-optimal strategies include the work of Shapiro et al. (2001). There the
authors embed hierarchical reinforcement learning within an agent architecture.
The agent is supplied with a “reasonable policy” and learns the best options
for this policy through experience. This approach is complementary to and can
be combined with the G-RRL ideas.
8.5
Conclusions
This chapter addressed the problem of integrating guidance and experimentation in reinforcement learning, and in particular relational reinforcement learning as the problem of finding rewards that are sparsely distributed is more
severe in large relational problem domains. It was shown that providing guidance to the reinforcement learning agent does help improve the performance
in such cases. Guidance in this case takes the form of traces of execution of a
“reasonable policy” that provides sufficiently dense rewards.
The utility of guidance was demonstrated through experiments in the blocks
world domain with 10 blocks. The 10 blocks world is characterized by a huge
state space and the three chosen tasks are characterized by their sparse rewards.
The effect of using guidance was studied in a number of settings, characterized along two dimensions: the mode of providing guidance and the generalization engine used within relational reinforcement learning.
Two modes of using guidance were investigated: providing all guidance at
the start and spreading guidance, i.e., providing some guided episodes followed
by several exploration episodes, and repeating this. A variation on the latter mode is active learning, where the agent asks for guided traces starting
from initial states that it selects itself rather than receiving guided traces from
randomly chosen initial states.
Overall, the use of guidance in addition to experimentation improves performance over using experimentation only, for all considered combinations of
8.6. FURTHER WORK
139
the dimensions mentioned above. Improvements in terms of the overall performance level achieved, the convergence speed, or both were observed. The
improvements result from using the best of both worlds: guidance provides
perfect or reasonably good information about the optimal action to take in a
narrow range of situations, while experimentation can obtain imperfect information about the optimal action to take in a wide range of situations. The
actual magnitude of performance improvement does depend on the considered
combination of the mode of providing guidance and generalization engine.
While both guidance at the start and spread guidance improve the performance of the RRL system, spread guidance often yields higher, but more
importantly, never lower performance. This is especially the case when the
regression engine is vulnerable to over-generalization such as the tg algorithm.
Providing all the guidance up front doesn’t quite work well in this case for several reasons. Namely, making a split on “perfect” guidance generated examples
only distinguishes between the regions of equal state-values, but not the actions
that allow movement between them. This can be corrected by splits further
down the tree, but this requires lots of extra examples, and therefore more
learning episodes. This problem is aggravated by the fact that after making a
split, the guidance received so far is lost. These problems do not appear when
instance-based regression is used as a generalization engine as the rib algorithm
is designed to remember high yielding examples and the bias towards action
difference of the used distance prevents the over-generalization that occurs with
tg . The kbr algorithm also does not suffer from over-generalization and succeeds in translating only optimal examples into the optimal policy. However,
since it has no example selection strategy like the rib algorithm, large amounts
of uninformative examples can cause the optimal policy to degenerate during
later exploration.
Active learning with spread guidance helps improve performance in the later
stages of learning, by enabling fine tuning of the learned Q-function by focusing
on problematic regions of the state space. This often results in a significant
reduction of the cases where RRL exhibits non-optimal behavior. Experiments
show that a sufficiently high level of performance has to be reached by G-RRL
for the active guidance to have any effect. If performance is too low to allow
fine-tuning, active guidance does not improve on normal guidance.
8.6
Further Work
Since each of the three regression algorithms responds differently to the supplied
guidance, one possible direction for further work is the tighter integration of
guidance and the used generalization engine. For example, when dealing with
a model building regression algorithm like the tg system, one could supply
more and possibly specific guidance when the algorithm is bound to make an
140
CHAPTER 8. GUIDED RRL
important decision. When tg is used, this would be when tg is ready to choose
a new test to split a leaf. Even when this guidance is not case specific, it could
be used to check whether a reasonable policy contradicts the proposed split.
Alternatively, one might decide to store (some of) the guided traces and re-use
them: at present, all statistics, and thus all the information received from the
guided traces, are forgotten once tg chooses a split.
When looking for a more general solution, one could try to provide a larger
batch of guidance after RRL has had some time to try and explore the statespace on its own. This is related to a human teaching strategy, where providing
the student with the perfect strategy at the start of learning is less effective
than providing the student with that strategy after he or she has had some
time to explore the systems behavior.
Another route of investigation that could yield interesting results would be
to have a closer look at the relations of our approach to the human learning
process. In analogy to human learner–teacher interaction, one could have a
teacher look at the behavior of RRL or — given the declarative nature of the
policies and Q-functions that are generated by tg — at the policy that RRL
has constructed itself and adjust the advice it wants to give. In the long run,
because RRL-tg uses an inductive logic programming approach to generalize
its Q-function and policies, this advice doesn’t have to be limited to traces, but
could include feedback on which part of the constructed Q-function is useless
and has to be rebuilt, or even constraints that the learned policy has to satisfy.
Although the idea of active guidance seems very attractive both intuitively
and in practice, it is not easy to extend this approach to applications with
stochastic actions or a fixed starting state such as most games. If one looks
at for example the popular computer game of Tetris (See also Chapter 9),
the starting state always has an empty playing field and the next block to
be dropped is chosen randomly. For stochastic applications one could try to
remember all the stochastic elements and try to recreate the episode. For the
Tetris game this would include the entire sequence of blocks and asking the
guidance strategy for a game with the given sequence. However, given the
large difference in the Tetris state as a consequence of only a few different
actions, the effect of this approach is anticipated to be small.
Another step towards active guidance in stochastic environments would be
to keep track of actions (and states) with a large negative effect. For example
in the Tetris game again, one could notice a large increase of the height of the
wall of the playing field. These remembered states could then be used to ask
for guidance. However, this approach requires not only a large amount of administration inside the learning system but also needs some a priori indication
of bad and good results of actions.
Chapter 9
Two Computer Games
“Shall we play a game?”
War Games
9.1
Introduction
To test the applicability of the RRL system to larger problems than the ones
already tested in the blocks world, computer games offer a cheap solution.
Computer games offer a well defined world for the learning system to interact
with and can often be tuned in complexity to make the learning task easier or
more difficult when needed. A computer game environment is often filled with
a varying amount of objects that exhibit several relational properties and are
often only defined by their properties and relations to other objects. This kind
of environment is exactly what the RRL system was designed for.
Computer games also offer cheap learning experience as the environment
is completely simulated. Real world applications are often hampered by long
interaction times or high exploration costs. Real world applications often have
slow response times, either by having to wait for human interactions or simply
because of physical limitations. Given access to the computer game code, the
interaction of the learning system with the game can often be accelerated to
speed up the learning experiments. Real world systems are often not suited
for random exploration as wrong choices might have destructive consequences.
In a completely simulated environment, the agent can be punished for making
bad mistakes by the appropriate reward value without imposing any real cost.
It is therefore not surprising that some of the best showcases of reinforcement
learning are based on games (Tesauro, 1992).
This chapter will show the learning possibilities of the relational reinforcement learning in two computer games: Digger and Tetris. The Digger game will
141
142
CHAPTER 9. TWO COMPUTER GAMES
Figure 9.1: A snapshot of the DIGGER game.
be tackled using the tg regression algorithm as well as with a new hierarchical
reinforcement learning technique for concurrent goals. After a brief introduction of the Digger game and a short overview of hierarchical reinforcement
learning, a new approach to hierarchical reinforcement learning for concurrent
goals is presented. Section 9.5 reports on a number of experiments with the
Digger game.
The behavior of the RRL system in the Tetris Game is discussed in Section
9.6. All three regression systems are used for the Tetris task. Instead of
learning a Q-function, the Tetris game is handled using “afterstates” and a
utility function is learned instead.
The work on learning the Digger game was started together with Jeroen
Nees in the context of his master’s thesis. Parts of this work were published
in (Driessens and Blockeel, 2001) where the new hierarchical approach was
introduced and (Driessens and Džeroski, 2002a) that reported on some of the
experiments.
9.2
The Digger Game
Digger1 is a computer game created in 1983, by Windmill Software. It is one of
a few old computer games which still enjoy a fair amount of popularity. In this
game, the player controls a digging machine or “Digger” in an environment that
contains emeralds, bags of gold, two kinds of monsters (nobbins and hobbins)
and tunnels. The goal of the game is to collect as many emeralds and as much
gold as possible while avoiding or shooting monsters.
In the tests the hobbins and the bags of gold were removed from the game.
Hobbins are more dangerous than nobbins for human players, because they can
1 http://www.digger.org
9.2. THE DIGGER GAME
143
Figure 9.2: The possible paths that can be travelled on in the Digger Game.
dig their own tunnels and reach Digger faster, as well as increase the mobility of
the nobbins. However, they are less interesting for learning purposes, because
they reduce the implicit penalty for digging new tunnels (and thereby increasing
the mobility of the monsters) when trying to reach certain rewards. The bags of
gold were removed from the game to reduce the complexity. Although they are
still shown during the game and consequently in the screen shots, Digger does
not interact with them. (Bags of gold have to be pushed to and dropped down
cliffs so that they burst open before the gold can be collected. The game already
is sufficiently complex for reinforcement learning purposes without them.)
The digging machine is not completely free to move anywhere through the
game-field. The screen is divided into 10 by 15 squares, and although it takes
several steps for Digger to get from one square to the next and the human
player can decide to return to where it is coming from before the next square
is reached, Digger can only change directions in the middle of these squares.
In other words, Digger can only travel on the lines shown in Figure 9.2, but it
can turn back at practically any given time.
The most elementary step in the Digger Game lasts about 100 milliseconds
during game-play. It takes Digger from 4 to 5 basic steps to go from one square
to the next. To reduce the number of steps between events (and rewards) and
therefore the complexity of learning the game through reinforcement learning,
the game was discretized so that a single step would make Digger move to the
next square. Although this does remove one possible strategy from the game
— a human player will often try to reach an emerald without connecting two
separate tunnels to limit the mobility of the monsters — it will reduce the
number of steps between rewards and the total number of steps taken during
a learning episode.
144
CHAPTER 9. TWO COMPUTER GAMES
9.2.1
Learning Difficulties
The Digger computer game offers quite a challenge to a reinforcement learning
system. Although the game is quite simple according to human standards, it
is too large for normal reinforcement learning techniques. Even with the used
discretization, the number of possible states is very large.
Indeed, the number of available emeralds and present monsters varies during
the game as does the tunnel structure. Although it is possible to represent the
Digger game with a feature vector, it would be very large and very impractical.
The different levels available in the game add to the difficulties of learning this
game with non-relational reinforcement learning.
Because the identity of an emerald or a certain monster is unimportant and
only their relative properties are important, a small relational Q-function can
yield a well performing policy.
9.2.2
State Representation
The used representation of the Digger game state consists of the following
components:
• the coordinates of digger, e.g., digPos(6,9)
• information on digger itself, supplied in the format:
digInf(digger dead,time to reload,level done,pts scored,
steps taken), e.g., digInf(false,63,false,0,17),
• information on tunnels as seen by digger (range of view in each direction
(up, down, left and right), e.g., tunnel(4,0,2,0); information on the tunnel
is relative to the digger; there is only one digger, so there is no need for
a digger index argument)
• list of emeralds (e.g., [em(14,9), em(14,8), em(14,5), . . .]), this information holds the absolute coordinates of all the emeralds,
• list of monsters (e.g., [mon(10,1,down), mon(10,9,down) . . .]), also using
absolute coordinates,
and
• information on the fireball fired by digger (x-coordinate, y-coordinate,
travelling direction, e.g., fb(7,9,right)).
The use of lists removes the limitations of fixed size feature vectors and
the lossless representation of the game-state allows for the computation of all
possible (relational) state properties.
9.2. THE DIGGER GAME
145
The actions available to the learning system are moveOne(X) and shoot(X)
with X ∈ {up, down, lef t, right}. These actions are implemented in such a way
that Digger moves an entire row or column. Shooting in a direction will also
move Digger in that direction, as it is impossible to make Digger stand still
during the game.
To let the tg algorithm use relational information to build the Q-function
instead of the the state and action representation described above, the tg
algorithm is presented with the following predicates to use as splitting criteria:
• actionDirection/2: gets the direction of the chosen action.
• moveAction/1: succeeds when the chosen action is a moving action and
returns the direction.
• shootAction/1: succeeds when the chosen action is a shooting action and
returns the direction.
• emerald/2: returns the relative direction of a given emerald.
• nearestEmerald/2: computes the nearest emerald and its direction.
• monster/2: returns the relative direction of a given monster.
• visibleMonster/2: computes whether there is a monster that is connected
to Digger with a straight tunnel and returns the monster and its direction.
• monsterDir/2: returns the travelling direction of a given monster.
• distanceTo/2: computes the distance from Digger to a given emerald or
monster.
• canFire/0: succeeds if Digger’s weapon is charged and ready to fire.
• lineOfFire/1: succeeds if a fireball is already travelling in the direction
of the given monster.
None of these predicates use object specific or level specific information.
This allows RRL to learn on the different levels of the Digger game at the
same time. Learning experience on easy levels can be mixed with hard levels
to help RRL to learn a representative Q-function.
9.2.3
Two Concurrent Subgoals
A player controlling the Digger bot, will have to collect as many emeralds
as possible and avoid encounters with the nobbins. These two tasks can be
regarded as two separate goals that have to be handled concurrently. An interesting feature of the Digger game is that it is often non-optimal to regard the
146
CHAPTER 9. TWO COMPUTER GAMES
Figure 9.3: The dilemma of concurrent goals in the Digger game. The two grey
actions are optimal for subgoals, while the white action is optimal considering
both at the same time.
two tasks as competing, as the optimal action when both tasks are considered
is often a different action than the optimal action for any of the subtasks as
shown in the example of Figure 9.3. The left grey arrow indicates the action
that is optimal for avoiding monsters, while the right grey arrow shows the
action that would be optimal for collecting emeralds. The white action is the
one chosen by most human players in the shown game state, as it combines the
avoidance of the nearby monster with the path to the closest “safe” emerald.
The fact that the Digger game consists of trying to solve these two rather
distinct subgoals make it a suited application for hierarchical reinforcement
learning. However, since the two subgoals have to be handled concurrently, not
all hierarchical reinforcement learning approaches are very well suited.
9.3
Hierarchical Reinforcement Learning
Hierarchical Reinforcement Learning is often used for reinforcement learning
problems that are too large to be handled directly. By dividing the learning
problem into smaller subproblems and learning these problems separately, the
so-called “curse of dimensionality”2 can be partially circumvented.
The translation of a learning problem into multiple smaller or simpler learning tasks can take different forms:
Sequential Division : When the original problem can be translated into a
sequence of (higher level) steps that have to be performed, each of these
2 The unfortunate fact that the number of parameters that need to be learned grows
exponentially with the size of the problem.
9.3. HIERARCHICAL REINFORCEMENT LEARNING
147
subtasks can be treated as an individual learning task. The complete
problem is solved by executing each policy until the local goal for which
it was learned is reached. An obvious example problem that can be
handled by this technique is that of construction work, e.g. a house. To
build a house, first the foundation has to be laid, then the walls have to
be erected and finally a roof has to be put on top.
Temporal Abstraction : When the solution to the learning problem consists
of (higher level) steps which have to be repeated or interleaved a certain
number of times, the correct execution of each of these steps can be regarded as a learning task. The solution to the original problem will then
be a high level policy that makes use of the low level abilities that were
learned to reduce the number of steps to be taken. The term “temporal
abstraction” stems from the fact that each of the subtasks may take a different amount of time to be executed and decisions are no longer required
at each step. An example of this would be robot navigation where navigating from one location to another consists of following several corridors,
making a number of turns and avoiding the necessary objects.
Concurrent Goal Selection : When the complete learning task is made up
of different goals that have to be addressed concurrently, the learning
system could first be allowed to train on each of the subgoals separately.
An example of this kind of task would be animal survival. A rabbit has to
handle the goal of collecting food concurrently with the goal of avoiding
predators.
The existing work on hierarchical reinforcement learning is extensive and a
full overview of the topic will not be presented. Interested readers are referred
to overviews given by Kaelbling et al. (1996) or Barto and Mahadevan (2003).
Most of the work on hierarchical reinforcement learning was done on sequential division and temporal abstraction (Sutton et al., 1999; Parr and Russell,
1997; Dietterich, 2000). Both of these approaches require the subgoals to have
a procedural nature, i.e., that one can be achieved after another has been completed (not necessarily in a strict order) and that each subgoal has a clear
termination condition. Having the Q-learning agent build a policy for each
subgoal leads to a number of “macro-actions” which are an abstract extension
of single step actions and which can be used to simplify the policy of the agent
for the complete problem.
However, this does impose a restriction on the types of problems where
these techniques can be applied. The concurrent goal setting includes learning
problems where subgoals do not have termination conditions, i.e., where all
subgoals have to be pursued during the entire execution.
W-learning (Humphrys, 1995) is one of the few hierarchical reinforcement
learning techniques that does deal with multiple parallel goals. The multiple
148
CHAPTER 9. TWO COMPUTER GAMES
goals are handled by generating a separate Q-learning agent for each goal. The
policies of the different agents are combined by attributing a Weight-factor for
each (state, agent) combination that determines the importance of following a
certain agents advice in the given state. The weight values are computed based
on which agent is most likely to suffer the most if its advice is disregarded.
Figure 9.4: The optimal action for the combination of both goals (shown in
white) is different from either of the two optimal actions for each goal considered
separately (shown in grey).
The downside of W-learning is that only actions that are optimal for one of
the subgoals will ever be chosen. Consider the robot in Figure 9.4. To survive,
the robot has to accomplish two tasks: it wants to reach the oil can on the
left of the picture, but it needs to avoid contact with the magnet that can trap
it when it gets too close. Two reinforcement learning agents inside the robot
will suggest the two grey arrows as optimal actions. To reach the oil can, the
robot will select the left grey action, to avoid the magnet, the robot will prefer
the action represented by the right grey arrow. No matter which subagent
is chosen to be most important, the actual optimal action represented by the
white arrow will never be chosen.
9.4
Concurrent Goals and RRL-tg
The results of relational reinforcement learning (comparable to other Q-learning
techniques) is some form of Q-function that is usually translated into a policy.
However, letting RRL learn on a subgoal of the original problem will result in a
Q-function that gives an indication of the quality of a (state, action) pair for the
used subgoal. This Q-function not only holds an indication about which actions
are optimal, but also includes information about non-optimal but reasonable
actions. To make use of the information contained in the subgoal Q-functions,
the predictions made by these Q-functions could be used in the description of
the Q-function for the entire problem.
9.5. EXPERIMENTS IN THE DIGGER GAME
149
Through the use of background knowledge and an adjusted language that
defines the available splitting criteria tg can make use of almost any information that is available about the states or actions. This can include Q-value
information learned for the different subgoals. Simple use of the predictions
of Q-values for each of the subgoals can be done by comparing these values
to constant values or to each other. On top of this, tg can be allowed to
add, negate or perform other computations with these values. Simple addition
of Q-values predicted for different subgoals could for example emphasize the
actions that are reasonable for most subgoals, catastrophic for none and as a
consequence, close to optimal for the complete task.
9.5
Experiments in the Digger Game
The actual coupling between the RRL system and the Digger game was done
by Jeroen Nees in the context of his master’s thesis. For reasons beyond his
(and my) control, this coupling remained unstable and unfortunately quite
slow, which made it impossible for an extensive amount of experiments to be
run.
The rib and kbr algorithms will not be used in the Digger experiments.
rib requires a first order distance and kbr a kernel to be defined on the
(state, action) pairs of the presented learning problem. The definition of such a
distance or kernel is non-trivial and as a consequence, they were not designed.
The rewards in the Digger-game are distributed as follows: 25 points for
eating an emerald (and an extra 250 points for eating 8 emeralds in a row),
250 points for shooting a monster and -200 points for dying. RRL is trained
and tested on all 8 standard Digger-levels.
Figure 9.5 shows the average learning results for RRL over 10 test runs.
The y-axis displays the average reward obtained by the learned strategies over
640 Digger test-games divided over the first 8 different Digger levels. In tabula
rasa form, RRL reaches an average performance of almost 600 points per level.
It is hard to give the maximum number of points that can be earned, as this
amount differs per level. It can be said that RRL performs worse than human
players although it does succeed in finishing the easiest level. Human players
will score much more points though as RRL finishes the level by eating all
available emeralds. Human players who want to maximize their earned points
will eat all but one of the emeralds and then finish the level by shooting all
monsters. This behavior is never exhibited by the RRL system, although it
does occasionally succeed in killing a monster (and even makes a detour to do
this).
150
CHAPTER 9. TWO COMPUTER GAMES
The Digger Game
1000
Average Total Reward
800
600
400
200
0
’tabula rasa learning’
’with 5 guided traces’
’with 20 guided traces’
-200
0
200
400
600
Number Of Episodes
800
1000
Figure 9.5: The performance of RRL on the Digger game, both with and
without guidance.
9.5.1
Bootstrapping with Guidance
In a second step, the learned strategy was used to provide RRL with a few
guided episodes. The point of this experiment was to see whether RRL could
improve on its own strategy. Figure 9.5 also shows the learning performance of
RRL with 5 and 20 guided traces and illustrates the fact that RRL is indeed
capable of improving its own learned strategy, although the resulting increase of
performance is limited. A second iteration did not result in any improvements
anymore. To reach higher levels of performance, a better set of splitting criteria
is needed together with, most importantly, a substantially larger set of learning
episodes.
Care should be taken when comparing the learning graphs of these approaches. While the system that is given guidance could be regarded as having
more episodes to learn from because it profits from earlier experience, this
would not be fair as it also has to start building the Q-function from scratch.
Using tg this means that a sufficient number of (state, action) pairs needs to
be collected to build a Q-tree. However, these extra learning episodes could
induce an extra cost when experimentation is not free.
9.5.2
Separating the Subtasks
For the experiment shown in Figure 9.6, RRL was first given the opportunity
to learn to collect emeralds and to avoid or shoot monsters separately using
the ideas for hierarchical reinforcement learning for concurrent goals. RRL
was allowed to train for 100 episodes on each of the subtasks. Each of the
learned Q-functions was then added to the background knowledge used by the
9.6. THE TETRIS GAME
151
The Digger Game
700
Average Total Reward
600
500
400
300
200
100
0
’Regular RRL’
’Hierarchical RRL’
-100
0
50
100
150
200
250
Number of Episodes
300
350
400
Figure 9.6: The performance of RRL both with and without hierarchical learning.
tg algorithm and tg was permitted to compare the predicted Q-values to
constant values and to each other. It was also allowed to add the two predicted
values together or negate one of the values before comparing it. As shown in the
learning curve, the extra information allows RRL to increase its performance
faster than without it. However, the resulting level of performance is only
comparable to the level of performance without the added features. It does not
improve on it.
The same remark as for the guided experiment holds here. While the hierarchical learner profits from previous experience and could be regarded as having
an extra 200 learning episodes, this is not entirely fair because tg needs to
build a Q-function from scratch when the learning experiment is started. However, the extra learning episodes should be considered when the learning curves
are compared.
9.6
The Tetris Game
Tetris3 is probably one of the most famous computer games around. Designed
by Alexey Pazhitnov in 1985 and has been ported to almost any platform
available, including most consoles. The Tetris is a puzzle-video game played
on a two-dimensional grid as shown in Figure 9.7. Different shaped blocks fall
from the top of the game field and fill up the grid. The object of the game
is to score points while keeping the blocks from piling up to the top of the
game field. To do this, one can move the dropping blocks right and left or
3 Tetris
is owned by The Tetris Company and Blue Planet Software.
152
CHAPTER 9. TWO COMPUTER GAMES
Figure 9.7: A snapshot of the Tetris game.
rotate them as they fall. When one horizontal row is completely filled, that
line disappears and the player scores a point. When the blocks pile up to the
top of the game field, the game ends. The fallen blocks on the playing field will
be referred to as the wall.
A playing strategy for the Tetris game consist of two parts. Given the
shape of the dropping block, one has to decide on the optimal orientation and
location of the block in the game field. This can be seen as the strategic part
of the game and deals with the uncertainty about the shape of the blocks
that will follow the present one. The other part consists of using the low level
actions — turn, moveLeft, moveRight, drop — to reach this optimal placement.
This part is completely deterministic and can be viewed as a rather simple
planning problem. The RRL system will only be tested on the first, and
most challenging subtask. Finding the optimal placement of a given series of
falling blocks is an NP-complete problem (Demaine et al., 2002). Although
the optimal placement is not required to play a good game of Tetris, the added
difficulty of dealing with an unknown series of blocks makes it quite challenging
for reinforcement learning, and Q-learning in particular and a suited application
to test the limitations of the RRL system.
There exist very good artificial Tetris players, most of which are hand built.
The best of these algorithms score about 500.000 lines on average when they
only include information about the falling block and more than 5 million lines
when the next block is also considered. The results in this chapter will be
nowhere near this high and will be even low for human standards. However,
the experiments shown will illustrate the capabilities of and the difficulties still
faced by the RRL system, and possibly of Q-learning algorithms in general.
Compared to the Digger game, Tetris is a lot harder for the RRL system.
This is greatly due to the shape of the Q-function in Tetris. Where the Digger
game was very object and relation oriented (having to deal with emeralds and
9.6. THE TETRIS GAME
153
Figure 9.8: Greedily taking the scoring action and dropping the block in the
canyon on column 1 might lead to problems later.
monsters and the distances between them) Tetris players need to focus on the
shape of the wall in the game field, which evolves more chaotically. Although
certain relational features exist (such as canyons, i.e., deep and narrow depressions in the wall on the playing field), the exact value of a Tetris state and/or
action seems hard to predict.
9.6.1
Q-values in the Tetris Game
The stochastic nature of the game (i.e., the unknown shape of the next falling
block) in combination with the chaotic nature of the Tetris dynamics make it
very hard to connect a Q-value to a given (state, action) pair.
It is very hard to predict the future rewards starting from a given Tetris
state. The state shown in Figure 9.8 for example can quickly lead to a reward
of 2, but the creation of a hole in column 1 could eventually lead to problems,
depending on the blocks to come, for example when no block can be found to
fill the small canyon on column 6.
The development of the Tetris wall is also quite chaotic. The resulting shape
of the wall when dropping the block of Figure 9.8 in the canyon on column 1
is barely related to the shape after the block is dropped almost anywhere else
on the game field. The height of the wall decreases by deleting two lines, but
also the resulting shape of the top of the wall is quite different, having lost the
canyon on column 1, which was one of its most defining features. This chaotic
behavior of the shape of the Tetris wall, will also make the Q-values in the
Tetris game very hard to predict.
154
9.6.2
CHAPTER 9. TWO COMPUTER GAMES
Afterstates
The Tetris game can easily benefit from the use of afterstates as discussed in
the intermezzo on page 26. It is very natural for human players to decide where
to place a Tetris block by quickly predicting what the resulting game-field state
would look like after the chosen action is performed. This allows the player to
evaluate a number of features of the resulting state such as:
• Are any lines erased?
• What is the shape of the top of the wall after the block has landed? Will
it fit a large variety of blocks that might follow?
• Are any new holes created or are some old ones deleted?
• How many canyons will (still) exist and what is their depth or width?
The state resulting from dropping a block can only be partially computed,
because of the random selection of the next falling block. Regarding this selection of the next block as a counter move of the environment, casts the problem
into the afterstates setting.
Calculating the resulting state (without the next block) and predicting the
reward that accompanies the chosen action are fairly easy and allows RRL to
calculate the Q-value of a (state, action) pair as
Qaft (s, a) = rpred (s, a) + V̂ (δpred (s, a))
The use of the afterstate prediction therefore reduces the need for regression
using (state, action) pairs to just predicting the value of a state. For the two
(state, action) pairs in Figure 9.9, this means that they will both use the same
state for the utility prediction by the chosen regression algorithm.
The use of afterstates does not remove the unpredictability of the long
term reward in the Tetris environment however, and the connected difficulties
to calculate the utility of states remain.
9.6.3
Experiments
All three regression systems were tested on the Tetris application. tg was given
a language that included tests such as:
• What is the maximum, average and minimum height of the wall?
• What is the number of holes?
• What are the differences in height between adjacent columns?
• Are there canyons (of width 1 or width 2)? How many are there?
9.6. THE TETRIS GAME
155
Figure 9.9: Both (state, action) pairs at the top of the Figure result in the
same afterstate shown on the bottom. No next block is shown, as the shape
of this block is stochastic and considered as the counter move of the Tetris
environment.
156
CHAPTER 9. TWO COMPUTER GAMES
• Does the next block fit? What is the lowest fitting row for the next block?
• How many points can the next block score?
• What is the lowest row the next block can be dropped on?
To turn the numerical values into binary tests as needed for the tg system,
the values were compared to a number of user defined constants or other related
values, e.g. the average height to the lowest row where the next block can
be dropped. With this language and 10% guided learning episodes (1 guided
every 10), RRL-tg learned to delete around 10 rows per game after about 5000
learning games averaged over 10 learning experiments. The resulting Q-trees
were quite large, showing an average of approximately 800 leafs. These results
do not improve on previously reported results of RRL-tg in the Tetris game
(Driessens and Džeroski, 2002a; Driessens and Džeroski, 2002b; Driessens and
Džeroski, 2004), where afterstates were not used.
rib was given a Euclidian distance using a number of state features, thus
basically working with a propositional representation of the Tetris states. These
features were:
• The maximum, average and minimum height of the wall and the differences between the extremes and the average.
• The height differences between adjacent columns.
• The number of holes and canyons of width 1.
Also presented with 10% guided learning episodes RRL-rib learned to remove an average of 12 lines per game after around 50 learning episodes, i.e.,
with only 5 guided traces. It could not however improve on that policy during
another 450 learning games. This behavior was shown in each of the 10 learning
episodes.
The covariance function needed for RRL-kbr was supplied by an inner
product of feature vectors such as used in the rib distance, thereby again
working with a propositional representation of the problem. The RRL-kbr
system very quickly learned to delete an average of 30 to 40 lines per game
after only 10 to 20 learning games, but un-learns this strategy after another 20
to 30 learning games. To predict the value of a new example, the covariance
matrix C needs to be inverted. This inversion can in theory be impossible as
there is no guarantee that the covariance matrix is non-singular. This is in
particular a problem when many learning examples reside in a low-dimensional
subspace of the feature space related to the used kernel, which is likely to
happen in the Tetris application, as the used feature space has a rather low
dimension. The current solution for this problem is to add a multiple of the
identity matrix to the covariance matrix to ensure that it is of full rank. This
9.6. THE TETRIS GAME
157
means C + I in the computations instead of C. Unfortunately, this is not
always sufficient: if is large, the matrix C + I differs too much from the
real covariance matrix C while if is small, computing (C + I)−1 becomes
numerically very unstable. It is this numeric instability that causes RRL-kbr
to un-learn its policy.
9.6.4
Discussion
None of the three regression algorithms fully exploited its relational capabilities.
All representations used were in essence propositional. In contrast to the Digger
game, Tetris does not seem like a real relational problem. Most existing work
on artificial Tetris players uses numerical attributes to describe the shape of
the game field.
The results on the Tetris game with RRL in its current form are a bit
disappointing. For example, a strategy which always selects the lowest possible
dropping position for the falling block already succeeds in deleting 19 lines on
average per game. Only RRL-kbr succeeds in doing better (before it runs into
problems).
These results are presumed to be caused by the difficulty of predicting the
future reward in the Tetris game. Even human Tetris experts, who often agree
on the action that should be chosen in a given Tetris state, will have a hard
time predicting how many lines will be deleted using the next 10 blocks. These
difficulties are illustrated by the RRL-tg experiments and the large sizes of
the learned Q-trees. Since at least a good estimate of the future cumulative
reward is necessary for any Q-learning algorithm, Q-learning (or any derived
learning algorithm such as the RRL system) is probably not the appropriate
learning technique for the Tetris game.
This illustrates one of the problems still faced by the RRL system and
indicates some directions for further work. The RRL system still strongly
relies on learning a good Q-function approximation. The use of a regression
algorithm allows RRL to generalize over different (state, action) pairs, but not
over different Q-values. Other reinforcement learning techniques applied to the
Tetris game (Bertsekas and Tsitsiklis, 1996; Lagoudakis et al., 2002) use a form
of approximate policy iteration. These techniques perform much better than
the RRL system, generating policies that delete 1000 lines per game or more.
The advantage off these algorithms seems to lie in the iterative improvement
nature of the policy iteration algorithm. Instead of having to learn a correct
utility or Q-value, the policy improvement step only relies on an indication of
which action is better than other actions to build the next policy. Therefore,
these approaches seems to perform some kind of advantage learning. Applying
an appropriate version of the relational approximate policy iteration technique
of Fern et al. (2003) will probably yield comparable results. Also, the use
of policy learning as described by Džeroski et al. (2001) generates a Q-value
158
CHAPTER 9. TWO COMPUTER GAMES
independent description of a policy and might yield better results on tasks such
as the Tetris game.
9.7
Conclusions
Digger was the first “larger” application that the RRL system was tested on.
Although Digger is still a rather simple game according to human standards,
with the large number of objects present on the game field, the total number
of different possible states and the availability of different levels, the Digger
game would be a very difficult task for non-relational reinforcement learning.
Although the RRL learning system never reached a human playing standard,
it was able to learn “decent” behavior and even to finish the first (and easiest) level. This illustrates that RRL is able to handle large problem domains
compared to regular reinforcement learning.
This chapter also introduced a new hierarchical reinforcement learning technique that can be used to learn in an environment with concurrent goals. The
techniques relies on the ability of the tg algorithm to include background information on top of the (state, action) pair description to build a Q-function. The
suggested approach first lets RRL build a Q-function for each of the present
subgoals and then allows tg to use the Q-value predictions during the construction of the Q-function for the entire learning problem.
RRL was able to bootstrap itself by using a previously learned policy as
guidance and was able to improve upon its own behavior. Using the hierarchical
learning technique for concurrent goals leads to faster learning, but not to a
higher level of performance.
On the Tetris game, the performance of RRL was a bit disappointing.
This seems to be caused by the difficult to predict Q-values connected to the
Tetris game. While other reinforcement learning techniques which apply some
kind of advantage learning are probably more suited to deal with this kind of
applications, the behavior of the RRL system on the Tetris task provided some
insights on the usability of RRL and of Q-learning in general.
In further work, a more extended language could be written for tg or even
a relational distance or kernel could be designed for rib and kbr which could
possibly lead to better performance. However, it does not seem feasible for
RRL to reach human expert level performance in the current setup. More
external help will be necessary to build a complex strategy such as the one
used by humans to maximize the number of points earned per level.
Part IV
Conclusions
159
Chapter 10
Conclusions
“I’m sorry, if you were right, I would agree with you.”
Awakenings
10.1
The RRL System
This work presented the first relational reinforcement learning system. Through
the adaptation of a standard Q-learning algorithm with Q-function generalization and the use of different incremental, relational regression algorithms an
applicable relational reinforcement learning technique was constructed.
Three new incremental and relational regression techniques were developed.
A first order regression tree algorithm tg was designed as the combination of
two existing tree algorithms, i.e., the Tilde algorithm and the G-tree algorithm. The tree building algorithm was made incremental through the use of
performance statistics for all possible extensions in each of the leafs of intermediate trees.
Based on the ideas of instance based learning, a relational instance based
regression algorithm was developed. Different data management techniques
were designed, based on different error-related example selection criteria. One
example selection technique — based on maximum Q-value variation — was
built on the specific dynamics of Q-learning algorithms.
A third regression algorithm was based on Gaussian processes and graph
kernels. Through the use of graph representations for state and actions and a
kernel based on the number of walks in the product graph, the well defined statistical properties of Gaussian processes can be used as a regression algorithm
in the RRL system.
Although the three discussed regression algorithms were developed with
their application in the RRL system in mind, most of the developed systems can
161
162
CHAPTER 10. CONCLUSIONS
be used in any supervised relational classification problems as well, especially
in tasks with a continuous prediction target.
Two additions were made to the RRL system to increase the applicability of
the system to large applications. Adding guidance to the exploration phase of
the RRL system, based on an available reasonable policy can greatly increase
the performance of the RRL system in applications with sparse and hard to
reach rewards. Several modes of guidance were tested, ranging from supplying
all the guidance at the beginning of learning to spreading out the guidance
and interleaving it with normal exploration. Although the different regression
algorithms react differently to different kinds of guidance, overall, guidance
improves the performance of the RRL system significantly.
A second addition to the RRL system was a new hierarchical reinforcement learning approach that can be used for concurrent goals with competing
actions. In this hierarchical approach, the RRL system is allowed to first train
on subtasks of a complete reinforcement learning task. When RRL is confronted with the original task, it is supplied with the learned Q-functions as
background knowledge and given the ability to compare the predictions made
by the functions to constants and each other. Through the use of this information RRL system can increase its performance on the complete problem more
quickly.
10.2
Comparing the Regression Algorithms
The three regression algorithms were compared on different applications, varying from different tasks in the blocks world to the computer games Digger and
Tetris. Although there are differences in the performance of the three systems,
it is remarkable how comparable their performance is.
The differences in performance between the three algorithms are a result
of the general characteristics of the family of algorithms the three engines
belong to, i.e., model building algorithms and more specifically tree building
algorithms for the tg algorithm, instance based algorithms for RRL-rib and
statistical methods for the kernel based system.
In general, decision trees perform well when dealing with large amounts of
data. Through the use of their divide and conquer strategy, they can process
large amounts of learning examples fast. They generalize well when enough
training data is available to build an elaborate tree. However, when only a
small number of learning examples is available or in the case were the learning
examples are not evenly distributed over the space of values that need to be
predicted, decision trees can perform poorly and are likely to overgeneralize.
These characteristics are of course amplified in an incremental implementation
of a tree learning algorithm. The need for large amounts of learning examples
can be recognized in the fact that the tg algorithm usually needs more learning
10.2. COMPARING THE REGRESSION ALGORITHMS
163
episodes to reach the same level of performance as the other two algorithms.
However, when exploration costs are not an issue, this is made up greatly by
the much higher processing speed of the tg algorithm. The vulnerability of
decision trees to skewed learning data distributions surfaces in the difficulties
that the tg algorithm has with the guidance mode where all guidance is provided at the start of learning. The tg algorithm’s greatest advantages lie in
its construction of a declarative Q-function and its ability to easily incorporate extra domain knowledge in its Q-function description. The hierarchical
approach for concurrent goals for example, relies on the fact that tg is used
as the regression engine. Although it is not entirely infeasible to use RRL-rib
or RRL-kbr in this case, it will not be as convenient and elegant as with the
tg algorithm. The declarative Q-function that is the result of the tg algorithm will also aid in the introduction of reinforcement learning into software
development and real world (or industrial) problems, as software designers can
study and comprehend the Q-function that is learned and will be used as a
policy in the software system. This will facilitate verification of the behavior
of the resulting system in extraordinary cases.
When small amounts of learning examples are given or when cautious generalization is needed (for example in cases where the distribution of the learning
data is skewed in respect to the space of values that need to be predicted),
instance based and statistical methods perform better than decision trees.
Through the use of the appropriate example selection method, instance based
regression can be tuned to take full advantage of learning data with a very low
percentage of informative content. This can be a great advantage in tasks that
supply noisy data such as Q-learning at the start of a learning experiment.
With examples of which the Q-values are based on very inaccurate predictions,
instance based regression can be made to remember mostly examples with a
high information content. This behavior results in very good performance of
the RRL-rib system in the early stages of experiments with applications with
sparse and hard to reach rewards. RRL-rib performs extremely well when it is
supplied with guidance at the start of learning, as it will try to remember high
yielding examples and can exploit that knowledge when it generates a policy.
The major drawbacks of the instance based regression are computational. The
evaluation of a relational distance is usually computationally very intensive and
when the shape of the Q-function forces the rib system to store a high number
of examples, this can drastically increase the time needed for learning (through
the computation of the example selection criteria) and prediction. Also, the
RRL-rib system relies on the availability of an appropriate relational distance
between different state-action combinations. The design and implementation
of such a distance can be non-trivial.
The statistical properties of the kernel based system make it the most informative regression algorithm of the three. Its basis in Bayesian statistics allows
164
CHAPTER 10. CONCLUSIONS
it to make accurate predictions with limited amounts of learning data as well as
to supply informative meta data such as the confidence of the made predictions.
This additional data can be very useful for the reinforcement learning algorithm
in the RRL system. The high prediction accuracy of the kbr regression algorithm can be witnessed in the overall high performance of the RRL-kbr
system. However, it must be noticed that it suffers more from uninformative
data than rib as shown in the experiments where guidance is provided at the
start of learning. In these experiments kbr is the quickest of all regression
algorithms to build a well performing policy, but is prone to forget the learned
strategy during further uninformative exploration. Comparable to the rib system, RRL-kbr relies on the availability of a kernel between two state-action
combinations. As with relational distances, the definition of such a kernel can
be quite complex. Also, since currently the kbr system has not been supplied
with any appropriate example selection criterion, processing large numbers of
examples becomes a problem and can drastically increase the time complexity
of learning and prediction.
The fact that both rib and kbr require less training examples to generate
good predictions, makes them well suited for applications with high exploration
costs. The resulting Q-functions are however not directly suited for later interpretation.
10.3
RRL on the Digger and Tetris Games
The RRL system was tested on the computer games Digger and Tetris. The
Digger game displays an environment that would be very difficult to handle with
standard Q-learning or Q-learning with propositional function generalization.
The Digger game field is filled with a large number of objects (emeralds and
monsters) and has a structure (tunnels) that changes during game-play. It also
offers 8 different levels on which RRL was able to learn and perform using
a single Q-function generalization. Although the results of the RRL system
on the Digger game do not reach human level performance, they do show the
increased applicability that relational reinforcement learning has brought to
the Q-learning approach.
The results on the Tetris game were a bit disappointing and illustrated
the limitations of the current RRL system. None of the regression algorithms
were able to handle the difficulties in learning the Q-function connected to the
Tetris game very well. The Q-learning approach used in the RRL system does
not seem fit to learn to play a chaotic game like Tetris. Other reinforcement
learning techniques which don’t rely on an accurate modeling of the Q-function
will probably be more appropriate.
10.4. THE LEUVEN METHODOLOGY
10.4
165
The Leuven Methodology
The development of a relational reinforcement learning system was only a small
next step in the general philosophy of the machine learning research group in
Leuven which is to upgrade propositional learners to a relational setting. This
upgrading methodology is based on the learning from interpretations setting
introduced by De Raedt and Džeroski (1994). The same strategy had already
been followed to develop a series of first order versions of machine learning algorithms including a decision tree induction algorithm called Tilde (Blockeel
and De Raedt, 1998; Blockeel, 1998), a first order frequent subset and association rule discovery system Warmer (King et al., 2001; Dehaspe and Toivonen,
1999; Dehaspe, 1998), a first order rule learner ICL (Van Laer, 2002) and first
order clustering and instance based techniques (Ramon and Bruynooghe, 2001;
Ramon, 2002).
The followed methodology has also been formalized into a process presented
in (Van Laer and De Raedt, 1998; Van Laer, 2002). A comparable approach
was followed to develop Bayesian Logic Programs (Kersting and De Raedt,
2000) and Logical Markov Decision Programs (Kersting and De Raedt, 2003)
Ongoing research includes first order sequence discovery (Jacobs and Blockeel, 2001) and relational neural networks (Blockeel and Bruynooghe, 2003).
10.5
In General
In their invited talks at IJCAI’97 (in Nagoya, Japan), both Richard Sutton
and Leslie Pack Kaelbling challenged machine learning researchers to study
the combination of relational learning and reinforcement learning. This thesis
presents a first answer to these challenges.
The work presented has very little theoretical content, and instead focussed
on the development and application of an applicable relational reinforcement
learning system. Although the applications of the RRL system have been limited to toy examples and computer games so far, the demonstrated possibilities
of the relational reinforcement learning approach has sparked a large interest
into the research field. This has given rise to research into the theoretical
foundations of relational reinforcement learning and relational extensions of
approaches of reinforcement learning other than Q-learning (see Section 4.6).
In a larger context, the emergence of the relation reinforcement learning
research field has helped to renew the interest in relational learning and related
topics such as stochastic relational learning.
166
CHAPTER 10. CONCLUSIONS
Chapter 11
Future Work
“They made us too smart, too quick, and too many.”
A.I.
This chapter discusses some directions for future work. The field of relational reinforcement learning is still very young and a great variety of topics
are waiting to be investigated. However, this chapter will focuss on some of
the extensions and improvements that can be made to the RRL system.
11.1
Further Work on Regression Algorithms
In each of the regression algorithm chapters, a number of ideas were presented
for further development of the different systems and will be briefly repeated
here.
The tg algorithm currently lacks a way of restructuring the tree once it
discovers that the initial choices it made, i.e., the tests that were chosen at the
top of the tree, were non-optimal. The first order setting of the tg algorithm
prevents the simple approach used in propositional algorithms of storing statistics on all possible tests in each internal tree node and restructuring the tree
according to these when necessary. However, it does seem possible to exploit at
least parts of the previously built tree when the regression algorithm discovers
it made a mistake earlier. Other extensions of the tg algorithm include building forests instead of single trees, where later trees try to model the prediction
errors made by the previously built trees. Also interesting is the addition of
aggregation functions into the language bias used by the tg system.
The improvements suggested for the rib regression algorithm are mainly
focussed on the reduction of the computational requirements. These range from
incrementally computable distances or using a partitioning of the state-action
167
168
CHAPTER 11. FUTURE WORK
space to reduce the number of stored examples that a new, to be predicted
example needs to be compared with.
The kbr system can also benefit from the same example selection or the
state-action space partitioning as the rib system. However, the probability
distributions predicted by the Gaussian processes also allows for the design of
other example selection methods. The probabilities predicted by the Gaussian
processes can also be used to guide exploration.
One more elaborate change to the suggested regression algorithms would
be the combination of the model building approach of the tg system with the
example driven approaches of the rib and kbr systems. This would allow the
tg system (or any other model building approach) to make a coarse partitioning
of the state-action space, and the example driven approaches to build a well
fitting local Q-function approximation. The partitioning made by the model
building algorithm will reduce the number of examples that need to be handled
simultaneously by the example driven approach.
One still open problem in regression algorithms is a system that can handle
uncertainty in the state and action description. None of the three proposed
algorithms is currently able to handle probabilistic state or action information.
One possible direction that can be investigated in this context are first order
extensions of neural networks. Neural networks are great at handling numerical
values, which can be used to represent the probabilities connected to state and
action features. A few preliminary ideas on first order neural networks can be
found in (Blockeel and Bruynooghe, 2003).
For the tg algorithm, it is also possible to interpret the resulting Q-function,
i.e., the resulting regression tree. Due to the declarative nature of the tg
algorithm, it might be possible to allow the user of the RRL system to intervene
in the tree building process when he or she discovers a tree structure that
contradicts his or her intuition. This intervention can in the first place be
made by changing the language bias that is used by the tg algorithm but can
also be extended to forcing tg to rebuild certain parts of the regression tree.
It is possible to use the maximum variance parameter of the rib system to
limit the number of examples stored in the data-base, regardless of the exact
value of the maximum variance of the Q-function with respect to the defined
(state, action) pairs distance. It could be possible to automatically tune this
parameter for a given level of Q-function approximation or a given performance
level that should be reached.
Although the regression algorithms were designed with their use in the RRL
system in mind, their applicability is not limited to regression for Q-learning
alone. Almost all systems could be used for regular relational regression tasks
without any changes to the systems. Future work will certainly include the
evaluation of the three systems on regular relational regression problems. The
added feature that the algorithms can deal with incremental data might be
11.2. INTEGRATION OF DOMAIN KNOWLEDGE
169
unnecessary in these applications.
11.2
Integration of Domain Knowledge
The integration of domain knowledge into the RRL system in its most obvious
form has been limited to the use of guidance to help exploration and the division
of the learning problem into subgoals. Of course, also the definition of the
language bias used by tg or the distance used by rib are influenced by the
domain knowledge of the user of the RRL system.
The use of guidance to help exploration allows for many different ways of
integrating the knowledge of a domain expert. Still to be investigated idea is
the closer integration of guidance strategies and the used regression algorithm.
For example, when using the tg algorithm, the guidance could be adjusted
to the next split that the tg system intends to make. This might alleviate
some of the problems with early decisions made by the tg system. In line with
observations made in human learning, the use of guidance could be delayed
until RRL has had some time to explore and tune the guidance to the parts
of the state-space that the learning agent has most problems with. Active
guidance is the first step in this direction, but more work is needed to make it
applicable to stochastic or highly chaotic environments.
However, the use of domain information into the RRL system can (and
probably should) be extended beyond this. For the RRL-tg system for example, this could include inspection of the constructed regression tree as already
stated.
Recent work on relational sequence discovery tries to find frequent sequences
of occurrences, possibly with gaps of different length between defining occurrences. Using this kind of discovery technique on a number of successful training episodes, it might be possible to discover important sequential subgoals, as
they will appear in each (or at least a large number) of the episodes.
11.3
Integration of Planning, Model Building
and Relational Reinforcement Learning
Further research will certainly include the integration of (partial) model building and planning under uncertainty into the RRL system. In this approach,
the learning agent tries to build a partial model of its environment in the form
of a set of rules that describe parts of the mechanics of the world as well as
make predictions about the reward that will be received. This model can then
be used to make predictions about the consequences of actions and thus add a
planning component to the learning agent. The combination of this technique
with the RRL system seems a straightforward next step. While the planning
170
CHAPTER 11. FUTURE WORK
component allows the agent to look a few steps ahead instead of just using
the information of the current state to choose which action to perform, the
information supplied by a learned Q-function will reduce the needed planning
depth.
11.4
Policy Learning
Since the Q-values for a given task implicitly encode both the distance to and
the amount of received awards, the policy that is generated from such a Qfunction is often much easier to represent. It might even be easier to learn.
A policy learning approach was already suggested in the introductory work on
the RRL algorithm (Džeroski et al., 2001), but it was not considered further
in the work done in context of this dissertation.
Preliminary results with learning a policy from examples generated from a
learned Q-function turn out to be very promising. The simpler representation
of a policy compared to Q-function also causes the policy to be easier to learn
and makes it generalize better. This could solve the problems that the current
RRL system has with applications such as Tetris, which have a Q-function
that is very difficult to predict.
Using the tg algorithm to build a policy allow the user to guide the search
and inspect the resulting policy afterwards. The ability to steer the learned
policy in certain directions will ease the introduction of machine learning techniques in software agent applications.
11.5
Applications
Although the application of RRL to the Digger game was new with respect
to other Q-learning applications, other, more useful applications need to be
investigated.
A first possibility is that of web-spidering. A web-spider is a software agent
that is used to gather web-pages on specific topics, or even just completely new
web-pages for storage in the data-base of internet search engines. In this task,
the goal of the agent is to limit the amount of data it needs to download, and
as a consequence, limit the number of links it needs to follow to get to the
wanted information. The relational structure of the internet might make it a
well suited domain for relational reinforcement learning.
Since relational reinforcement learning was designed to handle worlds with
objects, another type of applications that might be looked at are those situated
in the real world, for example through robotics. The RRL system is better
suited to handle high level task such as planning package movement than low
level tasks such as object avoidance. Therefore, a suitable starting platform
11.6. THEORETICAL FRAMEWORK FOR RRL
171
with which to interface will have to be found. For this, the work on Golog and
IndiGolog seem promising (De Giacomo et al., 2002).
11.6
Theoretical Framework for Relational Reinforcement Learning
Last but not least is the development of a theoretical framework for relational
reinforcement learning such as it exists for regular reinforcement learning.
Although the RRL system described above has illustrated the usefulness
and applicability of relational reinforcement learning, very little is known about
the theoretical foundations of relational reinforcement learning. Very recently
a lot of attention has gone to relational representations of Markov Decision
Processes (MDPs) which might yield a theoretical framework in which relational reinforcement learning can be studied. Such a study can allow better
comprehension of why, and more importantly when relational reinforcement
learning works. Eventually, such an understanding can ease the introduction
of relational reinforcement learning into applications or can be applied to build
better learning systems.
172
CHAPTER 11. FUTURE WORK
Part V
Appendices
173
Appendix A
On Blocks World
Representations
A.1
The Blocks World as a Relational Interpretation
To be able to describe the representation used in this work, a few concepts
have to be introduced. The same conventions are used in this text as in the
work of Flach (1994). These are standard in the Inductive Logic Programming
community.
A.1.1
Clausal Logic
Names of individual entities, often objects, are called constants, and can be
recognized by the fact that they are written starting with a lowercase character.
Variables, which are used to denote arbitrary individuals are written starting
with an uppercase character. A term is either a constant, a variable or a
functor symbol, followed by a number of terms. The number of terms behind
the functor symbol is called the arity. An atom is a predicate symbol followed
by a number of terms. That number is again referred to as the predicate’s arity
and a predicate p with arity n is denoted as p/n. A ground atom is an atom
without any variables. A literal is either an atom or a negated atom.
A clause is a disjunction of literals. By grouping positive and negative
literals, a clause can be written in the following format:
h1 ; . . . ; hn ← b 1 . . . b m
where h1 , . . . , hn are the positive literals of the clause, called the head of the
clause and b1 . . . bm are the negative literals, also called the body of the clause.
175
176
APPENDIX A. ON BLOCKS WORLD REPRESENTATIONS
The “;” symbol should be read as an or, the “,” as an and and the ← as an
if. A clause with a single positive literal is called a definite clause. A definite
clause with no negative literals is called a fact.
A.1.2
The Blocks World
To represent the blocks world as a relational interpretation, the predicates on/2
and clear/1 are used. The ground fact
on(a,b).
represents the fact that block a is on top of block b, while
clear(c).
signifies that block c has no blocks on top of it. The action is represented by
the move/2 predicate, i.e.,
move(c,a).
represents the action of moving block c onto block a.
3
1
5
4
2
Figure A.1: An example state of a blocks world with 5 blocks. The action is
indicated by the dotted arrow.
The example (state, action) pair of figure A.1 is represented as follows:
on(1,5).
on(2,floor).
on(3,1).
on(4,floor).
on(5,floor).
clear(2).
clear(3).
clear(4).
move(3,2).
Through the use of definite clauses, this representation can be augmented
by some derivable predicates. The predicates that were used throughout the
experiments in the blocks world are the following:
A.1. THE BLOCKS WORLD AS A RELATIONAL INTERPRETATION177
eq/2 defines the equality of two blocks.
above/2 specifies whether the first argument is a block that is part of the stack
on top of the second argument.
height/2 computes the hight above the table of a given block.
number of blocks/1 gives the number of blocks in the world as a result.
number of stacks/1 computes the number of stacks in the blocks world state.
The definitions of these predicates are trivial. Although the clear/1 predicate could also be derived from the on/2 facts, it is included in facts of the
state representation.
When the goal the agent has to reach refers to specific blocks, this goal can
be represented by adding a goal predicate and a definition of when this goal
is reached. For example, the goal of putting block 3 on top of block 1 can be
represented and defined as follows:
goal :- goal_on(3,1).
goal_on(X,Y) :- on(X,Y).
Other more general goals can be represented in similar ways, but can also
be left out of the representation all together, because no specific blocks need to
be referenced. For example, the task of Stacking all blocks could be represented
and defined as follows:
goal :- stack.
stack :- not (on(X,floor),on(Y,floor),X\=Y).
and Unstacking as:
goal :- unstack.
unstack :- not (on(X,Y),Y\=floor).
For implementation purposes, the blocks world state is sometimes represented as a single term with three arguments. The first argument is then the
list of on/2 predicates, the second argument the list of clear/2 and the third
argument represents the goal configuration. This representation is completely
equivalent to the one discussed above.
178
A.2
APPENDIX A. ON BLOCKS WORLD REPRESENTATIONS
The Blocks World as a Graph
Figure A.2 shows the graph representation of the blocks world (state, action)
pair of Figure A.1. The vertices of the graph correspond either to a block, the
floor, or ‘clear’; where ‘clear’ basically denotes ‘no block’. This is reflected in
the labels of the vertices.
An edge labelled ‘on’ (solid arrows) between two vertices labelled ’block’
denotes that the block corresponding to its initial vertex is on top of the block
corresponding to its terminal vertex. The edges labelled ‘on’ that start from
the vertex labelled ’clear’ represents the fact that the block corresponding to
its terminal vertex has no blocks on top of it. The edges labelled ‘on’ that
terminate in the vertex labelled ‘floor’ signify that the block corresponding to
its initial vertex is on the floor.
The edge labelled ‘action’ (dashed arrow) denotes the action of putting the
block corresponding to its initial vertex on top of the block corresponding to
its terminal vertex; in the example “move block 3 onto block 2”. The labels
‘a1 ’ and ‘a2 ’ denote the initial and terminal vertex of the action, respectively,
representing the fact that the block corresponding to the ‘a1 ’ label is moved
onto the block represented by the vertex with the ‘a2 ’ label.
{clear}
v6
{on}
{block,a1}
{on}
{block}
{on}
v3
{on}
v1
{block}
v4
{action}
{block,a2 }
v2
{on}
{on}
{on}
v5
{block}
{on}
v0
{floor}
Figure A.2: The graph representation of the blocks world (state, action) pair
of figure A.1.
To represent an arbitrary blocks world as a labelled directed graph one
proceeds as follows. Given the set of blocks numbered 1, . . . , n and the set of
stacks 1, . . . , m:
1. The vertex set V of the graph is {ν0 , . . . , νn+1 }
2. The edge set E of the graph is {e1 , . . . , en+m+1 }.
A.2. THE BLOCKS WORLD AS A GRAPH
179
The node ν0 is used to represent the floor, νn+1 indicates which blocks are
clear. Since each block is on top of something and each stack has one clear
block, n + m edges are needed to represent the blocks world state. Finally, one
extra edge is needed to represent the action.
For the representation of a state it remains to define the function Ψ:
3. For 1 ≤ i ≤ n, define Ψ(ei ) = (νi , ν0 ) if block i is on the floor, and
Ψ(ei ) = (νi , νj ) if block i is on top of block j.
4. For n < i ≤ n + m, define Ψ(ei ) = (νn+1 , νj ) if block j is the top block
of stack i − n.
and the function label :
5. Define: L = 2{{floor},{clear},{block},{on},{a1 },{a2 }} ,
• label (ν0 ) = {floor},
• label (νn+1 ) = {clear},
• label (νi ) = {block} (1 ≤ i ≤ n)
• label (ei ) = {on} (1 ≤ i ≤ n + m).
All that is left now is to represent the action in the graph
6. Define:
• Ψ(en+m+1 ) = (νi , νj ) if block i is moved to block j.
• label (νi ) = label (νi ) ∪ {a1 }
• label (νj ) = label (νj ) ∪ {a2 }
• label (en+m+1 ) = {action}
It is clear that this mapping from blocks worlds to graphs is injective.
In some cases the ‘goal’ of a blocks world problem is to stack blocks in a
given configuration (e.g. “put block 3 on top of block 4”). This then needs to
be represented in the graph. This is handled in the same way as the action
representation, i.e., by an extra edge along with an extra ‘g1 ’, ‘g2 ’, and ‘goal’
labels for initial and terminal blocks, and the new edge, respectively. Note
that by using more than one ‘goal’ edge, arbitrary goal configurations can be
modelled, e.g., “put block 3 on top of block 4 and block 2 on top of block 1”.
In the case of more general goals such as: “Build a single stack” it will not be
represented in the graph, as no specific blocks need to be referenced.
180
APPENDIX A. ON BLOCKS WORLD REPRESENTATIONS
Bibliography
[Aha et al., 1991] D.W. Aha, D. Kibler, and M.K. Albert. Instance-based
learning algorithms. Machine Learning, 6(1):37–66, January 1991.
[Aronszajn, 1950] N. Aronszajn. Theory of reproducing kernels. Transactions
of the American Mathematical Society, 68:337–404, 1950.
[Asimov, 1976] I. Asimov. Bicentential Man. 1976.
[Atkeson et al., 1997] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally
weighted learning. Artificial Intelligence Review, 11(1-5):11–73, 1997.
[Bain and Sammut, 1995] M. Bain and C. Sammut. A framework for behavioral cloning. In S. Muggleton, K. Furukawa, and D. Michie, editors, Machine
Intelligence 15. Oxford University Press, 1995.
[Barnett, 1979] S. Barnett.
MacGraw-Hill, 1979.
Matrix Methods for Engineers and Scientists.
[Barto and Duff, 1994] A. Barto and M. Duff. Monte Carlo matrix inversion
and reinforcement learning. In J.D. Cowan, G. Tesauro, and J. Alspector,
editors, Advances in Neural Information Processing Systems, volume 6, pages
687–694. Morgan Kaufmann Publishers, Inc., 1994.
[Barto and Mahadevan, 2003] A. Barto and S. Mahadevan. Recent advances
in hierarchical reinforcement learning. Discrete Event Systems, 13:41–77,
2003.
[Barto et al., 1995] A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to
act using real-time dynamic programming. Artificial Intelligence, 72:81–138,
1995.
[Bellman, 1961] R. Bellman. Adaptive Control Processes: a Guided Tour.
Princeton University, 1961.
181
182
BIBLIOGRAPHY
[Bertsekas and Tsitsiklis, 1996] Bertsekas and Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
[Blockeel and Bruynooghe, 2003] H. Blockeel and M. Bruynooghe. Aggregation versus selection bias, and relational neural networks. In IJCAI-2003
Workshop on Learning Statistical Models from Relational Data, SRL-2003,
Acapulco, Mexico, August 11, 2003, 2003.
[Blockeel and De Raedt, 1998] H. Blockeel and L. De Raedt. Top-down induction of first order logical decision trees. Artificial Intelligence, 101(1-2):285–
297, June 1998.
[Blockeel et al., 1998] H. Blockeel, L. De Raedt, and J. Ramon. Top-down
induction of clustering trees. In Proceedings of the 15th International Conference on Machine Learning, pages 55–63, 1998.
[Blockeel et al., 1999] H. Blockeel, L. De Raedt, N. Jacobs, and B. Demoen.
Scaling up inductive logic programming by learning from interpretations.
Data Mining and Knowledge Discovery, 3(1):59–93, 1999.
[Blockeel et al., 2000] H. Blockeel, B. Demoen, L. Dehaspe, G. Janssens, J. Ramon, and H. Vandecasteele. Executing query packs in ILP. In J. Cussens
and A. Frisch, editors, Proceedings of the 10th International Conference in
Inductive Logic Programming, volume 1866 of Lecture Notes in Artificial
Intelligence, pages 60–77, London, UK, July 2000. Springer.
[Blockeel, 1998] H. Blockeel. Top-down induction of first order logical decision
trees. Phd, Department of Computer Science, K.U.Leuven, Leuven, Belgium,
1998.
[Boutilier et al., 1999] C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic
planning: Structural assumptions and computational leverage. Journal of
AI Research, 11:1–94, 1999.
[Boutilier et al., 2001] C. Boutilier, R. Reiter, and B. Price. Symbolic dynamic
programming for first order MDP’s. In Proceedings of the 16th International
Joint Conference on Artificial Intelligence, 2001.
[Chambers and Michie, 1969] R. A. Chambers and D. Michie. Man-machine
co-operation on a learning task. Computer Graphics: Techniques and Applications, pages 179–186, 1969.
[Chapman and Kaelbling, 1991] D. Chapman and L.P. Kaelbling. Input generalization in delayed reinforcement learning: An algorithm and performance
comparisions. In Proceedings of the 12th International Joint Conference on
Artificial Intelligence, pages 726–731, 1991.
BIBLIOGRAPHY
183
[Clarke, 1968] A.C. Clarke. 2001, A Space Oddyssey. 1968.
[Collins and Duffy, 2002] M. Collins and N. Duffy. Convolution kernels for
natural language. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors,
Advances in Neural Information Processing Systems, volume 14, Cambridge,
MA, 2002. The MIT Press.
[Cristianini and Shawe-Taylor, 2000] N. Cristianini and J. Shawe-Taylor. An
Introduction to Support Vector Machines (and Other Kernel-Based Learning
Methods). Cambridge University Press, 2000.
[De Giacomo et al., 2002] G. De Giacomo, Y. Lespérance, H. Levesque, and
S. Sardiña. On the semantics of deliberation in IndiGolog — from theory
to implementation. In Proceedings of the 8th Conference on Principles of
Knowledge Representation and Reasoning, 2002.
[De Raedt and Džeroski, 1994] L. De Raedt and S. Džeroski. First order jkclausal theories are PAC-learnable. Artificial Intelligence, 70:375–392, 1994.
[Dehaspe and Toivonen, 1999] L. Dehaspe and H. Toivonen. Discovery of frequent datalog patterns. Data Mining and Knowledge Discovery, 3(1):7–36,
1999.
[Dehaspe, 1998] L. Dehaspe. Frequent Pattern Discovery in First-Order Logic.
Phd, Department of Computer Science, K.U.Leuven, Leuven, Belgium, 1998.
[Demaine et al., 2002] E.D. Demaine, S. Hohenberger, and D. Liben-Nowell.
Tetris is hard, even to approximate. Technical Report MIT-LCS-TR-865,
Massachussets Institue of Technology, Boston, 2002.
[Diestel, 2000] R. Diestel. Graph Theory. Springer-Verlag, 2000.
[Dietterich and Wang, 2002] T.G. Dietterich and X. Wang. Batch value function approximation via support vectors. In T. G. Dietterich, S. Becker, and
Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14, Cambridge, MA, 2002. The MIT Press.
[Dietterich, 2000] T.G. Dietterich. Hierarchical reinforcement learning with
the MAXQ value function decomposition. Journal of Artificial Intelligence
Research, 13:227–303, 2000.
[Dixon et al., 2000] K.R. Dixon, R.J. Malak, and P.K. Khosla. Incorporating prior knowledge and previously learned information into reinforcement
learning agents. Technical report, Institute for Complex Engineered Systems,
Carnegie Mellon University, 2000.
184
BIBLIOGRAPHY
[Driessens and Blockeel, 2001] K. Driessens and H. Blockeel. Learning digger
using hierarchical reinforcement learning for concurrent goals. European
Workshop on Reinforcement Learning, EWRL, Utrecht, the Netherlands,
October 5-6, 2001, oct 2001.
[Driessens and Džeroski, 2002a] K. Driessens and S. Džeroski. Integrating experimentation and guidance in relational reinforcement learning. In C. Sammut and A. Hoffmann, editors, Proceedings of the Nineteenth International
Conference on Machine Learning, pages 115–122. Morgan Kaufmann Publishers, Inc, 2002.
[Driessens and Džeroski, 2002b] K. Driessens and S. Džeroski. On using guidance in relational reinforcement learning. In Proceedings of Twelfth BelgianDutch Conference on Machine Learning, pages 31–38, 2002. Technical report
UU-CS-2002-046.
[Driessens and Džeroski, 2004] K. Driessens and S. Džeroski. Integrating guidance into relational reinforcement learning. Machine Learning, 2004. Accepted.
[Driessens and Ramon, 2003] K. Driessens and J. Ramon. Relational instance
based regression for relational reinforcement learning. In Proceedings of the
Twentieth International Conference on Machine Learning, pages 123–130.
AAAI Press, 2003.
[Driessens et al., 2001] K. Driessens, J. Ramon, and H. Blockeel. Speeding up
relational reinforcement learning through the use of an incremental first order
decision tree learner. In L. De Raedt and P. Flach, editors, Proceedings of
the 13th European Conference on Machine Learning, volume 2167 of Lecture
Notes in Artificial Intelligence, pages 97–108. Springer-Verlag, 2001.
[Driessens, 2001] K. Driessens. Relational reinforcement learning. In MultiAgent Systems and Applications, volume 2086 of Lecture Notes in Artificial
Intelligence, pages 271–280. Springer-Verlag, 2001.
[Džeroski et al., 2001] S. Džeroski, L. De Raedt, and K. Driessens. Relational
reinforcement learning. Machine Learning, 43:7–52, 2001.
[Džeroski and Lavrac, 2001] S. Džeroski and N. Lavrac, editors. Relational
Data Mining. Springer, Berlin, 2001.
[Džeroski et al., 1998] S. Džeroski, L. De Raedt, and H. Blockeel. Relational
reinforcement learning. In Proceedings of the 15th International Conference
on Machine Learning, pages 136–143. Morgan Kaufmann, 1998.
BIBLIOGRAPHY
185
[Emde and Wettschereck, 1996] W. Emde and D. Wettschereck. Relational
instance-based learning. In L. Saitta, editor, Proceedings of the 13th International Conference on Machine Learning, pages 122–130. Morgan Kaufmann,
1996.
[Fern et al., 2003] A. Fern, S. Yoon, and R. Givan. Approximate policy iteration with a policy language bias. In Thrun S., L. Saul, and B. Bernhard Schlkopf, editors, Proceedings of the Seventeenth Annual Conference
on Neural Information Processing Systems. The MIT Press, 2003.
[Fikes and Nilsson, 1971] R.E. Fikes and N.J. Nilsson. Strips: A new approach
to the application for theorem proving to problem solving. In Advance Papers
of the Second International Joint Conference on Artificial Intelligence, pages
608–620, Edinburgh, Scotland, 1971.
[Finney et al., 2002] S. Finney, N. H. Gardiol, L. P. Kaelbling, and T. Oates.
The thing that we tried didn’t work very well: Deictic representation in
reinforcement learning,. In Proceedings of the 18th International Conference
on Uncertainty in Artificial Intelligence, Edmonton, 2002.
[Flach, 1994] P. Flach. Simply Logical. John Wiley, Chicester, 1994.
[Forbes and Andre, 2002] J. Forbes and D. Andre. Representations for learning
control policies. In E. de Jong and T. Oates, editors, Proceedings of the
ICML-2002 Workshop on Development of Representations, pages 7–14. The
University of New South Wales, Sydney, 2002.
[Gärtner et al., 2003a] T. Gärtner, K. Driessens, and J. Ramon. Graph kernels
and Gaussian processes for relational reinforcement learning. In Inductive
Logic Programming, 13th International Conference, ILP 2003, Proceedings,
volume 2835 of Lecture Notes in Computer Science, pages 146–163. Springer,
2003.
[Gärtner et al., 2003b] T. Gärtner, P. Flach, and S. Wrobel. On graph kernels:
Hardness results and efficient alternatives. In B. Schölkopf and M.K. Warmuth, editors, Proceedings of the 16th Annual Conference on Computational
Learning Theory and 7th Kernel Workshop, volume 2777 of Lecture Notes in
Computer Science. Springer, 2003.
[Gärtner et al., 2003c] T. Gärtner, J.W. Lloyd, and P.A. Flach. Kernels for
structured data. In Inductive Logic Programming, 12th International Conference, ILP 2002, Proceedings, volume 2583 of Lecture Notes in Computer
Science. Springer, 2003.
[Gärtner, 2002] T. Gärtner. Exponential and geometric kernels for graphs. In
NIPS Workshop on Unreal Data: Principles of Modeling Nonvectorial Data,
2002.
186
BIBLIOGRAPHY
[Gärtner, 2003] T. Gärtner. A survey of kernels for structured data. SIGKDD
Explorations, 5(1):49–58, 2003.
[Gibbs, 1997] M.N. Gibbs. Bayesian Gaussian Processes for Regression and
Classification. PhD thesis, University of Cambridge, 1997.
[Goldberg, 1989] D.E. Goldberg. Genetic Algorithms in Search, Optimization
and Machine Learning. Addison-Wesley, 1989.
[Guestrin et al., 2003] C. Guestrin, D. Koller, C. Gearhart, and N. Kanodia.
Generalizing plans to new environments in relational MDP’s. In Proceedings
of the 18th International Joint Conference on Artificial Intellingece, 2003.
[Haussler, 1999] D. Haussler. Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at
Santa Cruz, 1999.
[Humphrys, 1995] M. Humphrys. W-learning: Competition among selfish Qlearners. Technical Report 362, University of Cambridge, Computer Laboratory, 1995.
[Imrich and Klavžar, 2000] W. Imrich and S. Klavžar. Product Graphs: Structure and Recognition. John Wiley, 2000.
[Jaakkola et al., 1993] T. Jaakkola, M. Jordan, and S.P. Singh. Convergence
of stochastic iterative dynamic programming algorithms. In J.D. Cowan,
G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 703–710. Morgan Kaufmann Publishers,
Inc., 1993.
[Jacobs and Blockeel, 2001] N. Jacobs and H. Blockeel. From shell logs to
shell scripts. In C. Rouveirol and M. Sebag, editors, Proceedings of the 11th
International Conference on Inductive Logic Programming, volume 2157 of
Lecture Notes in Artificial Intelligence, pages 80–90. Springer-Verlag, 2001.
[Jennings and Woodridge, 1995] N.R. Jennings and M. Woodridge. Intelligent
agents and multi-agent systems. Applied Artificial Intelligence, 9:357–369,
1995.
[Kaelbling et al., 1996] L. Kaelbling, M. Littman, and A. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–
285, 1996.
[Kaelbling et al., 2001] L.P. Kaelbling, T. Oates, N. Hernandez, and S. Finney.
Learning in worlds with objects. In Working Notes of the AAAI Stanford
Spring Symposium on Learning Grounded Representations, 2001.
BIBLIOGRAPHY
187
[Karalič, 1995] A. Karalič. First-order regression: Applications in real-world
domains. In Proceedings of the 2nd International Workshop on Artificial
Intelligence Techniques, 1995.
[Kashima and Inokuchi, 2002] H. Kashima and A. Inokuchi. Kernels for graph
classification. In ICDM Workshop on Active Mining, 2002.
[Kersting and De Raedt, 2000] K. Kersting and L. De Raedt. Baeyesian logic
programs. In Proceedings of the Tenth International Conference on Inductive
Logic Programming, work in progress track, 2000.
[Kersting and De Raedt, 2003] K. Kersting and L. De Raedt. Logical Markov
Decision Programs. In Proceedings of the IJCAI’03 Workshop on Learning
Statistical Models of Relational Data, pages 63–70, 2003.
[Kersting et al., 2004] K. Kersting, M. van Otterlo, and L De Raedt. Bellman
goes relational. In Proceedings of the Twenty-First International Conference
on Machine Learning, 2004. Accepted.
[Kibler et al., 1989] D. Kibler, D. W. Aha, and M.K. Albert. Instance-based
prediction of real-valued attributes. Computational Intelligence, 5:51–57,
1989.
[King et al., 2001] R. D. King, A. Srinivasan, and L. Dehaspe. Warmr: a data
mining tool for chemical data. Journal of Computer-Aided Molecular Design,
15(2):173–181, feb 2001.
[Kobsa, 2001] A. Kobsa, editor. User Modeling and User-Adapted Interaction,
Ten Year Anniversary Issue, volume 1-2. Kluwer Academic Publishers, 2001.
[Korte and Vygen, 2002] B. Korte and J. Vygen. Combinatorial Optimization:
Theory and Algorithms. Springer-Verlag, 2002.
[Kramer and Widmer, 2000] Stefan Kramer and Gerhard Widmer. Inducing classification and regression trees in first order logic, pages 140–156.
Springer-Verlag New York, Inc., 2000.
[Lagoudakis et al., 2002] M.G. Lagoudakis, R. Parr, and M.L. Littman. Leastsquares methods in reinforcement learning for control. In Proceedings of the
2nd Hellenic Conference on Artificial Intelligence (SETN-02), pages 249–
260. Springer, 2002.
[Langley, 1994] P. Langley. Elements of Machine Learning. Morgan Kaufmann,
1994.
[Lin, 1992] Long-Ji Lin. Self-improving reactive agents based on reinforcement
learning, planning and teaching. Machine Learning, 8:293–321, 1992.
188
BIBLIOGRAPHY
[Lodhi et al., 2002] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini,
and C. Watkins. Text classification using string kernels. Journal of Machine
Learning Research, 2:419–444, 2002.
[MacCallum, 1999] A. MacCallum. Reinforcement learning with selective perception and Hidden State. PhD thesis, University of Rochestor, 1999.
[MacKay, 1997] D.J.C. MacKay. Introduction to Gaussian processes. available
at http://wol.ra.phy.cam.ac.uk/mackay, 1997.
[Mahadevan et al., 1997] S. Mahadevan, N. Marchalleck, T.K. Das, and
A. Gosavi. Self-improving factory simulation using continuous-time averagereward reinforcement learning. In Proc. 14th International Conference on
Machine Learning, pages 202–210. Morgan Kaufmann, 1997.
[Mitchell, 1996] M. Mitchell. An Introduction to Genetic Algorithms. The MIT
Press, 1996.
[Mitchell, 1997] T. Mitchell. Machine Learning. McGraw-Hill, 1997.
[Morales, 2003] E. Morales. Scaling up reinforcement learning with a relational
representation. In Proc. of the Workshop on Adaptability in Multi-agent
Systems, pages 15–26, 2003.
[Nilsson, 1980] N.T. Nilsson. Principles of Artificial Intelligence. Tioga Publishing Company, 1980.
[Ormoneit and Sen, 2002] D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine Learning, 49:161–178, 2002.
[Parr and Russell, 1997] R. Parr and S. Russell. Reinforcement learning with
hierarchies of machines. In M.I. Jordan, M.J. Kearns, and S.A. Solla, editors,
Advances in Neural Information Processing Systems, volume 10. The MIT
Press, 1997.
[Puterman, 1994] M. L. Puterman. Markov Decision Processes. J. Wiley &
Sons, 1994.
[Ramon and Bruynooghe, 2001] J. Ramon and M. Bruynooghe. A polynomial
time computable metric between point sets. Acta Informatica, 37:765–780,
2001.
[Ramon, 2002] J. Ramon. Clustering and instance based learning in first order
logic. PhD thesis, Department of Computer Science, K.U.Leuven, 2002.
[Rasmussen and Kuss, 2004] C. E. Rasmussen and M. Kuss. Gaussian processes in reinforcement learning. In Advances in Neural Information Processing Systems, volume 16. MIT Press, 2004.
BIBLIOGRAPHY
189
[Rennie and McCallum, 1999] J. Rennie and A.K. McCallum. Using reinforcement learning to spider the web efficiently. In Proceedings of the 16th International Conf. on Machine Learning, pages 335–343. Morgan Kaufmann,
San Francisco, CA, 1999.
[Russell and Norvig, 1995] S. Russell and P. Norvig. Artificial Intelligence: A
Modern Approach. Prentice-Hall, 1995.
[Schaal et al., 2000] S. Schaal, C. G. Atkeson, and S. Vijayakumar. Real-time
robot learning with locally weighted statistical learning. In Proceedings of the
IEEE International Conference on Robotics and Automation, pages 288–293.
IEEE Press, Piscataway, N.J., 2000.
[Scheffer et al., 1997] T. Scheffer, R. Greiner, and C. Darken. Why experimentation can be better than “perfect guidance”. In Proceedings of the
14th International Conference on Machine Learning, pages 331–339. Morgan Kaufmann, 1997.
[Sebag, 1997] M. Sebag. Distance induction in first order logic. In N. Lavrač
and S. Džeroski, editors, Proceedings of the Seventh International Workshop
on Inductive Logic Programming, volume 1297 of Lecture Notes in Artificial
Intelligence, pages 264–272. Springer, 1997.
[Shapiro et al., 2001] D. Shapiro, P. Langley, and R. Shachter. Using background knowledge to speed reinforcement learning in physical agents. In
Proceedings of the 5th International Conference on Autonomous Agents. Association for Computing Machinery, 2001.
[Slaney and Thiébaux, 2001] J. Slaney and S. Thiébaux. Blocks world revisited. Artificial Intelligence, 125:119–153, 2001.
[Smart and Kaelbling, 2000] W. D. Smart and L. P. Kaelbling. Practical reinforcement learning in continuous spaces. In Proceedings of the 17th International Conference on Machine Learning, pages 903–910. Morgan Kaufmann,
2000.
[Sutton and Barto, 1998] R. Sutton and A. Barto. Reinforcement Learning:
an introduction. The MIT Press, Cambridge, MA, 1998.
[Sutton et al., 1999] R. Sutton, D. Precup, and S.P. Singh. Between MDP’s
and semi-MDP’s: A framework for temporal abstraction in reinforcement
learning. Artificial Intelligence, 112(1-2):181–211, 1999.
[Tesauro, 1992] G. Tesauro. Practical issues in temporal difference learning. In
J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Advances in Neural
Information Processing Systems, volume 4, pages 259–266. Morgan Kaufmann Publishers, Inc., 1992.
190
BIBLIOGRAPHY
[Urbancic et al., 1996] T. Urbancic, I. Bratko, and C. Sammut. Learning models of control skills: Phenomena, results and problems. In Proceedings of the
13th Triennial World Congress of the International Federation of Automatic
Control, pages 391–396. IFAC, 1996.
[Utgoff et al., 1997] P. Utgoff, N Berkman, and J. Clouse. Decision tree induction based on efficient tree restructuring. Machine Learning, 29:5–44,
1997.
[Van Laer and De Raedt, 1998] W. Van Laer and L. De Raedt. A methodology for first order learning: a case study. In F. Verdenius and W. van den
Broek, editors, Proceedings of the Eighth Belgian-Dutch Conference on Machine Learning, volume 352 of ATO-DLO Rapport, 1998.
[Van Laer, 2002] W. Van Laer. From Propositional to First Order Logic in
Machine Learning and Data Mining. Induction of First Order Rules with
ICL. PhD thesis, Department of Computer Science, K.U.Leuven, 2002.
[van Otterlo, 2002] M. van Otterlo. Relational representations in reinforcement
learning: Review and open problems. In E. de Jong and T. Oates, editors,
Proc. of the ICML-2002 Workshop on Development of Representations, pages
39–46. University of New South Wales, 2002.
[van Otterlo, 2004] M. van Otterlo. Reinforcement learning for relational
MDP’s. In Proceedings of the Machine Learning Conference of Belgium and
the Netherlands 2004, 2004.
[Wagner and Fischer, 1974] R.A. Wagner and M.J. Fischer. The string to
string correction problem. Journal of the ACM, 21(1):168–173, January
1974.
[Wang, 1995] X. Wang. Learning by observation and practice: An incremental approach for planning operator acquisition. In Proceedings of the 12th
International Conference on Machine Learning, pages 549–557, 1995.
[Watkins, 1989] Christopher Watkins. Learning from Delayed Rewards. PhD
thesis, King’s College, Cambridge., 1989.
[Wiering, 1999] M. Wiering. Explorations in Efficient Reinforcement Learning.
PhD thesis, University of Amsterdam, 1999.
[Witten and Frank, 1999] I.A. Witten and E. Frank. Data Mining: Practical
Machine Learning Tools and Techniques with Java Implementations. Morgan
Kaufmann, 1999.
BIBLIOGRAPHY
191
[Yoon et al., 2002] S. Yoon, A. Fern, and R. Givan. Inductive policy selection
for first-order MDP’s. In Proceedings of the 18th Conference on Uncertainty
in Artificial Intelligence, 2002.
[Zien et al., 2000] A. Zien, G. Ratsch, S. Mika, B. Schölkopf, T. Lengauer, and
K.-R. Muller. Engineering support vector machine kernels that recognize
translation initiation sites. Bioinformatics, 16(9):799–807, 2000.
Index
action, 12
adjacency matrix, see graphs
afterstates, 26
agent, 5, 12
arity, 175
artificial intelligence, 3
atom, 175
automated driving, 73
background knowledge, 39
Bayesian regression, 96
behavioral cloning, 6, 137
Bellman equation, 18–20
blackjack, 26
blocks world, 33
distance, 74
episode, 41
graph, 178
kernel, 106
relational interpretation, 176
state space size, 62
tasks, 61
Boltzmann exploration, 23
c-nearest neighbor, 76
CARCASS, 47
clause, 175
concurrent goals, 145, 148
convex hull, 73
convolution kernel, 95
corridor application, 81
covariance function, 98
covariance matrix, 96
curse of dimensionality, 21
data-mining, 5
deictic, 46
deictic representations, 28–29, 46
Digger, 142–146
direct policy search, 17
discount factor, see γ
dynamic programming, 18
edit distance, 74
environment, 12
episode, 20
error contribution, 78, 83
error management, 83
error margin, 76
error proximity, 78
exploration, 22
exponential series, 104
fact, 176
feature vector, 27
focal point, 28, 46
function generalization, see regression
functor, 175
G-algorithm, 54
γ, 14
Gaussian processes, 96
genetic programming, 17
geometric series, 104
graphs, 99–101
adjacency matrix, 101
cycle, 100
definition, 99
INDEX
direct product, 102
directed, 99
indegree, 101
labelled, 99
outdegree, 101
walk, 100
ground, 175
guidance, 120–123
active, 123, 132
hierarchical reinforcement learning,
146–148
incremental distance, 91
incremental regression, 42
inflow limitations, 76, 82
instance averaging, 79
instance based learning, 72
instance based regression, 72
ITI algorithm, 54, 57
kernel, 95
for graphs, 103
for structured data, 95
kernel methods, 94–98
language bias, 56
learning graphs, 63
literal, 175
locally linear regression, 72
Logical Markov Decision Processes,
46
machine learning, 4
Markov Decision Processes, 15
relational, 46
Markov property, 15
matching distance, 74
matrix inversion
incremental, 98
maximum variance, 79, 86
MDP, see Markov Decision Processes
minimal sample size, 60, 68
193
Monte-Carlo methods, 18
moving target regression, 43
nearest neighbor, see instance based
learning
neural networks, 22, 28, 96
On(A,B), see blocks world tasks
planning, 16
reward, 16
policy, 13, 17
policy evaluation, 18
policy improvement, 18
policy iteration, 18
approximate, 46
positive definite kernel, see kernel
predicate, 175
probabilistic policies, 15
Prolog, 56
propositional representations, 27–28
Q-learning, 6, 19–23
algorithm, 20
Q-value, 19
query packs, 58
radial basis function, 96, 105
ranked tree, 95
reachable goal state, 62
receptive field, 72
refinement operator, 56
regression, 21, 28, 42
task definition, 22
reinforcement learning, 5, 6, 11–16
task definition, 13
RRL , 39
algorithm, 40
problem definition, 38
prototype, 43
regression task, 42
relational distances, 73
relational regression, 42
194
relational regression tree, 55
sizes, 64
relational reinforcement learning, 6
relational representations, 29
graphs, 32
interpretations, 31
reward, 12, 16
reward function, 13
reward backpropagation, 20
rib algorithm, 76
rmode, 56
rQ-learning, 47
SARSA, 18
software agents, see agent
stacking, see blocks world tasks
state, 12
state utility, see value function
stochastic environments, 15–16, 80
state utility, 16
stochastic policies, 15
supervised learning, 6, 11
support vector machines, 19
temporal difference learning, 18
term, 175
tg algorithm, 55
Tilde , 43
transition function, 13
transition probability, 15
U-trees, 54
unstacking, see blocks world tasks
user-modelling, 5
utility, see value function
value function, 13, 14
class-based, 48
discounted cumulative reward,
14
value iteration, 18
web-spider, 5
world, see environment
INDEX