Behavior Adaptation for a Socially Interactive Robot

Behavior Adaptation
for a Socially Interactive Robot
CHRISTIAN SMITH
Master’s Degree Project
Stockholm, Sweden 2005
TRITA-NA-E05044
Numerisk analys och datalogi
KTH
100 44 Stockholm
Department of Numerical Analysis
and Computer Science
Royal Institute of Technology
SE-100 44 Stockholm, Sweden
Behavior Adaptation
for a Socially Interactive Robot
CHRISTIAN SMITH
TRITA-NA-E05044
Master’s Thesis in Computer Science (20 credits)
at the School of Engineering Physics,
Royal Institute of Technology year 2005
Supervisor at Nada was Henrik Christensen
Examiner was Henrik Christensen
Abstract
This report addresses the problem of making a humanoid robot learn a human
partner’s preferences regarding personal space and adapt to these in real-time. An
adaptive system using policy gradient reinforcement learning (PGRL) is proposed,
implemented and evaluated in an experiment using human subjects. The experiment
shows that this is a viable solution to the problem, but that there are some issues
that remain to be resolved.
Beteendeanpassning för en socialt interaktiv robot
Examensarbete
Sammanfattning
Denna rapport angriper problemet med att få en humanoid robot att i realtid lära
sig och anpassa sig efter en människas preferenser vad gäller personligt utrymme i
tvåpartskonversation. Ett förslag till ett anpassningssystem baserat på en maskininlärningsmetod kallad policy gradient reinforcement learning (PGRL) presenteras.
Implementation och utvärdering av metoden med ett experiment på människor visar
att denna metod är tillämpbar på problemet, men att vissa smärre problem kvarstår.
Foreword and acknowledgements
This report is presented as a graduation thesis at a master’s level at the School
of Engineering Physics at the Royal Institute of Technology (KTH) in Stockholm,
Sweden. The assumed reader is a master student in the field of computer science
(or equivalent).
The research project presented in this report was suggested and commissioned by
the Intelligent Robotics and Communication Laboratories (IRC) of Advanced Telecommunications Research International (ATR) in Kyoto, Japan. I wish to thank
prof. Henrik Christensen for being my supervisor at KTH and for introducing me
to this project, as well as prof. Hiroshi Ishiguro and dr. Takayuki Kanda of IRC
for providing me with this opportunity, and dr. Noriaki Mitsunaga of IRC for supervising the research and giving invaluable feedback and support throughout the
project.
This research was supported by the Ministry of Internal Affairs and Communications
of Japan.
Christian Smith, February 2005
Contents
1 Introduction
1
1.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Current State of the art . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.4
Report outline
3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Interaction Theory
5
2.1
Proxemics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
Personal space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.3
Interaction with robots . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
3 Method
3.1
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1
3.2
3.3
9
9
Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Choice of method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3.2.1
Available methods . . . . . . . . . . . . . . . . . . . . . . . .
10
3.2.2
Method analysis . . . . . . . . . . . . . . . . . . . . . . . . .
12
PGRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.3.1
General Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.3.2
The algorithm
15
. . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3
3.4
Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
4 Implementation
4.1
The behavior adaptation system
18
. . . . . . . . . . . . . . . . . . . .
18
Behavior parameterization . . . . . . . . . . . . . . . . . . . .
18
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
4.2.1
Reward Function . . . . . . . . . . . . . . . . . . . . . . . . .
21
4.3
Robot platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
4.4
Sensing and measurements . . . . . . . . . . . . . . . . . . . . . . . .
25
4.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
4.1.1
4.2
5 Experiment
27
5.1
Experimental goals . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
5.2
Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
5.2.1
Experimental setup . . . . . . . . . . . . . . . . . . . . . . . .
28
5.2.2
Experimental procedure . . . . . . . . . . . . . . . . . . . . .
29
Conducted measurements . . . . . . . . . . . . . . . . . . . . . . . .
30
5.3.1
Logged data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
5.3.2
Postexperimental measurements . . . . . . . . . . . . . . . . .
30
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
5.3
5.4
6 Results
32
6.1
Measured results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
6.2
Results for the different parameters . . . . . . . . . . . . . . . . . . .
34
6.3
Result evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
6.4
Suggested improvements . . . . . . . . . . . . . . . . . . . . . . . . .
38
6.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
7 Summary and conclusions
41
7.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
7.2
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
References
44
A Unabridged result measurements and plots
49
A.1 Complete results from PGRL experiments on Robovie . . . . . . . .
49
A.1.1 Completely successful experimental runs . . . . . . . . . . . .
49
A.1.2 Partial success - content subjects . . . . . . . . . . . . . . . .
51
A.1.3 Successful runs - discontent subjects . . . . . . . . . . . . . .
53
A.1.4 Partially successful runs . . . . . . . . . . . . . . . . . . . . .
53
A.1.5 Unsuccessful run . . . . . . . . . . . . . . . . . . . . . . . . .
57
Chapter 1
Introduction
This chapter gives the background for the subject matter, describes the problem
that is to be solved and what research has previously been done in the field.
1.1
Background
Since robots where first introduced, one of their main applications has been to perform tasks that humans are not able or not willing to do, or to simply perform them
faster and cheaper than humans would. Traditionally, these have been industrial
tasks, but also dangerous tasks in both military and civilian operations. There are
unmanned scout planes, mine sweepers, and Mars explorers. Lately, there has been
work done to apply robotics even to what has traditionally been very human fields,
like health care, or teaching. These latter fields require not only the classical robot
traits of strength and durability, but social skills as well.
The field is known as social robotics and is a fairly young academic discipline. A
survey of the field compiled by Fong [11] sets the emergence date to the late 1940s.
However, the field took its modern shape with the rapid development of artificial
intelligence in different shapes in the 1990s. There are several subdivisions of the
field, but the ones that are of interest here are the ones that deal with robots
operating in human environments.
One line of study uses social behavior or appearance on the interface level in order to
facilitate interaction with humans. An example is a study by DiSalvo et al. [9], that
examines the impact the design of a robot head has on a human’s impression of the
robot. They show that a robot’s head should be human-like enough to have clear
facial expressions that humans can project perceived emotions on, while retaining
enough robot-like qualities so that there is no ambiguity as to whether it is a person
1
or a machine. This results in the robot that most users feel the most comfortable
with.
Another line of study looks at robotic behavior. Nakajima [31] studies distances
in human-robot interaction, in order to determine how humans perceive robots as
a function of robot appearance, movement speed and distance. Other studies, e.g.
Imai et. al [15], show that joint attention and gaze-meeting seem to have the same
importance to facilitate communication in the human-robot case as in the humanhuman case.
In recent years, there have been full-scale trials, where robots have been immersed in
strictly human contexts. One example is a study by Kanda et.al. [21], where a robot
played the part of a foreign exchange student in order to help elementary school
students learn english. Another example is by Burgard et al. [7], who describe a
robot that works as a museum tour-guide.
1.2
Current State of the art
A major field of study in social robotics is constructing robots and systems that can
operate freely in normal human societies. A large effort has been put into using
social interactions in order to gain information that the robot needs to perform its
tasks. Asoh et al. [3] describe a system where a robot finds its way around an office
by asking questions when it does not have enough information to navigate on its
own. Inamura et al. [16] [17] also propose a similar system where the robot actively
asks questions as soon as it does not fully understand what it should do. As it
gathers statistics on its user, it can make more and more educated guesses as to
what behavior it should execute, becoming increasingly autonomous. In a study
by Ogata et al. [34], a human and robot are cooperating in solving a navigational
task, supplying each other with sensory information. A more indirect approach
to information sharing is proposed by Nicolescu and Mataric [33], who propose a
system where the robot first learns a task by imitating a human, and then, if it
fails to accomplish the tasks, demonstrates it failure to the human, thus enticing the
human to help it. Mataric [27] also presents a study that gives the details of how
a robot learns by imitation by mapping observed motions to motion primitives that
the system recognizes as plausible, much like a human would do when mimicking.
Another direction of study uses notions from human interaction in order to produce robots that can act as humans. An example of this is a study by Nakauchi et
al. [32] that uses the concept of personal space in order to identify which persons
are standing in line and determining an appropriate distance for the robot to use
when positioning itself last in line. Tasaki et al. [44] use proxemic theory to control
robot behavior, where different types of communication are used at different distances. Breazeal and Scassellati [6] describe a robot that uses interaction context to
2
control robot behaviors. Althaus et al. [2] propose a system that lets a robot keep
appropriate distance and orientation to a group of people in multi-part conversation.
However, all of these studies use static social models, that only fit one pre-specified
social type.
There are also some studies that incorporate learning in order to achieve better interaction with humans. These studies include more basic concepts like the system
proposed by Bennewitz et al. [5] that learns typical human trajectories in a room
in order to be able to predict probable future actions, or the system by Shiomi
et al. [36] that identifies emotions from facial expressions in order to control conversation. There are also more direct implementations, like Isbell et al. [18], who
use reinforcement learning to improve the behavior of a virtual computer character.
Their system is not autonomous, as the reinforcement signal is explicitly provided
by the user. This is a weakness, as Zrehen et al. [47] [46] discuss the importance
of autonomy, and argue that learning systems for robots need to be autonomous if
they are to be used by the broad masses.
1.3
Problem
When robots interact with humans in a social environment, they need to at least
partially follow the same rules of engagement as humans would. One of these rules is
to respect personal space (see section 2.1). In current technology, there is no system
that allows a robot to adapt in real-time to different persons’ notions of personal
space. Therefore, the aim of this report is to find a solution to this problem by
proposing a system by which such an adaptation can be done.
This report will be limited to addressing the strictly technological solution to this
problem. The aspects of behavioral psychology that are involved will be briefly
mentioned where needed to clarify the discussion, but these matters generally lie
beyond the scope of this project.
1.4
Report outline
This report aims to solve the aforementioned problem in the following order.
Chapter 2 - Interaction Theory will start by describing the theories of proxemics and
personal space that are necessary for this approach. Both interhuman interaction
and human-robot interaction will be discussed.
Chapter 3 - Methods will restate the problem in the light of the interaction theories,
and examine what methods are available to solve it. These methods are analyzed,
3
and the choice of Policy Gradient Reinforcement Learning is argued for. This algorithm is also presented in deeper detail.
Chapter 4 - Implementation will describe the actual implementation of the algorithm, as well as the humanoid robot RobovieII that it will be tested on. The
system used for input and sensing is also described.
Chapter 5 - Experiment will describe the design and execution of an experiment
used to evaluate the implementation.
Chapter 6 - Results will present the results obtained from experiments with a total
of 16 test subjects. The results will be evaluated and the strengths and weaknesses
of the system will be identified.
Chapter 7 - Conclusions will summarize the study and present the conclusions
drawn.
Appendix A will give a detailed description of all experimental runs.
The results found in this report will be presented at the 2005 Robotics Symposia
in Hakone, Japan [39], and have been submitted to IROS2005 [29] and to Nihon
Robotto Gakkaishi (Journal of the Robotics Society of Japan) [30].
4
Chapter 2
Interaction Theory
This chapter gives a brief introduction to the theory of proxemics and personal space,
that is, the study of what distances humans prefer in different types of interaction.
There is also a shorter presentation of how humans react to having their personal
space invaded.
2.1
Proxemics
When two humans interact with each other, the distance between them is based on
several factors. Relationships and the type of activity are two of the more important
ones according to Daibo [8]. He presents a study that shows how perceived comfort
or discomfort correlates to distance and relationship. As Figure 2.1 shows, between
total strangers that do not partake in a common action, the comfort is greater with
greater distance. When there is a common action, the optimal comfort is at a certain
distance that is closer for friends than strangers. This distance is also dependant on
orientation. Most people prefer a farther distance to someone in front of them than
to someone beside or behind them.
Hall [14] describes interpersonal distance in more detail, and classifies distances into
four different types, depending on the type of interaction that is taking place. These
types are intimate, personal, social, and public.
• Intimate distance is the closest distance. At this distance, smell and touch
give the main sensory input from the partner. Visual input is limited to a very
narrow field. Most people feel discomfort if forced to be at this distance.
• Personal distance is the distance where one can easily touch the other by
extening an arm. Vision provides the main sensory input, and the face is
5
Friend
(Common action)
Stranger
(Common action)
Comfort
Stranger
(No common action)
Discomfort
Short
Distance
Medium
Distance
Far
Distance
Figure 2.1. The correlations between relationships, distances and perceived comfort.
Adapted from Daibo [8].
perceived in very high detail. This is the distance used between close friends
in normal conversation in most cultures.
• Social distance is the distance where casual conversation can easily take place,
but touch is not possible even by extending the arms. This is the distance
used in casual social gatherings, or when two coworkers interact in performing
a common task.
• Public distance is the farthest distance. At this distance, facial features and
nuances in the tone of voice are not conveyed. This distance is not comfortably
used in common interaction, except for public speeches or lectures.
Another factor that has a large impact on interpersonal distances is culture. Hall
studied several American, European and Asian cultures and found significant differences between most. As an example, people from Arab cultures tend to prefer
much closer interaction than most Americans. These cultural differences are one of
the main factors defining the need for an adaptive system, but this report will not
be able to adress the problem thoroughly, as the experiments (see Chapter 5) were
carried out in a monocultural setting.
2.2
Personal space
When intrusions are made into an individual’s personal space, there are several common reactions. Hall [14] and Daibo [8] report on repositioning to be most common.
6
In circumstances where available physical space does not allow repositioning, the
normal reaction is to avert gaze, as this weakens the perceived intrusion. Sundstrom and Altman [40] also report on studies where gaze-averting has been used
even when there is enough physical space. These cases are mostly those where the
conversational partner has made obvious intrusions, and it would seem impolite to
back off.
However, gaze-meeting is not strictly hit or miss, but takes place on a continuous
scale. As research by Duncan and Fiske [19] shows, there are several factors that
influence the level of gaze-meeting in a two-part conversation, such as who is speaking, the sexes of the involved, and the topic of speech, as well as a large amount of
individual variation.
Kendon [25] presents a study of group dynamics that show that status is also very
important in determining the amount of perceived intrusion. For example, many
people do not experience any discomfort when their personal space is invaded by
small children or animals. Thus, it is possible that a person who does not perceive
a robot as an equal being will not have any notions of personal space regarding the
robot.
2.3
Interaction with robots
Since social robotics is a young field of study, the literature on the concepts of
personal space and proxemics regarding human-robot interaction is limited. There
are studies of some of the aspects, however.
One of the studies that looks at interaction preferences for human-robot interactions, done by Nakajima [31], shows that the faster the robot moves, the farther the
preferred distance. The study used pulse measurements to show that the subjects
felt considerable stress when speed was raised without increasing distance. This was
deemed to be more of an issue of personal safety than one of personal space, but the
reactions were similar.
On the other hand, a study by Imai et al. [15] shows that gazing and joint attention
can play the same role in human-robot interaction as in strictly human interaction. They show that their human test subjects project the human meaning of gaze
direction on a robot partner.
Further studies into human-robot interaction conducted by Kanda et al.[23] describe
correlations between perceived quality of interaction and human reactions. This
study shows that subjects who are interacting with a robot show some of the same
reactions as in human-human interaction. One example of this is a strong correlation
between perceived quality of interaction and the amount of eye contact. This study
also shows a slight negative correlation between perceived quality and the amount
7
of movement, that is, subjects tend to move more when discontent with the robot’s
behavior.
2.4
Summary
Humans have notions of personal space—preferred distances to others—that are of
great importance when interacting with other humans. The volume of an individual’s
personal space is dependent on several factors, including type of interaction, relationship to counterpart, cultural background, motion speed, etc. There is for most
cases an optimal distance. If the counterpart attempts to interact at some other
distance, either closer or farther, humans tend to feel discomfort. The perception of
personal space is also altered by the extent to which gaze is met. If gaze is averted,
most persons will feel comfortable at a closer distance than if gaze is met.
Studies show that most notions of personal space also apply to human-robot interaction. Humans show the same types of discomfort signals when a robot interacts
without regard to their personal space as they would in strictly human interaction.
These discomfort signals include body movement and gaze averting.
8
Chapter 3
Method
This chapter restates the problem in more detail and argues for the choice of method
by explaining what requirements must be fulfilled and why the proposed method
fulfills these better than other methods. The proposed method is also described in
detail.
3.1
Problem
Given the description of proxemics and personal space in the previous chapter, the
problem can now be restated. In order to achieve a robot that can interact socially
with a human without causing the human discomfort, the robot must respect the
same rules of interaction as a human would. This means that the robot must be able
to adjust its interaction distances according to relation, type of action, the situation,
and personal preferences of the human counterpart. The last factor is very complex
and depends on culture, mood, and several other ill-defined factors.
Furthermore, as the quality of interaction and the perceived discomfort is not only
a funtion of distance, the robot must be able to take gaze and movement speed into
account as well. Also, the robot must adapt its behavior to different individuals,
and to different occasions or situations with a certain individual.
The problem can thus be restated as finding a system that allows a robot to adjust its
interaction speeds and distances, as well as its gaze-meeting, to different individual
preferences and to different situations.
9
3.1.1
Requirements
There are several requirements that have to be met by the adaptive system. Some
are of course more vital than others, but an optimal system should fulfill as many
as possible. Especially the last three requirements are vital for a robot operating in
a public space.
• The method must generate an acceptable behavior. The system, when implemented, should result in a final behavior that makes smooth interaction as
easy as possible for a human partner.
• The system must not behave excessively strange in the adaptation phase. This
can either mean that strange or discomforting behavior does not last for very
long, or that the behavior is not far from acceptable or optimal behavior if the
adaptation requires a longer time.
• The adaptive system must not require special training or help from trained
staff. This could also be stated as a requirement that the system is completely
autonomous.
• The adaptation must be general and work with a wide range of human counterparts, possibly having a wide range of behavioral patterns or preferations.
3.2
Choice of method
Given the requirements stated above, a suitable method that fulfills as many of these
as possible has to be found. This section will start by listing some of the available
approaches and then analyze these to find the best one.
3.2.1
Available methods
There are several different approaches to controlling robot behavior. A short overview of the more common methods will be given here. These descriptions can be
found in any good textbook on the subject, and vary little between different authors.
Direct control methods
Most simpler methods for automated control are geared towards keeping some measurable quantity of a system at a predefined level. A typical example is the PID
(Proportional Integral Derivative) controller. This is a feedback controller that sets
the control signal(s) according to the value, integral, and derivative of the target
10
function. For most simpler cases, a controller like this will perform very well if
the system parameters are set appropriately. This is widely used in many different
industrial process applications1 .
Machine learning methods
More advanced systems can be constructed with some sort of inherent intelligence.
Methods that improve by using data from the environment are collectively known
as machine learning methods. In a wide sense, these can be divided into supervised
methods, where an operator has to provide the system with some sort of feedback
or support, unsupervised methods, that receive no external guidance, and reinforcement learning (RL) methods that given a specified reward function that should be
maximized2 .
Supervised learning in most forms uses a set of training examples that contain pairs
of input signals and desired output signals. Given this, the system learns a mapping
function from input to output that not only correctly maps the training data, but
in the succesful case also for yet unseen input signals. Unsupervised learning is not
provided with anything but the input data. It is then left to the algorithm to find
meaning or relationships in the data to classify or cluster it in some way. These two
classes of methods will not be considered in this report, as there is not enough comprehensive training data for supervised learning, and unsupervised learning would
not have any information as to what it should achieve. They are listed here to
complete the presentation.
Systems using reinforcement learning are provided with some target function, known
as the reward function, that is to be maximized. There are several different methods
for reinforcement learning, but they can roughly be devided into action value methods and policy methods. The action value methods learn the (long-term) value of
taking a certain action in a certain state, and try to choose the action that has the
highest value. Policy methods, on the other hand, learn which policy3 will result in
the greatest reward.
Sutton and Barto [41] describe the following RL methods:
Dynamic Programming. This is the name of a collection of methods that compute the
optimal policy directly. This requires the environment to be completely described as
a Markov decision model. These methods are often very expensive computationally,
but the optimal policy only needs calculating once, and when this is done, following
the policy is simple.
1
For further descriptions, see for instance Franklin et al. [12]
For further descriptions, see for instance Mitchell [28]
3
In this context, a policy is a function that maps states to actions for all possible states
2
11
Temporal Difference. These methods do not require a complete model of the environment, as they empirically explore different paths through the state space to find
the reward outcomes for different actions and sequences of actions. Methods of this
class include TD(λ), SARSA and Q-learning. They test different actions in different
states, and use the measured reward to evaluate the value of taking this action.
They differ slightly in how they model the policy functions and how they represent
and update the action values, but common is that they use information about values
for possible actions from the next state to calculate the long-term rewards.
Aberdeen [1] describes policy learning methods as methods that have a set of parameters that define the policy. The policy is repeatedly evaluated and improved,
which can be done in a number of different ways. For the case with continuous
states and parameters, Kohl and Stone [26] propose to sample the results of different policies in a neighborhood of a given policy, and perform gradient ascent on the
gradient of the reward function. This is called policy gradient reinforcement learning
(PGRL).
3.2.2
Method analysis
A number of these methods were considered, but only one was found to meet the
criteria. The following section describes the selection process. Some methods considered were evaluated solely on theoretical basis, while others were also tried in
simpler test implementations.
Direct control
The first method that comes to mind should be the easiest solution. Using direct
control, such as a generic PID controller, it should be very easy to keep the robot’s
behavior stable at some certain level. Several versions of this were considered and
tried.
The first thought was to keep the robot immobile initially, measure what distances
the human would choose for himself, and then when measurements were completed,
set the target value for a PID controller to be this distance. This is very fast and
seems to generate an acceptable distance-keeping behavior. It does, however, not
solve the problems for gaze-meeting or movement speeds. Also, it does not address
the problem that humans tend to prefer farther distances for moving robots than
for stationary ones, or that preferences tend to change with time. When a system
of this type was implemented and tried, it tended to result in the human chasing or
being chased by the robot.
Another approach to direct control is to let the robot continuously read the human’s
response and adapt its behavior accordingly. Thus, if the system notices that the
12
human tends to move towards or away from the robot, the target distance is changed
accordingly. There is however an inherent problem with this approach. If the change
is too rapid, the robot will stand still and let the human perform all the movement.
This will not make interaction uncomfortable, but the robot will be a passive part
in the interaction process.
Slowing down the recalibration of distances will let the robot move more actively, but
the slower the recalibration, the larger the tendency for chase becomes. When the
human takes a step away from the robot, the robot will tend to follow. It might be a
more natural movement if the robot aids the human by also stepping away. Another
possibility would thereby be to let the robot mirror the human movement by letting
the robot keep the same distance from a given point in the room as the human. The
centroid of the human-robot system would thus be locked in place in the room. No
literature supports this kind of behavior as normal in human interaction, however,
and when tested, neither of these systems were appreciated by a fair amount of
subjects.
Reinforcement learning
The results of the above mentioned approaches suggest that there is no easy solution
to the problem. Keeping a specific distance, a specific position or specific centroid
all seem to work in certain cases, but the third requirement is not met, as different
subjects tended to approve of different behavioral systems. There was no obvious
correlation between measurable quantities and it seemed difficult to construct a
model for interactive behaviors.
As in many cases where the model is unknown, reinforcement learning (RL) seems
to be a viable approach. If we can specify criteria for successful interaction, this can
be used as a reinforcement signal, and the system should be able to learn a separate
acceptable behavior for each individual subject.
Since the main goal is to construct a system that makes interaction as smooth and
comfortable as possible, comfort or discomfort in the human subject should be a
good reinforcement signal. As stated in 2.1, two major signals used by humans
to signal discomfort in interaction are repositioning and gaze-averting. A system
that minimizes the amount of repositioning and gaze-averting of the human subject
should therefore also minimize the underlying discomfort.
Some RL methods, like dynamic programming, require a complete knowledge of the
dynamics of the model, and are therefore difficult to apply to this problem. Other
methods, like TD(λ), SARSA, or Q-learning therefore seem more appropriate. The
common versions of these methods, as described by Kaelbling and Moore [20], and
Sutton and Barto [41], use a discrete state space with discrete action alternatives.
The problem investigated in this report is not discrete in essence, but most para13
meters involved can easily be discretisized to appropriate intervals. If these intervals
are made small enough, this should not cause any problems. Otherwise, function
approximators could be used, as described by e.g. Smart and Kaelbling [37].
The two major inherent problems with these RL methods are that they need to
explore low scoring routes through the state space in order to find optimal routes,
and that they tend to need a high number of iterations to converge. Even simple
gridworld problems typically need thousands, if not tens of thousands, of iterations
to converge. Since actions and reactions in robot-human interaction typically are on
the scale of seconds, a system based on this approach would literally take hours to
converge at best.
These theoretical problems were confirmed by an implementation of a simpler Qlearning system. It could adapt distances well in a very simplified setting with only
one type of action, but it required several hours to converge even when the human
subject behaved optimally. The problem of the system exploring poor policies is
adressed by Smart and Kaelbling [38], who suggest that during the first phase of
learning, another, acceptable, system is in control of the robot while the learning
system simply observes. This was tried in the context of this study by implementing
a version where the Q-table was initialized to values that made the robot move
towards roughly acceptable distances. This showed promising performance, but the
adaptation to new situations was still much too slow, taking 10–20 minutes even in
the most simplified case.
Some studies have been done to speed up these RL methods. Shapiro et al. [35] show
how background knowledge can be used to decrease the number of iterations needed
for learning by limiting the state-action pairs to avoid those that the designer knows
to be wrong. This would only require a limited knowledge of the interaction model.
Other authors, like Drummond [10] or Sutton et al. [43], propose different ways to
use temporal or subtask grouping to speed up learning. None of these, however,
show results that are nearly fast enough for real-time adaptation.
Another problem with methods that learn action values is that they, according to
Aberdeen [1], perform poorly in non-deterministic cases. The suggested solution to
this problem is to use methods that directly learn a policy. Others, like Baxter [4],
also support this choice by explaining that policy methods outperform action value
methods in the partially observable case. Since human behavior observed by a robot
at best can be described as a partially observable stochastic process, the choice of a
policy learning method seems well supported.
A method that combines the real-time speed of direct control and the flexibility of RL
is policy gradient reinforcement learning, PGRL. Tedrake et al. [45] show that PGRL
is fast enough for on-line adaptation to changing environments when implemented to
learn a gait for a bipedal robot. Grudic et al. [13] state that PGRL is fast and stable
in solving problems where other RL methods do not reach convergence for millions
of iterations. Kohl and Stone [26] show how PGRL—if started close enough to
14
the optimum—does not explore low-scoring alternatives to any higher extent. Thus,
PGRL seems to be an appropriate method for this problem, as practical experiments
later will support.
3.3
PGRL
This section gives a more detailed description of the policy gradient reinforcement
learning algorithm. The general ideas as well as a pseudocode version are presented.
3.3.1
General Ideas
The main idea in PGRL, as described by Kohl and Stone [26], Aberdeen [1], Baxter [4], Grudic et. al [13], Sutton et al. [42], or Tedrake et. al [45], is to parameterize
the system behavior, calculate an approximation of the gradient of the reward function in this parameter space, and then ascend to the local optimum. The algorithm
is started at a given point in the space. The reward function is sampled at random
points in the vicinity of the original point, and the gradient is calculated from these
samples. The system then chooses a new point by moving in the direction of the
gradient, and the process is repeated.
3.3.2
The algorithm
Figure 3.1 shows the algorithm in pseudocode. This description follows the description given by Kohl and Stone [26], with the exception of the second to last step (line
20), which has been added here to get correct scaling of the gradient. The current
policy is parameterized into a parameter vector Θ, and a total of T perturbations
Ri of Θ are generated, by randomly adding j , 0, or −j to each element θj in Θ.
The stepsizes j are set independently for each parameter.
For each parameter set Ri , the system is run and the reward function evaluated.
When all T sets have been run, the gradient A of the reward function in the parameter space is approximated by calculating the partial derivatives for each parameter. This is done by in turn calculating the average reward for the perturbed
and unperturbed cases. The value Avg+,j is the average reward obtained in the
cases where j was added to parameter j, Avg−,j is the average reward when j
was subtracted from parameter j, and Avg0,j is the average reward obtained in the
cases where parameter j was left unperturbed.
The gradient in dimension j, aj , is then regarded as 0 if the reward is greatest for the
unperturbed parameter, and regarded as the difference between the average rewards
for the pertubed parameters, Avg+,j − Avg−,j , otherwise. When the gradient A
15
Θ ← Initial parameter set vector of size n
← parameter step size vector of size n
η ← overall step size
while (not done)
f or i = 1 to T
f or j = 1 to n
r ← unbiased random choice
f rom {−1, 0, 1}
i
8
Rj ← Θj + j ∗ r , where Ri is
perturbed parameter set of same size as Θ
9
f or i = 1 to T
10
Run system using parameter set Ri ,
evaluate rewards
11
f or j = 1 to n
12
Avg+,j ← average reward f or all Ri
with positive perturbation in dimension j
13
Avg0,j ← average reward f or all Ri
with zero perturbation in dimension j
14
Avg−,j ← average reward f or all Ri
with negative perturbation in dimension j
15
if (Avg0,j > Avg+,j ) AN D
(Avg0,j > Avg−,j )
16
aj ← 0
17
else
18
aj ← ( Avg+,j − Avg−,j )
A
∗ η
19
A ← |A|
20
aj ← aj ∗ j , ∀ j
21
Θ ← Θ + A
1
2
3
4
5
6
7
Figure 3.1. PGRL Algorithm
has been thus calculated for all dimensions, it is normalized to overall step size η
and for the individual stepsizes in each dimension. The parameter set Θ is then
adjusted by adding A.
3.3.3
Properties
One of the main weaknesses of the PGRL algorithm is that it is not suited for
finding an explicit behavior. What it does find is a locally optimal parameter set
for a given parameterization of a behavior. This requires that some general outline
of the desired behavior is known beforehand.
16
On the other hand, if such a parameterization can be found, there are several advantages to this method as compared to other RL methods. One of these is that it
is normally considerably faster in finding the optimal behavior. Kohl and Stone [26]
show optimization within 10 to 15 iterations for a problem with a 12-dimensional
search space. Given that T is approximately 10, this gives convergence in the order
of 100–150 evaluations, which is feasible for a robot system.
The PGRL algorithm will as stated only find local optimums. Local will in this case
mean on the scale of the step length × η. As showed in 2.1 however, it is probable
that local optimums will be global optimums for most factors involved in interaction.
Thus, it is probable that the PGRL will be able to find global optimums, and thus
result in optimal behavior for a specific human subject.
A strength of the PGRL algorithm is that since it does gradient ascent on the reward
function, it should continuously improve. Thus, if started at somewhat acceptable
values, a system based on this algorithm should not explore unacceptable behaviors.
While evaluating the gradient, the system does however need to explore in all directions, including the suboptimal ones. The stepsize will therefore need to be made
sufficiently small so that these suboptimal tries do not explore significantly suboptimal parameter settings. The stepsize will at the same time need to be sufficiently
large so that the algorithm does not get lost in local optimums caused by random
noise in the reward function.
3.4
Summary
The problem can be restated to designing a system that enables a robot to keep
appropriate distances, speeds, and gaze-meeting frequencies for different people with
different preferences. The system should also adapt itself to different situations and
interaction types.
Since the task requires adaptation to a stochastic progress with unknown dynamics,
reinforcement learning methods were considered. The reinforcement signal should
be set as to minimize discomfort signals from the human, i.e. movement and gazeaverting, as this should result in a behavior that also minimizes the felt discomfort.
Policy gradient reinforcement learning (PGRL) was chosen as it offers a good balance
between adaptivity and convergence speed. The chosen algorithm works by searching
through a parameter space for higher reward values. An initial parameter set is
generated, and a random set of slight perturbations to this set are tried in order
to sample the reward function. These samples are used to estimate the gradient of
the reward function, and the parameters are adjusted in order to perform a gradient
ascent. This is iterated indefinetely or until some predefined condition is fulfilled.
17
Chapter 4
Implementation
This chapter describes the implementation of the behavior adaptation system. The
hardware and software platforms used are also given a presented briefly.
4.1
The behavior adaptation system
The final implementation of the system utilises a parametrized behavior that is
optimized using PGRL. Using this approach means that the adaptation system can
be applied to a currently existing behavior program.
4.1.1
Behavior parameterization
Keeping the number of parameters low will decrease the size of the searchspace, thus
giving faster adaptation. Too few parameters, however, will result in a system that
is not flexible enough. The exact number and type of parameters that result in the
optimal system is difficult to find without thorough testing, so the choice is heuristic
in this report. Two rational ways of finding a good parametrization are either to
start with all conceivable parameters and gradually decrease the number, or to start
with the fewest possible number and gradually increase the number of parameters.
In this report, the choice falls on the latter method, with the heuristic motivation
that it is reasonable to start with the simplest possible system as a proof of concept
before making it more complex. This approach should minimize the sources of error,
making implementation simpler and faster.
The factors that need to be simultaneously adapted are, as stated in section 2.1
distances, gaze-meeting, and motion speed. Since the distances are also dependent
on the type of interaction, there is a need for at least three distance parameters,
pertaining to the intimate, personal, and social distances. The public distance is not
18
considered as this study is limited to interaction, and the public distance is for noninteractive behaviors. The correlations between behavior type and gaze-meeting and
speed are much weaker, so it is possible that the simplest version only needs a single
parameter for each of these. The decision finally fell on having two parameters
for speed, as no simple heuristic was found to make the decision on whether the
actual speed of motions or the time waited between taking different actions was
more appropriate. Therefore, both of these were chosen, resulting in a total of six
parameters (see Table 4.1).
Table 4.1. The parameters used in the adaptive system
Parameter
Intimate distance
Personal distance
Social distance
Gaze ratio
Waiting time
Speed factor
Description
The distance in cm that the robot’s P controller tries to
maintain during intimate behavior.
The distance in cm that the robot’s P controller tries to
maintain during personal behavior.
The distance in cm that the robot’s P controller tries to
maintain during social behavior.
The proportion of the time the robot meets the human’s
gaze.
The time in seconds the robot waits between speech and
action.
A factor that is multiplied to the robot’s default speed for
executing movement.
The distance parameters are fed to a P controller1 that chooses the parameter that
corresponds to the current action type. This controller is limited in a number of
ways. First, there is a lower limit to distance. The robot will never be allowed to
get closer to the human than 15 cm. This is to avoid collisions. Second, there are
upper limits to speed and acceleration, also set for safety reasons. If the speed is too
high, there is a risk that the robot is not able to stop in time when coming close to
the human, and if the acceleration is set too high, there is a risk of the robot falling
over.
The gaze-meeting ratio is based on a study by Duncan and Fiske [19] which shows
that the average time of the cycle of gazing and gaze-averting in normal human
interaction is approximately 5 seconds. The adaptation system is set so that the
time the robot takes to complete a cycle is a uniform random distribution ranging
from 0 to 10 seconds. The gaze ratio parameter controls what portion of each cycle
is spent meeting gaze.
The waiting time parameter controls a waiting loop that is inserted into the robot’s
behavior modules (see section 4.3). Most robot behaviors used in this study consist
of an utterance from the robot followed by a motion. The wait is inserted between
1
A proportionate controller that sets the control signal to be proportionate to the deviation
from the target value.
19
these. For example, for the hug module, the parameter would control the time
between the robot uttering the phrase “please hug me!” and the robot extending its
open arms toward the human.
The speed parameter is a multiplicative factor that controls the motion execution
speed. Thus, a speed factor of for example 1.1 would result in a movement that is
10% faster than the default speed. The default speeds for the motions are the speeds
that the motion designer found to be appropriate, which means that they are not
originally fitted to personal preferences, but that all actions have similar speeds.
4.2
Algorithm
A general description of the PGRL algorith can be found in section 3.3.2. Before
the algorithm was implemented on the robot, a simple MatLab simulation was used
to try different parameter settings. When the simulation gave satisfying results, the
algorithm was implemented on the robot, where some further testing gave the final
parameter settings.
In order to achieve as fast iterations as possible, each evaluation should be as short
as possible. It was decided that the shortest possible unit to use for one evaluation
would be one module (see section 4.3). Using shorter units than this would not allow
for measurable reactions from the human. Also, since the gaze cycle is approximately
the same length as a module execution, more frequently measured reactions would
not render meaningful evaluations.
The number of different parameter sets evaluated in each iteration, T , was set to
10. The more sets evaluated, the better the approximation of the gradient, but also
the longer the time for each iteration. 10 sets seemed to give the smallest total
number of evaluations needed for convergence. As the average execution time for a
behavioral module is 10 seconds, this means that an iteration of the algorithm took
1 minute and 40 seconds on an average.
Since the perturbations are chosen at random, there is always a certain risk that,
for a particular parameter j in a particular iteration, one perturbation (i.e. +j or
−j or ±0) is not tried. In the case that the unperturbed case is not evaluated,
the gradient is calculated as if the unperturbed case would have given the lowest
score. In the case that one of the perturbed options is missing, the gradient in that
dimension is set to 0 if the unperturbed case gives the highest value, and it is set to
half the difference between the perturbed and unperturbed case otherwise.
In concrete terms, the algorithm was implemented on a device driver level, written
in C. This since all movements of the robot are controlled by device drivers for the
hardware. The existing device drivers were lightly modified to become parametrized
20
according to the above description. A separate PGRL controller was then implemented that measured the reward function and adjusted the parameters accordingly.
4.2.1
Reward Function
The reward function was designed to minimize human discomfort. Since it is virtually impossible to measure this directly in real-time, some measurable quantity
that correlates well is needed. As is explained in Section 2.1, humans tend to show
discomfort in face-to-face interaction by readjusting position or averting gaze. There
are no studies that show which of these is dominant or to what extent. Therefore,
in this report these two factors are taken to be of equal importance. This heuristic
seems to work reasonably well, as will be shown in Section 6.1, but can of course be
improved.
In the formulation of this implementation, the gazing factor was calculated as the
percentage of time that the subject’s face was turned towards the robot. Using face
direction and not actual eye direction is motivated by the fact that gaze and face
orientation tend to correlate well, as most people will turn their head to follow gaze.
Also, face direction is much easier to measure accurately than eye direction.
The movement was measured as the translation of the subject’s forehead in the
horizontal plane. Head movements were chosen for both on the grounds of being
easily measured and that this will include small fidgeting movements that tend to
signal discomfort. The movement measure was first filtered with a lowpass filter in
order to get rid of high frequency noise. This means that only movements on the
scale of 5 Hz or slower were considered. This value was given a negative weight for
the reward signal. For a schematic of the reward function, see Figure 4.1.
4.3
Robot platform
For the purpose of this study, the humanoid communication robot RobovieII was
used. This robot is developed at ATR, and it is thoroughly described by Kanda
et al. [24] [22]. A shorter description will be given here.
RobovieII is a humanoid robot standing approximately 120 cm tall (see figure 4.2).
It is designed to be smaller than the humans it interacts with in order to appear less
threatening. It is mounted on a Pioneer2 base that uses differential drive on the two
main wheels and is statically balanced by a third caster wheel. The robot’s arms
are mounted with four degrees of freedom, enabling most arm motions possible for a
human. The “hands” are rudimentary half-spheres with no motion capabilities, but
21
Figure 4.1. Reward function
Figure 4.2. The interactive robot RobovieII
the wrists are springloaded and can be moved by outer force. The head is mounted
with three degrees of freedom, enabling most human head movements. The eyes are
mounted with two degrees of freedom each, that can mimic human eye movements,
but not blink.
The robot’s torso, arms and head are equipped with a total of 16 binary touch
sensors that can register touch or forced bending of the wrists. The robot’s head
is fitted with a speaker and a microphone for spoken communication. The eyes are
22
each fitted with a video camera, and there is an omnidirectional camera mounted on
a short metal bar protruding from the robot’s back to approximately 25 cm above
the head. There are 24 sonar range finders mounted on the robots base and chest.
The robot is fitted with a main control computer (a 900MHz PC) and a secondary
vision and speech processing computer (a 2400MHz PC), both running Red Hat
Linux. The main control program runs at 30 Hz. The secondary computer runs
programs for real-time face and eye detection. The robot communicates with the
outside world using a normal ethernet connection. The onboard processing power
is more than adequate to run the relatively light PGRL algorithm simultaneously
with the main control programs.
The behavior control platform on RobovieII is implemented as a set of situated modules. Each module consists of a set of one or several utterances and actions that
together form a single behavior. For the purpose of this experiment, a subset of
behaviors that can be performed autonomously, i.e. do not require special circumstances like the presence of certain objects or situations, and that are interactive
were chosen. These were classified into three groups according to their interaction
type (see Section 2.1). A description of the modules used and their classification is
found in Table 4.2. Photos of the robot performing the different actions are found
in Figure 4.3.
Figure 4.3. RobovieII and the behaviors
The classification was done using a simpler prestudy where eight subjects were exposed to the behaviors and asked to interact at a preferred distance. These distances
were measured, and the behaviors grouped accordingly. In normal human interaction, casual conversation is commonly classed as social, but the prestudy showed
that most subjects preferred a closer distance, equalling that of the touch-based
23
Table 4.2. Behavior Classes
Class
intimate
Name
Hug
personal
Handshake
personal
Ask where from
personal
Ask if cute
personal
Ask for touch
social
Paper-scissors-stone
social
Pointing game
social
Excercise
social
Monologue
social
Just looking
Description
The robot says “please hug me” and extends
it open arms. If a person is detected directly
in front of the robot, the arms are closed in a
slow embrace
The robot says “let’s shake hands” and extends
its right arm. It waits for a certain time or
until a handshake has been registered by the
wrist sensor, the lowers the arm.
The robot asks “where are you from?” and
awaits a reply. If the answer is recognized,
the robot comments on the distance, if not the
robot complains that it could not understand.
The robot asks “do you think I’m cute?” and
comments on the answer. It responds positively to touching on the head.
The robot says “please touch me” and responds
to different kinds of touch.
The robot asks the subject to play the game of
paper-scissors-stone. It extends its right arm
after a short wait. The robot recognizes if the
person played or not, but not the outcome of
the game.
The robot says “let’s play the pointing game”,
and asks the subject to point in a certain direction. The robot then tries to look in the
opposite direction. (note: this is a well-known
japanese childrens’ game)
The robot says “let’s excercise” and demonstrates an arm-swinging motion.
The robot holds a short monologue, thanking
the subject for participating in the games, and
asks the subject to play some more.
The robot does not do anything in particular, but the gaze-meeting and distance-keeping
systems are run as usual, resulting in the robot silently observing the subject at a certain
distance.
24
behaviors found in the personal group. This is mainly due to limitations of the
robot’s speech recognition system, that requires subjects to move closer in order to
be correctly understood by the robot.
The original version of RobovieII software uses an advanced system that chooses
different modules in a manner that aims to emulate relationships developing over
time. Since this system was to be tested in shorter runs (see Section 5.2), another
simpler system had to be implemented. This simpler system chose modules at
random, with the exception that the same module was never executed twice in
succession.
4.4
Sensing and measurements
The first attempted implementation used the robot’s onboard sensors. These proved
not to be exact enough. There was a tendency for the systems to loose track of
the subject for time periods up to 2 seconds. Different interpolation and filtering
techniques were tried in order to solve this, but it still left the robot with very slow
reactions and no capability to accurately detect smaller movements and shorter gaze
avertions.
This problem was solved by using external sensors. A 12 camera motion capture
system running on an external workstation was used. This system is, when properly
calibrated, capable of making accurate millimeter precision measurements at 120 Hz.
A real-time identification program was also engaged so that correctly labeled numerical position data was transmitted directly to the robot. 23 points on both robot and
human subject were tracked in this manner. Most points were not directly used by
the robot, but tracking several points allows the system to interpolate the positions
of points temporarily occluded. The positioning data was forwarded to the robot
using the ethernet connection. The connection lag was logged and never exceeded
100 ms, resulting in real-time sensoring2 .
The position of both human and robot was defined in two dimensions as being the
x and y coordinates of the forehead. A total of four points measured on the head
enabled accurate measurements of the direction the head was turned. The angle
between the vector from the subject to the robot and the vector pointing in the
direction of the subject’s face was used to determine gazing. If the absolute value of
this angle was less than 10 degrees (Figure 4.4), the subject was considered to gaze
at the robot.
The robot’s normal face-tracking software was replaced with a function that used
this motion capture position data instead, resulting in very accurate face-tracking.
2
There were only two occurances where the lag ever exceeded 33 ms.
25
Figure 4.4. The angular interval determined as human gazing at robot
4.5
Summary
The PGRL algorithm was implemented on a humanoid robot, the behavior of which
was parametrized to six parameters. These were distances for three different interaction types, two speed parameters and a gaze-meeting factor. The reward function
used to evaluate the parameter settings was defined to give positive rewards when
the human counterpart was gazing at the robot, and to give negative rewards for
human locomotion, as movement and gaze-averting are signs of discomfort in human
interaction.
The humanoid robot performs different interactive behaviors, so called modules, that
are chosen at random. After each module has been performed, the reward function
is evaluated. For each 10 modules performed, the gradient is calculated and the
parameters are adjusted accordingly. All sensing and measurements are done with
a motion capture system on an external computer.
26
Chapter 5
Experiment
This chapter describes the design and execution of the experiments used to evaluate
the system.
5.1
Experimental goals
The goal of the experiment was to evaluate the system. A first simpler prestudy was
designed as a proof of concept and to find appropriate values for parameter settings
if the system seemed promising. A second, more rigorous experiment was then
designed for a more thorough evaluation. The questions that this second experiment
was to answer were:
• Does the system provide a good adaptation to personal preferences?
• Does the system adapt well to different people?
• What are the inherent strengths of the system?
• What are the inherent weaknesses of the system?
• What modifications should be done to the system to enhance performance?
There are some important questions that will be left unanswered with this approach,
like whether the adaptive system actually makes interaction easier. This question is
not targeted since it is of a more behavioral sciences nature, and hence lies outside
the scope of this study.
27
5.2
Experimental design
In the first prestudy that was used as a proof of concept, eight subjects were asked to
interact with the robot in an open space for approximately 15 minutes. The values
for the adapted parameters were logged, and checked to see if they seemed to converge. The subjects were asked if they thought that the robot’s behavior improved
during the experimental run. Most subjects were content with the behavior, and
the system was deemed as promising enough to do a more thorough study.
The first study showed that convergence seemed to take approximately 10–15 minutes,
so the second experiment needed to be longer in order to ensure that ample time
was given for convergence. Three subjects were asked to interact with the robot for
as long as possible, and the upper limit without getting tired or bored enough to
hamper performance was 30–40 minutes. Thus the length of the experiment was set
to 30 minutes.
In order to get a wide range of subjects, a total of 16 subjects were asked to take
part in the experiment. None of the subjects used in the prestudy where used, so
all subjects were new to the system. All subjects except one were japanese, and
all could understand the utterances made by the robot. The subjects were in the
age span of 20–35, with the majority being 20–25 years old. 6 of the subjects were
female, and of these, 5 were non-technical staff working at ATR. All the remaining
subjects were technical staff of ATR. Both technical and non-technical staff members
had a certain experience with robots in general and RobovieII in particular prior to
the experiment.
5.2.1
Experimental setup
Since a motion capture system was used for all the sensoring, the experiment had
to be carried out in a special room. The cameras had a limited view of the room,
so all interaction had to take place in a 3.5 m × 4.5 m space in the middle of the
room. Markers showing these limits on the floor were pointed out to the subjects.
The setup can be seen in Figure 5.1.
The robot was connected to the motion capture system using a network cable, since
a wireless connection had proved to drop too many packets to be reliable. A person
was designated to hold this cable away from the subject and the robot’s wheels.
The PGRL algorithm was primed with parameter values that were in the general
vicinity of what was thought to be acceptable values, based on the prestudy. The
values were deliberately set to be close to good values and not to be exactly the
average of the converged values from the prestudy. This choice was made in order
to see if the system could find the personal preferences even if these differed from
28
Figure 5.1. Experimental setup
Table 5.1. Parameter values and stepsizes
#
1
2
3
4
5
6
Parameter
intimate distance
personal distance
social distance
gazing ratio
waiting time
speed factor
Initial value
50 cm
80 cm
100 cm
0.6
0.17 s
1.0
Step size 15 cm
15 cm
15 cm
0.1
0.3 s
0.1
anticipated values. The values that the parameters were started at can be found in
Table 5.1. This table also shows the stepsizes used.
5.2.2
Experimental procedure
The subjects were asked to play with the robot in a relaxed manner for 30 minutes,
and to try to stay within the boundaries of the motion capture system. They were
told that the experiment was a study in human-robot interaction, but were not told
any details of the system. Before the experiment could start, a 5 minute motion
capture calibration session was conducted.
If a subject left the boundaries, he or she was instructed to re-enter. Apart from
this, no instructions were given during the experimental run.
29
5.3
Conducted measurements
For the purpose of evaluating the performance of the system, measurements of both
the robot’s behavior and the subjects’ responses were taken. Subjective observations
were also made by an observer in order to gain inmeasurable data, such as interaction
patterns, subject reactions, and perceived quality of performance.
5.3.1
Logged data
During the experimental runs, all internal states of the robot were continuously
logged. These data include position, speed, and active behavior of the robot, as well
as the states of all the robot’s onboard sensors, logged once every iteration of the
main control algorithm, i.e. once every 0.033 seconds. All variables concerning the
learning algorithm were also logged in order to be able to reconstruct every part of
every learning session afterwards.
Complete logs of all motion capture data were also made, so that all positions of both
robot and human subject can be reconstructed to milimeter precision with a temporal resolution of 30 Hz. All measurements were time stamped with a sync signal
generated by the robot, enabling crossreferencing between different measurements
in later reconstructions.
In addition to these numerical measurements, the experiments were observed and
commented by an observer, and recorded on video in order to allow for a more
qualitative or subjective analysis.
5.3.2
Postexperimental measurements
After each experimental run, the subjects were asked their impression of the robot’s
movements and general behavior. They were given a questionnaire where they were
asked to record their first impressions of the robot, as well as how their impressions changed during the experiment and what their final impressions were. They
were asked to make special note of how they perceived the robot’s movements, its
distances and the way it met their gaze.
More detailed measurements followed, in which the subjects were asked to stand
in front of the robot, at the distance they felt was the most comfortable while the
robot executed one representative action for each of the three interaction distances
studied: intimate, personal, and social. The indicated distance was measured using
the same motion capture system as was used in the earlier experiment.
Since most people do not have a single distance that is the only comfortable one, but
rather have an allowance for a preferred distance interval, further measurements were
30
conducted in order to find this interval. For each of the representative actions, each
subject was also asked to stand as close to robot as possible without the interaction
becoming awkward or uncomfortable, as well as as far away as possible without
feeling discomfort.
The preferences for the remaining three parameters of the adaptation system—gazemeeting, waiting, and speed —had to be measured differently. For these, each subject
was shown the robot’s behavior performed in turn with three different values—one
low, one average and one high—for each of the parameters, while the parameters
not being tested were set to default values. Table 5.2 shows the values used for each
parameter in this test.
test #
gazing
waiting
speed
1a
1b
1c
1.0
0.75
0.5
0.5s
0.5s
0.5s
1.0
1.0
1.0
2a
2b
2c
0.75
0.75
0.75
0s
0.5s
1.0s
1.0
1.0
1.0
3a
3b
3c
0.75
0.75
0.75
0.5s
0.5s
0.5s
1.2
1.0
0.8
Table 5.2. Parameter values used to measure subject preferences.
For each set (1, 2, and 3), the subjects were asked to indicate which parameter
setting (a, b, or c) was preferable. There was a possibility to indicate multiple
settings within each set.
5.4
Summary
An experiment was designed to evaluate the implementation. In the experiment,
a total of 16 subjects were asked to interact freely with the robot for 30 minutes.
The system was continously monitored during the experiment, and all relevant data
was logged. After the 30 minutes had passed, the subjects were asked to indicate
their actual preferences for the interaction parameters, and these were compared to
the values that the system had learned. The subjects were also asked to give their
subjective impressions of the systems performance.
31
Chapter 6
Results
This chapter presents and discusses the results obtained in the experiment.
6.1
Measured results
Out of the total of 16 subjects used in the experiment, only 12 produced usable
results. The runs that failed did so for different reasons. There were two runs that
failed due to technical problems. One of these failed as the Windows computer
running the motion capture system crashed as a result of some conflict between the
real-time labeling system and the logging function. Since both these are parts of a
proprietary software package that could not easily be fixed, the problem was solved
by doing all subsequent loggings on the robot’s onboard computer running Linux.
The crash occurerd after 21 minutes, so the results from this run are not completely
unusable, as convergence was already visible.
The second technical mishap was due to a different kind of weakness in the realtime labeling system. When markers were temporary occluded, the system would
use statistics to make reasonable guesses as to the positions, and update this as
the markers became visible again. However, if too many markers were occluded
simultaneously, for example during the hugging action or if the robot moved close to
the boundarys of the motion capture area, the markers would somtimes be classified
incorrectly. The system would correct these errors after a couple of seconds, but
until corrected, the robot would not navigate correctly, resulting in behaviors that
seriously disrupted interaction. After this effect had ruined an experimental run, a
human-operated correction system was added. Using this, a human operator could
correct the classification errors manually until the automatic corecctions took effect.
The remaining failures were due to subjects taking unanticipated actions. For example, there was one subject who tried to convey her opinions of the robot using
32
only verbal means. Since the system did not account for this, the robot’s behavior
became unacceptable, as it tried to execute fast motions at very close distance. This
run was aborted as it was not deemed safe for the subject to continue. Another
failure was by a subject who did not interact with the robot as with another person,
but treated it as an object for the whole run. Since he did not interact with the
robot at all, once again the system did not find any acceptable values, and this time
the distances were extended to where the robot was no longer able to stay within the
boundaries, and the motion capture system stopped generating usable data. This
experiment also had to be aborted prematurely.
The remaining experimental runs all resulted in different degrees of success. For most
subjects, the parameters reached reasonable convergence to stated preferences within
15-20 minutes. A complete analysis of all these runs can be found in Appendix A.
Distance (cm)
Distance (cm)
Distance (cm)
Since the algorithm continues searching the parameter space for optimal settings
indefinitely with a constant stepsize, convergence in the true sense of the word is
not obtainable. Therefore, the measure used to show the learned values of the
system was defined as the average value of the parameters for the last quarter of
each experimental run1 .
Intimate distance
80
60
40
20
0
0
2
4
6
8
Subject #
10
12
10
12 Subject
#
10
12Subject
#
Personal distance
100
50
0
0
2
4
6
8
Social distance
200
150
100
50
0
2
4
6
8
Figure 6.1. Learned distance parameters and preferences for 12 subjects
Figure 6.1 shows the learned values for the distances as compared to the stated preferences for the 12 subjects. The bars in the graph show the interval for acceptable
distance and the optimal value, and the asterisks are the learned values. Figure 6.2
1
i.e the last 7.5 minutes.
33
shows the remaining three parameters. The circles show which values the subjects
indicated as preferred, while the X’s show values not indicated. Some subjects indicated several values, while some indicated preferences between two values. These
latter cases are indicated with a triangle showing the preferred value. The asterisks again show the learned values as the mean values for the last quarter of the
experiment runs.
Gaze meeting ratio
ratio
1
0.8
0.6
0.4
0.2
0
2
4
6
10
12
subject #
8
10
12
subject #
8
10
12
subject #
8
Waiting time
time [s]
1.5
1
0.5
0
speed factor
0
2
4
6
Motion speed
1.5
1
0.5
0
2
4
6
Figure 6.2. Learned gaze and speed parameters and indicated preferences for 12
subjects. The circles show what parameter settings the subject indicated as preferable, the x’s show the non-indicated settings. In the cases where no given setting was
indicated, a triangle shows the preferred value.
6.2
Results for the different parameters
As can be seen in the plots in the previous section, there is a large difference in
the success rate for different parameters. This is partially due to the stochastic
nature of the reward function, but more to the fact that different parameters are of
different importance. A typical trait of the PGRL algorithm is that it will adjust the
parameters with the largest impact on the reward function faster, while parameters
that have a lesser impact will be adjusted at a slower rate.
The intimate distance parameter seemed to be highly significant for successful interaction, when the experiment was observed, as the hug behavior is very awkward
if executed at a faulty distance. Nevertheless, there are 3 runs of the twelve where
this distance is significantly off from indicated preference. For two of these, subjects
34
1 and 9, the value never enters the region of acceptable values at all. It is therefore plausible that all values tried for this parameter were unacceptable, thus not
resulting in a measurable gradient. It is also notable that subject 9 was an overall
failure for the system (see Section A.1.5). In the case of subject 4, the parameter
value enters the acceptable interval twice, only to make a sharp jump out of the
acceptable region almost instantly. This is most probably a sign of the system’s
instability due to the stochastic nature of the reward function. Since it seemed to
be a problem that the values started to far away from acceptable intervals for this
parameter, one run was done with a different initial value. This trial was done with
subject 11, and though one try is to little to show anything significant, this try was
very succesful in adapting this parameter.
The personal distance parameter converged to the higher bound of the preference
interval for most subjects. There were two cases where the value is significantly
higher than the measured preferences. Both these are a bit confusing, as both
subjects commented that they thought that the robot was coming too close. It is
possible that the system correctly evaluated their distance preference but that the
post-experiment measure did not. This would probably be caused by a poor choice of
representative action for the measurements. If the post-experimental measurement
is done with a behavior that most subjects prefer to be closer than the rest of the
personal behaviors, this phenomenon would be explained.
The social distance parameter converged to acceptable values for all subjects except
number 9. As stated above, the ninth subject was an overall failure. The main cause
for this seemed to be that this subject was very unclear in showing his approval or
disapproval for different behaviors, but unfortunately, there is no numerical data to
support this, only impressions by the observer.
As can be seen in Figure 6.2, the results for the gaze-meeting parameter show a
somewhat odd pattern. The fifth subject seems to be a complete failure, whereas
the results for subjects 1–4 are slightly weak. The results for subjects 6–12 are
overall very good. The number of subjects is too small for a reliable statistical
analysis, but it should still be noted that subjects 1–6 and 9 are technical staff from
the robotics research project, while subjects 7–12 (except 9) are secretarial staff
from the planning department. It cannot be ruled out that the non-technical staff
had another interaction pattern than the technical staff, that is only visible in the
gaze-meeting parameter. The correlation coefficient of successful adaptation of this
parameter and class of employee is 0.73 with p = 0.009, which is normally considered
as significant. This is when both success and class are given as 0 or 1.
For the waiting parameter, acceptable convergence can be seen for most subjects,
with the exceptions of 1, 9, and possibly 6. Subject 1 did not show much interest
in speed issues. He indicated all possible speeds as equally acceptable, and was
somewhat hesitant to indicate a waiting time. It is therefore difficult to take this
run as a failure. For subject 6, the parameters converged to acceptable values fairly
35
Table 6.1. Average deviation from preferred value (normalized to stepsize units)
parameter
intimate
personal
social
gaze
wait
speed
average deviation
0.9
1.3
1.3
1.0
0.8
1.0
initial deviation
1.8
1.6
1.2
2.4
1.5
1.4
quickly, and thus started to oscillate, as the algorithm does not decrease the stepsize.
This oscillatory behavior repeatedly forced the values away from stated preferences
and then back again. This is an inherent weakness of this implementation.
The speed parameter, finally, showed acceptable convergence for 8 of the 12 subjects.
It performed very poorly for subjects 3 and 9, and fairly poorly for subjects 5 and 8.
Also, there were two subjects who did not specify any preference for this parameter,
so it can only be said to be completely successful for 6 of the subjects. It is possible
that this weak performance indicates that this parameter had the least influence
on the reward function. In the case of subject 3, however, it was the subject’s
unanticipated reactions that caused the failure. This subject reacted to too fast
movement by becoming very wary of the robot’s movements, freezing up and staring
intently at the robot. Since the reward function gives high scores to this type of
reaction, the gradient was inversed for this parameter, causing the system to seek out
the worst possible behavior. This is also an inherent weakness of the implemented
system.
In order to get an overview of the general performance, the average deviation for each
parameter over all subjects was calculated. These values can be found in Table 6.1.
The rightmost column of this table shows how much the initial values deviated from
the stated preference, as a reference. All values have been normalized for stepsize.
As can be seen, most parameters converged to within one stepsize, the exceptions
being the personal and social distance parameters. It should be noted that for these
parameters the average stated tolerance (the difference between the closest comfortable distance and the farthest) was of a size corresponding to several stepsizes. For
example, for personal distance, the average stated tolerance was 3.02 stepsizes and
for social distance it was 5.01. Thus, the adaptation of the social distance is not
as great a failure as it first seems, even though the average case actually becomes
worse than the initial settings.
36
6.3
Result evaluation
This sections answers the questions posed in 5.1, using the results from the experiment.
• Does the system provide a good adaptation to personal preferences?
For most subjects, the system was able to find their personal preferences and adapt
the behavior to these. This is limited to the aspects of the behavior that is controlled by the parameters, and although most subjects expressed satisfaction with
these aspects, several subjects complained about parts of the behavior that were not
controlled by the parameters.
• Does the system adapt well to different people?
As is shown in the previous section, the system adapts well to different preferences,
but is not good at adapting to different behaviorial patterns. When subjects behaved in unanticipated ways, the system could not even find good approximations
of their preferences. If the subject where non-related technical problems disrupted
the experiment is excluded, a total of 10 subjects of the 15 received an acceptable
adaptation, while for three subjects, the adaptation was so bad that the experiment
had to be aborted.
• What are the inherent strengths of the system?
The system uses a model-free reinforcement learning algorithm. This means that
preferences that were completely unanticipated, like those of subject 5 (see A.1.2),
can easily be adapted to.
Also, since the system works with adapting parameters, it was very easy to implement it with an existing system. This means that the system should be easy to
port to other platforms, as long as similar parameters can be isolated in the control
systems.
• What are the inherent weaknesses of the system?
The reinforcement signal is inherently stochastic. This does not only make convergence slower, but means that the system sometimes temporarily changes parameters
to worse values. Also, in the current implementation, no information gained from interacting with one subject is used to facilitate interaction with another. This means
that the system has to start “from scratch” with each new subject.
37
The largest weakness is that the current reinforcement signal assumes a specific
behavior pattern that most, but not all, subjects exhibit. Since the system is completely dependent on this reinforcement function, the performance is very poor when
subjects do not fit the reaction model.
• What modifications should be done to the system to enhance performance?
The most necessary improvement must be to ensure that the system does not make
the parameter settings worse when subjects behave in unanticipated ways. It must
be made more robust and able to adapt to a wider range of behavior types. This
will be more thoroughly discussed in the next section.
6.4
Suggested improvements
There are two main problems with the implemented system. The first problem is that
the system is not capable of handling unexpected reaction patterns. This means that
there are several subjects for whom the system does not improve its performance but
instead performs a random walk through the parameter space until the parameters
reach unacceptable values. There are several possible ways to address this problem,
however.
A very simple solution would be to use the preference data gained from this experiment to find outer limits for the parameters. For, example, for the intimate distance
parameter, no subject stated a preference farther than 35 cm, and no subject expressed that distances over 50 cm where acceptable. Thus, it would be possible to
insert an upper limit to this parameter at 50 cm, guaranteeing that it does not move
too far from generally stated preferences. The same could be applied to the other
parameters. Gathering more data, and perhaps trying a version with tighter limits
than the largest acceptable value, would eliminate the risk of the system behaving
completely unacceptably even when it cannot gain any meaningful data from the
subject’s reactions. Using preference data from the experiment could also be used to
set the starting values to be closer to the average preferred values. For example, the
intimate distance should start at a value much lower than the 50 cm of the current
experiment.
A slightly more difficult solution to the problem is to construct a better reward
function. One possible approach is to utilize a reward function that takes into
account more modes of reaction. Examples that that would be relatively simple to
implement on the RobovieII platform are speech cues and facial expressions. There
are already systems implemeted that can recognize emotion from facial images (see
Shiomi et .al. [36]), and RobovieII is equipped with a system for speech recognition.
A problem that has to be overcome when using this approach is weighting the input
38
signals. Finding a good balance may be very difficult, as there is no guarantee that
there exists a balance that works well with all subjects.
This leads into another possible solution, that needs even more study before it can
be implemented and evaluated. It could be a feasible approach to use different
reward functions for different persons. If the system could identify the type of
reaction pattern that a person displays, it could then switch between different types
of reward functions accordingly, or possibly disengage the learning system entirely
when non-supported reactions are detected.
The other main problem with the implemented system that needs solving is that
even with subjects that react in an anticipated manner, some parameters do not
converge to preferred values. There seems to be two major reasons for this, that
have to be adressed differently. One reason is that some parameters seem to start
too far away from preferred values. Thus, making small perturbations do not result
in measurable differences in the reward function. This is easily corrected by using
statistical data gained from the experiments in this study to make sure that all values
are started as close as possible to average preference, and to correct the stepsizes to
better values.
The other reason is inherent in the PGRL algorithm in its current formulation. Since
the stepsize is left unaltered throughout the entire run, the reaction speed of the
adaptive system is constant, but parameter values tend to deviate or oscillate away
from their optimum values. In order to solve this problem, the system would need
some measure by which to recognize convergence, so that stepsize could be reduced
accordingly. In order to keep the response speed of the system and allow for fast
adaptation to changes in a subject’s preferences, the system would also have to be
able to recognize when it needs to increase the stepsize back to larger values.
Some other smaller problems and possible developments that remain include finding more parameters that should be adapted in the system. It is possible that all
parameters should be divided into proxemic classes, or that there are other parametrizable factors in the robot’s behavior that have a significant impact on how
a person perceives interaction quality. This needs thorough testing and statistical
analysis to determine.
Finally, there is a need to implement the system without the high accuracy obtained
with the motion capture system. If the adaptation is to be put to “real-world” use,
it has to work with the much lower accuracies that can be obtained with on-board
sensors on the robot.
39
6.5
Summary
Of the 16 experimental runs, one had technical malfunctions and 3 of the runs had
to be stopped early since the subjects reacted in ways that were not anticipated. In
short, these subjects did not treat the robot as they would a human, and the system
was unable to cope with their non-social interaction patterns. For the remaining
12 subjects, there was one subject were there was a significant discrepance between
the values learned by the system and the preferences indicated by the subject. The
remaining 11 runs resulted in varying degrees of success, with most parameters
converging to acceptable values within 10 to 15 minutes.
The main strength of the system is that it can adapt well to different persons’
preferations as long as they treat the robot in a social manner. The main weakness
is that the values of the parameters diverge in an unacceptable manner when the
robot is not treated in a social manner.
40
Chapter 7
Summary and conclusions
This chapter summarizes the entire study and discusses the conclusions that can be
drawn from it.
7.1
Summary
Using theories of proxemics and personal space that describe how humans interact
with one another, and the assumption that humans will interact in a similar way with
a humanoid robot, the problem was formulated as designing a system that allows
a robot to keep appropriate distances to its human counterpart when engaged in
social interaction. Since the appropriate distance is not an independent quantity,
the system should adjust this in accordance to movement speeds, gaze-meeting and
type of interaction.
A policy gradient reinforcement learning (PGRL) algorithm was used to address
this problem. The interaction behavior was parameterized to a policy based on six
different variables. These were adjusted in order to maximize a reward function that
used gaze-meeting as a positive signal and repositioning as a negative one. Thus,
the system tried to find the parameter set that would result in a robot behavior that
minimizes the human counterpart’s repositioning and maximizes the time he spends
gazing at the robot.
This system was implemented on the humanoid robot RobovieII, that is capable
of performing several different interactive behaviors. The behaviors were classified
according to their proxemic class1 , and a distance parameter was associated with
each class. The robot was set to perform these behaviors at random, and evaluating
1
This means that they were classified according to the distance at which they are performed –
intimate, personal, or social. See Section 2.1.
41
the reward function after the execution of each behavior. After every ten behavior
executions, the parameter set was adjusted in order to achieve the maximum reward.
This implementation was evaluated in an experiment where 16 subjects were asked
to interact with the robot in a natural manner for 30 minutes. After each run, the
subject’s actual preferences were measured and compared to those obtained by the
system. One of the runs had to be aborted due to a technical malfunction, and
three of the runs resulted in the policy parameters diverging to unacceptable values
due to non-social behavior patterns in the subjects. Of the remaining 12 runs, one
was significantly unsuccesful. The remaining 11 were succesful to varying degrees.
On an average, most parameters converged to within an algorithm stepsize from the
preferred values, showing that the stepsize is possibly the main limiting factor.
The main strength of the system is that it is able to autonomously adapt to personal
preferences without any explicit instructions from the human counterpart. The main
weakness is that it is lacking in robustness when confronted with subjects who exhibit
non-anticipated interaction patterns.
7.2
Conclusions
This study shows that it is possible to create a system that autonomously adapts
robot behavior to personal preferences by using minute discomfort signals. This
system requires no special operational skills from the human counterpart. The
human subject does not even need explicit knowledge of his own preferences for
the system to be able to find and adapt to them.
Whether this adaptiveness results in easier interaction with a robot or not can not
be answered by this study. For this, an experiment where human-robot interaction
both with and without the system was evaluated would be necessary. Some subjects
made the remark that they felt that the robot was made more interesting by its
ability to adapt, and that it made it seem more intelligent. However, these are very
subjective remarks, and should not be taken as evidence that the system enhances
interaction.
The question that is answered by the study is whether or not a system can be designed that lets a robot treat individual variations in preferences concerning proxemics and personal space. It is also shown that this can be achieved with reinforcement
learning.
Since the experiment was conducted in a monocultural environment, with subjects
from very similar backgrounds, the results are non-conclusive as to whether the
system would adapt well to cultural differences. Since Hall’s theory of proxemics [14]
takes these differences into consideration, and the rest of the model used for the
system is based on both american and japanese findings about personal space, there
42
is no particular reason to believe that this method and implementation should not
work in different cultural settings. The interactive behaviors would have to be
modified to fit in a different culture, however.
The system as described in this report is yet only at a proof-of-concept level of
development, and much remains before it can be practically implemented. It does,
however demonstrate the power of reinforcement learning in solving problems that
are only defined as some diffuse goal, e.g. “minimizing discomfort”. As the field of
social robotics progresses and develops in the future, there will most likely be several
similar behavioral or social interaction problems that will need solving. A method
such as the one suggested in this report may be a viable course of action in solving
some of these.
43
References
[1] Douglas Alexander Aberdeen. Policy-Gradient Algorithms for Partially Observable Markov Decision Processes. PhD thesis, Australian National University,
Mar 2003.
[2] Philipp Althaus, Hiroshi Ishiguro, Takayuki Kanda, Takahiro Miyashita, and
Henrik I Christensen. Navigation for human-robot interaction tasks. In IEEE
International Conference on Robotics and Automation - 04, pages 1894–1900,
New Orleans, April 2004.
[3] Hideki Asoh, Satoru Hayamizu, Isao Hara, Yoichi Motomura, Shotaro Akaho,
and Toshihiro Matsui. Socially embedded learning of the office-conversant mobile robot Jijo-2. In Proc. of International Joint Conference on Artificial Intelligence, pages 880–885, 1997.
[4] Jonathan Baxter and Peter L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001.
[5] Maren Bennewitz, Wolfram Burgard, and Sebastian Thrun. Using em to
learn motion behaviors of persons with mobile robots. In Proceedings of the
2002 IEEE/RSJ Intl. Conference on Intelligent Robots and Systems, Lausanne,
Switzerland. EPFL, Oct 2002.
[6] Cynthia Breazeal and Brian Scasselati. A context-dependent attention system
for a social robot. In Proceedings of the International Joint Confererence on
Artificial Intelligence, pages 1146–1151, 1999.
[7] Wolfram Burgard, Armin B. Cremers, Dieter Fox, Dirk Hahnel, Gerhard Lakemeyer, Dirk Schulz, Walter Steiner, and Sebastian Thrun. The interactive
museum tour-guide robot. In AAAI/IAAI, pages 11–18, 1998.
[8] Ikuo Daibo. Sigusa no communication - hito ha sitasimi wo dou tutaeau ka.
Saiensu-sha, 1998.
[9] Carl F DiSalvo, Francine Gemperle, Jodi Forlizzi, and Sara Kiesler. All robots
are not created equal: The design and perception of humanoid robot heads. In
44
Proceedings of the conference on Designing interactive systems, pages 321–326,
2002.
[10] Chris Drummond. Accelerating reinforcement learning by composing solutions
of automatically identified subtasks. Journal of Artificial Intelligence Research,
16:59–104, 2002.
[11] Terrence Fong, Illah Nourbakhsh, and Kerstin Dautenhahn. A survey of socially
interactive robots. Robotics and Autonomous Systems, 42:143–166, 2003.
[12] Gene F. Franklin, J. David Powell, and Abbas Emami-Naeini. Feedback Control
of Dynamic Systems. Prentice Hall, 2002.
[13] Gregory Z. Grudic, R Vijay Kumar, and Lyle H. Ungar. Using policy gradient
reinforcement learning on autonomous robot controllers. In Proceedings of the
2003 IEEE/RSJ International Conference on Intelligent Robots and Systems,
volume 1, pages 406–411, 2003.
[14] Edward T. Hall. The Hidden Dimension. DoubleDay Publishing, 1966.
[15] Michita Imai, Tetsuo Ono, and Hiroshi Ishiguro. Physical relation and expression: Joint attention for human-robot interaction. IEEE Transactions on
Industrial Electronics, 50(4):636–643, Aug 2003.
[16] Tetsunari Inamura, Masayuki Inaba, and Hirochika Inoue. Integration model
of learning mechanism and dialogue strategy based on stochastic experience
representation using bayesian network. In Proceedings of the Int’l Workshop on
Robot and Human Interactive Communication, pages 247–252,. ROMAN, 2000.
[17] Tetsunari Inamura, Masayuki Inaba, and Hirochika Inoue. toukeiteki keikenhyougen ni motoduku pa-sonarurobotto to no tekiouteki intarakushonsisutemu.
densi jouhou tuusin gakkai ronbunsi, 6(6):867–877, Jun 2000.
[18] Charles L. Isbell, Christian R. Shelton, Michael Kearns, Satinder Singh, and
Peter Stone. A social reinforcement learning agent. In Fifth International
Conference on Autonomous Agents, pages 377–384, 2001.
[19] Starkey Duncan jr. and Donald W. Fiske. Face-to-Face Interaction: Research,
Methods, and Theory. Lawrence Erlbaum Associates, Inc., Publishers, 1977.
[20] Leslie P. Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement
learning: A survey. Journal of Artificial Intelligence Research, 1996.
[21] Takayuki Kanda, Takayuki Hirano, Daniel Eaton, and Hiroshi Ishiguro. A
practical experiment with interactive humanoid robots in a human society. In
Third IEEE International Conference on Humanoid Robots (Humanoids 2003),
2003.
45
[22] Takayuki Kanda, Hiroshi Ishiguro, Michita Imai, Tetsuo Ono, and Kenji Mase.
A constructive approah for developing interactive humanoid robots. In Proceedings of the 2002 IEEE/RSJ International Conference on Intelligent Robots
and Systems, pages 1265–1270, Oct 2002.
[23] Takayuki Kanda, Hiroshi Ishiguro, Michita Imai, and Tetsuya Ono. Bodymovement analysis of human-robot interaction. In Proceedings of the International Joint Conferences on Artificial Intelligence, pages 177–182. IJCAI, 2003.
[24] Takayuki Kanda, Hiroshi Ishiguro, Michita Imai, Tetsuya Ono, and Ryohei
Nakatsu. Development and evaluation of an interactive humanoid robot "Robovie". In Proceedings of International Conference on Robotics and Automation, ICRA, pages 1848–1855. IEEE, 2002.
[25] Adam Kendon. Conducting interaction - Patterns of behavior in focused encounters. Cambridge University Press, 1990.
[26] Nate Kohl and Peter Stone. Policy gradient reinforcement learning for fast
quadrupedal locomotion. In Proceedings of International Conference on Robotics and Automation, volume 3, pages 2619–2624. IEEE, 2004.
[27] Maja Mataric. Getting humanoids to move and imitate. IEEE Intelligent
Systems, pages 18–24, jul 2000.
[28] Tom Mitchell. Mahine Learning. McGraw-Hill, 1997.
[29] Noriaki Mitsunaga, Christian Smith, Takayuki Kanda, Hiroshi Ishiguro, and
Norihiko Hagita. Robot behavior adaptation for human-robot interaction based
on policy gradient reinforcement learning. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS2005) (submitted to), 2005.
[30] Noriaki Mitsunaga, Christian Smith, Takayuki Kanda, Hiroshi Ishiguro, and
Norihiko Hagita. Robot behavior adaptation for human-robot interaction
based on policy gradient reinforcement learning (in japanese). Nihon Robotto
Gakkaishi (Journal of the Robotics Society of Japan) (submitted to), 2005.
[31] Kouji Nakajima. Research regarding personal distances between people and a
moving robot (in Japanese). PhD thesis, Kyuushuu Institute of Design, 1998.
[32] Yasushi Nakauchi and Reid Simmons.
Autonomous Robots, 12:313–324, 2002.
A social robot that stands in line.
[33] Monica N. Nicolescu and Maja J. Mataric. Learning and interacting in humanrobot domains. Special Issue of IEEE Transactions on Systems, Man, and
Cybernetics, Part A: Systems and Humans, 31(5):419–430, sep 2001.
[34] Tetsuya Ogata, Shigeki Sugano, and Jun Tani. Open-end human robot interaction from the dynamical systems perspective: Mutual adaptation and incremental learning. In Proceedings of the 17th Int. Conf. on Industrial and
46
Engineering Applications of Artificial Intelligence and Expert Systems, pages
435–444. Springer Verlag, 2004.
[35] Daniel Shapiro, Pat Langley, and Ross Shachter. Using background knowledge
to speed reinforcement learning in physical agents. In Proceedings of the fifth
international conference on Autonomous agents, pages 254–261. ACM Press,
2001.
[36] Masahiro Shiomi, Takayuki Kanda, Nicolas Miralles, Takahiro Miyashita, Ian
Fasel, Javier Movellan, and Hiroshi Ishiguro. Face-to-face interactive humanoid
robot. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS2004), pages 1340–1346, 2004.
[37] William D. Smart and Leslie Pack Kaelbling. Practical reinforcement learning in
continuous spaces. In Proceedings of the Seventeenth International Conference
on Machine Learning (ICML-2000), pages 903–910, 2000.
[38] William D. Smart and Leslie Pack Kaelbling. Effective reinforcement learning
for mobile robots. In Proceedings of the International Conference on Robotics
and Automation (ICRA-2002), volume 4, pages 3404–3410, May 2002.
[39] Christian Smith, Noriaki Mitsunaga, Takayuki Kanda, Hiroshi Ishiguro, and
Norihiko Hagita. Adaptation of an interactive robot’s behavior using policy
gradient reinforcement learning. In Proceedings of the tenth Robotics Symposia,
pages 319–324. RSJ, SICE, JSME, 2005.
[40] Eric Sundstrom and Irving Altman. Interpersonal relationships and personal
space: Research review and theoretical model. Human Ecology, 4(1), 1976.
[41] Richard S Sutton and Andrew G Barto. Reinforcement Learning. MIT Press,
Cambridge, MA, 1998.
[42] Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour.
Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12:1057–1063, 2000.
[43] Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and
semi-MDPs: A framework for temporal abstraction in reinforcement learning.
Artificial Intelligence, 112:181–211, 1999.
[44] Tsuyoshi Tasaki, Shohei Matsumoto, Hayato Ohba, and Mitsuhiko Toda. Dynamic communication of humanoid robot with multiple people based in interaction distance. In Proceedings of International Conference on Spoken Language
Processing, Oct 2004.
[45] Russ Tedrake, Teresa Weirui Zhang, and H Sebastian Seung. Stochastic policy
gradient reinforcement learning on a simple 3d biped. In Proceedings of the
IEEE International Conference on Intelligent Robots and Systems (IROS).
IEEE, 2004.
47
[46] Stephane Zrehen. Psychoanalytic concepts for the control of emotions in robots.
In AAAI Fall symposium on "Emotional and Intelligent: The tangled Knot of
Cognition", Oct 1998.
[47] Stephane Zrehen, Hiroaki Kitano, and Masahiro Fujita. Learning in psychologically plausible conditions: The case of an imaginary pet robot. In From
Animals to Animats 5: Proceedings of the Fifth International Conference on
Simulation of Adaptive Behavior, 1998.
48
Appendix A
Unabridged result measurements and
plots
A.1
Complete results from PGRL experiments on Robovie
This section contains a subject by subject description and analysis of the results
from the experiment. The results of the runs that had to be aborted early due to
severe failures are not included, as these plots show little of interest. The 12 complete
runs have been divided into different groups according to their successrate, and are
numbered in the chronological order that they where performed.
A.1.1
Completely successful experimental runs
There were three subjects for whom the system performed very well. Not only were
the subjects themselves content with the performance, but all parameters had a good
convergence to their stated preferences. These subjects were the second, tenth, and
twelvth subjects of the study. Common for all of them was a tendency to be very
interested in interaction with the robot, and they had a very positive interaction
pattern, much as when interacting with another human.
The second subject (Figure A.1) stated an overall content with the system’s performance, and found the distances to improve as the experiment progressed. Studying
the plots, one can see that most values converge to stated preferences, apart from
the gaze-meeting parameter, and this should therefore be considered successful.
The tenth subject (Figure A.2) was impressed by the robot’s behavior and said
that it quickly became much better. The plots support this, as all parameters are
adjusted to well within stated preference, except personal distance, which is but
49
Intimate
Gaze−meeting
1
60
Proportion
Dist. [cm]
80
40
20
0
0
10
20
Personal
0
20
Waiting
Time [s]
Dist [cm]
10
30
Time [min]
1.5
50
1
0.5
0
0
10
20
Social
200
30
Time [min]
0
10
20
Speed
30
Time [min]
1.6
Speed factor
Dist [cm]
0.6
30
Time [min]
100
0
0.8
150
100
50
0
10
20
1.4
1.2
1
0.8
30
Time [min]
0
10
20
30
Time [min]
Figure A.1. Results achieved for subject 2. The dotted lines represent the measured
preferences of the subject.
Intimate
Gaze−meeting
1
60
Proportion
Dist. [cm]
80
40
20
0
0
10
20
Personal
0
20
Waiting
Time [s]
Dist [cm]
10
30
Time [min]
1.5
50
1
0.5
0
0
10
20
Social
200
30
Time [min]
0
10
20
Speed
30
Time [min]
1.6
Speed factor
Dist [cm]
0.6
30
Time [min]
100
0
0.8
150
100
50
0
10
20
1.4
1.2
1
0.8
30
Time [min]
0
10
20
30
Time [min]
Figure A.2. Results achieved for subject 10. The dotted lines represent the measured preferences of the subject.
slightly farther. This subject stated a preference for an interval of waiting times,
hence the two lines in the plot showing the borders of this interval.
The twelfth subject (Figure A.3) was overall content with the performance, especially
the gaze-meeting. As can be seen in the plots, the parameters converge acceptably
50
Intimate
Gaze−meeting
1
60
Proportion
Dist. [cm]
80
40
20
0
0
10
20
Personal
0
20
Waiting
Time [s]
Dist [cm]
10
30
Time [min]
1.5
50
1
0.5
0
0
10
20
Social
200
30
Time [min]
0
10
20
Speed
30
Time [min]
1.6
Speed factor
Dist [cm]
0.6
30
Time [min]
100
0
0.8
150
100
50
0
10
20
1.4
1.2
1
0.8
30
Time [min]
0
10
20
30
Time [min]
Figure A.3. Results achieved for subject 12. The dotted lines represent the measured preferences of the subject.
towards the stated preferences. Again, the plots shows intervals as the subject stated
preference for more than one value.
A.1.2
Partial success - content subjects
The next group of subjects contain two that were content with the robot’s behavior,
even though analysis of the results show that some parameters were far from stated
preference.
The experimental run for the fifth subject (Figure A.4) resulted in good adaptation
for the distance parameters, but less successful adaptation for the remaining parameters. This subject stated that all shown values for waiting were equally good, so
this parameter cannot be evaluated in this case.
Interestingly though, this subject stated content with the gaze-meeting results, even
though it is obvious from the plots that these were far from his stated preference.
He was also satisfied with he speed parameter, which is as much as 20% off from
specified preference. It is possible that this discrepancy can be explained by different conditions during the experimental run and when measuring the preferences
afterward, or simply the fact that the subject actually accepted a fairly wide range
of parameter values.
This subject preferred to touch the robot during all actions, even the monologue
and other social actions. This resulted in a much closer distance for this parameter
51
Intimate
Gaze−meeting
1
60
Proportion
Dist. [cm]
80
40
20
0
0
10
20
Personal
0
20
Waiting
Time [s]
Dist [cm]
10
30
Time [min]
1.5
50
1
0.5
0
0
10
20
Social
200
30
Time [min]
0
10
20
Speed
30
Time [min]
1.6
Speed factor
Dist [cm]
0.6
30
Time [min]
100
0
0.8
150
100
50
0
10
20
1.4
1.2
1
0.8
30
Time [min]
0
10
20
30
Time [min]
Figure A.4. Results achieved for subject 5. The dotted lines represent the measured
preferences of the subject.
than for other subjects.
Intimate
Gaze−meeting
1
60
Proportion
Dist. [cm]
80
40
20
0
0
10
20
Personal
0
20
Waiting
Time [s]
Dist [cm]
10
30
Time [min]
1.5
50
1
0.5
0
0
10
20
Social
200
30
Time [min]
0
10
20
Speed
30
Time [min]
1.6
Speed factor
Dist [cm]
0.6
30
Time [min]
100
0
0.8
150
100
50
0
10
20
1.4
1.2
1
0.8
30
Time [min]
0
10
20
30
Time [min]
Figure A.5. Results achieved for subject 8. The dotted lines represent the measured
preferences of the subject.
The eighth subject (Figure A.5) was initially only discontent with the gaze-meeting
ratio, but noted that this quickly improved. She also commented that the robot
didn’t seem to improve after the first few minutes, something that is supported by
52
the plots.
It is notable that though the speed is slower than stated preference, the waiting time
is shorter. It is possible that these two effects cancel each other out, but this has
not been tested. Both these values show a slight tendency to correct themselves in
the last few minutes, but this is too short to rule out random fluctuations.
A.1.3
Successful runs - discontent subjects
Intimate
Gaze−meeting
1
60
Proportion
Dist. [cm]
80
40
20
0
0
10
20
Personal
0
20
Waiting
Time [s]
Dist [cm]
10
30
Time [min]
1.5
50
1
0.5
0
0
10
20
Social
200
30
Time [min]
0
10
20
Speed
30
Time [min]
1.6
Speed factor
Dist [cm]
0.6
30
Time [min]
100
0
0.8
150
100
50
0
10
20
1.4
1.2
1
0.8
30
Time [min]
0
10
20
30
Time [min]
Figure A.6. Results achieved for subject 7. The dotted lines represent the measured
preferences of the subject.
This seventh subject (Figure A.6) described her first impression of the robot’s behavior as “tentative”, but that it became more active as time passed. She also stated
that she thought that it tended to get too close, which is a bit surprising when actual
distances are compared to preference in the plots.
Most parameters can be said to converge acceptably, though the different distance
parameters tend to stay close to the outer limits of specified intervals.
A.1.4
Partially successful runs
There where five subjects for which the system only performed partially well. These
subjects were content with the aspects that worked, and discontent with the ones
that did not.
53
Intimate
Gaze−meeting
1
60
Proportion
Dist. [cm]
80
40
20
0
0
5
10
15
Personal
0
10
15
Waiting
Time [s]
Dist [cm]
5
20
Time [min]
1.5
50
1
0.5
0
0
5
10
15
Social
200
20
Time [min]
0
5
10
15
Speed
20
Time [min]
1.6
Speed factor
Dist [cm]
0.6
20
Time [min]
100
0
0.8
150
100
50
0
5
10
15
1.4
1.2
1
0.8
20
Time [min]
0
5
10
15
20
Time [min]
Figure A.7. Results achieved for subject 1.
The experiment with the first subject (Figure A.7) was aborted after 21 minutes
due to technical problems, and as such might be less significant than then the other
runs that lasted the entire planned 30 minutes.
The distances for personal and social interaction stay well within the stated preferences, whereas the distance for intimate interaction never enters the stated preference interval. Since all distances tested in this case are outside the preferred
interval, the algorithm gains little or no information on the reward gradient for this
parameter. The subject stated miscontent with the intimate distance.
The gaze-meeting parameter stays around 90%, which should be deemed as acceptable as the subject indicated 100% as being preferable to 50% or 75%. The waiting
parameter is however fairly far from the indicated preference. This particular subject
did not show much of an interest for speed and timing issues, and could not indicate
any speed as being preferable to any other. It is therefore difficult to conclude if
this is a failure of the adaptive system or not.
The third subject (Figure A.8) stated a discontent with the personal distance for
the first half of the experiment, which correlates well with the plot. She was also
content with the gaze-meeting, even though the actual values achieved were closer
to 75% than her specified preference of 100%.
The plot of the intimate distance shows the lower limit of 15 cm that the system
has for safety reasons. This subject stated a fairly wide preference interval for the
waiting parameter, thus making the results for this more difficult to evaluate.
The only other remarkable result of this experimental run is that the speed para54
Intimate
Gaze−meeting
1
60
Proportion
Dist. [cm]
80
40
20
0
0
10
20
Personal
0
20
Waiting
Time [s]
Dist [cm]
10
30
Time [min]
1.5
50
1
0.5
0
0
10
20
Social
200
30
Time [min]
0
10
20
Speed
30
Time [min]
1.6
Speed factor
Dist [cm]
0.6
30
Time [min]
100
0
0.8
150
100
50
0
10
20
1.4
1.2
1
0.8
30
Time [min]
0
10
20
30
Time [min]
Figure A.8. Results achieved for subject 3. The dotted lines represent the measured
preferences of the subject.
meter is far away from the stated preference, something the subject also complained
about. Observations of the actual experiment showed that as the robot increased its
movement speed, the subject tended to fix her gaze at it, being wary of the sudden
movements. This is clearly a weakness of the system itself.
Intimate
Gaze−meeting
1
60
Proportion
Dist. [cm]
80
40
20
0
0
10
20
Personal
0
20
Waiting
Time [s]
Dist [cm]
10
30
Time [min]
1.5
50
1
0.5
0
0
10
20
Social
200
30
Time [min]
0
10
20
Speed
30
Time [min]
1.6
Speed factor
Dist [cm]
0.6
30
Time [min]
100
0
0.8
150
100
50
0
10
20
1.4
1.2
1
0.8
30
Time [min]
0
10
20
30
Time [min]
Figure A.9. Results achieved for subject 4. The dotted lines represent the measured
preferences of the subject.
55
For the fourth subject (Figure A.9), the results were mediocre at best. The two
farther distances were acceptably well adapted, but the intimate distance showed
no signs of converging to stated preference. This is probably due to the parameter
value accidentally leaving the acceptable domain and not finding its way back. As
the behaviors are randomly picked in this experiment, long series that do not contain
a certain distance can give these results. This is an inherent weakness of the system.
As for the other parameters, timing and speed move in the desired direction, while
gazing does not. It is notable that this particular subject showed one of the lowest
preferences for gazing, not wanting the robot to meet his gaze much more than 50%
of the time.
Intimate
Gaze−meeting
1
60
Proportion
Dist. [cm]
80
40
20
0
0
10
20
Personal
0
20
Waiting
Time [s]
Dist [cm]
10
30
Time [min]
1.5
50
1
0.5
0
0
10
20
Social
200
30
Time [min]
0
10
20
Speed
30
Time [min]
1.6
Speed factor
Dist [cm]
0.6
30
Time [min]
100
0
0.8
150
100
50
0
10
20
1.4
1.2
1
0.8
30
Time [min]
0
10
20
30
Time [min]
Figure A.10. Results achieved for subject 6. The dotted lines represent the measured preferences of the subject.
The sixth subject (Figure A.10) was overall content with the robot’s behavior, apart
from the speed, which was thought to be too fast. This correlates very well with the
results in the plots. This particular subject had stated his preference for the speed
parameter as “any speed but the very fastest one”, hence the interval shown for the
preference in the plot.
As most values stay in the vincinity of the optimal one for most of the experimental
run, the values show an oscillating behavior. This is unavoidable as the algorithm
does not reduce the stepsize as time proceeds, but necessarily must make adjustments
that total to a certain fixed size.
Since there had been some problems for the system to find the preferred value
for the intimate distance when started too far away, this eleventh experimental
run (Figure A.11) was started with a value closer to what most earlier subjects
56
Intimate
Gaze−meeting
1
60
Proportion
Dist. [cm]
80
40
20
0
0
10
20
Personal
0
20
Waiting
Time [s]
Dist [cm]
10
30
Time [min]
1.5
50
1
0.5
0
0
10
20
Social
200
30
Time [min]
0
10
20
Speed
30
Time [min]
1.6
Speed factor
Dist [cm]
0.6
30
Time [min]
100
0
0.8
150
100
50
0
10
20
1.4
1.2
1
0.8
30
Time [min]
0
10
20
30
Time [min]
Figure A.11. Results achieved for subject 11. The dotted lines represent the measured preferences of the subject.
had preferred. The plot shows that the system tends to be reasonably good at
maintaining a good value when started close to it.
As for the other parameters, all were started at the same values as for the other
subjects, and all but personal distance seem to converge to preferred values. The
subject stated content with all aspects but social distance, which he found to be too
close.
A.1.5
Unsuccessful run
The results attained for the ninth subject (Figure A.12) are not very good. As can
be seen, apart from the personal distance and gazing parameters, the results are
far from stated preference. There were no observable problems with this subject’s
behavior, so the reason for these poor results are still unclear. It is possible that this
subject was not clear in showing his dislike when the robot behaved in an unwanted
manner. It is also noteworthy that the subject said he “felt as if the robot didn’t
like him, but forced itself to be somewhat polite and talk to him anyway”.
57
Intimate
Gaze−meeting
1
60
Proportion
Dist. [cm]
80
40
20
0
0
10
20
Personal
0
20
Waiting
Time [s]
Dist [cm]
10
30
Time [min]
1.5
50
1
0.5
0
0
10
20
Social
200
30
Time [min]
0
10
20
Speed
30
Time [min]
1.6
Speed factor
Dist [cm]
0.6
30
Time [min]
100
0
0.8
150
100
50
0
10
20
1.4
1.2
1
0.8
30
Time [min]
0
10
20
30
Time [min]
Figure A.12. Results achieved for subject 9. The dotted lines represent the measured preferences of the subject.
58