Robot Behavior Adaptation for Human

Robot Behavior Adaptation for Human-Robot Interaction
based on Policy Gradient Reinforcement Learning∗
Noriaki Mitsunaga1 , Christian Smith1,2 Takayuki Kanda1 ,
Hiroshi Ishiguro1,3 , Norihiro Hagita1
1
ATR Intelligent Robotics and Communication Laboratories
2
Royal Institute of Technology, Stockholm
3
Graduate School of Engineering, Osaka University
[email protected]
Abstract—In this paper we propose an adaptation mechanism
for robot behaviors to make robot-human interactions run more
smoothly. We propose such a mechanism based on reinforcement
learning, which reads minute body signals from a human
partner, and uses this information to adjust interaction distances,
gaze meeting, and motion speed and timing in human-robot
interaction. We show that this enables autonomous adaptation
to individual preferences by an experiment with twelve subjects.
Index Terms—policy gradient reinforcement learning
(PGRL), human-robot interaction, behavior adaptation,
proxemics.
NTRODUCTION
! "
# $ %& ' (
! !
) )
* +
( +
(
! *
! )
)
This research was supported by the National Institute of Information and
Communications Technology of Japan
∗
(
, # )
+
#
)
) - +
( # (
# (
" ( . .
. $
#
(
(
/ * +
)
(
0
1
/ / # (
#
HE 2 EHAVIOR * DAPTATION 3 YSTEM
A. Behavior adaptation
4 (
(
. # by the human. These values are then used as the reward that
is to be minimized by the system. The system consists of
the policy gradient reinforcement learning (PGRL) algorithm,
that minimizes this reward by changing its policy, which in
turn determines the robot's behavior. In PGRL, the policy is
directly adjusted to minimize the reward. Other reinforcement
learning techniques, such as Q-learning, learn an action-value
function which the policy will be extracted from. We used a
policy that consists of six adapted parameters, three classes
of interaction distances and three interaction parameters.
The following sections will explain in detail the adapted
parameters, the PGRL algorithm, and the reward function.
B. Adapted Parameters
In this experiment, we used six different parameters. These
were the interaction distance for each of three classes of
interaction (detailed explanation in III-A), the extent to which
the robot would meet a human's gaze, waiting time between
utterance and action, and the speed at which motions were
carried out. We chose these parameters since they seem to
have large impacts on interaction and low implementation
costs and we could keep the number of parameters small and
thereby the dimensionality of the search space at a minimum.
The distances were measured as the horizontal distance
between robot and human foreheads. For gaze-meeting, the
robot would meet human gaze, and then look away, with
a cycle length of 5 seconds, which is the average cycle
length in human-human interaction, according to [3]. The
gaze-meeting parameter is the proportion of the cycle spent
meeting the human subject's gaze. The speed of the robot
was divided into two parameters. One controlled how long
the robot would wait between utterance and action (e.g. the
waiting time between saying “let's shake hands” and reaching
out with its right arm), and the other controlled the actual
speed of the motion itself.
C. The PGRL Algorithm
The learning algorithm used in this experiment is policy
gradient reinforcement learning (PGRL), a reinforcement
learning method that directly adjusts the policy without calculating action value functions (detailed descriptions can be
found in [1] and [5]). Figure 1 shows the algorithm [5] in
pseudo code. The Θ indicates the current policy which has
n parameters. A total of T perturbations of Θ are generated,
tested with a person, and the reward function is evaluated.
Perturbation Θt of Θ is generated by randomly adding j ,
0, or −j to each element θj in Θ. The step sizes j are set
independently for each parameter.
When all T pertubations have been run, the gradient A of
the reward function in the parameter space is approximated
by calculating the partial derivatives for each parameter.
Thus, for each parameter θj , the average reward when j
is added, no change is done and when j is subtracted are
calculated. The gradient in dimension j is then regarded as
0 if the reward is greatest for the unperturbed parameter, and
regarded as the difference between the average rewards for
1 Θ = {θj } ← Initial parameter set vector of size n
Θt = {θjt } : P erturbed vector derived f rom Θ
3 ← parameter step size vector of size n
4 η ← overall step size
5 while (not done)
6
f or t = 1 to T
7
f or j = 1 to n
8
r ← unbiased random choice
f rom {−1, 0, 1}
9
θjt ← θj + j ∗ r ,
10
Run system using parameter set Θt ,
evaluate reward
11
f or j = 1 to n
12
Avg+,j ← average reward f or all Θt
with positive perturbation in dimension j
13
Avg0,j ← average reward f or all Θt
with zero perturbation in dimension j
14
Avg−,j ← average reward f or all Θt
with negative perturbation in dimension j
15
if (Avg0,j > Avg+,j ) AN D
(Avg0,j > Avg−,j )
16
aj ← 0
17
else
18
aj ← ( Avg+,j − Avg−,j )
A
19
A ← |A|
∗ η
20
aj ← aj ∗ j , ∀ j
21
Θ ← Θ + A
Fig. 1. PGRL Algorithm
the perturbed parameters otherwise. When the gradient A has
been calculated, it is normalized to overall step size η and for
the individual step sizes in each dimension. The parameter
set Θ is then adjusted by adding A.
This method is guaranteed to converge towards a local
optimum given a static reward function (for a more stringent
analysis of the algorithm, see [1]). Given a stochastic value
function, that possibly evolves over time, PGRL will continuously move towards the vicinity of the current local optimum,
but of course, no convergence guarantees can be given. The
main advantage over other reinforcement learning methods,
is that little knowledge of the dynamic model behind the
system is required (as compared to dynamic programming),
and that since large tables, e.g. those of Q-learning, need not
be calculated, learning should be considerably quicker.
D. Reward Function
The reward function was based on the amount of movement
of the subject and the propotion of the time spent gazing
directly at the robot in one interaction,
R = 0.2 × (movement[mm]) + 500 × (gazing prop.). (1)
Figure 2 shows the block diagram. These two features were
chosen as they are typical discomfort signals in interaction,
which are easy to measure. When we feel that a conversational
Fig. 3. Experimental setup
Fig. 2. The reward function and the angular interval determined as human
gazing at robot
!
" ! ! ! # # # " ! $ ! ±% &' () ' *
*
+++ ! HE , XPERIMENT
A. The environment and the robot
! -$ × .$ ' - ! / 0 '
1
++ & ' .) 0!1 . + (% + # * ! ! / ! # / ! 1
++ , # *
! 2
&! +) ! / ( ! 3 + social " *
personal ! ! ! ( ! " / Fig. 4. RobovieII and the behaviors
TABLE II
PARAMETER VALUES AND STEP SIZES
TABLE I
B EHAVIOR C LASSES
Class
intimate
personal
personal
personal
personal
social
social
social
social
social
Name
Hug
Shake hands
Ask where person comes from
Ask if robot is cute
Ask person to touch robot
Play paper-scissors-stone
Play pointing game
Perform arm-swinging exercise
Hold “thank you” monologue
Look at human without speaking
#
1
2
3
4
5
6
Parameter
intimate distance
personal distance
social distance
gazing ratio
waiting time
speed factor
Initial value
50 cm
80 cm
100 cm
0.7
0.17 s
1.0
Step size 15 cm
15 cm
15 cm
0.1
0.3 s
0.1
( ! ,
+ ( - ( ! & "
# . &
( B. Experimental procedure
(
& " %
& & ' & "
$
( )
( & % ) "
% %
%
" * ( & %
& + "
#" + !/
"
. " (
& & + 0 %$/
0 %0/ 1" +
& +2 &
Distance (cm)
Intimate distance
80
Gaze meeting ratio
ratio
60
40
Distance (cm)
20
0.8
0.6
0.4
0
0
2
4
6
8
10
12
0.2
Subject #
Personal distance
0
2
4
6
8
10
time [s]
50
12
subject#
Waiting time
100
Distance (cm)
1
1.5
1
0.5
0
0
0
2
4
6
8
10
12 Subject
#
0
Social distance
speed factor
200
150
100
50
0
2
4
6
8
10
12Subject
#
Fig. 5. Learned distance parameters and preferences for 12 subjects.
Asterisks show the learned parameters. The shorter bars show the interval
for acceptable distance and longer bars show the preferred value.
A. The results
XPERIMENTAL ESULTS
! "#$%& "& '() ! # "% * +# , $
- - ! . /
$ - 0 1 0 '() 2 $
2
4
6
8
10
Motion speed
1.5
12
subject#
1
0.5
0
2
4
6
8
10
12
subject#
Fig. 6. Learned gaze and speed parameters and indicated preferences for
12 subjects. Circles show what parameter settings the subject indicated as
preferable, `x's show the non-indicated settings. In the cases where values
outside the given settings were indicated, a triangle shows the preferred value.
TABLE III
AVERAGE DEVIATION FROM PREFERRED VALUE ( NORMALIZED TO
STEPSIZE UNITS )
parameter
intimate
personal
social
gaze
wait
speed
average deviation
0.9
1.3
1.3
1.0
0.8
1.1
initial deviation
1.8
1.6
1.2
2.4
1.5
1.4
2 $ * , 1 3 * $
, ! 4&% #&" 1 ! # Intimate
Gazemeeting
0.6
0
0
10
20
30
Time [min]
10
20
Waiting
30
Time [min]
0.5
30
Time [min]
Social
0
10
20
30
150
100
50
10
20
10
1
20
30
Time [min]
150
100
%
$ ' ( $ $
* $ $ $
+ $ personal
$
,
-
$ #
waiting
$
$
$ #
30
Time [min]
1.6
1.4
1.2
1
, 0
10
20
30
Time [min]
$ gaze-meeting
.
waiting $ -
( $ $ $ / $
speed
$
01 ( # # $ $ $ / 2
social
#
D. Successful run - discontent subject
$ 5 (
6 7 5 $
$ $ , $
$ ,
$ $ #
C. Partial success - content subject
$ 20
3 4 $ $ $ 10
Speed
30
Time [min]
$ ) $ $ 20
# B. Successful experimental runs
0
$ $
10
( $ & $ $ $ $
# 1
0.5
! " 30
Time [min]
Fig. 8. Results achieved for subject 5. The lines represent the measured
preferences of the subject as in Figure 7.
$
20
0.8
10
Waiting
30
Time [min]
50
Fig. 7. Results achieved for subject 10. The dotted lines represent the
measured preferences of the subject. The dashed lines represent the most
preferred values.
20
Social
0
0
0
1.2
10
0.6
30
Time [min]
50
200
0
0.8
1.5
0
0
1.4
30
Time [min]
20
100
1.6
0.8
0
1
Time [min
]
Speed
Speed factor
200
Dist [cm]
20
10
Personal
Dist [cm]
10
20
1
0
0
0
40
0
Dist [cm]
50
60
0
1.5
100
Time [s]
Dist [cm]
Personal
0
Gazemeeting
Proportion
20
Intimate
80
0.8
Time [s]
40
1
Speed factor
60
Dist. [cm]
Proportion
Dist. [cm]
80
$
40
20
1
0.8
Gazemeeting
60
40
20
0.6
10
20
Personal
30
Time [min]
0
10
20
Waiting
Dist [cm]
Time [s]
50
10
20
Personal
1.5
100
0
30
Time [min]
1
0.5
150
100
50
10
20
10
20
50
0
0
30
10
200
1.6
1.4
1.2
1
20
Social
Time [min]
Speed
0
10
20
30
Time [min]
Gazemeeting
Proportion
Dist. [cm]
40
20
1
0.8
0.6
0
5
10
15
20
Time [min]
0
5
15
20
Time [min]
1.5
Time [s]
Dist [cm]
10
Waiting
Personal
100
50
1
0.5
0
5
Social
15
20
Time [min]
0
150
100
50
5
10
Speed
Speed factor
200
Dist [cm]
10
15
20
Time [min]
1.6
1.4
1.2
1
0.8
5
10
15
20
Time [min]
0
5
10
15
20
Time [min]
Fig. 10. Results achieved for subject 1.
E. Partially successful runs
0
150
100
10
20
30
Time [min]
Speed
1.6
1.4
1.2
1
0.8
Intimate
60
1
0.5
30
Time [min]
50
Fig. 9. Results achieved for subject 7. The lines represent the measured
preferences of the subject as in Figure 7.
80
30
Time [min]
0
0
30
Time [min]
20
1.5
100
0.8
0
10
Waiting
Speed factor
Social
30
Time [min]
Speed factor
Dist [cm]
20
Dist [cm]
10
200
0
0.6
0
0
0
0
0
0
0.8
30
Time [min]
Time [s]
0
0
1
0
0
Dist [cm]
Intimate
80
Dist. [cm]
Proportio n
Dist. [cm]
Gazemeeting
60
Proportio n
Intimate
80
! "#$ " %# personal social intimate &
0
10
20
30
Time [min]
0
10
20
30
Time [min]
Fig. 11. Results achieved for subject 3. The lines represent the measured
preferences of the subject as in Figure 7.
intimate gaze-meeting '#( "##(
#( )( waiting ! ""$ personal & gaze-meeting )( "##(
intimate " waiting
*
+ F. Unsuccessful run
! "$ , personal
gazing - Intimate
! Gazemeeting
Proportio n
Dist. [cm]
80
60
40
20
) 1
0.8
0.6
0
0
10
20
30
Time [min]
0
10
20
Waiting
$ * 30
Time [min]
+
1.5
100
Time [s]
Dist [cm]
Personal
50
+
1
0.5
0
0
0
10
30
Time [min]
Social
0
150
100
50
10
20
1.6
10
20
1.2
1
30
Time [min]
1.4
" 0.8
0
+
+
+ * 30
Time [min]
Speed
Speed factor
200
Dist [cm]
20
0
10
20
30
Time [min]
# ! *
!
" # ! #
$ %
& # ' # # ! ! ! #
( ! !
, - ! "!# !$ !
*
% ! ! . % %%!%
&' ( )* + $ )
,-. /- 0 12343150 55
&' % 6
#
#/ +
0 77
&3' # # 8 9 $: %
"0 0 +
0
&;' <0 6 0 = 0 0 !9 #
- >> !
0 ?;?4?11
%%%0 55
&1' ! <
+ +
/ @
!
0 30 747; %%%0 55;
&7' 0 = 0 6 0 "@ +
) # =
=
"### $ !
0
1341?0 &' < !9 % !& %%
$
! '
%( +# 0 </ #0 ?
&?' A !9 " !0 23343;0 55
&' % "
2 : )0 ;BC0
7
&5' 90 =0 < <0 0 6 D 9
#/ : *+% !
',--.(0 ?-?70 %%%0 55;
&' <0 6 0 = 0 )/ = "
/
6- ' ,--/(0 -?0 553