Robot Behavior Adaptation for Human-Robot Interaction based on Policy Gradient Reinforcement Learning∗ Noriaki Mitsunaga1 , Christian Smith1,2 Takayuki Kanda1 , Hiroshi Ishiguro1,3 , Norihiro Hagita1 1 ATR Intelligent Robotics and Communication Laboratories 2 Royal Institute of Technology, Stockholm 3 Graduate School of Engineering, Osaka University [email protected] AbstractIn this paper we propose an adaptation mechanism for robot behaviors to make robot-human interactions run more smoothly. We propose such a mechanism based on reinforcement learning, which reads minute body signals from a human partner, and uses this information to adjust interaction distances, gaze meeting, and motion speed and timing in human-robot interaction. We show that this enables autonomous adaptation to individual preferences by an experiment with twelve subjects. Index Termspolicy gradient reinforcement learning (PGRL), human-robot interaction, behavior adaptation, proxemics. NTRODUCTION ! " # $ %& ' ( ! ! ) ) * + ( + ( ! * ! ) ) This research was supported by the National Institute of Information and Communications Technology of Japan ∗ ( , # ) + # ) ) - + ( # ( # ( " ( . . . $ # ( ( / * + ) ( 0 1 / / # ( # HE 2 EHAVIOR * DAPTATION 3 YSTEM A. Behavior adaptation 4 ( ( . # by the human. These values are then used as the reward that is to be minimized by the system. The system consists of the policy gradient reinforcement learning (PGRL) algorithm, that minimizes this reward by changing its policy, which in turn determines the robot's behavior. In PGRL, the policy is directly adjusted to minimize the reward. Other reinforcement learning techniques, such as Q-learning, learn an action-value function which the policy will be extracted from. We used a policy that consists of six adapted parameters, three classes of interaction distances and three interaction parameters. The following sections will explain in detail the adapted parameters, the PGRL algorithm, and the reward function. B. Adapted Parameters In this experiment, we used six different parameters. These were the interaction distance for each of three classes of interaction (detailed explanation in III-A), the extent to which the robot would meet a human's gaze, waiting time between utterance and action, and the speed at which motions were carried out. We chose these parameters since they seem to have large impacts on interaction and low implementation costs and we could keep the number of parameters small and thereby the dimensionality of the search space at a minimum. The distances were measured as the horizontal distance between robot and human foreheads. For gaze-meeting, the robot would meet human gaze, and then look away, with a cycle length of 5 seconds, which is the average cycle length in human-human interaction, according to [3]. The gaze-meeting parameter is the proportion of the cycle spent meeting the human subject's gaze. The speed of the robot was divided into two parameters. One controlled how long the robot would wait between utterance and action (e.g. the waiting time between saying let's shake hands and reaching out with its right arm), and the other controlled the actual speed of the motion itself. C. The PGRL Algorithm The learning algorithm used in this experiment is policy gradient reinforcement learning (PGRL), a reinforcement learning method that directly adjusts the policy without calculating action value functions (detailed descriptions can be found in [1] and [5]). Figure 1 shows the algorithm [5] in pseudo code. The Θ indicates the current policy which has n parameters. A total of T perturbations of Θ are generated, tested with a person, and the reward function is evaluated. Perturbation Θt of Θ is generated by randomly adding j , 0, or −j to each element θj in Θ. The step sizes j are set independently for each parameter. When all T pertubations have been run, the gradient A of the reward function in the parameter space is approximated by calculating the partial derivatives for each parameter. Thus, for each parameter θj , the average reward when j is added, no change is done and when j is subtracted are calculated. The gradient in dimension j is then regarded as 0 if the reward is greatest for the unperturbed parameter, and regarded as the difference between the average rewards for 1 Θ = {θj } ← Initial parameter set vector of size n Θt = {θjt } : P erturbed vector derived f rom Θ 3 ← parameter step size vector of size n 4 η ← overall step size 5 while (not done) 6 f or t = 1 to T 7 f or j = 1 to n 8 r ← unbiased random choice f rom {−1, 0, 1} 9 θjt ← θj + j ∗ r , 10 Run system using parameter set Θt , evaluate reward 11 f or j = 1 to n 12 Avg+,j ← average reward f or all Θt with positive perturbation in dimension j 13 Avg0,j ← average reward f or all Θt with zero perturbation in dimension j 14 Avg−,j ← average reward f or all Θt with negative perturbation in dimension j 15 if (Avg0,j > Avg+,j ) AN D (Avg0,j > Avg−,j ) 16 aj ← 0 17 else 18 aj ← ( Avg+,j − Avg−,j ) A 19 A ← |A| ∗ η 20 aj ← aj ∗ j , ∀ j 21 Θ ← Θ + A Fig. 1. PGRL Algorithm the perturbed parameters otherwise. When the gradient A has been calculated, it is normalized to overall step size η and for the individual step sizes in each dimension. The parameter set Θ is then adjusted by adding A. This method is guaranteed to converge towards a local optimum given a static reward function (for a more stringent analysis of the algorithm, see [1]). Given a stochastic value function, that possibly evolves over time, PGRL will continuously move towards the vicinity of the current local optimum, but of course, no convergence guarantees can be given. The main advantage over other reinforcement learning methods, is that little knowledge of the dynamic model behind the system is required (as compared to dynamic programming), and that since large tables, e.g. those of Q-learning, need not be calculated, learning should be considerably quicker. D. Reward Function The reward function was based on the amount of movement of the subject and the propotion of the time spent gazing directly at the robot in one interaction, R = 0.2 × (movement[mm]) + 500 × (gazing prop.). (1) Figure 2 shows the block diagram. These two features were chosen as they are typical discomfort signals in interaction, which are easy to measure. When we feel that a conversational Fig. 3. Experimental setup Fig. 2. The reward function and the angular interval determined as human gazing at robot ! " ! ! ! # # # " ! $ ! ±% &' () ' * * +++ ! HE , XPERIMENT A. The environment and the robot ! -$ × .$ ' - ! / 0 ' 1 ++ & ' .) 0!1 . + (% + # * ! ! / ! # / ! 1 ++ , # * ! 2 &! +) ! / ( ! 3 + social " * personal ! ! ! ( ! " / Fig. 4. RobovieII and the behaviors TABLE II PARAMETER VALUES AND STEP SIZES TABLE I B EHAVIOR C LASSES Class intimate personal personal personal personal social social social social social Name Hug Shake hands Ask where person comes from Ask if robot is cute Ask person to touch robot Play paper-scissors-stone Play pointing game Perform arm-swinging exercise Hold thank you monologue Look at human without speaking # 1 2 3 4 5 6 Parameter intimate distance personal distance social distance gazing ratio waiting time speed factor Initial value 50 cm 80 cm 100 cm 0.7 0.17 s 1.0 Step size 15 cm 15 cm 15 cm 0.1 0.3 s 0.1 ( ! , + ( - ( ! & " # . & ( B. Experimental procedure ( & " % & & ' & " $ ( ) ( & % ) " % % % " * ( & % & + " #" + !/ " . " ( & & + 0 %$/ 0 %0/ 1" + & +2 & Distance (cm) Intimate distance 80 Gaze meeting ratio ratio 60 40 Distance (cm) 20 0.8 0.6 0.4 0 0 2 4 6 8 10 12 0.2 Subject # Personal distance 0 2 4 6 8 10 time [s] 50 12 subject# Waiting time 100 Distance (cm) 1 1.5 1 0.5 0 0 0 2 4 6 8 10 12 Subject # 0 Social distance speed factor 200 150 100 50 0 2 4 6 8 10 12Subject # Fig. 5. Learned distance parameters and preferences for 12 subjects. Asterisks show the learned parameters. The shorter bars show the interval for acceptable distance and longer bars show the preferred value. A. The results XPERIMENTAL ESULTS ! "#$%& "& '() ! # "% * +# , $ - - ! . / $ - 0 1 0 '() 2 $ 2 4 6 8 10 Motion speed 1.5 12 subject# 1 0.5 0 2 4 6 8 10 12 subject# Fig. 6. Learned gaze and speed parameters and indicated preferences for 12 subjects. Circles show what parameter settings the subject indicated as preferable, `x's show the non-indicated settings. In the cases where values outside the given settings were indicated, a triangle shows the preferred value. TABLE III AVERAGE DEVIATION FROM PREFERRED VALUE ( NORMALIZED TO STEPSIZE UNITS ) parameter intimate personal social gaze wait speed average deviation 0.9 1.3 1.3 1.0 0.8 1.1 initial deviation 1.8 1.6 1.2 2.4 1.5 1.4 2 $ * , 1 3 * $ , ! 4&% #&" 1 ! # Intimate Gazemeeting 0.6 0 0 10 20 30 Time [min] 10 20 Waiting 30 Time [min] 0.5 30 Time [min] Social 0 10 20 30 150 100 50 10 20 10 1 20 30 Time [min] 150 100 % $ ' ( $ $ * $ $ $ + $ personal $ , - $ # waiting $ $ $ # 30 Time [min] 1.6 1.4 1.2 1 , 0 10 20 30 Time [min] $ gaze-meeting . waiting $ - ( $ $ $ / $ speed $ 01 ( # # $ $ $ / 2 social # D. Successful run - discontent subject $ 5 ( 6 7 5 $ $ $ , $ $ , $ $ # C. Partial success - content subject $ 20 3 4 $ $ $ 10 Speed 30 Time [min] $ ) $ $ 20 # B. Successful experimental runs 0 $ $ 10 ( $ & $ $ $ $ # 1 0.5 ! " 30 Time [min] Fig. 8. Results achieved for subject 5. The lines represent the measured preferences of the subject as in Figure 7. $ 20 0.8 10 Waiting 30 Time [min] 50 Fig. 7. Results achieved for subject 10. The dotted lines represent the measured preferences of the subject. The dashed lines represent the most preferred values. 20 Social 0 0 0 1.2 10 0.6 30 Time [min] 50 200 0 0.8 1.5 0 0 1.4 30 Time [min] 20 100 1.6 0.8 0 1 Time [min ] Speed Speed factor 200 Dist [cm] 20 10 Personal Dist [cm] 10 20 1 0 0 0 40 0 Dist [cm] 50 60 0 1.5 100 Time [s] Dist [cm] Personal 0 Gazemeeting Proportion 20 Intimate 80 0.8 Time [s] 40 1 Speed factor 60 Dist. [cm] Proportion Dist. [cm] 80 $ 40 20 1 0.8 Gazemeeting 60 40 20 0.6 10 20 Personal 30 Time [min] 0 10 20 Waiting Dist [cm] Time [s] 50 10 20 Personal 1.5 100 0 30 Time [min] 1 0.5 150 100 50 10 20 10 20 50 0 0 30 10 200 1.6 1.4 1.2 1 20 Social Time [min] Speed 0 10 20 30 Time [min] Gazemeeting Proportion Dist. [cm] 40 20 1 0.8 0.6 0 5 10 15 20 Time [min] 0 5 15 20 Time [min] 1.5 Time [s] Dist [cm] 10 Waiting Personal 100 50 1 0.5 0 5 Social 15 20 Time [min] 0 150 100 50 5 10 Speed Speed factor 200 Dist [cm] 10 15 20 Time [min] 1.6 1.4 1.2 1 0.8 5 10 15 20 Time [min] 0 5 10 15 20 Time [min] Fig. 10. Results achieved for subject 1. E. Partially successful runs 0 150 100 10 20 30 Time [min] Speed 1.6 1.4 1.2 1 0.8 Intimate 60 1 0.5 30 Time [min] 50 Fig. 9. Results achieved for subject 7. The lines represent the measured preferences of the subject as in Figure 7. 80 30 Time [min] 0 0 30 Time [min] 20 1.5 100 0.8 0 10 Waiting Speed factor Social 30 Time [min] Speed factor Dist [cm] 20 Dist [cm] 10 200 0 0.6 0 0 0 0 0 0 0.8 30 Time [min] Time [s] 0 0 1 0 0 Dist [cm] Intimate 80 Dist. [cm] Proportio n Dist. [cm] Gazemeeting 60 Proportio n Intimate 80 ! "#$ " %# personal social intimate & 0 10 20 30 Time [min] 0 10 20 30 Time [min] Fig. 11. Results achieved for subject 3. The lines represent the measured preferences of the subject as in Figure 7. intimate gaze-meeting '#( "##( #( )( waiting ! ""$ personal & gaze-meeting )( "##( intimate " waiting * + F. Unsuccessful run ! "$ , personal gazing - Intimate ! Gazemeeting Proportio n Dist. [cm] 80 60 40 20 ) 1 0.8 0.6 0 0 10 20 30 Time [min] 0 10 20 Waiting $ * 30 Time [min] + 1.5 100 Time [s] Dist [cm] Personal 50 + 1 0.5 0 0 0 10 30 Time [min] Social 0 150 100 50 10 20 1.6 10 20 1.2 1 30 Time [min] 1.4 " 0.8 0 + + + * 30 Time [min] Speed Speed factor 200 Dist [cm] 20 0 10 20 30 Time [min] # ! * ! " # ! # $ % & # ' # # ! ! ! # ( ! ! , - ! "!# !$ ! * % ! ! . % %%!% &' ( )* + $ ) ,-. /- 0 12343150 55 &' % 6 # #/ + 0 77 &3' # # 8 9 $: % "0 0 + 0 &;' <0 6 0 = 0 0 !9 # - >> ! 0 ?;?4?11 %%%0 55 &1' ! < + + / @ ! 0 30 747; %%%0 55; &7' 0 = 0 6 0 "@ + ) # = = "### $ ! 0 1341?0 &' < !9 % !& %% $ ! ' %( +# 0 </ #0 ? &?' A !9 " !0 23343;0 55 &' % " 2 : )0 ;BC0 7 &5' 90 =0 < <0 0 6 D 9 #/ : *+% ! ',--.(0 ?-?70 %%%0 55; &' <0 6 0 = 0 )/ = " / 6- ' ,--/(0 -?0 553
© Copyright 2024 Paperzz