Development of Self-Learning Vision-Based Mo

Development of Self-Learning Vision-Based Mobile Robots for Acquiring Soccer Robots Behaviors
3
Takayuki Nakamura
3Nara
Inst. of Science and Technology Dept. of Information Systems, 8916-5, Takayama-
cho, Ikoma, Nara 630-01, Japan. Email: [email protected]
Abstract.
An input generalization problem is one of the most important ones in applying reinforcement learning to real robot tasks. To cope
with this problem, we propose a self-partitioning state space algorithm which
can make non-uniform quantization of the multidimensional continuous state
space. This method recursively splits its continuous state space into some
coarse spaces called tentative states based on the relevance test for immediate reward r and discounted future reward Q which are collected during
Q-learning process. When it nds out that a tentative state is relevant by
the statistical test on a minimum description length (hereafter, MDL), it
partitions this coarse space into ner spaces. To show that our algorithm
has generalization capability, we apply our method to two tasks in which a
soccer robot shoots a ball into a goal and prevent a ball from entering a goal.
To show the validity of this method, the experimental results for computer
simulation and a real robot are shown.
Key Words. Self-organizing algorithm, Reinforcement learning, Visionbased mobile robots, Soccer robots
1
Introduction
Recently, many researchers in robotics e.g., (Connel and Mahadevan, 1993)
have paid much attention to reinforcement learning methods by which adaptive, reexive and purposive behavior of robots can be acquired without modeling its environment and its kinematic parameters.
A problem in applying reinforcement learning methods to real robot tasks
is that a value function 1 should be able to represent the value in terms of
innitely many state and action pairs because the state space is generally
represented by real-valued variables. For this reason, function approximators
are used to represent the value function when a closed-form solution of the
optimal policy is not available.
One approach that have been used to represent the value function is to
quantize the state and action spaces into a nite number of cells and collect
reward and punishment in terms of all states and actions. This is one of the
simplest forms of generalization in which all the states and actions within a
cell have the same value. In this way, the value function is approximated as
1 value function is a prediction of the return available from each state and is important
because the robot can use it to decide a next action. See (Kaelbling, 1993) for more details.
a table in which each cell has a specic value. Chapman et. al (Chapman
and Kaelbling, 1991) proposed an input generalization method which splits
an input vector consisting of a bit sequence of the states based on the already
structured actions such as \shoot a ghost" and \avoid an obstacle." However,
the original states have been already abstracted, and therefore it seems dicult to be applied to the continuous raw sensor space of real world. Moore et.
al (Moore and Atkeson, 1995) proposed a method to resolve the problem of
learning to achieve given tasks in deterministic high-dimensional continuous
spaces. It divides the continuous state space into cells such that in each cell
the actions available may be aiming at the neighboring cells. This aiming is
accomplished by a local controller, which must be provided as a prior knowledge of the given task in advance. The graph of cell transitions is solved
for shortest paths in an online incremental manner, but a minimax criterion
is used to detect when a group of cells is too coarse to prevent movement
between obstacles or to avoid limit cycles. The oending cells are split to
higher resolution. Eventually, the environment is divided up just enough to
choose appropriate actions for achieving the goal. However, the restriction of
this method to deterministic environments might limit its applicability since
the real environment is often non-deterministic.
This paper propose a new method for incrementally dividing a multidimensional continuous state space into some discrete states. This method recursively splits its continuous state space into some coarse spaces called tentative
states. It begins by supposing that such tentative states are regarded as the
states for Q-learning. It collects immediate and discounted future rewards
and their statistical evidence within this tentative state space. When it nds
out that a tentative state is relevant by the statistical test on a MDL criterion
(Rissanen, 1989), it partitions this coarse space into ner spaces. These procedures can make non-uniform quantization of the state space. Our method
can be applied to non-deterministic domain because the Q-learning is used
to nd out the optimal policy for accomplishing the given task.
2
Self-Partitioning State Space Algorithm
2.1
Function Approximator with Non-uniform Resolution Model
There are some reasons why designing non-uniform function approximators
may be more benecial than designing uniform ones.
In case that the designers know, up to a certain degree, a prior knowledge of the system (for example, what regions of the state-action space
will be used more often.), it may be ecient to design the function
approximator such that it may use many resources in more heavily
transited regions than in regions of the state space that are known to
be visited rarely.
If the amount of resources is limited, a non-uniform function approximator may make better performance and learning eciency than that
achieved with a uniform function approximator just because the former
is able to exploit the resources more eciently than the later.
It may be possible to design function approximators that dynamically
allocate more resources in certain regions of the state-action space and
increase the resolution in such regions as required to perform on-line.
2.2
Details of Our Algorithm
Here, we dene the sensor inputs, actions and rewards as follows:
Sensor input d is described by a N dimensional vector d = (d1 ; d2 ; ; dN ),
each component di (i = 1 N ) of which represents the measurement
provided by the sensor i. The continuous value di is provided by the
sensor i. Its range Range(di ) is known in advance. Based on Range(di),
a measurement di is normalized in such a way that di can take values
in the semi open interval [0; 1).
The agent has a set A of possible actions aj ; j = 1 M . Such a set
is called the action space.
One of the discrete rewards r = rk ; k = 1 C is given to the agent
depending on the evaluation of the action taken at a state.
Our method utilizes a hierarchical segment tree in order to represent the
non-uniform partitioning of the state space which consists of N dimensional
input vector. This representation has an advantage for approximating the
non-uniform distribution of sample data. The
inner node at i th depth in
high
the j th level keeps the range bi (j ) = [tlow
;
t
)
of a measurement provided
i
i
by each sensor i. (Actually, j corresponds to the number of iteration of this
algorithm.) At each inner node in the j the level, the range of a measurement
high
low
is partitioned into two equal intervals b0i (j ) = [tlow
i ; (ti + ti )=2) and
high
high
b1i (j ) = [(tlow
i + ti )=2; ti ). For example, initially j = 0, the range of
each dimension i is divided into two equal intervals b0i (0) = [0:0; 0:5) and
b1i (0) = [0:5; 1:0). When sensor input vector d has N dimensions, a segment
tree whose depth is N is built (see Fig.1). The leaf node corresponds to the
result of classication for observed sensor input vector d. As a result, 2N
leaf nodes are generated. These leaf nodes can represent the situations in the
agent's environment. The state space represented by the leaf nodes is called
"tentative state space" T S . Let tsk ; k = 1 2N be the component of the
tentative state space which is called "tentative state."
Our algorithm works as follows:
1. It starts by assuming that the entire environment is as if it were one
state. Initially, the total number of the states S and the tentative
states T S are one and 2N , respectively.
2. Based on T S, our algorithm begins Q-learning. In parallel with this
process, it gathers statistics in terms of r(ai tsk = on), r(ai tsk =
off ), Q(ai tsk = on) and Q(ai tsk = off ), which indicate immediate
rewards r and discounted future rewards Q in case that individual state
is "on" or "o," respectively. In this work, it is supposed that if a
N dimensional sensor vector d is classied into a leaf node tsk , the
condition of this node tsk is regarded as "on," otherwise (this means
the case that d is classied into the leaf node except tsk ), it is regarded
as "o."
3. After Q-learning based on T S is converging, our algorithm asks the
question whether there are some states in T S such that the r and
Q for states "on"are signicantly dierent from such values for states
"o." When the distributions of statistics of tsk in case of "on" and
"o" are dierent, it is determined that tsk is relevant to the given
task. In order to discover the dierence between two distributions, our
algorithm performs the statistical test based on a MDL criterion. In
the section 3.1 and 3.2, these procedures are explained.
111
j
j
j
j
j
j
j
j
4. (a) If there is the state ts0k adjoining the state tsk which is shown to
be relevant such that the statistical characteristic of Q and actions
assigned at the adjoining state are same, merge these two states
into one state.
(b) Otherwise, skip this step.
5. Each leaf nodes tsk is represented by a combination of intervals each
of which corresponds to the range of a measurement provided by each
sensor i. These intervals in tsk which is shown
to be relevant are bisected. As a result, in terms of one tsk , 2N leaf nodes are generated
and correspond to tentative states. These tentative states are regarded
as the states in Q-learning at the next iteration.
6. Until our algorithm can't nd out any relevant leaf nodes, the procedures 2 are repeated. Finally, a hierarchical segment tree is constructed to represent the partitioning of the state space for achievement
of a given task.
In case of n = 3
NULL
d1
j=0
bi (0)
b01 b20 b03
d3
[0]
b03 (1)
[0]
b03 (1)
bi1(1)
[0]
b01(1)
b02(1)
ti
b0i (1)
d3
j=1
high
t ilow
d2
[0]
[0]
1.0
d2
b20(1)
[0]
[0]
b30 (1)
[0]
[0]
[0]
b03 (1)
0.0
off
1.0
j
b[ 2$0 (j)]
t ilow
]
b[ 02$ (j)
b[ 3#0 (j)]
b[ 3#0(j)]
b[ 3#0 (j)]
on
"Relevant"
d1
b*i (j)
b[ 10*(j)]
b[ 3#0 (j)]
Freq.
b01 b20 b03
r or Q
high
ti
b0i (j+1) bi1(j+1)
Freq.
"Irrelevant"
r or Q
Figure 1: Representation of state
space by a hierarchical segment tree
Figure 2: Criterion for determining
the relevance of the state
After the learning, based on Q stored at leaf nodes, the agent takes actions
for accomplishing the given task.
3
The relevance test based on a MDL criterion
Here, we explain how to determine whether a state is relevant to the task
or not. Fig. 2 shows the dierence between the distributions of r or Q
regarding to the state tsk in case that tsk is relevant or irrelevant. As shown
in upper part of this gure, when two peaks of the distributions of r or Q,
which correspond to pair of r(ai tsk = on) and r(ai tsk = off ), or pair of
Q(ai tsk = on) and Q(ai tsk = off )), can be clearly discriminated, it is
supposed that tsk is relevant to the given task because the such state aects
the value of state more heavily than the other states does, therefore, it aects
how the robot should act at next time step. On the contrary, in case that two
peaks are ambiguous as shown in bottom part of this gure, it is considered
that tsk is irrelevant to the given task.
Actually, we perform the statistical test with respect to r and Q based on
a MDL criterion (Rissanen, 1989) in order to distinguish the distribution of
such reinforcement values.
j
j
3.1
j
j
The statistical test for
r
Since the immediate reward rj is given at each trial among one of C mutually
exclusive rewards rj j = 1; 2; C , the distribution of rj in the state tsk
follows a multinominal distribution.
Let n be the number of independent trials, ki be the number of event Ei
and pi be the probability that the event Ei occurs where Pci=1 pi = 1. The
probability that E1 occurs k1 times, E2 occurs k2 times, ... Ec occurs kc
times, can be represented by the multinominal distribution as follows:
n!
p(k1 ; ; kc p1 ; ; pc ) =
pk1 pkc c
k1 ! k c ! 1
where,0 ki n(i = 1 c); Pci=1 = n:
Supposing Tab. 1 shows the distribution of immediate rewards in case
that the state tsk is \on" or \o," our algorithm tries to nd the dierence
between the distributions of rewards in two cases of tsk "on" and "o" based
on this table.
Table 1: The distribution of rj in
n(i):
the frequency of sample data
tsk
in the state i (i = 1; ; S )
On
O
n(i; rj ): the frequency of reward rj
r1 n(On; r1 ) n(Off; r1)
(j = 1; ; C ) given in
r2 n(On; r2 ) n(Off; r2)
the state i
:
:
:
p(rj i): the probability that reward
rC n(On; rC ) n(Off; rC )
rj is given in the state i
C
n(On)
n(Off )
X
PC p(r i) = 1;
n(i; rj ) = n(i);
j
j =1
111
111
j
111
111
111
111
111
j
j
j =1
(i = 1; S ):
The probability P ( n(i; rj ) p(rj i) ) that the distribution of rj are acquired as shown in Tab. 1 n(i; rj ) ; (i = 1 ; S; j = 1; ; C ), can be
described as follows:
111
f
gjf
j
f
g
g
111
111
9
8
C
S <
=
Y
Y
n(i)!
p(rj i)n i;rj
P ( n(i; rj ) p(rj i) ) =
Q
C
;
: j n(i; rj )! j
i
f
gjf
j
j
g
=1
=1
=1
(
)
likelihood function L of this multinominal distribution can be written
asThe
follows:
( QS
)
C
S X
X
n
(
i
)!
n(i; rj ) log p(rj i) + log QS Qi=1
L( p(rj i) ) =
: (1)
C
i=1 j =1
i=1 j =1 n(i; rj )!
When two multinominal distributions in case of tsk = on and off can be
considered to be same, the probability p(rj i) that rj is given in the state i
can be modeled as follows:
M1: p(rj i) = (rj ) i = 1; ; S; j = 1; ; C:
On the contrary, when two multinominal distributions in case of "on" and
"o" can be considered to be dierent, the probability p(rj i) can be modeled
as follows:
M2: p(rj i) = (rj i) i = 1; ; S; j = 1; ; C:
Based on Eq.(1), we can derive the likelihood function and maximum likelihood for the each model M 1 and M 2 (See (Nakamura, 1998(to appear))
for more detailed derivative procedure.). MDL principle is a very powerful
and general approach which can be applied to any inductive learning task.
It appeals to Occam's razor the intuition that the simplest model which explains the data is the best one. The simplicity of the model is judged by
its length. Its ability to explain the data is measured by the number of bits
required to describe the data given the model. Based on MDL principle, we
can calculate MDL lMDL(M 1) and lMDL (M 2) for M 1 and M 2, respectively
(See (Nakamura, 1998(to appear)) for more detailed derivative procedure.).
We can suppose that discovering the dierence between two distributions
is equivalent to determining which model is appropriate for representing the
distribution of data. Therefore, the dierence between the distributions is
found as follow: If lMDL (M 1) > lMDL(M 2), two distributions are dierent.
Otherwise, two distributions are same.
f
j
j
g
j
j
111
111
j
j
3.2
j
The statistical test for
111
111
Q
In order to distinguish the distribution of sampled data of Q, we perform
the statistical test based on a MDL criterion. Let xn andnym be the
sample
data (x1 ; x2 ; xn ) and (y1 ; y2 ; ym ), respectively. x and ym indicate
a history of Q(ai tsk = on) and Q(ai tsk =n off ), respectively.
We'd like
to know whether these two sample data x and ym come from the two
dierent distributions or the same distribution. Here, we assume the following two model for the distribution
of sampled data are M 1 based on one
normal distribution
N (; 2 ) and M 2 based on two normal distributions
N (1 ; 02 ); N (2 ; 02 ). The normal distribution
with
mean and variance
n
2o
is dened by f (x : ; 2 ) = p212 exp 0(x202) : In case that both xn
and ym follow the model M 1 and M 2, respectively, the probabilistic density
function for each model can be written by
111
111
j
n
j
m
n
m
M1: i=1
5 f (xi : ; 2 )+i5=1 f (yi : ; 2 ); M2: i5=1 f (xi : 1 ; 02 )+i5=1 f (yi : 2 ; 02 ):
Based on these equation, we can derive the likelihood function , maximum
likelihood and MDL lMDL(M 1) and lMDL(M 2) for the each model M 1 and
M 2 (See (Nakamura, 1998(to appear)) for more detailed derivative procedure.). Based on a MDL criterion, we can recognize the dierence between
the distributions as follows: If lMDL(M 1) > lMDL (M 2), x and y arise from
the dierent normal distributions. Otherwise, x and y arise from the same
normal distribution.
4
Experimental Results
The experiment consists of two phases: rst, learn the optimal policy through
the computer simulation, then apply the learned policy to a real situation.
To show that our algorithm has a generalization capability, we apply it to
acquire two dierent behaviors: one is a shooting behavior and the other is
a defending behavior for soccer robots. In this work, we assume that our
robot does not know the location and the size of the goal, the size and the
weight of the ball, any camera parameters such as focal length and tilt angle,
or kinematics/dynamics of itself.
4.1
Simulation
We performed the computer simulation so that many parameters such as ball,
goal, robot size, camera parameter, friction between the oor and the wheel
and bounding factor between the robot and the ball are chosen to simulate
the real world (See (Nakamura, 1998(to appear)) for specications in the
simulation.).
The robot is driven by two independent motors and steered by front and
rear wheels which is driven by one motor. Since we can send the motor control commands such as "move forward or backward in the given direction,"
all together, we have 10 actions in the action primitive set A. The robot continues to take one action primitive at a time until the current state changes.
This sequence of the action primitives is called an action. Actually, a stop
motion does not causes any changes in the environment, we do not take into
account this action primitive.
The size of the image taken by the camera is 256 240 pixels. An input
vector x to our algorithm consists of:
x1 : the horizontal position of the ball in the image, that ranges from 0
to 256 pixels,
x2 : the horizontal position of the goal in the image ranging from 0 to
256,
x3 : the area of the goal region in the image, that ranges from 0 to
256 240 pixels2 .
After the range of these values is normalized in such a way that the range may
become the semi open interval [0; 1), they are used as inputs of our method.
A discounting factor is used to control to what degree rewards in the
distant future aect the total value of a policy. In our case, we set the value
a slightly less than 1 ( = 0:9). In this work, we set the learning rate = 0:25.
In case that the shooting behavior tried to be acquired by our method, as
a reward value, 1 is given when the robot succeeded in shooting a ball into
a goal, 0.3 is given when the robot just kicked a ball, -0.01 is given when
the robot went out of eld, 0 is given otherwise. In the same way, in case
that the defending behavior tried to be acquired by our method, as a reward
value, -0.7 is given when the robot failed in preventing a ball from entering
2
2
a goal, 1.0 is given when the robot just kicked a ball, -0.01 is given when the
robot went out of eld, 0 is given otherwise.
In the learning process, Q-learning continues until the sum of estimated
Q seems to be almost convergent. When our algorithm tried to acquire the
shooting behavior, our algorithms ended after it iterated the process (Qlearning + statistical test) 8 times. In this case, about 160K trials were
required to converge our algorithm and the total number of the states is
246. In case that our algorithm tried to acquire the defending behavior,
our algorithms ended after it iterated the process (Q-learning + statistical
test) 5 times. In this case, about 100K trials were required to converge our
algorithm and the total number of the states is 141.
Fig. 3 shows the success ratio versus the step of trials in the two learning
processes that one is for acquiring a shooting behavior the other is for a defending behavior. We dene the success rate as (# of successes)=(# of trials)
100(%). As you can see, the bigger the number of iteration is, the higher the
success ratio at the nal step in each iteration. This means that our algorithm gradually made better segmentation of state space for accomplishing
the given task.
2
80.00
90.00
70.00
No. 1
No. 2
No. 3
No. 4
No. 5
No. 6
No. 7
No. 8
50.00
80.00
70.00
The Ratio of Success
The Ratio of Success
60.00
40.00
30.00
20.00
60.00
50.00
40.00
No. 1
No. 2
No. 3
No. 4
No. 5
30.00
20.00
10.00
10.00
0.00
0.00
5.00
10.00
15.00
# of Trials in Learning Process
x 103
20.00
(a) In case of shooting behavior
0.00
0.00
5.00
10.00
15.00
x 103
20.00
# of Trials in Learning Process
(b) In case of defending behavior
Figure 3 The success ratio versus the step of trial
shows the partitioned state spaces obtained by our method. Fig.
(a) shows the state space for the shooting behavior, (b) shows one for
defending behavior. In each gure, Dim:1, Dim:2 and Dim:3 shows the
position of ball, the position of goal and the area of goal region, respectively.
Furthermore, in each gure, Action2 and Action7 corresponds to \moving
forward" and \moving backward," respectively.
For the sake of readers understanding, one cube in the partitioned state
space corresponds to one state. For example, the left gure in Fig. 4 (a)
shows a group of the cube where the action 2 is assigned as an optimal action.
As shown in this gure, many cubes where forward actions are assigned
concentrate around the center of the entire state space. This means that
the robot will take an forward action if the ball and goal are observed around
the center of eld of its view. This shows very natural behavior for shooting
a ball into a goal.
In the right gure of Fig. 4 (b), there is one large cube. This means that
4
Fig. 4
Action :2
Action :7
(a) The State Space for Shooting Behavior
Action :2
Action :7
(b) The State Space for Defending Behavior
Figure 4 The Partitioned state space
the robot will take an backward action if large goal are observed around the
center of eld of its view. This strategy is plausible behavior for preventing
a ball from entering a goal because the robot will have to go back in front of
own goal after it moved out there in order to kick a ball.
4.2
Real Robot Experiments
We have developed our real robot system to take part in RoboCup-97 competition where several robotic teams are competing on a eld. So, the system includes two robots which have the same structure: one for a shooter,
the other for a defender. O-board computer SGI ONYX (R4400/250MHz)
perceives the environment through on-board cameras, performs the decision
making based on the learned policy and sends motor commands to each
robot. A CCD camera is set at bottom of each robot in o-centered position.
Each robot is controlled by SGI ONYX through radio RS232C. The maximum vehicle speed is about 5cm/s. The images taken by the CCD camera
on each robot are transmitted to a video signal receiver. In order to process
two images (one is sent from the shooter robot, the other from the defender
robot) simultaneously, two video signals are combined into one video signal
by a video combiner on PC. Then, the video signal is sent to SGI ONYX for
image processing. The color-based visual tracking routine is implemented for
tracking and nding a ball and a goal in the image. In our current system,
it takes 66 ms to do this image processing for one frame.
In real robot experiments, our robots succeeded in shooting a ball into a
goal and preventing a ball from entering a goal based on the state space
obtained by our method. Although the robot often failed to shoot a ball,
it tried to moved backward so as to nd a position to shoot a ball, nally
succeeded in shooting. Note that the backward motion for retry is just the
result of learning and not hand-coded. When the robot tried to prevent a
ball from entering a goal, the robot always moves in front of own goal to
nd out a shot ball as soon as possible. Note that this behavior is just the
result of our learning algorithm and not hand-coded (See (Nakamura, 1998(to
appear)) for more details).
5
Concluding Remarks
6
Acknowledgments
7
REFERENCES
We have proposed a method for constructing the state space based on experiences, and shown the validity of the method with computer simulations
and real robot experiments. We can regard the problem of state space construction as \segmentation" problem. In computer vision, \segmentation
problem" has been attacked since the early stage as \image segmentation
problem." Since the evaluation of the results are subject to programmers,
the validity and limitation of the method seem to have been ambiguous. From
a viewpoint of robotics, segmentation of sensory data from the experiences
depends on the purpose (task), capabilities (sensing, acting, and processing)
of the robot, and its environment, and its evaluation should be done based
on the robot performance. The state space obtained by our method (Fig.4
indicates a projection of such a space) corresponds to the subjective representation of the robot to accomplish a given task. Although it seems very
limited, such a representation, an inside view of the world for the robot,
shows how the robot segments the world. This view is intrinsic to the robot,
and based on it the robot might make a subjective decisions when facing with
dierent environments, and further the robot might develop its view through
its experiences (interactions with its environment). That is, there might be
a possibility that the robot acquires the subjective criterion, and as a result,
an emerged behavior can be observed as \autonomous" and/or \intelligent."
The main idea of this paper is thought of while I stayed at AI Lab of Comp.
Sci. Dept. of Brown University. I would like to thank Prof. L. P. Kaelbling
for her helpful comments during my stay. I also would like to thank Prof. M.
Imai (NAIST) for providing research fund and S. Morita (Japan SGI Cray
Corp) for lending SGI ONYX to me.
Chapman, D. and L. P. Kaelbling (1991). \Input generalization in delayed
reinforcement learning: An alogorithm and performance comparisons".
In: Proc. of IJCAI-91. pp. 726{731.
Connel, J. H. and Mahadevan, S., Eds.) (1993). Robot Learning. Kluwer
Academic Publishers.
Kaelbling, L. P. (1993). \Learning to achieve goals". In: Proc. of IJCAI-93.
pp. 1094{1098.
Moore, A. W. and C. G. Atkeson (1995). The Parti-game Algorithm for
Variable Resolution Reinforcement Learning in Multidimensional Statespaces. Machine Learning 21, 199{233.
Nakamura, T. (1998(to appear)). \Development of Self-Learning VisionBased Mobile Robots for Acquiring Soccer Robots Behaviors". In:
RoboCup-97: The First Robot World Cup Soccer Games and Conferences 1997. Springer-Verlag. pp. {.
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. World Scientic.