Development of Self-Learning Vision-Based Mobile Robots for Acquiring Soccer Robots Behaviors 3 Takayuki Nakamura 3Nara Inst. of Science and Technology Dept. of Information Systems, 8916-5, Takayama- cho, Ikoma, Nara 630-01, Japan. Email: [email protected] Abstract. An input generalization problem is one of the most important ones in applying reinforcement learning to real robot tasks. To cope with this problem, we propose a self-partitioning state space algorithm which can make non-uniform quantization of the multidimensional continuous state space. This method recursively splits its continuous state space into some coarse spaces called tentative states based on the relevance test for immediate reward r and discounted future reward Q which are collected during Q-learning process. When it nds out that a tentative state is relevant by the statistical test on a minimum description length (hereafter, MDL), it partitions this coarse space into ner spaces. To show that our algorithm has generalization capability, we apply our method to two tasks in which a soccer robot shoots a ball into a goal and prevent a ball from entering a goal. To show the validity of this method, the experimental results for computer simulation and a real robot are shown. Key Words. Self-organizing algorithm, Reinforcement learning, Visionbased mobile robots, Soccer robots 1 Introduction Recently, many researchers in robotics e.g., (Connel and Mahadevan, 1993) have paid much attention to reinforcement learning methods by which adaptive, reexive and purposive behavior of robots can be acquired without modeling its environment and its kinematic parameters. A problem in applying reinforcement learning methods to real robot tasks is that a value function 1 should be able to represent the value in terms of innitely many state and action pairs because the state space is generally represented by real-valued variables. For this reason, function approximators are used to represent the value function when a closed-form solution of the optimal policy is not available. One approach that have been used to represent the value function is to quantize the state and action spaces into a nite number of cells and collect reward and punishment in terms of all states and actions. This is one of the simplest forms of generalization in which all the states and actions within a cell have the same value. In this way, the value function is approximated as 1 value function is a prediction of the return available from each state and is important because the robot can use it to decide a next action. See (Kaelbling, 1993) for more details. a table in which each cell has a specic value. Chapman et. al (Chapman and Kaelbling, 1991) proposed an input generalization method which splits an input vector consisting of a bit sequence of the states based on the already structured actions such as \shoot a ghost" and \avoid an obstacle." However, the original states have been already abstracted, and therefore it seems dicult to be applied to the continuous raw sensor space of real world. Moore et. al (Moore and Atkeson, 1995) proposed a method to resolve the problem of learning to achieve given tasks in deterministic high-dimensional continuous spaces. It divides the continuous state space into cells such that in each cell the actions available may be aiming at the neighboring cells. This aiming is accomplished by a local controller, which must be provided as a prior knowledge of the given task in advance. The graph of cell transitions is solved for shortest paths in an online incremental manner, but a minimax criterion is used to detect when a group of cells is too coarse to prevent movement between obstacles or to avoid limit cycles. The oending cells are split to higher resolution. Eventually, the environment is divided up just enough to choose appropriate actions for achieving the goal. However, the restriction of this method to deterministic environments might limit its applicability since the real environment is often non-deterministic. This paper propose a new method for incrementally dividing a multidimensional continuous state space into some discrete states. This method recursively splits its continuous state space into some coarse spaces called tentative states. It begins by supposing that such tentative states are regarded as the states for Q-learning. It collects immediate and discounted future rewards and their statistical evidence within this tentative state space. When it nds out that a tentative state is relevant by the statistical test on a MDL criterion (Rissanen, 1989), it partitions this coarse space into ner spaces. These procedures can make non-uniform quantization of the state space. Our method can be applied to non-deterministic domain because the Q-learning is used to nd out the optimal policy for accomplishing the given task. 2 Self-Partitioning State Space Algorithm 2.1 Function Approximator with Non-uniform Resolution Model There are some reasons why designing non-uniform function approximators may be more benecial than designing uniform ones. In case that the designers know, up to a certain degree, a prior knowledge of the system (for example, what regions of the state-action space will be used more often.), it may be ecient to design the function approximator such that it may use many resources in more heavily transited regions than in regions of the state space that are known to be visited rarely. If the amount of resources is limited, a non-uniform function approximator may make better performance and learning eciency than that achieved with a uniform function approximator just because the former is able to exploit the resources more eciently than the later. It may be possible to design function approximators that dynamically allocate more resources in certain regions of the state-action space and increase the resolution in such regions as required to perform on-line. 2.2 Details of Our Algorithm Here, we dene the sensor inputs, actions and rewards as follows: Sensor input d is described by a N dimensional vector d = (d1 ; d2 ; ; dN ), each component di (i = 1 N ) of which represents the measurement provided by the sensor i. The continuous value di is provided by the sensor i. Its range Range(di ) is known in advance. Based on Range(di), a measurement di is normalized in such a way that di can take values in the semi open interval [0; 1). The agent has a set A of possible actions aj ; j = 1 M . Such a set is called the action space. One of the discrete rewards r = rk ; k = 1 C is given to the agent depending on the evaluation of the action taken at a state. Our method utilizes a hierarchical segment tree in order to represent the non-uniform partitioning of the state space which consists of N dimensional input vector. This representation has an advantage for approximating the non-uniform distribution of sample data. The inner node at i th depth in high the j th level keeps the range bi (j ) = [tlow ; t ) of a measurement provided i i by each sensor i. (Actually, j corresponds to the number of iteration of this algorithm.) At each inner node in the j the level, the range of a measurement high low is partitioned into two equal intervals b0i (j ) = [tlow i ; (ti + ti )=2) and high high b1i (j ) = [(tlow i + ti )=2; ti ). For example, initially j = 0, the range of each dimension i is divided into two equal intervals b0i (0) = [0:0; 0:5) and b1i (0) = [0:5; 1:0). When sensor input vector d has N dimensions, a segment tree whose depth is N is built (see Fig.1). The leaf node corresponds to the result of classication for observed sensor input vector d. As a result, 2N leaf nodes are generated. These leaf nodes can represent the situations in the agent's environment. The state space represented by the leaf nodes is called "tentative state space" T S . Let tsk ; k = 1 2N be the component of the tentative state space which is called "tentative state." Our algorithm works as follows: 1. It starts by assuming that the entire environment is as if it were one state. Initially, the total number of the states S and the tentative states T S are one and 2N , respectively. 2. Based on T S, our algorithm begins Q-learning. In parallel with this process, it gathers statistics in terms of r(ai tsk = on), r(ai tsk = off ), Q(ai tsk = on) and Q(ai tsk = off ), which indicate immediate rewards r and discounted future rewards Q in case that individual state is "on" or "o," respectively. In this work, it is supposed that if a N dimensional sensor vector d is classied into a leaf node tsk , the condition of this node tsk is regarded as "on," otherwise (this means the case that d is classied into the leaf node except tsk ), it is regarded as "o." 3. After Q-learning based on T S is converging, our algorithm asks the question whether there are some states in T S such that the r and Q for states "on"are signicantly dierent from such values for states "o." When the distributions of statistics of tsk in case of "on" and "o" are dierent, it is determined that tsk is relevant to the given task. In order to discover the dierence between two distributions, our algorithm performs the statistical test based on a MDL criterion. In the section 3.1 and 3.2, these procedures are explained. 111 j j j j j j j j 4. (a) If there is the state ts0k adjoining the state tsk which is shown to be relevant such that the statistical characteristic of Q and actions assigned at the adjoining state are same, merge these two states into one state. (b) Otherwise, skip this step. 5. Each leaf nodes tsk is represented by a combination of intervals each of which corresponds to the range of a measurement provided by each sensor i. These intervals in tsk which is shown to be relevant are bisected. As a result, in terms of one tsk , 2N leaf nodes are generated and correspond to tentative states. These tentative states are regarded as the states in Q-learning at the next iteration. 6. Until our algorithm can't nd out any relevant leaf nodes, the procedures 2 are repeated. Finally, a hierarchical segment tree is constructed to represent the partitioning of the state space for achievement of a given task. In case of n = 3 NULL d1 j=0 bi (0) b01 b20 b03 d3 [0] b03 (1) [0] b03 (1) bi1(1) [0] b01(1) b02(1) ti b0i (1) d3 j=1 high t ilow d2 [0] [0] 1.0 d2 b20(1) [0] [0] b30 (1) [0] [0] [0] b03 (1) 0.0 off 1.0 j b[ 2$0 (j)] t ilow ] b[ 02$ (j) b[ 3#0 (j)] b[ 3#0(j)] b[ 3#0 (j)] on "Relevant" d1 b*i (j) b[ 10*(j)] b[ 3#0 (j)] Freq. b01 b20 b03 r or Q high ti b0i (j+1) bi1(j+1) Freq. "Irrelevant" r or Q Figure 1: Representation of state space by a hierarchical segment tree Figure 2: Criterion for determining the relevance of the state After the learning, based on Q stored at leaf nodes, the agent takes actions for accomplishing the given task. 3 The relevance test based on a MDL criterion Here, we explain how to determine whether a state is relevant to the task or not. Fig. 2 shows the dierence between the distributions of r or Q regarding to the state tsk in case that tsk is relevant or irrelevant. As shown in upper part of this gure, when two peaks of the distributions of r or Q, which correspond to pair of r(ai tsk = on) and r(ai tsk = off ), or pair of Q(ai tsk = on) and Q(ai tsk = off )), can be clearly discriminated, it is supposed that tsk is relevant to the given task because the such state aects the value of state more heavily than the other states does, therefore, it aects how the robot should act at next time step. On the contrary, in case that two peaks are ambiguous as shown in bottom part of this gure, it is considered that tsk is irrelevant to the given task. Actually, we perform the statistical test with respect to r and Q based on a MDL criterion (Rissanen, 1989) in order to distinguish the distribution of such reinforcement values. j j 3.1 j j The statistical test for r Since the immediate reward rj is given at each trial among one of C mutually exclusive rewards rj j = 1; 2; C , the distribution of rj in the state tsk follows a multinominal distribution. Let n be the number of independent trials, ki be the number of event Ei and pi be the probability that the event Ei occurs where Pci=1 pi = 1. The probability that E1 occurs k1 times, E2 occurs k2 times, ... Ec occurs kc times, can be represented by the multinominal distribution as follows: n! p(k1 ; ; kc p1 ; ; pc ) = pk1 pkc c k1 ! k c ! 1 where,0 ki n(i = 1 c); Pci=1 = n: Supposing Tab. 1 shows the distribution of immediate rewards in case that the state tsk is \on" or \o," our algorithm tries to nd the dierence between the distributions of rewards in two cases of tsk "on" and "o" based on this table. Table 1: The distribution of rj in n(i): the frequency of sample data tsk in the state i (i = 1; ; S ) On O n(i; rj ): the frequency of reward rj r1 n(On; r1 ) n(Off; r1) (j = 1; ; C ) given in r2 n(On; r2 ) n(Off; r2) the state i : : : p(rj i): the probability that reward rC n(On; rC ) n(Off; rC ) rj is given in the state i C n(On) n(Off ) X PC p(r i) = 1; n(i; rj ) = n(i); j j =1 111 111 j 111 111 111 111 111 j j j =1 (i = 1; S ): The probability P ( n(i; rj ) p(rj i) ) that the distribution of rj are acquired as shown in Tab. 1 n(i; rj ) ; (i = 1 ; S; j = 1; ; C ), can be described as follows: 111 f gjf j f g g 111 111 9 8 C S < = Y Y n(i)! p(rj i)n i;rj P ( n(i; rj ) p(rj i) ) = Q C ; : j n(i; rj )! j i f gjf j j g =1 =1 =1 ( ) likelihood function L of this multinominal distribution can be written asThe follows: ( QS ) C S X X n ( i )! n(i; rj ) log p(rj i) + log QS Qi=1 L( p(rj i) ) = : (1) C i=1 j =1 i=1 j =1 n(i; rj )! When two multinominal distributions in case of tsk = on and off can be considered to be same, the probability p(rj i) that rj is given in the state i can be modeled as follows: M1: p(rj i) = (rj ) i = 1; ; S; j = 1; ; C: On the contrary, when two multinominal distributions in case of "on" and "o" can be considered to be dierent, the probability p(rj i) can be modeled as follows: M2: p(rj i) = (rj i) i = 1; ; S; j = 1; ; C: Based on Eq.(1), we can derive the likelihood function and maximum likelihood for the each model M 1 and M 2 (See (Nakamura, 1998(to appear)) for more detailed derivative procedure.). MDL principle is a very powerful and general approach which can be applied to any inductive learning task. It appeals to Occam's razor the intuition that the simplest model which explains the data is the best one. The simplicity of the model is judged by its length. Its ability to explain the data is measured by the number of bits required to describe the data given the model. Based on MDL principle, we can calculate MDL lMDL(M 1) and lMDL (M 2) for M 1 and M 2, respectively (See (Nakamura, 1998(to appear)) for more detailed derivative procedure.). We can suppose that discovering the dierence between two distributions is equivalent to determining which model is appropriate for representing the distribution of data. Therefore, the dierence between the distributions is found as follow: If lMDL (M 1) > lMDL(M 2), two distributions are dierent. Otherwise, two distributions are same. f j j g j j 111 111 j j 3.2 j The statistical test for 111 111 Q In order to distinguish the distribution of sampled data of Q, we perform the statistical test based on a MDL criterion. Let xn andnym be the sample data (x1 ; x2 ; xn ) and (y1 ; y2 ; ym ), respectively. x and ym indicate a history of Q(ai tsk = on) and Q(ai tsk =n off ), respectively. We'd like to know whether these two sample data x and ym come from the two dierent distributions or the same distribution. Here, we assume the following two model for the distribution of sampled data are M 1 based on one normal distribution N (; 2 ) and M 2 based on two normal distributions N (1 ; 02 ); N (2 ; 02 ). The normal distribution with mean and variance n 2o is dened by f (x : ; 2 ) = p212 exp 0(x202) : In case that both xn and ym follow the model M 1 and M 2, respectively, the probabilistic density function for each model can be written by 111 111 j n j m n m M1: i=1 5 f (xi : ; 2 )+i5=1 f (yi : ; 2 ); M2: i5=1 f (xi : 1 ; 02 )+i5=1 f (yi : 2 ; 02 ): Based on these equation, we can derive the likelihood function , maximum likelihood and MDL lMDL(M 1) and lMDL(M 2) for the each model M 1 and M 2 (See (Nakamura, 1998(to appear)) for more detailed derivative procedure.). Based on a MDL criterion, we can recognize the dierence between the distributions as follows: If lMDL(M 1) > lMDL (M 2), x and y arise from the dierent normal distributions. Otherwise, x and y arise from the same normal distribution. 4 Experimental Results The experiment consists of two phases: rst, learn the optimal policy through the computer simulation, then apply the learned policy to a real situation. To show that our algorithm has a generalization capability, we apply it to acquire two dierent behaviors: one is a shooting behavior and the other is a defending behavior for soccer robots. In this work, we assume that our robot does not know the location and the size of the goal, the size and the weight of the ball, any camera parameters such as focal length and tilt angle, or kinematics/dynamics of itself. 4.1 Simulation We performed the computer simulation so that many parameters such as ball, goal, robot size, camera parameter, friction between the oor and the wheel and bounding factor between the robot and the ball are chosen to simulate the real world (See (Nakamura, 1998(to appear)) for specications in the simulation.). The robot is driven by two independent motors and steered by front and rear wheels which is driven by one motor. Since we can send the motor control commands such as "move forward or backward in the given direction," all together, we have 10 actions in the action primitive set A. The robot continues to take one action primitive at a time until the current state changes. This sequence of the action primitives is called an action. Actually, a stop motion does not causes any changes in the environment, we do not take into account this action primitive. The size of the image taken by the camera is 256 240 pixels. An input vector x to our algorithm consists of: x1 : the horizontal position of the ball in the image, that ranges from 0 to 256 pixels, x2 : the horizontal position of the goal in the image ranging from 0 to 256, x3 : the area of the goal region in the image, that ranges from 0 to 256 240 pixels2 . After the range of these values is normalized in such a way that the range may become the semi open interval [0; 1), they are used as inputs of our method. A discounting factor is used to control to what degree rewards in the distant future aect the total value of a policy. In our case, we set the value a slightly less than 1 ( = 0:9). In this work, we set the learning rate = 0:25. In case that the shooting behavior tried to be acquired by our method, as a reward value, 1 is given when the robot succeeded in shooting a ball into a goal, 0.3 is given when the robot just kicked a ball, -0.01 is given when the robot went out of eld, 0 is given otherwise. In the same way, in case that the defending behavior tried to be acquired by our method, as a reward value, -0.7 is given when the robot failed in preventing a ball from entering 2 2 a goal, 1.0 is given when the robot just kicked a ball, -0.01 is given when the robot went out of eld, 0 is given otherwise. In the learning process, Q-learning continues until the sum of estimated Q seems to be almost convergent. When our algorithm tried to acquire the shooting behavior, our algorithms ended after it iterated the process (Qlearning + statistical test) 8 times. In this case, about 160K trials were required to converge our algorithm and the total number of the states is 246. In case that our algorithm tried to acquire the defending behavior, our algorithms ended after it iterated the process (Q-learning + statistical test) 5 times. In this case, about 100K trials were required to converge our algorithm and the total number of the states is 141. Fig. 3 shows the success ratio versus the step of trials in the two learning processes that one is for acquiring a shooting behavior the other is for a defending behavior. We dene the success rate as (# of successes)=(# of trials) 100(%). As you can see, the bigger the number of iteration is, the higher the success ratio at the nal step in each iteration. This means that our algorithm gradually made better segmentation of state space for accomplishing the given task. 2 80.00 90.00 70.00 No. 1 No. 2 No. 3 No. 4 No. 5 No. 6 No. 7 No. 8 50.00 80.00 70.00 The Ratio of Success The Ratio of Success 60.00 40.00 30.00 20.00 60.00 50.00 40.00 No. 1 No. 2 No. 3 No. 4 No. 5 30.00 20.00 10.00 10.00 0.00 0.00 5.00 10.00 15.00 # of Trials in Learning Process x 103 20.00 (a) In case of shooting behavior 0.00 0.00 5.00 10.00 15.00 x 103 20.00 # of Trials in Learning Process (b) In case of defending behavior Figure 3 The success ratio versus the step of trial shows the partitioned state spaces obtained by our method. Fig. (a) shows the state space for the shooting behavior, (b) shows one for defending behavior. In each gure, Dim:1, Dim:2 and Dim:3 shows the position of ball, the position of goal and the area of goal region, respectively. Furthermore, in each gure, Action2 and Action7 corresponds to \moving forward" and \moving backward," respectively. For the sake of readers understanding, one cube in the partitioned state space corresponds to one state. For example, the left gure in Fig. 4 (a) shows a group of the cube where the action 2 is assigned as an optimal action. As shown in this gure, many cubes where forward actions are assigned concentrate around the center of the entire state space. This means that the robot will take an forward action if the ball and goal are observed around the center of eld of its view. This shows very natural behavior for shooting a ball into a goal. In the right gure of Fig. 4 (b), there is one large cube. This means that 4 Fig. 4 Action :2 Action :7 (a) The State Space for Shooting Behavior Action :2 Action :7 (b) The State Space for Defending Behavior Figure 4 The Partitioned state space the robot will take an backward action if large goal are observed around the center of eld of its view. This strategy is plausible behavior for preventing a ball from entering a goal because the robot will have to go back in front of own goal after it moved out there in order to kick a ball. 4.2 Real Robot Experiments We have developed our real robot system to take part in RoboCup-97 competition where several robotic teams are competing on a eld. So, the system includes two robots which have the same structure: one for a shooter, the other for a defender. O-board computer SGI ONYX (R4400/250MHz) perceives the environment through on-board cameras, performs the decision making based on the learned policy and sends motor commands to each robot. A CCD camera is set at bottom of each robot in o-centered position. Each robot is controlled by SGI ONYX through radio RS232C. The maximum vehicle speed is about 5cm/s. The images taken by the CCD camera on each robot are transmitted to a video signal receiver. In order to process two images (one is sent from the shooter robot, the other from the defender robot) simultaneously, two video signals are combined into one video signal by a video combiner on PC. Then, the video signal is sent to SGI ONYX for image processing. The color-based visual tracking routine is implemented for tracking and nding a ball and a goal in the image. In our current system, it takes 66 ms to do this image processing for one frame. In real robot experiments, our robots succeeded in shooting a ball into a goal and preventing a ball from entering a goal based on the state space obtained by our method. Although the robot often failed to shoot a ball, it tried to moved backward so as to nd a position to shoot a ball, nally succeeded in shooting. Note that the backward motion for retry is just the result of learning and not hand-coded. When the robot tried to prevent a ball from entering a goal, the robot always moves in front of own goal to nd out a shot ball as soon as possible. Note that this behavior is just the result of our learning algorithm and not hand-coded (See (Nakamura, 1998(to appear)) for more details). 5 Concluding Remarks 6 Acknowledgments 7 REFERENCES We have proposed a method for constructing the state space based on experiences, and shown the validity of the method with computer simulations and real robot experiments. We can regard the problem of state space construction as \segmentation" problem. In computer vision, \segmentation problem" has been attacked since the early stage as \image segmentation problem." Since the evaluation of the results are subject to programmers, the validity and limitation of the method seem to have been ambiguous. From a viewpoint of robotics, segmentation of sensory data from the experiences depends on the purpose (task), capabilities (sensing, acting, and processing) of the robot, and its environment, and its evaluation should be done based on the robot performance. The state space obtained by our method (Fig.4 indicates a projection of such a space) corresponds to the subjective representation of the robot to accomplish a given task. Although it seems very limited, such a representation, an inside view of the world for the robot, shows how the robot segments the world. This view is intrinsic to the robot, and based on it the robot might make a subjective decisions when facing with dierent environments, and further the robot might develop its view through its experiences (interactions with its environment). That is, there might be a possibility that the robot acquires the subjective criterion, and as a result, an emerged behavior can be observed as \autonomous" and/or \intelligent." The main idea of this paper is thought of while I stayed at AI Lab of Comp. Sci. Dept. of Brown University. I would like to thank Prof. L. P. Kaelbling for her helpful comments during my stay. I also would like to thank Prof. M. Imai (NAIST) for providing research fund and S. Morita (Japan SGI Cray Corp) for lending SGI ONYX to me. Chapman, D. and L. P. Kaelbling (1991). \Input generalization in delayed reinforcement learning: An alogorithm and performance comparisons". In: Proc. of IJCAI-91. pp. 726{731. Connel, J. H. and Mahadevan, S., Eds.) (1993). Robot Learning. Kluwer Academic Publishers. Kaelbling, L. P. (1993). \Learning to achieve goals". In: Proc. of IJCAI-93. pp. 1094{1098. Moore, A. W. and C. G. Atkeson (1995). The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional Statespaces. Machine Learning 21, 199{233. Nakamura, T. (1998(to appear)). \Development of Self-Learning VisionBased Mobile Robots for Acquiring Soccer Robots Behaviors". In: RoboCup-97: The First Robot World Cup Soccer Games and Conferences 1997. Springer-Verlag. pp. {. Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. World Scientic.
© Copyright 2026 Paperzz