Robot Oriented State Space Construction Hiroshi ISHIGURO, Ritsuko SATO & Toru ISHIDA Department of Information Science Kyoto University Sakyo-ku, Kyoto 606-01, JAPAN E-mail: ishiguro/ritsuko/[email protected] and the approach of Nakaumra and Asada uses features selected by humans such as optical ows. Therefore, it is not guaranteed that the state space is appropriate for the robot. A robot, generally, has a body which is dierent from a human body, thus it is considered that their internal state spaces should be also dierent from human's. It is a natural consideration that dierent internal state spaces are obtained from dierent sensors. Another important problem is how to reduce the size of the state space into a proper size with which Q-learning algorithm can be performed. The size reduction algorithms proposed in the above-mentioned works are for general tasks of the robot. Therefore, the robot needs to keep unnecessary states for the tasks. We consider the robot should keep the minimum number of states needed for the tasks in order to behave in real time. This paper proposes an approach to construct a robot oriented state space which does not depend on human intuitions but depends on constraints of its actions, sensors, and tasks. The research approach discussed in this paper is based on the following assumptions. (1) The external world and the internal representation should be discriminated. An interesting idea of a behavior-based approach proposed by Brooks [4] is \The world is an environmental model itself". That is, a robot which consists of reactive behavior modules refers to a real world through its sensors as an environmental model and does not have any explicit internal representations. Although the behavior-based approach enables reactive and robust behaviors, explicit internal representations are also needed in order to perform more intelligent behaviors such as planning. This paper focuses upon how the robot builds internal representations by observing the external world through its sensors. (2) A robot has original internal representations which are convenient for itself and its tasks. As discussed earlier, a human and a robot have dierent bodies each other, therefore, their internal representations are also dierent. Further, if two robots exist in an environment and their tasks are dierent each other, it can be considered that their internal representations are also dierent. The robot should acquire Abstract The state space of a sensor-based robot in the most previous works has been determined based on human intuitions, however the state space constructed from human viewpoints is not always appropriate for the robot. The robot has a dierent body, sensors, and tasks, therefore, we consider the robot should have an original internal state space determined based on actions, sensors, and tasks. This paper proposes an approach to construct such a robot oriented state space by statistically analyzing the actions, sensor patterns, and rewards given as results of task executions. In the state space construction, the robot creates sensor pattern classiers called Empirically Obtained Perceivers(EOPs) the combinations of which represents internal states of the robot. We have conrmed that the robot can construct original state spaces through its vision sensor and achieve navigation tasks with the obtained state spaces in a complicated simulated world. 1 Introduction Robot learning is one of the important topics in AI and Robotics. The key issue is how to apply machine learning algorithms developed in previous AI research into physical robots behaving in a real world. This requires to focus upon a problem how to construct an internal state space of the robot through its real sensors. As a learning algorithm, Q-learning is often utilized since it dose not require any domain knowledge. Mahadevan and Cornel [6] have demonstrated that a robot can obtain its programs for nding a box, pushing it, unwedging in a real world by Q-learning [10]. The important point of their work is how to reduce the huge size of the state space represented with discrete sensor inputs. They have proposed a method to reduce the size based on similarities between states. Nakamura and Asada [2] have recently proposed a method using a motion sketch which closely relates robot actions to sensor patterns for reducing the size of the state space. The motion sketch is acquired by statistically analyzing optical ows, therefore, the method can be only applied to tasks which are related to optical follows, such as obstacle avoidance. The approach of Mahadevan and Cornel reduces the size of the state space after executing Q-learning, 1 original internal representations which are suitable for the tasks. (3) A human and a robot can communicate only through the external world. A human and a robot have dierent internal representations each other, therefore, the human intuitions which result from human internal representations may not proper for the robot. They should communicate through external events which both of them can observe, then, the robot should understand the human intentions and acquire its internal representations. Based on the above-mentioned concepts, a robot autonomously constructs its internal state space for accomplish given tasks, such as avoiding obstacles and going toward a goal. The robot, in this paper, behaves in a complicated simulated world. Tasks of the robot are given as embodied rewards in the simulated world. For example, the robot receives a negative reward when colliding with a wall. While randomly moving or moving along taught paths by a human, the robot correct data sets of an action, sensor patterns, and a reward. Then, the robot obtains Empirically Obtained Perceivers(EOPs) which classies sensor patterns according to the rewards and actions by statistically analyzing the data sets. Combinations of the obtained EOPs represent an internal state space of the robot. We have conrmed that the robot can construct state spaces for two behaviors, avoiding obstacles and going toward a goal. A few similar works to our research approach have been already reported. G algorithm [5] picks out the appropriate sensors to describe a state space. UDM method [8] splits perceptually aliasing states according to their predecessor. Both of them use statistics, however they assume that sensors which generate abstract information already exist and do not mention how to acquire them. What they call Sensor corresponds to EOP in our research approach. In this paper, the word refers to a physical sensor such as a vision sensor. We, rather, focus upon how to obtain the state space through physical sensors by creating EOPs. Collecting data sets Q-learning Classifying data sets Compressing data Internal state spaces Acquiring EOP Dividing states Integrating states Figure 1: Process ow Figure 1 shows the process ow for incrementally constructing state spaces. Data sets corrected while moving randomly, moving along paths, and executing Q-learning (discussed in the section of Discussion and Conclusion) are classied into two data sets. After data compression, a EOP is obtained and the state is divided into two states. By iterating the processes, state spaces are incrementally constructed. Concerning the integration of states, this paper does not deal with. 2.1 Empirically obtained perceiver Assuming that two classes of sensor patterns class 0 and class 1 are given, a discriminant function can be obtained by the following manner. The discriminant function is a function which takes a sensor pattern x represented as a N -vector as input and outputs a scalar value. It is represented as a sum of weighted components of a vector x. 2 State Space Construction f (x) = w 1 x + c A fundamental issue in robotics is sensor-action mapping. In order to perform the mapping without human's intuitions, we employ a statistic method. If classes of sensor patterns to be discriminated exist, a discriminate function can be obtained using example patterns. This function maps patterns into symbols since it determines to which class each sensor pattern belongs. We call this function Empirically Obtained Perceiver (EOP). An EOP plays a role like a sensor since it generates a bit which brings information regarding the external world. This sensor, EOP, is rather abstract and adaptable than a physical sensor which is innate and hardly variable. Internal states of the robot are dened as combinations of EOPs. Because an EOP itself is just a classier, the sensor pattern classes discriminated by the EOP contains semantics. Therefore, combinations of EOPs represent state symbols. x; w 2 RN ; c 2 R where c is decided so that f outputs a positive value for a sensor pattern in class 1 and a negative value for a sensor pattern in class 0. w is decided to minimize the error rate [1] as follows. w = 601(1 0 0 ) where 6 is the variance covariance matrix of class 0 and class 1. 1 and 0 are the mean vectors of class 1 and class 0, respectively. The error rates of f are given by the following formulae. e0 = m0 =n0 e1 = m1 =n1 where n0 and n1 are the numbers of data in class 1 and 2, respectively. m0 is the number of data such 2 + Sensing Frequency Raw data + Early visual processing z 2 RL + y=Az2 + Kan Relief algorithm RM A 2M(M; L) Principle component analysis x = B y 2 RN f ( x) = Min r θ Max r Reward Figure 3: Divide condition B 2M(N; M ) + Discriminant analysis w 1 x + c 2 R w 2 RN c 2 R The procedures for the state division are as follows. 1. Take actions in the real world and collect quadruplet data hv; s; a; ri. Classify them into disparate sets R(si ; aj ) = fhv; s; a; rijs = si ; a = aj g v: s: a: r: Figure 2: Acquisition of a discriminant function that f 0 in class 0, and m1 is the number of data such that f < 0 in class 1. EOP is dened by the discriminant function f as follows: 0 EOP (x) = 10 ff ((xx)) <0 Physical sensor output Internal state corresponding to v Action taken at the situation (s; v) Reward as a result of a 2. For each R(si ; aj ) check whether si should be divided. If no state can be divided, then end. 3. If R(si; aj ) should be divided by threshold as shown in Figure 3, dene two classes: class0 = fvjhv; s; a; r i 2 R(si; aj ); r < g class1 = fvjhv; s; a; r i 2 R(si; aj ); r g In case of a visual sensor, the dimension of data is so large that two preliminary processings are performed. First of all, raw sensor data are ltered by early visual processings (Figure 2). Then, the dimension is cut o by Relief algorithm [7] and compressed by principal component analysis [1]. Relief algorithm calculates relevance for each component in point of contributions to the classication, and selects components with high relevance. Principal component analysis rotates and selects axes of the parameter space in order of quantity of data. 2.2 β α 4. Look for an EOP which can discriminate class 0 and class 1 with error rates e0 ; e1 <. If no proper EOP is found, then make a new EOP using class 0 and class 1. If it still has some errors (e0 or e1 ), then abort to divide and go to 2. 5. Refer the EOP selected at 4 as EOPk . si is divided into new states si0 and si1 by EOPk . Divide R(si ; aj ) into R(si0 ; aj ) and R(si1 ; aj ) for each aj and go to 2. R(si0 ; aj ) = fhv; s; a; rijs = si ; a = aj ; EOPk < 0g R(si1 ; aj ) = fhv; s; a; rijs = si ; a = aj ; EOPk 0g State division We have discussed how to obtain an EOP when two classes are given. Next, we discuss how to divide a class consisting of various sensor patterns into two classes. Sensor patterns should be discriminated according to robot tasks. However, our robot does not have any motives to classify sensor patterns by itself. The tasks are given by a human as rewards embodied in the external world. A human can inform his intention to the robot through the rewards, but cannot directly describe its internal states. The robot constructs its internal state space for itself referring to the rewards. Imagine the robot with an incomplete state space moves in a real world where rewards are embodied. If the robot receives dierent rewards when executing an action at a state, the robot nds the state should be divided into proper two states. Let g(r ) be frequency of r 2 frjhv; s; a; ri 2 R(si ; aj )g and n be the number of samples in R(si; aj ). Ka , Kb , Kc , and Kd are constant values. Conditions to divide si are given by the following equations. g(r) Ka n 8r 2 [; ] b 0 a Kb(Max r 0 Min r) n0 Kc n; n1 Kc n n0 Kd ; n1 Kd (1) (2) (3) (4) where n0 is the number of samples such that r < , = ( + )=2, and n1 is the number of samples such 3 Cylinderical image plane d r AAAA AAAA s h v Robot r=18 d=20 v=100 h=20 s=100 2 3 4 5 1 6 0 11 (a) Structure 7 10 9 8 (b) Actions Figure 5: Input patterns Figure 4: Robot in the simulator that r . If several s are found, the longest [; ] is selected. ( 3) means that rare events are neglected. ( 4) guarantees enough data to make an EOP. A state is divided by iterating the procedures discussed earlier, and then each state is represented by a logical combination of EOPs. For example, if two EOPs are obtained, we can represent 4 states as logical combinations of EOPs. These logical combinations are the state symbols represented by the EOPs. Although they are symbolic representations, they ground in the real world. Delayed rewards can be used in place of immediate rewards. The rewards propagated along action sequences in the external world (The robot temporarily memorizes the action sequence with the sensor data.). If rt represents the immediate reward at time t, delayed reward at time t is : r~t = X i=0 Wall Door Wall S Wall Door Figure 6: Simulated world preliminary processing for all EOPs, however this experiment uses only one which decreases the resolution to 45 2 20 pixels and concatenates outputs taken before and after the action. x in Figure 2, thereby, has 45 2 20 2 2 dimensions. The reason to use continuous images is to expect the robot extracts information of temporal dierential. Figure 5 shows an example of sensor patterns after preliminary processing. The upper corresponds to time t, the lower corresponds to time t 0 1. Figure 6 shows the top view of the simulated external world. The robot receives 1 as a reward when the robot enters the hatched region without collision, 01 when it collides with a wall, 0 otherwise. The method proposed in the previous section does not indicate how to collect data. It is necessary to make the robot move around in the external world by some means. In this paper, the robot randomly moves in the environment and practices given action sequences. i rt+i = rt + r~t+1 For example, even if the goal situation and the situation far from the goal are mapped into the same internal state, the goal reward is propagated only when the robot is near the goal. The delayed rewards are propagated just like a potential eld which center is the reward source. The robot detects the peculiarity of the potential values (or propagated rewards) and tries to discriminate the sensor patterns. The distribution of the rewards decides sensor pattern classes which should be discriminated and EOPs distinguish these classes. 3 Experimental Results We have developed a robot simulator and used for experimentation. A robot in the simulator has an omnidirectional vision sensor. The robot can rotate in arbitrary 12 directions and go forward, in other words, it has 12 actions. The conguration of the sensor and the actions are shown in Figure 4. An action ai consists of a rotation of 30 2 i degrees and a forward movement of 15pixels in the simulated world shown in Figure 6 with some errors (The size of the simulated world is 640 2 480 pixels). The actual direction and distance of ai follow the normal distribution whose means are 30 2 i degrees and 15 pixels, respectively. The sensor generates a picture of 360 2 100 pixels at each time step. There is no need to use the same (a) Positive components (b) Negative components Figure 7: Weight vectors for EOP16 4 EOP0 EOP3 EOP1 EOP6 EOP7 EOP5 EOP13 EOP11 s0 EOP14 s7 EOP9 s4 s8 s15 s14 s12 EOP8 EOP2 EOP12 s1 EOP15 s6 s13 s9 EOP4 EOP10 s16 EOP9 s5 EOP16 s11 s17 s2 s10 s3 s18 (a) Path in the simulated world Figure 8: State tree Avoiding obstacles by random movement At the beginning, the robot knows nothing about the external world, it has no EOP, and its internal state space consists of only one state s0 . With 23,100 data sets collected by random movements, a new state space has been constructed. Discount factor of delayed reward was set to 0 here, that is, only immediate rewards were taken into consideration, since acquiring reactive behaviors does not require the delayed rewards. As a result, 17 EOPs has been created and one state is divided into 19 states as shown in Figure 8. For example, state s3 is represented as 0 EOP 0 , and s18 is repreEOP0EOP1EOP2EOP10 16 0 sented as EOP0 EOP1 EOP2 EOP10 EOP16 . Those 2 states are discriminated by EOP16 . Figure 7 shows the weight vectors for EOP16 whose components are arranged in the same way of Figure 5. Intensities of these gures are proportional to the absolute values of the weights. Figure 7(a) corresponds to positive components, and Figure 7(b) corresponds to negative components. We have not analyzed the meanings of weight vectors yet, however, at least the robot must have nd some meanings in them. All of the EOPs created here discriminate negative rewards. They can be considered to detect obstacles in a certain direction. While the robot moves along the path shown in Figure 9(a), its internal state transfers in the way shown in Figure 9(b). It can be seen that the state changes corresponding to the change of appearance. No EOP to discriminate the reward at the goal is created, because the robot seldom reaches the goal by random movements. The robot learns to avoid walls however never head for the goal. (b) State transition Figure 9: State transition along a robot path sequences. The action sequences are shown by manually controlling the robot. The robot memorizes the action sequences given by a human operator and repeats them. The robot expects the same sensor outputs while iterating a sequence, however the sensor outputs are not exactly the same because of errors in actions. Therefore, the robot can collect various data along the example path. 2904 more data sets have been collected by practicing 4 action sequences (The robot reached the goal 325 times among 372 trials). This time, the robot considers the delayed reward in order to acquire procedural behaviors. With = 0:5, 9 more EOPs and 10 states have been made. 6 of those EOPs have positive threshold values. Because states discriminated by them are never classied without rewards from the goal, they are regarded to detect the goal by similarity of sensor patterns. 4 Discussion and Conclusion In this paper, we have proposed an approach to construct robot oriented state spaces based on statistical analysis of sensor data. Compared with other research approaches, the most distinctive feature of our method is EOP which enables exible interpretation of sensor patterns. Each EOP dichotomously classies sensor patterns and indicates to which class the given pattern belongs. EOPs are the way of mapping sensor patterns to state symbols. States are represented by logical combinations of EOPs, and similarity of them can be evaluated by logical operations. They are tied up with Going toward a goal by practice Now the robot discriminates the situation at the external world to some degree, but it does not nd out the goal state. The robot can seldom arrive at the goal by random walk. In order to teach the existence of a goal, we have given examples of action sequence to arrive at the goal to the robot and made the robot practice the 5 situations observed by the sensor of the robot, consequently, they ground well in the real world. The symbol grounding realized by the EOPs is important when applying machine learning algorithms which perform with symbolic representations of the world into real world robots. From the viewpoint of active vision [3], EOPs address the focus of attention. Each of the EOPs weights parts of a sensor pattern for discriminating states. That is, the parts are where the robot should pay its attention for identifying the state. Finally, we discuss the remained problems. In the data collection, we can utilize Q-learning in addition to moving randomly and moving along taught path. The purposes to utilize Q-learning in the data collection are to eciently collect data sets along a path toward a goal and to nd states which should be divided. In this paper, data collection and state space construction have been executed alternately, but it is possible to do them simultaneously. The construction process watches the progression of the searching process (not restricted to Q-learning), and divides alias states if they are found. The searching process tries to nd an optimal path toward the goal in the constructed state space. It can be expected to improve the results of Q-learning and increase the learning speed . EOPs are made with an assumption that actions are already dened, but some tasks may require more precise actions and others can be done with more coarse actions. In order to change the granularity, it is necessary to detect coarse actions or to identify similar actions. The state space is obtained by dividing states with EOPs. On the other hand, a process to integrate the states is also required. To combine the both process is our next step. Since EOPs classify sensor patterns, it may possible to classify the environment from the robot's point of view. Imagine to take our robot to another room and reconstruct the state spaces. It is expected that newly created EOPs will indicate the dierence between the current room and the previous room, and EOPs which commonly used indicate more general information. The nal goal of our research approach is to obtain general internal representations of a robot behaving in a class of environment. [5] D. Chapman and L. P. Kaelbling, Input generalization in delayed reinforcement learning: An algorithm and performance comparisons, Proc. IJCAI, pp. 726-731, 1991. [6] S. Mahadevan and J. Connel, Automatic programming of behavior-based robots using reinforcement learning, J. Articial Intelligence, Vol. 55, pp. 311-365, 1992. [7] K. Kira and L. A. Rendell, A practical approach to feature selection, Machine Learning, pp. 249256, 1992. [8] R. Andrew McCallum, Overcoming incomplete perception with utile distinction memory, Machine Learning, pp. 190-196, 1993. [9] M. J. Swain and M. Stricker(eds), Promising directions in Active Vision, Univ. Chicago Tech Report, CS 91-27, 1991. [10] C.H. Watkins and P. Dayan, Technical note: Qlearning, Machine Learning 82, pp. 39-46, 1992. References [1] T. W. Anderson, An introduction to multivariate statistical analysis, John Wiley & Sons, Inc., 1958. [2] T. Nakamura and M. Asada, Motion sketch: Acquisition of visual motion guided behaviors, Proc. Int. Joint Conf. Articial Intelligence, pp. 126132, 1995. [3] D. H. Ballard, Reference frames for animate vision, Proc. Int. Joint Conf. Articial Intelligence, pp. 1635-1641, 1989. [4] R. A. Brooks, Intelligence without representation, Int. J. Articial Intelligence, Vol. 47, pp. 139-159, 1991. 6
© Copyright 2026 Paperzz