Robot Oriented State Space Construction

Robot Oriented State Space Construction
Hiroshi ISHIGURO, Ritsuko SATO & Toru ISHIDA
Department of Information Science
Kyoto University
Sakyo-ku, Kyoto 606-01, JAPAN
E-mail: ishiguro/ritsuko/[email protected]
and the approach of Nakaumra and Asada uses features selected by humans such as optical ows. Therefore, it is not guaranteed that the state space is appropriate for the robot. A robot, generally, has a body
which is dierent from a human body, thus it is considered that their internal state spaces should be also
dierent from human's. It is a natural consideration
that dierent internal state spaces are obtained from
dierent sensors.
Another important problem is how to reduce the
size of the state space into a proper size with which
Q-learning algorithm can be performed. The size reduction algorithms proposed in the above-mentioned
works are for general tasks of the robot. Therefore, the
robot needs to keep unnecessary states for the tasks.
We consider the robot should keep the minimum number of states needed for the tasks in order to behave
in real time.
This paper proposes an approach to construct a
robot oriented state space which does not depend on
human intuitions but depends on constraints of its actions, sensors, and tasks.
The research approach discussed in this paper is
based on the following assumptions.
(1) The external world and the internal representation should be discriminated. An interesting idea of
a behavior-based approach proposed by Brooks [4] is
\The world is an environmental model itself". That
is, a robot which consists of reactive behavior modules
refers to a real world through its sensors as an environmental model and does not have any explicit internal representations. Although the behavior-based approach enables reactive and robust behaviors, explicit
internal representations are also needed in order to
perform more intelligent behaviors such as planning.
This paper focuses upon how the robot builds internal representations by observing the external world
through its sensors.
(2) A robot has original internal representations
which are convenient for itself and its tasks. As discussed earlier, a human and a robot have dierent
bodies each other, therefore, their internal representations are also dierent. Further, if two robots exist
in an environment and their tasks are dierent each
other, it can be considered that their internal representations are also dierent. The robot should acquire
Abstract
The state space of a sensor-based robot in the most
previous works has been determined based on human
intuitions, however the state space constructed from
human viewpoints is not always appropriate for the
robot. The robot has a dierent body, sensors, and
tasks, therefore, we consider the robot should have an
original internal state space determined based on actions, sensors, and tasks. This paper proposes an approach to construct such a robot oriented state space
by statistically analyzing the actions, sensor patterns,
and rewards given as results of task executions. In
the state space construction, the robot creates sensor pattern classiers called Empirically Obtained Perceivers(EOPs) the combinations of which represents
internal states of the robot. We have conrmed that
the robot can construct original state spaces through its
vision sensor and achieve navigation tasks with the obtained state spaces in a complicated simulated world.
1 Introduction
Robot learning is one of the important topics in AI
and Robotics. The key issue is how to apply machine
learning algorithms developed in previous AI research
into physical robots behaving in a real world. This
requires to focus upon a problem how to construct
an internal state space of the robot through its real
sensors.
As a learning algorithm, Q-learning is often utilized since it dose not require any domain knowledge.
Mahadevan and Cornel [6] have demonstrated that a
robot can obtain its programs for nding a box, pushing it, unwedging in a real world by Q-learning [10].
The important point of their work is how to reduce
the huge size of the state space represented with discrete sensor inputs. They have proposed a method to
reduce the size based on similarities between states.
Nakamura and Asada [2] have recently proposed
a method using a motion sketch which closely relates robot actions to sensor patterns for reducing the
size of the state space. The motion sketch is acquired
by statistically analyzing optical ows, therefore, the
method can be only applied to tasks which are related
to optical follows, such as obstacle avoidance.
The approach of Mahadevan and Cornel reduces
the size of the state space after executing Q-learning,
1
original internal representations which are suitable for
the tasks.
(3) A human and a robot can communicate only
through the external world. A human and a robot have
dierent internal representations each other, therefore, the human intuitions which result from human
internal representations may not proper for the robot.
They should communicate through external events
which both of them can observe, then, the robot
should understand the human intentions and acquire
its internal representations.
Based on the above-mentioned concepts, a robot
autonomously constructs its internal state space for
accomplish given tasks, such as avoiding obstacles and
going toward a goal. The robot, in this paper, behaves in a complicated simulated world. Tasks of the
robot are given as embodied rewards in the simulated
world. For example, the robot receives a negative
reward when colliding with a wall. While randomly
moving or moving along taught paths by a human,
the robot correct data sets of an action, sensor patterns, and a reward. Then, the robot obtains Empirically Obtained Perceivers(EOPs) which classies sensor patterns according to the rewards and actions by
statistically analyzing the data sets. Combinations of
the obtained EOPs represent an internal state space
of the robot. We have conrmed that the robot can
construct state spaces for two behaviors, avoiding obstacles and going toward a goal.
A few similar works to our research approach have
been already reported. G algorithm [5] picks out the
appropriate sensors to describe a state space. UDM
method [8] splits perceptually aliasing states according to their predecessor. Both of them use statistics,
however they assume that sensors which generate abstract information already exist and do not mention
how to acquire them. What they call Sensor corresponds to EOP in our research approach. In this
paper, the word refers to a physical sensor such as a
vision sensor. We, rather, focus upon how to obtain
the state space through physical sensors by creating
EOPs.
Collecting
data sets
Q-learning
Classifying
data sets
Compressing data
Internal
state spaces
Acquiring EOP
Dividing states
Integrating states
Figure 1: Process ow
Figure 1 shows the process ow for incrementally
constructing state spaces. Data sets corrected while
moving randomly, moving along paths, and executing
Q-learning (discussed in the section of Discussion and
Conclusion) are classied into two data sets. After
data compression, a EOP is obtained and the state
is divided into two states. By iterating the processes,
state spaces are incrementally constructed. Concerning the integration of states, this paper does not deal
with.
2.1
Empirically obtained perceiver
Assuming that two classes of sensor patterns class
0 and class 1 are given, a discriminant function can be
obtained by the following manner.
The discriminant function is a function which takes
a sensor pattern x represented as a N -vector as input
and outputs a scalar value. It is represented as a sum
of weighted components of a vector x.
2 State Space Construction
f (x) = w 1 x + c
A fundamental issue in robotics is sensor-action
mapping. In order to perform the mapping without
human's intuitions, we employ a statistic method. If
classes of sensor patterns to be discriminated exist, a
discriminate function can be obtained using example
patterns. This function maps patterns into symbols
since it determines to which class each sensor pattern belongs. We call this function Empirically Obtained Perceiver (EOP). An EOP plays a role like a
sensor since it generates a bit which brings information regarding the external world. This sensor, EOP,
is rather abstract and adaptable than a physical sensor
which is innate and hardly variable.
Internal states of the robot are dened as combinations of EOPs. Because an EOP itself is just a classier, the sensor pattern classes discriminated by the
EOP contains semantics. Therefore, combinations of
EOPs represent state symbols.
x; w 2 RN ; c 2 R
where c is decided so that f outputs a positive value
for a sensor pattern in class 1 and a negative value for
a sensor pattern in class 0. w is decided to minimize
the error rate [1] as follows.
w = 601(1 0 0 )
where 6 is the variance covariance matrix of class 0
and class 1. 1 and 0 are the mean vectors of class 1
and class 0, respectively.
The error rates of f are given by the following formulae.
e0 = m0 =n0 e1 = m1 =n1
where n0 and n1 are the numbers of data in class 1
and 2, respectively. m0 is the number of data such
2
+
Sensing
Frequency
Raw data
+
Early visual processing
z 2 RL
+
y=Az2
+
Kan
Relief algorithm
RM
A 2M(M; L)
Principle component analysis
x = B y 2 RN
f ( x) =
Min r
θ
Max r
Reward
Figure 3: Divide condition
B 2M(N; M )
+
Discriminant analysis
w 1 x + c 2 R w 2 RN c 2 R
The procedures for the state division are as follows.
1. Take actions in the real world and collect quadruplet data hv; s; a; ri. Classify them into disparate
sets R(si ; aj ) = fhv; s; a; rijs = si ; a = aj g
v:
s:
a:
r:
Figure 2: Acquisition of a discriminant function
that f 0 in class 0, and m1 is the number of data
such that f < 0 in class 1.
EOP is dened by the discriminant function f as
follows:
0
EOP (x) = 10 ff ((xx)) <0
Physical sensor output
Internal state corresponding to v
Action taken at the situation (s; v)
Reward as a result of a
2. For each R(si ; aj ) check whether si should be divided. If no state can be divided, then end.
3. If R(si; aj ) should be divided by threshold as
shown in Figure 3, dene two classes:
class0 = fvjhv; s; a; r i 2 R(si; aj ); r < g
class1 = fvjhv; s; a; r i 2 R(si; aj ); r g
In case of a visual sensor, the dimension of data is so
large that two preliminary processings are performed.
First of all, raw sensor data are ltered by early visual
processings (Figure 2). Then, the dimension is cut o
by Relief algorithm [7] and compressed by principal
component analysis [1].
Relief algorithm calculates relevance for each component in point of contributions to the classication,
and selects components with high relevance. Principal component analysis rotates and selects axes of the
parameter space in order of quantity of data.
2.2
β
α
4. Look for an EOP which can discriminate class 0
and class 1 with error rates e0 ; e1 <. If no proper
EOP is found, then make a new EOP using class
0 and class 1. If it still has some errors (e0 or e1 ), then abort to divide and go to 2.
5. Refer the EOP selected at 4 as EOPk . si is divided into new states si0 and si1 by EOPk . Divide R(si ; aj ) into R(si0 ; aj ) and R(si1 ; aj ) for
each aj and go to 2.
R(si0 ; aj ) = fhv; s; a; rijs = si ; a = aj ; EOPk < 0g
R(si1 ; aj ) = fhv; s; a; rijs = si ; a = aj ; EOPk 0g
State division
We have discussed how to obtain an EOP when
two classes are given. Next, we discuss how to divide
a class consisting of various sensor patterns into two
classes.
Sensor patterns should be discriminated according
to robot tasks. However, our robot does not have any
motives to classify sensor patterns by itself. The tasks
are given by a human as rewards embodied in the external world. A human can inform his intention to
the robot through the rewards, but cannot directly
describe its internal states. The robot constructs its
internal state space for itself referring to the rewards.
Imagine the robot with an incomplete state space
moves in a real world where rewards are embodied. If
the robot receives dierent rewards when executing an
action at a state, the robot nds the state should be
divided into proper two states.
Let g(r ) be frequency of r 2 frjhv; s; a; ri 2
R(si ; aj )g and n be the number of samples in R(si; aj ).
Ka , Kb , Kc , and Kd are constant values. Conditions
to divide si are given by the following equations.
g(r) Ka n 8r 2 [; ]
b 0 a Kb(Max r 0 Min r)
n0 Kc n; n1 Kc n
n0 Kd ; n1 Kd
(1)
(2)
(3)
(4)
where n0 is the number of samples such that r < ,
= ( + )=2, and n1 is the number of samples such
3
Cylinderical
image plane
d
r
AAAA
AAAA
s
h v
Robot
r=18
d=20
v=100
h=20
s=100
2
3
4
5
1
6
0
11
(a) Structure
7
10
9
8
(b) Actions
Figure 5: Input patterns
Figure 4: Robot in the simulator
that r . If several s are found, the longest [; ]
is selected.
( 3) means that rare events are neglected. ( 4) guarantees enough data to make an EOP.
A state is divided by iterating the procedures discussed earlier, and then each state is represented by
a logical combination of EOPs. For example, if two
EOPs are obtained, we can represent 4 states as logical combinations of EOPs. These logical combinations
are the state symbols represented by the EOPs. Although they are symbolic representations, they ground
in the real world.
Delayed rewards can be used in place of immediate rewards. The rewards propagated along action sequences in the external world (The robot temporarily
memorizes the action sequence with the sensor data.).
If rt represents the immediate reward at time t, delayed reward at time t is :
r~t =
X
i=0
Wall
Door
Wall
S
Wall
Door
Figure 6: Simulated world
preliminary processing for all EOPs, however this experiment uses only one which decreases the resolution
to 45 2 20 pixels and concatenates outputs taken before and after the action. x in Figure 2, thereby, has
45 2 20 2 2 dimensions. The reason to use continuous images is to expect the robot extracts information
of temporal dierential. Figure 5 shows an example
of sensor patterns after preliminary processing. The
upper corresponds to time t, the lower corresponds to
time t 0 1.
Figure 6 shows the top view of the simulated external world. The robot receives 1 as a reward when the
robot enters the hatched region without collision, 01
when it collides with a wall, 0 otherwise.
The method proposed in the previous section does
not indicate how to collect data. It is necessary to
make the robot move around in the external world
by some means. In this paper, the robot randomly
moves in the environment and practices given action
sequences.
i rt+i = rt + r~t+1
For example, even if the goal situation and the situation far from the goal are mapped into the same internal state, the goal reward is propagated only when
the robot is near the goal. The delayed rewards are
propagated just like a potential eld which center is
the reward source. The robot detects the peculiarity
of the potential values (or propagated rewards) and
tries to discriminate the sensor patterns. The distribution of the rewards decides sensor pattern classes
which should be discriminated and EOPs distinguish
these classes.
3 Experimental Results
We have developed a robot simulator and used for
experimentation. A robot in the simulator has an omnidirectional vision sensor. The robot can rotate in arbitrary 12 directions and go forward, in other words, it
has 12 actions. The conguration of the sensor and the
actions are shown in Figure 4. An action ai consists of
a rotation of 30 2 i degrees and a forward movement
of 15pixels in the simulated world shown in Figure 6
with some errors (The size of the simulated world is
640 2 480 pixels). The actual direction and distance
of ai follow the normal distribution whose means are
30 2 i degrees and 15 pixels, respectively.
The sensor generates a picture of 360 2 100 pixels
at each time step. There is no need to use the same
(a) Positive components (b) Negative components
Figure 7: Weight vectors for EOP16
4
EOP0
EOP3
EOP1
EOP6
EOP7
EOP5
EOP13
EOP11
s0 EOP14 s7 EOP9 s4
s8
s15
s14
s12
EOP8
EOP2
EOP12 s1 EOP15
s6
s13
s9
EOP4
EOP10
s16 EOP9 s5 EOP16 s11
s17
s2
s10
s3
s18
(a) Path in the simulated world
Figure 8: State tree
Avoiding obstacles by random movement
At the beginning, the robot knows nothing about the
external world, it has no EOP, and its internal state
space consists of only one state s0 .
With 23,100 data sets collected by random movements, a new state space has been constructed. Discount factor of delayed reward was set to 0 here,
that is, only immediate rewards were taken into consideration, since acquiring reactive behaviors does not
require the delayed rewards.
As a result, 17 EOPs has been created and one
state is divided into 19 states as shown in Figure 8. For example, state s3 is represented as
0 EOP 0 , and s18 is repreEOP0EOP1EOP2EOP10
16 0
sented as EOP0 EOP1 EOP2 EOP10
EOP16 . Those 2
states are discriminated by EOP16 . Figure 7 shows
the weight vectors for EOP16 whose components are
arranged in the same way of Figure 5. Intensities of
these gures are proportional to the absolute values
of the weights. Figure 7(a) corresponds to positive
components, and Figure 7(b) corresponds to negative
components. We have not analyzed the meanings of
weight vectors yet, however, at least the robot must
have nd some meanings in them.
All of the EOPs created here discriminate negative
rewards. They can be considered to detect obstacles
in a certain direction. While the robot moves along
the path shown in Figure 9(a), its internal state transfers in the way shown in Figure 9(b). It can be seen
that the state changes corresponding to the change of
appearance.
No EOP to discriminate the reward at the goal is
created, because the robot seldom reaches the goal by
random movements. The robot learns to avoid walls
however never head for the goal.
(b) State transition
Figure 9: State transition along a robot path
sequences. The action sequences are shown by manually controlling the robot. The robot memorizes the
action sequences given by a human operator and repeats them.
The robot expects the same sensor outputs while iterating a sequence, however the sensor outputs are not
exactly the same because of errors in actions. Therefore, the robot can collect various data along the example path.
2904 more data sets have been collected by practicing 4 action sequences (The robot reached the goal
325 times among 372 trials). This time, the robot
considers the delayed reward in order to acquire procedural behaviors. With = 0:5, 9 more EOPs and
10 states have been made. 6 of those EOPs have positive threshold values. Because states discriminated
by them are never classied without rewards from the
goal, they are regarded to detect the goal by similarity
of sensor patterns.
4 Discussion and Conclusion
In this paper, we have proposed an approach to construct robot oriented state spaces based on statistical
analysis of sensor data. Compared with other research
approaches, the most distinctive feature of our method
is EOP which enables exible interpretation of sensor
patterns.
Each EOP dichotomously classies sensor patterns
and indicates to which class the given pattern belongs.
EOPs are the way of mapping sensor patterns to state
symbols. States are represented by logical combinations of EOPs, and similarity of them can be evaluated by logical operations. They are tied up with
Going toward a goal by practice
Now the robot discriminates the situation at the external world to some degree, but it does not nd out
the goal state. The robot can seldom arrive at the
goal by random walk.
In order to teach the existence of a goal, we have
given examples of action sequence to arrive at the
goal to the robot and made the robot practice the
5
situations observed by the sensor of the robot, consequently, they ground well in the real world. The symbol grounding realized by the EOPs is important when
applying machine learning algorithms which perform
with symbolic representations of the world into real
world robots.
From the viewpoint of active vision [3], EOPs address the focus of attention. Each of the EOPs weights
parts of a sensor pattern for discriminating states.
That is, the parts are where the robot should pay its
attention for identifying the state.
Finally, we discuss the remained problems. In the
data collection, we can utilize Q-learning in addition
to moving randomly and moving along taught path.
The purposes to utilize Q-learning in the data collection are to eciently collect data sets along a path toward a goal and to nd states which should be divided.
In this paper, data collection and state space construction have been executed alternately, but it is possible
to do them simultaneously. The construction process
watches the progression of the searching process (not
restricted to Q-learning), and divides alias states if
they are found. The searching process tries to nd an
optimal path toward the goal in the constructed state
space. It can be expected to improve the results of
Q-learning and increase the learning speed .
EOPs are made with an assumption that actions are
already dened, but some tasks may require more precise actions and others can be done with more coarse
actions. In order to change the granularity, it is necessary to detect coarse actions or to identify similar
actions.
The state space is obtained by dividing states with
EOPs. On the other hand, a process to integrate the
states is also required. To combine the both process
is our next step.
Since EOPs classify sensor patterns, it may possible to classify the environment from the robot's point
of view. Imagine to take our robot to another room
and reconstruct the state spaces. It is expected that
newly created EOPs will indicate the dierence between the current room and the previous room, and
EOPs which commonly used indicate more general information. The nal goal of our research approach is
to obtain general internal representations of a robot
behaving in a class of environment.
[5] D. Chapman and L. P. Kaelbling, Input generalization in delayed reinforcement learning: An algorithm and performance comparisons, Proc. IJCAI, pp. 726-731, 1991.
[6] S. Mahadevan and J. Connel, Automatic programming of behavior-based robots using reinforcement learning, J. Articial Intelligence, Vol.
55, pp. 311-365, 1992.
[7] K. Kira and L. A. Rendell, A practical approach
to feature selection, Machine Learning, pp. 249256, 1992.
[8] R. Andrew McCallum, Overcoming incomplete
perception with utile distinction memory, Machine Learning, pp. 190-196, 1993.
[9] M. J. Swain and M. Stricker(eds), Promising directions in Active Vision, Univ. Chicago Tech Report, CS 91-27, 1991.
[10] C.H. Watkins and P. Dayan, Technical note: Qlearning, Machine Learning 82, pp. 39-46, 1992.
References
[1] T. W. Anderson, An introduction to multivariate statistical analysis, John Wiley & Sons, Inc.,
1958.
[2] T. Nakamura and M. Asada, Motion sketch: Acquisition of visual motion guided behaviors, Proc.
Int. Joint Conf. Articial Intelligence, pp. 126132, 1995.
[3] D. H. Ballard, Reference frames for animate vision, Proc. Int. Joint Conf. Articial Intelligence,
pp. 1635-1641, 1989.
[4] R. A. Brooks, Intelligence without representation, Int. J. Articial Intelligence, Vol. 47, pp.
139-159, 1991.
6