www.isl.ee.boun.edu.tr - Intelligent Systems Laboratory

To appear in “Autonomous Robots”, 2006.
APES: ATTENTIVELY PERCEIVING ROBOT
Ç. Soyer†‡
H.I. Bozma‡
Y. İstefanopulos†‡
[email protected]
[email protected]
[email protected]
† Institute of Biomedical Engineering, Boğaziçi University, 80015 Bebek, İstanbul, Turkey
‡ Intelligent Systems Laboratory, Department of Electrical Electronic Engineering, Boğaziçi University
80015 Bebek, Istanbul, Turkey
Abstract
Robot vision systems - inspired by human-like vision - are required to employ mechanisms similar to those
that have proven to be crucial in human visual performance. One of these mechanisms is attentive
perception. Findings from vision science research suggest that attentive perception requires a multitude of
properties: A retina with fovea-periphery distinction, an attention mechanism that can be manipulated both
mechanically and internally, an extensive set of visual primitives that enable different representation modes,
an integration mechanism that can infer the appropriate visual information in spite of eye, head, body and
target motion, and finally memory for guiding eye movements and modeling the environment. In this paper
we present an attentively “perceiving” robot called APES. The novelty of this system stems from the fact that
it incorporates all of these properties simultaneously. As is explained, original approaches have to be taken
to realize each of the properties so that they can be integrated together in an attentive perception framework.
Keywords: Attention, selective perception, active vision, robot vision, mobile robots, attentional sequences,
visual memory, bubble model, temporal recognition, sequence based recognition.
To appear in “Autonomous Robots”, 2006.
1
Introduction
Biological systems have the capability of seeing with no problems such as computational complexity that
need to be confronted when endowing vision into robots. It is argued that if human technology can mimic or
copy nature, then biological-like performance can follow. Obviously, blind copying – if possible – may not
work at best since optimal designs are not necessarily the end product of evolution and more importantly the
underlying constraints are different. The trick may lie in understanding the essential biological features to a
degree in order to develop counterpart analogies and this has been a motivation for this work. Consider the
scene of Figure 1 (left) taken from our laboratory. Now if a robot is to understand that it is part of our
laboratory, traditionally it would look at the whole image at once and try to process it. Findings of vision
science are such that biological systems do not quite work this way. Instead of looking at the scene all at
once and therefore being bombarded with massive amounts of data, biological vision systems attentively
view their scene – thus collecting a sequence of spatio-temporal visual data as shown in Figure 1 (right).
1
2
3
4
5
6
7
8
9
10
Figure 1: (Left) A scene from our laboratory. (Right) A sequence of fixations.
Although the exact mechanisms of attentive perception are still unknown, recent work on humans and
monkeys have revealed some integral properties:
1. Fovea-Periphery Distinction: Unlike traditional cameras, the distribution of receptor cells on the
retina is Gaussian-like with a small variance, resulting in fovea-periphery distinction [13,16].
Biological vision systems process only a small part of their visual field in detail. Unlike traditional
cameras used by man made imaging systems, the distribution of receptor cells on the retina is like a
gaussian with a very small variance, resulting in a dramatic loss of resolution as we move away from
the optical axis of the eye. The small region of highest acuity around the optical axis is called the
fovea, and the rest of the retina is called periphery.
2. Attention: The attentive processing via mechanical eye and head motions and cognitive
mechanisms generates a stream of spatio-temporally related visual data [12,21]. As a consequence of
fovea-periphery distinction, saccades - fast eye movements - are used to bring images of chosen
objects to fovea where resolution is at its best. This physical attention mechanism is called overt
attention. There is strong evidence suggesting that the saccades are voluntary and require the
computation of the relative position of a visual feature of interest with respect to the fovea in order to
determine the direction and amplitude of the saccade. A second type of attention system called
covert attention refers to unconscious attentional effects. These include poorly understood complex
cognitive processes, which determine the attention behavior of the system.
3. Visual Primitives & Representation Modes: Findings suggest that attentional deployment is based
on a rich set of representation modes [6,11,41]. Cells in the visual path from retina to the primary
and other cortical regions respond to increasingly more complex stimuli, accompanied by larger
receptive fields on the retina. For example in the primary visual cortex, simple cells respond to lines
of a particular orientation, more common complex cells respond to motion, and some cells both
simple and complex respond to specific corners and curvatures.
4. Serial Processing: Although the human visual system is massively parallel in structure, most visual
tasks also require serial processing as the oculomotor activity results in the perception of a series of
images in time [9,14,21,24,25,32,49]. Especially in counting or comparison experiments more
complex scenes lead to longer processing times in human subjects because of increased number of
To appear in “Autonomous Robots”, 2006.
fixations or eye movements required to solve the task. This implies that information is collected and
somehow combined after each fixation until there is enough information to make a decision.
5. Memory: Human vision also relies heavily on short and long term memory [16,19,21,23,49]. Some
cognitive effects during attention control, like inhibition of return or negative priming require a
short-term memory mechanism. Long term memory is used to accumulate visual information during
fixations and to build abstract models of the environment that can last for years.
The use of human-like mechanisms in robot vision systems rapidly increased in recent years starting with the
introduction of active vision paradigm [1,3,4,5]. Most of the work in biologically motivated robot vision
systems has concentrated on the realization of the first three properties discussed above. Earlier research
focused on the construction and control of camera heads that can replicate eye motions [10,26,45]. Later on
"where to look" and "how to look" problems were also studied and various models of attention, eyemovements and visual search have been developed by both robot vision and biological vision communities
[28,29,30,31,40,42]. The attention mechanism - unlike classical computer vision systems – requires new
approaches that make use of the spatio-temporal visual data thus generated [7,17,31,39]. The use of various
memory mechanisms has been widely discussed in cognitive science literature [2,8,20,27,46,47]. Let us
remark that most of these features have been studied previously – however in general separately from each
other. However, to be of use in real world tasks, the vision system of an attentive robot needs to implement
all of the above properties simultaneously. There are only few studies in the literature where all these
properties are addressed together in a comprehensive manner [17,22,38].
This paper describes the attentive visual processing of APES – a mobile robot whose novelty comes from
integrating all of the above properties of biological vision in single system. In the remaining of this paper, we
explain the different components of this integrated active vision system. We first give a general overview of
APES’ hardware and software in Section 2. As presented in Section 3, the fovea-periphery distinction is
realized with a two-camera system. The control of focus of attention through a pre-attention and short-term
memory mechanism is explained in Section 4. In this framework, the focus of attention can also be changed
“mentally” by utilizing appropriate attention criteria and applying different types of processing. Following,
attentive processing and representation modes are discussed in Section 5. Serial processing of the spatiotemporal data thus generated is based on evidential reasoning as explained in Section 6. Finally, the
incorporation of long-term memory is discussed in Section 7. Section 8 presents the complete system.
Experimental results from a variety of scenarios are discussed in Section 9. The paper concludes with a brief
summary and remarks about future directions.
To appear in “Autonomous Robots”, 2006.
2
APES Hardware and Software
APES shown in Figure 2, is a mobile robot developed in our laboratory for attentive vision research. Its body
is driven by two conventional wheels. Using four stepping motors it can translate and rotate its body and
direct its cameras to the visual stimuli by pan and tilt motions. Body rotation and camera pan axes have been
designed to be co-centered, in order to simplify transformations during combined body and camera motions,
and are not the same as the centerline of the cylindrical body for mechanical stability reasons. Table 1 and
Figure 3 present the technical specifications and hardware configuration of APES respectively. The main
visual processing module running on a workstation performs vision processor setup, frame grabbing, preattentive and attentive processing and serial communications. The on-board PC104 computer is responsible
for serial communications, motor control, and camera control. All camera features including zoom angle can
be controlled by the on-board computer.
Height:
60cm.
Radius
37cm.
Wheel span:
52cm.
Wheel radius:
15cm.
Drive method:
Stepping motors
Power:
12 V Battery
Pan accuracy:
1.8 degrees
Tilt accuracy:
1.8 degrees
Video format:
CCIR composite
Image size:
512x512 pixels
Camera lens:
4-47 degree zoom
Figure 2 :APES robot and its 2 dof camera base.
Table 1:Technical specifications of APES.
Figure 3: Schematic of APES
Figure 4: APES main software snapshot.
The two degrees of freedom step motor based head assembly and camera motions of APES cannot be
compared to the highly developed oculomotor system. However APES can effectively control the optical
axis of its cameras with an accuracy of 1.8 degrees due to its step motor based drive system. Camera motions
correspond to large and fast saccadic motions of the eye, which are used for fixating different spatial targets.
During operation the saccade system determines the new fixation point in the periphery and the
corresponding saccade vector. This information is sent to the on-board computer, which moves the camera
accordingly. The new visual field is then processed by the vision system. In Figure 4, a snapshot from
APES’ main software is shown. The two large image boxes can display raw or processed images from the
two cameras, which are used to simulate the fovea-periphery distinction in the human eye as explained in the
next section. The tiny fovea image is also shown on the left below the large image. A control window is used
to select operating modes and settings, and a separate data window displays all computations, including
fovea saliencies, attentive features, saccade vectors, bubble points and fixation numbers. Its simple hardware
To appear in “Autonomous Robots”, 2006.
and flexible software libraries enable easy integration of different oculomotor and retina models as well as
memory and recognition modules to build a biologically motivated vision system.
3
Fovea-Periphery Distinction: APES Retina Model
Biological vision systems process only a small region of their visual field in detail. The small region of
highest acuity around the optical axis is referred to as the fovea. It is thought to provide information
regarding the scene or the current visual task. The rest of the visual field, called the periphery, is much lower
in resolution and is used in finding the next fixation point [13,16]. The retina model of APES incorporates
such a fovea-periphery distinction as shown in Figure 5 (Left). Since this is not possible with a single fixed
resolution camera and spatially variant cameras are still under development [15,34], APES uses a twocamera retina model in order to realize such a model. As seen in Figure 5 (Right), a wide angle camera is
used to get peripheral visual data in low resolution while a narrow angle camera is used to get foveal data.
The two cameras are fixed together such that their optical axes are parallel and as close as possible. There is
a horizontal separation of about 5 centimeters between optical axes resulting in a fovea image which is not
exactly at the center of the peripheral image. There is also an error caused by the accuracy of stepping
motors. These errors are corrected in software by shifting the center of the fovea in the acquired periphery
image to better match with the actual fovea for inhibition purposes. The periphery camera has a 46 degree
wide angle lens and generates approximately an angular pixel density of 11 pixels/degree. The fovea camera,
on the other hand, has a narrow angle lens with 4 degrees viewing angle. It dedicates all of its resolution
accordingly and therefore obtains an angular pixel density of 128 pixels/degree. The photoreceptor
distribution has a Gaussian shape, but its variance is very small creating a steep peak around the optical axis
– thus endowing the robot with a high resolution fovea- low resolution periphery as shown in Figure 6.
Periphery
512x 512 Periphery
Low receptor density
Fovea
Candidate
foveas
with o x × o y%
overlapping
Wide angle
camera
Inhibition Region
512x512 Fovea
High receptor
density
Narrow angle camera
Top View
Fixation Point
Figure 5: (Left) APES’ visual field with fovea-periphery distinction. (Right) Two camera retina model of APES.
Figure 6: (Left) Periphery. (Center) Fovea with uniform resolution. (Right) Fovea with two camera model.
4
Attention Mechanism and Short-Term Fixation Memory
Attentive vision works in a loop of pre-attention and attention. In pre-attention, the next fovea is determined
by simple, fast calculations in the periphery. In attention, a fixated fovea is processed to extract more
complex features, which are used to build a high level cognitive model for the scene. APES’ visual
processing is based on a similar loop as shown in Figure 7. Let I tv and I tf ⊂ I vt denote the visual field and
To appear in “Autonomous Robots”, 2006.
the fovea at time t respectively. In pre-attention, all the fovea sized regions in the periphery with some
overlapping constitute the set C ( I vt ) of candidate foveae as shown in Figure 5 (left). Each I cf ∈ C ( I vt ) is
then considered and its saliency is computed. The saliency measure is an attention function a : I cf → [0,∞ )
which is determined by its current task with the constraint that it must be simple to compute. In the literature,
many different computational models of pre-attention have been proposed as summarized in [18]. As such,
one conceives attention simply as the facilitation of certain set of neurons and thus the highlighting of
particular features at a particular position in the visual field. As expected, the fixation behavior can change
depending on the selected attention function. We have experimented with different functions. For most of our
( ) ∑ ∇I ( p )
∆
experiments, the attention criterion is simply defined as a I cf =
- the total sum of gradient
p∈I cf
magnitudes of all the pixels within the candidate fovea. However, with just such an attention function, shortterm fixation loops that cycle between only two or three foveae may occur. Hence attention function
response must be modified by mechanisms that inhibit such cyclic behavior. Visual findings indicate that
there are two types of memory present here [18]:
1. Inhibition of return – The process by which the currently attended location is prevented from being
attended again,
2. Short-term memory – The process by which the last few fixations are recalled and are being
prevented from being attended,
APES has two such built-in mechanisms - inhibition and short-term fixation memory – for this purpose.
First, the next fovea is forced to be away from the currently fixated fovea I ft using an inhibition region. This
is achieved by defining an HxH pixel region I ht around I ft as the inhibition region as shown in Figure 5
(left). All candidate foveae I cf ∈ C(I ht ) falling within the inhibition region are inhibited. Note that the
inhibition mechanism also enables the control of saccade magnitudes. Secondly a short-term fixation
memory mechanism C d is implemented. This mechanism works via keeping track of previously fixated
foveae and inhibiting them even if they are not within the current inhibition region. For this, we use a firstin-first-out memory C d = {I ft , I ft -1 , K , I ft - D } of size D. All foveae in this memory are inhibited during preattention. Obviously, the value of D puts a lower bound on the permissible length of fixation loops. At the
end of each new fixation, I ft -D is removed from while I ft +1 is added to this memory. Pre-attentive processing
together with inhibition and short-term memory mechanisms are merged to form an augmented attention
function ~a : I cf → R + as:
⎧ 0
if
⎪
c
~
a (I f ) = ⎨ 0
if
⎪a(I c ) if
⎩ f
I cf ∈ C(I ht )
I cf ∈ C d
I cf ∈ C(I tv ), I cf ∉ C(I ht ), C d
APES determines its next fovea I tf+1 by finding the most salient candidate fovea in its periphery using this
augmented attention function as:
I tf+1 ∈ arg cmaxc a~ ( I cf )
I f ∈C ( I v )
It then moves its camera as to fixate on its center. After a fixation point is found, the fixation point image
coordinates are converted to camera coordinates and the amount of motion required for fixation is calculated.
Using the results of this calculation, the fovea camera is directed to the new fixation point and a fovea image
is grabbed. APES can effectively control the optical axis of its camera with an accuracy of 1.8 degrees due to
its step motor based drive system. Camera motions correspond to large and fast saccadic motions of the eye.
Although this system cannot be compared to the highly developed and poorly understood oculomotor system
of mammals, it nevertheless suffices for implementing a physical attention mechanism. While this image is
being analyzed for higher level features the periphery camera is free to look for the next fixation point if
there is parallel processing.
Periphery Cam
Low level
pre-attentive
processing
Fovea Cam
High level
attentive
Next fixation point
Cognitive information
Camera
controller
To appear in “Autonomous Robots”, 2006.
Figure 7: Attention process in the two camera retina model.
In Figure 8 APES is looking at the curved metal object on the left and generating the sequence of fixation
images on the right. In this experiment a gradient based attention criteria is used with an augmented attention
function as described above.
Figure 8: Snapshots from a recognition experiment using selective attention (right) on a curved metal object (left).
Note that although the augmented attention function can potentially result in rather complex and
unpredictable attention behavior, the basic selective attention mechanism of APES is relatively simple when
compared to some other work in the literature. For example the early works of Ballard, Rimey and Brown
introduce various attention control mechanisms integrated into a 6 dof robotic arm [4,5,29,30]. Many aspects
of animate vision, including reference frames, gaze control, vergence and depth, foveal and peripheral
features are introduced and investigated in [4] and [5]. While this work focused on selective attention and
fixation control, our work on APES is focused on both generating fixations and using the information
collected during the selective attention process. As explained in section 6, regardless of its attention control
mechanisms, the output of an attentive vision system can be characterized by a sequence of fixations and
sequence of feature vectors computed during each fixation. An integrated system should be able to have the
mechanisms to process this information in a timely manner within its attention loop shown in Figure 7.
5
Visual Primitives & Representation Modes
In the attentive stage APES applies complex visual processing on its current fovea I ft in order to extract its
visual properties. This choice will vary depending on the task at hand and the representation mode. The
results of this processing is encoded as an observation vector ot. Currently in the attentive stage, APES
processes its fovea using visual primitives such as edge magnitude, edge orientation, and saccade direction
shown in Figure 9. This set has recently been extended to include Cartesian and Non-Cartesian filters -which have been experimentally confirmed to be used in primate vision [11]. Different visual primitives
may also be conjoined together. The ability to use different representation modes in both the pre-attentive
and attentive stages enables APES to explore and internally represent its environment in different ways. For
example by using a gradient based attention criteria APES can be made to prefer focusing on high frequency
content image regions such as contours, or by using the brightness feature it can be made to fixate on light
sources, reflective objects, etc. Similarly the object whose contours are focused upon can be modeled by
APES using the sequence of edge types or by using the sequence of saccade vectors which give shape
information.
To appear in “Autonomous Robots”, 2006.
Figure 9: (Left) Chain coded saccade directions. (Center) Edge types. (Right) Cartesian and non-Cartesian filters are
used by APES as attentive features.
6
Attentional Sequence and Temporal Recognition
Visual attention systems are widely studied as feature search mechanisms and as models of human attention
[18,29,30,41,42,45]. However there are few cases in the literature where an attentive system and its output the attentional sequence described below - are used for object or scene recognition. One example is Rybak’s
work on attention-guided visual recognition [31]. In this approach the fovea image or features are compared
to a library and then further analyzed when there is a close match. While selective attention is used as an
efficient search mechanism in this case, this approach still relies on classical methods at the recognition stage
and does not model recognition by an attentive system where information is accumulated in time, over a
number of fixations on different points in space. In APES we aim to model this behavior, which is assumed
to exist in humans and other animals.
Rimey and Brown also studied attentional sequence modeling using Hidden Markov Models and Bayes Nets
[29,30]. Their work focused on another interesting problem, where the models were used to help attentional
sequencing by a robot. In earlier work we used Rimey and Brown’s HMM approach for modeling and
recognition of attentional sequences and compared it to evidential reasoning as an alternative method [37].
This method is described below as applied in APES robot.
As APES cycles through the pre-attention attention loop, it generates a spatio-temporally related stream of
fixations and thus observations – which we refer to as an “attentional sequence”. After T fixations, the
attentional sequence can be denoted as O T = (o1 ,..., o T ) where o k is the observation at fixation k. APES
can then use this information in a cumulative manner. One method of accomplishing this is based on
evidential reasoning [33,35,37]. In this approach, for a given visual decision task, all the candidate
alternatives are considered as competing propositions. Each body of evidence o k , k = 1,..., T in O T is
found to support competing propositions to different degrees. Spatio-temporally related bodies of evidence
are then combined in a cumulative manner to find the proposition which is most supported at an instance t.
Hence decisions can then be made accordingly.
Let k* be the correct classification of a scene. Suppose the set of its possible values are given by K. Then
propositions of interest are precisely those of the form "the true value of k* is in A” and hence they are in 1-1
correspondence with the subsets 2K of K. Thus, we use A ∈ 2K to denote a proposition. In classification,
we are in particular interested in propositions of the form:
A k = { k } , k = 1,K, N K
where N K = K
Ak is taken to mean “The object under view is k”. Now suppose for each proposition Ak , we have a transition
frequency matrix
Tk : Ω × Ω → [0, ∞] . Each entry Tk (oi , o j )
represents the weight of evidence attested
t
to observing o after having observed o . Recall that o ∈ Ω is the observation at time t and note the use
j
i
To appear in “Autonomous Robots”, 2006.
of feature transitions rather than the features themselves. Then each observation attests evidence for each
proposition Ak as follows: Let ω : 2 L × Ω → [0, ∞] represent the weight of evidence function. Then,
ω(A k , o t ) = Tk (o t −1 , o t )
In evidential reasoning, degrees of support for various propositions discerned by K are determined by the
K
weights of evidence attesting to these propositions. Let sk : 2 × Ω → [0,1] define a simple support function
focused on Ak. Then sk can be defined as
0
⎧
⎪
s k (A, o t ) = ⎨s k (A k , o t )
⎪
1
⎩
Ak ⊄ A
if
if
Ak ⊂ A , A ≠ L
if
A=L
− cω ( A ,o t )
k
where s k ( Ak , o ) = 1 − e
.
Note that sk is a belief function with basic probability number m(Ak)=sk(Ak ,ot), m(K)=1-sk(Ak ,ot), m(A)=0
for all other A ⊂ K that does not contain Ak.
t
Each evidence supports each proposition Ak , k=1,...,K with strength sk(Ak,ot). As each proposition conflicts
with the other, the effect of each should diminished by the other and instantaneous support for each support
s ik for each proposition Ak should be calculated. The instantaneous support s ik : 2 K × Ω → [0,1] can be
computed as the orthogonal sum of the simple support functions sk focused on Ak given with basic probability
numbers:
N
K
t
s (A , o ) ∏ (1− s (A , ot ))
k k
j j
j =1
j≠ l
m( A , o t ) =
k
N
K
1− ∏ s ( A , o t )
j j
j =1
and
N
K
t
∏ (1− s (A , o ))
j j
j =1
m( K , o t ) =
N
K
1− ∏ s ( A , o t )
j j
j =1
and
⎧ 0
⎪
N
⎪
t ) K (1− s (A , o t ))
s
(
A
,
o
∏
⎪ k k
j j
=1
j
⎪
j≠ l
⎪
N
⎪
K
⎪
1− ∏ s ( A , o t )
j j
⎪
j =1
s ik (C, o t ) = ⎨
N
⎪
K
t
t
⎪ ∑ sk (Ak , o ) ∏ (1− s j(A j, o ))
j =1
⎪C∩K
j≠ k
⎪
⎪
N
K
⎪
1− ∏ s ( A , o t )
j j
⎪
j =1
⎪ 1
⎩
if
C contains none of A , k =1,K, N
k
K
if
C contains A but does not contain A , j =1,K, N , j ≠ k
k
j
K
if
C contains some of A , C ≠ K
k
if
C=K
To appear in “Autonomous Robots”, 2006.
Figure 10: Calculation of instantaneous support for a two hypotheses case.
In Figure 10 the calculation of instantaneous support is shown graphically for a two hypotheses case. Note
the abstraction of information from the fovea image to the extracted feature and feature transition. In this
case ‘Evidence 1’ and ‘Evidence 2’ refer to bodies of evidence supporting two different hypotheses.
The effect of ski is to provide instantaneous support for each proposition Ak. . The total support skt for each
proposition Ak can then be cumulated by combining the so-far total cumulated support skt-1 with the
instantaneous support ski. This is the case of homogeneous evidence - evidence strictly supporting a single
t
L
t
proposition. The cumulative support function s k : 2 × Ω → [0,1] for proposition Ak attested by the
attentional sequence O t can be computed using Bernoulli's rule. Bernoulli's rule of combination provides an
t −1
iterative rule for updating s k focused on Ak with support skt-1(Ak) using the instantaneous information
s ik focused on Ak with support ski(Ak). It is defined recursively as the orthogonal sum s kt = s kt −1 ⊕ s ik :
if C does not contain A k
⎧0
⎪
s kt (C, O t ) = ⎨1 − (1 − s ik (A k , o t ))(1 − s kt −1 (A k , O t )) if C contains A k
⎪1
if C = K
⎩
Figure 11 explains the calculation of temporal support for a two hypotheses case. In this case ‘Evidence 1’
and ‘Evidence 2’ refer to bodies of evidence supporting the same hypothesis at different times or fixations.
To appear in “Autonomous Robots”, 2006.
Figure 11 Calculation of temporal support for a two hypotheses case.
Many different strategies can then be used in order to make a decision about the current visual task. A simple
strategy is choosing the maximally supported proposition A k ∗ where:
k* = arg max s kt (A k , O t )
k∈K
In creating a model for each proposition k ⊆ K , which may correspond to an object image or a complex
scene, APES starts observing the respective scene or object in an attentive manner. As it is consecutively
t −1
t
fixating and forming observations, the transition Tl (o , o ) between two consecutive observations in this
scan path is recorded by incrementing the frequency of that particular transition by 1. Hence, for any library
model, the number of transitions between any pair of features forms a Ω x Ω feature transition matrix.
These matrices serve directly as weights of evidence. The modeling stage is critical to performance of the
two approaches in recognition. In order to obtain a reliable model, all parts of a scene must be observed
equally during learning fixations. Therefore, the learning period as determined by the length of the
attentional sequence must be long enough to allow different scan paths to be taken. A partial model that does
not include all possible scan paths and thus all possible feature transitions will mean that the scene is
incompletely modeled. However, due to attention mechanisms involved, this does not necessarily imply poor
recognition performance. As discussed in the experiments section of this paper, the system is sometimes able
to model the most characteristic features of a scene during a short learning phase and therefore perform
successful recognition. Although it can be considered as a special case this property of attentional sequence
based recognition is successfully employed in biological systems and needs to be studied in more detail. We
speculate that this special case may in fact be the key to human-like visual performance.
7
Long Term Memory Model: Bubbles
Biological systems are known to create abstract long-term memories of their environment for navigation and
self-localization using visual data. Recent environment mapping techniques for mobile robots use local
sensory experiences of the robot to build a Cartesian representation of the environment [8,46,47]. The
sensory egosphere was first used to project images on to a spherical surface placed around the robot [2].
This idea was later used as a database structure where each point on the sphere pointed to a data structure
representing the sensory inputs from that direction [27,20] – thus facilitating the storage and retrieval of
sensory experience of a mobile robot in a natural and efficient way. However, the egosphere itself does not
contain any sensory information. A different approach – called a bubble memory – was proposed in [36].
Bubble memory is based on the idea of a surrounding spherical surface that can be deformed to represent a
To appear in “Autonomous Robots”, 2006.
robot’s sensory experience, while maintaining the same indexing capability like the ego-sphere. While the
ego-sphere is used to map information from the 3D world on to a database of sensory experiences, the bubble
itself is deformed using sensory information to become a special 3D surface, which represents the robot’s
sensory experience. In this section we revisit the bubble memory and explain its integration to APES’ vision
system.
As APES looks around with its pan-tilt type head, it records the observations thus gathered in a long-term
memory. Let θ, ϕ denote the pan and tilt angles respectively. APES can direct the optical axis of its camera
in any direction (θ, ϕ) within its physical limits. For each fixation direction, an observation is made as
explained previously and accordingly, a quantitative measure ρ can be assigned to each fixation direction. In
+
this manner, a surface is defined implicitly by ρ : (θ , ϕ ) → R . We refer to this deformable surface hypothetically placed around the robot - as the bubble as shown in Figure 12(left). Each bubble is initialized
to a sphere. As APES starts to look around, for each fixation direction (θ, ϕ) the bubble is deformed at the
corresponding bubble point. The amount of deformation of the bubble is determined by the attentive
processing made on the fovea as shown in Figure 12(right). Since APES has finite precision in pan and tilt
directions - around 1.8 degrees, the bubble surface is discrete and we can represent it by a finite set of
equally spaced bubble points:
{
β = (ρ, θ, ϕ) ∈ ℜ 3 | θ ≡ i ⋅ ∆θ , ϕ ≡ j ⋅ ∆ϕ
}
where i ∈ [0, n ); j ∈ [0, m )
As there can be a plethora of visual primitives being simultaneously extracted, a set of bubbles can be
formed - each corresponding to one such visual primitive. These bubbles provide a compact representation of
the spatio-temporally visual data generated while APES looks around from a fixed viewpoint. Thus, bubble
memory model provides a mechanism for the integration of spatially distinct features in time to obtain a
model of the environment. The bubble enables long-term memory – a recollection of which feature was seen
where. APES can use bubbles for vision based environment modeling. The integration property of bubbles
enables them to store foveal features observed from a single point in space. For each new viewpoint, a new
set of bubbles is generated.
Figure 12: (Left) APES, bubble points and potential fixation points. (Right) Inflated bubble in 2D.
Number of pan steps
-9
-8
-7
Number of tilt steps
-6
-5
-4
-3
-2
-1
0
2
1
0 1 -1
-2
-3
-4
-5
-6
-7
-8
-9
-10
2
3
Figure 13: (Left) Table scene from our laboratory. (Center) Complete 1.8 degree resolution bubble representation for 30
To appear in “Autonomous Robots”, 2006.
fixations. (Right) Coordinates of fixations made during experiments in terms of the number of pan and tilt steps from
starting position.
A number of experiments to test the bubble formation in different situations have been conducted. In Figure
13 (center) a full resolution bubble surface composed of 40.000 control points generated in one experiment is
shown. On the right of this figure, the number of pan and tilt steps made by stepping motors during these
fixations are shown in Cartesian coordinates. This representation can be used to model 3D environments –
which we have investigated in another set of experiments using our Table scene. It is found that the bubbles
after 30 fixations fall into one of the two categories as shown in Figure 14. Figure 15 shows sample bubbles
from each category. The bubbles formed in these and other experiments suggest that the attentive system
takes different but a limited number of paths for each scene. Interestingly, the system has a tendency to
converge to one of these preferred paths, even if the first fixation point is different. Note that unlike the
attentional sequence, which is a short-term buffer, the bubble memory is retained and integrated over long
periods of time to enable environment mapping. Our recent work with bubbles involves using them for selflocalization by modeling and comparing bubble surfaces. Furthermore, through the use of 3D elliptic Fourier
methods, bubble data can be stored in a compact mathematical form and at different levels of detail as
described in [36].
Figure 14: Two types of bubbles formed on the Table scene.
Figure 15: Sample bubbles formed in experiments.
8
An Integrated Model of APES’ Vision
All the five features – fovea-periphery distinction, attentive processing with inhibition of return and shortterm memory, visual primitives, serial processing and long-term memory are integrated within a single
model of attentive vision on APES as shown in Figure 16. Blue lines indicate information flow common to
all modes of operation. Red and green lines indicate information flow during learning and run-time
respectively. Note that only the saccadic eye movement system is realized on APES. This framework also
proposes models for certain poorly understood mechanisms such as temporal recognition, visual integration
over saccades, and environment modeling.
First, periphery and fovea images are obtained by the two-camera sensor system simulating the human retina.
The periphery image is input into the pre-attention system, which also receives pre-attentive interest criteria,
inhibition settings, fixation memory contents and any other higher-level cognitive effects. The new fixation
point is selected, from which the saccade vector is generated and sent to the head motors controller. In
To appear in “Autonomous Robots”, 2006.
humans, saccades are also known to be controlled in a predictive manner, based on expectations about a
scene. This top-down saccade control mechanism is not implemented on APES, however inputs from
sequence processing and bubble memory can be used for this purpose.
Figure 16: An integrated model of APES’ vision.
At each fixation the fixation fovea is processed to extract attentive features. Similar to pre-attentive interest
criteria, the system also enables any feature that can be computed on the fovea image to be used as an
attentive feature. These features are sequentially processed by attentional sequence modeling and recognition
algorithms, which enable temporal recognition. Results of attentive processing are also saved in a bubble
memory, so that they can be recalled by the attentive processing module when part of the same peripheral
field needs to be processed at a later time, or when visual information is not available. The 3D bubble
surfaces store viewpoint dependent visual models of the environment using different attentive features. For
each viewpoint a number of bubbles can be formed to represent different properties like edge or color
content. Furthermore, spherical bubble surfaces can be modeled by Fourier techniques to create more
abstract representations, which can be used for rapid comparison of bubbles to those stored in a bubble
model memory.
9
Scene Recognition Experiments
An extensive study of APES’ attentional sequence based recognition capability in a variety of recognition
tasks have been conducted. In this Section, we present highlights from these results, but the interested reader
is referred to [37] for a more comprehensive discussion. In these experiments, APES is set to use a 200x200
pixel visual field and a 40x40 pixel fovea. The overlap between candidate foveae is 50% and a short-term
memory depth D=10 is used to inhibit the last 10 fixated foveae. Simple pre-attentive and attentive features
are employed with the intention of removing any ambiguity from feature extraction stages and underpinning
the exact capability of this attentive vision system in recognition tasks. The pre-attentive attention criterion
for each candidate fovea I cf is as described in Section 4. In the attentive stage the feature space consists of
To appear in “Autonomous Robots”, 2006.
Ω = Ω1 corresponding to 8 different orientations of a simple edge feature computed by the operator
f 1 = arg max S i (I ft )
where S i (I ft )
i∈Ω1
is the 3x3 operator for detecting edges with an orientation of i
degrees.
All experiments are performed under ceiling mounted fluorescents and daylight from windows, without any
special lighting. Typically, two fixation sequences generated by APES while looking at the same scene are
never identical even if there is no variation in the scene. This is caused by 1) Slight variations in the first
fixation point; 2) Small positioning errors in the camera head assembly; 3) Frame grabber noise; 4)
Variations in lighting conditions. Even a one pixel wide difference in the fixation point can lead to a new
visual field image for the next fixation, which results in a completely different attentional sequence as
fixation goes on.
Figure 17: Simple scenes containing “rectangle”and “hexagon”.
9.1
Simple Scenes
The first set of experiments is performed on simple 2D shapes hanging on a black background as shown in
Figure 17. APES has to decide which scene is being viewed by as it is attentively fixating and using
evidential reasoning. Learning is based on attentional sequences of length 10. The observed feature transition
frequencies are shown in Figure 18 and Figure 19. Even with such short length of the attentional sequences,
these matrices start to become differentiable. The matrix for scene 1 favors no transitions between diagonal
features 4, 5, 6, and 7, as compared to that of scene 2.
0
1
2
3
4
5
6
7
0
1
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
2
0
0
0
1
0
0
0
0
3
1
2
0
2
0
0
0
0
4
0
0
0
0
0
0
0
0
5
0
0
0
0
0
0
0
0
6
0
0
0
0
0
0
0
0
7
0
0
0
0
0
0
0
0
0
1
2
3
4
5
6
7
0
0
0
0
0
0
0
0
0
1
0
1
0
0
1
0
0
0
2
0
0
0
0
0
0
0
0
3
0
0
0
1
0
1
0
0
4
0
0
0
0
1
0
0
1
5
0
1
0
0
0
0
0
0
6
0
0
0
0
0
0
0
0
7
0
0
0
1
0
0
0
1
Figure 18: Scene 1 (rectangle) – Learning using attentional Figure 19: Scene 2 (hexagon) – Learning using attentional
sequences of length 10.
sequences of length 10.
For recognition experiments, 20 experiments are conducted and support values after 10 fixations are
10
considered. Figure 20 and Figure 21 show the generated sequences O and recognition results. Using as
low as 10 fixations during learning and classification, different feature sequences can be recognized as
belonging to the correct shape with a fairly good rate. Note that as the robot’s cameras are not following a
pre-defined boundary or trajectory, all the twenty sequences generated during these experiments are
completely different. Sequences, which include highly favored transitions, are immediately recognized with
a high margin. Those which do not are either incorrectly classified or return only a slightly better result
To appear in “Autonomous Robots”, 2006.
compared to the competing model. Another reason for incorrect classification is the possibility of generating
very similar or even identical sequences on two different scenes. However, correct classification rates
indicate that this intersection region is small.
10
10
10
10
No
O10
■ s (A , O )
♦ s (A , O )
1
2
2
1,5
1
0,5
19
17
15
13
11
9
7
5
0
3
3000333111
1220123000
0333301111
1133000333
3101211230
0033311012
1112300013
2300033313
0000333311
1121130003
0021212323
0000313110
1211200003
0103223011
2231200003
3330221110
1211130000
0033311110
1211200003
0313331012
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
Attentional Sequence No
Figure 20: Results after 10 fixations on Scene 1 with 10 fixation learning on Scene 1 and 2. Recognition rate is 90%.
No
O10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
7021135374
1111111177
7142111133
1161111473
1273311112
1177331111
1313711122
1477331111
3111121117
0142111511
1333371173
7702333371
4211111121
1077331111
1121411333
7012311112
1147733511
1012111117
4117331161
1112214003
10
10
♦ s1 ( A1 , O )
10
10
■ s 2 (A 2 , O )
1,2
0,7
0,2
-0,3 0
10
20
30
Attentional Sequence No
Figure 21: Results after 10 fixations on Scene 2 with 10 fixation learning on Scene 1 and 2. Recognition rate is 90%.
9.2
Complex scenes
In the next set of experiments, 3 complex scenes shown in Figure 22 from our laboratory are used. Figure 23
and Figure 24 show two sample fixation sequences generated by APES as it is looking at Scene 1. The
To appear in “Autonomous Robots”, 2006.
complexity of our problem can be observed in these sample sequences. For example in the fifth fovea, a
boundary caused by a shadow is fixated, and in some foveas like those numbered 4,8,9, and 10 the image is
distorted by small camera or body motion, making edge based features quite hard to detect correctly. Note
that these are problems common to any practical implementation outside controlled environments. Our
methods are expected to cope with such distortions. Also note that in the two sequences, although starting
points are close and the first visual fields are almost identical, the two sequences are quite different – as
noted earlier on. However spatial and temporal relations of observed features remain the same.
Figure 22: (Left-right) Wide-angle images of Scene 1, Scene 2 and Scene 3. Squares represent the visual field and fovea.
1
2
3
4
5
6
7
8
9
10
Figure 23: A sample sequence of visual field images I v = ( I v , K , I v ) on Scene 1.
1
10
1
2
3
4
5
6
7
8
9
10
Figure 24: A sample sequence of visual field images I v = ( I v , K , I v ) on Scene 1.
1
10
During learning, APES constructs its models (feature transition frequency matrices) of all the three scenes
using attentional sequences of length 30. These are presented in Figure 25, Figure 26 and Figure 27
respectively. Scene 3 model is different compared to those of the other two scenes – however it is closer to
that of Scene 2. Therefore any sequence generated on Scene 3 is likely to be identified correctly in general.
On the other hand, the similarity between models of the first two scenes hints that two scenes may be
confused with each other. These results are justified by results shown in Figure 28 to Figure 33. Scene 3 is
recognized with a rate of 100% at all times while Scene 1 and Scene 2 have lower recognition rates.
To appear in “Autonomous Robots”, 2006.
0
1
2
3
4
5
6
7
0
7
5
0
0
0
0
0
0
1
5
11
0
1
0
0
0
0
2
0
0
0
0
0
0
0
0
3
1
0
0
0
0
0
0
0
4
0
0
0
0
0
0
0
0
5
0
0
0
0
0
0
0
0
6
0
0
0
0
0
0
0
0
7
0
0
0
0
0
0
0
0
Figure 25: Scene 1
No
0
1
2
3
4
5
6
7
0
3
3
3
2
0
0
0
1
1
2
6
0
2
0
0
0
0
2
2
0
0
0
0
0
0
0
3
4
0
0
1
0
0
0
0
4
0
0
0
0
0
0
0
0
5
0
0
0
0
0
0
0
0
6
0
0
0
0
0
0
0
0
7
0
1
0
0
0
0
0
0
0
1
2
3
4
5
6
7
Figure 26: Scene 2
O 30
1
2
3
4
5
6
7
8
9
111000010001100110001111111100
101011110011000131103000011111
111110001100010011101104001110
000001111111111111010201110100
100010000101131111113011111011
111601100000001120000000100300
100010111010101010100001111013
000000001111130111100101110000
111014110001110000006000111110
10 101010101110001000001111100100
30
0
0
1
0
3
0
0
1
0
1
1
2
2
1
0
0
0
0
2
0
1
4
3
0
0
1
0
3
1
2
3
0
0
0
1
0
4
0
0
0
0
0
0
0
0
5
0
0
0
0
0
0
0
0
6
2
0
0
1
0
0
0
0
7
0
0
0
0
0
0
0
0
Figure 27: Scene 3
30
30
♦ s1 ( A 1 , O )
30
■ s 2 (A 2 , O )
1,2
1,0
0,8
0,6
0,4
0,2
0,0
0
5
10
15
Attentional Sequence No
Figure 28: Results after 30 fixations on Scene 1 with 30 fixation learning on Scene 1 and 3. Recognition rate is 100%.
No
O 30
1
2
3
4
5
6
7
8
9
212013111112031233210120321026
327313231521313160355662211022
301101111001113233121011111331
000266110131313121133223214111
111126661025011312223611022220
112113133220223132232436226211
331111261163111111213111212222
113220121111011122262322222333
123302322121633011121361222212
10 362231233226011330111233027116
30
30
30
♦ s1 ( A 1 , O )
30
■ s 3 (A 3 , O )
1,2
1
0,8
0,6
0,4
0,2
0
0
5
10
15
Attentional Sequence No
Figure 29: Results after 30 fixations on Scene 3 with 30 fixation learning on Scene 1 and 3. Recognition rate is 100%.
No
1
2
3
4
5
O 30
111000010001100110001111111100
101011110011000131103000011111
111110001100010011101104001110
000001111111111111010201110100
100010000101131111113011111011
30
30
♦ s1 ( A 1 , O )
30
30
■ s 2 (A 2 , O )
To appear in “Autonomous Robots”, 2006.
6
7
8
9
111601100000001120000000100300
100010111010101010100001111013
000000001111130111100101110000
111014110001110000006000111110
10 101010101110001000001111100100
1,2
1
0,8
0,6
0,4
0,2
0
0
5
10
15
Attentional Sequence No
Figure 30: Results after 30 fixations on Scene 1 with 30 fixation learning on Scene 1 and 2. Recognition rate is 50%.
No
O 30
30
30
30
♦ s1 ( A 1 , O )
1
2
3
4
5
6
7
8
9
101001107060064011154411222512
501333300077606115154141340442
116303331111061616010674630110
012111013051707170644011100001
122300150010001767633055441324
261117011001032132001001230211
001010103710100333110110001111
000100007111010133002111100031
011104311301111003333120110010
10 100001111170110111100110012100
30
■ s 2 (A 2 , O )
1,2
0,7
0,2
-0,3 0
5
10
15
Attentional Sequence No
Figure 31: Results after 30 fixations on Scene 2 with 30 fixation learning on Scene 1 and 2. Recognition rate is 100%.
No
O 30
1
2
3
4
5
6
7
8
9
101001107060064011154411222512
501333300077606115154141340442
116303331111061616010674630110
012111013051707170644011100001
122300150010001767633055441324
261117011001032132001001230211
001010103710100333110110001111
000100007111010133002111100031
011104311301111003333120110010
10 100001111170110111100110012100
30
30
30
♦ s 2 (A 2 , O )
30
■ s 3 (A 3 , O )
1,2
1
0,8
0,6
0,4
0,2
0
0
5
10
15
Attentional Sequence No
Figure 32: Results after 30 fixations on Scene 2 with 30 fixation learning on Scene 2 and 3. Recognition rate is 70%.
No
1
2
3
4
5
O 30
212013111112031233210120321026
327313231521313160355662211022
301101111001113233121011111331
000266110131313121133223214111
111126661025011312223611022220
30
30
♦ s 2 (A 2 , O )
30
30
■ s 3 (A 3 , O )
To appear in “Autonomous Robots”, 2006.
6
7
8
9
112113133220223132232436226211
331111261163111111213111212222
113220121111011122262322222333
123302322121633011121361222212
10 362231233226011330111233027116
1,2
1,0
0,8
0,6
0,4
0,2
0,0
0
5
10
15
Attentional Sequence No
Figure 33: Results after 30 fixations on Scene 3 with 30 fixation learning on Scene 2 and 3. Recognition rate is 100%.
9.3
Complex and similar scenes
APES was also made to look at three similar scenes with small variations and one unrelated scene as shown
in Figure 34. Changes in the three similar scenes are not very small at all, such as missing chairs, but a
human viewer tends to overlook these changes. APES is expected to perform similarly and "understand" that
the three scenes belong to the same part of the world and Scene 2 to a different part.
Figure 34: (Left-right) Wide-angle images of Scene 1, Scene 2, Scene 3 and Scene 4.
30
30
♦ s1 ( A1 , O )
30
30
■ s 2 ( A2 , O )
1,200E+00
1,000E+00
8,000E-01
6,000E-01
4,000E-01
2,000E-01
0,000E+00
1
2
3
4
5
6
7
8
9 10
Attentional Sequence No
To appear in “Autonomous Robots”, 2006.
30
30
30
30
■ s 2 ( A2 , O )
♦ s1 ( A 1 , O )
4.0000E-05
3.5000E-05
3.0000E-05
2.5000E-05
2.0000E-05
1.5000E-05
1.0000E-05
5.0000E-06
0.0000E+00
1
2
3
4
5
6
7
8
9
10
Figure 35: Results of 30 fixations on Scene 1 (top) and Scene 2 (bottom) after 30 fixation learning on Scene 1 and
Scene 2. Recognition rates are 100% and 80% respectively.
In Figure 35 results of experiments on the original training scenes are shown. Scene 1 can be recognized
easily with a high margin, while Scene 2 is recognized in 80% of the experiments with a very low margin. In
Figure 36 results of experiments on the two variants of Scene 1, Scene 3 and Scene 4 are shown. Both scenes
can easily be recognized as Scene 1 except in a few experiments.
30
30
♦ s1 ( A1 , O )
30
30
■ s 2 ( A2 , O )
1,2000E+00
1,0000E+00
8,0000E-01
6,0000E-01
4,0000E-01
2,0000E-01
0,0000E+00
1
2
3
4
5
6
7
8
9
10
9
10
Attentional Sequence No
30
30
30
30
■ s 2 ( A2 , O )
♦ s1 ( A 1 , O )
1,2000E+00
1,0000E+00
8,0000E-01
6,0000E-01
4,0000E-01
2,0000E-01
0,0000E+00
1
2
3
4
5
6
7
8
Attentional Sequence No
Figure 36: Results of 30 fixations on Scene 3 (top) and Scene 4 (bottom) after 30 fixation learning on Scene 1 and
Scene 2. Recognition rates are 100% and 80% respectively.
To appear in “Autonomous Robots”, 2006.
Although these experiments show that scene recognition based on attentional sequences can compensate for
small changes in the environment, the low margins in Scene 2 recognition results in Figure 35 is confusing.
This result may suggest that the model of Scene 1 may be dominating over Scene 2 and correct classification
of Scene 3 and Scene 4 may be a result of this dominance.
9.4
Multiple object recognition
Temporal recognition using attentional sequences is also suitable for multiple object recognition. This can be
demonstrated by an experiment where two parts of a scene containing different objects are modeled
separately. For example in the scene shown in Figure 37, fixations concentrate on two distinct objects, which
are the switch and an old fuse board mounted on the wall. The bubble formed while observing this scene is
also shown in Figure 37. During learning, feature transition matrices are generated using fixations on each
object separately. In this case we obtain two models from a single sequence of fixations made on two distinct
objects. During recognition the system makes fixations on the same scene and the cumulative support values
for the two models are computed. In Figure 38 only fixations after 10th are considered so that enough
information will be accumulated before starting recognition decisions. Initially after 10th fixation (shown as 1
in the figure) Model 1 is dominating. Starting with 20th fixation the robot starts looking at the parts of the
scene learned as Model 2 and after a transition period, where no decision is possible, Model 2 is activated
and Model 1 goes down starting with 35th fixation. As the robot attends the two areas of the scene, the values
change to support corresponding models.
Figure 37: Switch and fuse board scene (left) and the bubble inflated by fixations (right).
Figure 38: Supports for Model 1 and Model 2 vs. fixation number (starting from 10th)
9.5
Experiments’ Summary and Discussion
To appear in “Autonomous Robots”, 2006.
In summary, our experiments on simple and complex scenes revealed the following important results about
the use of attentional sequences for scene classification:
1) Evidential reasoning is a promising method for classification of attentional sequences.
2) Even by using very simple edge based features we can deduce invariant relations from the seemingly
varying fovea image sequences generated while looking at the same scene.
3) Using as low as 10 fixations during learning and recognition, good classification performance can be
achieved.
4) Results on complex real world scenes, which are hard to classify using classical methods, show that
attentional sequence based classification is promising to solve such problems.
5) Increasing the learning period does not necessarily improve performance. Good performance with short
learning period is possible depending on learning and recognition fixations.
In order to achieve good performance, models (feature transition frequency matrices) need to represent
unique features about the scene. How to generate fixation models with such property and how to compute
their representation capability are open problems that we are working on.
10 Conclusion
APES is developed as an experimental robot platform for biologically motivated attentive vision. Its novelty
stems from the fact that it simultaneously mimics some of the most key properties of biological vision -including fovea-periphery distinction, attention, different representation modes, temporal processing and
memory. Let us remark that most of these features has been studied previously – however in general
separately from each other. However, there has been little work on their realization altogether in an
integrative framework. It turns out that such integration requires the development of new original
approaches in addition to the utilization of the more classical solutions. Our approach to integration of each
feature has proven to be simplistic but relatively successful approximations to their biological counterparts.
For example, the two camera retina model is a very realistic approximation to the biological system. We
hope that some of our mechanisms may provide insight into the associated aspects of biological vision where
there is still much unknown. For example, the bubble model has many practical advantages as a visual
memory mechanism and it may also be interesting as a model to explain visual integration mechanisms in
humans. In the future, this approach may lead to a functional memory mechanism, which will enable the
robot to recall a previously visited environment and to detect changes in it without forming a 3D geometric
model or recording a large number of images.
APES is continuing to be further developed in our laboratory for studying attention. There is much room for
further work regarding both physical and visual capabilities. The physical properties of the system need to be
improved for increased positioning accuracy and faster mechanical response during fixations. Similarly, its
visual processing software is currently being expanded to include a very rich set of visual primitives which
may be tuned on or off depending on the task. Another focus is on undermining the relation between
attentional sequence and bubbles in real-time scene exploration tasks.
Acknowledgements
This work is supported by Boğaziçi University Research Fund project 01A202D.
References
1. Abbott, A.L et al. "Promising directions in active vision". International Journal of Computer Vision,
11:2, 109-126, 1993.
2. Albus, J.S. “Outline for a theory of intelligence”, IEEE Trans. Syst., Man and Cybernetics, Vol.21, No.3,
1991.
3. Aloimonos, J. "Purposive and qualitative active vision". In Proceedings of Image Understanding
Workshop, September 1990.
4. Ballard, D.H. "Animate Vision". Artificial Intelligence, 48: 57-86, 1991.
5. Ballard, D.H. and C.M.Brown. "Principles of Animate Vision". CVIP: Image Understanding, 56(1), July
1992.
To appear in “Autonomous Robots”, 2006.
6. Ballard, D.H. "On the function of visual representations". In K.Akins, editor, Perception, pp: 111-131.
Oxford University Press, 1996.
7. Bozma, H.I. and Ç.Soyer. "Shape Identification using probabilistic models of attentional sequences". In
Proceedings of Workshop on Machine Vision Applications. IAPR, 1994.
8. Chown, E., Kaplan, S. and Kortenkamp, D. “Prototypes, Location, and Associative Networks (PLAN):
Towards a Unified Theory of Cognitive Mapping”. Cognitive Science 19, pp. 1-51, 1995.
9. Clark, J. "Spatial attention and Latencies in Saccadic Eye Movements". Vision Research V 39 pp:585602, 1999.
10. Fiala,J.C. et al. “TRICLOPS: A Tool for Studying Active Vision”. International Journal of Computer
Vision, 12:2/3, 231-250, 1994.
11. Gallant, J. L, C.E. Connor, S. Rakshit, J.W. Lewis and D.C. Van Essen. “Neural Responses to Polar,
Hyperbolic and Cartesian Gratings in Area V4 of the Macaque Monkey”, Journal of Neurophysiology,
Vol.76, No. 4, pp:2718-2739, 1996.
12. Gouras, P. "Oculomotor system". In J.H.Schwartz and E.R.Kandel, editors, Principles of Neural Science.
Elsevier, 1986.
13. Gouras, P. and C.H.Bailey. "The retina and phototransduction". In J.H.Schwartz and E.R.Kandel,
editors, Principles of Neural Science. Elsevier, 1986.
14. Greene,H.H. “Temporal relationships between eye fixations and manual reactions in visual search”. Acta
Psychologica, 101:105-123, 1999.
15. Grosso, E., E. Manzotti, R. Tiso and G. Sandini. A Space-Variant Approach to Oculomotor Control.
Proceedings of International Symposium on Computer Vision, pp:509-514, 1995.
16. Hubel, D.H. "Eye, brain and vision". Scientific American Lib., 1988.
17. Huber, E. and Kortenkamp, D. “A behavior-based approach to active stereo vision for mobile robots”.
Artificial Intelligence 11, pp. 229-243, 1998.
18. Itti, L. and Koch, C. “Computational Modeling of Visual Attention”. Nature Reviews, Vol.2, February
2001.
19. Julesz, B. Dialogues on Perception. MIT Book Press, Cambridge, MA 1995.
20. Keskinpala, T. et al. “Knowledge-Sharing Techniques for Egocentric Navigation”, IEEE Conference on
Systems, Man and Cybernetics, pp: 2469-2476, 2003.
21. Kowler, E. "Eye movements". In S.M.Kosslyn, D.N. Osherson, editors, Visual Cognition, pp: 215-266.
MIT Press, 1995.
22. Lago-Fernandez, L.F., Sanchez-Montanes, M.A. and Cobacho, F. “A biologically inspired visual system
for an autonomous robot”. Neurocomputing, 38-40:1385-1391, 2001.
23. McGaugh, J.L., N.M.Weinberger and G.Lynch editors. Brain and Memory. Oxford University Press,
1995.
24. Noton, D. and L.Stark. "Scan paths in eye movements during pattern recognition". Science, Vol.171,
pp.308-311, January 1971.
25. Palmer,J., Verghese,P. and Pavel,M. “The psychophysics of visual search”. Vision Research, 40:12271268, 2000.
26. Papanikolopoulos,N.P. “Adaptive control, visual servoing, and controlled active vision”. Proceedings of
IEEE International Conference on Robotics and Automation, 1994.
27. Peters, R.A. et al. “The Sensory Ego-Sphere as a Short-Term Memory for Humanoids”. Proceedings of
the IEEE-RAS International Conference on Humanoid Robots, 2001.
28. Rao, R.P.N. et al. “Modeling Saccadic Tergeting in Visual Search”. In Touretzky, D., Mozer, M. and
Hasselmo, M. editors, Advances in Neural Information Processing Systems 8 (NIPS*95), MIT Press,
1996.
29. Rimey, R.D. and C.M.Brown. "Selective attention as sequential behaviour: Modelling eye movements
with an augmented hidden Markov model". Technical Report, The University of Rochester, Computer
Science Department, February 1990.
30. Rimey, R.D. and C. Brown. "Control of Selective Perception Using Bayes Nets and Decision Theory".
International Journal of Computer Vision, 12:2/3, 173-207, 1994.
31. Rybak, I. A., V. I. Gusakova, A. V. Golovan, L.N. Podladchikova and N. A. Shevtsova. "A Model of
Attention-Guided Visual Perception and Recognition". Vision Research, Special Issue: Models of
Recognition, 1998.
32. Schlingensiepen,K.H., et al. “The importance of eye movements in the analysis of simple patterns”.
Vision Research, Vol.26, No.7, pp.1111-1117, 1986.
To appear in “Autonomous Robots”, 2006.
33. Shafer, G. A Mathematical Theory of Evidence. Princeton University Press, 1976.
34. Shin, C.W. and S. Inokuchi and K.I. Kim. “Retina-like visual sensor for fast tracking and navigation
robots”. Machine Vision and Applications V 10 pp:1-8, 1997.
35. Soyer, Ç., and H.I.Bozma. "Further experiments in classification of attentional sequences: Combining
instantaneous and temporal evidence". In Proceedings of IEEE 8th International Conference on
Advanced Robotics. ICAR, 1997.
36. Soyer, Ç., H.I.Bozma, and Y. Istefanopulos. "A New Memory Model for Selective Perception Systems".
In Proceedings of IEEE RSC International Conference on Intelligent Robots and Systems. IROS, 2000.
37. Soyer, Ç. and H.I.Bozma. “Attentional Sequence Based Recognition: Markovian and Evidential
Reasoning”. IEEE Transactions on Systems, Man and Cybernetics, V33, No 6, pp: 937-950, December
2003.
38. Soyer, Ç. “A model of active and attentive vision”. PhD dissertation. Bogazici University, 2002.
39. Stark, L. and S.R. Ellis. “Scan paths Revisited: Cognitive Models Direct Active Looking”, In Eye
Movements: Cognition and Visual Perception. Editors: Fisher, Monty and Senders. pp:193-226,
Erlbaum, NJ, 1981.
40. Tagare, H., K. Toyama and J. G. Wang. "A Maximum Likelihood Strategy for Directing Attention
During Visual Search". IEEE Transactions on PAMI, V23, No 5, pp:491-500, May 2001.
41. Treisman, A., and G. Gelade. "A feature integration theory of attention". Cognitive Psychology, 12,
pp:97-136, 1980.
42. Tsotsos, J.K. et al. “Modeling visual attention via selective tuning”. Artificial Intelligence, 78:507-545,
1995.
43. Viviani, P. "Eye movements in visual search: Cognitive, perceptual and motor control aspects". In
E.Kowler, editor, Eye Movements and Their Role in Visual and Cognitive Processes, pp: 71-112.
Elsevier, 1990.
44. Wasson, G., Kortenkamp, D. and Huber, E. “Integrating active perception with an autonomous robot
architecture”. Robotics and Autonomous Systems, 29:175-186, 1999.
45. Westin, C. et al. "Attention control for robot vision". In Proceedings of IEEE Computer Society
Conference on Computer Vision and Pattern Recognition. CVPR, pp 726-1996.
46. Yeap, W.K. “Toward of a computational theory of cognitive maps”. Artificial Intelligence 34, 297-360,
1988.
47. Yeap, W.K. and Jefferies, M.E. “Computing a representation of the local environment”. Artificial
Intelligence 107, pp. 265-301, 1999.
48. Zeki, S. "The visual image in mind and brain". Scientific American, Vol.267, No.3, September 1992.
49. Zingale, C.M. and Kowler, E. “Planning Sequences of Saccades”. Vision Research, Vol.27, No.8,
pp.1327-1341, 1987.

Download Report

www.isl.ee.boun.edu.tr - Intelligent Systems Laboratory

Paperzz.com

Your Paperzz