To appear in “Autonomous Robots”, 2006. APES: ATTENTIVELY PERCEIVING ROBOT Ç. Soyer†‡ H.I. Bozma‡ Y. İstefanopulos†‡ [email protected] [email protected] [email protected] † Institute of Biomedical Engineering, Boğaziçi University, 80015 Bebek, İstanbul, Turkey ‡ Intelligent Systems Laboratory, Department of Electrical Electronic Engineering, Boğaziçi University 80015 Bebek, Istanbul, Turkey Abstract Robot vision systems - inspired by human-like vision - are required to employ mechanisms similar to those that have proven to be crucial in human visual performance. One of these mechanisms is attentive perception. Findings from vision science research suggest that attentive perception requires a multitude of properties: A retina with fovea-periphery distinction, an attention mechanism that can be manipulated both mechanically and internally, an extensive set of visual primitives that enable different representation modes, an integration mechanism that can infer the appropriate visual information in spite of eye, head, body and target motion, and finally memory for guiding eye movements and modeling the environment. In this paper we present an attentively “perceiving” robot called APES. The novelty of this system stems from the fact that it incorporates all of these properties simultaneously. As is explained, original approaches have to be taken to realize each of the properties so that they can be integrated together in an attentive perception framework. Keywords: Attention, selective perception, active vision, robot vision, mobile robots, attentional sequences, visual memory, bubble model, temporal recognition, sequence based recognition. To appear in “Autonomous Robots”, 2006. 1 Introduction Biological systems have the capability of seeing with no problems such as computational complexity that need to be confronted when endowing vision into robots. It is argued that if human technology can mimic or copy nature, then biological-like performance can follow. Obviously, blind copying – if possible – may not work at best since optimal designs are not necessarily the end product of evolution and more importantly the underlying constraints are different. The trick may lie in understanding the essential biological features to a degree in order to develop counterpart analogies and this has been a motivation for this work. Consider the scene of Figure 1 (left) taken from our laboratory. Now if a robot is to understand that it is part of our laboratory, traditionally it would look at the whole image at once and try to process it. Findings of vision science are such that biological systems do not quite work this way. Instead of looking at the scene all at once and therefore being bombarded with massive amounts of data, biological vision systems attentively view their scene – thus collecting a sequence of spatio-temporal visual data as shown in Figure 1 (right). 1 2 3 4 5 6 7 8 9 10 Figure 1: (Left) A scene from our laboratory. (Right) A sequence of fixations. Although the exact mechanisms of attentive perception are still unknown, recent work on humans and monkeys have revealed some integral properties: 1. Fovea-Periphery Distinction: Unlike traditional cameras, the distribution of receptor cells on the retina is Gaussian-like with a small variance, resulting in fovea-periphery distinction [13,16]. Biological vision systems process only a small part of their visual field in detail. Unlike traditional cameras used by man made imaging systems, the distribution of receptor cells on the retina is like a gaussian with a very small variance, resulting in a dramatic loss of resolution as we move away from the optical axis of the eye. The small region of highest acuity around the optical axis is called the fovea, and the rest of the retina is called periphery. 2. Attention: The attentive processing via mechanical eye and head motions and cognitive mechanisms generates a stream of spatio-temporally related visual data [12,21]. As a consequence of fovea-periphery distinction, saccades - fast eye movements - are used to bring images of chosen objects to fovea where resolution is at its best. This physical attention mechanism is called overt attention. There is strong evidence suggesting that the saccades are voluntary and require the computation of the relative position of a visual feature of interest with respect to the fovea in order to determine the direction and amplitude of the saccade. A second type of attention system called covert attention refers to unconscious attentional effects. These include poorly understood complex cognitive processes, which determine the attention behavior of the system. 3. Visual Primitives & Representation Modes: Findings suggest that attentional deployment is based on a rich set of representation modes [6,11,41]. Cells in the visual path from retina to the primary and other cortical regions respond to increasingly more complex stimuli, accompanied by larger receptive fields on the retina. For example in the primary visual cortex, simple cells respond to lines of a particular orientation, more common complex cells respond to motion, and some cells both simple and complex respond to specific corners and curvatures. 4. Serial Processing: Although the human visual system is massively parallel in structure, most visual tasks also require serial processing as the oculomotor activity results in the perception of a series of images in time [9,14,21,24,25,32,49]. Especially in counting or comparison experiments more complex scenes lead to longer processing times in human subjects because of increased number of To appear in “Autonomous Robots”, 2006. fixations or eye movements required to solve the task. This implies that information is collected and somehow combined after each fixation until there is enough information to make a decision. 5. Memory: Human vision also relies heavily on short and long term memory [16,19,21,23,49]. Some cognitive effects during attention control, like inhibition of return or negative priming require a short-term memory mechanism. Long term memory is used to accumulate visual information during fixations and to build abstract models of the environment that can last for years. The use of human-like mechanisms in robot vision systems rapidly increased in recent years starting with the introduction of active vision paradigm [1,3,4,5]. Most of the work in biologically motivated robot vision systems has concentrated on the realization of the first three properties discussed above. Earlier research focused on the construction and control of camera heads that can replicate eye motions [10,26,45]. Later on "where to look" and "how to look" problems were also studied and various models of attention, eyemovements and visual search have been developed by both robot vision and biological vision communities [28,29,30,31,40,42]. The attention mechanism - unlike classical computer vision systems – requires new approaches that make use of the spatio-temporal visual data thus generated [7,17,31,39]. The use of various memory mechanisms has been widely discussed in cognitive science literature [2,8,20,27,46,47]. Let us remark that most of these features have been studied previously – however in general separately from each other. However, to be of use in real world tasks, the vision system of an attentive robot needs to implement all of the above properties simultaneously. There are only few studies in the literature where all these properties are addressed together in a comprehensive manner [17,22,38]. This paper describes the attentive visual processing of APES – a mobile robot whose novelty comes from integrating all of the above properties of biological vision in single system. In the remaining of this paper, we explain the different components of this integrated active vision system. We first give a general overview of APES’ hardware and software in Section 2. As presented in Section 3, the fovea-periphery distinction is realized with a two-camera system. The control of focus of attention through a pre-attention and short-term memory mechanism is explained in Section 4. In this framework, the focus of attention can also be changed “mentally” by utilizing appropriate attention criteria and applying different types of processing. Following, attentive processing and representation modes are discussed in Section 5. Serial processing of the spatiotemporal data thus generated is based on evidential reasoning as explained in Section 6. Finally, the incorporation of long-term memory is discussed in Section 7. Section 8 presents the complete system. Experimental results from a variety of scenarios are discussed in Section 9. The paper concludes with a brief summary and remarks about future directions. To appear in “Autonomous Robots”, 2006. 2 APES Hardware and Software APES shown in Figure 2, is a mobile robot developed in our laboratory for attentive vision research. Its body is driven by two conventional wheels. Using four stepping motors it can translate and rotate its body and direct its cameras to the visual stimuli by pan and tilt motions. Body rotation and camera pan axes have been designed to be co-centered, in order to simplify transformations during combined body and camera motions, and are not the same as the centerline of the cylindrical body for mechanical stability reasons. Table 1 and Figure 3 present the technical specifications and hardware configuration of APES respectively. The main visual processing module running on a workstation performs vision processor setup, frame grabbing, preattentive and attentive processing and serial communications. The on-board PC104 computer is responsible for serial communications, motor control, and camera control. All camera features including zoom angle can be controlled by the on-board computer. Height: 60cm. Radius 37cm. Wheel span: 52cm. Wheel radius: 15cm. Drive method: Stepping motors Power: 12 V Battery Pan accuracy: 1.8 degrees Tilt accuracy: 1.8 degrees Video format: CCIR composite Image size: 512x512 pixels Camera lens: 4-47 degree zoom Figure 2 :APES robot and its 2 dof camera base. Table 1:Technical specifications of APES. Figure 3: Schematic of APES Figure 4: APES main software snapshot. The two degrees of freedom step motor based head assembly and camera motions of APES cannot be compared to the highly developed oculomotor system. However APES can effectively control the optical axis of its cameras with an accuracy of 1.8 degrees due to its step motor based drive system. Camera motions correspond to large and fast saccadic motions of the eye, which are used for fixating different spatial targets. During operation the saccade system determines the new fixation point in the periphery and the corresponding saccade vector. This information is sent to the on-board computer, which moves the camera accordingly. The new visual field is then processed by the vision system. In Figure 4, a snapshot from APES’ main software is shown. The two large image boxes can display raw or processed images from the two cameras, which are used to simulate the fovea-periphery distinction in the human eye as explained in the next section. The tiny fovea image is also shown on the left below the large image. A control window is used to select operating modes and settings, and a separate data window displays all computations, including fovea saliencies, attentive features, saccade vectors, bubble points and fixation numbers. Its simple hardware To appear in “Autonomous Robots”, 2006. and flexible software libraries enable easy integration of different oculomotor and retina models as well as memory and recognition modules to build a biologically motivated vision system. 3 Fovea-Periphery Distinction: APES Retina Model Biological vision systems process only a small region of their visual field in detail. The small region of highest acuity around the optical axis is referred to as the fovea. It is thought to provide information regarding the scene or the current visual task. The rest of the visual field, called the periphery, is much lower in resolution and is used in finding the next fixation point [13,16]. The retina model of APES incorporates such a fovea-periphery distinction as shown in Figure 5 (Left). Since this is not possible with a single fixed resolution camera and spatially variant cameras are still under development [15,34], APES uses a twocamera retina model in order to realize such a model. As seen in Figure 5 (Right), a wide angle camera is used to get peripheral visual data in low resolution while a narrow angle camera is used to get foveal data. The two cameras are fixed together such that their optical axes are parallel and as close as possible. There is a horizontal separation of about 5 centimeters between optical axes resulting in a fovea image which is not exactly at the center of the peripheral image. There is also an error caused by the accuracy of stepping motors. These errors are corrected in software by shifting the center of the fovea in the acquired periphery image to better match with the actual fovea for inhibition purposes. The periphery camera has a 46 degree wide angle lens and generates approximately an angular pixel density of 11 pixels/degree. The fovea camera, on the other hand, has a narrow angle lens with 4 degrees viewing angle. It dedicates all of its resolution accordingly and therefore obtains an angular pixel density of 128 pixels/degree. The photoreceptor distribution has a Gaussian shape, but its variance is very small creating a steep peak around the optical axis – thus endowing the robot with a high resolution fovea- low resolution periphery as shown in Figure 6. Periphery 512x 512 Periphery Low receptor density Fovea Candidate foveas with o x × o y% overlapping Wide angle camera Inhibition Region 512x512 Fovea High receptor density Narrow angle camera Top View Fixation Point Figure 5: (Left) APES’ visual field with fovea-periphery distinction. (Right) Two camera retina model of APES. Figure 6: (Left) Periphery. (Center) Fovea with uniform resolution. (Right) Fovea with two camera model. 4 Attention Mechanism and Short-Term Fixation Memory Attentive vision works in a loop of pre-attention and attention. In pre-attention, the next fovea is determined by simple, fast calculations in the periphery. In attention, a fixated fovea is processed to extract more complex features, which are used to build a high level cognitive model for the scene. APES’ visual processing is based on a similar loop as shown in Figure 7. Let I tv and I tf ⊂ I vt denote the visual field and To appear in “Autonomous Robots”, 2006. the fovea at time t respectively. In pre-attention, all the fovea sized regions in the periphery with some overlapping constitute the set C ( I vt ) of candidate foveae as shown in Figure 5 (left). Each I cf ∈ C ( I vt ) is then considered and its saliency is computed. The saliency measure is an attention function a : I cf → [0,∞ ) which is determined by its current task with the constraint that it must be simple to compute. In the literature, many different computational models of pre-attention have been proposed as summarized in [18]. As such, one conceives attention simply as the facilitation of certain set of neurons and thus the highlighting of particular features at a particular position in the visual field. As expected, the fixation behavior can change depending on the selected attention function. We have experimented with different functions. For most of our ( ) ∑ ∇I ( p ) ∆ experiments, the attention criterion is simply defined as a I cf = - the total sum of gradient p∈I cf magnitudes of all the pixels within the candidate fovea. However, with just such an attention function, shortterm fixation loops that cycle between only two or three foveae may occur. Hence attention function response must be modified by mechanisms that inhibit such cyclic behavior. Visual findings indicate that there are two types of memory present here [18]: 1. Inhibition of return – The process by which the currently attended location is prevented from being attended again, 2. Short-term memory – The process by which the last few fixations are recalled and are being prevented from being attended, APES has two such built-in mechanisms - inhibition and short-term fixation memory – for this purpose. First, the next fovea is forced to be away from the currently fixated fovea I ft using an inhibition region. This is achieved by defining an HxH pixel region I ht around I ft as the inhibition region as shown in Figure 5 (left). All candidate foveae I cf ∈ C(I ht ) falling within the inhibition region are inhibited. Note that the inhibition mechanism also enables the control of saccade magnitudes. Secondly a short-term fixation memory mechanism C d is implemented. This mechanism works via keeping track of previously fixated foveae and inhibiting them even if they are not within the current inhibition region. For this, we use a firstin-first-out memory C d = {I ft , I ft -1 , K , I ft - D } of size D. All foveae in this memory are inhibited during preattention. Obviously, the value of D puts a lower bound on the permissible length of fixation loops. At the end of each new fixation, I ft -D is removed from while I ft +1 is added to this memory. Pre-attentive processing together with inhibition and short-term memory mechanisms are merged to form an augmented attention function ~a : I cf → R + as: ⎧ 0 if ⎪ c ~ a (I f ) = ⎨ 0 if ⎪a(I c ) if ⎩ f I cf ∈ C(I ht ) I cf ∈ C d I cf ∈ C(I tv ), I cf ∉ C(I ht ), C d APES determines its next fovea I tf+1 by finding the most salient candidate fovea in its periphery using this augmented attention function as: I tf+1 ∈ arg cmaxc a~ ( I cf ) I f ∈C ( I v ) It then moves its camera as to fixate on its center. After a fixation point is found, the fixation point image coordinates are converted to camera coordinates and the amount of motion required for fixation is calculated. Using the results of this calculation, the fovea camera is directed to the new fixation point and a fovea image is grabbed. APES can effectively control the optical axis of its camera with an accuracy of 1.8 degrees due to its step motor based drive system. Camera motions correspond to large and fast saccadic motions of the eye. Although this system cannot be compared to the highly developed and poorly understood oculomotor system of mammals, it nevertheless suffices for implementing a physical attention mechanism. While this image is being analyzed for higher level features the periphery camera is free to look for the next fixation point if there is parallel processing. Periphery Cam Low level pre-attentive processing Fovea Cam High level attentive Next fixation point Cognitive information Camera controller To appear in “Autonomous Robots”, 2006. Figure 7: Attention process in the two camera retina model. In Figure 8 APES is looking at the curved metal object on the left and generating the sequence of fixation images on the right. In this experiment a gradient based attention criteria is used with an augmented attention function as described above. Figure 8: Snapshots from a recognition experiment using selective attention (right) on a curved metal object (left). Note that although the augmented attention function can potentially result in rather complex and unpredictable attention behavior, the basic selective attention mechanism of APES is relatively simple when compared to some other work in the literature. For example the early works of Ballard, Rimey and Brown introduce various attention control mechanisms integrated into a 6 dof robotic arm [4,5,29,30]. Many aspects of animate vision, including reference frames, gaze control, vergence and depth, foveal and peripheral features are introduced and investigated in [4] and [5]. While this work focused on selective attention and fixation control, our work on APES is focused on both generating fixations and using the information collected during the selective attention process. As explained in section 6, regardless of its attention control mechanisms, the output of an attentive vision system can be characterized by a sequence of fixations and sequence of feature vectors computed during each fixation. An integrated system should be able to have the mechanisms to process this information in a timely manner within its attention loop shown in Figure 7. 5 Visual Primitives & Representation Modes In the attentive stage APES applies complex visual processing on its current fovea I ft in order to extract its visual properties. This choice will vary depending on the task at hand and the representation mode. The results of this processing is encoded as an observation vector ot. Currently in the attentive stage, APES processes its fovea using visual primitives such as edge magnitude, edge orientation, and saccade direction shown in Figure 9. This set has recently been extended to include Cartesian and Non-Cartesian filters -which have been experimentally confirmed to be used in primate vision [11]. Different visual primitives may also be conjoined together. The ability to use different representation modes in both the pre-attentive and attentive stages enables APES to explore and internally represent its environment in different ways. For example by using a gradient based attention criteria APES can be made to prefer focusing on high frequency content image regions such as contours, or by using the brightness feature it can be made to fixate on light sources, reflective objects, etc. Similarly the object whose contours are focused upon can be modeled by APES using the sequence of edge types or by using the sequence of saccade vectors which give shape information. To appear in “Autonomous Robots”, 2006. Figure 9: (Left) Chain coded saccade directions. (Center) Edge types. (Right) Cartesian and non-Cartesian filters are used by APES as attentive features. 6 Attentional Sequence and Temporal Recognition Visual attention systems are widely studied as feature search mechanisms and as models of human attention [18,29,30,41,42,45]. However there are few cases in the literature where an attentive system and its output the attentional sequence described below - are used for object or scene recognition. One example is Rybak’s work on attention-guided visual recognition [31]. In this approach the fovea image or features are compared to a library and then further analyzed when there is a close match. While selective attention is used as an efficient search mechanism in this case, this approach still relies on classical methods at the recognition stage and does not model recognition by an attentive system where information is accumulated in time, over a number of fixations on different points in space. In APES we aim to model this behavior, which is assumed to exist in humans and other animals. Rimey and Brown also studied attentional sequence modeling using Hidden Markov Models and Bayes Nets [29,30]. Their work focused on another interesting problem, where the models were used to help attentional sequencing by a robot. In earlier work we used Rimey and Brown’s HMM approach for modeling and recognition of attentional sequences and compared it to evidential reasoning as an alternative method [37]. This method is described below as applied in APES robot. As APES cycles through the pre-attention attention loop, it generates a spatio-temporally related stream of fixations and thus observations – which we refer to as an “attentional sequence”. After T fixations, the attentional sequence can be denoted as O T = (o1 ,..., o T ) where o k is the observation at fixation k. APES can then use this information in a cumulative manner. One method of accomplishing this is based on evidential reasoning [33,35,37]. In this approach, for a given visual decision task, all the candidate alternatives are considered as competing propositions. Each body of evidence o k , k = 1,..., T in O T is found to support competing propositions to different degrees. Spatio-temporally related bodies of evidence are then combined in a cumulative manner to find the proposition which is most supported at an instance t. Hence decisions can then be made accordingly. Let k* be the correct classification of a scene. Suppose the set of its possible values are given by K. Then propositions of interest are precisely those of the form "the true value of k* is in A” and hence they are in 1-1 correspondence with the subsets 2K of K. Thus, we use A ∈ 2K to denote a proposition. In classification, we are in particular interested in propositions of the form: A k = { k } , k = 1,K, N K where N K = K Ak is taken to mean “The object under view is k”. Now suppose for each proposition Ak , we have a transition frequency matrix Tk : Ω × Ω → [0, ∞] . Each entry Tk (oi , o j ) represents the weight of evidence attested t to observing o after having observed o . Recall that o ∈ Ω is the observation at time t and note the use j i To appear in “Autonomous Robots”, 2006. of feature transitions rather than the features themselves. Then each observation attests evidence for each proposition Ak as follows: Let ω : 2 L × Ω → [0, ∞] represent the weight of evidence function. Then, ω(A k , o t ) = Tk (o t −1 , o t ) In evidential reasoning, degrees of support for various propositions discerned by K are determined by the K weights of evidence attesting to these propositions. Let sk : 2 × Ω → [0,1] define a simple support function focused on Ak. Then sk can be defined as 0 ⎧ ⎪ s k (A, o t ) = ⎨s k (A k , o t ) ⎪ 1 ⎩ Ak ⊄ A if if Ak ⊂ A , A ≠ L if A=L − cω ( A ,o t ) k where s k ( Ak , o ) = 1 − e . Note that sk is a belief function with basic probability number m(Ak)=sk(Ak ,ot), m(K)=1-sk(Ak ,ot), m(A)=0 for all other A ⊂ K that does not contain Ak. t Each evidence supports each proposition Ak , k=1,...,K with strength sk(Ak,ot). As each proposition conflicts with the other, the effect of each should diminished by the other and instantaneous support for each support s ik for each proposition Ak should be calculated. The instantaneous support s ik : 2 K × Ω → [0,1] can be computed as the orthogonal sum of the simple support functions sk focused on Ak given with basic probability numbers: N K t s (A , o ) ∏ (1− s (A , ot )) k k j j j =1 j≠ l m( A , o t ) = k N K 1− ∏ s ( A , o t ) j j j =1 and N K t ∏ (1− s (A , o )) j j j =1 m( K , o t ) = N K 1− ∏ s ( A , o t ) j j j =1 and ⎧ 0 ⎪ N ⎪ t ) K (1− s (A , o t )) s ( A , o ∏ ⎪ k k j j =1 j ⎪ j≠ l ⎪ N ⎪ K ⎪ 1− ∏ s ( A , o t ) j j ⎪ j =1 s ik (C, o t ) = ⎨ N ⎪ K t t ⎪ ∑ sk (Ak , o ) ∏ (1− s j(A j, o )) j =1 ⎪C∩K j≠ k ⎪ ⎪ N K ⎪ 1− ∏ s ( A , o t ) j j ⎪ j =1 ⎪ 1 ⎩ if C contains none of A , k =1,K, N k K if C contains A but does not contain A , j =1,K, N , j ≠ k k j K if C contains some of A , C ≠ K k if C=K To appear in “Autonomous Robots”, 2006. Figure 10: Calculation of instantaneous support for a two hypotheses case. In Figure 10 the calculation of instantaneous support is shown graphically for a two hypotheses case. Note the abstraction of information from the fovea image to the extracted feature and feature transition. In this case ‘Evidence 1’ and ‘Evidence 2’ refer to bodies of evidence supporting two different hypotheses. The effect of ski is to provide instantaneous support for each proposition Ak. . The total support skt for each proposition Ak can then be cumulated by combining the so-far total cumulated support skt-1 with the instantaneous support ski. This is the case of homogeneous evidence - evidence strictly supporting a single t L t proposition. The cumulative support function s k : 2 × Ω → [0,1] for proposition Ak attested by the attentional sequence O t can be computed using Bernoulli's rule. Bernoulli's rule of combination provides an t −1 iterative rule for updating s k focused on Ak with support skt-1(Ak) using the instantaneous information s ik focused on Ak with support ski(Ak). It is defined recursively as the orthogonal sum s kt = s kt −1 ⊕ s ik : if C does not contain A k ⎧0 ⎪ s kt (C, O t ) = ⎨1 − (1 − s ik (A k , o t ))(1 − s kt −1 (A k , O t )) if C contains A k ⎪1 if C = K ⎩ Figure 11 explains the calculation of temporal support for a two hypotheses case. In this case ‘Evidence 1’ and ‘Evidence 2’ refer to bodies of evidence supporting the same hypothesis at different times or fixations. To appear in “Autonomous Robots”, 2006. Figure 11 Calculation of temporal support for a two hypotheses case. Many different strategies can then be used in order to make a decision about the current visual task. A simple strategy is choosing the maximally supported proposition A k ∗ where: k* = arg max s kt (A k , O t ) k∈K In creating a model for each proposition k ⊆ K , which may correspond to an object image or a complex scene, APES starts observing the respective scene or object in an attentive manner. As it is consecutively t −1 t fixating and forming observations, the transition Tl (o , o ) between two consecutive observations in this scan path is recorded by incrementing the frequency of that particular transition by 1. Hence, for any library model, the number of transitions between any pair of features forms a Ω x Ω feature transition matrix. These matrices serve directly as weights of evidence. The modeling stage is critical to performance of the two approaches in recognition. In order to obtain a reliable model, all parts of a scene must be observed equally during learning fixations. Therefore, the learning period as determined by the length of the attentional sequence must be long enough to allow different scan paths to be taken. A partial model that does not include all possible scan paths and thus all possible feature transitions will mean that the scene is incompletely modeled. However, due to attention mechanisms involved, this does not necessarily imply poor recognition performance. As discussed in the experiments section of this paper, the system is sometimes able to model the most characteristic features of a scene during a short learning phase and therefore perform successful recognition. Although it can be considered as a special case this property of attentional sequence based recognition is successfully employed in biological systems and needs to be studied in more detail. We speculate that this special case may in fact be the key to human-like visual performance. 7 Long Term Memory Model: Bubbles Biological systems are known to create abstract long-term memories of their environment for navigation and self-localization using visual data. Recent environment mapping techniques for mobile robots use local sensory experiences of the robot to build a Cartesian representation of the environment [8,46,47]. The sensory egosphere was first used to project images on to a spherical surface placed around the robot [2]. This idea was later used as a database structure where each point on the sphere pointed to a data structure representing the sensory inputs from that direction [27,20] – thus facilitating the storage and retrieval of sensory experience of a mobile robot in a natural and efficient way. However, the egosphere itself does not contain any sensory information. A different approach – called a bubble memory – was proposed in [36]. Bubble memory is based on the idea of a surrounding spherical surface that can be deformed to represent a To appear in “Autonomous Robots”, 2006. robot’s sensory experience, while maintaining the same indexing capability like the ego-sphere. While the ego-sphere is used to map information from the 3D world on to a database of sensory experiences, the bubble itself is deformed using sensory information to become a special 3D surface, which represents the robot’s sensory experience. In this section we revisit the bubble memory and explain its integration to APES’ vision system. As APES looks around with its pan-tilt type head, it records the observations thus gathered in a long-term memory. Let θ, ϕ denote the pan and tilt angles respectively. APES can direct the optical axis of its camera in any direction (θ, ϕ) within its physical limits. For each fixation direction, an observation is made as explained previously and accordingly, a quantitative measure ρ can be assigned to each fixation direction. In + this manner, a surface is defined implicitly by ρ : (θ , ϕ ) → R . We refer to this deformable surface hypothetically placed around the robot - as the bubble as shown in Figure 12(left). Each bubble is initialized to a sphere. As APES starts to look around, for each fixation direction (θ, ϕ) the bubble is deformed at the corresponding bubble point. The amount of deformation of the bubble is determined by the attentive processing made on the fovea as shown in Figure 12(right). Since APES has finite precision in pan and tilt directions - around 1.8 degrees, the bubble surface is discrete and we can represent it by a finite set of equally spaced bubble points: { β = (ρ, θ, ϕ) ∈ ℜ 3 | θ ≡ i ⋅ ∆θ , ϕ ≡ j ⋅ ∆ϕ } where i ∈ [0, n ); j ∈ [0, m ) As there can be a plethora of visual primitives being simultaneously extracted, a set of bubbles can be formed - each corresponding to one such visual primitive. These bubbles provide a compact representation of the spatio-temporally visual data generated while APES looks around from a fixed viewpoint. Thus, bubble memory model provides a mechanism for the integration of spatially distinct features in time to obtain a model of the environment. The bubble enables long-term memory – a recollection of which feature was seen where. APES can use bubbles for vision based environment modeling. The integration property of bubbles enables them to store foveal features observed from a single point in space. For each new viewpoint, a new set of bubbles is generated. Figure 12: (Left) APES, bubble points and potential fixation points. (Right) Inflated bubble in 2D. Number of pan steps -9 -8 -7 Number of tilt steps -6 -5 -4 -3 -2 -1 0 2 1 0 1 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 2 3 Figure 13: (Left) Table scene from our laboratory. (Center) Complete 1.8 degree resolution bubble representation for 30 To appear in “Autonomous Robots”, 2006. fixations. (Right) Coordinates of fixations made during experiments in terms of the number of pan and tilt steps from starting position. A number of experiments to test the bubble formation in different situations have been conducted. In Figure 13 (center) a full resolution bubble surface composed of 40.000 control points generated in one experiment is shown. On the right of this figure, the number of pan and tilt steps made by stepping motors during these fixations are shown in Cartesian coordinates. This representation can be used to model 3D environments – which we have investigated in another set of experiments using our Table scene. It is found that the bubbles after 30 fixations fall into one of the two categories as shown in Figure 14. Figure 15 shows sample bubbles from each category. The bubbles formed in these and other experiments suggest that the attentive system takes different but a limited number of paths for each scene. Interestingly, the system has a tendency to converge to one of these preferred paths, even if the first fixation point is different. Note that unlike the attentional sequence, which is a short-term buffer, the bubble memory is retained and integrated over long periods of time to enable environment mapping. Our recent work with bubbles involves using them for selflocalization by modeling and comparing bubble surfaces. Furthermore, through the use of 3D elliptic Fourier methods, bubble data can be stored in a compact mathematical form and at different levels of detail as described in [36]. Figure 14: Two types of bubbles formed on the Table scene. Figure 15: Sample bubbles formed in experiments. 8 An Integrated Model of APES’ Vision All the five features – fovea-periphery distinction, attentive processing with inhibition of return and shortterm memory, visual primitives, serial processing and long-term memory are integrated within a single model of attentive vision on APES as shown in Figure 16. Blue lines indicate information flow common to all modes of operation. Red and green lines indicate information flow during learning and run-time respectively. Note that only the saccadic eye movement system is realized on APES. This framework also proposes models for certain poorly understood mechanisms such as temporal recognition, visual integration over saccades, and environment modeling. First, periphery and fovea images are obtained by the two-camera sensor system simulating the human retina. The periphery image is input into the pre-attention system, which also receives pre-attentive interest criteria, inhibition settings, fixation memory contents and any other higher-level cognitive effects. The new fixation point is selected, from which the saccade vector is generated and sent to the head motors controller. In To appear in “Autonomous Robots”, 2006. humans, saccades are also known to be controlled in a predictive manner, based on expectations about a scene. This top-down saccade control mechanism is not implemented on APES, however inputs from sequence processing and bubble memory can be used for this purpose. Figure 16: An integrated model of APES’ vision. At each fixation the fixation fovea is processed to extract attentive features. Similar to pre-attentive interest criteria, the system also enables any feature that can be computed on the fovea image to be used as an attentive feature. These features are sequentially processed by attentional sequence modeling and recognition algorithms, which enable temporal recognition. Results of attentive processing are also saved in a bubble memory, so that they can be recalled by the attentive processing module when part of the same peripheral field needs to be processed at a later time, or when visual information is not available. The 3D bubble surfaces store viewpoint dependent visual models of the environment using different attentive features. For each viewpoint a number of bubbles can be formed to represent different properties like edge or color content. Furthermore, spherical bubble surfaces can be modeled by Fourier techniques to create more abstract representations, which can be used for rapid comparison of bubbles to those stored in a bubble model memory. 9 Scene Recognition Experiments An extensive study of APES’ attentional sequence based recognition capability in a variety of recognition tasks have been conducted. In this Section, we present highlights from these results, but the interested reader is referred to [37] for a more comprehensive discussion. In these experiments, APES is set to use a 200x200 pixel visual field and a 40x40 pixel fovea. The overlap between candidate foveae is 50% and a short-term memory depth D=10 is used to inhibit the last 10 fixated foveae. Simple pre-attentive and attentive features are employed with the intention of removing any ambiguity from feature extraction stages and underpinning the exact capability of this attentive vision system in recognition tasks. The pre-attentive attention criterion for each candidate fovea I cf is as described in Section 4. In the attentive stage the feature space consists of To appear in “Autonomous Robots”, 2006. Ω = Ω1 corresponding to 8 different orientations of a simple edge feature computed by the operator f 1 = arg max S i (I ft ) where S i (I ft ) i∈Ω1 is the 3x3 operator for detecting edges with an orientation of i degrees. All experiments are performed under ceiling mounted fluorescents and daylight from windows, without any special lighting. Typically, two fixation sequences generated by APES while looking at the same scene are never identical even if there is no variation in the scene. This is caused by 1) Slight variations in the first fixation point; 2) Small positioning errors in the camera head assembly; 3) Frame grabber noise; 4) Variations in lighting conditions. Even a one pixel wide difference in the fixation point can lead to a new visual field image for the next fixation, which results in a completely different attentional sequence as fixation goes on. Figure 17: Simple scenes containing “rectangle”and “hexagon”. 9.1 Simple Scenes The first set of experiments is performed on simple 2D shapes hanging on a black background as shown in Figure 17. APES has to decide which scene is being viewed by as it is attentively fixating and using evidential reasoning. Learning is based on attentional sequences of length 10. The observed feature transition frequencies are shown in Figure 18 and Figure 19. Even with such short length of the attentional sequences, these matrices start to become differentiable. The matrix for scene 1 favors no transitions between diagonal features 4, 5, 6, and 7, as compared to that of scene 2. 0 1 2 3 4 5 6 7 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 2 0 0 0 1 0 0 0 0 3 1 2 0 2 0 0 0 0 4 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 2 0 0 0 0 0 0 0 0 3 0 0 0 1 0 1 0 0 4 0 0 0 0 1 0 0 1 5 0 1 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 7 0 0 0 1 0 0 0 1 Figure 18: Scene 1 (rectangle) – Learning using attentional Figure 19: Scene 2 (hexagon) – Learning using attentional sequences of length 10. sequences of length 10. For recognition experiments, 20 experiments are conducted and support values after 10 fixations are 10 considered. Figure 20 and Figure 21 show the generated sequences O and recognition results. Using as low as 10 fixations during learning and classification, different feature sequences can be recognized as belonging to the correct shape with a fairly good rate. Note that as the robot’s cameras are not following a pre-defined boundary or trajectory, all the twenty sequences generated during these experiments are completely different. Sequences, which include highly favored transitions, are immediately recognized with a high margin. Those which do not are either incorrectly classified or return only a slightly better result To appear in “Autonomous Robots”, 2006. compared to the competing model. Another reason for incorrect classification is the possibility of generating very similar or even identical sequences on two different scenes. However, correct classification rates indicate that this intersection region is small. 10 10 10 10 No O10 ■ s (A , O ) ♦ s (A , O ) 1 2 2 1,5 1 0,5 19 17 15 13 11 9 7 5 0 3 3000333111 1220123000 0333301111 1133000333 3101211230 0033311012 1112300013 2300033313 0000333311 1121130003 0021212323 0000313110 1211200003 0103223011 2231200003 3330221110 1211130000 0033311110 1211200003 0313331012 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 Attentional Sequence No Figure 20: Results after 10 fixations on Scene 1 with 10 fixation learning on Scene 1 and 2. Recognition rate is 90%. No O10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 7021135374 1111111177 7142111133 1161111473 1273311112 1177331111 1313711122 1477331111 3111121117 0142111511 1333371173 7702333371 4211111121 1077331111 1121411333 7012311112 1147733511 1012111117 4117331161 1112214003 10 10 ♦ s1 ( A1 , O ) 10 10 ■ s 2 (A 2 , O ) 1,2 0,7 0,2 -0,3 0 10 20 30 Attentional Sequence No Figure 21: Results after 10 fixations on Scene 2 with 10 fixation learning on Scene 1 and 2. Recognition rate is 90%. 9.2 Complex scenes In the next set of experiments, 3 complex scenes shown in Figure 22 from our laboratory are used. Figure 23 and Figure 24 show two sample fixation sequences generated by APES as it is looking at Scene 1. The To appear in “Autonomous Robots”, 2006. complexity of our problem can be observed in these sample sequences. For example in the fifth fovea, a boundary caused by a shadow is fixated, and in some foveas like those numbered 4,8,9, and 10 the image is distorted by small camera or body motion, making edge based features quite hard to detect correctly. Note that these are problems common to any practical implementation outside controlled environments. Our methods are expected to cope with such distortions. Also note that in the two sequences, although starting points are close and the first visual fields are almost identical, the two sequences are quite different – as noted earlier on. However spatial and temporal relations of observed features remain the same. Figure 22: (Left-right) Wide-angle images of Scene 1, Scene 2 and Scene 3. Squares represent the visual field and fovea. 1 2 3 4 5 6 7 8 9 10 Figure 23: A sample sequence of visual field images I v = ( I v , K , I v ) on Scene 1. 1 10 1 2 3 4 5 6 7 8 9 10 Figure 24: A sample sequence of visual field images I v = ( I v , K , I v ) on Scene 1. 1 10 During learning, APES constructs its models (feature transition frequency matrices) of all the three scenes using attentional sequences of length 30. These are presented in Figure 25, Figure 26 and Figure 27 respectively. Scene 3 model is different compared to those of the other two scenes – however it is closer to that of Scene 2. Therefore any sequence generated on Scene 3 is likely to be identified correctly in general. On the other hand, the similarity between models of the first two scenes hints that two scenes may be confused with each other. These results are justified by results shown in Figure 28 to Figure 33. Scene 3 is recognized with a rate of 100% at all times while Scene 1 and Scene 2 have lower recognition rates. To appear in “Autonomous Robots”, 2006. 0 1 2 3 4 5 6 7 0 7 5 0 0 0 0 0 0 1 5 11 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 Figure 25: Scene 1 No 0 1 2 3 4 5 6 7 0 3 3 3 2 0 0 0 1 1 2 6 0 2 0 0 0 0 2 2 0 0 0 0 0 0 0 3 4 0 0 1 0 0 0 0 4 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 7 0 1 0 0 0 0 0 0 0 1 2 3 4 5 6 7 Figure 26: Scene 2 O 30 1 2 3 4 5 6 7 8 9 111000010001100110001111111100 101011110011000131103000011111 111110001100010011101104001110 000001111111111111010201110100 100010000101131111113011111011 111601100000001120000000100300 100010111010101010100001111013 000000001111130111100101110000 111014110001110000006000111110 10 101010101110001000001111100100 30 0 0 1 0 3 0 0 1 0 1 1 2 2 1 0 0 0 0 2 0 1 4 3 0 0 1 0 3 1 2 3 0 0 0 1 0 4 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 6 2 0 0 1 0 0 0 0 7 0 0 0 0 0 0 0 0 Figure 27: Scene 3 30 30 ♦ s1 ( A 1 , O ) 30 ■ s 2 (A 2 , O ) 1,2 1,0 0,8 0,6 0,4 0,2 0,0 0 5 10 15 Attentional Sequence No Figure 28: Results after 30 fixations on Scene 1 with 30 fixation learning on Scene 1 and 3. Recognition rate is 100%. No O 30 1 2 3 4 5 6 7 8 9 212013111112031233210120321026 327313231521313160355662211022 301101111001113233121011111331 000266110131313121133223214111 111126661025011312223611022220 112113133220223132232436226211 331111261163111111213111212222 113220121111011122262322222333 123302322121633011121361222212 10 362231233226011330111233027116 30 30 30 ♦ s1 ( A 1 , O ) 30 ■ s 3 (A 3 , O ) 1,2 1 0,8 0,6 0,4 0,2 0 0 5 10 15 Attentional Sequence No Figure 29: Results after 30 fixations on Scene 3 with 30 fixation learning on Scene 1 and 3. Recognition rate is 100%. No 1 2 3 4 5 O 30 111000010001100110001111111100 101011110011000131103000011111 111110001100010011101104001110 000001111111111111010201110100 100010000101131111113011111011 30 30 ♦ s1 ( A 1 , O ) 30 30 ■ s 2 (A 2 , O ) To appear in “Autonomous Robots”, 2006. 6 7 8 9 111601100000001120000000100300 100010111010101010100001111013 000000001111130111100101110000 111014110001110000006000111110 10 101010101110001000001111100100 1,2 1 0,8 0,6 0,4 0,2 0 0 5 10 15 Attentional Sequence No Figure 30: Results after 30 fixations on Scene 1 with 30 fixation learning on Scene 1 and 2. Recognition rate is 50%. No O 30 30 30 30 ♦ s1 ( A 1 , O ) 1 2 3 4 5 6 7 8 9 101001107060064011154411222512 501333300077606115154141340442 116303331111061616010674630110 012111013051707170644011100001 122300150010001767633055441324 261117011001032132001001230211 001010103710100333110110001111 000100007111010133002111100031 011104311301111003333120110010 10 100001111170110111100110012100 30 ■ s 2 (A 2 , O ) 1,2 0,7 0,2 -0,3 0 5 10 15 Attentional Sequence No Figure 31: Results after 30 fixations on Scene 2 with 30 fixation learning on Scene 1 and 2. Recognition rate is 100%. No O 30 1 2 3 4 5 6 7 8 9 101001107060064011154411222512 501333300077606115154141340442 116303331111061616010674630110 012111013051707170644011100001 122300150010001767633055441324 261117011001032132001001230211 001010103710100333110110001111 000100007111010133002111100031 011104311301111003333120110010 10 100001111170110111100110012100 30 30 30 ♦ s 2 (A 2 , O ) 30 ■ s 3 (A 3 , O ) 1,2 1 0,8 0,6 0,4 0,2 0 0 5 10 15 Attentional Sequence No Figure 32: Results after 30 fixations on Scene 2 with 30 fixation learning on Scene 2 and 3. Recognition rate is 70%. No 1 2 3 4 5 O 30 212013111112031233210120321026 327313231521313160355662211022 301101111001113233121011111331 000266110131313121133223214111 111126661025011312223611022220 30 30 ♦ s 2 (A 2 , O ) 30 30 ■ s 3 (A 3 , O ) To appear in “Autonomous Robots”, 2006. 6 7 8 9 112113133220223132232436226211 331111261163111111213111212222 113220121111011122262322222333 123302322121633011121361222212 10 362231233226011330111233027116 1,2 1,0 0,8 0,6 0,4 0,2 0,0 0 5 10 15 Attentional Sequence No Figure 33: Results after 30 fixations on Scene 3 with 30 fixation learning on Scene 2 and 3. Recognition rate is 100%. 9.3 Complex and similar scenes APES was also made to look at three similar scenes with small variations and one unrelated scene as shown in Figure 34. Changes in the three similar scenes are not very small at all, such as missing chairs, but a human viewer tends to overlook these changes. APES is expected to perform similarly and "understand" that the three scenes belong to the same part of the world and Scene 2 to a different part. Figure 34: (Left-right) Wide-angle images of Scene 1, Scene 2, Scene 3 and Scene 4. 30 30 ♦ s1 ( A1 , O ) 30 30 ■ s 2 ( A2 , O ) 1,200E+00 1,000E+00 8,000E-01 6,000E-01 4,000E-01 2,000E-01 0,000E+00 1 2 3 4 5 6 7 8 9 10 Attentional Sequence No To appear in “Autonomous Robots”, 2006. 30 30 30 30 ■ s 2 ( A2 , O ) ♦ s1 ( A 1 , O ) 4.0000E-05 3.5000E-05 3.0000E-05 2.5000E-05 2.0000E-05 1.5000E-05 1.0000E-05 5.0000E-06 0.0000E+00 1 2 3 4 5 6 7 8 9 10 Figure 35: Results of 30 fixations on Scene 1 (top) and Scene 2 (bottom) after 30 fixation learning on Scene 1 and Scene 2. Recognition rates are 100% and 80% respectively. In Figure 35 results of experiments on the original training scenes are shown. Scene 1 can be recognized easily with a high margin, while Scene 2 is recognized in 80% of the experiments with a very low margin. In Figure 36 results of experiments on the two variants of Scene 1, Scene 3 and Scene 4 are shown. Both scenes can easily be recognized as Scene 1 except in a few experiments. 30 30 ♦ s1 ( A1 , O ) 30 30 ■ s 2 ( A2 , O ) 1,2000E+00 1,0000E+00 8,0000E-01 6,0000E-01 4,0000E-01 2,0000E-01 0,0000E+00 1 2 3 4 5 6 7 8 9 10 9 10 Attentional Sequence No 30 30 30 30 ■ s 2 ( A2 , O ) ♦ s1 ( A 1 , O ) 1,2000E+00 1,0000E+00 8,0000E-01 6,0000E-01 4,0000E-01 2,0000E-01 0,0000E+00 1 2 3 4 5 6 7 8 Attentional Sequence No Figure 36: Results of 30 fixations on Scene 3 (top) and Scene 4 (bottom) after 30 fixation learning on Scene 1 and Scene 2. Recognition rates are 100% and 80% respectively. To appear in “Autonomous Robots”, 2006. Although these experiments show that scene recognition based on attentional sequences can compensate for small changes in the environment, the low margins in Scene 2 recognition results in Figure 35 is confusing. This result may suggest that the model of Scene 1 may be dominating over Scene 2 and correct classification of Scene 3 and Scene 4 may be a result of this dominance. 9.4 Multiple object recognition Temporal recognition using attentional sequences is also suitable for multiple object recognition. This can be demonstrated by an experiment where two parts of a scene containing different objects are modeled separately. For example in the scene shown in Figure 37, fixations concentrate on two distinct objects, which are the switch and an old fuse board mounted on the wall. The bubble formed while observing this scene is also shown in Figure 37. During learning, feature transition matrices are generated using fixations on each object separately. In this case we obtain two models from a single sequence of fixations made on two distinct objects. During recognition the system makes fixations on the same scene and the cumulative support values for the two models are computed. In Figure 38 only fixations after 10th are considered so that enough information will be accumulated before starting recognition decisions. Initially after 10th fixation (shown as 1 in the figure) Model 1 is dominating. Starting with 20th fixation the robot starts looking at the parts of the scene learned as Model 2 and after a transition period, where no decision is possible, Model 2 is activated and Model 1 goes down starting with 35th fixation. As the robot attends the two areas of the scene, the values change to support corresponding models. Figure 37: Switch and fuse board scene (left) and the bubble inflated by fixations (right). Figure 38: Supports for Model 1 and Model 2 vs. fixation number (starting from 10th) 9.5 Experiments’ Summary and Discussion To appear in “Autonomous Robots”, 2006. In summary, our experiments on simple and complex scenes revealed the following important results about the use of attentional sequences for scene classification: 1) Evidential reasoning is a promising method for classification of attentional sequences. 2) Even by using very simple edge based features we can deduce invariant relations from the seemingly varying fovea image sequences generated while looking at the same scene. 3) Using as low as 10 fixations during learning and recognition, good classification performance can be achieved. 4) Results on complex real world scenes, which are hard to classify using classical methods, show that attentional sequence based classification is promising to solve such problems. 5) Increasing the learning period does not necessarily improve performance. Good performance with short learning period is possible depending on learning and recognition fixations. In order to achieve good performance, models (feature transition frequency matrices) need to represent unique features about the scene. How to generate fixation models with such property and how to compute their representation capability are open problems that we are working on. 10 Conclusion APES is developed as an experimental robot platform for biologically motivated attentive vision. Its novelty stems from the fact that it simultaneously mimics some of the most key properties of biological vision -including fovea-periphery distinction, attention, different representation modes, temporal processing and memory. Let us remark that most of these features has been studied previously – however in general separately from each other. However, there has been little work on their realization altogether in an integrative framework. It turns out that such integration requires the development of new original approaches in addition to the utilization of the more classical solutions. Our approach to integration of each feature has proven to be simplistic but relatively successful approximations to their biological counterparts. For example, the two camera retina model is a very realistic approximation to the biological system. We hope that some of our mechanisms may provide insight into the associated aspects of biological vision where there is still much unknown. For example, the bubble model has many practical advantages as a visual memory mechanism and it may also be interesting as a model to explain visual integration mechanisms in humans. In the future, this approach may lead to a functional memory mechanism, which will enable the robot to recall a previously visited environment and to detect changes in it without forming a 3D geometric model or recording a large number of images. APES is continuing to be further developed in our laboratory for studying attention. There is much room for further work regarding both physical and visual capabilities. The physical properties of the system need to be improved for increased positioning accuracy and faster mechanical response during fixations. Similarly, its visual processing software is currently being expanded to include a very rich set of visual primitives which may be tuned on or off depending on the task. Another focus is on undermining the relation between attentional sequence and bubbles in real-time scene exploration tasks. Acknowledgements This work is supported by Boğaziçi University Research Fund project 01A202D. References 1. Abbott, A.L et al. "Promising directions in active vision". International Journal of Computer Vision, 11:2, 109-126, 1993. 2. Albus, J.S. “Outline for a theory of intelligence”, IEEE Trans. Syst., Man and Cybernetics, Vol.21, No.3, 1991. 3. Aloimonos, J. "Purposive and qualitative active vision". In Proceedings of Image Understanding Workshop, September 1990. 4. Ballard, D.H. "Animate Vision". Artificial Intelligence, 48: 57-86, 1991. 5. Ballard, D.H. and C.M.Brown. "Principles of Animate Vision". CVIP: Image Understanding, 56(1), July 1992. To appear in “Autonomous Robots”, 2006. 6. Ballard, D.H. "On the function of visual representations". In K.Akins, editor, Perception, pp: 111-131. Oxford University Press, 1996. 7. Bozma, H.I. and Ç.Soyer. "Shape Identification using probabilistic models of attentional sequences". In Proceedings of Workshop on Machine Vision Applications. IAPR, 1994. 8. Chown, E., Kaplan, S. and Kortenkamp, D. “Prototypes, Location, and Associative Networks (PLAN): Towards a Unified Theory of Cognitive Mapping”. Cognitive Science 19, pp. 1-51, 1995. 9. Clark, J. "Spatial attention and Latencies in Saccadic Eye Movements". Vision Research V 39 pp:585602, 1999. 10. Fiala,J.C. et al. “TRICLOPS: A Tool for Studying Active Vision”. International Journal of Computer Vision, 12:2/3, 231-250, 1994. 11. Gallant, J. L, C.E. Connor, S. Rakshit, J.W. Lewis and D.C. Van Essen. “Neural Responses to Polar, Hyperbolic and Cartesian Gratings in Area V4 of the Macaque Monkey”, Journal of Neurophysiology, Vol.76, No. 4, pp:2718-2739, 1996. 12. Gouras, P. "Oculomotor system". In J.H.Schwartz and E.R.Kandel, editors, Principles of Neural Science. Elsevier, 1986. 13. Gouras, P. and C.H.Bailey. "The retina and phototransduction". In J.H.Schwartz and E.R.Kandel, editors, Principles of Neural Science. Elsevier, 1986. 14. Greene,H.H. “Temporal relationships between eye fixations and manual reactions in visual search”. Acta Psychologica, 101:105-123, 1999. 15. Grosso, E., E. Manzotti, R. Tiso and G. Sandini. A Space-Variant Approach to Oculomotor Control. Proceedings of International Symposium on Computer Vision, pp:509-514, 1995. 16. Hubel, D.H. "Eye, brain and vision". Scientific American Lib., 1988. 17. Huber, E. and Kortenkamp, D. “A behavior-based approach to active stereo vision for mobile robots”. Artificial Intelligence 11, pp. 229-243, 1998. 18. Itti, L. and Koch, C. “Computational Modeling of Visual Attention”. Nature Reviews, Vol.2, February 2001. 19. Julesz, B. Dialogues on Perception. MIT Book Press, Cambridge, MA 1995. 20. Keskinpala, T. et al. “Knowledge-Sharing Techniques for Egocentric Navigation”, IEEE Conference on Systems, Man and Cybernetics, pp: 2469-2476, 2003. 21. Kowler, E. "Eye movements". In S.M.Kosslyn, D.N. Osherson, editors, Visual Cognition, pp: 215-266. MIT Press, 1995. 22. Lago-Fernandez, L.F., Sanchez-Montanes, M.A. and Cobacho, F. “A biologically inspired visual system for an autonomous robot”. Neurocomputing, 38-40:1385-1391, 2001. 23. McGaugh, J.L., N.M.Weinberger and G.Lynch editors. Brain and Memory. Oxford University Press, 1995. 24. Noton, D. and L.Stark. "Scan paths in eye movements during pattern recognition". Science, Vol.171, pp.308-311, January 1971. 25. Palmer,J., Verghese,P. and Pavel,M. “The psychophysics of visual search”. Vision Research, 40:12271268, 2000. 26. Papanikolopoulos,N.P. “Adaptive control, visual servoing, and controlled active vision”. Proceedings of IEEE International Conference on Robotics and Automation, 1994. 27. Peters, R.A. et al. “The Sensory Ego-Sphere as a Short-Term Memory for Humanoids”. Proceedings of the IEEE-RAS International Conference on Humanoid Robots, 2001. 28. Rao, R.P.N. et al. “Modeling Saccadic Tergeting in Visual Search”. In Touretzky, D., Mozer, M. and Hasselmo, M. editors, Advances in Neural Information Processing Systems 8 (NIPS*95), MIT Press, 1996. 29. Rimey, R.D. and C.M.Brown. "Selective attention as sequential behaviour: Modelling eye movements with an augmented hidden Markov model". Technical Report, The University of Rochester, Computer Science Department, February 1990. 30. Rimey, R.D. and C. Brown. "Control of Selective Perception Using Bayes Nets and Decision Theory". International Journal of Computer Vision, 12:2/3, 173-207, 1994. 31. Rybak, I. A., V. I. Gusakova, A. V. Golovan, L.N. Podladchikova and N. A. Shevtsova. "A Model of Attention-Guided Visual Perception and Recognition". Vision Research, Special Issue: Models of Recognition, 1998. 32. Schlingensiepen,K.H., et al. “The importance of eye movements in the analysis of simple patterns”. Vision Research, Vol.26, No.7, pp.1111-1117, 1986. To appear in “Autonomous Robots”, 2006. 33. Shafer, G. A Mathematical Theory of Evidence. Princeton University Press, 1976. 34. Shin, C.W. and S. Inokuchi and K.I. Kim. “Retina-like visual sensor for fast tracking and navigation robots”. Machine Vision and Applications V 10 pp:1-8, 1997. 35. Soyer, Ç., and H.I.Bozma. "Further experiments in classification of attentional sequences: Combining instantaneous and temporal evidence". In Proceedings of IEEE 8th International Conference on Advanced Robotics. ICAR, 1997. 36. Soyer, Ç., H.I.Bozma, and Y. Istefanopulos. "A New Memory Model for Selective Perception Systems". In Proceedings of IEEE RSC International Conference on Intelligent Robots and Systems. IROS, 2000. 37. Soyer, Ç. and H.I.Bozma. “Attentional Sequence Based Recognition: Markovian and Evidential Reasoning”. IEEE Transactions on Systems, Man and Cybernetics, V33, No 6, pp: 937-950, December 2003. 38. Soyer, Ç. “A model of active and attentive vision”. PhD dissertation. Bogazici University, 2002. 39. Stark, L. and S.R. Ellis. “Scan paths Revisited: Cognitive Models Direct Active Looking”, In Eye Movements: Cognition and Visual Perception. Editors: Fisher, Monty and Senders. pp:193-226, Erlbaum, NJ, 1981. 40. Tagare, H., K. Toyama and J. G. Wang. "A Maximum Likelihood Strategy for Directing Attention During Visual Search". IEEE Transactions on PAMI, V23, No 5, pp:491-500, May 2001. 41. Treisman, A., and G. Gelade. "A feature integration theory of attention". Cognitive Psychology, 12, pp:97-136, 1980. 42. Tsotsos, J.K. et al. “Modeling visual attention via selective tuning”. Artificial Intelligence, 78:507-545, 1995. 43. Viviani, P. "Eye movements in visual search: Cognitive, perceptual and motor control aspects". In E.Kowler, editor, Eye Movements and Their Role in Visual and Cognitive Processes, pp: 71-112. Elsevier, 1990. 44. Wasson, G., Kortenkamp, D. and Huber, E. “Integrating active perception with an autonomous robot architecture”. Robotics and Autonomous Systems, 29:175-186, 1999. 45. Westin, C. et al. "Attention control for robot vision". In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR, pp 726-1996. 46. Yeap, W.K. “Toward of a computational theory of cognitive maps”. Artificial Intelligence 34, 297-360, 1988. 47. Yeap, W.K. and Jefferies, M.E. “Computing a representation of the local environment”. Artificial Intelligence 107, pp. 265-301, 1999. 48. Zeki, S. "The visual image in mind and brain". Scientific American, Vol.267, No.3, September 1992. 49. Zingale, C.M. and Kowler, E. “Planning Sequences of Saccades”. Vision Research, Vol.27, No.8, pp.1327-1341, 1987.
© Copyright 2026 Paperzz