UNIVERSITÀ DEGLI STUDI DI FIRENZE FACOLT À DI I NGEGNERIA - D IPARTIMENTO DI S ISTEMI ED I NFORMATICA Dottorato di Ricerca in Ingegneria Informatica e dell’Automazione XV Ciclo Depth Perception in a Robot that Replicates Human Eye Movements Candidate Fabrizio Santini Advisor Marco Gori A NNO ACCADEMICO 2003 Perceptual activity is exploratory, probing, searching; percepts do not simply fall onto sensors as rain falls onto ground. We do not just see, we look. (Ruzena Bajcsy, 1988) Preface One of the most important goals of a visual system is to recover the distance at which objects are located. This process is inherently ambiguous as it considers projections of a three dimensional scene onto the two dimensional surface of a sensor. Humans rely on a variety of cues which are monocular and binocular. These cues, either dynamic or static, can be categorized as primary depth cues, which provide direct depth information such as convergence of the optical axes of the two eyes; or secondary depth cues, which may also be present in monocularly viewed images such as motion parallax, shading, and shadows. These cues are combined by the visual cortex and other high-level brain functions with various strategies dependent upon the visual circumstances [16], in order to obtain a reliable distance information estimation. An important contribution to depth information is supplied by the relative motion between the observer and the scene, which appears in the form of motion parallax [49]. While motion parallax is evident for large movements of the observer, it occurs also during eye movements. Due to the misalignment between the focal points and the center of rotation when the eye rotates, the shift in the retinal projection of a point in space depends not only on the amplitude of the eye movement but also on the distance of the point with respect to the observer. It is known that when analyzing a 3-D scene, human beings tend to allocate several successive fixations at nearby spatial locations. In addition, within each of these fixations, small fixational eye movements move the retinal image by a relatively large degree. Both types of eye movements provide 3-D information because eye movements themselves cause changes in the pattern of retinal motion and, therefore, modify depth cues. However, their role in the perception of depth in humans has always been uncertain and neglected [19]. Only recently, studies in psychophysics (see [28, 66, 24, 67, 33, 68] for further details) have begun to consider monocular cues and eye movements as integral parts of the depth perception process. Scientists are considering emerging evidence that eye movements influence magnitude of perceived depth derived by motion. While it is not clear if or how humans use eye movement parallax in the perception of depth, this cue can be used in a robotic active vision system. In robotics, several studies have taken an active approach in the visual evaluation of distance. In most of them, motion of the sensor according to the movement of the robot (or the camera itself) produces a consequent motion of the projection of the observed scene which is computed with different approaches like optical flow, feature correspondence, spatio-temporal gradients, etc. However, no previous study has attempted to consider the exploitation of the parallax phenomenon emerging from a rotation of the cameras due to eye movement replication. i In this thesis, we present a novel mathematical model of 3-D projection on a 2-D surface which exploits human eye movements to obtain depth information. This model brings new features with it. The estimations obtained are reliable and rich with information. Even with very small movements (i.e. fixational eye movements), the model is able to supply depth information. Moreover, the model’s simple formulation bounds the number of requested parameters. Another contribution of this work is the design and assembly of an oculomotor robotic system which is able to replicate, as accurately as possible, the geometrical and mobile characteristics of the human eye. Combining the mathematical model of parallax and recorded human eye movements, we show that the robot is able to provide depth estimation of natural 3-D scenes. Chapter 1 of this thesis summarizes general considerations on eye anatomy, its possible aberrations, and its possible models. We also introduce basic elements of optical physics as applied to the eye and to the characterization of human eye movement. Chapter 2 presents possible depth information cues such as static and dynamic monocular cues and their possible types of interaction. Chapter 3 reviews all the research literature in robotic and computer vision in order to summarize the latest advances in this field. Chapter 4 describes the novel mathematical model for the parallax information extraction. Finally, Chapter 5 presents the results obtained using the robotic system, the mathematical model, and human eye movement recordings on generic 3-D scenes present under different kinds of conditions. ii Acknowledgments I have been told once that a Ph.D. thesis is something that one parents, alone, with love and attention. But in the end, he realizes that the dissertation has many grandparents, uncles and aunts, relatives and friends. These are my small acknowledgments to this enlarged family. First, I would like to thank my advisor Professor Marco Gori. This thesis would not have been possible without his support and trust. My most special thanks go to Professor Michele Rucci, who has been able to contain the false vauntings and drive the shy successes of my work. A special thank to Professor Marco Maggini, as well, for his friendship. He made all this possible. I thank the wonderful people of the Active Perception Lab: Gaelle Desbordes and Bilgin Ersoy have been great friends and supporters, as well as providers of precious comments for this thesis. I will not easily forget the Machine Learning Group within the “Dipartimento di Ingegneria dell’Informazione” at the University of Siena: Monica Bianchini, Michelangelo Diligenti, Ciro de Mauro, Marco Ernandes, Luca Rigutini, Lorenzo Sarti, Franco Scarselli, Marco Turchi, and Edmondo Trentin and his “children’s tails”. How can I forget all my dearest friends who supported me during this work? Michele degli Angeli, Cinzia Gentile, Thomas Giove, Robert Mabe, Emily Marquez, Bruce Miller, Paolo M. Orlandini, Michela Pisà, Christian Schlesinger, Andrea Valeriani, Sergio M. Ziliani: to all of you... forever grateful. In the end, I would like to dedicate this work to my parents Bruno, Lelia, and my siblings Federica and Francesco - their love and support endless, their confidence in me steadfast, and their patience limitless. March 2004 iii Contents 1 Human vision 1.1 General anatomy of the eye 1.2 Optics applied to the eye . 1.3 Aberrations in the eye . . . 1.4 Optical models of the eyes 1.5 Eye movements . . . . . . . . . . . 9 10 16 20 22 26 2 Monocular depth vision system 2.1 Monocular static depth cues . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Monocular dynamic depth cues . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Types of cue interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 35 45 47 3 Monocular depth perception in robotic 3.1 Passive depth finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Active depth finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 50 57 4 The model 4.1 Eye movement parallax on a semi-spherical sensor . . . 4.2 Eye movement parallax on a planar sensor . . . . . . . . 4.3 Theoretical parallax of a plane . . . . . . . . . . . . . . 4.4 Replicating fixational eye movements in a robotic system . . . . 59 59 63 66 68 . . . . 72 72 78 81 85 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Discussion of the experiment results 5.1 The oculomotor robotic system . . . . . . . . . . 5.2 Recording human eye movements . . . . . . . . 5.3 Experimental results with artificial movements . . 5.4 Experimental results with human eye movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion and future work 97 Bibliography 99 1 List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 The human eye and its constitutive elements. . . . . . . . . . . . . . . . . Detailed structure of the retina . . . . . . . . . . . . . . . . . . . . . . . . Inner segments of a human foveal photoreceptor mosaic in a strip extending from the foveal center (indicated by the upper left arrow) along the temporal horizontal meridian. Arrowheads indicate the edges of the sampling windows. The large cells are cones and the small cells are rods. . . . . . . . Refraction for light passing through two different mediums where n1 < n2 . Simple graphical method for imaging reconstruction in thin lenses. . . . . . Behavior of light in a thick lens according to the nodal points. . . . . . . . Simple graphical method for imaging reconstruction in thick lenses. . . . . Spherical aberration in a converging lens. . . . . . . . . . . . . . . . . . . Coma aberration in a converging lens. . . . . . . . . . . . . . . . . . . . . Images of distortion aberration: (a) a square; (b) pincushion distortion; (c) barrel distortion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emsley’s standard reduced 60-diopter eye. . . . . . . . . . . . . . . . . . . Gullstrand’s three-surface reduced schematic eye. . . . . . . . . . . . . . . General form of the Chromatic Eye. . . . . . . . . . . . . . . . . . . . . . General form of the Indiana Eye. . . . . . . . . . . . . . . . . . . . . . . . Human eye movements (saccades) recorded during an experiment in which a subject was requested to freely watch a natural scene. A trace of eye movements recorded by a DPI eyetracker (see Paragraph 5.2) is shown superimposed on the original image. The panel on the bottom right shows a zoomed portion of the trace in which small fixational eye movements are present. The color of the trace represents the velocity of eye movements (red: slow movements; yellow: fast movements). Blue segments mark periods of blink. Fixational eye movements and drifts recorded during an experiment in which a subject was requested to look at a specific point on the screen. The figure shows the magnified details about fixational eye movements around a target which size is only 30 pixels. . . . . . . . . . . . . . . . . . . . . . . . . . Axis systems for specifying eye movements. . . . . . . . . . . . . . . . . . 2 10 11 13 16 17 18 19 20 21 22 23 24 25 25 27 29 31 1.18 (a) In the Helmholtz system the horizontal axis is fixed to the skull, and the vertical axis rotates gimbal fashion; (b) In the Fick system the vertical axis is fixed to the skull. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 4.1 Binocular disparity according to the displacement of the two eyes from the cyclopean axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monocular depth cues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of depth perception using perspective cues. . . . . . . . . . . . . Object, Object space, Display, Observer. . . . . . . . . . . . . . . . . . . . (a) one-point; (b) two-point; (c) three-point perspective. . . . . . . . . . . . Example of linear perspective. . . . . . . . . . . . . . . . . . . . . . . . . The depth cue of height in the field. . . . . . . . . . . . . . . . . . . . . . Texture gradients and perception of inclination: (a) Texture size scaling; (b) Texture density gradient; (c) Aspect-Ratio perspective. . . . . . . . . . . . Example of object which does not respond to the Jordan’s Theorem. . . . . (a) Amodal completion in which the figures appear in a specific order. (b) Modal completion effects using a Kanizsa triangle in which a complete white triangle is seen. Each disc appears complete behind the triangle (amodal completion). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Kanizsa triangle is not evident in each static frame but becomes visible in the dynamic sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variations of convexity and concavity from shading. . . . . . . . . . . . . . The gray squares appear to increase in depth above the background because of the positions of the shadows. . . . . . . . . . . . . . . . . . . . . . . . . Differential transformations for the optical flow fields: (a) translation; (b) expansion; (c) rotation; (d) shear of type one; (e) shear of type two. . . . . . Parallax can be obtained when a semi-spherical sensor (like the eye) rotates. A projection of the points A and B depends both on the rotation angle and the distance to the nodal point. The circle on the right highlights how a misalignment between the focal point (for the sake of clarity, we consider a simple lens with a single nodal point) and the center of rotation yields a different projection movement according to the rotation but also according to the distance of A and B. . . . . . . . . . . . . . . . . . . . . . . . . . . 3 32 34 36 36 36 37 39 39 40 41 41 43 43 44 46 60 4.2 4.3 4.4 4.5 4.6 4.7 4.8 (a) Gullstrand’s three-surface reduced schematic eye. (b) Geometrical assumptions and axis orientation for a semi-spherical sensor. An object A is positioned in the space at distance da from the center C of rotation of the sensor, and with an angle α from axis y; N1 and N2 represent the position of the two nodal points in the optical system. The object A projects an image xa on the sensor and its position is indicated by θ. . . . . . . . . . . . . . . (a) Variations of the projection point on a semi-spherical sensor obtained by (4.7) considering an object at distance da = 500 mm, N1 = 6.07 mm and N2 = 6.32 mm; (b) detail of the rectangle in Figure (a): two objects at distance da = 500 mm and da = 900 mm produce two slightly different projections on the sensor. . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) Schematic representation of the right eye of the oculomotor system with its cardinal points. The L-shaped support places N1 and N2 before the center of rotation C of the robotic system. (b) Geometrical assumptions and axis orientation for a planar sensor. An object A is positioned in the space at a distance da from the center C of rotation of the sensor with an angle α from axis y; N1 and N2 represent the position of the two nodal points in the optical system. The object A projects an image xa on the sensor; x̂a represent the projection of A considering null the distance between N1 and C. Curve A represents variations of the projection xa on a planar sensor obtained by (4.9), considering an object at distance da = 500 mm, N1 = 49.65 mm and N2 = 5.72 mm (see Paragraph 5.1.3 for further details); Curve B represents the same optical system configuration but with the object at distance da = 900 mm. Curve C represents variations of x̂a using (4.10) when the nodal points N1 and N2 collapse in the center of rotation of the sensor. This curve does not change with the distance. . . . . . . . . . . The clockwise rotation of the system around the point C leaves unchanged the absolute distance da , and allows a differential measurement of the angle α (the angle before the rotation) to be obtained using only ∆α and the two projections xa (before the rotation), and x0a (after the rotation). . . . . . . . Geometrical assumptions and axis orientation needed to calculate the theoretical parallax. Considering the plane P at distance D from the center of rotation C of the camera, it is possible to calculate the expected parallax. . . Horizontal theoretical parallax for planes at different distances using a rotation of ∆α = 3.0◦ anticlockwise. Although asymmetry, due to the rotation, is present on all the curves, this is more evident for shorter distances. . . . . 4 61 62 64 65 66 67 68 4.9 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 Geometrical assumptions and axis orientation for a small movement. The sensor S is no longer considered as a single continuous semi-spherical surface, but it is quantized according to the dimension of the virtual photoreceptors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The robotic oculomotor system. Aluminum wings and sliding bars mounted on the cameras made it possible to maintain the distance between the center of rotation C and the sensor S. . . . . . . . . . . . . . . . . . . . . . . . . Pinpoint light source (PLS) structure. . . . . . . . . . . . . . . . . . . . . (a) Factory specifications for the optical system used in the robotic head (not in scale). The point C = 7.8 ± 0.1 mm represents the center of rotation of the system camera-lens, and it was measured directly on the robot. (b) Estimation of the position of N1 and N2 in respect to C as a function of the focal length (focus at infinite). . . . . . . . . . . . . . . . . . . . . . . . . Experimental setup of the calibration procedure. The PLS was positioned on the visual axis of the right camera at distances depending on each trial. . Data indicated with circles represent the estimated position of the nodal points N1p and N2p for values of focal length 11.5, 15.0, 25.0, 50.0 mm (focus at infinite). Solid lines in the graph represent their interpolated trajectory. It can be noticed that the position of the cardinal points expressed by the graph is compatible with the trajectory reported by the factory in Figure 5.3b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model fitting (indicated with a solid line) obtained interpolating N1p = 49.65 mm and N2p = 5.72 mm from the data (indicated with circles). The underestimation of the projections under da = 270 mm is due to defocus, aberrations and distortions introduced by the PLS too close to the lens. In fact, since da is considered from the center of rotation C, the light source is actually less then 160 mm from the front glass of the lens. This position is incompatible even for a human being, which cannot focus objects closer than a distance called near point, and equal to 250 mm [74]. . . . . . . . . Results of the model fitting of the parallax obtained at different focal lengths (and different nodal points). . . . . . . . . . . . . . . . . . . . . . . . . . . The Dual-Purkinje-Image eyetracker. This version of DPI eyetracker has a temporal resolution of approximately 1 ms and has a spatial accuracy of approximately 1 arcmin. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 70 73 74 76 77 78 79 80 80 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 Basic principle of the Dual-Purkinje-Image eyetracker. Purkinje images are formed by light reflected from surfaces in the eye. The first reflection takes place at the anterior surface of the cornea, while the fourth occurs at the posterior surface of the lens of the eye at its interface with the vitreous humor. Both the first and fourth Purkinje images lie in approximately the same plane in the pupil of the eye and, since eye rotation alters the angle of the collimated beam with respect to the optical axis of the eye, and since translations move both images by the same amount, eye movement can be obtained from the spatial position and distance between the two Purkinje images. . . Distance estimation calculated from projection data obtained by rotating the right camera 3◦ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error introduced by the quantization in the model depth perception performance. Distance estimation using a sensor with a 9 µm cell size (thick blue line) is compared with a sensor with a 0.3 µm cell size. The error introduced by the higher resolutions is evident only after numerous meters. . . . . . . Semi-natural scene consisting of a background placed at a distance 900 mm with a rectangular object placed before it at a distance 500 mm. This object is an exact replica of the covered area of the background. (a) The scene before a rotation; (b) Same semi-natural scene after a anticlockwise rotation of ∆α = 3.0◦ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The rectangular object hidden in the scene in Figure 5.12 is visible only because it has been exposed to an amodal completion effect with a sheet of paper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Horizontal parallax produced by a rotation of ∆α = 3◦ for the object in Figure 5.12a. It can be noticed how the general tendency of the surface corresponds to the theoretical parallax behavior predicted in the equation (4.14). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distance estimation of the object in Figure 5.12a. The horizontal parallax matrix obtained by positioning an object at 500 mm from the point C of the robotic oculomotor system was post-processed using (4.11) and (4.12), thereby obtaining the distance information in the figure. . . . . . . . . . . . PLS estimated distance using fixational eye movements of the subject MV. . PLS estimated distance using fixational eye movements of the subject AR. . Natural image representing a set of objects placed at different depths: object A at 430 mm; object B at 590 mm; object C at 640 mm; object D at 740 mm; object E at 880 mm. . . . . . . . . . . . . . . . . . . . . . . . . 6 81 82 83 84 85 86 87 88 88 89 5.19 Reconstruction of the distance information according to the oculomotor parallax produced by the subject MV’s saccades. Lines A and B individuate two specific sections of the prospective which are shown with further details in Figure 5.20a and 5.20b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.20 (a) Detailed distance estimation of the scene at the section A highlighted in Figure 5.19. (b) Detailed distance estimation of the scene at the section B highlighted in the same figure. . . . . . . . . . . . . . . . . . . . . . . . . 5.21 Filtered 3-D reconstruction of the distance information produced by the subject MV’s saccades. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.22 Reconstruction of the distance information according to the oculomotor parallax produced by the subject AR’s saccades. Lines A and B individuate two specific sections of the prospective which are shown with further details in Figure 5.23a and 5.23b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.23 (a) Detailed distance estimation of the scene at the section A highlighted in Figure 5.22. (b) Detailed distance estimation of the scene at the section B highlighted in the same figure. . . . . . . . . . . . . . . . . . . . . . . . . 5.24 Filtered 3-D reconstruction of the distance information produced by the subject AR’s saccades. It can be noticed how there is no object at the extreme right. This is because AR’s eye movements were focused more on the left side of the scene. No parallax information concerning this object was available, and therefore no distance cues. . . . . . . . . . . . . . . . . . . . . . 7 91 92 93 94 95 96 List of Tables 1.1 5.1 Various refraction indexes for different optical parts of the eye. It can be noticed that these values for the interior components of the eye are similar to each other, creating very little refraction phenomena. . . . . . . . . . . . 16 Characteristics of the pinpoint source semiconductor light. . . . . . . . . . 74 8 Chapter 1 Human vision Contents 1.1 1.2 1.3 1.4 1.5 General anatomy of the eye . . . . . . . . . . . . . . . . . . . . . . . 10 1.1.1 The retina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.1.2 Hyperacuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.1.3 Accommodation . . . . . . . . . . . . . . . . . . . . . . . . . 15 Optics applied to the eye . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2.1 Thin lenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2.2 Thick lenses . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Aberrations in the eye . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3.1 Spherical Aberrations . . . . . . . . . . . . . . . . . . . . . . . 20 1.3.2 Coma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.3.3 Astigmatism . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.3.4 Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Optical models of the eyes . . . . . . . . . . . . . . . . . . . . . . . . 22 1.4.1 Gullstrand’s schematic model . . . . . . . . . . . . . . . . . . 23 1.4.2 Chromatic and Indiana schematic eye . . . . . . . . . . . . . . 24 Eye movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.5.1 Abrupt movements . . . . . . . . . . . . . . . . . . . . . . . . 27 1.5.2 Smooth movements . . . . . . . . . . . . . . . . . . . . . . . . 28 1.5.3 Fixational eye movements . . . . . . . . . . . . . . . . . . . . 29 1.5.4 Coordinates systems for movements . . . . . . . . . . . . . . . 30 Since the times of ancient Egyptians (The Ebers Papyrus, 1550 BC), people have tried to penetrate the riddle of vision. It is in the Arabian literature that figures illustrating the 9 anterior chamber nodal points cornea iris posterior chamber limbal zone ciliar body lens conjunctiva canal of Schlemm ciliar muscle ciliary process retrolental space zonule fibers ciliary epithelium ora terminalis rectus tendon vitreous optic axis visual axis retina sclera chorioid fovea disc center of rotation labina cribosa optic nerve macula Figure 1.1 – The human eye and its constitutive elements. anatomy of the eye made their first appearance. The earliest drawing appears in the recently discovered Book of the ten treatises on the eye by Hunain ibn Is-hq (about 860 BC). Centuries of discoveries and the advancement of technology have led us toward a deep understanding about how living beings see the world and how they learn from it. We acknowledge the striking evidence that the human eye is an elegant visual system that outclasses in many ways most modern optical instruments. We can focus on objects as close as a few centimeters or as far away as a few kilometers, as well as watch the surrounding environment without becoming blinded by sun light. With the eye we are able to differentiate colors, textures, sizes, and shapes at very high resolutions. Speculations, studies and research on its internal functioning and its features have uncovered many of the aspects of the human eye. These discoveries have been consequently employed in our everyday life. Simulation of eye surgery techniques, prescriptions of corrective lenses, or basic eye function emulations in robotics and computer vision are only some of the many considerable examples. 1.1 General anatomy of the eye There are many parts of the eye that serve different optical functions (see Figure 1.1). Light enters the cornea, a transparent bulge on the front of the eye. Behind the cornea is a cavity called the anterior chamber, which is filled with a clear liquid called the aqueous humor. Next light passes through the pupil, a variably sized opening in the externally colored opaque 10 Figure 1.2 – Detailed structure of the retina iris. Just behind the iris, light passes through the crystalline lens, whose shape is controlled by ciliary muscles attached to its edge. Traveling through the central chamber of the eye, filled with vitreous humor, light finally reaches its destination and strikes the retina, which represents the curved surface at the bottom of the eye. The retina is densely covered with over 100 million light sensitive photoreceptors, which convert photons into neural activity which is sent afterward to the visual cortex in the brain. 1.1.1 The retina The optic nerve enters the globe of the eye at a point 3 mm on the nasal side of the posterior pole, and about 1 mm below. It spreads out into the fine network of nerve fibers that constitute the retina. It contains the sensitive elements, cells that form part of the visual pathway from receptors to cortex (bipolar and ganglion cells), and cells which interconnect branches of the visual path (horizontal and amacrine cells). The interconnections formed by fibers from the bipolar and ganglion cell bodies are very complicated even when seen under an optical microscope. Only recently, electronic microscopes have shown that the number and extent of the horizontal connections are much larger than those revealed by the light microscope. The function of these interconnections is not understood in detail, but it is certain that selection and recording of information take place within the retina itself. 11 The fine structure of the retina is depicted in Figure 1.2. This structure was first revealed by Ramón y Cajal using the Golgi staining method, and described in a series of paper between 1888 and 1933. The retina is a multi-layered membrane with an area of about 1, 000 mm2 . It is about 250 µm thick at the fovea, diminishing to about 100 µm at the periphery. The regions containing ganglion and dipolar cells highlighted in Figure 1.2 are known as the cerebral layer of the retina. The receptors are densely packed in the outer layer, and are considered to be about 70% of all receptors in the human body. There are two main types of receptors: rods and cones. Rods have high sensitivity, an exclusively peripherical distribution, and broad spectral tuning. Cones have lower sensitivity, high concentration in the fovea with decreasing concentration in the peripherical retina, and three types of spectral tuning, peaking at around 450 nm (S-cones), 535 nm (M-cones), and 565 nm (L-cones). S-cones constitute 5 to 10% of all cones. There are about equal numbers of M- and L-cones, although there is considerable variation between people. The normal range of luminance sensitivity of the human eye extends from roughly 10−7 cd · m−2 to 10−4 cd · m−2 . The fovea is a specialized avascular region of the retina where cones are very tightly packed together in order to maximize spatial resolution in the approximate center of our visual field. Outside the fovea, spatial resolution of the neural retina falls substantially. The adult human retina has between 4 and 6 million cones. Peak density of between 100,000 and 320,000 per mm2 occurs at the fovea but declines to about 6,000 per mm2 at an eccentricity of 10◦ . The fovea of a primate is a centrally placed pit about 1.5 mm in diameter, which contains a regular hexagonal mosaic of cones with a mean spacing of between 2 and 3 µm. The central fovea, which is about 0.27 mm in diameter and subtends about 1◦ , contains at least 6,000 cones. Each cone in the retina projects to a single ganglion cell, giving rise to both a disproportionate number of foveal nerve fibers in the visual pathway (compared with the rest of the retina) and a disproportionate representation in the visual cortex. The human retina has 100 million or more rods, which are absent in the fovea and reach a peak density of about 160,000 per mm2 at an eccentricity of about 20◦ . In the region where the fibers of the optic nerve converge and ultimately leave the globe there are no rods or cones. This region is blind and is called the optic disc. When the eye is viewed in an ophthalmoscope, large retinal blood vessels can be seen in and near this region, though the exact appearance varies considerably even in healthy subjects. From this region the retinal blood vessels spread out into finer and finer branches. 1.1.2 Hyperacuity In the retinal image of the world, contours or objects are characterized in the simplest case by changes in luminance or color. Accordingly, the capabilities of the visual system are 12 Figure 1.3 – Inner segments of a human foveal photoreceptor mosaic in a strip extending from the foveal center (indicated by the upper left arrow) along the temporal horizontal meridian. Arrowheads indicate the edges of the sampling windows. The large cells are cones and the small cells are rods. 13 often investigated by using stimuli in which patterns are well defined in shape, luminance, and color. In this context, spatial resolution refers to the ability of the visual system to determine that two points are separated in space. Theoretically1 , for two points in space to be resolved the images of the two points must fall on two separate receptors with at least one receptor between them. Hence the images of the two points must be separated by at least 4 µm in order for the two points to be resolved. In a typical eye this corresponds to an angular resolution of approximately 50 arcsec. The wave theory of light predicts that the image of a point object formed by an optical system with a circular aperture stop will be a circular disc surrounded by fainter rings. The central disc contains about 85% of the light in the diffraction pattern and it is called Airy disc. The diameter of the Airy disc is dependent on the pupil size and the wavelength of light, and it is of the order of 90 arcsec in a typical eye. Clearly this spreading of the light will affect resolution. In a typical eye the spatial resolution limit set by diffraction is of the order of 45 arcsec. In practice, a number of other factors including scatter and various aberrations degrades the retinal image further. However, despite the theory, the visual system is remarkably good at resolving details, which are smaller than the size predicted by the spacing of receptors and diffraction. Research has shown that the eye is capable of resolving changes in position that are nearly an order of magnitude smaller than the 50 arcsec diameter of a foveal photoreceptor. Westheimer [105] coined the term hyperacuity to describe the high levels of performance observed in the above tasks relative to those obtained in more conventional spatial-acuity tasks, such as grating resolution which yields thresholds of the order of 30 − 60 arcsec (the approximate diameter of the inner segment of a foveal cone). Westheimer and others2 identified many different hyperacuity tasks: • Point Acuity: When the subject notices the separation present between two points • Line acuity: The separation required for the subject to resolve two lines • Grating acuity: The separation of the lines required for the subject to just see the grating (rather than a uniform gray field) • Vernier acuity: The minimum separation required for the subject to see that the two lines are not collinear (in a straight line) • Center acuity: The minimum displacement of the dot from the center of the circle • Displacement threshold: The minimum displacement of two lines turned on and off sequentially, giving rise to the sensation of side to side movement 1 For a more exhaustive treatment regarding the determination of the theoretical limits of the retina resolution see [46, 108, 26, 90, 45]. 2 For further investigation and experimental data refer to [42, 57, 50, 110, 61, 11, 37, 100, 63, 17]. 14 For example, observers can reliably detect an 8 − 10 arcsec instantaneous displacement of a short line or bar (see [36]), a 2 − 4 arcsec misalignment in a vernier-acuity task, a 6 arcsec difference in the separation of two lines, and a 2 − 4 arcsec difference in the separation of lines in a stereoacuity task. Many of the hyperacuity tasks undoubtedly measured the resolution limits of fundamental visual processes. The exquisite sensitivities obtained in hyperacuity tasks are amazingly resistant to changes in the spatial configuration of the stimuli. Thus hyperacuity is a robust phenomenon of some generality. For example, highly sensitive displacement (small-movements) detection and separation discrimination are probably of prime importance for the mechanisms that extract the orientation and motion of the observer and the orientation and relative distances of surfaces in the environment. Similarly, excellent stereoacuity must be crucial for the extraction of information regading surface orientation and depth from stereopsis. Since stereo and optical flow information decline rapidly with distance from the observer (the inverse-square law), it follows that every additional second of resolution that can be obtained will increase significantly the volume of space around the observer from which reliable distance and surface-orientation can be extracted. 1.1.3 Accommodation One of the main activities of the eye is to correctly focus light onto the retina. This ability is called accommodation. For this purpose, there are two principal focusing elements of the eye: the cornea, which is the main component responsible for the focusing, and the crystalline lens, which does fine adjustments. The crystalline lens has a diameter of about 9 mm and a thickness of about 4 mm. It is not like a typical glass convex lens, but is composed of different layers which have different refractive indexes. The lens is responsible for adjustments between near and far points of vision. According to the equations of optics (see Paragraph 1.2 for further details), we can say that the lens of the eye would need a wide range of focal lengths to focus objects which are close to the eye, versus objects that are far away. Instead of having hundreds of different lenses, ciliary muscles (see Figure 1.1) attached to the crystalline contract and relax, causing a change in the lens curvature and a subsequent change in the focal length. To focus distant objects, the ciliary muscles relax and the lens flattens, increasing its radius of curvature, which therefore decreases the power of magnification. As an objects moves closer, the ciliary muscles contract making the lens fatter, which produces a higher power lens. An healthy young adult is capable of increasing the refractive power of their eye from 20 D to 34 D. Unfortunately, as humans age, the lens begins to harden and its ability to accommodate deteriorates. 15 Figure 1.4 – Refraction for light passing through two different mediums where n1 < n2 . Part of the Eye Index of Refraction Water Cornea Aqueous humor Lens cover Lens center Vitreous humor 1.33 1.34 1.33 1.38 1.41 1.34 Table 1.1 – Various refraction indexes for different optical parts of the eye. It can be noticed that these values for the interior components of the eye are similar to each other, creating very little refraction phenomena. 1.2 Optics applied to the eye Light travels through air at a velocity of approximately 3×108 m/s. When it travels through mediums, such as transparent solids or liquids, the speed at which the light is traveling can be considerably lower. This ratio (velocity of light in the air over velocity in the medium) changes from medium to medium and is known as the refractive index n. The refractive index for air is considered 1.00, while for the glass is 1.50. When light travels from one medium to another medium with a different refractive index, it deviates from its original path. The bending of light caused by refractive index mismatch is called refraction. The relationship between two mediums and the angle of refraction, as illustrated in Figure 1.4, is known as Snell’s Law: n1 sin θ1 = n2 sin θ2 (1.1) In the eye, the cornea refracts incident light rays to a focus point on the retina (refractive indexes of the cornea and other optical parts of the eye are listed in Table 1.2). 16 Figure 1.5 – Simple graphical method for imaging reconstruction in thin lenses. 1.2.1 Thin lenses Lenses are usually made of glass or plastic and are used to refract light rays in a desired direction. When parallel light passes through a convex lens, the beams converge at a single point. On the other hand, when parallel light passes through a concave lens, the beams diverge, or spread apart. The distance beyond the convex lens, where parallel light rays converge is called the focal length of the lens. The relationship between the focal length of a lens, the object position, and the location where a sharp or focused image of the object will appear is (see Figure 1.5): 1 1 1 = + 0 f s s (1.2) This equation is called Gauss’s law for thin-lens, and it describes the relationship between the focal length of the lens f , the distance of the object to the left of the lens in the object space medium s, and the distance of the focused image to the right of the lens in the image space medium s0 . A convex lens produces a real inverted image to the right of the lens. Similarly, since the cornea is convex shaped, the image formed on the retina is inverted. It is the responsibility of the visual cortex in the brain to invert the image back to normal. The magnification of an image is calculated from the following equation: s0 (1.3) s The optical power of a lens is commonly measured in terms of diopters (D). A diopter is actually a measure of curvature and is equivalent to the inverse of the focal length of a lens, measured in meters. The advantage of using diopters in lenses is their additivity. The thin lens equation (1.2), in terms of diopters, can be rewritten as following: m=− 17 Figure 1.6 – Behavior of light in a thick lens according to the nodal points. P = S0 + S (1.4) where P is the focal length measured in diopters, S and S 0 are the object and image distances, respectively, measured in inverse meters, or diopters. 1.2.2 Thick lenses When the thickness of a lens cannot be considered small in respect to the focal length, the lens is called thick. This kinds of lenses can also be considered a model for composite systems of optics. A well-corrected optical system, in fact, can be treated as a black box whose characteristics are defined by its cardinal points. There are six cardinal points (F1 , F2 , H1 , H2 , N1 , and N2 ) on the axis of a thick lens from which its imaging properties can be deduced. They consist of the front and back focal points (F1 and F2 ), front and back principle points (H1 and H2 ), and the front and back nodal points (N1 and N2 ). A incident ray on a lens from the front focal point F1 will exit the lens parallel to the axis, and an incident ray parallel to the axis refracted by the lens will converge onto the back focal point F2 (see Figure 1.6a and 1.6b). The extension of the incident and emerging rays in each case intersects, by definition, the principal planes which cross the axis at the principal points, H1 and H2 (see Figure 1.6a and 1.6b). The nodal points are two axial points such that a ray directed at the first nodal point appears (after passing through the system) to emerge from the second nodal point parallel to its original direction (see Figure 1.6c). When an optical system is bounded on both sides by air (usually true in the majority of applications, but not for the eye), the nodal points coincide with the principal points. A simple graphical method can be used to determine image location and magnification of an object. This graphical approach relies on few simple properties of an optical system. When the cardinal points of an optical system are known, the location and size of the image 18 Figure 1.7 – Simple graphical method for imaging reconstruction in thick lenses. formed by the optical system can be readily determined. In Figure 1.7, the focal points F1 and F2 and the principal points H1 and H2 are shown. The object which the system is to image is shown as the arrow AO. Ray OB, parallel to the system axis, will pass through the second focal point F2 . The refraction will appear to have occurred at the second principal plane. The ray OC, passing through the first focal point F1 , will emerge from the system parallel to the axis. The intersection of these rays at the point O0 locates the image of point O. A similar construction for other points on the object would locate additional image points, which would lie along the indicated arrow O0 A0 . A third ray could be constructed from O to the first nodal point. This ray would appear to emerge from the second nodal point, and it would be parallel to the entering ray. The law for thin lenses specified in (1.2) can be extended to thick lenses in the following form: 1 1 1 1 = = + 0 f1 f2 s s (1.5) This modification of the law is valid only if both object space medium and image space medium of the lens have the same refraction coefficient n. Otherwise the relation between f and f 0 with different mediums would be: f0 n0 = n f (1.6) With a real lens of finite thickness, the image distance, object distance, and focal length are all referenced to the principal points, not to the physical center of the lens. By neglecting the distance between the lens’ principal points, known as the hiatus, s + s0 becomes the objectto-image distance. This simplification, called the thin-lens approximation, leads back in first approximation to the thin lens configuration. 19 Figure 1.8 – Spherical aberration in a converging lens. 1.3 Aberrations in the eye By assuming that all the rays that hit lenses are paraxial3 , we have deliberately ignored the dispersive effect caused by lenses. Saying that spherical lenses and mirrors produce perfect images is not completely correct. Even when perfectly polished, lenses do not produce perfect images. The deviations of rays from what is ideally expected are called aberrations. There are four main types of aberrations: spherical, coma, astigmatism, and distortion. 1.3.1 Spherical Aberrations When paraxial rays pass through the edges of a spherical lens they do not pass through the same focus (see Figure 1.8). A blurred image formed by the deviations of the rays passing through the edges of the lens is called spherical aberration. As we might expect, spherical aberrations are a great problem for optical systems with large apertures, and production of accurate wide surface lenses is a continuing technological challenge. Under normal conditions, the human pupil is small enough that spherical aberrations do not significantly affect vision. Under low-light conditions or when the pupil is dilated, however, spherical aberrations become important. The visual acuity becomes affected by this aberration when the pupil is larger than about 2.5 mm in diameter. There are other factors of the eye that reduce the effect of spherical aberrations. The outer portions of the cornea have a flatter shape and therefore refract less than the central areas. The central part of the crystalline lens also refracts more than its outer portions due to a slightly higher refractive index at its center. 3 The paraxial region of an optical system is a thin thread-like region about the optical axis which is so small that all the angles made by the rays may be considered equal to their sines and tangents. 20 Figure 1.9 – Coma aberration in a converging lens. 1.3.2 Coma Coma is an off-axis modification of spherical aberration. It produces a blurred image of off-axis objects that is shaped like a comet (shown in Figure 1.9). Coma is an aberration dependent on the eye’s pupil size. An optical system free of both coma and spherical aberration is called aplantic. 1.3.3 Astigmatism Astigmatism is the difference in focal length of rays coming in from different planes of an off-axis object. Like coma, astigmatism is non-symmetric about the optical axis. The eye defect called astigmatism is slightly different than the optical aberration. Astigmatism refers to a cornea that is not spherical, but it is more curved in one plane than in another. That is to say, the focal length of the astigmatic eye is different for rays in one plane than for those in its perpendicular plane. 1.3.4 Distortion Distortion is a variation in the lateral magnification for object points at different distances from the optical axis (see Figure 1.10). If magnification increases with the object point distance from the optical axis, the image of Figure 1.10a will look like Figure 1.10b. This is called pincushion distortion. Conversely, if the magnification decreases with object point distance, the image has barrel distortion (see Figure 1.10c). 21 (a) (b) (c) Figure 1.10 – Images of distortion aberration: (a) a square; (b) pincushion distortion; (c) barrel distortion. 1.4 Optical models of the eyes Some of the greatest scientists have tried to explain and quantify the underlying optical principles of the human eye. According to Helmholtz (1867)4 , Kepler (1602) was the first to understand that the eye formed an inverted image on the retina, and Scheiner (1625) demonstrated this experimentally in human eyes. Newton (1670) was the first to consider that the human eye suffered from chromatic aberration, and Huygens (1702) actually built a model eye to demonstrate the inverted image and the beneficial effects of spectacle lenses. Astigmatism was observed by Thomas Young (1801) in his own eye, and spherical aberration was measured by Volkman (1846). As we said above, there are some occurrences which need a model that includes all the eye’s main characteristics without being either computational expensive or too complex to be utilized. Numerous model eyes, referred to as schematic eyes, have been developed during the last 150 years to satisfy a variety of needs. Schematic eye models which can reproduce optical properties from anatomy are especially useful. They can be employed to help design of ophthalmic or visual optics, to simulate experiments, or to better understand the role of the different optical components. Furthermore, schematic eye models are necessary to estimate basic optical properties of the eye like focal length. In some cases, the model of the eye is aimed primarily at anatomical accuracy, whereas other models have made no attempt to be anatomically correct or even to have any anatomical structure at all. There are many reasons why anatomically correct models are useful. As a biological organ, it would be impossible to understand the physiology, development, and biological basis of the eye’s optical properties without an accurate anatomical model. Moreover, in the world of refractive surgery, an anatomically correct model is a vital tool 4 ”Treatise on Physiological Optics” of Hermann Ludwig Ferdinand von Helmholtz (1821-94) is widely recognized as the greatest book on vision. This classic work is one of the most frequently cited books on the physiology and physics of vision. His summa (in three volumes) transformed the study of vision by integrating its physical, physiological and psychological principles. 22 Figure 1.11 – Emsley’s standard reduced 60-diopter eye. for planning operations designed to modify the eye’s optical properties before operating on the patient. 1.4.1 Gullstrand’s schematic model The first quantitative paraxial model eye was developed by Listing (1851), whose model employed a series of spherical surfaces separating homogeneous media with fixed refractive indices. Emsley (1952) [44] produced a variant of the Listing model, which has been widely cited. The Emsley standard reduced 60-diopter eye is one of the simplest models of the eye and is most often used in ophthalmic education (see Figure 1.11). It contains a single refracting surface and only one index of refraction mismatch between the air and the vitreous humor. The axial distance from the cornea to the retina is 22.22 mm. Due to its simple structure, accommodation calculations cannot be done with this model. The two principal points and two nodal points are combined into single principal and nodal points (H and N). The corneal surface of 60 D power encompasses the separate refractions of the corneal and lens interfaces, representing the total focusing power for the eye. Currently, the most used model eye is the one designed by Gullstrand5 [1]. In this simplified model for the human eye, shown in Figure 1.12, the lens is considered to have an average refraction index of 1.413, and the axial distance from the cornea to retina is 23.89 mm. These two parameters ensure that entering parallel rays will focus perfectly 5 Gullstrand received the Nobel Prize for Physiology (1911) for his investigations on the dioptrics of the eye. 23 Figure 1.12 – Gullstrand’s three-surface reduced schematic eye. on the retina. Gullstrand’s three-surface model allows for changes in the power of the lens by adjusting the curvatures of the front and back surfaces. This attribute is very useful for calculating image positions when the natural lens is removed, as it is during cataract surgery, and replaced by an artificial fixed lens. 1.4.2 Chromatic and Indiana schematic eye There is general agreement that chromatic aberration is one of the more serious aberrations of the human eye. Despite the fact that ocular chromatic aberration was first described over 300 years ago by Newton when he explored chromatic dispersion by glass prisms, and despite the extensive studies of the last 50 years, there have been few attempts to model this aberration. Few models have included dispersive media even though it has been known, since Helmholtz, that, so far as chromatic aberration is concerned, the eye behaves as if it were a volume of water separated from air by a single refracting surface. The most recent work in this direction is by Thibos et al. [96]. In this research the authors introduce a new model called Chromatic Eye. It is a reduced schematic eye containing a pupil and a single, aspheric refracting surface separating air from a chromatically dispersive ocular medium. This model was designed to accurately describe the eye’s transverse and longitudinal chromatic aberration, while at the same time being free from spherical aberration. Comparisons can be made between Emsley’s reduced eye in Figure 1.11 and the proposed Chromatic Eye showed in Figure 1.13. The main introduction in the Thibos’ model is the new refract- 24 Figure 1.13 – General form of the Chromatic Eye. Figure 1.14 – General form of the Indiana Eye. ing surface. What is a sphere in Emsley’s model is a prolate spheroid in the Chromatic Eye; this corresponds, respectively, to a circle and an ellipse when viewed as a cross section. The results of the experiments conducted by Thibos [96] (and successively confirmed by Doshi et al. [25]) show that the data are accurately described by a reduced-eye optical model which has a refractive index that changes more rapidly with the wavelength than does the model filled with water. A comparison with all other chromatic models clearly shows that the description of chromatic focus error is improved simply by changing the ocular medium while leaving all other aspects of the model unchanged, especially at shorter wavelengths. 25 As we can see in Figure 1.13, there is no particular relationships assumed between the optical axis, the visual axis, the fixation axis, or the achromatic axis of the eye. Angle α specifies the location of the fovea relative to the model’s axis of symmetry, which will affect the magnitude of off-axis aberrations present on the fovea. For polychromatic light, angle ψ and the closely related angle φ are even more important for assessing the quality of foveal images. This is because ψ and φ are sensitive to the misalignment of the pupil relative to the visual axis of maximum neural resolution. However, for many engineering and scientific applications, the Chromatic Eye has too many degrees of freedom to be useful. What is needed is an even simpler model constrained by empirical data from typical human eyes. A simplified model, which Thibos et al. successively introduced and called the Indiana Eye [94], is shown in Figure 1.14. The primary simplifying assumption of the Indiana Eye is that, on average, the pupil is well centered on the visual axis (i.e. achromatic and visual axes coincide, so ψ = φ = 0). This assumption is based on studies which have shown that, although individual pupils may be misaligned from the visual axis by as much as 1 mm, the statistical mean of angle ψ in the population is not significantly different from zero in either the horizontal or vertical meridians [87, 89, 95]. 1.5 Eye movements Our eyes are never completely at rest. Even when we are fixating upon a visual target, small involuntary eye movements continuously perturb the projection of the image on the retina6 . Eye movements can be characterized in three classes: abrupt, smooth, and fixational. They serve three basic functions: stabilization of the image on the retina as the head moves, fixation and pursuit of particular objects, and convergence of the visual axes on the particular object. Image stabilization is achieved by the vestibuloocular response (VOR), a conjugate eye movement evoked by stimuli arising in the vestibular organs as the head moves. When the eyes are open, the VOR is supplemented by optokinetic nystagmus (OKN) which is evoked by the motion of the image of the visual scene. These involuntary responses were the first types of eye movements to evolve. Eye movements for fixational and voluntary pursuit of particular objects evolved in animals with foveas. Voluntary rapid eye movements allow the gaze to move quickly from one 6 A striking demonstration of how small eye movements affect visual fixation is available on the world wide web at http://www.visionlab.harvard.edu/Members/ikuya/html/memorandum/VisualJitter.html (Murakami and Cavanagh, 1998). It is surprising that we are able to perceive a stable world despite such movements. Does visual perception rely on some type of stabilization mechanism that discounts visual changes induced by fixation eye movements, or, on the contrary, are these movements an intrinsic functional component of visual processes? 26 Figure 1.15 – Human eye movements (saccades) recorded during an experiment in which a subject was requested to freely watch a natural scene. A trace of eye movements recorded by a DPI eyetracker (see Paragraph 5.2) is shown superimposed on the original image. The panel on the bottom right shows a zoomed portion of the trace in which small fixational eye movements are present. The color of the trace represents the velocity of eye movements (red: slow movements; yellow: fast movements). Blue segments mark periods of blink. part of the visual scene to another. Voluntary pursuit maintains the image of a particular object on the fovea. In the third basic type of eye movements, the eyes move though equal angles in opposite directions to produce a disjunctive movement, or vergence. In horizontal vergence, each visual axis moves within a plane containing the interocular axis. In vertical vergence, each visual axis moves within a plane that is orthogonal to the interocular axis. The eyes also move in opposite directions around the two visual axes. Horizontal vergence occurs when a person changes fixation (under voluntary control) from an object in one depth plane to one in another depth plane. 1.5.1 Abrupt movements Abrupt movements, also called saccades, are the fastest eye movements that are used to bring new details of the visual field to the fovea. They can be either aimed or suppressed according to instructions (voluntary), in response to the sudden appearance of a new target in the visual field (involuntary). These kinds of movements are so fast that there is usually 27 no time for visual feedback to guide the eye to its final position. Therefore saccades are also called ballistic movements because the ocular motor system must calculate the muscular activation pattern in advance to throw the eye exactly in the position selected. Most saccades are short-duration (20 − 100 ms) and high-velocity (20 − 600◦ · s−1 ). Under natural conditions, with the subject moving freely in his normal surrounding, it has been noticed by Bahill et al. [9] that 85% of natural saccades have amplitudes of less than 15◦ . The activity pattern which drives the eye in the requested position, also called pulse, has to be followed by a lower, steady activity that will hold the eye in the final position. This activity is called step signal. If pulse and step do not match perfectly (the end of the first with the beginning of the second), a slow movement, called glissade, will bring the eye in the final position. This sort of movement can be very long, taking to one second. If the pulse is too long in respect to the step, the requested rotation of the eye will be followed by an overshoot and a subsequent glissadic movement which will hold the eye in the new position. If, on other hand, the pulse is too small in relation to the step, then a undershoot with glissadic behavior will follow the saccade. 1.5.2 Smooth movements Smooth movements have considerably lower velocity and longer duration. Usually these movements last more than 100 ms and have a velocity up to 30 − 100◦ · s−1 . They can be distinguished according to how information coming from the retina is used: conjugate movements are originated when both eyes are guided in the same direction, while disjunctive smooth eye movements are guided in opposite directions. Even though no criteria allows direct separation of a single aspect of smooth movement, conjugate smooth eye movements have been classified in humans. Smooth pursuit conjugate movements are performed during voluntary tracking of a small stimulus. They have a maximum velocity of 20 − 30◦ · s−1 , and they are considered voluntary because the task requests visual feedback. Vestibular smooth conjugate movements are due to a short reflex arc in the part of the cortex appointed to the vestibular afference. They can be noticed by rotating the head in a darkness, that is without any visual reference. These movements can have a latency (difference between head rotation and eye movements) of 15 ms and a velocity which can reach up to 500◦ · sec−1 . Optokinetic smooth conjugate movements are involuntary smooth eye movements produced by rotating very large stimuli around the stationary subject, who is instructed to stare ahead and not to track. These kind of movements can reach a velocity of 80◦ · sec−1 . 28 Figure 1.16 – Fixational eye movements and drifts recorded during an experiment in which a subject was requested to look at a specific point on the screen. The figure shows the magnified details about fixational eye movements around a target which size is only 30 pixels. 1.5.3 Fixational eye movements During visual fixation, there are three main types of eye movements in humans: microsaccades, drifts, and tremor. Microsaccades are small, fast (up to 100◦ · s−1 ), jerk-like eye movements that occur during voluntary fixation. They bring the retinal image across a range of several dozen to several hundred photoreceptors, and are about 25 ms in duration. Microsaccades cannot be defined on the basis of amplitude alone, as the amplitude of voluntary saccades can be as small as that of fixational microsaccades. Microsaccades have been reported in many species. However, they seem to assume an important role in species with foveal vision (such monkeys and humans). The role of the microsaccades in vision is not clear at all. Scientists argue even on their very existence [58]. Recently, Rucci [82] showed that the presence of this kind of movement improves the correct discrimination with short stimulus presentation (500 ms), while under conditions of visual stabilization (obtained with recording instruments able to eliminate fixational eye movements) this discrimination is significantly lower. Drifts occur simultaneously with tremor and are slow motions of the eye that occur dur29 ing the epochs between microsaccades. During drifts, the image of the object being fixated can move across a dozen photoreceptors. Initially, drifts seemed to be random motions of the eye, generated by the instability of the oculomotor system. However, drifts were later found to have a compensatory role in maintaining accurate visual fixation in the absence of microsaccades, or when compensation by microsaccades was relatively poor. Tremor is an aperiodic motion of the eyes with frequencies of about 90 Hz. Being the smallest of the all eye movements (tremor amplitudes are about the diameter of a cone in the fovea), visual tremor is difficult to record accurately because it falls in the range of the recording system’s noise. The purpose of the tremor in the vision process is unclear. It has been argued that tremor frequencies are much greater than the flicker fusion frequencies in humans7 , so the tremor of the visual image might be ineffective as a stimulus. But recent studies indicate that tremor frequencies can be quite low, below the flicker fusion limit. However, early visual neurons can follow high-frequency flickering that is over the perceptual threshold for flicker fusion. So, it is possible that even high frequency tremor is adequate to maintain activity in the early visual system, which might then lead to visual perception. Tremor is generally thought to be independent in the two eyes. This imposes a physical limit on the ability of the visual system to match corresponding points in the retinas during stereovision. 1.5.4 Coordinates systems for movements The center of rotation of an eye is not at the center of the eye and is not fixed in reference to the orbit [76]. In other words, an eye translates a little as it rotates. For most purposes, however, it can be assumed that the human eye rotates about a fixed center 13.5 mm behind the front surface of the cornea. The direction of the gaze is specified with respect to the median and transverse planes of the head. The straight-ahead position, or primary position, of an eye is not easy to define precisely because the head and eye lack clear landmarks. For most purposes, the primary position of an eye may be defined as the direction of gaze when the visual axis is at right angles to the plane of the face. An eye moves from the primary 7 The flicker fusion threshold is a concept in the psychophysics of vision. It is defined as the frequency at which all flicker of an intermittent light stimulus disappears. Like all psychophysical thresholds, the flicker fusion threshold is a statistical rather than an absolute quantity. There is a range of frequencies within which flicker sometimes will be seen and sometimes will not be seen, and the threshold is the frequency at which flicker is detected on 50% of trials. The flicker fusion threshold varies with brightness (it is higher for brighter lights) and with location on the retina where the light falls: the rods have a faster response than the cones, so flicker can be seen in peripheral vision at higher frequencies than in foveal vision. Flicker fusion is important in all technologies for presenting moving images, nearly all of which depend on presenting a rapid succession of static images (e.g. the frames in film or digital video file. If the frame rate falls below the flicker fusion threshold for the given viewing conditions, flicker will be apparent to the observer, and movements of objects on the film will appear jerky. For the purposes of presenting moving images, the human flicker fusion threshold is usually taken as 16 Hz. The frame rate used in cine projection is 24 Hz and video displays up to 160 Hz. 30 Figure 1.17 – Axis systems for specifying eye movements. position into a secondary position when the visual axis moves from the primary position in either a sagittal or a transverse plane of the head. An eye moves into a tertiary position when the visual axis moves into a oblique position. Three different coordinate systems employed to report eye movements: • Helmholtz’s system: In this system the horizontal axis about which vertical eye movements occur is fixed to the skull. The vertical axis about which horizontal movements occur rotates gimbal fashion about the horizontal axis and does not retain a fixed angle to the skull. The direction of the visual axis is expressed in terms of elevation (λ and azimuth (µ) (see Figure 1.18a). Torsion is a rotation of the eye about the visual axis with respect to the vertical axis of eye rotation. • Flick’s system: In the Flick system, the vertical axis is assumed to be fixed to the skull, and the direction of the visual axis is expressed in terms of latitude (θ) and longitude (φ). Torsion is rotation of an eye about the visual axis with respect to the horizontal axis of eye rotation. The flick system is the Helmoholtz system turned to the side through 90◦ (see Figure 1.18b). • Perimeter system: The perimeter system uses polar coordinates based on the primary axis of gaze (the axis straight out from the eye socket and fixed to the head). Eye positions are expressed in terms of angle of eccentricity of the visual axis (π) with respect to the primary axis, and of the meridional direction (κ) of the plane containing the visual and primary axes with respect to the horizontal meridian of head-fixed polar coordinates. 31 Figure 1.18 – (a) In the Helmholtz system the horizontal axis is fixed to the skull, and the vertical axis rotates gimbal fashion; (b) In the Fick system the vertical axis is fixed to the skull. These three systems are the same coordinate system, simply anchored to the head in different ways. A specification of eye position can be transformed between the three systems by the following equations: tan λ = tan θ cos φ = sin κ tan π sin µ = sin φ cos θ = sin π cos κ Listing proposed another coordinate system where any rotation of an eye occurs about an axis in a plane known as Listing’s plane. Helmholtz called this Listing’s law. Listing’s plane is fixed with respect to the head and coincides with the equatorial plane of the eye when the eye is in its primary position. Elevations and depressions of the eye occur about a horizontal axis in Listing’s plane, lateral movements occur about a vertical axis, and oblique movements occur about intermediate axes. More precisely, any unidirectional movement of an eye can be described as occurring about an axis in Listing’s plane that is orthogonal to the plane within which the visual axis moves. The extent of an eye movement is the angle between the initial and final directions of gaze (the change of the angle of eccentricity π). The direction of an eye movement is the angle between the meridian along which the visual axis moves and a horizontal line in Listing’s plane (δ or its supplement κ). 32 Chapter 2 Monocular depth vision system Contents 2.1 Monocular static depth cues . . . . . . . . . . . . . . . . . . . . . . 35 2.1.1 Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Linear prospective . . . . . . . . . . . . . . . . . . . . . . . . 38 Height . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Texture perspective . . . . . . . . . . . . . . . . . . . . . . . . 40 Interposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Interposition using parts of the same object . . . . . . . . . . . 41 Interposition using different objects . . . . . . . . . . . . . . . 42 Accretion and deletion . . . . . . . . . . . . . . . . . . . . . . 42 Lighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Shading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Shadows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.1.4 Accommodation using image blur . . . . . . . . . . . . . . . . 44 2.1.5 Aerial effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Monocular dynamic depth cues . . . . . . . . . . . . . . . . . . . . . 45 2.2.1 Motion parallax . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.2.2 Optical flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Types of cue interaction . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.1.2 2.1.3 2.2 2.3 Humans use a wide variety of depth cues that can be combined with cognitive strategies depending upon the visual circumstances. Without sense of depth, we would be greatly 33 Figure 2.1 – Binocular disparity according to the displacement of the two eyes from the cyclopean axis. impaired since this perception allows us to perceive and represent many aspects of our environment. However, the perceptual problem of recovering the distance to a surface is that depth perception from 2-D images is inherently ambiguous. This is because the optical processes of surface reflection phenomenon project light from a 3-D world onto a 2-D surface at the back of the eye, and such projection can be inverted in an infinite number of ways. Therefore we should understand from the environment what sort of cue allows us to perceive such a vital piece of information as distance. These sources of information are called depth or distance cues. Nature resolved the problem in living beings by creating a stereo rather than a monocular structure of visual information. This vision process is called stereopsis, and it is the process of perceiving the relative distance to objects based on their lateral displacement in the two retinal images. Stereopsis is possible because we have two laterally separated eyes whose visual fields overlap in the central region of vision. Because the two eyes differ in position, the two retinal images are slightly different. This relative lateral displacement is called binocular disparity, and it arises when a given point in the external world does not project to corresponding positions on the left and right retina (see Figure 2.1). 34 Obviously, stereoscopic vision poses some problems, the main one of which is to measure the direction and the amount of disparity between corresponding image features in the two retinal images. But we cannot measure this disparity if we cannot determine which features in the left retinal image correspond to which features in the right one. This is called the correspondence problem, for obvious reasons. Many solutions to this problem have been proposed. The first interesting and well-known theory was devised by David Marr and Tomasso Poggio in 1977 [70]. Clearly, binocular depth perception is not the only way to perceive distance. Even though a classification of such a complex system is only partially possible, three families of cues may be distinguished (see Figure 2.2): • Primary depth cues that provide direct depth information, such as convergence of the optical axes of the two eyes, accommodation, and unequivocal disparity cues • Secondary depth cues that may also be present in monocularly viewed images. These include shading, shadows, texture gradients, motion parallax, occlusion, 3-D interpretation of line drawings, structure and size of familiar objects • Cues to flatness, inhibiting the perception of depth. Examples are frames surrounding pictures, or the uniform texture of a poorly resolved CRT-monitor All these cues can be monocular and binocular, depending on the number of visual sources involved. Some of them, called dynamic depth cues, are obtained only when the projection of the environment on the retina is moving. Some others, called static depth cues, can be extracted even in static scenes. 2.1 Monocular static depth cues As mentioned before, depth cues depend strictly on how many visual sources are involved in the process and how they change on the retina. Considering the nature of this work, we will focus mainly on monocular static depth cues. As mentioned before, monocular static depth cues are a family of cues that can be extracted from a static scene (like a picture) and that can carry to the visual system an enormous quantity of information. 2.1.1 Perspective In general, the surface on which the objects are represented in perspective is a vertical plane. Nevertheless, representations can be obtained on slanting planes (in certain architectural representations), horizontal (ceilings), or cylindrical surfaces (panoramas). 35 Figure 2.2 – Monocular depth cues. Figure 2.3 – Example of depth perception using perspective cues. Figure 2.4 – Object, Object space, Display, Observer. 36 (a) (b) (c) Y Y Y Z Z Z X X X Figure 2.5 – (a) one-point; (b) two-point; (c) three-point perspective. Geometrically, the perspective is a conic projection that introduces four principal elements: the object space, the object to be represented (laying in the object space), the display, and the observer (see Figure 2.4). Usually the object space is represented by X, Y, and Z Cartesian coordinates. The display, presumed vertical, transparent and parallel to the X, Y plane, is interposed between the object and the observer, who is imagined in erect position on a horizontal plan and with monocular vision. Light rays are represented by straight lines outgoing from the points that we want to be shown. In this context, the representation of a point will be the intersection with the display of the light ray from the considered point. Perspective is normally classified into one-point, two-point, or three-point perspective. Let the object be a rectangular object with three axes A, B, and C. When each object axis is parallel to a space coordinate (see Figure 2.5a), we have one-point perspective. When only one object axis is parallel to a space coordinate we have two-point perspective (as shown in Figure 2.5b). When no object axis is parallel to a space coordinate we have a three-point perspective (Figure 2.5c). Perspective in general is not a cue itself, but it collects a group of cues that use perspective principles as a source of information. Size Size information is one of the most controversial cues in the science of vision. Many works have debated which size features are important in this cue, but the answers are still controversial and not really exhaustive. The size signal of an image can be either dynamic or static depending on the kind of information extracted. This information can be used as depth cue in three ways: 37 • An image changing in size can give the impression of an approaching or receding object (see Paragraph 2.2.1). Naturally, the observer assumes that the surface of the object is not actually varying in size • The relative sizes of the images of simultaneously or successively presented objects indicate their relative depths if the observer considers that they are the images of the same object and, therefore, of the same size • The size of the image of an isolated familiar object indicates the distance of the object Ittelson [51] presented playing cards of half, normal, and double size to subjects, one card at the time, at the same distance, and in dark surroundings. He concluded that the perceived depth of a familiar object is “[...] that distance at which an object of physical size equal to the assumed-size would have to be placed in order to produce the given retinal image”. Hochberg and Hochberg [47] proved that Ittelson had not properly controlled the effects of the relative size of the three playing cards. Three blank cards may have produced the same result. Ono [43] found that when photographs of a golf ball and baseball were presented at the same angular size, the baseball appeared more distant. Therefore, it seems that estimating the distance of single familiar objects is in accordance with cognitive processes rather than only with perceptual afferents [2]. Gogel [104] used a procedure designed to tap a purely perceptual mechanism. In his experiments he argued that the partial effect of familiar size under reduced conditions is due to the tendency to perceive objects at a default distance that he called specific-distance tendency. However, Hastorf [4] found that distance estimates of a disc were influenced by whether subjects were told that it was a ping-pong ball or a billiard ball. Epstein and Baratz [30] discovered that only long-term familiarity is effective as a relative depth cue. In another experiments, Gogel and Mertens [39] used a wide variety of familiar objects and also produced evidence of distance scaling by familiar size. Within such assumptions, depth perception using size cues can be a very complex process in which not only pure visual perception is involved but also high-level cognitive processes which constrain information about size and the dynamics of size. For instance, it is quite common to have the experience of playing cards being different sizes, but rather uncommon to experience human figures that can shrink or enlarge. Linear prospective Linear perspective is one of the basic perspective cues used by the brain to perceive depth. Planar figures can be interpreted as a projection on the display of objects which lie in the 3-D object space. For instance, the impression of inclination produced by a 2-D trapezoid (Figure 2.6) as a projection of a floor in the object space can be very strong. 38 Figure 2.6 – Example of linear perspective. Figure 2.7 – The depth cue of height in the field. Obviously, linear perspective evaluation depends on observer assumptions. In the case of the above example, the observer assumes that the shape of the image represents an inclined rectangle (a floor) rather than a frontal trapezoid. The impression of depth using linear cues can be easily overridden as soon as the observer uses other clues to obtain more depth information. For example, the production of a motion parallax cue by moving the head laterally a sufficient distance destroys the linear cue perception (for more details see [5]) because the perceived object’s 2-D shape should change according to the motion applied. Height The image of a single object located on a textured surface creates a strong impression of motion depth. The effect of height in the field on perceived distance is stronger when the stimulating objects are placed on a frontal-plane background containing lines converging to a vanishing point. (see [27]). A frontal background with a texture gradient produces a stronger effect than a simple tapered black frame [101]. An object suspended above a horizontal textured surface appears as if it is sitting on a point on the surface that is optically adjacent to the base of the object. The nearer the object is to the horizon, the more distant the object appears (see Figure 2.7 for further investigations). 39 (a) (b) (c) Figure 2.8 – Texture gradients and perception of inclination: (a) Texture size scaling; (b) Texture density gradient; (c) Aspect-Ratio perspective. Texture perspective Depth perception using texture perspective supplies strong distance information when the reference surface of the object space is filled up with some sort of texture. In these cases, the textures have to be homogeneous (texture elements are identical in shape, size, and density at all positions on the surface) and isotropic (texture elements are similar in shape, size, and density at all orientations). There are three types of texture perspectiveS (see Figure 2.8): • Texture size scaling: The image of any texture element decreases in size proportionally to the distance of the element from the nodal point of the eye. The more the distance increases, the more the texture elements become graded in size. • Texture density gradient: Images of texture elements become more densely spaced with increasing distance along the surface. The observer implicitly assumes that the texture on the surface is homogeneous. • Aspect-Ratio perspective: The image of an inclined 2-D shape is compressed in the direction of the projection changing, in this way, the aspect ratio of the texture element. This indicates inclination only if the true shape of the single element is known. A gap in the texture gradient usually indicates the presence of a step; however, a change in texture gradient associated with aspect ratio produces, instead, a change in slope. 2.1.2 Interposition Interposition occurs when the whole or part of one object hides part of the same object or another object. Therefore, this cue provides information about the depth order but not about the magnitude of the distance. 40 Figure 2.9 – Example of object which does not respond to the Jordan’s Theorem. (a) (b) Figure 2.10 – (a) Amodal completion in which the figures appear in a specific order. (b) Modal completion effects using a Kanizsa triangle in which a complete white triangle is seen. Each disc appears complete behind the triangle (amodal completion). Interposition using parts of the same object The cue that uses parts of the same object to obtain depth information strictly obeys Jordan’s Theorem, which states that a simple closed line divides a plane into an inside and an outside. In order to better understand the application of this theorem to interposition, let us give some definitions. The locus of points where visual lines are tangential to the surfaces of the object but do not enter the surface is the bounding rim of the object. Holes in the surface of the object produce bounding rim. For all objects, the bounding rim projects a bounding contour on the retina. A bounding contour is a closed line that represents the projection of the object on the retina. It has a polarity, meaning that one side is inside the image of the object and one side is outside the image of the object. Jordan’s Theorem is strictly and implicitly embedded in the mechanism of the perceptual system, which allows us to recognize the difference between the inside and the outside of an object. In fact we are able to recognize an object which violates the Jordan’s Theorem, as shown in Figure 2.9, even though we can not explain why [23]. 41 Interposition using different objects This kind of interposition can be either amodal or modal completion: • amodal completion: Overlapping objects usually produce images that are defective of the information of the farther part of the bounding contour. Amodal completion occurs when the object (or more than one) is perceived as occluded by the closer of the shapes. This effect allows the far object to be seen as having the perception of a continuous edge extending behind the nearest object (see Figure 2.10a). • Modal completion: Many times we are able to perceive an object as complete even when parts of its boundary cannot be seen because the object has the same luminance, color, and texture as the background. In the Figure 2.10b, we see clearly the white triangle even though there is no physical contrast between its borders and the surrounding region (see [55] for further investigations). This phenomenon is known as modal completion. Edges implicitly produced by this process are known as cognitive contours. The object to which the cognitive contours belong in modal completion is always perceived in the foreground. Accretion and deletion This cue is similar to modal completion but appears clearly when the involved surfaces have a relative movement between them. When the image of a textured object A moves with respect to the image of another textured object B, texture elements of B are deleted along the direction of A and emerge at the opposite edge. Even if the object A has the same luminance, color, and texture as the texture background of B, the object B is still completely visible, and it is seen in front of the object A, which is undergoing accretion and deletion of texture (see Figure 2.11). 2.1.3 Lighting Another strong cue of depth information is lighting. There are two different phenomena caught by the brain to understand 3-D scenes: shading and shadow. Shading Shading is the variation in irradiance from a surface due to changes in the orientation of the surface in relation to the incident light or in relation to variations in specularity1 . A smoothly 1 A specular surface reflects light more in one direction than in other directions. The preferred direction is that for which the angle of reflection equals the angle of incidence. Specularity is a parameter which measures the grade of reflection of light from the surface. 42 Figure 2.11 – A Kanizsa triangle is not evident in each static frame but becomes visible in the dynamic sequence. Figure 2.12 – Variations of convexity and concavity from shading. curved surface produces gradual shading. A sharp discontinuity in surface orientation creates shading with a sharp edge (see Figure 2.12). Shading alone is ambiguous as a depth cue. Other information that can aid to solve ambiguities include direction of illumination, presence of other objects, occluding edges, and/or familiarity with the gazed object. Shadows Shadow is a variation in irradiance from a surface caused by obstruction of an opaque or semi-opaque object. A shadow cast by an object onto another object is known as a detached shadow. Shadows provide information about the dimensions of the object that casts the shadow only if the observer knows the direction of illumination and the nature and orientation of the surface upon which the shadow is cast. Particularly, a detached shadow can be a rich source of information about the structure of 3-D surfaces. For instance in Figure 2.13 light gray squares appear to have different depths above the background because of the position of their shadows. 43 Figure 2.13 – The gray squares appear to increase in depth above the background because of the positions of the shadows. 2.1.4 Accommodation using image blur Accommodation is a change in the crystalline deformation that the eye carries out in order to adapt the focus to a gazed object. The absence of this control produces a blur on the retina which subsequently causes a lack of sharpness in edge projections. The eye uses this effect to control the magnitude and the sign of the accommodation. The use of accommodation as a cue of absolute distance has had alternating periods of acceptance. Descartes (1664), followed by Berkeley (1709), proposed that the act of accommodation itself aids in the perception of depth. Wundt (1862), followed by Hillebrand (1894) discovered that people cannot judge the distance of an object on the basis of accommodation but can use changes in accommodation to judge differences in depth. Recently, Fisher and Ciuffreda [32] discovered that subjects, estimating the distance of the monocular targets by pointing to them with a hidden hand, tend to overestimate distances that were less than 31 cm and to underestimate larger distances. Mon-Williams and Tresilian [72] discovered some correlation between accommodation, image blur, and vergence. However, responses to the experiments were variable. It is surely true that if the sharpness of the edges under observation is known, static blur on the retina alone can be used as an unambiguous cue of relative depth. The act of changing accommodation between two objects at different distances may provide information about their relative depth. Dynamic accommodation may be more effective when many objects in different depth planes are presented at the same time. 2.1.5 Aerial effects It seems that Leonardo da Vinci coined the term aerial perspective to describe the effect produced by the atmosphere on distant objects. Actually, the atmosphere affects the visibility of distant objects in two ways: through 44 optical haze, when rising convection currents of warm air refract light and introduce shimmer, and through mist, when light is absorbed and scattered by atmospheric dust and mist. The effects of these phenomena can be measured with particular instruments and then used to calculate distances between the position of the observer and distant points (for further investigations and applications see [20]). 2.2 Monocular dynamic depth cues As we have seen previously, some monocular cues can be either static or dynamic, like size modifications and accretion and deletion. Pure dynamic cues extract information not just using pictures or static images, but rather using physiologic movements, voluntary or not, of the image on the retina to catch other characteristics of the 3-D scene. 2.2.1 Motion parallax Motion parallax is a cue based on continuous changes of the object projection on the retina over time. For an object at given distance and a given motion of the observer, the extent of motion parallax between that object and a second object is proportional to the depth between the objects. Mainly, there are three types of motion parallax: • Absolute parallax: For a given magnitude of motion of object or observer (what is important is the relative motion between observer and object), the change in visual direction of an object is, approximately, inversely proportional to its distance (for details see Chapter 4) • Looming: An approaching object or a set of objects increasing in size produces a strong perception of the distance (see Paragraph 2.1.1). This type of image motion does not occur with parallel projections for which change in perspective is not produced • Linear parallax: When an object or set of objects translates laterally or is viewed by a person moving laterally, the images of more distant parts move more slowly than those of nearer parts. Obviously, the same motion effect happens when the two objects translate at the same velocity with respect to a stationary head 2.2.2 Optical flow During physiological movements of the eyes, the projection of the object onto the retina forms a motion pattern referred to as optical flow. This process produces visually perceived complex motion patterns, which combine object and retinal motions to provide an incredibly 45 Figure 2.14 – Differential transformations for the optical flow fields: (a) translation; (b) expansion; (c) rotation; (d) shear of type one; (e) shear of type two. rich source of information regarding the dynamic 3-D structure of the scene. This knowledge in the optical flow can be represented by a vector indicating velocity and direction for each point of the image on the retina that has moved during a fixed interval. Since the object under observation does not present discontinuity points, but instead surfaces that usually gradual2 , the optical flow field produced by the movements is spatially differentiable over local areas. Any field so generated can be decomposed into five local spatial gradients of angular velocity, which are known as differential velocity components. These components are (see Figure 2.14) translation, expansion or contraction, rotation, and two orthogonal components of deformation. The last three components are known as differential invariants because they do not depend on the choice of the coordinate system. Since the motion of the object projection on the retina depends directly on the distance of the object from the observer (see Paragraph 2.2.1), and since the optical flow represents this motion, it is straightforward that these fields can be used to extract depth information. However, optical flow is not usually used alone to extract distance information since it is relatively ambiguous in respect to some common situations. Let us consider, for instance, two objects moving laterally onto different parallel planes (parallel to the plane of the observer) with the same linear velocity. Due to the linear parallax, the image of the nearer object moves faster than that of the farther object. If the observer assumes that they are moving at the same velocity, the faster object is surely the nearest one. Although this phenomenon is perfectly capable of supplying depth order information, it cannot be used for cases for which the eye tracks one of the two objects, since the image of the tracked object on the retina is steady and it does not furnish enough knowledge to the optical 2 They can be very honed but they never present discontinuity points on the surface. 46 flow fields. This means that velocity on the retina of the image does not necessarily indicate relative depth. 2.3 Types of cue interaction As we have seen, there are a number of cues that the brain can use in order to obtain robust depth information. However, the visual system does not consider single cues as separate sources of information, but implements interactions which can improve the estimate of depth even while balancing the detection of poor or noisy cues. Since all these sources bear upon the same perceptual interpretation of depth, they must somehow be put together into a coherent consistent representation. Bülthoff and Mallot [16] suggested a classification of the cue interactions based of five different categories: • Accumulation: In this interaction, signals coming from different cues systems may be summed, averaged, or multiplied in order to improve discrimination of depth. According to Bülthoff and Mallot, this interaction occurs at the output level of independent systems. The fundamental idea is that for each point of image (or representation of the depth map of the image), different cues produce different estimates which are then integrated together3 . • Cooperation: This type of interaction is similar to accumulation. The main difference is that the cues systems involved interact at an earlier stage. This means that different modules are not kept separate, but interact at different levels to arrive at a single coherent representation of the distance. Practically, different cues take advantage of some specific sub-modules of other cues in order to reduce the number of circuitries involved in the calculation of the same function. • Veto: This interaction occurs when two or more cues provide conflicting information. In this case judgments may be based on only one, with the other cue being suppressed. A surprising example of veto in depth perception can be carried out using a pseudoscope4 . However, the depth perception modified by this instrument is not inverted as predicted. This is because the distance information can be overridden by monocular cues, such as perspective, texture gradient, shadows etc. 3 This kind of interaction has been afterward called weak fusion [18, 59] because it assumes no interaction among different information sources apart from interactions on the output of the modules. 4 This device reverses binocular disparity simply by reversing the images that are projected to the left and right eyes. Since this instrument reverses the horizontal disparities of everything in the image, objects that are closer should appear farther and vice versa. 47 • Disambiguation: One cue system can produce depth information that lacks precision or is too noisy to be used. Another cue may solve the ambiguity by integrating some information. For example, depth information generated by motion parallax can be ambiguous in respect to the sign of depth. This ambiguity can be resolved using interposition and shadow cues as sources of new information. • Hierarchy: Information derived from one cue may be used as raw data for another one. The question of how different sources of distance information are merged into a single coherent depth image is very complex and not well understood so far. High level processes as well as different low level functions in the brain can reinforce or even interfere with the integration of the cues. This process happens, but no one yet knows yet where and how. 48 Chapter 3 Monocular depth perception in robotic Contents 3.1 3.2 Passive depth finding . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.1.1 Depth from camera motion . . . . . . . . . . . . . . . . . . . . 50 3.1.2 Image information extraction . . . . . . . . . . . . . . . . . . . 52 3.1.3 Depth from camera parameter variations . . . . . . . . . . . . . 55 3.1.4 Depth from camera parameter variations due to movement . . . 56 Active depth finding . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 As we considered in early chapters, the depth extraction feature that the visual system actuates is really complex. It is not surprising that the eye is a biological system with its complexities and its millions of years of evolution that cannot easily be represented by a simple equation or model. Evolutionarily speaking, important is the feature of distance perception and its applications to survival. It would not be surprising to find that the evolutionary process had performed some of its greatest feats to produce the mechanisms underlying depth perception tasks. Computer and robotic vision tries to imitate, simulate, or be inspired by this extraordinary performance. Robotic sensorial perception is a key issue for immediate or short-range motion planning; that is, reacting only to the free space around a robot without requiring a predefined trajectory plan. Local navigation requires no environment model and relies entirely on sensorial data. There are many different approaches being studied. These approaches depend very much on the outcome of the image feature extraction algorithms, which are also compu49 tationally demanding. In this context, all types of distance extraction algorithms can be categorized in five different classes according to the type of interaction that they have with the environment that should be explored. Active methods, for instance, can be depth finders that use laser, ultrasound or radar waves; passive methods can be information extraction, depth from camera motion, depth from camera parameter variations, and depth from camera parameter variations due to movements. In general, passive methods, which do not require production of any sort of beamed energy, have a wider range of applications. This is because the analysis of energy reflections in a composite environment is usually so complex that it produces a grade of ambiguity too high for any application. Moreover, the possible presence of others kind of energy sources can invalidate or jam any direct reading. In robotics, several studies have taken an active approach in the visual evaluation of distance (see for example [75, 84, 14, 34, 92, 93] for some methods). In most of these studies, a motion of the sensor according to the movement of the robot (or the camera itself) produces a subsequent motion of the projection on the sensor of the observed scene which is elaborated with different approaches like optical flow, feature correspondence, spatio-temporal gradients, etc. In this chapter we will report a on the complete range of different approaches which can be found in the robotic literature. However, none of these approaches has taken inspiration from the human eye by considering the combination of parallax effect and eye movements in order to determine depth information. 3.1 Passive depth finding This is the class that includes the majority of all algorithms of depth processing or extraction. In this class, no energy is released by the measuring agent. 3.1.1 Depth from camera motion Mobile robots are the most challenging problem in robotics. Conceptually, robotic agents that are able to perform complex navigation tasks in an unknown environment are important because its realization implies interactions among different cognitive modules. A fundamental feature is the extraction of distance information from obstacles present in the agent’s path. To solve the problem, many approaches have been developed to yield a variation of the camera position and orientation. In this class of algorithms, a motion of the sensor according to the movement of the robot produces consequently a motion of the projection on the sensor of the scene observed by the camera, which can be elaborated and utilized to extract distance information. For 50 almost the possible applications, these algorithms have to be performed on-line in order to drive all the movements of the camera. Murphy [75] proposes a military application that acquire the distance information from a moving vehicle to targets in video image sequences is presented. Murphy presents a realtime depth detection system that finds the distances of objects through a monocular vision model employing a simple geometric triangulation model of the vehicle movements and some explicative considerations. This algorithm can be used with a camera mounted either at the front or side of a moving vehicle, and the relative geometric model can be both front and side-view. To solve the correspondence problem, a real-time matching algorithm is introduced in order to improve the matching performance by several orders of magnitude. The author uses the intensity feature to match corresponding pixels in the two adjacent image frames. In order to have accurate and efficient matching, he derives a number of motion heuristics including maximum velocity change, small change in orientation, coherent motion and continuous motion. Sandini [84] suggests that a tracking strategy, which is possible in case of egomotion, can be used in the place of a matching procedure. The basic differences between the two computational problems arise from the fact that the motion of the camera can be actively controlled in case of ego-motion, whereas in stereo vision it is fixed by the geometry. For Sandini, the continuous nature of motion information is used to derive the component of the two-dimensional velocity vector perpendicular to the contours, while the knowledge of the egomotion parameters is used to compute its directions. By combining these measures, an estimate of the optical flow can be obtained for each image of the sequence. In order to facilitate the measure of the navigation parameters within this framework, a constrained egomotion strategy was adopted in which the position of the fixation point is stabilized during the navigation (in an anthropomorphic fashion). This constraint reduces the dimensionality of the parameter space without increments in the complexity of the equations and allows deriving the position of the objects in space from image positions and optical flow. Beß et al. [14] present a clear example of interpolation preprocessing, in which image information is replaced with a less complex representation. The technique is part of a system computing depth from monocular image sequences. Taking a sequence of different views from a camera mounted to a robot hand, each two consecutive images are considered as a stereo image. The advantage of the segmentation approach, employing sequences of straight line segments and circular arcs, is straightforward. Using this technique, the number of primitives representing the image information is reduced significantly, lowering, therefore, the computational efforts of the corresponding problem as well as the frequency of erroneous matches. Starting with matched pairs of primitives, a disparity image is computed containing the initial disparity values for a subsequent block matching algorithm. The 51 depth images computed from these stereo images are then fused to one complete depth map of the object surface. So far, conventional approaches to recovering depth from images in which motion (either induced by egomotion, or produced by moving objects) have involved obtaining two or more images, applying a reduction of information through filtering or skeleton segmentation, and solving the correspondence problem. Unfortunately, the computational complexity involved in feature extraction and in solving the correspondence problem makes existing techniques unattractive for many real world robotics applications in which the real-time is often a given constrain. To avoid completely the correspondence problem, Skiftstad [88] developed a depth recovery technique that does not include the computationally intensive steps of features selection, which is required by conventional approaches. The Intensity Gradient Analysis (IGA) technique is a depth recovery algorithm that exploits the properties of the MCSO (moving camera, stationary object) scenario. Depth values are obtained by analyzing temporal intensity gradients arising from the optical flow field induced by known camera motion, causing objects to displace in known directions. The basic idea is that if an object displaces exactly one pixel, the intensity perceived at the location the object moves into must equal the intensity perceived at the location the object moved out of before the object displacement took place. This 1-pixel displacement can be considered as disparity equal to one. IGA, thus, avoids the feature extraction and correspondence steps of conventional approaches and is therefore very fast. Fujii92 [34] presents a Hypothesize-and-Test approach to the problem of constructing a depth map from a sequence of monocular images. This approach only requires knowledge of the axial translation component of the moving camera. When the configuration of the camera is known with respect to the mobile platform, this information can be readily obtained by projecting displacement data from a wheel encoder or range sensors onto the focal axis. Instead of directly calculating the depth to the feature points, the algorithm first hypothesizes that there is a pair of feature points which have the same depth. Based on this hypothesis, the depths are calculated for all pairs of feature points found in the image. As the robot moves, the relative location of two points changes in a specific and predictable manner on the image plane if they are actually at the same depth (in other words, if the hypothesis is correct). The motion of each pair of points on the image plane is observed, and if it is consistent with the predicted behavior, the hypothesis is accepted. Accepted pairs create a graph structure which contains depth relations among the feature points. 3.1.2 Image information extraction These approaches usually extract distance information from the image structure themselves, such as the scattering of light, Fourier analysis, optical flow, etc. The kind of information 52 extracted in these cases is not the depth itself, but some sort of indirect information that can be related in some way to the distance. These algorithms can be considered pure computer vision, because they do not interact with the environment at all, and they can be normally performed off-line. There are many different examples of this kind of algorithm in the computer vision literature, but in this review we will only consider a few of them. They are mainly characterized by two different sources of information: sequences of pictures which offer time as an additional dimension, and single shots. A common approach to visual extraction is to apply a transformation that preserves the nature of the information of the image, but that can be successively manipulated in an easier way. Many times this transformation uncovers properties and features that can be strongly used to extract the desired information. Torralba and Oliva [98] propose that a source of information for absolute depth estimation can be based on a whole scene structure that does not rely on specific objects. They demonstrate that, by recognizing the properties of the structures present in a single image, they can infer the scale of the scene and, therefore, its absolute mean depth. Using a probabilistic learning approach based on Fourier transforms of the image, they show that the general aspect of some spectral signatures changes with depth, along with the spatial configuration of orientation and scale. What they obtain is a system designed to learn the relationship between structure and depth. In this way, pictures can be divided in semantic class that can be used in specific tasks. A very significant example is the calculation of the dimension of human heads according the prospective and the scale of the picture. Ravichandran and Trivedi [81] propose a technique where motion information is extracted by identifying the temporal signature associated with the texture objects in the scene. The authors present a computational framework for motion perception in which an approach for motion is carried out using a spatiotemporal frequency (STF) domain analysis. Their work is based on the fact that a translation in the spatiotemporal space causes the corresponding Fourier transform to be skewed in 3-D space, which can be thus expressed in terms of the motion parameters. In this STF analysis approach, a temporal filter with a non-symmetric and periodic transfer function is used to obtain a linear combination of the sequence of image frames with moving patterns. A localized Fourier transform is then computed to extract motion information. When the observer (or the camera) moves, motion is induced in the scene. If the optical parameters (like relative position, orientation, focal length etc.) of the camera are known along with the motion characteristics of the image, the disparity value calculated can be used to extract distance information. There are algorithms, even those employing the same kind of transformations mentioned before, that tend to use over-imposed heuristics. These heuristics or rules have the function 53 to filter and to validate all hypotheses expressed by the underlying system. These heuristics are usually related to physical constraints of the system and to objects in the scene. The work by Guo et al. [83] describes an approach for recovering the structure of a moving target from a monocular image sequence. Assuming that the camera is stationary, the authors first use a motion algorithm to detect moving targets based on four heuristics derived from the properties of moving vehicles: maximum velocity, small velocity changes, coherent motion, and continuous motion. Successively, a second algorithm estimates the distance of the moving targets by using an over-constrained approach. They have applied the approach to monocular image sequences captured by a moving camera to recover the 3-D structure of stationary targets such as trees, telephone poles, etc. Adaptive algorithms cover another important class of algorithms in machine vision. Basically, these approaches try to interpolate an unknown function in the output space drawing upon features present in the images used during the training period. Which features, and how many, sometimes remain unresolved questions. Since the training process can also be very expensive in terms of computational power, these kinds of approaches are normally subjected to two different phases, the first one being the training while the second is the field deployment. An example of adaptive algorithm is presented by Marshall [52]. This approach describes how a self-organizing neural network (SONN) can learn to perform the high-level vision task of discriminating visual depth by using motion parallax cues. This task is carried out without the assistance of an external teacher, whose job is to explain which correct answers are needed. A SSON can acquire sensitivity to visual depth as a result of exposure, during a developmental period, to visual input sequences containing motion parallax transformations. Sometimes the absolute distance information is not available or is not requested by the problem to be solved. In these cases depth information is represented by interposition cues (see Paragraph 2.1.2) in which the order of the objects present in the scene must be extracted. Although Bergen et al. [12] focus their attention on the object boundaries discovery, they also present a technique that can be used to extract relative depth information of the objects present in an image. They introduce a detailed analysis of the behavior of dense motion estimation techniques at object boundaries. Their analysis reveals that a motion estimation error is systematically present and localized in a small neighborhood on the occluded side of the objects. They show how the position of this error can be used as a depth cue by exploiting the erroneous motion measurement density to determine what type of discontinuity they are observing. Intensity discontinuities have essentially three origins: object boundaries, surface marks (for instance, different reflective properties on an object’s surface) and illumination discontinuities. For all these types, the error density is different and, thus, subjected 54 to classification. The straightforward application of this observation is then the retrieval of objects boundaries and occlusions. Fourier analysis, motion parallax, and focus are not the only sources of depth information. Some types of aberrations of the optical system, as well as natural phenomena, can be exploited to extract distance information intrinsically dependent on the phenomenon. For example, in the human visual system, there is a considerable longitudinal chromatic aberration1 , so that the retinal image of an edge will have color fringes: red fringes for under-accommodation (focus behind the retina) and blue fringes for over-accommodation (focus in front of the retina). Fincham shows [29] using achromatizing lenses and monocromatic light that accommodation is impaired when chromatic aberration is removed, and suggested that a chromatic mechanism, sensitive to the effects of chromatic aberration, could provide a directional cue for accommodation. Garcia [38] presents chromatic aberrations as a source of depth visual information. This system takes three images of the same scene, with images having a different focal length. Using these images and the mathematical model suggested, the authors discovered that the spread function of the blurness is inversely proportional to the distance of the object. Cozman and Krotkov in [20] present the first analysis of atmospheric scattering2 phenomenon from an image. They investigate a group of techniques for extraction of depth cues solely from the analysis of atmospheric scattering effects in images. Depth from scattering techniques are discussed for indoor and outdoor environments, and experimental tests with real images are presented. They found that depth cues in outdoor scenes can be recovered with surprising accuracy and can be used as an additional information source for autonomous vehicles (see Paragraph 2.1.5 for further details). 3.1.3 Depth from camera parameter variations This class of algorithms collects all those systems that require variations of the camera parameters, such as focus, contrast, etc. Indeed, they use the variation of these parameters of the image on the sensor in order to obtain depth information. These kinds of algorithms involve interaction between the camera and the algorithm, and they are often performed in real-time. Honig [48] proposed an example of this class of algorithm. The depth perception presented in this work is based on a method that exploits the physical fact that the imaging properties of an optical system depend upon the acquisition parameters and the object dis1 As we have seen, chromatic aberration is the variation in the focal length of a lens with respect to the wavelength of the light that strikes it. 2 Light power is affected when it crosses the atmosphere; there is a simple, albeit non-linear, relationship between the radiance of an image at any given wavelength and the distance between object and viewer. This phenomenon is called atmospheric scattering and has been extensively studied by physicists and meteorologists. 55 tance. The author suggests an edge-oriented approach which takes advantage of the structure of the visual cortex suggested by Marr and Hildreth [69]. Honig’s basic principle is to compare blur in two defocused images of the same scene taken with different aperture (depth from defocus). Modeling the optic as a linear system, the blurring process can be therefore interpreted as filtering the sharp image with a defocus operator. The problem of depth sensing can be reduced therefore to identify the defocus operator in the image. 3.1.4 Depth from camera parameter variations due to movement This class of on-line algorithms is quite similar to the previous one, the only difference being that the camera parameters are not changed straightforwardly but through a movement that indirectly changes the camera parameters. Throughout the robotic literature, the most used approach is the measurement of the image blur due to camera movements. One example can be found in the work carried out by Lee [60]. In this approach, a method of constructing 3-D maps is presented based on the relative magnification and blurring of a pair of images, where the images are taken at two camera positions of a small displacement. Due to this displacement, the depth information is captured not only by blurring but also by magnification. That is, an image captured by the camera at a certain position can be considered as a blurred version of its well-focused image or the convolution of the well-focused image and the blurring function. The authors also consider that these two images have also a different magnification factor, which can be described in terms of the relationship between two well-focused images. The correspondance between the well-focused images and the two images taken at different camera positions can be expressed based on the Taylor series expansion but only when the distance between the two camera positions is small enough. In this way, the relationship between the two images from two the camera positions, taking into account both blurring and magnification effects, can be formulated in the frequency domain. The ratio between the Fourier transform of two images can be represented in terms of the magnification and blurring factors which contain the depth information that can be extracted. The method, referred to here as Depth from Magnification and Blurring, aims to generate a precise 3-D map of local scenes or objects to be manipulated by a robot arm with a hand-eye camera. The method uses a single standard camera with a telecentric lens and assumes neither active illumination nor active control of camera parameters. Fusing the two disparate sources of depth information, magnification and blurring, the proposed method provides a more accurate and robust depth estimation. Another approach in which blurred images are used can be found in [56]. This research presents an approach for the determination of depth as a function of blurring for automated visual inspection in VLSI wafer probing. There exists a smooth relationship between the degree of blur and the distance of a probe from a test-pad on a VLSI chip. Therefore, by 56 measuring the amount of blurring, the distance from contact can be estimated. The effect of blurring on a point-object is successively studied in the frequency domain, and a monotonic relationship is found between the degree of blur and the frequency content of the image. Fourier feature extraction, with its inherent property of shift-invariance, is utilized to extract significant feature vectors, which contain information on the degree of blur and, hence, the distance from the probe. In this case, the authors employed ANNs to map these feature vectors onto the actual distances. The network is then used in the recall mode to linearly interpolate the distance corresponding to the significant Fourier features of a blurred image. 3.2 Active depth finding In contrast to passive methods for extracting distance information from single or coupled images, are the active depth finding strategies. This class of algorithms includes all kinds of systems that actively produce a controlled energy beam and that measure the reflection of this energy provoked by the environment. These approaches generally include all types of sonic, ultrasonic and laser systems, and are used for robotic exploration and guidance. The analysis performed by some of them not only extracts distance, but also measures the whole structure (or part of it) of the environment under observation. In this techniques, the scene is lit by a light usually produced with a laser beam and a device able to spread the beam all along the scene (usually a rotating cylindrical lens). This beam, which has the shape of a sheet, is moved across the scene, producing a simple light stripe for each position. This simple strip is acquired by a camera, recorded and then processed either on-line or off-line (see [80] for further investigations). The time-of-flight algorithms class can be considered the oldest and simplest approach to the problem of distance measurement. The main three representative forms of physical phenomena employed are ultrasonic sources, radar, and laser sources. Using one of the three types of energy the interval of flight between the emission of an impulse (respectively pressure for ultrasound, and electromagnetic radiation for radar and laser) and the recording of its echo performed by a coaxial sensor is the parameter used to estimate depth information. No image analysis is involved, nor are assumptions concerning the planar properties or other properties of the objects in the scene relevant. Laser exploration of the environment can be performed using two laser finder types dependent upon time of flight to and from an object point whose range is sought. The first kind measures the phase shift in a continuous wave modulated laser beam as it leaves the source and returns to the detector coaxially. The second measures the time a laser pulse takes to go from the source, bounce off a target point, and return coaxially to a detector. Some interesting works are presented in [65, 54, 7, 86]. 57 Since the description of these works exceeds the purpose of this thesis, we leave any further detail to the specific interest of the reader. Many works use ultrasound range finders; one interesting work is proposed by Santos [85] in which ultrasound sensors are coupled with Artificial Neural Networks (ANNs) in order to pre-process range information in a consistent fashion. Although active depth finding can be, in many cases, more precise and accurate than the passive approaches, laser techniques are restricted by fog, rain, and variations in lighting, and ultrasound devices are disrupted by wind, temperature variations, and serious distance absorption. For this reason, radar techniques for outdoor applications have been introduced. An interesting work can be found in [73]. This is based on Frequency Modulated (FMCW) radar and is intended for applications in mobile outdoor robotics, such as positioning an agricultural machine to reduce impact of spreading practices on the environment. They show that spectral analysis methods used here to process radar signals provide an accuracy of about a few centimeters. 58 Chapter 4 The model Contents 4.1 Eye movement parallax on a semi-spherical sensor . . . . . . . . . . 59 4.2 Eye movement parallax on a planar sensor . . . . . . . . . . . . . . 63 4.3 Theoretical parallax of a plane . . . . . . . . . . . . . . . . . . . . . 66 4.4 Replicating fixational eye movements in a robotic system . . . . . . 68 Parallax effects observed in sensor rotations (either a human eye or a digital camera) depend on the characteristics of the optical system in which the sensor is incorporated and on the observed object. In fact, the movement of the projection of the object is strictly related not only to the optical system and its inner structure (position of the cardinal points, magnification power, etc.), but also to the relative position of this system with respect to the center of rotation of the whole system (sensor and optical system). Figure 4.1 illustrates how parallax emerges during the rotation of a semi-spherical sensor that is similar to the human eye. It can be observed how a projection of the points A and B depend on both the rotation angle and the distance to the nodal point. 4.1 Eye movement parallax on a semi-spherical sensor In order to evaluate the projection of the whole object space on the retina we use a modified version of the eye model proposed by Gullstrand [1]. This model (see figure 4.2a) is characterized by two nodal points arranged at 7.08 mm and 7.33 mm from the corneal vertex. The lens has an average index of refraction of 1.413, and the axial distance from the cornea to retina is 24.4 mm (all measurements are considered on an unaccommodated eye). These parameters ensure that parallel rays entering will focus perfectly on the retina. The original 59 Figure 4.1 – Parallax can be obtained when a semi-spherical sensor (like the eye) rotates. A projection of the points A and B depends both on the rotation angle and the distance to the nodal point. The circle on the right highlights how a misalignment between the focal point (for the sake of clarity, we consider a simple lens with a single nodal point) and the center of rotation yields a different projection movement according to the rotation but also according to the distance of A and B. model does not include the radius of the ocular bulb; however a value of 11 mm has been found the consensus in vision literature [40]. Using Gullstrand’s eye approximation, let A be a pinpoint light source (see figure 4.2b) located in the object space at coordinates (−da sin α, da cos α). The projection x̃a of A on the retina is generated by the ray of light AN 1 and verifies the following condition: ¯ ¯ x ¯ ¯ ¯ −da sin α ¯ ¯ ¯ 0 which has the following solution: ¯ y 1 ¯¯ ¯ da cos α 1 ¯¯ = 0 ¯ N1 1 ¯ x(da cos α − N1 ) + yda sin α − da N1 sin α = 0 (4.1) Since a ray of light entering in the first nodal point with a certain angle α̃ emerges from the second nodal point with the identical angle in a two nodal point optical system (see 60 (a) (b) Figure 4.2 – (a) Gullstrand’s three-surface reduced schematic eye. (b) Geometrical assumptions and axis orientation for a semi-spherical sensor. An object A is positioned in the space at distance da from the center C of rotation of the sensor, and with an angle α from axis y; N1 and N2 represent the position of the two nodal points in the optical system. The object A projects an image xa on the sensor and its position is indicated by θ. Paragraph 1.2.2 for further details), the actual projection xa of A on the sensor belongs to the line outgoing from the second nodal point N2 and parallel to AN 1 : x(da cos α − N1 ) + yda sin α − da N2 sin α = 0 (4.2) Equation (4.2) can be expressed in polar coordinates: r= where d= q d cos(θ − γ) | − da N2 sin α| N12 + d2a − 2N1 da cos α (4.3) and γ= π − α̃ 2 (4.4) The angle α̃ is generated by the intersection between the line AN 1 with the axis y: da sin α α̃ = arctan − da cos α − N1 µ ¶ (4.5) The projection of A on the retina is given by the intersection of (4.3) with the surface of the sensor having equation r = R: 61 (a) (b) Figure 4.3 – (a) Variations of the projection point on a semi-spherical sensor obtained by (4.7) considering an object at distance da = 500 mm, N1 = 6.07 mm and N2 = 6.32 mm; (b) detail of the rectangle in Figure (a): two objects at distance da = 500 mm and da = 900 mm produce two slightly different projections on the sensor. 62 θ = arccos µ d R ¶ − α̃ + π 2 (4.6) Substituting in (4.6) the variables from (4.4) and (4.5), we obtain: θ = µ √ |−da N2 sin α| arccos R ³N12 +d2a −2N1 d´a cos α π a sin α + arctan da dcos α−N1 + 2 ¶ + (4.7) where θ represents the projection angle of the pinpoint light source A on the sensor, considering the cardinal points of the optical system and the physical position of A. According to Cumming [15], primary visual cortex of primates is more specialized for processing horizontal disparities. For this reason we considered only horizontal displacements in the model. Nevertheless, since all cardinal points are located on the visual axis, the vertical component of the displacements can be obtained, without loosing correctness, simply by rotating the reference frame around axis y. 4.2 Eye movement parallax on a planar sensor When the sensor is planar, the geometry of the problem is shown in figure 4.4. Let us consider two rays of light outgoing from the pinpoint light source A and hitting the planar sensor in two different points: a projection x̃a is generated by the ray passing through the point N1 , and a projection x̂a obtained by the ray passing through the point C. The ray AN 1 is described as following: x(da cos α − N1 ) + yda sin α − da N1 sin α = 0 (4.8) As mentioned before, a ray of light entering in the first nodal point with a certain angle α̃ emerges from the second nodal point with the identical angle in this optical system. The ray parallel to AN 1 emerging from N2 cuts the planar sensor, which has equation y = −dc , at the intersection described as follow: xa = da f sin α da cos α − N1 (4.9) where f = dc + N2 . From Figure 4.4b, we can also calculate x̂a which is equal to: x̂a = dc tan α 63 (4.10) (a) (b) Figure 4.4 – (a) Schematic representation of the right eye of the oculomotor system with its cardinal points. The L-shaped support places N1 and N2 before the center of rotation C of the robotic system. (b) Geometrical assumptions and axis orientation for a planar sensor. An object A is positioned in the space at a distance da from the center C of rotation of the sensor with an angle α from axis y; N1 and N2 represent the position of the two nodal points in the optical system. The object A projects an image xa on the sensor; x̂a represent the projection of A considering null the distance between N1 and C. The previous equation can be considered the projection of object A on the sensor without the contribution of the parallax effect, which is when the point C and the nodal points are coincident. It can be noticed that (4.10) supplies no information about the absolute distance of the object but describes only an infinite locus of the possible positions of A. Equation (4.9) can be simply inverted obtaining da as absolute distance information: da = xa N1 xa cos α − f sin α (4.11) Observing the previous equation, it can be noticed that in order to calculate the correct distance of the pinpoint light source A from the center of rotation C, two parameters are requested: the sensor projection xa of A and its angle from the visual axis (coincident with y)1 . Although values of α in (4.11) are not directly known, it is possible to perform an 1 In a planar sensor, the absolute parallax |xa − x0a | value cannot directly employed to determine distance information da without taking into account the angle α. In fact, let us consider two different objects A and B at the same distance from the center of rotation C, but at different eccentricity αA > αB (smaller values of α correspond to a minor eccentricity). When a rotation occurs, the object projections move on the surface of sensor not only according to the distance da but also to the distance N2 xa and the angle α0 . Therefore, a 64 Figure 4.5 – Curve A represents variations of the projection xa on a planar sensor obtained by (4.9), considering an object at distance da = 500 mm, N1 = 49.65 mm and N2 = 5.72 mm (see Paragraph 5.1.3 for further details); Curve B represents the same optical system configuration but with the object at distance da = 900 mm. Curve C represents variations of x̂a using (4.10) when the nodal points N1 and N2 collapse in the center of rotation of the sensor. This curve does not change with the distance. indirect measurement of this parameter considering different projections of the object A on the sensor. Rotating the reference axis around the point C of a known angle ∆α, and taking in account that this rotation does not change the absolute distance da because of the reference frame (as shown in figure 4.6), it is possible to use (4.11) to write: x0a N1 xa N1 = 0 xa cos α − f sin α xa cos(α + ∆α) − f sin(α + ∆α) where x0a is the new projection of the object A in the rotated reference frame. Inverting the previous relationship for α, we obtain: x0 [1 − cos ∆α] + f sin ∆α i α = arctan hax0 f xaa − cos ∆α − x0a sin ∆α (4.12) projection moving in the periphery of the sensors produces a bigger parallax than a projection moving in the center. 65 Figure 4.6 – The clockwise rotation of the system around the point C leaves unchanged the absolute distance da , and allows a differential measurement of the angle α (the angle before the rotation) to be obtained using only ∆α and the two projections xa (before the rotation), and x0a (after the rotation). which relates α only to the projection on the sensor before and after a rotation ∆α. 4.3 Theoretical parallax of a plane Let us consider in Figure 4.7 a plane P at distance D before the sensor. Applying (4.9) to a generic point A on the plane and considering the distance D and the cardinal points of the optical system, it is possible to calculate the theoretical value of the parallax after a rotation D ∆α of the reference frame. Considering the fact that da = cos α , equation (4.9) becomes: xa = Df tan α D − N1 (4.13) The parallax a point A on the plane is ∆x = xa − x0a is therefore: Df tan α − tan(α + ∆α) ∆x = D − N1 · ¸ 66 (4.14) Figure 4.7 – Geometrical assumptions and axis orientation needed to calculate the theoretical parallax. Considering the plane P at distance D from the center of rotation C of the camera, it is possible to calculate the expected parallax. Figure 4.8 reports different parallax values for planes at different distances D. The asymmetry in the curve for a distance D = 300 mm is due to the movement of the point of the plane either toward or against the rotation of the sensor. In fact, during an anticlockwise rotation, all points on the plane change the value of α but not their da . Points which have α > 0 before the rotation decrease this value during rotation, and they project onto a more central area of the sensor. This central area, with the same rotation ∆α, produces a smaller value of parallax. For those points which have α < 0 the value of α increases during rotation and these points project onto a more peripheral area of the sensor, with consequently greater values of parallax. Therefore, a smaller value of parallax on the left side of the curve indicates an anticlockwise rotation; on the other hand a smaller value of parallax on the right side of the curve indicates a clockwise rotation. 67 Figure 4.8 – Horizontal theoretical parallax for planes at different distances using a rotation of ∆α = 3.0◦ anticlockwise. Although asymmetry, due to the rotation, is present on all the curves, this is more evident for shorter distances. 4.4 Replicating fixational eye movements in a robotic system We have seen in Paragraph 1.5 that there are many human eye movements which can be classified by magnitude and speed. However, due to issues such as the relatively small dimensions of the retina, the response of mechanical parts to stress and movements, and the minuteness of the actuators, it is not possible, technologically, to reproduce the whole spectrum of eye movements. Moreover, the human retina and the camera of the robot have very different resolutions, and the corresponding optical systems have cardinal points located in considerably different locations. Despite these issues, theoretical considerations can offer the chance to replicate some of these movements, namely saccades and fixational eye movements. We will leave to the Paragraph 5.4.2 the treatment about the simulation of saccades 68 movements in the robot. On the other hand, the chance to replicate fixational eye movements is directly offered by comparing projections measured in pixels on a planar sensor with the corresponding photoreceptors on the retina. We have seen before (see discussion of hyperacuity phenomenon) that humans are able to resolve points on the retinal projection of the world much beyond the physical dimension of the photoreceptors. In contrast, the resolution power in the robot is directly proportional to the cell size in the sensor grid. Consequently, considering the difference between the actual resolution calculated using the dimension of physical photoreceptors and the resolution achieved in hyperacuity tasks, we introduce the concept of virtual photoreceptor, in which the dimension is about 30 times smaller than the photoreceptor present in the fovea (7 arcsec [105] for a two points acuity). Let us consider the projection of a ray on the semi-spherical sensor which hits a specific virtual photoreceptor. A small rotation ∆αs moves this projection to the very next virtual photoreceptor. Similarly, a rotation ∆αp moves a projection from one pixel to the next one. It is possible therefore to calculate a constant which allows us to relate ∆αs and ∆αp by considering the difference of a single sensitive cell (both virtual photoreceptor and pixel). Let us consider a fixational eye movement ∆α which translates the projection of the pinpoint light source A from the perpendicular position on the sensor to a new position. Since the angle ∆α can be considered small in fixational eyes movements, we can calculate the variation of the retinal projections considering a simpler approximation of (4.7). In fact, from Figure 4.9, we can deduce the following geometrical considerations: h sin ∆α̃ = R sin ∆θ → tan ∆α̃ = h cos ∆α̃ = N2 + R cos ∆θ R sin ∆θ N2 +R cos ∆θ and the following: da sin ∆α = α̃ sin ∆α̃ da cos ∆α = N1 + d˜a cos ∆α̃ 69 → tan ∆α̃ = da sin ∆α da cos ∆α−N1 Figure 4.9 – Geometrical assumptions and axis orientation for a small movement. The sensor S is no longer considered as a single continuous semi-spherical surface, but it is quantized according to the dimension of the virtual photoreceptors. which can be compared: R sin ∆θ da sin ∆α = N2 + R cos ∆θ da cos ∆α − N1 (4.15) Obviously 4.15 is still valid for every angle, but it cannot easily be inverted to obtain ∆θ. Yet, considering that ∆θ and ∆α are small, and da À N1 , it becomes: ∆θ ' µ R + N2 R ¶ ∆α (4.16) Therefore, the shift on the retina ∆xs for a small movement ∆αs is: ∆xs ' (R + N2 )2 ∆αs R 70 (4.17) Let us consider now the approximation of the planar sensor projection described in (4.9). With the same assumption, that ∆αp is small, and da À N1 , (4.9) becomes: ∆xa ' (dc + N2 ) ∆αp (4.18) Therefore, the geometric factor which allows the transformation of fixational eye movements in movements of the planar sensor can be obtained by equating the shift of one photoreceptor on the camera with the shift of one virtual photoreceptor on the retina: $ 1 1 (R + N2s )2 ∆αs = (dc + N2p ) ∆αp Ss R Sp % $ % (4.19) where Ss and Sp are, respectively, the dimension of the virtual photoreceptor on the semispherical sensor and the dimension of the photoreceptor cell on the planar sensor; N2p and N2s are the position of the second nodal point in the optical system of the camera and the position of the second nodal point in Gullstrand’s model (see Figure 4.4a); R and dc are the radius of the semi-spherical sensor and the distance of the planar sensor with the center of rotation of the camera; ∆αs is the movement of the semi-spherical sensor. Therefore, the relation which binds a rotation ∆αs of the semi-spherical sensor to the corresponding ∆αp of the planar sensor, has been named geometrical amplification relation: ∆αp ' Gaf ∆αs (4.20) where Gaf is called geometrical amplification ratio, and is expressed as: Gaf " Sp (R + N2s )2 = Ss R (dc + N2p ) # It can be noticed that only the secondary nodal points N2p and N2s are important in the geometric relation 4.20. 71 Chapter 5 Discussion of the experiment results Contents 5.1 The oculomotor robotic system . . . . . . . . . . . . . . . . . . . . . 72 5.1.1 Pinpoint light source (PLS) . . . . . . . . . . . . . . . . . . . . 74 5.1.2 Nodal points characterization . . . . . . . . . . . . . . . . . . 74 5.1.3 System Calibration . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Recording human eye movements . . . . . . . . . . . . . . . . . . . 78 5.3 Experimental results with artificial movements . . . . . . . . . . . . 81 5.3.1 Depth perception from calibration movements . . . . . . . . . . 82 5.3.2 Depth perception in a natural scene . . . . . . . . . . . . . . . 83 Experimental results with human eye movements . . . . . . . . . . 85 5.4.1 Depth and fixational eye movements . . . . . . . . . . . . . . . 86 5.4.2 Depth and saccadic eye movements . . . . . . . . . . . . . . . 89 5.4 An oculomotor robotic system able to replicate as accurately as possible, the geometrical and mobile characteristics of a human eye, and the novel mathematical model of 3-D projection on a 2-D surface which exploits human eye movements to obtain depth information were both employed in a set of tasks to characterize their properties and prove their reliability in depth cues estimation. 5.1 The oculomotor robotic system Human eye movements were replicated in a robotic head/eye system that we developed expressly for this purpose. The oculomotor system (see figure 5.1) was composed of an aluminum structure sustaining two pan-tilt units (Directed Perception Inc. - CA), each with 72 Figure 5.1 – The robotic oculomotor system. Aluminum wings and sliding bars mounted on the cameras made it possible to maintain the distance between the center of rotation C and the sensor S. 2 degrees of freedom and one center of rotation. These units were digitally controlled by proprietary microprocessors which assured 0.707 arcmin precision movements. Specifically designed aluminum wings allowed cameras to be mounted in such a way that the center of rotation is always maintained between the sensor plane and the nodal points (as shown in 5.1a). A sliding bar furnished with calibrated holes allowed the distance between the center of rotation C and the sensor S to be set at a value dc = 10.50 mm. This configuration recreated the position of the retina on the human eye which has been estimated to be about R = 11.00 mm [40]. Images were captured by one of the two digital cameras (Pulnix Inc. - CA) mounted on the left and right pan-tilt units respectively. Each camera had a 640 × 484 CCD sensor resolution with square 9 µm photoreceptor size, 120 frames/s acquisition rate. The optical system was formed by two TV optical zoom capable of focal length regulation between 11.5 mm and 69 mm1 , and a shutter for the luminance intensity control. Camera signals and 1 Only 11.5, 15.0, 25.0, 50.0, and 69.00 mm markers were available on the lens body. 73 Emitted Color Size Lens Colour Luminous Intensity Viewing Angle Max Power Dissipation Ultra red (640 − 660 nm) 5 mm T 1 34 Water Clear 2000 mcd (Typical) − 3000 mcd (Max) 15◦ 80 mW (Ta = 25◦ ) Table 5.1 – Characteristics of the pinpoint source semiconductor light. Figure 5.2 – Pinpoint light source (PLS) structure. images were acquired by a fast frame-grabber (Datacube Inc. - MA) during the movement of the robot, and later saved on a PC for subsequent analysis. 5.1.1 Pinpoint light source (PLS) For the calibration procedure of the oculomotor system and for the first series of validations of the model, an approximation of a pinpoint light source (called object A in Figure 4.4b, and henceforth PLS) was required. This source was built using an ultra bright red LED (see further characteristic in Table 5.1.1). This semiconductor light source was covered with a 3 mm pierced steel mask to guarantee a minimum variance of the shape of the PLS projection during rotations and to cover undesired bright aberrations due to the LED imperfections. 5.1.2 Nodal points characterization In this first stage of the experimental setup, a study of the two nodal points of the lens was performed in order to calculate the trajectory of N1p and N2p on the visual axis as a function of the optical system focal length. The lens factory provided an estimation of the nodal point position at focal lengths 11.5 mm and 69.00 mm (see Figure 5.3a). If we consider the trajectory of the cardinal 74 points on the visual axis inside the lens to be linear, the position of N1p and N2p can be summarized as follows in Figure 5.3b. A first experiment was performed in order to verify if the model expressed by (4.9) was able to predict the position of the cardinal points N1p and N2p as characterized in Figure 5.3b. The PLS was placed at a set of different distances (between 300 mm and 900 mm, with intervals of 100 mm) from the center of rotation of the camera2 . The same set of distances was repeated for different focal lengths (11.5, 15.0, 25.0, 50.0, and 69.0 mm) of the optical system (focus at infinite). For each of these distances (and focal lengths), the same rotation ∆α = 3.0◦ of the camera was applied, and the corresponding PLS centroid was sampled. The shutter of the lens was set to a value 5.6 in order to eliminate all the background details in the image and maintain visible only the PLS projection. N1p and N2p values were estimated by a regularized least square. Results in Figure 5.5 indicate that the model (4.9) was able to predict the correct position of the nodal points considering information such as distance of the object da and angle of rotation ∆α. Furthermore, the results indicate that the hypothesis about the linearity of trajectory of N1p and N2p inside the lens was correct. The inversion in the order (N2p , N1p , and C, opposite to the regular configuration N1p , N2p , and C) was congruent with the specifications of the optical system provided by the factory, which predicted an inversion for a focal length greater than 27.7 mm. From these results, we considered that the position of the points N1p , N2p , and C corresponded to the geometrical configuration of the human eye for values of focal length included in the range 11.5 mm to 27.7 mm. 5.1.3 System Calibration According to (4.9) and (4.7), the nodal point N2p appears to be the most relevant point in the model. In fact, it can be noticed how N1 in the denominator relates only with da , which is much bigger. A value of N2p = 6.03 mm in the robot would generate a position compatible with the human eye structure (see Figure 4.2a) and guarantee a coherent geometrical position of the cardinal points (N1p , N2p , and C). Since, in (4.9), f is defined as dc + N2p , it was also possible to derive that the correct focal length appeared to be f = 16.55 mm. In order to set the nodal point position as specified above, the procedure described in the first experiment was modified accordingly, suppling a powerful tool for measuring the exact focal length of the optical system which does not rely on the markers present on the mechanical system of the lens. Using the previous experimental setup, The PLS was placed at a set of different dis2 From now on, all the distances expressed in the experimental setups will be considered referred to the center of rotation of the system camera-lens C 75 (a) (b) Figure 5.3 – (a) Factory specifications for the optical system used in the robotic head (not in scale). The point C = 7.8 ± 0.1 mm represents the center of rotation of the system cameralens, and it was measured directly on the robot. (b) Estimation of the position of N1 and N2 in respect to C as a function of the focal length (focus at infinite). 76 Figure 5.4 – Experimental setup of the calibration procedure. The PLS was positioned on the visual axis of the right camera at distances depending on each trial. tances (between 300 mm and 900 mm, with intervals of 100 mm) with a specified focal length. For each of these distances, the same camera rotation ∆α = 3.0◦ was applied, and the corresponding PLS projection was sampled. N1p and N2p values were estimated by a regularized least square. This procedure was repeated with incremental focal lengths until a final value for N2p was obtained that was as close as possible to N2s = 6.03 mm. At the end of this preliminary phase of the calibration, the second nodal point was in position N2p = 5.60 mm. In order to obtain a final and refined characterization of the optical system, the PLS was placed at a set of different distances (between 270 mm and 1120 mm, with intervals of 50 mm) from the robot. The focal length of the optical system had already been set in the previous stage. At each of these distances, a sequence of camera rotations (between 0.5◦ and 5.0◦ , with intervals of 0.5◦ ) was applied, and the corresponding PLS projection was sampled. The collected data was used to estimate, by a regularized least square, all the parameters of (4.9). Final values confirmed the preliminary calibration and the geometrical structure of the oculomotor system returning N1p = 49.65 mm, N2p = 5.72 mm, and dc = 10.52 mm. In Figure 5.6 it is shown how a single movement of 3.0◦ affected the projection of the PLS at different distances. The model prediction, expressed in (4.9) and indicated in the figure with a solid line, fitted the experimental data indicated with circles. 77 Figure 5.5 – Data indicated with circles represent the estimated position of the nodal points N1p and N2p for values of focal length 11.5, 15.0, 25.0, 50.0 mm (focus at infinite). Solid lines in the graph represent their interpolated trajectory. It can be noticed that the position of the cardinal points expressed by the graph is compatible with the trajectory reported by the factory in Figure 5.3b. 5.2 Recording human eye movements Eye movements performed by human subjects during the analysis of natural scenes were sampled by a Dual Purkinje Image eyetracker (shown in Figure 5.8). The DPI eyetracker, originally designed by Crane and Steele [21, 22] (Fourward Technologies, Inc. - CA), illuminates the eye with a collimated beam of 0.93 µm infrared light and employs a complex combination of lenses and servo-controlled mirrors to continuously locate the positions of the first and fourth Purkinje images. The data returned by the eyetracker regarding the rotation of the eyes are represented by the voltage potentials requested to command the mirrors. Subjects employed in the eye movement recording were experienced subjects with nor- 78 Figure 5.6 – Model fitting (indicated with a solid line) obtained interpolating N1p = 49.65 mm and N2p = 5.72 mm from the data (indicated with circles). The underestimation of the projections under da = 270 mm is due to defocus, aberrations and distortions introduced by the PLS too close to the lens. In fact, since da is considered from the center of rotation C, the light source is actually less then 160 mm from the front glass of the lens. This position is incompatible even for a human being, which cannot focus objects closer than a distance called near point, and equal to 250 mm [74]. mal vision (MV 31 years old, and AR 22 years old). Informed consent was obtained from these subjects after the nature of the experiment had been explained to them. The procedures were approved by the Boston University Institutional Review Board. Natural images were generated on a PC (Dell Computer Corp.) with a Matrox Millenium G550 graphics card on board (Matrox Graphics - Canada), and displayed on a 21-inch color Trinitron monitor (Sony Electronics Inc.). The screen had a resolution of 800 × 600 pixels and the vertical refresh rate was 75 Hz. The display was viewed monocularly with the right eye, while the left eye was covered by an opaque eye-patch. The subject’s right eye was at a distance of 110 cm from the CRT. The subject’s head was positioned on a chin-rest. Each experimental session started with preliminary setups which allowed the subject to adapt to the low luminance level in the room. These preliminary setups included: position79 Figure 5.7 – Results of the model fitting of the parallax obtained at different focal lengths (and different nodal points). Figure 5.8 – The Dual-Purkinje-Image eyetracker. This version of DPI eyetracker has a temporal resolution of approximately 1 ms and has a spatial accuracy of approximately 1 arcmin. 80 Figure 5.9 – Basic principle of the Dual-Purkinje-Image eyetracker. Purkinje images are formed by light reflected from surfaces in the eye. The first reflection takes place at the anterior surface of the cornea, while the fourth occurs at the posterior surface of the lens of the eye at its interface with the vitreous humor. Both the first and fourth Purkinje images lie in approximately the same plane in the pupil of the eye and, since eye rotation alters the angle of the collimated beam with respect to the optical axis of the eye, and since translations move both images by the same amount, eye movement can be obtained from the spatial position and distance between the two Purkinje images. ing the subject optimally and comfortably in the apparatus; adjusting the eyetracker until successful tracking was obtained; and running a short calibration procedure that allowed to convert the eyetracker voltage outputs into degrees of visual angle and pixel on the screen. This widely used procedure associates the subject gaze at a specific point on the screen with the voltage potential produced by the eyetracker. A nine point lattice, equally distributed on the screen, was enough to interpolate all the pixel positions with eyetracker voltage responses. After this calibration procedure, the actual experiment was run. In each session, the subject was required to observe the CRT and to freely move the eyes among the objects present in the scene. Each subject was requested to perform five different sessions. Eye movement data were sampled at 1 kHz, recorded by a analog/digital sampling board (Measurement Computing Corp. - MA), and stored on the computer hard drive for off-line analysis. The programs required to acquire subject’s eye movement data and subsequent analysis were written in the Matlab environment (MathWorks Ltd. - MA). 5.3 Experimental results with artificial movements The response of the model to the parallax measured with the robot was first tested on artificial movements, which offered a perfect control of the system and supplied an initial knowledge of the behavior of mechanical and digital parts of the oculomotor equipment. 81 Figure 5.10 – Distance estimation calculated from projection data obtained by rotating the right camera 3◦ . 5.3.1 Depth perception from calibration movements Figure 5.10 shows how distance values were retrieved using projection data calculated during the final calibration procedure in Paragraph 5.1.3. The graph shows estimated distance (highlighted with a solid line and circles), and the estimation error performed by the model (solid red line). Results show that the model is capable of calculating depth information with a rising error dependent upon the real distance of the PLS. This error is mainly due to the quantization introduced by the sensor during the sampling process of the PLS centroid. In fact, when the object is distant, the projection of the PLS, moving in accord with the rotation, falls within the same photoreceptor, thus introducing an error which is about the half size of the photoreceptor cell in the cameras mounted on our robot (estimated to be 9 µm). Clearly, this error is not evident in human beings (at least for distances less than 10 m) because hyperacuity (see Paragraph 1.1.2) introduces a quantization error which is about 30 times smaller than the physical photoreceptor in the retina [11]. To make this evident, in Figure 5.11 the response of the model to a rotation ∆α = 3◦ is computed for two different quantizations. In the graph, a planar sensor with 9 µm quantization is compared with another having a quantization 30 times smaller. It can be noticed that the estimation error of the higher resolution sensor at 3 m is only 25.25 mm, where as, at the same dis- 82 Figure 5.11 – Error introduced by the quantization in the model depth perception performance. Distance estimation using a sensor with a 9 µm cell size (thick blue line) is compared with a sensor with a 0.3 µm cell size. The error introduced by the higher resolutions is evident only after numerous meters. tance, the lower resolution sensor has an estimation error of 788.1 mm. For this reason, the area in which the lower resolution sensor has about the same estimation error as the higher resolution sensor is approximately 30 times smaller. 5.3.2 Depth perception in a natural scene In this experiment we exposed the model to a natural scene and calculated the parallax produced by an artificial movement. A square-shaped object was positioned in front of a background plane with the same texture (Irises - Van Gogh 1889). The defocus effect produced by differences in depth between the object and the background was compensated by exposure to a 500 W light and by a reduced aperture of the shutter. This scene composition guaranteed the almost complete invisibility of the object and reduced the number of other visual cues used to process the resulting image. Two pictures were taken before and after a 3.0◦ rotation (see Figure 5.12a and b), then and divided in 45 × 45 patches. A normalized cross correlation algorithm was actuated to search for the displacement of the patches. As mentioned before (see Paragraph 4.1), the primary visual cortex of primates is more 83 (a) (b) Figure 5.12 – Semi-natural scene consisting of a background placed at a distance 900 mm with a rectangular object placed before it at a distance 500 mm. This object is an exact replica of the covered area of the background. (a) The scene before a rotation; (b) Same semi-natural scene after a anticlockwise rotation of ∆α = 3.0◦ . 84 Figure 5.13 – The rectangular object hidden in the scene in Figure 5.12 is visible only because it has been exposed to an amodal completion effect with a sheet of paper. specialized for processing horizontal disparities. For this reason, we considered only horizontal displacements of the patches in the parallax matrices. Figure 5.14 shows the parallax matrix obtained by applying the normalized cross-correlation on two images of an object situated at 520 mm in front of the camera. Figure 5.15 shows the distances calculated for all the patches obtained by the application of the model described in (4.11) and (4.12). The model produces the real distance of the object with good approximation. 5.4 Experimental results with human eye movements Eye movements recorded during the sessions were analyzed and the segments inside each session were classified as saccades or fixational eye movements according to their velocities. In order to obtain the best performance of the robot, only one of the five sessions was chosen. Session selection was based on the consideration of simple paths of movements, the absence of blinks, and the quality of the movements. The considered recordings were processed according to the final objective of each experiment, and the selected traces were used to produce motor commands for the robot. The speed of the eye movements was reduced in order to operate within the range of velocities that the robot could achieve reliably, without jeopardizing space accuracy of the eye movement reproduction. 85 Figure 5.14 – Horizontal parallax produced by a rotation of ∆α = 3◦ for the object in Figure 5.12a. It can be noticed how the general tendency of the surface corresponds to the theoretical parallax behavior predicted in the equation (4.14). 5.4.1 Depth and fixational eye movements A first experiment with human recordings was performed in order to verify whether the position of the PLS could be predicted by both the fixational eye movements of each subject and by the model expressed by (4.9). As mentioned in Paragraph 4.4, we introduced the geometrical amplification ratio Gaf , which can be calculated only when all the geometrical parameters of the optical system have been characterized. Therefore, considering both the values reported for the Gullstrand’s model (see section 4.1), and the interpolated values of N1p and N2p (see Section 5.1.3), as well as considering Sp = 9 µm and Ss = 7.93 µm3 , the geometric amplification ration has been estimated to be Gaf = 1.89. 3 Due to the human visual system’s amazing performance in acuity tasks, and technology issues of our robot, it has been impossible to calculate using the virtual photoreceptor size widely accepted by the psychophysics community (it is about 30 times smaller than the actual physical photoreceptor [11]). So we decided to lower the semi-spherical resolution to a more obtainable Ss = 7.93 µm correspoing to an eccentricity on the retina of 4◦ . 86 Figure 5.15 – Distance estimation of the object in Figure 5.12a. The horizontal parallax matrix obtained by positioning an object at 500 mm from the point C of the robotic oculomotor system was post-processed using (4.11) and (4.12), thereby obtaining the distance information in the figure. In this experiment, ten fixational eye movements traces, each of length 3 s, were separated from the other movements and amplified by Gaf . Each trace included about three different fixational eye movements. The PLS was placed at a set of different distances (between 300 mm and 900 mm, with intervals of 200 mm) from the center of rotation of the camera. For each of these distances, the traces produced camera rotations according to the fixational eye movements. The corresponding PLS projection centroid was sampled at the end of each movement. After each rotation, a delay of 1 s was applied in order to reduce vibrations and the influence of inertia on the samples. The knowledge of the fixational eye movements and the PLS projections applied to the model (4.9) resulted in the graphs shown in Figures 5.16 and 5.17. It can be noticed that even small movements, such as fixational eye movements, produce enough parallax to produce rough distance estimations. Under the experimental conditions described above, the resultant distance is influenced by errors due to the sensor quantization. This error tends to produce either an overestimation or an underestimation of the angle α of (4.12), which causes, subsequently, the error in the perceived distance (see graph in Figure 5.11 for further details). Furthermore, the graphs show that the standard deviation of the measure is proportional 87 Figure 5.16 – PLS estimated distance using fixational eye movements of the subject MV. Figure 5.17 – PLS estimated distance using fixational eye movements of the subject AR. 88 Figure 5.18 – Natural image representing a set of objects placed at different depths: object A at 430 mm; object B at 590 mm; object C at 640 mm; object D at 740 mm; object E at 880 mm. to the object’s distance. As the parallax decreases (due to big distances or to small rotations of the camera) it comes more comparable with the sensor resolution. This creates a situation in which the error of one pixel in the acquisition of the image (introduced by the mechanical system, vibration, motor position errors, etc.) creates a direct overestimation of α and a subsequent increase of the standard deviation. For shorter distances, a bigger parallax (more than ten pixels) is not influenced by a one or two pixel error. 5.4.2 Depth and saccadic eye movements Another series of experiments were performed in order to verify if saccadic eye movements of each subject and the model expressed by (4.9) were able to extract depth information from a natural scene shown in Figure 5.18. Even though (4.20) of Paragraph 4.4 supplies a strong theoretical relation which allows us to replicate fixational eye movements, it cannot be used for saccades because it presupposed the correctness of assumptions which can not always verified. The assumption, and probably the most important, is the magnitude of the rotations. All the approximations considered in Paragraph 4.4 suppose a rotation ∆α so small that the projection lying on the visual axis (α = 0) shifts by only a single photoreceptor. A second assumption is that the resolution considered in central part of the retina, the fovea, is considered to be constant. For saccadic eye movements, this assumption is no longer valid because the separation among 89 photoreceptors changes linearly with the eccentricity [46] measured from the foveal point. In this experiment, all the fixational eye movements in the session were filtered, leaving only the saccadic eye movements. This selection produced 29 saccades for the subject MV, and 28 for the subject AR. The selected recordings were then used to produce motor commands for the robot. As we have seen above, it is not correct to use Gaf to amplify the saccades in order for the robot to perfectly replicate the gaze of the human subject. It is still possible, however, to find an amplification factor able to provide the same functionality. In fact, the calibration procedure employed in the first stage of the eye movement recording isolated the exact pixel observed by the human subject (see Paragraph 5.2). Movements, in which the robot requested to watch the same pixel gazed at by the human subject, were collected and the relative amplification factor (for both x and y axes) values were estimated by a regularized least square. These amplification factors were successively used to amplify the saccadic movements and to obtain the correct rotation commands for the robot. After each command was sent to the robot with a delay of 1 s (necessary to minimize errors in the acquisition process caused by vibrations and inertia), the relative frame was acquired by the camera and stored for successive off-line processing. The saccadic traces produced a 30-frame movie for the subject MV, and 29-frame movie for the subject AR. Each frame of the movie was divided into 45 × 45 patches. A normalized cross correlation algorithm was actuated to search for the horizontal displacement of the patches between each frame and its predecessor. Knowing the rotation applied to the camera before each frame, and the parallax produced by that rotation, the results of the model are straightforward. Figures 5.19 through 5.23 summarize how the replication of the saccades produced a reconstruction of the real 3-D scene with a good approximation of the distance information. 90 Figure 5.19 – Reconstruction of the distance information according to the oculomotor parallax produced by the subject MV’s saccades. Lines A and B individuate two specific sections of the prospective which are shown with further details in Figure 5.20a and 5.20b. 91 (a) (b) Figure 5.20 – (a) Detailed distance estimation of the scene at the section A highlighted in Figure 5.19. (b) Detailed distance estimation of the scene at the section B highlighted in the same figure. 92 Figure 5.21 – Filtered 3-D reconstruction of the distance information produced by the subject MV’s saccades. 93 Figure 5.22 – Reconstruction of the distance information according to the oculomotor parallax produced by the subject AR’s saccades. Lines A and B individuate two specific sections of the prospective which are shown with further details in Figure 5.23a and 5.23b. 94 (a) (b) Figure 5.23 – (a) Detailed distance estimation of the scene at the section A highlighted in Figure 5.22. (b) Detailed distance estimation of the scene at the section B highlighted in the same figure. 95 Figure 5.24 – Filtered 3-D reconstruction of the distance information produced by the subject AR’s saccades. It can be noticed how there is no object at the extreme right. This is because AR’s eye movements were focused more on the left side of the scene. No parallax information concerning this object was available, and therefore no distance cues. 96 Chapter 6 Conclusion and future work In this thesis we presented a novel approach to depth perception and figure-ground segregation. Although no previous study has attempted to consider it before, we have shown that the monocular parallax, which originates from a rotation of the camera similar to the movement of human eye, provides reliable depth information for a near range of distances. By integrating psychophysics data on recorded human eye movements and robotic experiments, this work takes a radically different approach from previous studies on depth perception in robotics and computer vision, focusing primarily on the replication of computational strategies employed by biological visual systems. Accurately replicating human eye movements represented a serious challenge to this research. However, we showed, that the implementation of these movements on the oculomotor robotic system, joined with the mathematical model of parallax, were able to provide depth estimation of natural 3-D scenes by accurately replicating the geometrical and mobile characteristics of the human eye. One of the main contributions of this thesis is the simple model formulation and the restricted number of requested parameters, that are measured in preliminary calibrations. This preventative phase is only initially a limitation. In fact, in a system that operates in the real world, depth cues are valuable only in the framework of the physical and functional characteristics of the system; for example, depth cues depend on the optical and motor characteristics of the eye. All organisms in nature, from insects with simple nervous systems to humans, tune their sensory cues while interacting with complex environments. Therefore, learning approaches such as Kalman filtering or unsupervised neural networks, both of which extract the model parameters from the flow of visual information affering to the system, could be used to continuously adjust these values. Due to its sole dependency on rotations of the sensor and object projections, the novel model presented in this research could easily be extended to incorporate egomotion informa- 97 tion provided by emulated vestibular sensors. This system would integrate proprioceptive sensor knowledge with movements of the robot itself, exploiting and integrating, as a matter of fact, errors and drifts of the moving components in a better 3-D perception of the world. Another advantage is the relatively light computational load that the model generates by exploiting preexisting basic visual cues. Parallax is a kind of information which might be calculated at low levels. Using this model, elements requested to perceive depth information using this model are already present in the visual system and are probably already provided to other modules. This strategy is perfectly suitable for hierarchical systems such as the visual cortex and its higher level processes. A future major improvement for a more robust distance estimation could be obtained with the use of vertical parallax as a source of depth information, which has not been considered in this thesis. Fusion with horizontal cues, based on averaging or other criteria, would provide a distance cue less affected by noise. Moreover, in a binocular robotic system, monocular parallax information extracted by the two cameras could be combined for improved accuracy. Clearly, the robotic oculomotor system lacks a number of features necessary to produce an exact replica of the appearance of stimuli on the retina during oculomotor activity. For example, the different structure of sensor surfaces and the spatial distribution of receptors still represent an undisputed challenge. It is clear, anyhow, that the biological neurological system posses a far more powerful signal processing capability than that of algorithmic techniques, which are often complex, slow and unreliable. But the hardware implementation of biologically inspired signal processing models has increased during these recent years. Neuromorphic engineering, a field of engineering that is based on the design and fabrication of artificial neural systems inspired by biological nervous systems, started in the late 80s and it is slowly maturing. We aim to investigate whether effective hardware for performing signal processing can be built by using the approach of biologically inspired sensing technology. This may lead to new algorithms in computer vision and to the development of radically novel approaches in the design of machine vision applications in which fixational eye movements and saccades are essential constituents. 98 Bibliography [1] Gullstrand A. Appendix II.3. The optical system of the eye, pages 350–358. Optical Society of America, 1924. [2] Higashiyama A. The effect of familiar size on judgments of size and distance: an interaction of viewing attitude with spatial cues. Perception and Psychophysics, 35:305–312, 1984. [3] Hughes A. A useful table of reduced schematic eyes for vertebrates which includes computed logitudinal chromatic aberrations. Vision Research, 19:1273–1275, 1979. [4] Hastorf AH. The influence of suggestion on the relationship between stimulus size and perceived distance. Journal of Psychology, 29:195–217, 1950. [5] Reinhardt-Rutland AH. Detecting orientation of a surface: the rectangularity postulate and primary depth cues. Journal of General Psychology, 117:391–401, 1990. [6] Robert M. Gray Allen Gersho. Vector Quantization and Signal Compression, chapter 5, pages 151–152. Kluwer Academic Publishers, 1992. [7] Pertriu E Archibald C. Robot skills development using a laser range finder. In IEEE Transactions on Instrumentation and Measurement Technology Conference, Proceedings, volume 43, pages 448–452, 1993. [8] Julesz B. Foundations of Cyclopean Perception. The University of Chicago Press Chicago, 1971. [9] Stark L Bahill AT. The high-frequency burst of motoneural activity lasts about half the duration of saccadic eye movements. Mathematical Bioscience, (26):319–323, 1975. [10] Brown CM Ballard DH. Computer Vision. Englewood Cliffs NJ - Prentice-Hall, 1982. [11] Skottun BC. Hyperacuity and the estimated positional accuracy of a theoretical simple cell. Vision Reseach, 40:3117–3120, 2000. 99 [12] Meyer F Bergen L. A novel approach to depth ordering in monocular image sequences. In IEEE International Conference on Computer Vision and Pattern Recognition, Proceedings, volume 2, pages 536–541, 2000. [13] Horst Hauecker Bernd Jahne. Computer Vision and Applications, chapter 11, pages 397–438. Academic Press, 2000. [14] Harbeck M Beß R, Paulus D. Segmentation of lines and arcs and its application for depth recovery. In IEEE International Conference on Acoustic, speech and Signal Processing, Proceedings, volume 4, pages 3165–3168, 1997. [15] Cumming BG. An unexpected specialization for horizontal disparity in primate primary visual cortex. Nature, 418:633–636, 8 August 2002 2002. [16] Mallot HA Bulthoff HH. Interaction of different modules in depth perception: Stereo and shading. http://www.ai.mit.edu/publications/pubsDB/pubs.htm (AIM965), 1987. [17] Klein SA Carey T. Resolution acuity is better than vernier acuity. Vision Research, 37(5):525–539, 1997. [18] Yuille AL Clark JJ. Data fusion for sensory information processing systems. Kluwer - Massachusetts, 1990. [19] Erkelens CJ Collewijn H. Binocular eye movements and the perception of depth E Kowler Editor, volume 4, chapter Chapter 4, pages 213–261. Elsevier Science Publisher BV (Biomedical division), 1990. [20] Krotkow E Cozman F. Depth from scattering. In IEEE Computer society conference on Computer vision and pattern recognition, Proceedings, pages 801–806, 1997. [21] Steele CM Crane HD. Accurate three-dimensional eyetracker. Applied Optics, 17(5):691–705, 1978. [22] Steele CM Crane HD. Generation v dual-purkinje-image eyetracker. Applied Optics, 24(4):527–537, 1985. [23] Schuster DH. A new ambiguous figure: a three-stick clevis. American Journal of Psychology, 77:673, 1964. [24] Carey DP Dijkerman HC, Milner AD. Motion parallax enables depth processing for action in a visual form agnosic when binocular vision is unavailable. Neuropsychologia, 37(13):1505–1510, Dec 1999. 100 [25] Applegate RA Doshi JB, Sarver EJ. Schematic eye models for simulation of patient visual performance. Journal of Refractive Surgery, 17(4):414–419, 2001. [26] Williams DR. Seeing through the photoreceptor mosaic. Trends in Neuroscience, 9:193–198, 1986. [27] Thompson D Dunn BE, Gray GC. Relative height on the picture-plane and depth perception. Perceptual and Motor Skills, 21(1):227–236, 1965. [28] et al. Durgin FH. Comparing depth from motion with depth from binocular disparity. Journal of Experimental Psychology: Human Perception and Performance, 21(3):679–699, Jun 1995. [29] Fincham E. The accommodation reflex and its stimulus. British Journal of Opthalmology, 35:381–393, 1951. [30] Baratz SS Epstein W. Relative size in isolation as a stimulus for relative perceived distance. Journal of Experimental Psychology, 67:507–513, 1964. [31] Hillebrand F. Das verhltnis von akkommodation und kovergenz zur tiefenlokalisation. Zeitschrift fur Psychologie, 7:97–151, 1894. [32] Ciuffreda KJ Fisher SK. Accommodation and apparent distance. Perception, 17:609– 621, 1988. [33] Fowler TA Freeman TC. Unequal retinal and extra-retinal motion signals produce different perceived slants of moving surfaces. Vision Research, 40(14):1857–1868, 2000. [34] et al. Fujii Y. Robust monocular depth perception using feature pairs and approximate motion. In IEEE International Conference on Robotic and Automation, Proceedings, volume 1, pages 33–39, 1992. [35] Berkeley G. An essay towards a new theory of vision. Dutton - New York, 1709, Reprinted 1922. [36] Westheimer G. The spatial sense of the eye. Investigative Ophthalmology, 18:893– 912, 1979. [37] Westheimer G. The spatial grain of the perifoveal visual field. Vision Research, 22:157–162, 1982. [38] et al. Garcia J. Chromatic aberration and depth extraction. In International Conference on Pattern Recognition, Proceedings, volume 1, pages 762–765, 2000. 101 [39] Mertens HW Gogel WC. Perceived depth between familiar objects. Journal of Experimental Psychology, 77(2):206–211, 1968. [40] Bingham GP. Optical flow from eye movement with head immobilized: öcula occlusionb̈eyond the nose. Vision Research, 33(5/6):777–789, 1993. [41] et al. Greschner M. Retinal ganglion cell synchronization by fixational eye movements improves feature stimation. Nature, 5:341–247, 2002. [42] Hirsch J Groll SL. Two-dot vernier discrimination within 2.0 degrees of the foveal center. Journal of Optical Society of America, Series A, 4(8):1535–1542, 1987. [43] Ono H. Apparent distance as function of familiar size. Journal of Experimental Psychology, 79:109–115, 1969. [44] Emsley HH. Visual Optics. Hayyon Press Ltd - London, 1952. [45] Curcio CA Hirsch J. The spatial resolution capacity of human foveal retina. Vision Research, 29(9):1095–1101, 1989. [46] Miller WH Hirsch J. Does cone positional disorder limit resolution? Optical Society of America, Series A, 4(8):1481–1492, 1987. Journal of [47] Hochberg JE Hochberg CB. Familiar size and the perception of depth. Journal of Psychology, 34:107–114, 1952. [48] Brenot J Honig J, Heit B. Visual depth perception based on optical blur. In International Conference on Image Processing, Proceedings, volume A, pages 721–724, 1996. [49] Rogers BJ Howard IP. Seeing in Depth, volume 2, chapter 25, page 413. I. Porteous, 2002. [50] Wilson HR. Response of spatial mechanisms can explain hyperacuity. Vision Research, 26:453–469, 1986. [51] Ittelson HW. Size as a cue to distance: static localization. American Journal of Psychology, 64:54–67, 1951. [52] Marshall JA. Self-organizing neural network architectures for computing visual depth from motion parallax neural networks. In International Joint Conference on Neural Networks, Proceedings, volume 2, pages 227–234, 1989. [53] George TB Jr. Calculus and Analytic Geometry, volume 1, chapter 11, pages 361– 370. Addison-Wesley Publishing Company, 4 edition, 1972. 102 [54] Vasan NS Juyal DP, Gaba SP. Design of a high performance laser range finder. In 4th Pacific Rim Conference on Lasers and Electro-Optics, Proceedings, volume 1, pages I–146–I–147, 2001. [55] Cohen MH Kellman PJ. Kinetic subjective contours. Perception and Psychophysics, 35:237–244, 1984. [56] et al. Khan N. Depth perception from blurring-a neural networks based approach for automated visual inspection in vlsi wafer probing neural networks. In International Joint Conference on Neural Networks, Proceedings, volume 3, pages 286–290, 1992. [57] Levi DM Klein SA. Hyperacuity thresholds of 1 sec: theoretical predictions and empirical validation. Journal of Optical Society of America, Series A, 2(7):1170– 1190, 1985. [58] Stainman RM Kwler E. Miniature saccades: eye movements that do not count. Vision Research, 19:105–108, 1979. [59] et al. Landy MS. Measurement and modeling of depth cue combination: in defense of weak fusion. Vision Research, 35(3):389–412, 1995. [60] Meyyappan A Lee S, Ahn SC. Depth from magnification and blurring. In IEEE International Conference on Robotics and Automation, Proceedings, pages 137–142, 1997. [61] Aitsebaomo AP Levi DM, Klein SA. Venier acuity, crowding and cortical magnification. Vision Research, 25(7):963–977, 1985. [62] Klein SA Levi DM, McGraw PV. Venier and contrast discrimination in central and peripherical vision. Vision Research, 40:973–988, 2000. [63] Brown B Li RWH, Edwards MH. Variation in vernier acuity with age. Vision Research, 40:3775–3781, 2000. [64] Thibos LN. Calculation of the influence of lateral chromatic aberration on image quality across the visual field. Journal of Optical Society of America, Series A, 4:1673–1680, 1987. [65] Houde R Loranger F, Laurendeau D. A fast and accurate 3-d rangefinder using the biris technology: the trid sensor 3-d. In International Conference on Recent Advances in Digital Imaging and Modeling, Proceedings, volume 1, pages 51–58, 1997. [66] Nawrot M. Role of slow eye movements in depth from motion parallax. Investigative Ophthalmology and Visual Science, 38(4):S694, 1997. 103 [67] Nawrot M. Viewing distance, eye movements, and the perception of relative depth from motion parallax. Investigative Ophthalmology and Visual Science, 41(4):S45, 2000. [68] Nawrot M. Eye movements provide the extra-retinal signal required for the perception of depth from motion parallax. Vision Research, 43(14):1553–1562, 2003. [69] Hildreth EC Marr D. Theory of edge detection. Proceedings of the Royal Society London, Series B, 207(1167):187–217, 1980. [70] Poggio T Marr D. Cooperative computation of stereo disparity. Science, 194:283– 287, 1977. [71] Poggio T Marr D. A computational theory of human stereo vision. Proceedings of the Royal Society London, B204(1979):301–328, 1982. [72] Tresilian JR Mon-Williams M. Some recent studies on the extraretinal contribution to distance perception. Perception, 28:167–181, 1999. [73] et al. Monod MO. Perception of the environment with a low cost radar sensor. In Radar 97, 14 - 16 October 1997, Proceedings, pages 806–810, 1997. [74] Macdonald J Mouroulis P. Geometrical Optics and Optical Design. Oxford University Press, 1997. [75] et al. Murphey YL. A real-time depth detection system using monocular vision. SSGRR - International Conference on Advances in Infrastructure for Electronic Business, Science, and Education on the Internet, 2000. [76] Park GE Park RS. The center of ocular rotaion in the horizontal plane. Journal of Physiology, 104:545–552, 1933. [77] Bajcsy R. Active perception. Proceedings of the IEEE, Special issue on Computer Vision (invited paper), 76(8):996–1005, 1988. [78] Descartes R. Treatise of man. Harvard University Press - Cambridge, 1972. [79] Clement RA. Introduction To Vision Science. Lawrence Erbaum Associates, Publishers, 1993. [80] Jarvis RA. A perspective on range finding techniques for computer vision. IEEE Transaction on Pattern Analysis and Machine Intelligence, 5(2):122–139, 1983. [81] Trivedi MM Ravichandran G. Motion and depth perception using spatio-temporal frequency analysis. In IEEE International Conference on Systems, Man, and Cybernetics, Proceedings, volume 1, pages 36–41, 1994. 104 [82] Desbordes G Rucci M. ontributions of fixational eye movements to the discrimination of briefly presented stimuli. Journal of Vision, 3(11):852–864, 2003. [83] Guo H; Yi Lu; Sarka S. Depth detection of targets in a monocular image sequence. In 18th Digital Avionics Systems Conference, Proceedings, volume 2, pages 8.A.2–1 – 8.A.2–7, 1999. [84] Tistarelli M Sandini G. Active tracking strategy for monocular depth inference over multiple frames. IEEE Transactions on pattern analysis and machine intelligence, 12:13–27, 1990. [85] Vaz F Santos V, Goncalves JGM. Perception maps for the local navigation of a mobile robot: a neural network approach. In IEEE International Conference on Robotics and Automation, Proceedings, volume 3, pages 2193–2198, 1994. [86] et al. Shinohara S. Compact and high-precision range finder with wide dynamic range and its applications. IEEE Transactions on instrumentation and measurement, 41(1):40–44, 1992. [87] Campbell MCW Simonet P. The optical transverse chromatic aberration on the fovea of the human eye. Vision Research, 30:187–206, 1990. [88] Jain R Skifstad K. Range estimation from intensity gradient analysis. In IEEE International Conference on Robotics and Automation, Proceedings, volume 1, pages 43–48, 1989. [89] Wolbarsht M Sliney D. Safety woth lasers and other optical sources. Plenu Press New York, 1980. [90] Miller WH Snyder AW. Photoreceptor diameter and spacing for highest resolving power. Journal of Optical Society of America, 67(5):696–698, 1977. [91] Meyyappan A Sukhan L, Sang CA. Depth from magnification and blurring. In IEEE International Conference on Robotics and Automation Albuquerque, New Mexico April 1997, Proceedings, volume 1, pages 137–142, 1997. [92] Samy R Sune JL. Pyramidal robust estimation of the depth-from-motion. In SPIE Conference, volume 2354, pages 380–386, 1994. [93] Poussart D Taalebinezhaaf MA. Depth map froma sequence of two monocular images. In SPIE Conference, volume 2354, pages 357–368, 1994. [94] Bradley A. Thibos L.N. Modeling, chapter 4, pages 101–159. McGraw-Hill, 1999. 105 [95] et al. Thibos LN. Theory and measurement of ocular chromatic aberration. Vision Research, 30:33–49, 1990. [96] et al. Thibos LN. The chromatic eye: a new reduced-eye model of ocular chromatic aberration in humans. Applied Optics, 31(19):3594–3600, 1992. [97] et al. Thibos LN. Spherical aberration of the reduced schematic eye with elliptical refracting surface. Optometry and Vision Science, (in press), 1997. [98] Oliva A Torralba A. Depth estimation from image structure. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(9):1226–1238, 2002. [99] Karen K. De Valois. Seeing, chapter 1, pages 31–33. Academic Press, 2000. [100] Zanker JM Volz H. Hyperacuity for spatial localization of contrast-modulated patterns. Vision Research, 36(9):1329–1339, 1996. [101] Epstein W. Perceived depth as a function of relative height under three background conditions. Journal of Experimental Psychology, 72:335–338, 1966. [102] Wundt W. Beitrge zur Theorie der Sinneswahrnemung. Winter - Leipzig, 1862. [103] Griffin DR Wald G. The change in refractive power of the human eye in dim and bright light. Journal of Optical Society of America, 37:321–336, 1947. [104] Gogel WC. An indirect method of measuring perceived distance from familiar size. Perception and Psychophysics, 20:419–429, 1976. [105] McKee SP Westheimer G. Spatial configurations for visual hyperacuity. Vision Research, 17:941–947, 1977. [106] Geld DJ Wilson HR. Modified line element theory for spatial frequency and width discrimination. Journal of Optical Society of America, A1:124–131, 1984. [107] Smith WJ. Modern Optical Engineering. McGraw-Hill - New York, 1966. [108] Geisler WS. Physical limits of acuity and hyperacuity. Journal of Optical Society of America, Series A, 1(7):775–782, 1984. [109] Le Grand Y. Form and Space Vision - G.G. Hearth and M. Millodot Editors. Indiana Univerity Press - Bloomington, 1967. [110] Klein SA Yap YL, Levi DM. Peripherical positional acuity: retinal and cortical constraints on 2-dot separation discrimination under photopic and scotopic conditions. Vision Research, 29(7):789–802, 1989. 106
© Copyright 2026 Paperzz