Depth Perception in a Robot that Replicates Human Eye Movements

UNIVERSITÀ DEGLI STUDI DI FIRENZE
FACOLT À DI I NGEGNERIA - D IPARTIMENTO DI S ISTEMI ED I NFORMATICA
Dottorato di Ricerca in
Ingegneria Informatica e dell’Automazione
XV Ciclo
Depth Perception in a Robot that
Replicates Human Eye Movements
Candidate
Fabrizio Santini
Advisor
Marco Gori
A NNO ACCADEMICO 2003
Perceptual activity is exploratory, probing, searching;
percepts do not simply fall onto sensors as rain falls onto ground.
We do not just see, we look.
(Ruzena Bajcsy, 1988)
Preface
One of the most important goals of a visual system is to recover the distance at which objects are located. This process is inherently ambiguous as it considers projections of a three
dimensional scene onto the two dimensional surface of a sensor.
Humans rely on a variety of cues which are monocular and binocular. These cues, either dynamic or static, can be categorized as primary depth cues, which provide direct depth
information such as convergence of the optical axes of the two eyes; or secondary depth
cues, which may also be present in monocularly viewed images such as motion parallax,
shading, and shadows. These cues are combined by the visual cortex and other high-level
brain functions with various strategies dependent upon the visual circumstances [16], in order to obtain a reliable distance information estimation.
An important contribution to depth information is supplied by the relative motion between the observer and the scene, which appears in the form of motion parallax [49]. While
motion parallax is evident for large movements of the observer, it occurs also during eye
movements. Due to the misalignment between the focal points and the center of rotation
when the eye rotates, the shift in the retinal projection of a point in space depends not only
on the amplitude of the eye movement but also on the distance of the point with respect to
the observer.
It is known that when analyzing a 3-D scene, human beings tend to allocate several successive fixations at nearby spatial locations. In addition, within each of these fixations, small
fixational eye movements move the retinal image by a relatively large degree. Both types of
eye movements provide 3-D information because eye movements themselves cause changes
in the pattern of retinal motion and, therefore, modify depth cues. However, their role in the
perception of depth in humans has always been uncertain and neglected [19].
Only recently, studies in psychophysics (see [28, 66, 24, 67, 33, 68] for further details)
have begun to consider monocular cues and eye movements as integral parts of the depth
perception process. Scientists are considering emerging evidence that eye movements influence magnitude of perceived depth derived by motion. While it is not clear if or how
humans use eye movement parallax in the perception of depth, this cue can be used in a
robotic active vision system.
In robotics, several studies have taken an active approach in the visual evaluation of distance. In most of them, motion of the sensor according to the movement of the robot (or
the camera itself) produces a consequent motion of the projection of the observed scene
which is computed with different approaches like optical flow, feature correspondence,
spatio-temporal gradients, etc. However, no previous study has attempted to consider the
exploitation of the parallax phenomenon emerging from a rotation of the cameras due to eye
movement replication.
i
In this thesis, we present a novel mathematical model of 3-D projection on a 2-D surface
which exploits human eye movements to obtain depth information. This model brings new
features with it. The estimations obtained are reliable and rich with information. Even with
very small movements (i.e. fixational eye movements), the model is able to supply depth
information. Moreover, the model’s simple formulation bounds the number of requested
parameters.
Another contribution of this work is the design and assembly of an oculomotor robotic
system which is able to replicate, as accurately as possible, the geometrical and mobile characteristics of the human eye. Combining the mathematical model of parallax and recorded
human eye movements, we show that the robot is able to provide depth estimation of natural
3-D scenes.
Chapter 1 of this thesis summarizes general considerations on eye anatomy, its possible
aberrations, and its possible models. We also introduce basic elements of optical physics as
applied to the eye and to the characterization of human eye movement. Chapter 2 presents
possible depth information cues such as static and dynamic monocular cues and their possible types of interaction. Chapter 3 reviews all the research literature in robotic and computer
vision in order to summarize the latest advances in this field. Chapter 4 describes the novel
mathematical model for the parallax information extraction. Finally, Chapter 5 presents the
results obtained using the robotic system, the mathematical model, and human eye movement recordings on generic 3-D scenes present under different kinds of conditions.
ii
Acknowledgments
I have been told once that a Ph.D. thesis is something that one parents, alone, with love and
attention. But in the end, he realizes that the dissertation has many grandparents, uncles and
aunts, relatives and friends. These are my small acknowledgments to this enlarged family.
First, I would like to thank my advisor Professor Marco Gori. This thesis would not have
been possible without his support and trust. My most special thanks go to Professor Michele
Rucci, who has been able to contain the false vauntings and drive the shy successes of my
work. A special thank to Professor Marco Maggini, as well, for his friendship. He made all
this possible.
I thank the wonderful people of the Active Perception Lab: Gaelle Desbordes and Bilgin
Ersoy have been great friends and supporters, as well as providers of precious comments for
this thesis. I will not easily forget the Machine Learning Group within the “Dipartimento di
Ingegneria dell’Informazione” at the University of Siena: Monica Bianchini, Michelangelo
Diligenti, Ciro de Mauro, Marco Ernandes, Luca Rigutini, Lorenzo Sarti, Franco Scarselli,
Marco Turchi, and Edmondo Trentin and his “children’s tails”.
How can I forget all my dearest friends who supported me during this work? Michele
degli Angeli, Cinzia Gentile, Thomas Giove, Robert Mabe, Emily Marquez, Bruce Miller,
Paolo M. Orlandini, Michela Pisà, Christian Schlesinger, Andrea Valeriani, Sergio M. Ziliani: to all of you... forever grateful.
In the end, I would like to dedicate this work to my parents Bruno, Lelia, and my siblings
Federica and Francesco - their love and support endless, their confidence in me steadfast,
and their patience limitless.
March 2004
iii
Contents
1 Human vision
1.1 General anatomy of the eye
1.2 Optics applied to the eye .
1.3 Aberrations in the eye . . .
1.4 Optical models of the eyes
1.5 Eye movements . . . . . .
.
.
.
.
.
9
10
16
20
22
26
2 Monocular depth vision system
2.1 Monocular static depth cues . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Monocular dynamic depth cues . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Types of cue interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
35
45
47
3 Monocular depth perception in robotic
3.1 Passive depth finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Active depth finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
50
57
4 The model
4.1 Eye movement parallax on a semi-spherical sensor . . .
4.2 Eye movement parallax on a planar sensor . . . . . . . .
4.3 Theoretical parallax of a plane . . . . . . . . . . . . . .
4.4 Replicating fixational eye movements in a robotic system
.
.
.
.
59
59
63
66
68
.
.
.
.
72
72
78
81
85
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Discussion of the experiment results
5.1 The oculomotor robotic system . . . . . . . . . .
5.2 Recording human eye movements . . . . . . . .
5.3 Experimental results with artificial movements . .
5.4 Experimental results with human eye movements
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Conclusion and future work
97
Bibliography
99
1
List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10
1.11
1.12
1.13
1.14
1.15
1.16
1.17
The human eye and its constitutive elements. . . . . . . . . . . . . . . . .
Detailed structure of the retina . . . . . . . . . . . . . . . . . . . . . . . .
Inner segments of a human foveal photoreceptor mosaic in a strip extending
from the foveal center (indicated by the upper left arrow) along the temporal horizontal meridian. Arrowheads indicate the edges of the sampling
windows. The large cells are cones and the small cells are rods. . . . . . . .
Refraction for light passing through two different mediums where n1 < n2 .
Simple graphical method for imaging reconstruction in thin lenses. . . . . .
Behavior of light in a thick lens according to the nodal points. . . . . . . .
Simple graphical method for imaging reconstruction in thick lenses. . . . .
Spherical aberration in a converging lens. . . . . . . . . . . . . . . . . . .
Coma aberration in a converging lens. . . . . . . . . . . . . . . . . . . . .
Images of distortion aberration: (a) a square; (b) pincushion distortion; (c)
barrel distortion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Emsley’s standard reduced 60-diopter eye. . . . . . . . . . . . . . . . . . .
Gullstrand’s three-surface reduced schematic eye. . . . . . . . . . . . . . .
General form of the Chromatic Eye. . . . . . . . . . . . . . . . . . . . . .
General form of the Indiana Eye. . . . . . . . . . . . . . . . . . . . . . . .
Human eye movements (saccades) recorded during an experiment in which
a subject was requested to freely watch a natural scene. A trace of eye movements recorded by a DPI eyetracker (see Paragraph 5.2) is shown superimposed on the original image. The panel on the bottom right shows a zoomed
portion of the trace in which small fixational eye movements are present.
The color of the trace represents the velocity of eye movements (red: slow
movements; yellow: fast movements). Blue segments mark periods of blink.
Fixational eye movements and drifts recorded during an experiment in which
a subject was requested to look at a specific point on the screen. The figure
shows the magnified details about fixational eye movements around a target
which size is only 30 pixels. . . . . . . . . . . . . . . . . . . . . . . . . .
Axis systems for specifying eye movements. . . . . . . . . . . . . . . . . .
2
10
11
13
16
17
18
19
20
21
22
23
24
25
25
27
29
31
1.18 (a) In the Helmholtz system the horizontal axis is fixed to the skull, and the
vertical axis rotates gimbal fashion; (b) In the Fick system the vertical axis
is fixed to the skull. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
4.1
Binocular disparity according to the displacement of the two eyes from the
cyclopean axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Monocular depth cues. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example of depth perception using perspective cues. . . . . . . . . . . . .
Object, Object space, Display, Observer. . . . . . . . . . . . . . . . . . . .
(a) one-point; (b) two-point; (c) three-point perspective. . . . . . . . . . . .
Example of linear perspective. . . . . . . . . . . . . . . . . . . . . . . . .
The depth cue of height in the field. . . . . . . . . . . . . . . . . . . . . .
Texture gradients and perception of inclination: (a) Texture size scaling; (b)
Texture density gradient; (c) Aspect-Ratio perspective. . . . . . . . . . . .
Example of object which does not respond to the Jordan’s Theorem. . . . .
(a) Amodal completion in which the figures appear in a specific order. (b)
Modal completion effects using a Kanizsa triangle in which a complete
white triangle is seen. Each disc appears complete behind the triangle (amodal
completion). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Kanizsa triangle is not evident in each static frame but becomes visible in
the dynamic sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Variations of convexity and concavity from shading. . . . . . . . . . . . . .
The gray squares appear to increase in depth above the background because
of the positions of the shadows. . . . . . . . . . . . . . . . . . . . . . . . .
Differential transformations for the optical flow fields: (a) translation; (b)
expansion; (c) rotation; (d) shear of type one; (e) shear of type two. . . . . .
Parallax can be obtained when a semi-spherical sensor (like the eye) rotates.
A projection of the points A and B depends both on the rotation angle and
the distance to the nodal point. The circle on the right highlights how a
misalignment between the focal point (for the sake of clarity, we consider
a simple lens with a single nodal point) and the center of rotation yields a
different projection movement according to the rotation but also according
to the distance of A and B. . . . . . . . . . . . . . . . . . . . . . . . . . .
3
32
34
36
36
36
37
39
39
40
41
41
43
43
44
46
60
4.2
4.3
4.4
4.5
4.6
4.7
4.8
(a) Gullstrand’s three-surface reduced schematic eye. (b) Geometrical assumptions and axis orientation for a semi-spherical sensor. An object A is
positioned in the space at distance da from the center C of rotation of the
sensor, and with an angle α from axis y; N1 and N2 represent the position of
the two nodal points in the optical system. The object A projects an image
xa on the sensor and its position is indicated by θ. . . . . . . . . . . . . . .
(a) Variations of the projection point on a semi-spherical sensor obtained
by (4.7) considering an object at distance da = 500 mm, N1 = 6.07 mm
and N2 = 6.32 mm; (b) detail of the rectangle in Figure (a): two objects at
distance da = 500 mm and da = 900 mm produce two slightly different
projections on the sensor. . . . . . . . . . . . . . . . . . . . . . . . . . . .
(a) Schematic representation of the right eye of the oculomotor system with
its cardinal points. The L-shaped support places N1 and N2 before the center
of rotation C of the robotic system. (b) Geometrical assumptions and axis
orientation for a planar sensor. An object A is positioned in the space at
a distance da from the center C of rotation of the sensor with an angle α
from axis y; N1 and N2 represent the position of the two nodal points in
the optical system. The object A projects an image xa on the sensor; x̂a
represent the projection of A considering null the distance between N1 and C.
Curve A represents variations of the projection xa on a planar sensor obtained by (4.9), considering an object at distance da = 500 mm, N1 =
49.65 mm and N2 = 5.72 mm (see Paragraph 5.1.3 for further details);
Curve B represents the same optical system configuration but with the object at distance da = 900 mm. Curve C represents variations of x̂a using
(4.10) when the nodal points N1 and N2 collapse in the center of rotation of
the sensor. This curve does not change with the distance. . . . . . . . . . .
The clockwise rotation of the system around the point C leaves unchanged
the absolute distance da , and allows a differential measurement of the angle
α (the angle before the rotation) to be obtained using only ∆α and the two
projections xa (before the rotation), and x0a (after the rotation). . . . . . . .
Geometrical assumptions and axis orientation needed to calculate the theoretical parallax. Considering the plane P at distance D from the center of
rotation C of the camera, it is possible to calculate the expected parallax. . .
Horizontal theoretical parallax for planes at different distances using a rotation of ∆α = 3.0◦ anticlockwise. Although asymmetry, due to the rotation,
is present on all the curves, this is more evident for shorter distances. . . . .
4
61
62
64
65
66
67
68
4.9
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
Geometrical assumptions and axis orientation for a small movement. The
sensor S is no longer considered as a single continuous semi-spherical surface, but it is quantized according to the dimension of the virtual photoreceptors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The robotic oculomotor system. Aluminum wings and sliding bars mounted
on the cameras made it possible to maintain the distance between the center
of rotation C and the sensor S. . . . . . . . . . . . . . . . . . . . . . . . .
Pinpoint light source (PLS) structure. . . . . . . . . . . . . . . . . . . . .
(a) Factory specifications for the optical system used in the robotic head
(not in scale). The point C = 7.8 ± 0.1 mm represents the center of rotation
of the system camera-lens, and it was measured directly on the robot. (b)
Estimation of the position of N1 and N2 in respect to C as a function of the
focal length (focus at infinite). . . . . . . . . . . . . . . . . . . . . . . . .
Experimental setup of the calibration procedure. The PLS was positioned
on the visual axis of the right camera at distances depending on each trial. .
Data indicated with circles represent the estimated position of the nodal
points N1p and N2p for values of focal length 11.5, 15.0, 25.0, 50.0 mm
(focus at infinite). Solid lines in the graph represent their interpolated trajectory. It can be noticed that the position of the cardinal points expressed by
the graph is compatible with the trajectory reported by the factory in Figure
5.3b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model fitting (indicated with a solid line) obtained interpolating N1p =
49.65 mm and N2p = 5.72 mm from the data (indicated with circles).
The underestimation of the projections under da = 270 mm is due to defocus, aberrations and distortions introduced by the PLS too close to the lens.
In fact, since da is considered from the center of rotation C, the light source
is actually less then 160 mm from the front glass of the lens. This position
is incompatible even for a human being, which cannot focus objects closer
than a distance called near point, and equal to 250 mm [74]. . . . . . . . .
Results of the model fitting of the parallax obtained at different focal lengths
(and different nodal points). . . . . . . . . . . . . . . . . . . . . . . . . . .
The Dual-Purkinje-Image eyetracker. This version of DPI eyetracker has
a temporal resolution of approximately 1 ms and has a spatial accuracy of
approximately 1 arcmin. . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
70
73
74
76
77
78
79
80
80
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16
5.17
5.18
Basic principle of the Dual-Purkinje-Image eyetracker. Purkinje images are
formed by light reflected from surfaces in the eye. The first reflection takes
place at the anterior surface of the cornea, while the fourth occurs at the
posterior surface of the lens of the eye at its interface with the vitreous humor. Both the first and fourth Purkinje images lie in approximately the same
plane in the pupil of the eye and, since eye rotation alters the angle of the
collimated beam with respect to the optical axis of the eye, and since translations move both images by the same amount, eye movement can be obtained
from the spatial position and distance between the two Purkinje images. . .
Distance estimation calculated from projection data obtained by rotating the
right camera 3◦ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Error introduced by the quantization in the model depth perception performance. Distance estimation using a sensor with a 9 µm cell size (thick blue
line) is compared with a sensor with a 0.3 µm cell size. The error introduced
by the higher resolutions is evident only after numerous meters. . . . . . .
Semi-natural scene consisting of a background placed at a distance 900 mm
with a rectangular object placed before it at a distance 500 mm. This object
is an exact replica of the covered area of the background. (a) The scene
before a rotation; (b) Same semi-natural scene after a anticlockwise rotation
of ∆α = 3.0◦ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The rectangular object hidden in the scene in Figure 5.12 is visible only
because it has been exposed to an amodal completion effect with a sheet of
paper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Horizontal parallax produced by a rotation of ∆α = 3◦ for the object in
Figure 5.12a. It can be noticed how the general tendency of the surface
corresponds to the theoretical parallax behavior predicted in the equation
(4.14). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Distance estimation of the object in Figure 5.12a. The horizontal parallax
matrix obtained by positioning an object at 500 mm from the point C of
the robotic oculomotor system was post-processed using (4.11) and (4.12),
thereby obtaining the distance information in the figure. . . . . . . . . . . .
PLS estimated distance using fixational eye movements of the subject MV. .
PLS estimated distance using fixational eye movements of the subject AR. .
Natural image representing a set of objects placed at different depths: object
A at 430 mm; object B at 590 mm; object C at 640 mm; object D at
740 mm; object E at 880 mm. . . . . . . . . . . . . . . . . . . . . . . . .
6
81
82
83
84
85
86
87
88
88
89
5.19 Reconstruction of the distance information according to the oculomotor parallax produced by the subject MV’s saccades. Lines A and B individuate two
specific sections of the prospective which are shown with further details in
Figure 5.20a and 5.20b. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.20 (a) Detailed distance estimation of the scene at the section A highlighted in
Figure 5.19. (b) Detailed distance estimation of the scene at the section B
highlighted in the same figure. . . . . . . . . . . . . . . . . . . . . . . . .
5.21 Filtered 3-D reconstruction of the distance information produced by the subject MV’s saccades. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.22 Reconstruction of the distance information according to the oculomotor parallax produced by the subject AR’s saccades. Lines A and B individuate two
specific sections of the prospective which are shown with further details in
Figure 5.23a and 5.23b. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.23 (a) Detailed distance estimation of the scene at the section A highlighted in
Figure 5.22. (b) Detailed distance estimation of the scene at the section B
highlighted in the same figure. . . . . . . . . . . . . . . . . . . . . . . . .
5.24 Filtered 3-D reconstruction of the distance information produced by the subject AR’s saccades. It can be noticed how there is no object at the extreme
right. This is because AR’s eye movements were focused more on the left
side of the scene. No parallax information concerning this object was available, and therefore no distance cues. . . . . . . . . . . . . . . . . . . . . .
7
91
92
93
94
95
96
List of Tables
1.1
5.1
Various refraction indexes for different optical parts of the eye. It can be
noticed that these values for the interior components of the eye are similar
to each other, creating very little refraction phenomena. . . . . . . . . . . .
16
Characteristics of the pinpoint source semiconductor light. . . . . . . . . .
74
8
Chapter 1
Human vision
Contents
1.1
1.2
1.3
1.4
1.5
General anatomy of the eye . . . . . . . . . . . . . . . . . . . . . . .
10
1.1.1
The retina . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.1.2
Hyperacuity . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.1.3
Accommodation . . . . . . . . . . . . . . . . . . . . . . . . .
15
Optics applied to the eye . . . . . . . . . . . . . . . . . . . . . . . .
16
1.2.1
Thin lenses . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.2.2
Thick lenses . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
Aberrations in the eye . . . . . . . . . . . . . . . . . . . . . . . . . .
20
1.3.1
Spherical Aberrations . . . . . . . . . . . . . . . . . . . . . . .
20
1.3.2
Coma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
1.3.3
Astigmatism . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
1.3.4
Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
Optical models of the eyes . . . . . . . . . . . . . . . . . . . . . . . .
22
1.4.1
Gullstrand’s schematic model . . . . . . . . . . . . . . . . . .
23
1.4.2
Chromatic and Indiana schematic eye . . . . . . . . . . . . . .
24
Eye movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
1.5.1
Abrupt movements . . . . . . . . . . . . . . . . . . . . . . . .
27
1.5.2
Smooth movements . . . . . . . . . . . . . . . . . . . . . . . .
28
1.5.3
Fixational eye movements . . . . . . . . . . . . . . . . . . . .
29
1.5.4
Coordinates systems for movements . . . . . . . . . . . . . . .
30
Since the times of ancient Egyptians (The Ebers Papyrus, 1550 BC), people have tried to
penetrate the riddle of vision. It is in the Arabian literature that figures illustrating the
9
anterior chamber
nodal points
cornea
iris
posterior chamber
limbal zone
ciliar body
lens
conjunctiva
canal of Schlemm
ciliar muscle
ciliary process
retrolental
space
zonule fibers
ciliary epithelium
ora terminalis
rectus tendon
vitreous
optic axis
visual axis
retina
sclera
chorioid
fovea
disc
center of rotation
labina cribosa
optic nerve
macula
Figure 1.1 – The human eye and its constitutive elements.
anatomy of the eye made their first appearance. The earliest drawing appears in the recently
discovered Book of the ten treatises on the eye by Hunain ibn Is-hq (about 860 BC).
Centuries of discoveries and the advancement of technology have led us toward a deep
understanding about how living beings see the world and how they learn from it. We acknowledge the striking evidence that the human eye is an elegant visual system that outclasses in many ways most modern optical instruments. We can focus on objects as close as
a few centimeters or as far away as a few kilometers, as well as watch the surrounding environment without becoming blinded by sun light. With the eye we are able to differentiate
colors, textures, sizes, and shapes at very high resolutions.
Speculations, studies and research on its internal functioning and its features have uncovered many of the aspects of the human eye. These discoveries have been consequently
employed in our everyday life. Simulation of eye surgery techniques, prescriptions of corrective lenses, or basic eye function emulations in robotics and computer vision are only
some of the many considerable examples.
1.1 General anatomy of the eye
There are many parts of the eye that serve different optical functions (see Figure 1.1). Light
enters the cornea, a transparent bulge on the front of the eye. Behind the cornea is a cavity
called the anterior chamber, which is filled with a clear liquid called the aqueous humor.
Next light passes through the pupil, a variably sized opening in the externally colored opaque
10
Figure 1.2 – Detailed structure of the retina
iris. Just behind the iris, light passes through the crystalline lens, whose shape is controlled
by ciliary muscles attached to its edge.
Traveling through the central chamber of the eye, filled with vitreous humor, light finally reaches its destination and strikes the retina, which represents the curved surface at
the bottom of the eye. The retina is densely covered with over 100 million light sensitive
photoreceptors, which convert photons into neural activity which is sent afterward to the
visual cortex in the brain.
1.1.1
The retina
The optic nerve enters the globe of the eye at a point 3 mm on the nasal side of the posterior
pole, and about 1 mm below. It spreads out into the fine network of nerve fibers that constitute the retina. It contains the sensitive elements, cells that form part of the visual pathway
from receptors to cortex (bipolar and ganglion cells), and cells which interconnect branches
of the visual path (horizontal and amacrine cells). The interconnections formed by fibers
from the bipolar and ganglion cell bodies are very complicated even when seen under an
optical microscope. Only recently, electronic microscopes have shown that the number and
extent of the horizontal connections are much larger than those revealed by the light microscope. The function of these interconnections is not understood in detail, but it is certain
that selection and recording of information take place within the retina itself.
11
The fine structure of the retina is depicted in Figure 1.2. This structure was first revealed by Ramón y Cajal using the Golgi staining method, and described in a series of
paper between 1888 and 1933. The retina is a multi-layered membrane with an area of
about 1, 000 mm2 . It is about 250 µm thick at the fovea, diminishing to about 100 µm at
the periphery.
The regions containing ganglion and dipolar cells highlighted in Figure 1.2 are known
as the cerebral layer of the retina.
The receptors are densely packed in the outer layer, and are considered to be about
70% of all receptors in the human body. There are two main types of receptors: rods and
cones. Rods have high sensitivity, an exclusively peripherical distribution, and broad spectral tuning. Cones have lower sensitivity, high concentration in the fovea with decreasing
concentration in the peripherical retina, and three types of spectral tuning, peaking at around
450 nm (S-cones), 535 nm (M-cones), and 565 nm (L-cones). S-cones constitute 5 to 10%
of all cones. There are about equal numbers of M- and L-cones, although there is considerable variation between people. The normal range of luminance sensitivity of the human eye
extends from roughly 10−7 cd · m−2 to 10−4 cd · m−2 .
The fovea is a specialized avascular region of the retina where cones are very tightly
packed together in order to maximize spatial resolution in the approximate center of our
visual field. Outside the fovea, spatial resolution of the neural retina falls substantially. The
adult human retina has between 4 and 6 million cones. Peak density of between 100,000 and
320,000 per mm2 occurs at the fovea but declines to about 6,000 per mm2 at an eccentricity
of 10◦ . The fovea of a primate is a centrally placed pit about 1.5 mm in diameter, which
contains a regular hexagonal mosaic of cones with a mean spacing of between 2 and 3 µm.
The central fovea, which is about 0.27 mm in diameter and subtends about 1◦ , contains at
least 6,000 cones. Each cone in the retina projects to a single ganglion cell, giving rise to
both a disproportionate number of foveal nerve fibers in the visual pathway (compared with
the rest of the retina) and a disproportionate representation in the visual cortex.
The human retina has 100 million or more rods, which are absent in the fovea and reach
a peak density of about 160,000 per mm2 at an eccentricity of about 20◦ . In the region
where the fibers of the optic nerve converge and ultimately leave the globe there are no rods
or cones. This region is blind and is called the optic disc. When the eye is viewed in an
ophthalmoscope, large retinal blood vessels can be seen in and near this region, though the
exact appearance varies considerably even in healthy subjects. From this region the retinal
blood vessels spread out into finer and finer branches.
1.1.2
Hyperacuity
In the retinal image of the world, contours or objects are characterized in the simplest case
by changes in luminance or color. Accordingly, the capabilities of the visual system are
12
Figure 1.3 – Inner segments of a human foveal photoreceptor mosaic in a strip extending from
the foveal center (indicated by the upper left arrow) along the temporal horizontal
meridian. Arrowheads indicate the edges of the sampling windows. The large cells
are cones and the small cells are rods.
13
often investigated by using stimuli in which patterns are well defined in shape, luminance,
and color. In this context, spatial resolution refers to the ability of the visual system to
determine that two points are separated in space.
Theoretically1 , for two points in space to be resolved the images of the two points must
fall on two separate receptors with at least one receptor between them. Hence the images of
the two points must be separated by at least 4 µm in order for the two points to be resolved.
In a typical eye this corresponds to an angular resolution of approximately 50 arcsec.
The wave theory of light predicts that the image of a point object formed by an optical
system with a circular aperture stop will be a circular disc surrounded by fainter rings. The
central disc contains about 85% of the light in the diffraction pattern and it is called Airy
disc. The diameter of the Airy disc is dependent on the pupil size and the wavelength of
light, and it is of the order of 90 arcsec in a typical eye.
Clearly this spreading of the light will affect resolution. In a typical eye the spatial
resolution limit set by diffraction is of the order of 45 arcsec. In practice, a number of other
factors including scatter and various aberrations degrades the retinal image further.
However, despite the theory, the visual system is remarkably good at resolving details,
which are smaller than the size predicted by the spacing of receptors and diffraction.
Research has shown that the eye is capable of resolving changes in position that are nearly
an order of magnitude smaller than the 50 arcsec diameter of a foveal photoreceptor.
Westheimer [105] coined the term hyperacuity to describe the high levels of performance
observed in the above tasks relative to those obtained in more conventional spatial-acuity
tasks, such as grating resolution which yields thresholds of the order of 30 − 60 arcsec (the
approximate diameter of the inner segment of a foveal cone).
Westheimer and others2 identified many different hyperacuity tasks:
• Point Acuity: When the subject notices the separation present between two points
• Line acuity: The separation required for the subject to resolve two lines
• Grating acuity: The separation of the lines required for the subject to just see the
grating (rather than a uniform gray field)
• Vernier acuity: The minimum separation required for the subject to see that the two
lines are not collinear (in a straight line)
• Center acuity: The minimum displacement of the dot from the center of the circle
• Displacement threshold: The minimum displacement of two lines turned on and off
sequentially, giving rise to the sensation of side to side movement
1
For a more exhaustive treatment regarding the determination of the theoretical limits of the retina resolution
see [46, 108, 26, 90, 45].
2
For further investigation and experimental data refer to [42, 57, 50, 110, 61, 11, 37, 100, 63, 17].
14
For example, observers can reliably detect an 8 − 10 arcsec instantaneous displacement of a
short line or bar (see [36]), a 2 − 4 arcsec misalignment in a vernier-acuity task, a 6 arcsec
difference in the separation of two lines, and a 2 − 4 arcsec difference in the separation
of lines in a stereoacuity task. Many of the hyperacuity tasks undoubtedly measured the
resolution limits of fundamental visual processes.
The exquisite sensitivities obtained in hyperacuity tasks are amazingly resistant to changes
in the spatial configuration of the stimuli. Thus hyperacuity is a robust phenomenon of some
generality. For example, highly sensitive displacement (small-movements) detection and
separation discrimination are probably of prime importance for the mechanisms that extract
the orientation and motion of the observer and the orientation and relative distances of surfaces in the environment.
Similarly, excellent stereoacuity must be crucial for the extraction of information regading surface orientation and depth from stereopsis. Since stereo and optical flow information
decline rapidly with distance from the observer (the inverse-square law), it follows that every
additional second of resolution that can be obtained will increase significantly the volume
of space around the observer from which reliable distance and surface-orientation can be
extracted.
1.1.3
Accommodation
One of the main activities of the eye is to correctly focus light onto the retina. This ability
is called accommodation. For this purpose, there are two principal focusing elements of
the eye: the cornea, which is the main component responsible for the focusing, and the
crystalline lens, which does fine adjustments.
The crystalline lens has a diameter of about 9 mm and a thickness of about 4 mm. It is
not like a typical glass convex lens, but is composed of different layers which have different
refractive indexes.
The lens is responsible for adjustments between near and far points of vision. According
to the equations of optics (see Paragraph 1.2 for further details), we can say that the lens of
the eye would need a wide range of focal lengths to focus objects which are close to the
eye, versus objects that are far away. Instead of having hundreds of different lenses, ciliary
muscles (see Figure 1.1) attached to the crystalline contract and relax, causing a change in
the lens curvature and a subsequent change in the focal length.
To focus distant objects, the ciliary muscles relax and the lens flattens, increasing its
radius of curvature, which therefore decreases the power of magnification. As an objects
moves closer, the ciliary muscles contract making the lens fatter, which produces a higher
power lens. An healthy young adult is capable of increasing the refractive power of their eye
from 20 D to 34 D. Unfortunately, as humans age, the lens begins to harden and its ability
to accommodate deteriorates.
15
Figure 1.4 – Refraction for light passing through two different mediums where n1 < n2 .
Part of the Eye
Index of Refraction
Water
Cornea
Aqueous humor
Lens cover
Lens center
Vitreous humor
1.33
1.34
1.33
1.38
1.41
1.34
Table 1.1 – Various refraction indexes for different optical parts of the eye. It can be noticed that
these values for the interior components of the eye are similar to each other, creating
very little refraction phenomena.
1.2 Optics applied to the eye
Light travels through air at a velocity of approximately 3×108 m/s. When it travels through
mediums, such as transparent solids or liquids, the speed at which the light is traveling can
be considerably lower. This ratio (velocity of light in the air over velocity in the medium)
changes from medium to medium and is known as the refractive index n. The refractive
index for air is considered 1.00, while for the glass is 1.50.
When light travels from one medium to another medium with a different refractive index,
it deviates from its original path. The bending of light caused by refractive index mismatch
is called refraction.
The relationship between two mediums and the angle of refraction, as illustrated in
Figure 1.4, is known as Snell’s Law:
n1 sin θ1 = n2 sin θ2
(1.1)
In the eye, the cornea refracts incident light rays to a focus point on the retina (refractive
indexes of the cornea and other optical parts of the eye are listed in Table 1.2).
16
Figure 1.5 – Simple graphical method for imaging reconstruction in thin lenses.
1.2.1
Thin lenses
Lenses are usually made of glass or plastic and are used to refract light rays in a desired
direction. When parallel light passes through a convex lens, the beams converge at a single
point. On the other hand, when parallel light passes through a concave lens, the beams
diverge, or spread apart.
The distance beyond the convex lens, where parallel light rays converge is called the
focal length of the lens. The relationship between the focal length of a lens, the object
position, and the location where a sharp or focused image of the object will appear is (see
Figure 1.5):
1
1
1
= + 0
f
s s
(1.2)
This equation is called Gauss’s law for thin-lens, and it describes the relationship between
the focal length of the lens f , the distance of the object to the left of the lens in the object
space medium s, and the distance of the focused image to the right of the lens in the image
space medium s0 .
A convex lens produces a real inverted image to the right of the lens. Similarly, since the
cornea is convex shaped, the image formed on the retina is inverted. It is the responsibility
of the visual cortex in the brain to invert the image back to normal.
The magnification of an image is calculated from the following equation:
s0
(1.3)
s
The optical power of a lens is commonly measured in terms of diopters (D). A diopter is
actually a measure of curvature and is equivalent to the inverse of the focal length of a lens,
measured in meters. The advantage of using diopters in lenses is their additivity. The thin
lens equation (1.2), in terms of diopters, can be rewritten as following:
m=−
17
Figure 1.6 – Behavior of light in a thick lens according to the nodal points.
P = S0 + S
(1.4)
where P is the focal length measured in diopters, S and S 0 are the object and image distances, respectively, measured in inverse meters, or diopters.
1.2.2
Thick lenses
When the thickness of a lens cannot be considered small in respect to the focal length, the
lens is called thick. This kinds of lenses can also be considered a model for composite
systems of optics. A well-corrected optical system, in fact, can be treated as a black box
whose characteristics are defined by its cardinal points.
There are six cardinal points (F1 , F2 , H1 , H2 , N1 , and N2 ) on the axis of a thick lens
from which its imaging properties can be deduced. They consist of the front and back focal
points (F1 and F2 ), front and back principle points (H1 and H2 ), and the front and back
nodal points (N1 and N2 ).
A incident ray on a lens from the front focal point F1 will exit the lens parallel to the
axis, and an incident ray parallel to the axis refracted by the lens will converge onto the
back focal point F2 (see Figure 1.6a and 1.6b). The extension of the incident and emerging
rays in each case intersects, by definition, the principal planes which cross the axis at the
principal points, H1 and H2 (see Figure 1.6a and 1.6b).
The nodal points are two axial points such that a ray directed at the first nodal point
appears (after passing through the system) to emerge from the second nodal point parallel
to its original direction (see Figure 1.6c). When an optical system is bounded on both sides
by air (usually true in the majority of applications, but not for the eye), the nodal points
coincide with the principal points.
A simple graphical method can be used to determine image location and magnification
of an object. This graphical approach relies on few simple properties of an optical system.
When the cardinal points of an optical system are known, the location and size of the image
18
Figure 1.7 – Simple graphical method for imaging reconstruction in thick lenses.
formed by the optical system can be readily determined. In Figure 1.7, the focal points F1
and F2 and the principal points H1 and H2 are shown. The object which the system is to
image is shown as the arrow AO. Ray OB, parallel to the system axis, will pass through the
second focal point F2 . The refraction will appear to have occurred at the second principal
plane. The ray OC, passing through the first focal point F1 , will emerge from the system
parallel to the axis. The intersection of these rays at the point O0 locates the image of point
O. A similar construction for other points on the object would locate additional image
points, which would lie along the indicated arrow O0 A0 .
A third ray could be constructed from O to the first nodal point. This ray would appear
to emerge from the second nodal point, and it would be parallel to the entering ray.
The law for thin lenses specified in (1.2) can be extended to thick lenses in the following
form:
1
1
1
1
=
= + 0
f1
f2
s s
(1.5)
This modification of the law is valid only if both object space medium and image space
medium of the lens have the same refraction coefficient n. Otherwise the relation between
f and f 0 with different mediums would be:
f0
n0
=
n
f
(1.6)
With a real lens of finite thickness, the image distance, object distance, and focal length are
all referenced to the principal points, not to the physical center of the lens. By neglecting the
distance between the lens’ principal points, known as the hiatus, s + s0 becomes the objectto-image distance. This simplification, called the thin-lens approximation, leads back in first
approximation to the thin lens configuration.
19
Figure 1.8 – Spherical aberration in a converging lens.
1.3 Aberrations in the eye
By assuming that all the rays that hit lenses are paraxial3 , we have deliberately ignored the
dispersive effect caused by lenses. Saying that spherical lenses and mirrors produce perfect
images is not completely correct. Even when perfectly polished, lenses do not produce
perfect images. The deviations of rays from what is ideally expected are called aberrations.
There are four main types of aberrations: spherical, coma, astigmatism, and distortion.
1.3.1
Spherical Aberrations
When paraxial rays pass through the edges of a spherical lens they do not pass through the
same focus (see Figure 1.8). A blurred image formed by the deviations of the rays passing
through the edges of the lens is called spherical aberration. As we might expect, spherical
aberrations are a great problem for optical systems with large apertures, and production of
accurate wide surface lenses is a continuing technological challenge.
Under normal conditions, the human pupil is small enough that spherical aberrations
do not significantly affect vision. Under low-light conditions or when the pupil is dilated,
however, spherical aberrations become important. The visual acuity becomes affected by
this aberration when the pupil is larger than about 2.5 mm in diameter.
There are other factors of the eye that reduce the effect of spherical aberrations. The
outer portions of the cornea have a flatter shape and therefore refract less than the central
areas. The central part of the crystalline lens also refracts more than its outer portions due
to a slightly higher refractive index at its center.
3
The paraxial region of an optical system is a thin thread-like region about the optical axis which is so small
that all the angles made by the rays may be considered equal to their sines and tangents.
20
Figure 1.9 – Coma aberration in a converging lens.
1.3.2
Coma
Coma is an off-axis modification of spherical aberration. It produces a blurred image of
off-axis objects that is shaped like a comet (shown in Figure 1.9). Coma is an aberration dependent on the eye’s pupil size. An optical system free of both coma and spherical aberration
is called aplantic.
1.3.3
Astigmatism
Astigmatism is the difference in focal length of rays coming in from different planes of an
off-axis object. Like coma, astigmatism is non-symmetric about the optical axis. The eye
defect called astigmatism is slightly different than the optical aberration. Astigmatism refers
to a cornea that is not spherical, but it is more curved in one plane than in another. That is
to say, the focal length of the astigmatic eye is different for rays in one plane than for those
in its perpendicular plane.
1.3.4
Distortion
Distortion is a variation in the lateral magnification for object points at different distances
from the optical axis (see Figure 1.10). If magnification increases with the object point
distance from the optical axis, the image of Figure 1.10a will look like Figure 1.10b. This
is called pincushion distortion. Conversely, if the magnification decreases with object point
distance, the image has barrel distortion (see Figure 1.10c).
21
(a)
(b)
(c)
Figure 1.10 – Images of distortion aberration: (a) a square; (b) pincushion distortion; (c) barrel
distortion.
1.4 Optical models of the eyes
Some of the greatest scientists have tried to explain and quantify the underlying optical
principles of the human eye. According to Helmholtz (1867)4 , Kepler (1602) was the first to
understand that the eye formed an inverted image on the retina, and Scheiner (1625) demonstrated this experimentally in human eyes. Newton (1670) was the first to consider that the
human eye suffered from chromatic aberration, and Huygens (1702) actually built a model
eye to demonstrate the inverted image and the beneficial effects of spectacle lenses. Astigmatism was observed by Thomas Young (1801) in his own eye, and spherical aberration was
measured by Volkman (1846).
As we said above, there are some occurrences which need a model that includes all the
eye’s main characteristics without being either computational expensive or too complex to
be utilized. Numerous model eyes, referred to as schematic eyes, have been developed during the last 150 years to satisfy a variety of needs.
Schematic eye models which can reproduce optical properties from anatomy are especially useful. They can be employed to help design of ophthalmic or visual optics, to
simulate experiments, or to better understand the role of the different optical components.
Furthermore, schematic eye models are necessary to estimate basic optical properties of the
eye like focal length.
In some cases, the model of the eye is aimed primarily at anatomical accuracy, whereas
other models have made no attempt to be anatomically correct or even to have any anatomical structure at all. There are many reasons why anatomically correct models are useful.
As a biological organ, it would be impossible to understand the physiology, development,
and biological basis of the eye’s optical properties without an accurate anatomical model.
Moreover, in the world of refractive surgery, an anatomically correct model is a vital tool
4
”Treatise on Physiological Optics” of Hermann Ludwig Ferdinand von Helmholtz (1821-94) is widely
recognized as the greatest book on vision. This classic work is one of the most frequently cited books on the
physiology and physics of vision. His summa (in three volumes) transformed the study of vision by integrating
its physical, physiological and psychological principles.
22
Figure 1.11 – Emsley’s standard reduced 60-diopter eye.
for planning operations designed to modify the eye’s optical properties before operating on
the patient.
1.4.1
Gullstrand’s schematic model
The first quantitative paraxial model eye was developed by Listing (1851), whose model
employed a series of spherical surfaces separating homogeneous media with fixed refractive
indices.
Emsley (1952) [44] produced a variant of the Listing model, which has been widely
cited. The Emsley standard reduced 60-diopter eye is one of the simplest models of the
eye and is most often used in ophthalmic education (see Figure 1.11). It contains a single
refracting surface and only one index of refraction mismatch between the air and the vitreous
humor. The axial distance from the cornea to the retina is 22.22 mm. Due to its simple
structure, accommodation calculations cannot be done with this model. The two principal
points and two nodal points are combined into single principal and nodal points (H and N).
The corneal surface of 60 D power encompasses the separate refractions of the corneal and
lens interfaces, representing the total focusing power for the eye.
Currently, the most used model eye is the one designed by Gullstrand5 [1]. In this
simplified model for the human eye, shown in Figure 1.12, the lens is considered to have
an average refraction index of 1.413, and the axial distance from the cornea to retina is
23.89 mm. These two parameters ensure that entering parallel rays will focus perfectly
5
Gullstrand received the Nobel Prize for Physiology (1911) for his investigations on the dioptrics of the eye.
23
Figure 1.12 – Gullstrand’s three-surface reduced schematic eye.
on the retina. Gullstrand’s three-surface model allows for changes in the power of the lens
by adjusting the curvatures of the front and back surfaces. This attribute is very useful for
calculating image positions when the natural lens is removed, as it is during cataract surgery,
and replaced by an artificial fixed lens.
1.4.2
Chromatic and Indiana schematic eye
There is general agreement that chromatic aberration is one of the more serious aberrations
of the human eye. Despite the fact that ocular chromatic aberration was first described over
300 years ago by Newton when he explored chromatic dispersion by glass prisms, and despite the extensive studies of the last 50 years, there have been few attempts to model this
aberration. Few models have included dispersive media even though it has been known,
since Helmholtz, that, so far as chromatic aberration is concerned, the eye behaves as if it
were a volume of water separated from air by a single refracting surface.
The most recent work in this direction is by Thibos et al. [96]. In this research the
authors introduce a new model called Chromatic Eye. It is a reduced schematic eye containing a pupil and a single, aspheric refracting surface separating air from a chromatically
dispersive ocular medium.
This model was designed to accurately describe the eye’s transverse and longitudinal
chromatic aberration, while at the same time being free from spherical aberration. Comparisons can be made between Emsley’s reduced eye in Figure 1.11 and the proposed Chromatic
Eye showed in Figure 1.13. The main introduction in the Thibos’ model is the new refract-
24
Figure 1.13 – General form of the Chromatic Eye.
Figure 1.14 – General form of the Indiana Eye.
ing surface. What is a sphere in Emsley’s model is a prolate spheroid in the Chromatic Eye;
this corresponds, respectively, to a circle and an ellipse when viewed as a cross section.
The results of the experiments conducted by Thibos [96] (and successively confirmed by
Doshi et al. [25]) show that the data are accurately described by a reduced-eye optical model
which has a refractive index that changes more rapidly with the wavelength than does the
model filled with water. A comparison with all other chromatic models clearly shows that
the description of chromatic focus error is improved simply by changing the ocular medium
while leaving all other aspects of the model unchanged, especially at shorter wavelengths.
25
As we can see in Figure 1.13, there is no particular relationships assumed between the
optical axis, the visual axis, the fixation axis, or the achromatic axis of the eye. Angle α
specifies the location of the fovea relative to the model’s axis of symmetry, which will affect
the magnitude of off-axis aberrations present on the fovea. For polychromatic light, angle ψ
and the closely related angle φ are even more important for assessing the quality of foveal
images. This is because ψ and φ are sensitive to the misalignment of the pupil relative to the
visual axis of maximum neural resolution.
However, for many engineering and scientific applications, the Chromatic Eye has too
many degrees of freedom to be useful. What is needed is an even simpler model constrained
by empirical data from typical human eyes. A simplified model, which Thibos et al. successively introduced and called the Indiana Eye [94], is shown in Figure 1.14. The primary
simplifying assumption of the Indiana Eye is that, on average, the pupil is well centered on
the visual axis (i.e. achromatic and visual axes coincide, so ψ = φ = 0). This assumption is
based on studies which have shown that, although individual pupils may be misaligned from
the visual axis by as much as 1 mm, the statistical mean of angle ψ in the population is not
significantly different from zero in either the horizontal or vertical meridians [87, 89, 95].
1.5 Eye movements
Our eyes are never completely at rest. Even when we are fixating upon a visual target, small
involuntary eye movements continuously perturb the projection of the image on the retina6 .
Eye movements can be characterized in three classes: abrupt, smooth, and fixational.
They serve three basic functions: stabilization of the image on the retina as the head moves,
fixation and pursuit of particular objects, and convergence of the visual axes on the particular object.
Image stabilization is achieved by the vestibuloocular response (VOR), a conjugate eye
movement evoked by stimuli arising in the vestibular organs as the head moves. When the
eyes are open, the VOR is supplemented by optokinetic nystagmus (OKN) which is evoked
by the motion of the image of the visual scene. These involuntary responses were the first
types of eye movements to evolve.
Eye movements for fixational and voluntary pursuit of particular objects evolved in animals with foveas. Voluntary rapid eye movements allow the gaze to move quickly from one
6
A striking demonstration of how small eye movements affect visual fixation is available on the world wide
web at http://www.visionlab.harvard.edu/Members/ikuya/html/memorandum/VisualJitter.html (Murakami and Cavanagh, 1998). It is surprising that we are able to perceive a stable world despite such movements.
Does visual perception rely on some type of stabilization mechanism that discounts visual changes induced by
fixation eye movements, or, on the contrary, are these movements an intrinsic functional component of visual
processes?
26
Figure 1.15 – Human eye movements (saccades) recorded during an experiment in which a subject
was requested to freely watch a natural scene. A trace of eye movements recorded by
a DPI eyetracker (see Paragraph 5.2) is shown superimposed on the original image.
The panel on the bottom right shows a zoomed portion of the trace in which small
fixational eye movements are present. The color of the trace represents the velocity
of eye movements (red: slow movements; yellow: fast movements). Blue segments
mark periods of blink.
part of the visual scene to another. Voluntary pursuit maintains the image of a particular
object on the fovea.
In the third basic type of eye movements, the eyes move though equal angles in opposite
directions to produce a disjunctive movement, or vergence. In horizontal vergence, each
visual axis moves within a plane containing the interocular axis. In vertical vergence, each
visual axis moves within a plane that is orthogonal to the interocular axis. The eyes also
move in opposite directions around the two visual axes. Horizontal vergence occurs when a
person changes fixation (under voluntary control) from an object in one depth plane to one
in another depth plane.
1.5.1
Abrupt movements
Abrupt movements, also called saccades, are the fastest eye movements that are used to
bring new details of the visual field to the fovea. They can be either aimed or suppressed
according to instructions (voluntary), in response to the sudden appearance of a new target
in the visual field (involuntary). These kinds of movements are so fast that there is usually
27
no time for visual feedback to guide the eye to its final position. Therefore saccades are also
called ballistic movements because the ocular motor system must calculate the muscular
activation pattern in advance to throw the eye exactly in the position selected.
Most saccades are short-duration (20 − 100 ms) and high-velocity (20 − 600◦ · s−1 ).
Under natural conditions, with the subject moving freely in his normal surrounding, it has
been noticed by Bahill et al. [9] that 85% of natural saccades have amplitudes of less than
15◦ .
The activity pattern which drives the eye in the requested position, also called pulse, has
to be followed by a lower, steady activity that will hold the eye in the final position. This
activity is called step signal. If pulse and step do not match perfectly (the end of the first
with the beginning of the second), a slow movement, called glissade, will bring the eye in
the final position. This sort of movement can be very long, taking to one second. If the
pulse is too long in respect to the step, the requested rotation of the eye will be followed
by an overshoot and a subsequent glissadic movement which will hold the eye in the new
position. If, on other hand, the pulse is too small in relation to the step, then a undershoot
with glissadic behavior will follow the saccade.
1.5.2
Smooth movements
Smooth movements have considerably lower velocity and longer duration. Usually these
movements last more than 100 ms and have a velocity up to 30 − 100◦ · s−1 . They can
be distinguished according to how information coming from the retina is used: conjugate
movements are originated when both eyes are guided in the same direction, while disjunctive
smooth eye movements are guided in opposite directions.
Even though no criteria allows direct separation of a single aspect of smooth movement,
conjugate smooth eye movements have been classified in humans. Smooth pursuit conjugate
movements are performed during voluntary tracking of a small stimulus. They have a maximum velocity of 20 − 30◦ · s−1 , and they are considered voluntary because the task requests
visual feedback. Vestibular smooth conjugate movements are due to a short reflex arc in the
part of the cortex appointed to the vestibular afference. They can be noticed by rotating the
head in a darkness, that is without any visual reference. These movements can have a latency
(difference between head rotation and eye movements) of 15 ms and a velocity which can
reach up to 500◦ · sec−1 . Optokinetic smooth conjugate movements are involuntary smooth
eye movements produced by rotating very large stimuli around the stationary subject, who
is instructed to stare ahead and not to track. These kind of movements can reach a velocity
of 80◦ · sec−1 .
28
Figure 1.16 – Fixational eye movements and drifts recorded during an experiment in which a
subject was requested to look at a specific point on the screen. The figure shows the
magnified details about fixational eye movements around a target which size is only
30 pixels.
1.5.3
Fixational eye movements
During visual fixation, there are three main types of eye movements in humans: microsaccades, drifts, and tremor.
Microsaccades are small, fast (up to 100◦ · s−1 ), jerk-like eye movements that occur
during voluntary fixation. They bring the retinal image across a range of several dozen to
several hundred photoreceptors, and are about 25 ms in duration. Microsaccades cannot
be defined on the basis of amplitude alone, as the amplitude of voluntary saccades can be
as small as that of fixational microsaccades. Microsaccades have been reported in many
species. However, they seem to assume an important role in species with foveal vision (such
monkeys and humans). The role of the microsaccades in vision is not clear at all. Scientists
argue even on their very existence [58]. Recently, Rucci [82] showed that the presence of
this kind of movement improves the correct discrimination with short stimulus presentation
(500 ms), while under conditions of visual stabilization (obtained with recording instruments able to eliminate fixational eye movements) this discrimination is significantly lower.
Drifts occur simultaneously with tremor and are slow motions of the eye that occur dur29
ing the epochs between microsaccades. During drifts, the image of the object being fixated
can move across a dozen photoreceptors. Initially, drifts seemed to be random motions of
the eye, generated by the instability of the oculomotor system. However, drifts were later
found to have a compensatory role in maintaining accurate visual fixation in the absence of
microsaccades, or when compensation by microsaccades was relatively poor.
Tremor is an aperiodic motion of the eyes with frequencies of about 90 Hz. Being the
smallest of the all eye movements (tremor amplitudes are about the diameter of a cone in
the fovea), visual tremor is difficult to record accurately because it falls in the range of the
recording system’s noise. The purpose of the tremor in the vision process is unclear. It
has been argued that tremor frequencies are much greater than the flicker fusion frequencies in humans7 , so the tremor of the visual image might be ineffective as a stimulus. But
recent studies indicate that tremor frequencies can be quite low, below the flicker fusion
limit. However, early visual neurons can follow high-frequency flickering that is over the
perceptual threshold for flicker fusion. So, it is possible that even high frequency tremor
is adequate to maintain activity in the early visual system, which might then lead to visual
perception. Tremor is generally thought to be independent in the two eyes. This imposes a
physical limit on the ability of the visual system to match corresponding points in the retinas
during stereovision.
1.5.4
Coordinates systems for movements
The center of rotation of an eye is not at the center of the eye and is not fixed in reference
to the orbit [76]. In other words, an eye translates a little as it rotates. For most purposes,
however, it can be assumed that the human eye rotates about a fixed center 13.5 mm behind
the front surface of the cornea. The direction of the gaze is specified with respect to the
median and transverse planes of the head. The straight-ahead position, or primary position,
of an eye is not easy to define precisely because the head and eye lack clear landmarks. For
most purposes, the primary position of an eye may be defined as the direction of gaze when
the visual axis is at right angles to the plane of the face. An eye moves from the primary
7
The flicker fusion threshold is a concept in the psychophysics of vision. It is defined as the frequency at
which all flicker of an intermittent light stimulus disappears. Like all psychophysical thresholds, the flicker
fusion threshold is a statistical rather than an absolute quantity. There is a range of frequencies within which
flicker sometimes will be seen and sometimes will not be seen, and the threshold is the frequency at which
flicker is detected on 50% of trials. The flicker fusion threshold varies with brightness (it is higher for brighter
lights) and with location on the retina where the light falls: the rods have a faster response than the cones, so
flicker can be seen in peripheral vision at higher frequencies than in foveal vision. Flicker fusion is important
in all technologies for presenting moving images, nearly all of which depend on presenting a rapid succession
of static images (e.g. the frames in film or digital video file. If the frame rate falls below the flicker fusion
threshold for the given viewing conditions, flicker will be apparent to the observer, and movements of objects
on the film will appear jerky. For the purposes of presenting moving images, the human flicker fusion threshold
is usually taken as 16 Hz. The frame rate used in cine projection is 24 Hz and video displays up to 160 Hz.
30
Figure 1.17 – Axis systems for specifying eye movements.
position into a secondary position when the visual axis moves from the primary position in
either a sagittal or a transverse plane of the head. An eye moves into a tertiary position when
the visual axis moves into a oblique position.
Three different coordinate systems employed to report eye movements:
• Helmholtz’s system: In this system the horizontal axis about which vertical eye
movements occur is fixed to the skull. The vertical axis about which horizontal movements occur rotates gimbal fashion about the horizontal axis and does not retain a
fixed angle to the skull. The direction of the visual axis is expressed in terms of elevation (λ and azimuth (µ) (see Figure 1.18a). Torsion is a rotation of the eye about
the visual axis with respect to the vertical axis of eye rotation.
• Flick’s system: In the Flick system, the vertical axis is assumed to be fixed to the
skull, and the direction of the visual axis is expressed in terms of latitude (θ) and
longitude (φ). Torsion is rotation of an eye about the visual axis with respect to the
horizontal axis of eye rotation. The flick system is the Helmoholtz system turned to
the side through 90◦ (see Figure 1.18b).
• Perimeter system: The perimeter system uses polar coordinates based on the primary
axis of gaze (the axis straight out from the eye socket and fixed to the head). Eye
positions are expressed in terms of angle of eccentricity of the visual axis (π) with
respect to the primary axis, and of the meridional direction (κ) of the plane containing
the visual and primary axes with respect to the horizontal meridian of head-fixed polar
coordinates.
31
Figure 1.18 – (a) In the Helmholtz system the horizontal axis is fixed to the skull, and the vertical
axis rotates gimbal fashion; (b) In the Fick system the vertical axis is fixed to the
skull.
These three systems are the same coordinate system, simply anchored to the head in different
ways. A specification of eye position can be transformed between the three systems by the
following equations:
tan λ =
tan θ
cos φ
= sin κ tan π
sin µ = sin φ cos θ = sin π cos κ
Listing proposed another coordinate system where any rotation of an eye occurs about an
axis in a plane known as Listing’s plane. Helmholtz called this Listing’s law. Listing’s
plane is fixed with respect to the head and coincides with the equatorial plane of the eye
when the eye is in its primary position. Elevations and depressions of the eye occur about a
horizontal axis in Listing’s plane, lateral movements occur about a vertical axis, and oblique
movements occur about intermediate axes. More precisely, any unidirectional movement of
an eye can be described as occurring about an axis in Listing’s plane that is orthogonal to
the plane within which the visual axis moves. The extent of an eye movement is the angle
between the initial and final directions of gaze (the change of the angle of eccentricity π).
The direction of an eye movement is the angle between the meridian along which the visual
axis moves and a horizontal line in Listing’s plane (δ or its supplement κ).
32
Chapter 2
Monocular depth vision system
Contents
2.1
Monocular static depth cues . . . . . . . . . . . . . . . . . . . . . .
35
2.1.1
Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
Linear prospective . . . . . . . . . . . . . . . . . . . . . . . .
38
Height . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
Texture perspective . . . . . . . . . . . . . . . . . . . . . . . .
40
Interposition . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
Interposition using parts of the same object . . . . . . . . . . .
41
Interposition using different objects . . . . . . . . . . . . . . .
42
Accretion and deletion . . . . . . . . . . . . . . . . . . . . . .
42
Lighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Shading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Shadows . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
2.1.4
Accommodation using image blur . . . . . . . . . . . . . . . .
44
2.1.5
Aerial effects . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
Monocular dynamic depth cues . . . . . . . . . . . . . . . . . . . . .
45
2.2.1
Motion parallax . . . . . . . . . . . . . . . . . . . . . . . . . .
45
2.2.2
Optical flow . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Types of cue interaction . . . . . . . . . . . . . . . . . . . . . . . . .
47
2.1.2
2.1.3
2.2
2.3
Humans use a wide variety of depth cues that can be combined with cognitive strategies
depending upon the visual circumstances. Without sense of depth, we would be greatly
33
Figure 2.1 – Binocular disparity according to the displacement of the two eyes from the cyclopean
axis.
impaired since this perception allows us to perceive and represent many aspects of our environment.
However, the perceptual problem of recovering the distance to a surface is that depth
perception from 2-D images is inherently ambiguous. This is because the optical processes
of surface reflection phenomenon project light from a 3-D world onto a 2-D surface at the
back of the eye, and such projection can be inverted in an infinite number of ways. Therefore we should understand from the environment what sort of cue allows us to perceive such
a vital piece of information as distance. These sources of information are called depth or
distance cues.
Nature resolved the problem in living beings by creating a stereo rather than a monocular structure of visual information. This vision process is called stereopsis, and it is the
process of perceiving the relative distance to objects based on their lateral displacement in
the two retinal images.
Stereopsis is possible because we have two laterally separated eyes whose visual fields
overlap in the central region of vision. Because the two eyes differ in position, the two retinal
images are slightly different. This relative lateral displacement is called binocular disparity, and it arises when a given point in the external world does not project to corresponding
positions on the left and right retina (see Figure 2.1).
34
Obviously, stereoscopic vision poses some problems, the main one of which is to measure the direction and the amount of disparity between corresponding image features in the
two retinal images.
But we cannot measure this disparity if we cannot determine which features in the left
retinal image correspond to which features in the right one. This is called the correspondence problem, for obvious reasons. Many solutions to this problem have been proposed.
The first interesting and well-known theory was devised by David Marr and Tomasso Poggio in 1977 [70].
Clearly, binocular depth perception is not the only way to perceive distance. Even
though a classification of such a complex system is only partially possible, three families
of cues may be distinguished (see Figure 2.2):
• Primary depth cues that provide direct depth information, such as convergence of the
optical axes of the two eyes, accommodation, and unequivocal disparity cues
• Secondary depth cues that may also be present in monocularly viewed images. These
include shading, shadows, texture gradients, motion parallax, occlusion, 3-D interpretation of line drawings, structure and size of familiar objects
• Cues to flatness, inhibiting the perception of depth. Examples are frames surrounding
pictures, or the uniform texture of a poorly resolved CRT-monitor
All these cues can be monocular and binocular, depending on the number of visual sources
involved. Some of them, called dynamic depth cues, are obtained only when the projection
of the environment on the retina is moving. Some others, called static depth cues, can be
extracted even in static scenes.
2.1 Monocular static depth cues
As mentioned before, depth cues depend strictly on how many visual sources are involved
in the process and how they change on the retina. Considering the nature of this work, we
will focus mainly on monocular static depth cues. As mentioned before, monocular static
depth cues are a family of cues that can be extracted from a static scene (like a picture) and
that can carry to the visual system an enormous quantity of information.
2.1.1
Perspective
In general, the surface on which the objects are represented in perspective is a vertical plane.
Nevertheless, representations can be obtained on slanting planes (in certain architectural representations), horizontal (ceilings), or cylindrical surfaces (panoramas).
35
Figure 2.2 – Monocular depth cues.
Figure 2.3 – Example of depth perception using perspective cues.
Figure 2.4 – Object, Object space, Display, Observer.
36
(a)
(b)
(c)
Y
Y
Y
Z
Z
Z
X
X
X
Figure 2.5 – (a) one-point; (b) two-point; (c) three-point perspective.
Geometrically, the perspective is a conic projection that introduces four principal elements: the object space, the object to be represented (laying in the object space), the display,
and the observer (see Figure 2.4).
Usually the object space is represented by X, Y, and Z Cartesian coordinates. The display, presumed vertical, transparent and parallel to the X, Y plane, is interposed between
the object and the observer, who is imagined in erect position on a horizontal plan and with
monocular vision.
Light rays are represented by straight lines outgoing from the points that we want to be
shown. In this context, the representation of a point will be the intersection with the display
of the light ray from the considered point. Perspective is normally classified into one-point,
two-point, or three-point perspective.
Let the object be a rectangular object with three axes A, B, and C. When each object axis
is parallel to a space coordinate (see Figure 2.5a), we have one-point perspective. When only
one object axis is parallel to a space coordinate we have two-point perspective (as shown in
Figure 2.5b). When no object axis is parallel to a space coordinate we have a three-point
perspective (Figure 2.5c).
Perspective in general is not a cue itself, but it collects a group of cues that use perspective principles as a source of information.
Size
Size information is one of the most controversial cues in the science of vision. Many works
have debated which size features are important in this cue, but the answers are still controversial and not really exhaustive.
The size signal of an image can be either dynamic or static depending on the kind of
information extracted. This information can be used as depth cue in three ways:
37
• An image changing in size can give the impression of an approaching or receding
object (see Paragraph 2.2.1). Naturally, the observer assumes that the surface of the
object is not actually varying in size
• The relative sizes of the images of simultaneously or successively presented objects
indicate their relative depths if the observer considers that they are the images of the
same object and, therefore, of the same size
• The size of the image of an isolated familiar object indicates the distance of the object
Ittelson [51] presented playing cards of half, normal, and double size to subjects, one card
at the time, at the same distance, and in dark surroundings. He concluded that the perceived
depth of a familiar object is “[...] that distance at which an object of physical size equal
to the assumed-size would have to be placed in order to produce the given retinal image”.
Hochberg and Hochberg [47] proved that Ittelson had not properly controlled the effects of
the relative size of the three playing cards. Three blank cards may have produced the same
result. Ono [43] found that when photographs of a golf ball and baseball were presented at
the same angular size, the baseball appeared more distant. Therefore, it seems that estimating the distance of single familiar objects is in accordance with cognitive processes rather
than only with perceptual afferents [2].
Gogel [104] used a procedure designed to tap a purely perceptual mechanism. In his
experiments he argued that the partial effect of familiar size under reduced conditions is
due to the tendency to perceive objects at a default distance that he called specific-distance
tendency. However, Hastorf [4] found that distance estimates of a disc were influenced by
whether subjects were told that it was a ping-pong ball or a billiard ball.
Epstein and Baratz [30] discovered that only long-term familiarity is effective as a relative depth cue. In another experiments, Gogel and Mertens [39] used a wide variety of
familiar objects and also produced evidence of distance scaling by familiar size.
Within such assumptions, depth perception using size cues can be a very complex process in which not only pure visual perception is involved but also high-level cognitive processes which constrain information about size and the dynamics of size. For instance, it
is quite common to have the experience of playing cards being different sizes, but rather
uncommon to experience human figures that can shrink or enlarge.
Linear prospective
Linear perspective is one of the basic perspective cues used by the brain to perceive depth.
Planar figures can be interpreted as a projection on the display of objects which lie in the
3-D object space. For instance, the impression of inclination produced by a 2-D trapezoid
(Figure 2.6) as a projection of a floor in the object space can be very strong.
38
Figure 2.6 – Example of linear perspective.
Figure 2.7 – The depth cue of height in the field.
Obviously, linear perspective evaluation depends on observer assumptions. In the case of
the above example, the observer assumes that the shape of the image represents an inclined
rectangle (a floor) rather than a frontal trapezoid. The impression of depth using linear
cues can be easily overridden as soon as the observer uses other clues to obtain more depth
information. For example, the production of a motion parallax cue by moving the head
laterally a sufficient distance destroys the linear cue perception (for more details see [5])
because the perceived object’s 2-D shape should change according to the motion applied.
Height
The image of a single object located on a textured surface creates a strong impression of
motion depth. The effect of height in the field on perceived distance is stronger when the
stimulating objects are placed on a frontal-plane background containing lines converging to
a vanishing point. (see [27]).
A frontal background with a texture gradient produces a stronger effect than a simple
tapered black frame [101]. An object suspended above a horizontal textured surface appears
as if it is sitting on a point on the surface that is optically adjacent to the base of the object.
The nearer the object is to the horizon, the more distant the object appears (see Figure 2.7
for further investigations).
39
(a)
(b)
(c)
Figure 2.8 – Texture gradients and perception of inclination: (a) Texture size scaling; (b) Texture
density gradient; (c) Aspect-Ratio perspective.
Texture perspective
Depth perception using texture perspective supplies strong distance information when the
reference surface of the object space is filled up with some sort of texture. In these cases, the
textures have to be homogeneous (texture elements are identical in shape, size, and density
at all positions on the surface) and isotropic (texture elements are similar in shape, size, and
density at all orientations). There are three types of texture perspectiveS (see Figure 2.8):
• Texture size scaling: The image of any texture element decreases in size proportionally to the distance of the element from the nodal point of the eye. The more the
distance increases, the more the texture elements become graded in size.
• Texture density gradient: Images of texture elements become more densely spaced
with increasing distance along the surface. The observer implicitly assumes that the
texture on the surface is homogeneous.
• Aspect-Ratio perspective: The image of an inclined 2-D shape is compressed in
the direction of the projection changing, in this way, the aspect ratio of the texture
element. This indicates inclination only if the true shape of the single element is
known. A gap in the texture gradient usually indicates the presence of a step; however,
a change in texture gradient associated with aspect ratio produces, instead, a change
in slope.
2.1.2
Interposition
Interposition occurs when the whole or part of one object hides part of the same object or
another object. Therefore, this cue provides information about the depth order but not about
the magnitude of the distance.
40
Figure 2.9 – Example of object which does not respond to the Jordan’s Theorem.
(a)
(b)
Figure 2.10 – (a) Amodal completion in which the figures appear in a specific order. (b) Modal
completion effects using a Kanizsa triangle in which a complete white triangle is
seen. Each disc appears complete behind the triangle (amodal completion).
Interposition using parts of the same object
The cue that uses parts of the same object to obtain depth information strictly obeys Jordan’s
Theorem, which states that a simple closed line divides a plane into an inside and an outside.
In order to better understand the application of this theorem to interposition, let us give some
definitions. The locus of points where visual lines are tangential to the surfaces of the object
but do not enter the surface is the bounding rim of the object. Holes in the surface of the
object produce bounding rim. For all objects, the bounding rim projects a bounding contour
on the retina. A bounding contour is a closed line that represents the projection of the object
on the retina. It has a polarity, meaning that one side is inside the image of the object and
one side is outside the image of the object.
Jordan’s Theorem is strictly and implicitly embedded in the mechanism of the perceptual
system, which allows us to recognize the difference between the inside and the outside of
an object. In fact we are able to recognize an object which violates the Jordan’s Theorem,
as shown in Figure 2.9, even though we can not explain why [23].
41
Interposition using different objects
This kind of interposition can be either amodal or modal completion:
• amodal completion: Overlapping objects usually produce images that are defective
of the information of the farther part of the bounding contour. Amodal completion
occurs when the object (or more than one) is perceived as occluded by the closer of
the shapes. This effect allows the far object to be seen as having the perception of a
continuous edge extending behind the nearest object (see Figure 2.10a).
• Modal completion: Many times we are able to perceive an object as complete even
when parts of its boundary cannot be seen because the object has the same luminance,
color, and texture as the background. In the Figure 2.10b, we see clearly the white
triangle even though there is no physical contrast between its borders and the surrounding region (see [55] for further investigations). This phenomenon is known as
modal completion. Edges implicitly produced by this process are known as cognitive
contours. The object to which the cognitive contours belong in modal completion is
always perceived in the foreground.
Accretion and deletion
This cue is similar to modal completion but appears clearly when the involved surfaces have
a relative movement between them. When the image of a textured object A moves with
respect to the image of another textured object B, texture elements of B are deleted along
the direction of A and emerge at the opposite edge. Even if the object A has the same
luminance, color, and texture as the texture background of B, the object B is still completely
visible, and it is seen in front of the object A, which is undergoing accretion and deletion of
texture (see Figure 2.11).
2.1.3
Lighting
Another strong cue of depth information is lighting. There are two different phenomena
caught by the brain to understand 3-D scenes: shading and shadow.
Shading
Shading is the variation in irradiance from a surface due to changes in the orientation of the
surface in relation to the incident light or in relation to variations in specularity1 . A smoothly
1
A specular surface reflects light more in one direction than in other directions. The preferred direction is
that for which the angle of reflection equals the angle of incidence. Specularity is a parameter which measures
the grade of reflection of light from the surface.
42
Figure 2.11 – A Kanizsa triangle is not evident in each static frame but becomes visible in the
dynamic sequence.
Figure 2.12 – Variations of convexity and concavity from shading.
curved surface produces gradual shading. A sharp discontinuity in surface orientation creates shading with a sharp edge (see Figure 2.12). Shading alone is ambiguous as a depth
cue. Other information that can aid to solve ambiguities include direction of illumination,
presence of other objects, occluding edges, and/or familiarity with the gazed object.
Shadows
Shadow is a variation in irradiance from a surface caused by obstruction of an opaque or
semi-opaque object. A shadow cast by an object onto another object is known as a detached
shadow. Shadows provide information about the dimensions of the object that casts the
shadow only if the observer knows the direction of illumination and the nature and orientation of the surface upon which the shadow is cast. Particularly, a detached shadow can
be a rich source of information about the structure of 3-D surfaces. For instance in Figure
2.13 light gray squares appear to have different depths above the background because of the
position of their shadows.
43
Figure 2.13 – The gray squares appear to increase in depth above the background because of the
positions of the shadows.
2.1.4
Accommodation using image blur
Accommodation is a change in the crystalline deformation that the eye carries out in order
to adapt the focus to a gazed object. The absence of this control produces a blur on the retina
which subsequently causes a lack of sharpness in edge projections.
The eye uses this effect to control the magnitude and the sign of the accommodation.
The use of accommodation as a cue of absolute distance has had alternating periods of acceptance.
Descartes (1664), followed by Berkeley (1709), proposed that the act of accommodation itself aids in the perception of depth. Wundt (1862), followed by Hillebrand (1894)
discovered that people cannot judge the distance of an object on the basis of accommodation but can use changes in accommodation to judge differences in depth. Recently, Fisher
and Ciuffreda [32] discovered that subjects, estimating the distance of the monocular targets
by pointing to them with a hidden hand, tend to overestimate distances that were less than
31 cm and to underestimate larger distances. Mon-Williams and Tresilian [72] discovered
some correlation between accommodation, image blur, and vergence. However, responses
to the experiments were variable.
It is surely true that if the sharpness of the edges under observation is known, static blur
on the retina alone can be used as an unambiguous cue of relative depth. The act of changing
accommodation between two objects at different distances may provide information about
their relative depth. Dynamic accommodation may be more effective when many objects in
different depth planes are presented at the same time.
2.1.5
Aerial effects
It seems that Leonardo da Vinci coined the term aerial perspective to describe the effect
produced by the atmosphere on distant objects.
Actually, the atmosphere affects the visibility of distant objects in two ways: through
44
optical haze, when rising convection currents of warm air refract light and introduce shimmer, and through mist, when light is absorbed and scattered by atmospheric dust and mist.
The effects of these phenomena can be measured with particular instruments and then used
to calculate distances between the position of the observer and distant points (for further
investigations and applications see [20]).
2.2 Monocular dynamic depth cues
As we have seen previously, some monocular cues can be either static or dynamic, like size
modifications and accretion and deletion. Pure dynamic cues extract information not just
using pictures or static images, but rather using physiologic movements, voluntary or not,
of the image on the retina to catch other characteristics of the 3-D scene.
2.2.1
Motion parallax
Motion parallax is a cue based on continuous changes of the object projection on the retina
over time. For an object at given distance and a given motion of the observer, the extent of
motion parallax between that object and a second object is proportional to the depth between
the objects. Mainly, there are three types of motion parallax:
• Absolute parallax: For a given magnitude of motion of object or observer (what is
important is the relative motion between observer and object), the change in visual
direction of an object is, approximately, inversely proportional to its distance (for
details see Chapter 4)
• Looming: An approaching object or a set of objects increasing in size produces a
strong perception of the distance (see Paragraph 2.1.1). This type of image motion
does not occur with parallel projections for which change in perspective is not produced
• Linear parallax: When an object or set of objects translates laterally or is viewed
by a person moving laterally, the images of more distant parts move more slowly than
those of nearer parts. Obviously, the same motion effect happens when the two objects
translate at the same velocity with respect to a stationary head
2.2.2
Optical flow
During physiological movements of the eyes, the projection of the object onto the retina
forms a motion pattern referred to as optical flow. This process produces visually perceived
complex motion patterns, which combine object and retinal motions to provide an incredibly
45
Figure 2.14 – Differential transformations for the optical flow fields: (a) translation; (b) expansion;
(c) rotation; (d) shear of type one; (e) shear of type two.
rich source of information regarding the dynamic 3-D structure of the scene.
This knowledge in the optical flow can be represented by a vector indicating velocity
and direction for each point of the image on the retina that has moved during a fixed interval. Since the object under observation does not present discontinuity points, but instead
surfaces that usually gradual2 , the optical flow field produced by the movements is spatially
differentiable over local areas. Any field so generated can be decomposed into five local
spatial gradients of angular velocity, which are known as differential velocity components.
These components are (see Figure 2.14) translation, expansion or contraction, rotation, and
two orthogonal components of deformation. The last three components are known as differential invariants because they do not depend on the choice of the coordinate system.
Since the motion of the object projection on the retina depends directly on the distance
of the object from the observer (see Paragraph 2.2.1), and since the optical flow represents
this motion, it is straightforward that these fields can be used to extract depth information.
However, optical flow is not usually used alone to extract distance information since it
is relatively ambiguous in respect to some common situations. Let us consider, for instance,
two objects moving laterally onto different parallel planes (parallel to the plane of the observer) with the same linear velocity. Due to the linear parallax, the image of the nearer
object moves faster than that of the farther object. If the observer assumes that they are
moving at the same velocity, the faster object is surely the nearest one.
Although this phenomenon is perfectly capable of supplying depth order information, it cannot be used for cases for which the eye tracks one of the two objects, since the image of the
tracked object on the retina is steady and it does not furnish enough knowledge to the optical
2
They can be very honed but they never present discontinuity points on the surface.
46
flow fields. This means that velocity on the retina of the image does not necessarily indicate
relative depth.
2.3 Types of cue interaction
As we have seen, there are a number of cues that the brain can use in order to obtain robust depth information. However, the visual system does not consider single cues as separate sources of information, but implements interactions which can improve the estimate of
depth even while balancing the detection of poor or noisy cues. Since all these sources bear
upon the same perceptual interpretation of depth, they must somehow be put together into
a coherent consistent representation. Bülthoff and Mallot [16] suggested a classification of
the cue interactions based of five different categories:
• Accumulation: In this interaction, signals coming from different cues systems may
be summed, averaged, or multiplied in order to improve discrimination of depth. According to Bülthoff and Mallot, this interaction occurs at the output level of independent systems. The fundamental idea is that for each point of image (or representation
of the depth map of the image), different cues produce different estimates which are
then integrated together3 .
• Cooperation: This type of interaction is similar to accumulation. The main difference
is that the cues systems involved interact at an earlier stage. This means that different
modules are not kept separate, but interact at different levels to arrive at a single
coherent representation of the distance. Practically, different cues take advantage of
some specific sub-modules of other cues in order to reduce the number of circuitries
involved in the calculation of the same function.
• Veto: This interaction occurs when two or more cues provide conflicting information.
In this case judgments may be based on only one, with the other cue being suppressed.
A surprising example of veto in depth perception can be carried out using a pseudoscope4 . However, the depth perception modified by this instrument is not inverted as
predicted. This is because the distance information can be overridden by monocular
cues, such as perspective, texture gradient, shadows etc.
3
This kind of interaction has been afterward called weak fusion [18, 59] because it assumes no interaction
among different information sources apart from interactions on the output of the modules.
4
This device reverses binocular disparity simply by reversing the images that are projected to the left and
right eyes. Since this instrument reverses the horizontal disparities of everything in the image, objects that are
closer should appear farther and vice versa.
47
• Disambiguation: One cue system can produce depth information that lacks precision
or is too noisy to be used. Another cue may solve the ambiguity by integrating some
information. For example, depth information generated by motion parallax can be
ambiguous in respect to the sign of depth. This ambiguity can be resolved using
interposition and shadow cues as sources of new information.
• Hierarchy: Information derived from one cue may be used as raw data for another
one.
The question of how different sources of distance information are merged into a single coherent depth image is very complex and not well understood so far. High level processes
as well as different low level functions in the brain can reinforce or even interfere with the
integration of the cues. This process happens, but no one yet knows yet where and how.
48
Chapter 3
Monocular depth perception in
robotic
Contents
3.1
3.2
Passive depth finding . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.1.1
Depth from camera motion . . . . . . . . . . . . . . . . . . . .
50
3.1.2
Image information extraction . . . . . . . . . . . . . . . . . . .
52
3.1.3
Depth from camera parameter variations . . . . . . . . . . . . .
55
3.1.4
Depth from camera parameter variations due to movement . . .
56
Active depth finding . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
As we considered in early chapters, the depth extraction feature that the visual system actuates is really complex. It is not surprising that the eye is a biological system with its
complexities and its millions of years of evolution that cannot easily be represented by a
simple equation or model.
Evolutionarily speaking, important is the feature of distance perception and its applications to survival. It would not be surprising to find that the evolutionary process had
performed some of its greatest feats to produce the mechanisms underlying depth perception tasks.
Computer and robotic vision tries to imitate, simulate, or be inspired by this extraordinary performance. Robotic sensorial perception is a key issue for immediate or short-range
motion planning; that is, reacting only to the free space around a robot without requiring a
predefined trajectory plan. Local navigation requires no environment model and relies entirely on sensorial data.
There are many different approaches being studied. These approaches depend very
much on the outcome of the image feature extraction algorithms, which are also compu49
tationally demanding. In this context, all types of distance extraction algorithms can be
categorized in five different classes according to the type of interaction that they have with
the environment that should be explored.
Active methods, for instance, can be depth finders that use laser, ultrasound or radar
waves; passive methods can be information extraction, depth from camera motion, depth
from camera parameter variations, and depth from camera parameter variations due to movements.
In general, passive methods, which do not require production of any sort of beamed energy, have a wider range of applications. This is because the analysis of energy reflections
in a composite environment is usually so complex that it produces a grade of ambiguity too
high for any application. Moreover, the possible presence of others kind of energy sources
can invalidate or jam any direct reading.
In robotics, several studies have taken an active approach in the visual evaluation of distance (see for example [75, 84, 14, 34, 92, 93] for some methods). In most of these studies, a
motion of the sensor according to the movement of the robot (or the camera itself) produces
a subsequent motion of the projection on the sensor of the observed scene which is elaborated with different approaches like optical flow, feature correspondence, spatio-temporal
gradients, etc.
In this chapter we will report a on the complete range of different approaches which can
be found in the robotic literature. However, none of these approaches has taken inspiration
from the human eye by considering the combination of parallax effect and eye movements
in order to determine depth information.
3.1 Passive depth finding
This is the class that includes the majority of all algorithms of depth processing or extraction.
In this class, no energy is released by the measuring agent.
3.1.1
Depth from camera motion
Mobile robots are the most challenging problem in robotics. Conceptually, robotic agents
that are able to perform complex navigation tasks in an unknown environment are important
because its realization implies interactions among different cognitive modules.
A fundamental feature is the extraction of distance information from obstacles present
in the agent’s path. To solve the problem, many approaches have been developed to yield a
variation of the camera position and orientation.
In this class of algorithms, a motion of the sensor according to the movement of the
robot produces consequently a motion of the projection on the sensor of the scene observed
by the camera, which can be elaborated and utilized to extract distance information. For
50
almost the possible applications, these algorithms have to be performed on-line in order to
drive all the movements of the camera.
Murphy [75] proposes a military application that acquire the distance information from
a moving vehicle to targets in video image sequences is presented. Murphy presents a realtime depth detection system that finds the distances of objects through a monocular vision
model employing a simple geometric triangulation model of the vehicle movements and
some explicative considerations. This algorithm can be used with a camera mounted either
at the front or side of a moving vehicle, and the relative geometric model can be both front
and side-view. To solve the correspondence problem, a real-time matching algorithm is introduced in order to improve the matching performance by several orders of magnitude. The
author uses the intensity feature to match corresponding pixels in the two adjacent image
frames. In order to have accurate and efficient matching, he derives a number of motion
heuristics including maximum velocity change, small change in orientation, coherent motion and continuous motion.
Sandini [84] suggests that a tracking strategy, which is possible in case of egomotion,
can be used in the place of a matching procedure. The basic differences between the two
computational problems arise from the fact that the motion of the camera can be actively
controlled in case of ego-motion, whereas in stereo vision it is fixed by the geometry. For
Sandini, the continuous nature of motion information is used to derive the component of
the two-dimensional velocity vector perpendicular to the contours, while the knowledge of
the egomotion parameters is used to compute its directions. By combining these measures,
an estimate of the optical flow can be obtained for each image of the sequence. In order to
facilitate the measure of the navigation parameters within this framework, a constrained egomotion strategy was adopted in which the position of the fixation point is stabilized during
the navigation (in an anthropomorphic fashion). This constraint reduces the dimensionality
of the parameter space without increments in the complexity of the equations and allows
deriving the position of the objects in space from image positions and optical flow.
Beß et al. [14] present a clear example of interpolation preprocessing, in which image information is replaced with a less complex representation. The technique is part of a
system computing depth from monocular image sequences. Taking a sequence of different
views from a camera mounted to a robot hand, each two consecutive images are considered as a stereo image. The advantage of the segmentation approach, employing sequences
of straight line segments and circular arcs, is straightforward. Using this technique, the
number of primitives representing the image information is reduced significantly, lowering,
therefore, the computational efforts of the corresponding problem as well as the frequency
of erroneous matches. Starting with matched pairs of primitives, a disparity image is computed containing the initial disparity values for a subsequent block matching algorithm. The
51
depth images computed from these stereo images are then fused to one complete depth map
of the object surface.
So far, conventional approaches to recovering depth from images in which motion (either
induced by egomotion, or produced by moving objects) have involved obtaining two or more
images, applying a reduction of information through filtering or skeleton segmentation, and
solving the correspondence problem. Unfortunately, the computational complexity involved
in feature extraction and in solving the correspondence problem makes existing techniques
unattractive for many real world robotics applications in which the real-time is often a given
constrain.
To avoid completely the correspondence problem, Skiftstad [88] developed a depth recovery technique that does not include the computationally intensive steps of features selection, which is required by conventional approaches. The Intensity Gradient Analysis (IGA)
technique is a depth recovery algorithm that exploits the properties of the MCSO (moving
camera, stationary object) scenario. Depth values are obtained by analyzing temporal intensity gradients arising from the optical flow field induced by known camera motion, causing
objects to displace in known directions. The basic idea is that if an object displaces exactly
one pixel, the intensity perceived at the location the object moves into must equal the intensity perceived at the location the object moved out of before the object displacement took
place. This 1-pixel displacement can be considered as disparity equal to one. IGA, thus,
avoids the feature extraction and correspondence steps of conventional approaches and is
therefore very fast.
Fujii92 [34] presents a Hypothesize-and-Test approach to the problem of constructing
a depth map from a sequence of monocular images. This approach only requires knowledge of the axial translation component of the moving camera. When the configuration of
the camera is known with respect to the mobile platform, this information can be readily
obtained by projecting displacement data from a wheel encoder or range sensors onto the
focal axis. Instead of directly calculating the depth to the feature points, the algorithm first
hypothesizes that there is a pair of feature points which have the same depth. Based on this
hypothesis, the depths are calculated for all pairs of feature points found in the image. As
the robot moves, the relative location of two points changes in a specific and predictable
manner on the image plane if they are actually at the same depth (in other words, if the
hypothesis is correct). The motion of each pair of points on the image plane is observed,
and if it is consistent with the predicted behavior, the hypothesis is accepted. Accepted pairs
create a graph structure which contains depth relations among the feature points.
3.1.2
Image information extraction
These approaches usually extract distance information from the image structure themselves,
such as the scattering of light, Fourier analysis, optical flow, etc. The kind of information
52
extracted in these cases is not the depth itself, but some sort of indirect information that can
be related in some way to the distance. These algorithms can be considered pure computer
vision, because they do not interact with the environment at all, and they can be normally
performed off-line.
There are many different examples of this kind of algorithm in the computer vision literature, but in this review we will only consider a few of them. They are mainly characterized
by two different sources of information: sequences of pictures which offer time as an additional dimension, and single shots.
A common approach to visual extraction is to apply a transformation that preserves the
nature of the information of the image, but that can be successively manipulated in an easier
way. Many times this transformation uncovers properties and features that can be strongly
used to extract the desired information.
Torralba and Oliva [98] propose that a source of information for absolute depth estimation can be based on a whole scene structure that does not rely on specific objects. They
demonstrate that, by recognizing the properties of the structures present in a single image, they can infer the scale of the scene and, therefore, its absolute mean depth. Using a
probabilistic learning approach based on Fourier transforms of the image, they show that
the general aspect of some spectral signatures changes with depth, along with the spatial
configuration of orientation and scale. What they obtain is a system designed to learn the
relationship between structure and depth. In this way, pictures can be divided in semantic
class that can be used in specific tasks. A very significant example is the calculation of the
dimension of human heads according the prospective and the scale of the picture.
Ravichandran and Trivedi [81] propose a technique where motion information is extracted by identifying the temporal signature associated with the texture objects in the scene.
The authors present a computational framework for motion perception in which an approach
for motion is carried out using a spatiotemporal frequency (STF) domain analysis. Their
work is based on the fact that a translation in the spatiotemporal space causes the corresponding Fourier transform to be skewed in 3-D space, which can be thus expressed in
terms of the motion parameters.
In this STF analysis approach, a temporal filter with a non-symmetric and periodic transfer function is used to obtain a linear combination of the sequence of image frames with
moving patterns. A localized Fourier transform is then computed to extract motion information. When the observer (or the camera) moves, motion is induced in the scene. If the
optical parameters (like relative position, orientation, focal length etc.) of the camera are
known along with the motion characteristics of the image, the disparity value calculated can
be used to extract distance information.
There are algorithms, even those employing the same kind of transformations mentioned
before, that tend to use over-imposed heuristics. These heuristics or rules have the function
53
to filter and to validate all hypotheses expressed by the underlying system. These heuristics
are usually related to physical constraints of the system and to objects in the scene.
The work by Guo et al. [83] describes an approach for recovering the structure of a
moving target from a monocular image sequence. Assuming that the camera is stationary,
the authors first use a motion algorithm to detect moving targets based on four heuristics
derived from the properties of moving vehicles: maximum velocity, small velocity changes,
coherent motion, and continuous motion. Successively, a second algorithm estimates the
distance of the moving targets by using an over-constrained approach. They have applied
the approach to monocular image sequences captured by a moving camera to recover the
3-D structure of stationary targets such as trees, telephone poles, etc.
Adaptive algorithms cover another important class of algorithms in machine vision. Basically, these approaches try to interpolate an unknown function in the output space drawing
upon features present in the images used during the training period. Which features, and
how many, sometimes remain unresolved questions. Since the training process can also be
very expensive in terms of computational power, these kinds of approaches are normally
subjected to two different phases, the first one being the training while the second is the
field deployment.
An example of adaptive algorithm is presented by Marshall [52]. This approach describes how a self-organizing neural network (SONN) can learn to perform the high-level
vision task of discriminating visual depth by using motion parallax cues. This task is carried out without the assistance of an external teacher, whose job is to explain which correct
answers are needed. A SSON can acquire sensitivity to visual depth as a result of exposure,
during a developmental period, to visual input sequences containing motion parallax transformations. Sometimes the absolute distance information is not available or is not requested
by the problem to be solved. In these cases depth information is represented by interposition cues (see Paragraph 2.1.2) in which the order of the objects present in the scene must
be extracted.
Although Bergen et al. [12] focus their attention on the object boundaries discovery,
they also present a technique that can be used to extract relative depth information of the objects present in an image. They introduce a detailed analysis of the behavior of dense motion
estimation techniques at object boundaries. Their analysis reveals that a motion estimation
error is systematically present and localized in a small neighborhood on the occluded side of
the objects. They show how the position of this error can be used as a depth cue by exploiting the erroneous motion measurement density to determine what type of discontinuity they
are observing. Intensity discontinuities have essentially three origins: object boundaries,
surface marks (for instance, different reflective properties on an object’s surface) and illumination discontinuities. For all these types, the error density is different and, thus, subjected
54
to classification. The straightforward application of this observation is then the retrieval of
objects boundaries and occlusions.
Fourier analysis, motion parallax, and focus are not the only sources of depth information. Some types of aberrations of the optical system, as well as natural phenomena,
can be exploited to extract distance information intrinsically dependent on the phenomenon.
For example, in the human visual system, there is a considerable longitudinal chromatic
aberration1 , so that the retinal image of an edge will have color fringes: red fringes for
under-accommodation (focus behind the retina) and blue fringes for over-accommodation
(focus in front of the retina).
Fincham shows [29] using achromatizing lenses and monocromatic light that accommodation is impaired when chromatic aberration is removed, and suggested that a chromatic
mechanism, sensitive to the effects of chromatic aberration, could provide a directional cue
for accommodation.
Garcia [38] presents chromatic aberrations as a source of depth visual information. This
system takes three images of the same scene, with images having a different focal length.
Using these images and the mathematical model suggested, the authors discovered that the
spread function of the blurness is inversely proportional to the distance of the object.
Cozman and Krotkov in [20] present the first analysis of atmospheric scattering2 phenomenon from an image. They investigate a group of techniques for extraction of depth
cues solely from the analysis of atmospheric scattering effects in images. Depth from scattering techniques are discussed for indoor and outdoor environments, and experimental tests
with real images are presented. They found that depth cues in outdoor scenes can be recovered with surprising accuracy and can be used as an additional information source for
autonomous vehicles (see Paragraph 2.1.5 for further details).
3.1.3
Depth from camera parameter variations
This class of algorithms collects all those systems that require variations of the camera parameters, such as focus, contrast, etc. Indeed, they use the variation of these parameters of
the image on the sensor in order to obtain depth information. These kinds of algorithms
involve interaction between the camera and the algorithm, and they are often performed in
real-time.
Honig [48] proposed an example of this class of algorithm. The depth perception presented in this work is based on a method that exploits the physical fact that the imaging
properties of an optical system depend upon the acquisition parameters and the object dis1
As we have seen, chromatic aberration is the variation in the focal length of a lens with respect to the
wavelength of the light that strikes it.
2
Light power is affected when it crosses the atmosphere; there is a simple, albeit non-linear, relationship
between the radiance of an image at any given wavelength and the distance between object and viewer. This
phenomenon is called atmospheric scattering and has been extensively studied by physicists and meteorologists.
55
tance. The author suggests an edge-oriented approach which takes advantage of the structure
of the visual cortex suggested by Marr and Hildreth [69].
Honig’s basic principle is to compare blur in two defocused images of the same scene
taken with different aperture (depth from defocus). Modeling the optic as a linear system,
the blurring process can be therefore interpreted as filtering the sharp image with a defocus
operator. The problem of depth sensing can be reduced therefore to identify the defocus
operator in the image.
3.1.4
Depth from camera parameter variations due to movement
This class of on-line algorithms is quite similar to the previous one, the only difference being that the camera parameters are not changed straightforwardly but through a movement
that indirectly changes the camera parameters. Throughout the robotic literature, the most
used approach is the measurement of the image blur due to camera movements.
One example can be found in the work carried out by Lee [60]. In this approach, a
method of constructing 3-D maps is presented based on the relative magnification and blurring of a pair of images, where the images are taken at two camera positions of a small
displacement. Due to this displacement, the depth information is captured not only by blurring but also by magnification. That is, an image captured by the camera at a certain position
can be considered as a blurred version of its well-focused image or the convolution of the
well-focused image and the blurring function. The authors also consider that these two
images have also a different magnification factor, which can be described in terms of the relationship between two well-focused images. The correspondance between the well-focused
images and the two images taken at different camera positions can be expressed based on
the Taylor series expansion but only when the distance between the two camera positions is
small enough. In this way, the relationship between the two images from two the camera
positions, taking into account both blurring and magnification effects, can be formulated
in the frequency domain. The ratio between the Fourier transform of two images can be
represented in terms of the magnification and blurring factors which contain the depth information that can be extracted. The method, referred to here as Depth from Magnification and
Blurring, aims to generate a precise 3-D map of local scenes or objects to be manipulated
by a robot arm with a hand-eye camera. The method uses a single standard camera with a
telecentric lens and assumes neither active illumination nor active control of camera parameters. Fusing the two disparate sources of depth information, magnification and blurring,
the proposed method provides a more accurate and robust depth estimation.
Another approach in which blurred images are used can be found in [56]. This research
presents an approach for the determination of depth as a function of blurring for automated
visual inspection in VLSI wafer probing. There exists a smooth relationship between the
degree of blur and the distance of a probe from a test-pad on a VLSI chip. Therefore, by
56
measuring the amount of blurring, the distance from contact can be estimated. The effect of
blurring on a point-object is successively studied in the frequency domain, and a monotonic
relationship is found between the degree of blur and the frequency content of the image.
Fourier feature extraction, with its inherent property of shift-invariance, is utilized to extract
significant feature vectors, which contain information on the degree of blur and, hence, the
distance from the probe. In this case, the authors employed ANNs to map these feature
vectors onto the actual distances. The network is then used in the recall mode to linearly
interpolate the distance corresponding to the significant Fourier features of a blurred image.
3.2 Active depth finding
In contrast to passive methods for extracting distance information from single or coupled
images, are the active depth finding strategies. This class of algorithms includes all kinds of
systems that actively produce a controlled energy beam and that measure the reflection of
this energy provoked by the environment.
These approaches generally include all types of sonic, ultrasonic and laser systems, and
are used for robotic exploration and guidance. The analysis performed by some of them not
only extracts distance, but also measures the whole structure (or part of it) of the environment under observation.
In this techniques, the scene is lit by a light usually produced with a laser beam and
a device able to spread the beam all along the scene (usually a rotating cylindrical lens).
This beam, which has the shape of a sheet, is moved across the scene, producing a simple
light stripe for each position. This simple strip is acquired by a camera, recorded and then
processed either on-line or off-line (see [80] for further investigations).
The time-of-flight algorithms class can be considered the oldest and simplest approach
to the problem of distance measurement. The main three representative forms of physical
phenomena employed are ultrasonic sources, radar, and laser sources.
Using one of the three types of energy the interval of flight between the emission of an
impulse (respectively pressure for ultrasound, and electromagnetic radiation for radar and
laser) and the recording of its echo performed by a coaxial sensor is the parameter used to
estimate depth information. No image analysis is involved, nor are assumptions concerning
the planar properties or other properties of the objects in the scene relevant.
Laser exploration of the environment can be performed using two laser finder types dependent upon time of flight to and from an object point whose range is sought. The first kind
measures the phase shift in a continuous wave modulated laser beam as it leaves the source
and returns to the detector coaxially. The second measures the time a laser pulse takes to go
from the source, bounce off a target point, and return coaxially to a detector. Some interesting works are presented in [65, 54, 7, 86].
57
Since the description of these works exceeds the purpose of this thesis, we leave any
further detail to the specific interest of the reader. Many works use ultrasound range finders;
one interesting work is proposed by Santos [85] in which ultrasound sensors are coupled
with Artificial Neural Networks (ANNs) in order to pre-process range information in a consistent fashion.
Although active depth finding can be, in many cases, more precise and accurate than the
passive approaches, laser techniques are restricted by fog, rain, and variations in lighting,
and ultrasound devices are disrupted by wind, temperature variations, and serious distance
absorption. For this reason, radar techniques for outdoor applications have been introduced.
An interesting work can be found in [73]. This is based on Frequency Modulated (FMCW) radar and is intended for applications in mobile outdoor robotics, such as positioning
an agricultural machine to reduce impact of spreading practices on the environment. They
show that spectral analysis methods used here to process radar signals provide an accuracy
of about a few centimeters.
58
Chapter 4
The model
Contents
4.1
Eye movement parallax on a semi-spherical sensor . . . . . . . . . .
59
4.2
Eye movement parallax on a planar sensor . . . . . . . . . . . . . .
63
4.3
Theoretical parallax of a plane . . . . . . . . . . . . . . . . . . . . .
66
4.4
Replicating fixational eye movements in a robotic system . . . . . .
68
Parallax effects observed in sensor rotations (either a human eye or a digital camera) depend
on the characteristics of the optical system in which the sensor is incorporated and on the
observed object. In fact, the movement of the projection of the object is strictly related not
only to the optical system and its inner structure (position of the cardinal points, magnification power, etc.), but also to the relative position of this system with respect to the center of
rotation of the whole system (sensor and optical system).
Figure 4.1 illustrates how parallax emerges during the rotation of a semi-spherical sensor
that is similar to the human eye. It can be observed how a projection of the points A and B
depend on both the rotation angle and the distance to the nodal point.
4.1 Eye movement parallax on a semi-spherical sensor
In order to evaluate the projection of the whole object space on the retina we use a modified
version of the eye model proposed by Gullstrand [1]. This model (see figure 4.2a) is characterized by two nodal points arranged at 7.08 mm and 7.33 mm from the corneal vertex.
The lens has an average index of refraction of 1.413, and the axial distance from the cornea
to retina is 24.4 mm (all measurements are considered on an unaccommodated eye). These
parameters ensure that parallel rays entering will focus perfectly on the retina. The original
59
Figure 4.1 – Parallax can be obtained when a semi-spherical sensor (like the eye) rotates. A projection of the points A and B depends both on the rotation angle and the distance to the
nodal point. The circle on the right highlights how a misalignment between the focal
point (for the sake of clarity, we consider a simple lens with a single nodal point) and
the center of rotation yields a different projection movement according to the rotation
but also according to the distance of A and B.
model does not include the radius of the ocular bulb; however a value of 11 mm has been
found the consensus in vision literature [40].
Using Gullstrand’s eye approximation, let A be a pinpoint light source (see figure 4.2b)
located in the object space at coordinates (−da sin α, da cos α). The projection x̃a of A on
the retina is generated by the ray of light AN 1 and verifies the following condition:
¯
¯
x
¯
¯
¯ −da sin α
¯
¯
¯
0
which has the following solution:
¯
y
1 ¯¯
¯
da cos α 1 ¯¯ = 0
¯
N1
1 ¯
x(da cos α − N1 ) + yda sin α − da N1 sin α = 0
(4.1)
Since a ray of light entering in the first nodal point with a certain angle α̃ emerges from
the second nodal point with the identical angle in a two nodal point optical system (see
60
(a)
(b)
Figure 4.2 – (a) Gullstrand’s three-surface reduced schematic eye. (b) Geometrical assumptions and
axis orientation for a semi-spherical sensor. An object A is positioned in the space at
distance da from the center C of rotation of the sensor, and with an angle α from axis
y; N1 and N2 represent the position of the two nodal points in the optical system. The
object A projects an image xa on the sensor and its position is indicated by θ.
Paragraph 1.2.2 for further details), the actual projection xa of A on the sensor belongs to
the line outgoing from the second nodal point N2 and parallel to AN 1 :
x(da cos α − N1 ) + yda sin α − da N2 sin α = 0
(4.2)
Equation (4.2) can be expressed in polar coordinates:
r=
where
d= q
d
cos(θ − γ)
| − da N2 sin α|
N12 + d2a − 2N1 da cos α
(4.3)
and
γ=
π
− α̃
2
(4.4)
The angle α̃ is generated by the intersection between the line AN 1 with the axis y:
da sin α
α̃ = arctan −
da cos α − N1
µ
¶
(4.5)
The projection of A on the retina is given by the intersection of (4.3) with the surface of the
sensor having equation r = R:
61
(a)
(b)
Figure 4.3 – (a) Variations of the projection point on a semi-spherical sensor obtained by (4.7)
considering an object at distance da = 500 mm, N1 = 6.07 mm and N2 = 6.32 mm;
(b) detail of the rectangle in Figure (a): two objects at distance da = 500 mm and
da = 900 mm produce two slightly different projections on the sensor.
62
θ = arccos
µ
d
R
¶
− α̃ +
π
2
(4.6)
Substituting in (4.6) the variables from (4.4) and (4.5), we obtain:
θ =
µ
√ |−da N2 sin α|
arccos
R ³N12 +d2a −2N1 d´a cos α
π
a sin α
+ arctan da dcos
α−N1 + 2
¶
+
(4.7)
where θ represents the projection angle of the pinpoint light source A on the sensor, considering the cardinal points of the optical system and the physical position of A.
According to Cumming [15], primary visual cortex of primates is more specialized for
processing horizontal disparities. For this reason we considered only horizontal displacements in the model. Nevertheless, since all cardinal points are located on the visual axis,
the vertical component of the displacements can be obtained, without loosing correctness,
simply by rotating the reference frame around axis y.
4.2 Eye movement parallax on a planar sensor
When the sensor is planar, the geometry of the problem is shown in figure 4.4. Let us
consider two rays of light outgoing from the pinpoint light source A and hitting the planar
sensor in two different points: a projection x̃a is generated by the ray passing through the
point N1 , and a projection x̂a obtained by the ray passing through the point C. The ray AN 1
is described as following:
x(da cos α − N1 ) + yda sin α − da N1 sin α = 0
(4.8)
As mentioned before, a ray of light entering in the first nodal point with a certain angle α̃
emerges from the second nodal point with the identical angle in this optical system. The ray
parallel to AN 1 emerging from N2 cuts the planar sensor, which has equation y = −dc , at
the intersection described as follow:
xa =
da f sin α
da cos α − N1
(4.9)
where f = dc + N2 . From Figure 4.4b, we can also calculate x̂a which is equal to:
x̂a = dc tan α
63
(4.10)
(a)
(b)
Figure 4.4 – (a) Schematic representation of the right eye of the oculomotor system with its cardinal
points. The L-shaped support places N1 and N2 before the center of rotation C of the
robotic system. (b) Geometrical assumptions and axis orientation for a planar sensor.
An object A is positioned in the space at a distance da from the center C of rotation of
the sensor with an angle α from axis y; N1 and N2 represent the position of the two
nodal points in the optical system. The object A projects an image xa on the sensor;
x̂a represent the projection of A considering null the distance between N1 and C.
The previous equation can be considered the projection of object A on the sensor without
the contribution of the parallax effect, which is when the point C and the nodal points are
coincident. It can be noticed that (4.10) supplies no information about the absolute distance
of the object but describes only an infinite locus of the possible positions of A.
Equation (4.9) can be simply inverted obtaining da as absolute distance information:
da =
xa N1
xa cos α − f sin α
(4.11)
Observing the previous equation, it can be noticed that in order to calculate the correct
distance of the pinpoint light source A from the center of rotation C, two parameters are
requested: the sensor projection xa of A and its angle from the visual axis (coincident with
y)1 . Although values of α in (4.11) are not directly known, it is possible to perform an
1
In a planar sensor, the absolute parallax |xa − x0a | value cannot directly employed to determine distance
information da without taking into account the angle α. In fact, let us consider two different objects A and
B at the same distance from the center of rotation C, but at different eccentricity αA > αB (smaller values
of α correspond to a minor eccentricity). When a rotation occurs, the object projections move on the surface
of sensor not only according to the distance da but also to the distance N2 xa and the angle α0 . Therefore, a
64
Figure 4.5 – Curve A represents variations of the projection xa on a planar sensor obtained
by (4.9), considering an object at distance da = 500 mm, N1 = 49.65 mm and
N2 = 5.72 mm (see Paragraph 5.1.3 for further details); Curve B represents the same
optical system configuration but with the object at distance da = 900 mm. Curve C
represents variations of x̂a using (4.10) when the nodal points N1 and N2 collapse in
the center of rotation of the sensor. This curve does not change with the distance.
indirect measurement of this parameter considering different projections of the object A on
the sensor.
Rotating the reference axis around the point C of a known angle ∆α, and taking in
account that this rotation does not change the absolute distance da because of the reference
frame (as shown in figure 4.6), it is possible to use (4.11) to write:
x0a N1
xa N1
= 0
xa cos α − f sin α
xa cos(α + ∆α) − f sin(α + ∆α)
where x0a is the new projection of the object A in the rotated reference frame. Inverting the
previous relationship for α, we obtain:


x0 [1 − cos ∆α] + f sin ∆α 
i
α = arctan  hax0
f xaa − cos ∆α − x0a sin ∆α
(4.12)
projection moving in the periphery of the sensors produces a bigger parallax than a projection moving in the
center.
65
Figure 4.6 – The clockwise rotation of the system around the point C leaves unchanged the absolute
distance da , and allows a differential measurement of the angle α (the angle before the
rotation) to be obtained using only ∆α and the two projections xa (before the rotation),
and x0a (after the rotation).
which relates α only to the projection on the sensor before and after a rotation ∆α.
4.3 Theoretical parallax of a plane
Let us consider in Figure 4.7 a plane P at distance D before the sensor. Applying (4.9) to a
generic point A on the plane and considering the distance D and the cardinal points of the
optical system, it is possible to calculate the theoretical value of the parallax after a rotation
D
∆α of the reference frame. Considering the fact that da = cos
α , equation (4.9) becomes:
xa =
Df
tan α
D − N1
(4.13)
The parallax a point A on the plane is ∆x = xa − x0a is therefore:
Df
tan α − tan(α + ∆α)
∆x =
D − N1
·
¸
66
(4.14)
Figure 4.7 – Geometrical assumptions and axis orientation needed to calculate the theoretical parallax. Considering the plane P at distance D from the center of rotation C of the
camera, it is possible to calculate the expected parallax.
Figure 4.8 reports different parallax values for planes at different distances D. The asymmetry in the curve for a distance D = 300 mm is due to the movement of the point of the
plane either toward or against the rotation of the sensor. In fact, during an anticlockwise
rotation, all points on the plane change the value of α but not their da . Points which have
α > 0 before the rotation decrease this value during rotation, and they project onto a more
central area of the sensor. This central area, with the same rotation ∆α, produces a smaller
value of parallax. For those points which have α < 0 the value of α increases during rotation and these points project onto a more peripheral area of the sensor, with consequently
greater values of parallax.
Therefore, a smaller value of parallax on the left side of the curve indicates an anticlockwise rotation; on the other hand a smaller value of parallax on the right side of the curve
indicates a clockwise rotation.
67
Figure 4.8 – Horizontal theoretical parallax for planes at different distances using a rotation of
∆α = 3.0◦ anticlockwise. Although asymmetry, due to the rotation, is present on all
the curves, this is more evident for shorter distances.
4.4 Replicating fixational eye movements in a robotic system
We have seen in Paragraph 1.5 that there are many human eye movements which can be
classified by magnitude and speed. However, due to issues such as the relatively small dimensions of the retina, the response of mechanical parts to stress and movements, and the
minuteness of the actuators, it is not possible, technologically, to reproduce the whole spectrum of eye movements. Moreover, the human retina and the camera of the robot have very
different resolutions, and the corresponding optical systems have cardinal points located in
considerably different locations. Despite these issues, theoretical considerations can offer
the chance to replicate some of these movements, namely saccades and fixational eye movements.
We will leave to the Paragraph 5.4.2 the treatment about the simulation of saccades
68
movements in the robot. On the other hand, the chance to replicate fixational eye movements is directly offered by comparing projections measured in pixels on a planar sensor
with the corresponding photoreceptors on the retina.
We have seen before (see discussion of hyperacuity phenomenon) that humans are able
to resolve points on the retinal projection of the world much beyond the physical dimension
of the photoreceptors. In contrast, the resolution power in the robot is directly proportional
to the cell size in the sensor grid.
Consequently, considering the difference between the actual resolution calculated using
the dimension of physical photoreceptors and the resolution achieved in hyperacuity tasks,
we introduce the concept of virtual photoreceptor, in which the dimension is about 30 times
smaller than the photoreceptor present in the fovea (7 arcsec [105] for a two points acuity).
Let us consider the projection of a ray on the semi-spherical sensor which hits a specific
virtual photoreceptor. A small rotation ∆αs moves this projection to the very next virtual
photoreceptor.
Similarly, a rotation ∆αp moves a projection from one pixel to the next one. It is possible
therefore to calculate a constant which allows us to relate ∆αs and ∆αp by considering the
difference of a single sensitive cell (both virtual photoreceptor and pixel).
Let us consider a fixational eye movement ∆α which translates the projection of the
pinpoint light source A from the perpendicular position on the sensor to a new position.
Since the angle ∆α can be considered small in fixational eyes movements, we can calculate
the variation of the retinal projections considering a simpler approximation of (4.7). In fact,
from Figure 4.9, we can deduce the following geometrical considerations:
h sin ∆α̃ = R sin ∆θ
→ tan ∆α̃ =
h cos ∆α̃ = N2 + R cos ∆θ
R sin ∆θ
N2 +R cos ∆θ
and the following:
da sin ∆α = α̃ sin ∆α̃
da cos ∆α = N1 + d˜a cos ∆α̃
69
→ tan ∆α̃ =
da sin ∆α
da cos ∆α−N1
Figure 4.9 – Geometrical assumptions and axis orientation for a small movement. The sensor S is
no longer considered as a single continuous semi-spherical surface, but it is quantized
according to the dimension of the virtual photoreceptors.
which can be compared:
R sin ∆θ
da sin ∆α
=
N2 + R cos ∆θ
da cos ∆α − N1
(4.15)
Obviously 4.15 is still valid for every angle, but it cannot easily be inverted to obtain ∆θ.
Yet, considering that ∆θ and ∆α are small, and da À N1 , it becomes:
∆θ '
µ
R + N2
R
¶
∆α
(4.16)
Therefore, the shift on the retina ∆xs for a small movement ∆αs is:
∆xs '
(R + N2 )2
∆αs
R
70
(4.17)
Let us consider now the approximation of the planar sensor projection described in (4.9).
With the same assumption, that ∆αp is small, and da À N1 , (4.9) becomes:
∆xa ' (dc + N2 ) ∆αp
(4.18)
Therefore, the geometric factor which allows the transformation of fixational eye movements in movements of the planar sensor can be obtained by equating the shift of one photoreceptor on the camera with the shift of one virtual photoreceptor on the retina:
$
1
1 (R + N2s )2
∆αs =
(dc + N2p ) ∆αp
Ss
R
Sp
%
$
%
(4.19)
where Ss and Sp are, respectively, the dimension of the virtual photoreceptor on the semispherical sensor and the dimension of the photoreceptor cell on the planar sensor; N2p and
N2s are the position of the second nodal point in the optical system of the camera and the
position of the second nodal point in Gullstrand’s model (see Figure 4.4a); R and dc are the
radius of the semi-spherical sensor and the distance of the planar sensor with the center of
rotation of the camera; ∆αs is the movement of the semi-spherical sensor. Therefore, the
relation which binds a rotation ∆αs of the semi-spherical sensor to the corresponding ∆αp
of the planar sensor, has been named geometrical amplification relation:
∆αp ' Gaf ∆αs
(4.20)
where Gaf is called geometrical amplification ratio, and is expressed as:
Gaf
"
Sp (R + N2s )2
=
Ss R (dc + N2p )
#
It can be noticed that only the secondary nodal points N2p and N2s are important in the
geometric relation 4.20.
71
Chapter 5
Discussion of the experiment results
Contents
5.1
The oculomotor robotic system . . . . . . . . . . . . . . . . . . . . .
72
5.1.1
Pinpoint light source (PLS) . . . . . . . . . . . . . . . . . . . .
74
5.1.2
Nodal points characterization . . . . . . . . . . . . . . . . . .
74
5.1.3
System Calibration . . . . . . . . . . . . . . . . . . . . . . . .
75
5.2
Recording human eye movements . . . . . . . . . . . . . . . . . . .
78
5.3
Experimental results with artificial movements . . . . . . . . . . . .
81
5.3.1
Depth perception from calibration movements . . . . . . . . . .
82
5.3.2
Depth perception in a natural scene . . . . . . . . . . . . . . .
83
Experimental results with human eye movements . . . . . . . . . .
85
5.4.1
Depth and fixational eye movements . . . . . . . . . . . . . . .
86
5.4.2
Depth and saccadic eye movements . . . . . . . . . . . . . . .
89
5.4
An oculomotor robotic system able to replicate as accurately as possible, the geometrical and
mobile characteristics of a human eye, and the novel mathematical model of 3-D projection
on a 2-D surface which exploits human eye movements to obtain depth information were
both employed in a set of tasks to characterize their properties and prove their reliability in
depth cues estimation.
5.1 The oculomotor robotic system
Human eye movements were replicated in a robotic head/eye system that we developed
expressly for this purpose. The oculomotor system (see figure 5.1) was composed of an
aluminum structure sustaining two pan-tilt units (Directed Perception Inc. - CA), each with
72
Figure 5.1 – The robotic oculomotor system. Aluminum wings and sliding bars mounted on the
cameras made it possible to maintain the distance between the center of rotation C
and the sensor S.
2 degrees of freedom and one center of rotation. These units were digitally controlled by
proprietary microprocessors which assured 0.707 arcmin precision movements.
Specifically designed aluminum wings allowed cameras to be mounted in such a way
that the center of rotation is always maintained between the sensor plane and the nodal
points (as shown in 5.1a). A sliding bar furnished with calibrated holes allowed the distance
between the center of rotation C and the sensor S to be set at a value dc = 10.50 mm. This
configuration recreated the position of the retina on the human eye which has been estimated
to be about R = 11.00 mm [40].
Images were captured by one of the two digital cameras (Pulnix Inc. - CA) mounted
on the left and right pan-tilt units respectively. Each camera had a 640 × 484 CCD sensor
resolution with square 9 µm photoreceptor size, 120 frames/s acquisition rate. The optical
system was formed by two TV optical zoom capable of focal length regulation between
11.5 mm and 69 mm1 , and a shutter for the luminance intensity control. Camera signals and
1
Only 11.5, 15.0, 25.0, 50.0, and 69.00 mm markers were available on the lens body.
73
Emitted Color
Size
Lens Colour
Luminous Intensity
Viewing Angle
Max Power Dissipation
Ultra red (640 − 660 nm)
5 mm T 1 34
Water Clear
2000 mcd (Typical) − 3000 mcd (Max)
15◦
80 mW (Ta = 25◦ )
Table 5.1 – Characteristics of the pinpoint source semiconductor light.
Figure 5.2 – Pinpoint light source (PLS) structure.
images were acquired by a fast frame-grabber (Datacube Inc. - MA) during the movement
of the robot, and later saved on a PC for subsequent analysis.
5.1.1
Pinpoint light source (PLS)
For the calibration procedure of the oculomotor system and for the first series of validations of the model, an approximation of a pinpoint light source (called object A in Figure
4.4b, and henceforth PLS) was required. This source was built using an ultra bright red
LED (see further characteristic in Table 5.1.1). This semiconductor light source was covered with a 3 mm pierced steel mask to guarantee a minimum variance of the shape of the
PLS projection during rotations and to cover undesired bright aberrations due to the LED
imperfections.
5.1.2
Nodal points characterization
In this first stage of the experimental setup, a study of the two nodal points of the lens was
performed in order to calculate the trajectory of N1p and N2p on the visual axis as a function
of the optical system focal length.
The lens factory provided an estimation of the nodal point position at focal lengths
11.5 mm and 69.00 mm (see Figure 5.3a). If we consider the trajectory of the cardinal
74
points on the visual axis inside the lens to be linear, the position of N1p and N2p can be
summarized as follows in Figure 5.3b.
A first experiment was performed in order to verify if the model expressed by (4.9) was
able to predict the position of the cardinal points N1p and N2p as characterized in Figure
5.3b.
The PLS was placed at a set of different distances (between 300 mm and 900 mm,
with intervals of 100 mm) from the center of rotation of the camera2 . The same set of
distances was repeated for different focal lengths (11.5, 15.0, 25.0, 50.0, and 69.0 mm) of
the optical system (focus at infinite). For each of these distances (and focal lengths), the
same rotation ∆α = 3.0◦ of the camera was applied, and the corresponding PLS centroid
was sampled. The shutter of the lens was set to a value 5.6 in order to eliminate all the
background details in the image and maintain visible only the PLS projection. N1p and N2p
values were estimated by a regularized least square.
Results in Figure 5.5 indicate that the model (4.9) was able to predict the correct position of the nodal points considering information such as distance of the object da and angle
of rotation ∆α. Furthermore, the results indicate that the hypothesis about the linearity of
trajectory of N1p and N2p inside the lens was correct. The inversion in the order (N2p , N1p ,
and C, opposite to the regular configuration N1p , N2p , and C) was congruent with the specifications of the optical system provided by the factory, which predicted an inversion for a
focal length greater than 27.7 mm.
From these results, we considered that the position of the points N1p , N2p , and C corresponded to the geometrical configuration of the human eye for values of focal length
included in the range 11.5 mm to 27.7 mm.
5.1.3
System Calibration
According to (4.9) and (4.7), the nodal point N2p appears to be the most relevant point in
the model. In fact, it can be noticed how N1 in the denominator relates only with da , which
is much bigger. A value of N2p = 6.03 mm in the robot would generate a position compatible with the human eye structure (see Figure 4.2a) and guarantee a coherent geometrical
position of the cardinal points (N1p , N2p , and C). Since, in (4.9), f is defined as dc + N2p ,
it was also possible to derive that the correct focal length appeared to be f = 16.55 mm.
In order to set the nodal point position as specified above, the procedure described in
the first experiment was modified accordingly, suppling a powerful tool for measuring the
exact focal length of the optical system which does not rely on the markers present on the
mechanical system of the lens.
Using the previous experimental setup, The PLS was placed at a set of different dis2
From now on, all the distances expressed in the experimental setups will be considered referred to the center
of rotation of the system camera-lens C
75
(a)
(b)
Figure 5.3 – (a) Factory specifications for the optical system used in the robotic head (not in scale).
The point C = 7.8 ± 0.1 mm represents the center of rotation of the system cameralens, and it was measured directly on the robot. (b) Estimation of the position of N1
and N2 in respect to C as a function of the focal length (focus at infinite).
76
Figure 5.4 – Experimental setup of the calibration procedure. The PLS was positioned on the visual
axis of the right camera at distances depending on each trial.
tances (between 300 mm and 900 mm, with intervals of 100 mm) with a specified focal
length. For each of these distances, the same camera rotation ∆α = 3.0◦ was applied, and
the corresponding PLS projection was sampled. N1p and N2p values were estimated by a
regularized least square. This procedure was repeated with incremental focal lengths until
a final value for N2p was obtained that was as close as possible to N2s = 6.03 mm. At
the end of this preliminary phase of the calibration, the second nodal point was in position
N2p = 5.60 mm.
In order to obtain a final and refined characterization of the optical system, the PLS
was placed at a set of different distances (between 270 mm and 1120 mm, with intervals
of 50 mm) from the robot. The focal length of the optical system had already been set in
the previous stage. At each of these distances, a sequence of camera rotations (between
0.5◦ and 5.0◦ , with intervals of 0.5◦ ) was applied, and the corresponding PLS projection
was sampled. The collected data was used to estimate, by a regularized least square, all the
parameters of (4.9). Final values confirmed the preliminary calibration and the geometrical
structure of the oculomotor system returning N1p = 49.65 mm, N2p = 5.72 mm, and
dc = 10.52 mm.
In Figure 5.6 it is shown how a single movement of 3.0◦ affected the projection of the
PLS at different distances. The model prediction, expressed in (4.9) and indicated in the
figure with a solid line, fitted the experimental data indicated with circles.
77
Figure 5.5 – Data indicated with circles represent the estimated position of the nodal points N1p
and N2p for values of focal length 11.5, 15.0, 25.0, 50.0 mm (focus at infinite). Solid
lines in the graph represent their interpolated trajectory. It can be noticed that the
position of the cardinal points expressed by the graph is compatible with the trajectory
reported by the factory in Figure 5.3b.
5.2 Recording human eye movements
Eye movements performed by human subjects during the analysis of natural scenes were
sampled by a Dual Purkinje Image eyetracker (shown in Figure 5.8).
The DPI eyetracker, originally designed by Crane and Steele [21, 22] (Fourward Technologies, Inc. - CA), illuminates the eye with a collimated beam of 0.93 µm infrared light
and employs a complex combination of lenses and servo-controlled mirrors to continuously
locate the positions of the first and fourth Purkinje images. The data returned by the eyetracker regarding the rotation of the eyes are represented by the voltage potentials requested
to command the mirrors.
Subjects employed in the eye movement recording were experienced subjects with nor-
78
Figure 5.6 – Model fitting (indicated with a solid line) obtained interpolating N1p = 49.65 mm
and N2p = 5.72 mm from the data (indicated with circles). The underestimation of
the projections under da = 270 mm is due to defocus, aberrations and distortions
introduced by the PLS too close to the lens. In fact, since da is considered from the
center of rotation C, the light source is actually less then 160 mm from the front glass
of the lens. This position is incompatible even for a human being, which cannot focus
objects closer than a distance called near point, and equal to 250 mm [74].
mal vision (MV 31 years old, and AR 22 years old). Informed consent was obtained from
these subjects after the nature of the experiment had been explained to them. The procedures
were approved by the Boston University Institutional Review Board.
Natural images were generated on a PC (Dell Computer Corp.) with a Matrox Millenium G550 graphics card on board (Matrox Graphics - Canada), and displayed on a 21-inch
color Trinitron monitor (Sony Electronics Inc.). The screen had a resolution of 800 × 600
pixels and the vertical refresh rate was 75 Hz. The display was viewed monocularly with
the right eye, while the left eye was covered by an opaque eye-patch. The subject’s right eye
was at a distance of 110 cm from the CRT. The subject’s head was positioned on a chin-rest.
Each experimental session started with preliminary setups which allowed the subject to
adapt to the low luminance level in the room. These preliminary setups included: position79
Figure 5.7 – Results of the model fitting of the parallax obtained at different focal lengths (and
different nodal points).
Figure 5.8 – The Dual-Purkinje-Image eyetracker. This version of DPI eyetracker has a temporal resolution of approximately 1 ms and has a spatial accuracy of approximately
1 arcmin.
80
Figure 5.9 – Basic principle of the Dual-Purkinje-Image eyetracker. Purkinje images are formed by
light reflected from surfaces in the eye. The first reflection takes place at the anterior
surface of the cornea, while the fourth occurs at the posterior surface of the lens of the
eye at its interface with the vitreous humor. Both the first and fourth Purkinje images
lie in approximately the same plane in the pupil of the eye and, since eye rotation alters
the angle of the collimated beam with respect to the optical axis of the eye, and since
translations move both images by the same amount, eye movement can be obtained
from the spatial position and distance between the two Purkinje images.
ing the subject optimally and comfortably in the apparatus; adjusting the eyetracker until
successful tracking was obtained; and running a short calibration procedure that allowed to
convert the eyetracker voltage outputs into degrees of visual angle and pixel on the screen.
This widely used procedure associates the subject gaze at a specific point on the screen with
the voltage potential produced by the eyetracker. A nine point lattice, equally distributed
on the screen, was enough to interpolate all the pixel positions with eyetracker voltage responses.
After this calibration procedure, the actual experiment was run. In each session, the subject was required to observe the CRT and to freely move the eyes among the objects present
in the scene. Each subject was requested to perform five different sessions.
Eye movement data were sampled at 1 kHz, recorded by a analog/digital sampling
board (Measurement Computing Corp. - MA), and stored on the computer hard drive for
off-line analysis. The programs required to acquire subject’s eye movement data and subsequent analysis were written in the Matlab environment (MathWorks Ltd. - MA).
5.3 Experimental results with artificial movements
The response of the model to the parallax measured with the robot was first tested on artificial movements, which offered a perfect control of the system and supplied an initial
knowledge of the behavior of mechanical and digital parts of the oculomotor equipment.
81
Figure 5.10 – Distance estimation calculated from projection data obtained by rotating the right
camera 3◦ .
5.3.1
Depth perception from calibration movements
Figure 5.10 shows how distance values were retrieved using projection data calculated during the final calibration procedure in Paragraph 5.1.3. The graph shows estimated distance
(highlighted with a solid line and circles), and the estimation error performed by the model
(solid red line).
Results show that the model is capable of calculating depth information with a rising error dependent upon the real distance of the PLS. This error is mainly due to the quantization
introduced by the sensor during the sampling process of the PLS centroid.
In fact, when the object is distant, the projection of the PLS, moving in accord with the
rotation, falls within the same photoreceptor, thus introducing an error which is about the
half size of the photoreceptor cell in the cameras mounted on our robot (estimated to be
9 µm). Clearly, this error is not evident in human beings (at least for distances less than
10 m) because hyperacuity (see Paragraph 1.1.2) introduces a quantization error which is
about 30 times smaller than the physical photoreceptor in the retina [11]. To make this evident, in Figure 5.11 the response of the model to a rotation ∆α = 3◦ is computed for two
different quantizations. In the graph, a planar sensor with 9 µm quantization is compared
with another having a quantization 30 times smaller. It can be noticed that the estimation
error of the higher resolution sensor at 3 m is only 25.25 mm, where as, at the same dis-
82
Figure 5.11 – Error introduced by the quantization in the model depth perception performance.
Distance estimation using a sensor with a 9 µm cell size (thick blue line) is compared
with a sensor with a 0.3 µm cell size. The error introduced by the higher resolutions
is evident only after numerous meters.
tance, the lower resolution sensor has an estimation error of 788.1 mm. For this reason, the
area in which the lower resolution sensor has about the same estimation error as the higher
resolution sensor is approximately 30 times smaller.
5.3.2
Depth perception in a natural scene
In this experiment we exposed the model to a natural scene and calculated the parallax
produced by an artificial movement. A square-shaped object was positioned in front of a
background plane with the same texture (Irises - Van Gogh 1889). The defocus effect produced by differences in depth between the object and the background was compensated by
exposure to a 500 W light and by a reduced aperture of the shutter. This scene composition
guaranteed the almost complete invisibility of the object and reduced the number of other
visual cues used to process the resulting image.
Two pictures were taken before and after a 3.0◦ rotation (see Figure 5.12a and b), then
and divided in 45 × 45 patches. A normalized cross correlation algorithm was actuated to
search for the displacement of the patches.
As mentioned before (see Paragraph 4.1), the primary visual cortex of primates is more
83
(a)
(b)
Figure 5.12 – Semi-natural scene consisting of a background placed at a distance 900 mm with
a rectangular object placed before it at a distance 500 mm. This object is an exact
replica of the covered area of the background. (a) The scene before a rotation; (b)
Same semi-natural scene after a anticlockwise rotation of ∆α = 3.0◦ .
84
Figure 5.13 – The rectangular object hidden in the scene in Figure 5.12 is visible only because it
has been exposed to an amodal completion effect with a sheet of paper.
specialized for processing horizontal disparities. For this reason, we considered only horizontal displacements of the patches in the parallax matrices. Figure 5.14 shows the parallax
matrix obtained by applying the normalized cross-correlation on two images of an object
situated at 520 mm in front of the camera. Figure 5.15 shows the distances calculated for
all the patches obtained by the application of the model described in (4.11) and (4.12). The
model produces the real distance of the object with good approximation.
5.4 Experimental results with human eye movements
Eye movements recorded during the sessions were analyzed and the segments inside each
session were classified as saccades or fixational eye movements according to their velocities. In order to obtain the best performance of the robot, only one of the five sessions was
chosen. Session selection was based on the consideration of simple paths of movements, the
absence of blinks, and the quality of the movements.
The considered recordings were processed according to the final objective of each experiment, and the selected traces were used to produce motor commands for the robot. The
speed of the eye movements was reduced in order to operate within the range of velocities
that the robot could achieve reliably, without jeopardizing space accuracy of the eye movement reproduction.
85
Figure 5.14 – Horizontal parallax produced by a rotation of ∆α = 3◦ for the object in Figure
5.12a. It can be noticed how the general tendency of the surface corresponds to the
theoretical parallax behavior predicted in the equation (4.14).
5.4.1
Depth and fixational eye movements
A first experiment with human recordings was performed in order to verify whether the position of the PLS could be predicted by both the fixational eye movements of each subject
and by the model expressed by (4.9).
As mentioned in Paragraph 4.4, we introduced the geometrical amplification ratio Gaf ,
which can be calculated only when all the geometrical parameters of the optical system have
been characterized. Therefore, considering both the values reported for the Gullstrand’s
model (see section 4.1), and the interpolated values of N1p and N2p (see Section 5.1.3), as
well as considering Sp = 9 µm and Ss = 7.93 µm3 , the geometric amplification ration has
been estimated to be Gaf = 1.89.
3
Due to the human visual system’s amazing performance in acuity tasks, and technology issues of our robot,
it has been impossible to calculate using the virtual photoreceptor size widely accepted by the psychophysics
community (it is about 30 times smaller than the actual physical photoreceptor [11]). So we decided to lower
the semi-spherical resolution to a more obtainable Ss = 7.93 µm correspoing to an eccentricity on the retina
of 4◦ .
86
Figure 5.15 – Distance estimation of the object in Figure 5.12a. The horizontal parallax matrix
obtained by positioning an object at 500 mm from the point C of the robotic oculomotor system was post-processed using (4.11) and (4.12), thereby obtaining the
distance information in the figure.
In this experiment, ten fixational eye movements traces, each of length 3 s, were separated from the other movements and amplified by Gaf . Each trace included about three
different fixational eye movements.
The PLS was placed at a set of different distances (between 300 mm and 900 mm, with
intervals of 200 mm) from the center of rotation of the camera. For each of these distances,
the traces produced camera rotations according to the fixational eye movements. The corresponding PLS projection centroid was sampled at the end of each movement. After each
rotation, a delay of 1 s was applied in order to reduce vibrations and the influence of inertia
on the samples. The knowledge of the fixational eye movements and the PLS projections
applied to the model (4.9) resulted in the graphs shown in Figures 5.16 and 5.17. It can
be noticed that even small movements, such as fixational eye movements, produce enough
parallax to produce rough distance estimations.
Under the experimental conditions described above, the resultant distance is influenced
by errors due to the sensor quantization. This error tends to produce either an overestimation
or an underestimation of the angle α of (4.12), which causes, subsequently, the error in the
perceived distance (see graph in Figure 5.11 for further details).
Furthermore, the graphs show that the standard deviation of the measure is proportional
87
Figure 5.16 – PLS estimated distance using fixational eye movements of the subject MV.
Figure 5.17 – PLS estimated distance using fixational eye movements of the subject AR.
88
Figure 5.18 – Natural image representing a set of objects placed at different depths: object A at
430 mm; object B at 590 mm; object C at 640 mm; object D at 740 mm; object E
at 880 mm.
to the object’s distance. As the parallax decreases (due to big distances or to small rotations
of the camera) it comes more comparable with the sensor resolution. This creates a situation
in which the error of one pixel in the acquisition of the image (introduced by the mechanical
system, vibration, motor position errors, etc.) creates a direct overestimation of α and a
subsequent increase of the standard deviation. For shorter distances, a bigger parallax (more
than ten pixels) is not influenced by a one or two pixel error.
5.4.2
Depth and saccadic eye movements
Another series of experiments were performed in order to verify if saccadic eye movements
of each subject and the model expressed by (4.9) were able to extract depth information
from a natural scene shown in Figure 5.18.
Even though (4.20) of Paragraph 4.4 supplies a strong theoretical relation which allows
us to replicate fixational eye movements, it cannot be used for saccades because it presupposed the correctness of assumptions which can not always verified. The assumption, and
probably the most important, is the magnitude of the rotations. All the approximations considered in Paragraph 4.4 suppose a rotation ∆α so small that the projection lying on the
visual axis (α = 0) shifts by only a single photoreceptor. A second assumption is that the
resolution considered in central part of the retina, the fovea, is considered to be constant. For
saccadic eye movements, this assumption is no longer valid because the separation among
89
photoreceptors changes linearly with the eccentricity [46] measured from the foveal point.
In this experiment, all the fixational eye movements in the session were filtered, leaving
only the saccadic eye movements. This selection produced 29 saccades for the subject MV,
and 28 for the subject AR. The selected recordings were then used to produce motor commands for the robot.
As we have seen above, it is not correct to use Gaf to amplify the saccades in order for
the robot to perfectly replicate the gaze of the human subject. It is still possible, however,
to find an amplification factor able to provide the same functionality. In fact, the calibration procedure employed in the first stage of the eye movement recording isolated the exact
pixel observed by the human subject (see Paragraph 5.2). Movements, in which the robot
requested to watch the same pixel gazed at by the human subject, were collected and the
relative amplification factor (for both x and y axes) values were estimated by a regularized
least square.
These amplification factors were successively used to amplify the saccadic movements
and to obtain the correct rotation commands for the robot. After each command was sent
to the robot with a delay of 1 s (necessary to minimize errors in the acquisition process
caused by vibrations and inertia), the relative frame was acquired by the camera and stored
for successive off-line processing. The saccadic traces produced a 30-frame movie for the
subject MV, and 29-frame movie for the subject AR.
Each frame of the movie was divided into 45 × 45 patches. A normalized cross correlation algorithm was actuated to search for the horizontal displacement of the patches
between each frame and its predecessor. Knowing the rotation applied to the camera before each frame, and the parallax produced by that rotation, the results of the model are
straightforward. Figures 5.19 through 5.23 summarize how the replication of the saccades
produced a reconstruction of the real 3-D scene with a good approximation of the distance
information.
90
Figure 5.19 – Reconstruction of the distance information according to the oculomotor parallax
produced by the subject MV’s saccades. Lines A and B individuate two specific
sections of the prospective which are shown with further details in Figure 5.20a and
5.20b.
91
(a)
(b)
Figure 5.20 – (a) Detailed distance estimation of the scene at the section A highlighted in Figure
5.19. (b) Detailed distance estimation of the scene at the section B highlighted in the
same figure.
92
Figure 5.21 – Filtered 3-D reconstruction of the distance information produced by the subject MV’s
saccades.
93
Figure 5.22 – Reconstruction of the distance information according to the oculomotor parallax
produced by the subject AR’s saccades. Lines A and B individuate two specific
sections of the prospective which are shown with further details in Figure 5.23a and
5.23b.
94
(a)
(b)
Figure 5.23 – (a) Detailed distance estimation of the scene at the section A highlighted in Figure
5.22. (b) Detailed distance estimation of the scene at the section B highlighted in the
same figure.
95
Figure 5.24 – Filtered 3-D reconstruction of the distance information produced by the subject AR’s
saccades. It can be noticed how there is no object at the extreme right. This is
because AR’s eye movements were focused more on the left side of the scene. No
parallax information concerning this object was available, and therefore no distance
cues.
96
Chapter 6
Conclusion and future work
In this thesis we presented a novel approach to depth perception and figure-ground segregation. Although no previous study has attempted to consider it before, we have shown that the
monocular parallax, which originates from a rotation of the camera similar to the movement
of human eye, provides reliable depth information for a near range of distances.
By integrating psychophysics data on recorded human eye movements and robotic experiments, this work takes a radically different approach from previous studies on depth
perception in robotics and computer vision, focusing primarily on the replication of computational strategies employed by biological visual systems.
Accurately replicating human eye movements represented a serious challenge to this
research. However, we showed, that the implementation of these movements on the oculomotor robotic system, joined with the mathematical model of parallax, were able to provide
depth estimation of natural 3-D scenes by accurately replicating the geometrical and mobile
characteristics of the human eye.
One of the main contributions of this thesis is the simple model formulation and the
restricted number of requested parameters, that are measured in preliminary calibrations.
This preventative phase is only initially a limitation. In fact, in a system that operates in
the real world, depth cues are valuable only in the framework of the physical and functional
characteristics of the system; for example, depth cues depend on the optical and motor characteristics of the eye. All organisms in nature, from insects with simple nervous systems
to humans, tune their sensory cues while interacting with complex environments. Therefore, learning approaches such as Kalman filtering or unsupervised neural networks, both
of which extract the model parameters from the flow of visual information affering to the
system, could be used to continuously adjust these values.
Due to its sole dependency on rotations of the sensor and object projections, the novel
model presented in this research could easily be extended to incorporate egomotion informa-
97
tion provided by emulated vestibular sensors. This system would integrate proprioceptive
sensor knowledge with movements of the robot itself, exploiting and integrating, as a matter
of fact, errors and drifts of the moving components in a better 3-D perception of the world.
Another advantage is the relatively light computational load that the model generates by
exploiting preexisting basic visual cues. Parallax is a kind of information which might be
calculated at low levels. Using this model, elements requested to perceive depth information
using this model are already present in the visual system and are probably already provided
to other modules. This strategy is perfectly suitable for hierarchical systems such as the
visual cortex and its higher level processes.
A future major improvement for a more robust distance estimation could be obtained
with the use of vertical parallax as a source of depth information, which has not been considered in this thesis. Fusion with horizontal cues, based on averaging or other criteria,
would provide a distance cue less affected by noise. Moreover, in a binocular robotic system, monocular parallax information extracted by the two cameras could be combined for
improved accuracy.
Clearly, the robotic oculomotor system lacks a number of features necessary to produce
an exact replica of the appearance of stimuli on the retina during oculomotor activity. For
example, the different structure of sensor surfaces and the spatial distribution of receptors
still represent an undisputed challenge.
It is clear, anyhow, that the biological neurological system posses a far more powerful
signal processing capability than that of algorithmic techniques, which are often complex,
slow and unreliable. But the hardware implementation of biologically inspired signal processing models has increased during these recent years. Neuromorphic engineering, a field
of engineering that is based on the design and fabrication of artificial neural systems inspired by biological nervous systems, started in the late 80s and it is slowly maturing. We
aim to investigate whether effective hardware for performing signal processing can be built
by using the approach of biologically inspired sensing technology. This may lead to new
algorithms in computer vision and to the development of radically novel approaches in the
design of machine vision applications in which fixational eye movements and saccades are
essential constituents.
98
Bibliography
[1] Gullstrand A. Appendix II.3. The optical system of the eye, pages 350–358. Optical
Society of America, 1924.
[2] Higashiyama A. The effect of familiar size on judgments of size and distance: an
interaction of viewing attitude with spatial cues. Perception and Psychophysics,
35:305–312, 1984.
[3] Hughes A. A useful table of reduced schematic eyes for vertebrates which includes
computed logitudinal chromatic aberrations. Vision Research, 19:1273–1275, 1979.
[4] Hastorf AH. The influence of suggestion on the relationship between stimulus size
and perceived distance. Journal of Psychology, 29:195–217, 1950.
[5] Reinhardt-Rutland AH. Detecting orientation of a surface: the rectangularity postulate and primary depth cues. Journal of General Psychology, 117:391–401, 1990.
[6] Robert M. Gray Allen Gersho. Vector Quantization and Signal Compression, chapter 5, pages 151–152. Kluwer Academic Publishers, 1992.
[7] Pertriu E Archibald C. Robot skills development using a laser range finder. In IEEE
Transactions on Instrumentation and Measurement Technology Conference, Proceedings, volume 43, pages 448–452, 1993.
[8] Julesz B. Foundations of Cyclopean Perception. The University of Chicago Press Chicago, 1971.
[9] Stark L Bahill AT. The high-frequency burst of motoneural activity lasts about half
the duration of saccadic eye movements. Mathematical Bioscience, (26):319–323,
1975.
[10] Brown CM Ballard DH. Computer Vision. Englewood Cliffs NJ - Prentice-Hall,
1982.
[11] Skottun BC. Hyperacuity and the estimated positional accuracy of a theoretical simple cell. Vision Reseach, 40:3117–3120, 2000.
99
[12] Meyer F Bergen L. A novel approach to depth ordering in monocular image sequences. In IEEE International Conference on Computer Vision and Pattern Recognition, Proceedings, volume 2, pages 536–541, 2000.
[13] Horst Hauecker Bernd Jahne. Computer Vision and Applications, chapter 11, pages
397–438. Academic Press, 2000.
[14] Harbeck M Beß R, Paulus D. Segmentation of lines and arcs and its application for
depth recovery. In IEEE International Conference on Acoustic, speech and Signal
Processing, Proceedings, volume 4, pages 3165–3168, 1997.
[15] Cumming BG. An unexpected specialization for horizontal disparity in primate primary visual cortex. Nature, 418:633–636, 8 August 2002 2002.
[16] Mallot HA Bulthoff HH. Interaction of different modules in depth perception:
Stereo and shading. http://www.ai.mit.edu/publications/pubsDB/pubs.htm (AIM965), 1987.
[17] Klein SA Carey T. Resolution acuity is better than vernier acuity. Vision Research,
37(5):525–539, 1997.
[18] Yuille AL Clark JJ. Data fusion for sensory information processing systems. Kluwer
- Massachusetts, 1990.
[19] Erkelens CJ Collewijn H. Binocular eye movements and the perception of depth E Kowler Editor, volume 4, chapter Chapter 4, pages 213–261. Elsevier Science
Publisher BV (Biomedical division), 1990.
[20] Krotkow E Cozman F. Depth from scattering. In IEEE Computer society conference
on Computer vision and pattern recognition, Proceedings, pages 801–806, 1997.
[21] Steele CM Crane HD. Accurate three-dimensional eyetracker. Applied Optics,
17(5):691–705, 1978.
[22] Steele CM Crane HD. Generation v dual-purkinje-image eyetracker. Applied Optics,
24(4):527–537, 1985.
[23] Schuster DH. A new ambiguous figure: a three-stick clevis. American Journal of
Psychology, 77:673, 1964.
[24] Carey DP Dijkerman HC, Milner AD. Motion parallax enables depth processing for
action in a visual form agnosic when binocular vision is unavailable. Neuropsychologia, 37(13):1505–1510, Dec 1999.
100
[25] Applegate RA Doshi JB, Sarver EJ. Schematic eye models for simulation of patient
visual performance. Journal of Refractive Surgery, 17(4):414–419, 2001.
[26] Williams DR. Seeing through the photoreceptor mosaic. Trends in Neuroscience,
9:193–198, 1986.
[27] Thompson D Dunn BE, Gray GC. Relative height on the picture-plane and depth
perception. Perceptual and Motor Skills, 21(1):227–236, 1965.
[28] et al. Durgin FH. Comparing depth from motion with depth from binocular disparity. Journal of Experimental Psychology: Human Perception and Performance,
21(3):679–699, Jun 1995.
[29] Fincham E. The accommodation reflex and its stimulus. British Journal of Opthalmology, 35:381–393, 1951.
[30] Baratz SS Epstein W. Relative size in isolation as a stimulus for relative perceived
distance. Journal of Experimental Psychology, 67:507–513, 1964.
[31] Hillebrand F. Das verhltnis von akkommodation und kovergenz zur tiefenlokalisation.
Zeitschrift fur Psychologie, 7:97–151, 1894.
[32] Ciuffreda KJ Fisher SK. Accommodation and apparent distance. Perception, 17:609–
621, 1988.
[33] Fowler TA Freeman TC. Unequal retinal and extra-retinal motion signals produce
different perceived slants of moving surfaces. Vision Research, 40(14):1857–1868,
2000.
[34] et al. Fujii Y. Robust monocular depth perception using feature pairs and approximate
motion. In IEEE International Conference on Robotic and Automation, Proceedings,
volume 1, pages 33–39, 1992.
[35] Berkeley G. An essay towards a new theory of vision. Dutton - New York, 1709,
Reprinted 1922.
[36] Westheimer G. The spatial sense of the eye. Investigative Ophthalmology, 18:893–
912, 1979.
[37] Westheimer G. The spatial grain of the perifoveal visual field. Vision Research,
22:157–162, 1982.
[38] et al. Garcia J. Chromatic aberration and depth extraction. In International Conference on Pattern Recognition, Proceedings, volume 1, pages 762–765, 2000.
101
[39] Mertens HW Gogel WC. Perceived depth between familiar objects. Journal of Experimental Psychology, 77(2):206–211, 1968.
[40] Bingham GP. Optical flow from eye movement with head immobilized: öcula occlusionb̈eyond the nose. Vision Research, 33(5/6):777–789, 1993.
[41] et al. Greschner M. Retinal ganglion cell synchronization by fixational eye movements improves feature stimation. Nature, 5:341–247, 2002.
[42] Hirsch J Groll SL. Two-dot vernier discrimination within 2.0 degrees of the foveal
center. Journal of Optical Society of America, Series A, 4(8):1535–1542, 1987.
[43] Ono H. Apparent distance as function of familiar size. Journal of Experimental
Psychology, 79:109–115, 1969.
[44] Emsley HH. Visual Optics. Hayyon Press Ltd - London, 1952.
[45] Curcio CA Hirsch J. The spatial resolution capacity of human foveal retina. Vision
Research, 29(9):1095–1101, 1989.
[46] Miller WH Hirsch J. Does cone positional disorder limit resolution?
Optical Society of America, Series A, 4(8):1481–1492, 1987.
Journal of
[47] Hochberg JE Hochberg CB. Familiar size and the perception of depth. Journal of
Psychology, 34:107–114, 1952.
[48] Brenot J Honig J, Heit B. Visual depth perception based on optical blur. In International Conference on Image Processing, Proceedings, volume A, pages 721–724,
1996.
[49] Rogers BJ Howard IP. Seeing in Depth, volume 2, chapter 25, page 413. I. Porteous,
2002.
[50] Wilson HR. Response of spatial mechanisms can explain hyperacuity. Vision Research, 26:453–469, 1986.
[51] Ittelson HW. Size as a cue to distance: static localization. American Journal of
Psychology, 64:54–67, 1951.
[52] Marshall JA. Self-organizing neural network architectures for computing visual depth
from motion parallax neural networks. In International Joint Conference on Neural
Networks, Proceedings, volume 2, pages 227–234, 1989.
[53] George TB Jr. Calculus and Analytic Geometry, volume 1, chapter 11, pages 361–
370. Addison-Wesley Publishing Company, 4 edition, 1972.
102
[54] Vasan NS Juyal DP, Gaba SP. Design of a high performance laser range finder. In 4th
Pacific Rim Conference on Lasers and Electro-Optics, Proceedings, volume 1, pages
I–146–I–147, 2001.
[55] Cohen MH Kellman PJ. Kinetic subjective contours. Perception and Psychophysics,
35:237–244, 1984.
[56] et al. Khan N. Depth perception from blurring-a neural networks based approach for
automated visual inspection in vlsi wafer probing neural networks. In International
Joint Conference on Neural Networks, Proceedings, volume 3, pages 286–290, 1992.
[57] Levi DM Klein SA. Hyperacuity thresholds of 1 sec: theoretical predictions and
empirical validation. Journal of Optical Society of America, Series A, 2(7):1170–
1190, 1985.
[58] Stainman RM Kwler E. Miniature saccades: eye movements that do not count. Vision
Research, 19:105–108, 1979.
[59] et al. Landy MS. Measurement and modeling of depth cue combination: in defense
of weak fusion. Vision Research, 35(3):389–412, 1995.
[60] Meyyappan A Lee S, Ahn SC. Depth from magnification and blurring. In IEEE
International Conference on Robotics and Automation, Proceedings, pages 137–142,
1997.
[61] Aitsebaomo AP Levi DM, Klein SA. Venier acuity, crowding and cortical magnification. Vision Research, 25(7):963–977, 1985.
[62] Klein SA Levi DM, McGraw PV. Venier and contrast discrimination in central and
peripherical vision. Vision Research, 40:973–988, 2000.
[63] Brown B Li RWH, Edwards MH. Variation in vernier acuity with age. Vision Research, 40:3775–3781, 2000.
[64] Thibos LN. Calculation of the influence of lateral chromatic aberration on image
quality across the visual field. Journal of Optical Society of America, Series A,
4:1673–1680, 1987.
[65] Houde R Loranger F, Laurendeau D. A fast and accurate 3-d rangefinder using the
biris technology: the trid sensor 3-d. In International Conference on Recent Advances
in Digital Imaging and Modeling, Proceedings, volume 1, pages 51–58, 1997.
[66] Nawrot M. Role of slow eye movements in depth from motion parallax. Investigative
Ophthalmology and Visual Science, 38(4):S694, 1997.
103
[67] Nawrot M. Viewing distance, eye movements, and the perception of relative depth
from motion parallax. Investigative Ophthalmology and Visual Science, 41(4):S45,
2000.
[68] Nawrot M. Eye movements provide the extra-retinal signal required for the perception
of depth from motion parallax. Vision Research, 43(14):1553–1562, 2003.
[69] Hildreth EC Marr D. Theory of edge detection. Proceedings of the Royal Society
London, Series B, 207(1167):187–217, 1980.
[70] Poggio T Marr D. Cooperative computation of stereo disparity. Science, 194:283–
287, 1977.
[71] Poggio T Marr D. A computational theory of human stereo vision. Proceedings of
the Royal Society London, B204(1979):301–328, 1982.
[72] Tresilian JR Mon-Williams M. Some recent studies on the extraretinal contribution
to distance perception. Perception, 28:167–181, 1999.
[73] et al. Monod MO. Perception of the environment with a low cost radar sensor. In
Radar 97, 14 - 16 October 1997, Proceedings, pages 806–810, 1997.
[74] Macdonald J Mouroulis P. Geometrical Optics and Optical Design. Oxford University Press, 1997.
[75] et al. Murphey YL. A real-time depth detection system using monocular vision. SSGRR - International Conference on Advances in Infrastructure for Electronic Business, Science, and Education on the Internet, 2000.
[76] Park GE Park RS. The center of ocular rotaion in the horizontal plane. Journal of
Physiology, 104:545–552, 1933.
[77] Bajcsy R. Active perception. Proceedings of the IEEE, Special issue on Computer
Vision (invited paper), 76(8):996–1005, 1988.
[78] Descartes R. Treatise of man. Harvard University Press - Cambridge, 1972.
[79] Clement RA. Introduction To Vision Science. Lawrence Erbaum Associates, Publishers, 1993.
[80] Jarvis RA. A perspective on range finding techniques for computer vision. IEEE
Transaction on Pattern Analysis and Machine Intelligence, 5(2):122–139, 1983.
[81] Trivedi MM Ravichandran G. Motion and depth perception using spatio-temporal
frequency analysis. In IEEE International Conference on Systems, Man, and Cybernetics, Proceedings, volume 1, pages 36–41, 1994.
104
[82] Desbordes G Rucci M. ontributions of fixational eye movements to the discrimination
of briefly presented stimuli. Journal of Vision, 3(11):852–864, 2003.
[83] Guo H; Yi Lu; Sarka S. Depth detection of targets in a monocular image sequence.
In 18th Digital Avionics Systems Conference, Proceedings, volume 2, pages 8.A.2–1
– 8.A.2–7, 1999.
[84] Tistarelli M Sandini G. Active tracking strategy for monocular depth inference over
multiple frames. IEEE Transactions on pattern analysis and machine intelligence,
12:13–27, 1990.
[85] Vaz F Santos V, Goncalves JGM. Perception maps for the local navigation of a mobile
robot: a neural network approach. In IEEE International Conference on Robotics and
Automation, Proceedings, volume 3, pages 2193–2198, 1994.
[86] et al. Shinohara S. Compact and high-precision range finder with wide dynamic
range and its applications. IEEE Transactions on instrumentation and measurement,
41(1):40–44, 1992.
[87] Campbell MCW Simonet P. The optical transverse chromatic aberration on the fovea
of the human eye. Vision Research, 30:187–206, 1990.
[88] Jain R Skifstad K. Range estimation from intensity gradient analysis. In IEEE International Conference on Robotics and Automation, Proceedings, volume 1, pages
43–48, 1989.
[89] Wolbarsht M Sliney D. Safety woth lasers and other optical sources. Plenu Press New York, 1980.
[90] Miller WH Snyder AW. Photoreceptor diameter and spacing for highest resolving
power. Journal of Optical Society of America, 67(5):696–698, 1977.
[91] Meyyappan A Sukhan L, Sang CA. Depth from magnification and blurring. In IEEE
International Conference on Robotics and Automation Albuquerque, New Mexico April 1997, Proceedings, volume 1, pages 137–142, 1997.
[92] Samy R Sune JL. Pyramidal robust estimation of the depth-from-motion. In SPIE
Conference, volume 2354, pages 380–386, 1994.
[93] Poussart D Taalebinezhaaf MA. Depth map froma sequence of two monocular images. In SPIE Conference, volume 2354, pages 357–368, 1994.
[94] Bradley A. Thibos L.N. Modeling, chapter 4, pages 101–159. McGraw-Hill, 1999.
105
[95] et al. Thibos LN. Theory and measurement of ocular chromatic aberration. Vision
Research, 30:33–49, 1990.
[96] et al. Thibos LN. The chromatic eye: a new reduced-eye model of ocular chromatic
aberration in humans. Applied Optics, 31(19):3594–3600, 1992.
[97] et al. Thibos LN. Spherical aberration of the reduced schematic eye with elliptical
refracting surface. Optometry and Vision Science, (in press), 1997.
[98] Oliva A Torralba A. Depth estimation from image structure. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 24(9):1226–1238, 2002.
[99] Karen K. De Valois. Seeing, chapter 1, pages 31–33. Academic Press, 2000.
[100] Zanker JM Volz H. Hyperacuity for spatial localization of contrast-modulated patterns. Vision Research, 36(9):1329–1339, 1996.
[101] Epstein W. Perceived depth as a function of relative height under three background
conditions. Journal of Experimental Psychology, 72:335–338, 1966.
[102] Wundt W. Beitrge zur Theorie der Sinneswahrnemung. Winter - Leipzig, 1862.
[103] Griffin DR Wald G. The change in refractive power of the human eye in dim and
bright light. Journal of Optical Society of America, 37:321–336, 1947.
[104] Gogel WC. An indirect method of measuring perceived distance from familiar size.
Perception and Psychophysics, 20:419–429, 1976.
[105] McKee SP Westheimer G. Spatial configurations for visual hyperacuity. Vision Research, 17:941–947, 1977.
[106] Geld DJ Wilson HR. Modified line element theory for spatial frequency and width
discrimination. Journal of Optical Society of America, A1:124–131, 1984.
[107] Smith WJ. Modern Optical Engineering. McGraw-Hill - New York, 1966.
[108] Geisler WS. Physical limits of acuity and hyperacuity. Journal of Optical Society of
America, Series A, 1(7):775–782, 1984.
[109] Le Grand Y. Form and Space Vision - G.G. Hearth and M. Millodot Editors. Indiana
Univerity Press - Bloomington, 1967.
[110] Klein SA Yap YL, Levi DM. Peripherical positional acuity: retinal and cortical constraints on 2-dot separation discrimination under photopic and scotopic conditions.
Vision Research, 29(7):789–802, 1989.
106

Download Report

Depth Perception in a Robot that Replicates Human Eye Movements

Paperzz.com

Your Paperzz