Gaze behavior during 3-D face identification is depth cue invariant

Journal of Vision (2017) 17(2):9, 1–12
1
Gaze behavior during 3-D face identification is depth cue
invariant
Hassan Akhavein
McGill Vision Research Unit,
Department of Ophthalmology, McGill University,
and The BRaIN Program of the RI-MUHC,
Montréal, Canada
Reza Farivar
McGill Vision Research Unit,
Department of Ophthalmology, McGill University,
Montréal, Canada
Gaze behavior during scene and object recognition can
highlight the relevant information for a task. For
example, salience maps—highlighting regions that have
heightened luminance, contrast, color, etc. in a scene—
can be used to predict gaze targets. Certain tasks, such
as face recognition, result in a typical pattern of fixations
on high salience features. While local salience of a 2-D
feature may contribute to gaze behavior and object
recognition, we are perfectly capable of recognizing
objects from 3-D depth cues devoid of meaningful 2-D
features. Faces can be recognized from pure texture,
binocular disparity, or structure-from-motion displays
(Dehmoobadsharifabadi & Farivar, 2016; Farivar, Blanke,
& Chaudhuri, 2009; Liu, Collin, Farivar, & Chaudhuri,
2005), and yet these displays are devoid of local salient
2-D features. We therefore sought to determine whether
gaze behavior is driven by an underlying 3-D
representation that is depth-cue invariant or depth-cue
specific. By using a face identification task comprising
morphs of 3-D facial surfaces, we were able to measure
identification thresholds and thereby equate for task
difficulty across different depth cues. We found that gaze
behavior for faces defined by shading and texture cues
was highly comparable, but we observed some
deviations for faces defined by binocular disparity.
Interestingly, we found no effect of task difficulty on
gaze behavior. The results are discussed in the context of
depth-cue invariant representations for facial surfaces,
with gaze behavior being constrained by low-level limits
of depth extraction from specific cues such as binocular
disparity.
$
#
Introduction
The inspection of a visual object is accompanied by
complex gaze behavior that has proven difficult to
predict (Jovancevic, Sullivan, & Hayhoe, 2006; Turano,
Geruschat, & Baker, 2003). Aspects of this gaze
behavior are driven by stimulus attributes, but task
demands can also drastically influence gaze behavior
(Navalpakkam & Itti, 2005). Stimulus attributes
include low-level features such as brightness, contrast,
color, or spatial frequency, but may also include highlevel information such as depth cues. The perception of
depth can have the effect of over-riding some of the
salient 2-D cues. For example, the addition of
perspective information during the viewing of a simple
geometric capsule has the effect of biasing gaze towards
the center of gravity for the object in 3-D, instead of the
center of the 2-D projected form (Vishwanath &
Kowler, 2004). In more complex tasks such as face
perception, there is evidence that the addition of
stereopsis can influence gaze behavior (Chelnokova &
Laeng, 2011), but it is difficult to determine whether
depth per se changed eye movement behavior or depth
simplified the task, resulting in a change in inspection
strategy.
The addition of depth information can influence
complex object recognition. For example, face recognition improves with the addition of binocular disparity
information (Liu & Ward, 2006). We have shown
previously that a number of depths cues, such as
structure-from-motion (SFM) and texture (Farivar et
al., 2009; Liu et al., 2005), can alone serve to empower
complex object recognition. Interestingly, not all depth
cues are processed by similar cortical mechanisms. For
Citation: Akhavein, H., & Farivar, R. (2017). Gaze behavior during 3-D face identification is depth cue invariant. Journal of Vision,
17(2):9, 1–12, doi:10.1167/17.2.9.
doi: 10 .116 7 /1 7. 2. 9
Received July 25, 2016; published February 28, 2017
ISSN 1534-7362 Copyright 2017 The Authors
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Downloaded From: http://jov.arvojournals.org/pdfaccess.ashx?url=/data/journals/jov/936040/ on 03/08/2017
Journal of Vision (2017) 17(2):9, 1–12
Akhavein & Farivar
example, stereopsis and SFM are processed largely by
the dorsal visual pathway (Backus, Fleet, Parker, &
Heeger, 2001; Heeger, Boynton, Demb, Seidemann, &
Newsome, 1999; Kamitani & Tong, 2006; Tootell et al.,
1995; Tsao et al., 2003; Zeki et al., 1991), a pathway
that begins in V1 and courses dorsally through area
MT before terminating in the posterior parietal lobe,
while texture and shading largely depend on the ventral
visual pathway (Merigan, 2000; Merigan & Pham,
1998), a pathway that also begins in V1, but courses
ventrally through V4 and terminates in the anterior
temporal lobe.
In displays where a single depth cue is used to render
an object, low-level cues such as luminance and
contrast are unavailable, weak, or unrelated to the
surface of the object. This means that saliency models
such as described by (Itti & Koch, 2000) would poorly
predict gaze behavior during 3-D object recognition,
independent of task demands, simply because low-level
features in such displays are unrelated to depth.
Therefore, gaze behavior in such displays must be
driven primarily by the perception of depth and not the
manner in which depth is rendered—the depth cue
ought to be unimportant. We call this the depth-cue
invariance hypothesis—that gaze behavior is driven by
the depth profile of an object, independent of low-level
features or task demands. The key prediction of this
view is that the gaze behavior for a given object
recognition task would remain constant across different
depth cues. An alternative hypothesis is that gaze
performance is not invariant to depth cue, but tasks
using each depth cue create different demands for gaze,
resulting in a distinct pattern of gaze behavior that
depends on the depth cue of the stimulus.
Distinguishing between these two hypotheses is
challenging, because it requires comparison of gaze
performance across different depth cues in a manner
that separates the influence of task demands. While
objects can be defined by different depth cues,
recognition performance across depth cues is not
consistent. For example, face recognition performance
with SFM on a 1:8 matching task is approximately 50%
(four times above chance), while the same task with
shaded faces results in performance close to 85%
(Farivar et al., 2009). This means that we cannot
directly compare gaze performance across depth for
such a task—the gaze performance difference could be
due to the difference in the task difficulty, independent
of the depth cue itself (Dehmoobadsharifabadi &
Farivar, 2016).
In the studies described below, we have sought to
overcome this challenge and measure gaze behavior
during face recognition, where faces are defined by
different depth cues. We were able to equate task
difficulty across the different depth cues by measuring
face identification thresholds—the amount of morph
2
between a given facial surface and the ‘‘average’’ facial
surface needed to result in correct performance. This
allowed us to equate task difficulty across depth cues
and therefore directly compare how gaze performance
was influenced by depth cues for similar levels of
performance on each depth cue.
Methods
Participants
Nine subjects (including one of the authors, HA)
participated in the study. All participants (aged 23 to
38; two women, seven men) had normal or correctedto-normal vision. All procedures were approved by the
Research Ethics Board of the Montreal Neurological
Institute. Two subjects completed all but the stereo
task.
Stimuli and display
The stimuli consisted of frontal view of faces with
different 3-D surface information (Shading, Texture,
and Stereo). Facial surfaces with four different
identities (two male and two female) were generated
using FaceGen Modeller 3.5. Each identity was
morphed from 100% towards the average face in steps
of 10% using the morph feature in the software. Facial
surfaces were edited using a professional 3-D modelling
package (3D Studio Max 2013, Autodesk) to remove
the identity information from the size and contour of
the morphs by resizing them to the dimension of the
average face and then applying a random surface noise
to the external contour (Figure 1). The sizes of the final
objects were approximately 180 3 280 3 220 in 3Ds
Max units. The stimuli were presented on a Samsung
(S23A700D) 3-D screen operating at 1920 3 1080 and
with a refresh rate of 120 Hz, driven by a dual HDMI
cable connected to the stimulus computer. Subjects
viewed the display from a distance of 60 cm. Each
stimulus subtended 168 of visual angle vertically and
10.58 horizontally. As described further below, each
stimulus was presented at a random location in order to
measure the location of the first fixation landing. The
face stimuli were presented against a grey background.
Shading
All the textures were removed from the surface and a
directional light source with 458 angle from horizon
was introduced. The frontal view of the face was
rendered in orthographic projection to avoid perspec-
Downloaded From: http://jov.arvojournals.org/pdfaccess.ashx?url=/data/journals/jov/936040/ on 03/08/2017
Journal of Vision (2017) 17(2):9, 1–12
Akhavein & Farivar
3
Figure 1. Schematic view of the face space. Four different identities (two male and two female) were generated and morphed from
100% identity towards the average face in steps of 10%. Participants were asked to associate each identity to a specific key on the
keyboard. The figure shows only the object defined by shading.
tive information (Figure 2a), in order to make this
consistent across all conditions.
Texture
The texture stimuli were generated using the method
described by (Liu et al., 2005). A random regular noise
texture with 100% self-luminance was applied to the
surface. The maximum and minimum thresholds for
the noise were set to be 75% to 25% of the maximum
image luminance range. The size of the noise was 12.7
cm overlaid on the face object about 4 m in width. The
images were rescaled for presentation on the screen.
The self-luminance was required to remove shading
information from the stimuli. The size of the noise
minimized nonuniformity of the procedural random
texture on the surface. The images were rendered in
orthographic projection. In this condition the depth of
the object would be well defined by changes in texture
gradient across the surface (Figure 2b).
Stereo
Using the 3D Studio Max software, two images were
captured from the 3-D surfaces from two virtual
cameras shifted in the horizontal direction to simulate
the position of the two eyes at 60 cm distance from the
full screen monitor (62.868). The surfaces were covered
with a high-density regular noise texture. The maximum and minimum thresholds for the noise were set to
be 75% to 25% of the maximum image luminance
range. The size of the noise was 7.62 cm overlaid on the
face object about 4 m in width. The surface was also set
to 100% self-luminance to eliminate shading. The two
images were presented alternatively on the screen, and
the subject could detect the surface of the face through
binocular disparity information provided by use of 3-D
shutter eye glasses. In this condition all the shading was
removed from the surface with self-luminance and the
size of the texture was small enough that it did not
allow for monocular detection of changes in the texture
gradient, meaning that depth could be resolved only
Downloaded From: http://jov.arvojournals.org/pdfaccess.ashx?url=/data/journals/jov/936040/ on 03/08/2017
Journal of Vision (2017) 17(2):9, 1–12
Akhavein & Farivar
4
Figure 2. An example of (a) shading, and (b) texture defined faces. Different depth cues extracted from the same 3D-surface of the
face and the external contours were randomized. The stereo stimulus is not shown. (c) The stereo version shown in color anaglyph
form for visualization (the stimuli presented on a 3-D monitor using shutter eye glasses).
when the stimuli were viewed binocularly—monocular
viewing resulted in a flat image of random noise (Figure
1c).
Eye movement recording
Two-dimensional movement of the left eye was
recorded using a custom designed eye-tracker sampling
at 400 fps with precision of 0.18 of visual angle at 60 cm
distance from the monitor running on the GazeParser
software (Sogo, 2013). The position of the eye was
calculated based on the relative distance of the pupil
and the first Purkinje image. A nine-point linear
transform was used to calibrate the camera coordinates
to the screen coordinates. To further increase the
accuracy, a chin rest and an IR filter on the camera was
used. Participants who required prescription glasses
wore their contact lenses for the studies. 3-D glasses
provided by the display manufacturer were used for the
stereo condition. A data file generated by the eyetracker containing the eye-position and stimulus
timings was saved for later analysis.
Procedure
The experiment started with the training session
where the subjects became familiar with the identity of
the faces and were asked to associate each identity to a
specific key on the keyboard. Participants were trained
for all types of stimuli (Texture, Shading, and Stereo) in
different sessions. During the training, one of the four
faces was randomly chosen and was presented for 2 s
while the subject had unlimited time to respond. No
fixation points were presented, and the stimuli were
always at the center of the screen. The training session
was repeated until the subject performance reached
more than 90% accuracy on the 60% morph level for all
the three stimuli types. Each training session lasted
about five minutes and depending on the subject, four
to five training sessions were required to reach the
sufficient accuracy.
During the testing session, a face with one of the four
identities and 10 morph levels was randomly selected
and presented for 2 s. The position of the stimulus was
random on the screen with no fixation point so as to be
able to record the preparatory fixation of the subject.
After two seconds, the stimulus was removed and the
screen was filled with a gray background. Participants
then made their response without a time limit (Figure
3). Each face and morph level was tested 20 times, for a
total of 2,640 trials per subject (3 depth conditions 3 4
identities 3 11 morph levels 3 20 trials per condition).
Different depth cues were presented in separate sessions
and because of the long duration of each session,
subjects had three breaks and the eye tracker was
recalibrated after each break. Subjects were tested for
each session on separate days, and the overall duration
Downloaded From: http://jov.arvojournals.org/pdfaccess.ashx?url=/data/journals/jov/936040/ on 03/08/2017
Journal of Vision (2017) 17(2):9, 1–12
Akhavein & Farivar
5
Figure 3. Schematic view of the experimental design for the testing phase. Only the sequence for the shading stimuli is shown here
and the procedure was identical for all depth cues. A face with a random identity, morph level, and position on the screen was
presented for 2 s. There was no fixation point presented in the entire experiment. After 2 s, the screen was covered with a gray
background and the subject was asked to respond with no time limit. The subject was not allowed to respond while the stimulus was
still presented. The next stimulus was presented immediately after the response of the subject. Different depth cues were tested in
separate sessions and the subject had two breaks within each session.
of the testing was slightly less than three hours
combined.
Analysis of psychophysical data
Weibull Functions were fit to the 4AFC data for
each subject and depth cue using the Palamedes
Toolbox (Prins & Kingdom, 2009). The identification
threshold was defined as the amount of face morph
needed to achieve 72.2% accuracy. The guessing and
the lapse rates were fixed to 0.25 and 0.001, respectively. The recognition thresholds were calculated for
each individual subject for each of the three depth-cues.
Gaze data: Region-of-interest (ROI) analysis
Eight regions of interest were selected corresponding
to the facial features (left eye, right eye, left cheek, right
cheek, nose, forehead, mouse, and chin). The
schematic view of the defined facial regions in this
study is shown on (Figure 4). The amount of time spent
on individual facial areas has been reported as a
Gaze data: Preprocessing and inclusion criteria
We used Matlab for the preprocessing of eye-data to
correct the random positioning of the stimuli on the
screen. The eye-data was then analyzed using Ogama
4.3 (Vosskuhler, Nordmeier, Kuchinke, & Jacobs,
2008) to calculate the position and the duration of
fixations from raw data. The maximum distance a data
point may vary from the average fixation point to be
considered as part of fixation was set to be 0.58 of visual
angle, and the minimum fixation duration was set to 70
ms. Trials containing less than two and more than eight
fixations (;two standard deviations away from average
number of fixations per trial) where excluded from
further statistical analysis.
Figure 4. The schematic view of eight different facial areas
overlaid on gray scale z-scored depth-map of the average face.
Each area is defined as the rough region covering individual
facial features.
Downloaded From: http://jov.arvojournals.org/pdfaccess.ashx?url=/data/journals/jov/936040/ on 03/08/2017
Journal of Vision (2017) 17(2):9, 1–12
Akhavein & Farivar
6
percentage of the duration of the stimulus onset (2 s)
for statistical analysis.
Results
Face identification performance
A repeated-measures ANOVA was performed on the
identification thresholds for each depth cue (Shading,
Texture, and Stereo). As expected, the identification
threshold was significantly different across depth-cues,
F(2, 14) ¼ 17.5, p ¼ 0.001; Figure 5. Identification
thresholds were lowest for shading and highest for
texture. In general, subjects found the tasks difficult—
although all subjects were trained before the experiment, not all of them were able to perform at 100%
accuracy even with 100% identity strength with texture
and stereo. By measuring the identification performance for each subject and depth cue and analyzing the
eye movement data in relation to the threshold level of
performance, we were then able to remove the effect of
task difficulty on the eye movements during face
identification.
Preferred fixation locations for faces defined by
3-D depth cues
After the onset of the stimuli, the first saccades were
towards the center of the face and the consecutive
fixations were accumulated in facial regions. The main
facial region of interest for shading and texture was the
nose with a slight shift towards the right side. The
overall amount of time spent on each individual facial
feature is shown in Figure 6. The results are the
percentages of accumulated durations spent on each
area calculated only from fixations, and the times
during saccades have been excluded (Findlay &
Walker, 1999; Thiele, Henning, Kubischik, & Hoffmann, 2002; Volkmann, 1986). Gaze performance for
each subject is reported in relation to their psychophysical performance, and we report gaze performance
below, at, and above face identification thresholds.
Panels a, b, and c of Figure 6 represent the results for
the average face (0% identity), at face-identification
threshold, and at 100% identity for each depth-cue,
respectively. A repeated-measures ANOVA was performed on duration of fixations with depth-cue
(Texture, Shading, and Stereo), difficulty (average face,
threshold, and full identity) and facial features (eight
facial areas) as within-subject factors. There was no
main effect of depth-cue, F(2, 14) ¼ 0.298, p ¼ 0.657;
while there were main effects of difficulty, F(2, 14) ¼
5.291, p ¼ 0.022, and ROI, F(7, 49) ¼ 45.998, p ,
Figure 5. Identification performance for different depth cues.
The y axis represents the morph level, and the threshold is
defined as the amount of identity required to perform at 72.2%
accuracy (4 AFC task). The error bars are the standard errors
across subjects. Threshold was significantly different for
different depth-cues
0.0001. There were no interaction between difficulty
and ROI, F(14, 98) ¼ 0.278, p ¼ 0.123, which suggested
there was no significant shift in the gaze strategy of
subjects into different ROIs as a function of difficulty.
We observed a strong interaction between depth-cues
and ROIs, F(14, 98) ¼ 0.343, p , 0.0001, which suggest
a significant shift of gaze strategy for different depth
cues.
To compare gaze performance across depth cues and
to remove the effect of task difficulty, we only
considered gaze behavior during threshold-level face
identification performance—for each depth cue, we
considered gaze performance during viewing of face
morphs that represented the minimum amount of
identity information required for correct identification.
A posthoc comparison was performed on duration of
fixation at threshold with depth cues (shading and
texture) and ROI as factors. There was no significant
interaction between ROI and depth cue, F(7, 56) ¼
3.147, p , 0.08, suggesting that the interaction effect we
detected previously was mainly driven by performance
on the disparity-defined faces. Figure 6d represents the
duration of fixations for each depth cue at threshold
level of face identification performance to facilitate
visualization of between-depth-cue effects.
Effect of task difficulty on gaze performance
during the recognition of faces from pure depth
cues
Previous studies have suggested longer fixation
durations are correlated to the difficulty of the task
(Rayner, 1998; Underwood, Jebbett, & Roberts, 2004).
Downloaded From: http://jov.arvojournals.org/pdfaccess.ashx?url=/data/journals/jov/936040/ on 03/08/2017
Journal of Vision (2017) 17(2):9, 1–12
Akhavein & Farivar
7
Figure 6. The overall duration spent on each individual facial area for (a) shading, (b) stereo, and (c) texture at three levels of difficulty
(Average face, 100% identity, and at threshold). The results were calculated based on subjects’ individual identification performance.
The y axis represents the percentage of the time spent on an area over the entire duration of the trial (2 s). The amount of time
during the saccades has been excluded, and only the time spent during the fixations has been included. (d) Represents the same data
at threshold for all three depth-cues for better comparison of the effect of depth on fixation positions. There was no significant
difference in duration of fixation between shading and texture. Error bars are one standard error across subjects.
We tested the effect of the difficulty of the task on the
duration of fixations by using morphed faces. Figure 7
shows the duration of fixation for the first three
fixations. The preparatory fixation is the amount of
time taken for the gaze to move towards the face after
the onset of the stimulus. This duration mainly reflects
the detection of the stimulus position after the onset of
the stimulus and the difficulty of selecting the first
Figure 7. Duration of fixation as a function of morph level in (a) Texture, (b) Shading, and (c) Stereo. The data are shown only for the
first three fixations and the preparatory fixation of the eye. No significant change was detected for the duration of fixation across
morph levels within each depth cue
Downloaded From: http://jov.arvojournals.org/pdfaccess.ashx?url=/data/journals/jov/936040/ on 03/08/2017
Journal of Vision (2017) 17(2):9, 1–12
Akhavein & Farivar
saccade target (Bichot & Schall, 1999; Pomplun,
Garaas, & Carrasco, 2013; Pomplun, Reingold, &
Shen, 2001), and the subsequent fixation durations
reflect the visual processing required for selecting the
following fixations (Hooge & Erkelens, 1998). The first
two fixations are reported to be the most informative—
the recognition performance is not affected if the
subject is restricted to only two fixations (Hsiao &
Cottrell, 2008). Therefore, we did not consider fixations
that occurred after the fourth saccade in our analysis.
Repeated-measures ANOVAs were performed on
the duration of fixations with fixation count (preparatory, first, second, and third fixation) and morph levels
(0% to 100% in 10 steps) as factors. There were no
significant changes detected for duration of fixations
across different morph levels in any of the depth cues—
for Shading, F(10, 80) ¼ 1.911, p ¼ 0.13; for Texture,
F(10, 80) ¼ 1.204, p ¼ 0.326; or for Stereo, F(10, 60) ¼
3.29, p ¼ 0.119. Thus, the amount of facial identity
information provided did not alter the duration of
fixations: Whether subjects were performing the identification task with high accuracy or were completely at
chance, fixation durations remained unchanged.
While fixation durations did not vary as a function
of task difficulty, fixation durations did vary as a
function of the fixation count in a manner that was
consistent across all depth cues. The duration of
fixations significantly varied between consecutive fixations within a trial (Figure 8).
Because the durations of fixations were constant
across all the morphed levels, the pooled data from all
morph levels were used for the analysis of fixation
durations by fixation count. A repeated-measures
ANOVA was performed on the duration of fixations as
a function of fixation count (preparatory, first, second,
and third fixation) and depth-cue. There was a main
effect of fixation count, F(3, 18) ¼ 92.77, p , 0.001; and
depth-cue, F(2, 12) ¼ 14.92, p ¼ 0.004; but with no
interaction between the depth-cue and fixation count,
F(6, 36) ¼ 2.57, p ¼ 0.114. This suggests significant
difference in the duration of fixations for each
consecutive saccade.
The second fixation was longer than the third for
textured, t(8) ¼ 2.85, p ¼ 0.02; and for stereo-defined
stimuli, t(8) ¼ 2.6 and p ¼ 0.04; though for shaded
stimuli the second and the third were comparable, t(8)
¼ 1.115, p ¼ 0.3. The second fixation was significantly
longer than the first for all depth cues, all t(8) . 4.8, p
, 0.001.
8
Figure 8. Fixation duration as a function of fixation count. The
error bars are one standard error across subjects. Data are
averaged across all subjects and all morph levels. The durations
of fixations were significantly different for each consecutive
fixation within trial.
equating task difficulty across depth cues and using
precisely controlled 3-D stimuli, we were able to assess
how gaze performance during face identification varies
across depth cues. First, we found that the amount of
time spent on a given facial region was highly
comparable between faces defined by shading and
texture, but gaze performance for disparity-defined
faces differed—more fixations were spent on the cheeks
than with the other depth cues. Second, we observed a
comparable pattern of fixation durations for all depth
cues across the most informative saccades (first, second,
and third), suggesting that similar amounts of surface
information are extracted in each fixation regardless of
depth cue. Third, we did not observe a role of task
difficulty on fixation durations on different facial
regions nor on the fixation count—fixation durations
across facial regions were highly conserved regardless
of whether the face being identified was the average
face, close to the threshold, or well-above identification
threshold.
These findings extend previous psychophysical
studies on face recognition from 3-D depth cues
(Dehmoobadsharifabadi & Farivar, 2016; Farivar et
al., 2009; Liu et al., 2005) and suggest that depth cue
invariant face representations may be formed by
comparable (e.g., for shading and texture) as well as
divergent (e.g., for disparity) gaze strategies.
Gaze behavior during face recognition
Discussion
In this study, we examined the effect of depth cues
on gaze behavior during face identification. By
The mechanism of encoding information from
different depth-cues is believed to be different in texture
and shading (Grill-Spector, Kushnir, Edelman, Itzchak, & Malach, 1998; Sereno, Trinath, Augath, &
Logothetis, 2002). Our result suggested that although
Downloaded From: http://jov.arvojournals.org/pdfaccess.ashx?url=/data/journals/jov/936040/ on 03/08/2017
Journal of Vision (2017) 17(2):9, 1–12
Akhavein & Farivar
different mechanisms are involved, the brain tends to
gather the information with the same strategy in the
case of position of fixations. While the only shared
attribute of the stimuli used is the depth profile of the
object, this can suggest that the information from
different processes aggregated to generate a comprehensive depth-map of the object, and this depth-map is
acting as a guiding force that dictates where to look.
We observed more fixations around the cheeks in the
stereo task, which was consistent with a previous
report. Chelnokova and Laeng (2011) interpreted the
increased fixation time spent on cheeks during the 3-D
viewing as a preference for more volumetric properties
of the scene during 3-D viewing. But these volumetric
properties were consistent in our set of stimuli for
different depth cues and could not modify the gaze
behavior. Instead, we suggest that the preference of
greater fixations on the cheek may be a feature specific
to the use of disparity rather than a general preference
for volumetric information.
Whereas our monocular recording conditions preclude a deep discussion of the role of vergence eye
movements during the disparity tasks, we believe
vergence eye movements played very little role in the
pattern described here. The stereo-paired images shown
to the two eyes were almost perfectly colocalized on the
screen, allowing us to relate the monocular recording to
the screen position of the facial features. It will be
interesting to investigate the specific role of vergence
eye movements and their speed in the perception of
objects defined by disparity or joint with other depth
cues.
The preferred fixation locations around the nose and
the cheeks were consistent with previous finding of eyemovements during face recognition (Bindemann,
Scheepers, & Burton, 2009; Chelnokova & Laeng,
2011; Hsiao & Cottrell, 2008). For shading-defined
stimuli, these facial areas are also more salient—local
gradients of luminance contrasts and the presence of
sharp edges are suggested to be more attractive for the
gaze (Itti & Koch, 2000; Olshausen, Anderson, & Van
Essen, 1993; Treisman, 1998; Wolfe, 1994). These lowlevel saliencies are absent in texture- and stereo-defined
faces; therefore, our results strongly suggested that the
depth profile of the object overcomes any effect of these
low-level features. Saliency in the area around the nose
could also be due to this area being the facial center of
gravity which is suggested to be the first target of gaze
in any object recognition task (Bindemann et al., 2009;
He & Kowler, 1991; Kowler & Blaser, 1995; McGowan, Kowler, Sharma, & Chubb, 1998; Melcher &
Kowler, 1999; Vishwanath & Kowler, 2003, 2004;
Vishwanath, Kowler, & Feldman, 2000). We have also
observed a rightward bias in the position of the
fixation, which was in contrast with previous findings
of the eye-movement during recognition of face images
9
(Armann & Bulthoff, 2009; Hsiao & Cottrell, 2008;
Peterson & Eckstein, 2012).
We found that the duration of consecutive fixations
within a trial is significantly different. The fixation
pattern starts with a short fixation (first fixation) and
continues with a long fixation (second fixation) and a
fixation with intermediate duration (third fixation).
This effect, which was reported earlier in face
recognition tasks (Hsiao & Cottrell, 2008), was
consistent for all the depth-cues used in this experiment
(Figure 8). One of the major views in determining the
duration of a fixation is believed to be the ongoing
visual and cognitive processes that take place during
that fixation (Henderson & Ferreira, 1990; Morrison,
1984; Rayner, 1998). In this perspective, the duration of
fixation is the time required for the visual system to
register the input and calculate the best next saccade
target in order to perform the task with maximum
efficiency. In this study, we controlled the effect of one
of the main factors of the cognitive process by equating
the task difficulty. Therefore, the difference between the
duration of fixations is mainly due to the time required
for the visual system to process the facial surface from
the different depth cues and not task difficulty. The
difference in processing time can express itself by
biasing the duration of all the fixations with a constant
value between depth cues. This effect can be seen
clearly as a constant shift of the graphs between depthcues (Figure 8), suggesting that the relative amount of
time needed between subsequent fixations is comparable across all depth cues, but the average total fixation
duration may vary across depth cues.
Effect of task difficulty on gaze behavior
We did not observe an effect of task difficulty (i.e.,
morph level) on fixation durations. This was in contrast
with our prediction that the amount of identity
information (i.e., task difficulty) would affect fixation
durations (Armann & Bulthoff, 2009; Rayner, 1998;
Underwood et al., 2004). Longer fixation durations are
often interpreted as demanding of more extensive
processing, such as in reading of low-frequency words
(Underwood et al., 2004). Our results corroborate that
of Armann and Bulthoff (2009) as they also did not
observe an effect of task difficulty on fixation duration
during a 2AFC face identification task with face
morphs. Therefore, the duration of fixation is more
likely related to the amount of time the brain requires
to extract surface information from the depth-defined
stimuli, rather than identification of those surfaces.
In contrast to Armann and Bulthoff (2009), we did
not observe an interaction between the amount of time
spent on each facial feature and task difficulty,
suggesting task difficulty did not alter fixation patterns
Downloaded From: http://jov.arvojournals.org/pdfaccess.ashx?url=/data/journals/jov/936040/ on 03/08/2017
Journal of Vision (2017) 17(2):9, 1–12
Akhavein & Farivar
in our task. Whereas Armann and Bulthoff (2009) used
simultaneous presentation of two faces, our task
required identification of a single face image. This
difference in the task may have influenced how
fixations were planned—the simultaneous presentation
of two face images in Armann and Bulthoff (2009) may
have allowed subjects to make an initial guess at the
task difficulty and to adjust their gaze strategy
accordingly. On our task, subjects had no way of
knowing the difficulty of a face before initiating their
inspection for identification and may have therefore
adopted a single inspection strategy for all morph
levels. We speculate that knowledge of task difficulty
may alter fixation patterns, but task difficulty itself
does not.
Generalizability to other depth cues and object
categories
In this study, we were unable to include structurefrom-motion mainly because the motion of the stimulus
poses difficulties in both reliably analyzing the fixation
of the eye and also comparing the fixations with
stationary stimuli like shading and texture. Another
factor that restricts us to generalize our findings to any
other object category is the choice of our stimuli. In
order to exclude the task difficulty and have a
comparable fixation pattern, we used faces as the
stimuli because facial surfaces and their morphs have
comparable depth profile. While restricting our analysis
of depth-driven eye movements to faces did facilitate
equating for task difficulty across depth cues, it does
reduce the generalizability of the findings to other
object categories.
Depth cue invariance and gaze behavior
Our study sought to determine whether gaze
behavior is invariant with regards to depth cues or
whether it is depth cue specific. We found moderate
support for a depth cue invariant mechanism in gaze
planning but with an important caveat. The fact that
shaded faces and textured faces resulted in highly
comparable pattern of fixations and fixation durations
despite large differences in low-level image features
suggests low-level features did not influence gaze
behavior on our task. Alone, the highly comparable
gaze patterns between the shaded and textured faces
would support a depth cue invariant model of gaze
planning, were it not for the fact that preferred fixation
positions for faces defined by binocular disparity
differed significantly from that of both textured and
shaded faces. While disparity-defined faces on average
caused longer fixations, the pattern of fixation dura-
10
tions across the most informative fixations was highly
comparable across depth cues, suggesting some aspects
of gaze behavior are invariant with regards to depth
cue.
We propose that the divergence of the results with
disparity-defined faces may have more to do with the
limit of stereopsis than depth cue invariance in gaze
planning and execution. The perception of the depth
profile of an object in stereo-defined stimuli is based on
the accumulation of relative disparity information
across the surface—the gradient of disparity change
that defines the curvature of the cheeks, for example, is
effectively the acceleration of disparity over space.
Relative disparity sensitivity decreases drastically as a
function of eccentricity or distant from fixation (Prince
& Rogers, 1998; Rady & Ishak, 1955). Thus detecting
the curvature of the cheek would be very difficult when
one is fixating on the nose. In order to accurately
perceive the changes in cheek curvature from stereopsis, one would need to fixate on the cheek. The
increased number of fixations to the cheek area for
disparity-defined faces may therefore reflect the limit of
human disparity perception rather than a depth cue
invariant mechanism for directing gaze at objects.
Conclusion
By equating task difficulty across depth cues and
using precisely controlled 3-D stimuli, we were able to
assess how gaze performance during face identification
varies across depth cues. We found that the amount of
time spent on a given facial region was highly
comparable between faces defined by shading and
texture, but gaze performance for disparity-defined
faces did differ. We also observed a comparable pattern
of fixation durations for all depth cues across the most
informative saccades, suggesting that similar amounts
of surface information is extracted in each fixation,
regardless of depth cue. Task difficulty did not affect
fixation durations or fixation count. The results lend
support to a depth cue invariant mechanism for object
inspection and gaze planning.
Keywords: face recognition, depth-cue, eye movement,
shading, texture, disparity, task difficulty, depth perception
Acknowledgments
The research was funded by NSERC Discovery
grant and an internal start-up award to RF by the
Research Institute of the McGill University Health
Centre.
Downloaded From: http://jov.arvojournals.org/pdfaccess.ashx?url=/data/journals/jov/936040/ on 03/08/2017
Journal of Vision (2017) 17(2):9, 1–12
Akhavein & Farivar
Commercial relationships: none.
Corresponding author: Reza Farivar.
Email: [email protected].
Address: McGill Vision Research Unit, Department of
Ophthalmology, McGill University, Montréal,
Canada.
References
Armann, R., & Bulthoff, I. (2009). Gaze behavior in
face comparison: The roles of sex, task, and
symmetry. Attention, Perception, & Psychophysics,
71(5), 1107–1126, doi:10.3758/app.71.5.1107.
Backus, B. T., Fleet, D. J., Parker, A. J., & Heeger, D.
J. (2001). Human cortical activity correlates with
stereoscopic depth perception. Journal of Neurophysiology, 86(4), 2054–2068.
Bichot, N. P., & Schall, J. D. (1999). Saccade target
selection in macaque during feature and conjunction visual search. Visual Neuroscience, 16(1), 81–
89.
Bindemann, M., Scheepers, C., & Burton, A. M.
(2009). Viewpoint and center of gravity affect eye
movements to human faces. Journal of Vision, 9(2):
7, 1–16, doi:10.1167/9.2.7. [PubMed] [Article]
Chelnokova, O., & Laeng, B. (2011). Three-dimensional information in face recognition: An eyetracking study. Journal of Vision, 11(13):27, 1–15,
doi:10.1167/11.13.27. [PubMed] [Article]
Dehmoobadsharifabadi, A., & Farivar, R. (2016). Are
face representations depth cue invariant? Journal of
Vision, 16(8):6, 1–15, doi:10.1167/16.8.6. [PubMed]
[Article]
Farivar, R., Blanke, O., & Chaudhuri, A. (2009).
Dorsal-ventral integration in the recognition of
motion-defined unfamiliar faces. Journal of Neuroscience, 29(16), 5336–5342, doi:10.1523/jneurosci.
4978-08.2009.
Findlay, J. M., & Walker, R. (1999). A model of
saccade generation based on parallel processing
and competitive inhibition. Behavioral and Brain
Science, 22(4), 661–721.
Grill-Spector, K., Kushnir, T., Edelman, S., Itzchak,
Y., & Malach, R. (1998). Cue-invariant activation
in object-related areas of the human occipital lobe.
Neuron, 21(1), 191–202.
He, P. Y., & Kowler, E. (1991). Saccadic localization of
eccentric forms. Journal of the Optical Society of
America A, 8(2), 440–449.
Heeger, D. J., Boynton, G. M., Demb, J. B.,
Seidemann, E., & Newsome, W. T. (1999). Motion
11
opponency in visual cortex. Journal of Neuroscience, 19(16), 7162–7174.
Henderson, J. M., & Ferreira, F. (1990). Effects of
foveal processing difficulty on the perceptual span
in reading: Implications for attention and eye
movement control. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(3),
417–429.
Hooge, I. T., & Erkelens, C. J. (1998). Adjustment of
fixation duration in visual search. Vision Research,
38(9), 1295–1302.
Hsiao, J. H., & Cottrell, G. (2008). Two fixations
suffice in face recognition. Psychological Science,
19(10), 998–1006, doi:10.1111/j.1467-9280.2008.
02191.x.
Itti, L., & Koch, C. (2000). A saliency-based search
mechanism for overt and covert shifts of visual
attention. Vision Research, 40(10–12), 1489–1506.
Jovancevic, J., Sullivan, B., & Hayhoe, M. (2006).
Control of attention and gaze in complex environments. Journal of Vision, 6(12):9, 1431–1450, doi:
10.1167/6.12.9. [PubMed] [Article]
Kamitani, Y., & Tong, F. (2006). Decoding seen and
attended motion directions from activity in the
human visual cortex. Current Biology, 16(11),
1096–1102, doi:10.1016/j.cub.2006.04.003.
Kowler, E., & Blaser, E. (1995). The accuracy and
precision of saccades to small and large targets.
Vision Research, 35(12), 1741–1754.
Liu, C. H., Collin, C. A., Farivar, R., & Chaudhuri, A.
(2005). Recognizing faces defined by texture
gradients. Perception & Psychophysics, 67(1), 158–
167.
Liu, C. H., & Ward, J. (2006). The use of 3D
information in face recognition. Vision Research,
46(6–7), 768–773, doi:10.1016/j.visres.2005.10.008.
McGowan, J. W., Kowler, E., Sharma, A., & Chubb,
C. (1998). Saccadic localization of random dot
targets. Vision Research, 38(6), 895–909.
Melcher, D., & Kowler, E. (1999). Shapes, surfaces and
saccades. Vision Research, 39(17), 2929–2946.
Merigan, W. H. (2000). Cortical area V4 is critical for
certain texture discriminations, but this effect is not
dependent on attention. Visual Neuroscience, 17(6),
949–958.
Merigan, W. H., & Pham, H. A. (1998). V4 lesions in
macaques affect both single- and multiple-viewpoint shape discriminations. Visual Neuroscience,
15(2), 359–367.
Morrison, R. E. (1984). Manipulation of stimulus onset
delay in reading: Evidence for parallel programming of saccades. Journal of Experimental Psy-
Downloaded From: http://jov.arvojournals.org/pdfaccess.ashx?url=/data/journals/jov/936040/ on 03/08/2017
Journal of Vision (2017) 17(2):9, 1–12
Akhavein & Farivar
chology: Human Perception & Performance, 10(5),
667–682.
Navalpakkam, V., & Itti, L. (2005). Modeling the
influence of task on attention. Vision Research,
45(2), 205–231, doi:10.1016/j.visres.2004.07.042.
Olshausen, B. A., Anderson, C. H., & Van Essen, D. C.
(1993). A neurobiological model of visual attention
and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13(11), 4700–4719.
Peterson, M. F., & Eckstein, M. P. (2012). Looking just
below the eyes is optimal across face recognition
tasks. Proceedings of the National Academy of
Sciences, USA, 109(48), E3314–E3323, doi:10.1073/
pnas.1214269109.
Pomplun, M., Garaas, T. W., & Carrasco, M. (2013).
The effects of task difficulty on visual search
strategy in virtual 3D displays. Journal of Vision,
13(3):24, 1–22, doi:10.1167/13.3.24. [PubMed]
[Article]
Pomplun, M., Reingold, E. M., & Shen, J. (2001).
Peripheral and parafoveal cueing and masking
effects on saccadic selectivity in a gaze-contingent
window paradigm. Vision Research, 41(21), 2757–
2769.
Prince, S. J., & Rogers, B. J. (1998). Sensitivity to
disparity corrugations in peripheral vision. Vision
Research, 38(17), 2533–2537.
Prins, N., & Kingdom, F.A.A. (2009). Palamedes:
MATLAB routines for analyzing psychophysical
data. Retrieved from www.palamedestoolbox.org
Rady, A. A., & Ishak, I. G. (1955). Relative
contributions of disparity and convergence to
stereoscopic acuity. Journal of the Optical Society of
America, 45(7), 530–534.
Rayner, K. (1998). Eye movements in reading and
information processing: 20 years of research.
Psychological Bulletin, 124(3), 372–422.
Sereno, M. E., Trinath, T., Augath, M., & Logothetis,
N. K. (2002). Three-dimensional shape representation in monkey cortex. Neuron, 33(4), 635–652.
Sogo, H. (2013). GazeParser: An open-source and
multiplatform library for low-cost eye tracking and
analysis. Behavior Research Methods, 45(3), 684–
695, doi:10.3758/s13428-012-0286-x.
Thiele, A., Henning, P., Kubischik, M., & Hoffmann,
K. P. (2002). Neural mechanisms of saccadic
suppression. Science, 295(5564), 2460–2462, doi:10.
1126/science.1068788.
Tootell, R. B., Reppas, J. B., Kwong, K. K., Malach,
R., Born, R. T., Brady, T. J., . . . Belliveau, J. W.
12
(1995). Functional analysis of human MT and
related visual cortical areas using magnetic resonance imaging. Journal of Neuroscience, 15(4),
3215–3230.
Treisman, A. (1998). Feature binding, attention and
object perception. Philosophical Transactions of the
Royal Society of London: Series B Biological
Sciences, 353(1373), 1295–1306, doi:10.1098/rstb.
1998.0284.
Tsao, D. Y., Vanduffel, W., Sasaki, Y., Fize, D.,
Knutsen, T. A., Mandeville, J. B., . . . Tootell, R. B.
(2003). Stereopsis activates V3A and caudal intraparietal areas in macaques and humans. Neuron,
39(3), 555–568.
Turano, K. A., Geruschat, D. R., & Baker, F. H.
(2003). Oculomotor strategies for the direction of
gaze tested with a real-world activity. Vision
Research, 43(3), 333–346.
Underwood, G., Jebbett, L., & Roberts, K. (2004).
Inspecting pictures for information to verify a
sentence: Eye movements in general encoding and
in focused search. The Quarterly Journal of
Experimental Psychology A, 57(1), 165–182, doi:10.
1080/02724980343000189.
Vishwanath, D., & Kowler, E. (2003). Localization of
shapes: Eye movements and perception compared.
Vision Research, 43(15), 1637–1653.
Vishwanath, D., & Kowler, E. (2004). Saccadic
localization in the presence of cues to threedimensional shape. Journal of Vision, 4(6):4, 445–
458, doi:10.1167/4.6.4. [PubMed] [Article]
Vishwanath, D., Kowler, E., & Feldman, J. (2000).
Saccadic localization of occluded targets. Vision
Research, 40(20), 2797–2811.
Volkmann, F. C. (1986). Human visual suppression.
Vision Research, 26(9), 1401–1416.
Vosskuhler, A., Nordmeier, V., Kuchinke, L., &
Jacobs, A. M. (2008). OGAMA (Open Gaze and
Mouse Analyzer): Open-source software designed
to analyze eye and mouse movements in slideshow
study designs. Behavior Research Methods, 40(4),
1150–1162, doi:10.3758/brm.40.4.1150.
Wolfe, J. M. (1994). Visual search in continuous,
naturalistic stimuli. Vision Research, 34(9), 1187–
1195.
Zeki, S., Watson, J. D., Lueck, C. J., Friston, K. J.,
Kennard, C., & Frackowiak, R. S. (1991). A direct
demonstration of functional specialization in human visual cortex. Journal of Neuroscience, 11(3),
641–649.
Downloaded From: http://jov.arvojournals.org/pdfaccess.ashx?url=/data/journals/jov/936040/ on 03/08/2017