Bottom-up and top-down processing in visual perception

Bottom-up and top-down
processing in visual perception
Thomas Serre
Brown University
Department of Cognitive & Linguistic Sciences
Brown Institute for Brain Science
Center for Vision Research
The ‘what’ problem
monkey electrophysiology
Willmore Prenger & Gallant ’10
The ‘what’ problem
monkey electrophysiology
combinatorial sampling.
of the intractable size of three-dimensional shape space. In this
a
b
d
Stereo Nonstereo
0
20
10
0
–90
0
90
Light angle (deg)
30
20
10
0
30
20
10
10
20
20
0
10
10
0
90
0
–4 –2 0
2
30
10
20
20
0
10
10
90
0
0
180
0
0
–90
0
90
Rotation angle (deg)
–4 –2 0
2
1
0
4
30
20
10
0
0.48
1
Size(x)
2.1
1
2.1
1
2.1
30
Willmore Prenger
& Gallant20 ’10
20
10
0
10
0
–90
0
90
40
0
0.48
30
20
20
10
–10
The ‘what’ problem
Relative y position
1
20
40
–10
4
30
0
40
–10
Response = 0.4A + 0.0B + 49.0AB + 0.0
Angle on yz plane
(deg)
Minimum curvature
i
–10
0
10
x position (deg)
30
0
–90
0
–10
0
–4 –2 0 2 4
Depth (deg)
30
10
0
30
0
–90
20
10
Response (spikes s–1)
10
30
Response (spikes s–1)
20
–1
0 spikes s 40
–10
0
10
0
–90
j
0
90
0
0.48
monkey electrophysiology
Relative y position
30
Yamane
Carlson Bowman
Wang h& Connor ’08
f
g
e
y position (deg)
–1
Response (spikes s–1)
c
Sh
a
Te din
xt g
ur
N
Sh o e
a ne
Te din
xt g
u
N re
on
e
5 deg
38
Response (spikes s–1)
–1
Response (spikes s )
0 spikes s
1
0
combinatorial sampling.
of the intractable size of three-dimensional shape space. In this
a
b
38
d
Stereo Nonstereo
Sh
a
Te din
xt g
ur
N
Sh o e
a ne
Te din
xt g
u
N re
on
e
0 al ’05
Hung5 deg
Kreiman et
20
10
0
–90
0
90
Light angle (deg)
30
20
10
0
30
20
10
10
20
20
0
10
10
0
90
0
–4 –2 0
2
30
10
20
20
0
10
10
90
0
0
180
0
0
–90
0
90
Rotation angle (deg)
–4 –2 0
2
1
0
4
30
20
10
0
0.48
1
Size(x)
2.1
1
2.1
1
2.1
30
Willmore Prenger
& Gallant20 ’10
20
10
0
10
0
–90
0
90
40
0
0.48
30
20
20
10
–10
The ‘what’ problem
Relative y position
1
20
40
–10
4
30
0
40
–10
Response = 0.4A + 0.0B + 49.0AB + 0.0
Angle on yz plane
(deg)
Minimum curvature
i
–10
0
10
x position (deg)
30
0
–90
0
–10
0
–4 –2 0 2 4
Depth (deg)
30
10
0
30
0
–90
20
10
Response (spikes s–1)
10
30
Response (spikes s–1)
20
–1
0 spikes s 40
–10
0
10
0
–90
j
0
90
0
0.48
monkey electrophysiology
Relative y position
30
Yamane
Carlson Bowman
Wang h& Connor ’08
f
g
e
y position (deg)
–1
Response (spikes s–1)
c
Response (spikes s–1)
–1
Response (spikes s )
0 spikes s
1
0
The ‘what’ problem
human psychophysics
Thorpe et al ’96; VanRullen & Thorpe ’01; Fei-Fei et al ’02 ’05; Evans & Treisman ’05; Serre Oliva & Poggio ’07
Vision as ‘knowing what
is where’
Aristotle; Marr ’82
‘What’ and ‘where’
cortical pathways
Ungerleider & Mishkin ‘84
ventral
‘what‘
stream
‘What’ and ‘where’
cortical pathways
Ungerleider & Mishkin ‘84
dorsal
‘where’
stream
ventral
‘what‘
stream
‘What’ and ‘where’
cortical pathways
Ungerleider & Mishkin ‘84
How does the visual system combine information about the
identity and location of objects?
?
dorsal
‘where’
stream
ventral
‘what‘
stream
‘What’ and ‘where’
cortical pathways
Ungerleider & Mishkin ‘84
How does the visual system combine information about the
identity and location of objects?
➡ Central thesis: visual attention
(see also Van Der Velde and De Kamps ‘01; Deco and Rolls ‘04)
?
dorsal
‘where’
stream
ventral
‘what‘
stream
‘What’ and ‘where’
cortical pathways
Ungerleider & Mishkin ‘84
Part I: A ‘Bayesian’ theory of
attention that solves the
‘what is where’ problem
• Predictions for neurophysiology and
human eye movement data
Part II: Readout of IT
population activity
• Attention eliminates clutter
• Sharat Chikkerur
• Cheston Tan
• Tomaso Poggio
• Ying Zhang
• Ethan Meyers
• Narcisse Bichot
• Tomaso Poggio
• Bob Desimone
Part I: A ‘Bayesian’ theory of
attention that solves the
‘what is where’ problem
• Predictions for neurophysiology and
human eye movement data
Part II: Readout of IT
population activity
• Attention eliminates clutter
• Sharat Chikkerur
• Cheston Tan
• Tomaso Poggio
• Ying Zhang
• Ethan Meyers
• Narcisse Bichot
• Tomaso Poggio
• Bob Desimone
Perception as Bayesian
inference
S
visual scene
description
I
image
measurements
Mumford ’92; Knill & Richards ‘96; Dayan &
Zemel ’99; Rao ’02 ’04; Kersten & Yuille ‘03;
Kersten et al ‘04; Lee & Mumford ‘03; Dean ’05;
George & Hawkins ’05 ’09; Hinton ‘07; Epshtein
et al ‘08; Murray & Kreutz-Delgado ’07
P (S|I) ∝ P (I|S)P (S)
Perception as Bayesian
inference
S
visual scene
description
I
image
measurements
Mumford ’92; Knill & Richards ‘96; Dayan &
Zemel ’99; Rao ’02 ’04; Kersten & Yuille ‘03;
Kersten et al ‘04; Lee & Mumford ‘03; Dean ’05;
George & Hawkins ’05 ’09; Hinton ‘07; Epshtein
et al ‘08; Murray & Kreutz-Delgado ’07
P (S|I) ∝ P (I|S)P (S)
Perception as Bayesian
inference
S
visual scene
description
I
image
measurements
Mumford ’92; Knill & Richards ‘96; Dayan &
Zemel ’99; Rao ’02 ’04; Kersten & Yuille ‘03;
Kersten et al ‘04; Lee & Mumford ‘03; Dean ’05;
George & Hawkins ’05 ’09; Hinton ‘07; Epshtein
et al ‘08; Murray & Kreutz-Delgado ’07
object identities
object locations
S = {O1 , O2 , · · · , On , L1 , L2 , · · · , Ln }
Perception as Bayesian
inference
P (S|I) ∝ P (I|S)P (S)
S
visual scene
description
I
image
measurements
Mumford ’92; Knill & Richards ‘96; Dayan &
Zemel ’99; Rao ’02 ’04; Kersten & Yuille ‘03;
Kersten et al ‘04; Lee & Mumford ‘03; Dean ’05;
George & Hawkins ’05 ’09; Hinton ‘07; Epshtein
et al ‘08; Murray & Kreutz-Delgado ’07
object identities
object locations
S = {O1 , O2 , · · · , On , L1 , L2 , · · · , Ln }
Perception as Bayesian
inference
P (S, I) = P (O1 , L1 , O2 , L2 , · · · , Ln , Ln , I)
S
visual scene
description
I
image
measurements
Mumford ’92; Knill & Richards ‘96; Dayan &
Zemel ’99; Rao ’02 ’04; Kersten & Yuille ‘03;
Kersten et al ‘04; Lee & Mumford ‘03; Dean ’05;
George & Hawkins ’05 ’09; Hinton ‘07; Epshtein
et al ‘08; Murray & Kreutz-Delgado ’07
• To recognize and localize objects in the
scene, the visual system selects
objects, one object at a time
Assumption #1:
Attentional spotlight
P (O, L, I)
O,L
visual scene
description
I
image
measurements
Broadbent ’52 ‘54; Treisman ‘60; Treisman &
Gelade ‘80; Duncan & Desimone ‘95; Wolfe ‘97;
and many others
• Object location L and identity O are
independent
dorsal
‘where’
stream
ventral
‘what‘
stream
Assumption #2:
‘what’ and ‘where’ pathways
P (O, L, I)
O,L
visual scene
description
I
image
measurements
• Object location L and identity O are
independent
P (O, L, I) = P (O)P (L)P (I|L, O)
dorsal
‘where’
stream
object
O
L
ventral
‘what‘
stream
location
Assumption #2:
‘what’ and ‘where’ pathways
I
image
measurements
• Objects encoded by collections of
‘universal’ features (of intermediate
complexity)
- Either present or absent
- Conditionally independent given the
object and its location
object
O
L
location
Assumption #3:
universal features
I
image
measurements
see Riesenhuber & Poggio ’99; Serre et al ’05 ’07
• Objects encoded by collections of
‘universal’ features (of intermediate
complexity)
location
O
object
Fi
position and
scale tolerant
features
Xi
retinotopic
features
- Either present or absent
- Conditionally independent given the
object and its location
L
N
I
Assumption #3:
universal features
image
measurements
see Riesenhuber & Poggio ’99; Serre et al ’05 ’07
location
O
object
Fi
position and
scale tolerant
features
Xi
retinotopic
features
L
N
I
Assumption #3:
universal features
image
measurements
see Riesenhuber & Poggio ’99; Serre et al ’05 ’07
location
O
object
Fi
position and
scale tolerant
features
Xi
retinotopic
features
L
N
V2/V4
Serre et al ’05
David et al ’06
Cadieu et al ’07
Willmore et al ’10
Assumption #3:
universal features
I
image
measurements
see Riesenhuber & Poggio ’99; Serre et al ’05 ’07
location
O
object
Fi
position and
scale tolerant
features
Xi
retinotopic
features
L
N
V2/V4
Serre et al ’05
David et al ’06
Cadieu et al ’07
Willmore et al ’10
Assumption #3:
universal features
I
image
measurements
see Riesenhuber & Poggio ’99; Serre et al ’05 ’07
location
IT
Serre et al ’07
O
object
Fi
position and
scale tolerant
features
Xi
retinotopic
features
L
N
V2/V4
Serre et al ’05
David et al ’06
Cadieu et al ’07
Willmore et al ’10
Assumption #3:
universal features
I
image
measurements
see Riesenhuber & Poggio ’99; Serre et al ’05 ’07
Predicting IT readout
Serre et al ’07
1
IT
O
object
Fi
position and
scale tolerant
features
Xi
retinotopic
features
Model
Classification performance
0.8
location
0.6
0.4
L
0.2
0
Size: 3.4o
Position: center
3.4o
center
1.7o
center
6.8o
3.4o
3.4o
o
o
center 2 horz. 4 horz.
N
TRAIN
I
TEST
Bayesian inference and
attention
image
measurements
Model: Serre et al ‘05
Experimental data: Hung* Kreiman*et al ‘05
O
object
Fi
position and
scale tolerant
features
Xi
retinotopic
features
behavior
Serre Oliva & Poggio ’07
location
L
N
I
Bayesian inference and
attention
image
measurements
O
object
Fi
position and
scale tolerant
features
Xi
retinotopic
features
behavior
Serre Oliva & Poggio ’07
location
L
N
I
Bayesian inference and
attention
image
measurements
O
object
Fi
position and
scale tolerant
features
Xi
retinotopic
features
behavior
Serre Oliva & Poggio ’07
location
L
N
I
Bayesian inference and
attention
image
measurements
feedforward
(bottom-up)
sweep
O
object
Fi
position and
scale tolerant
features
Xi
retinotopic
features
behavior
Serre Oliva & Poggio ’07
location
L
N
I
Bayesian inference and
attention
image
measurements
feedforward
(bottom-up)
sweep
• Goal of visual perception: to
estimate posterior probabilities of
visual features, objects and their
locations in an image
location
• Attention corresponds to
conditioning on high-level latent
variables representing particular
features or locations (as well as on
sensory input), and doing
inference over the other latent
variables
Bayesian inference and
attention
O
object
Fi
position and
scale tolerant
features
Xi
retinotopic
features
L
N
I
image
measurements
dorsal /
where
ventral /
what
O
Fi
L
Xi
N
I
Bayesian inference and
attention
dorsal /
where
ventral /
what
O
Fi
L
Xi
N
P (X i |I)
=
�
i
i
i
P (I|X i )
F i ,L P (X |F , L)P (L)P (F )
�
� �
�
i
i
i
i
X i P (I|X )
F i ,L P (X |F , L)P (L)P (F )
Bayesian inference and
attention
I
dorsal /
where
ventral /
what
O
Fi
L
Xi
N
feedforward input
P (X i |I)
=
�
i
i
i
P (I|X i )
F i ,L P (X |F , L)P (L)P (F )
�
� �
�
i
i
i
i
X i P (I|X )
F i ,L P (X |F , L)P (L)P (F )
Bayesian inference and
attention
I
feedforward
(bottom-up)
sweep
dorsal /
where
ventral /
what
O
Fi
top-down spatial
attention
top-down featurebased attention
L
Xi
feedforward input
P (X i |I)
=
N
attentional modulation
�
i
i
i
P (I|X i )
F i ,L P (X |F , L)P (L)P (F )
�
� �
�
i
i
i
i
X i P (I|X )
F i ,L P (X |F , L)P (L)P (F )
Bayesian inference and
attention
I
feedforward
(bottom-up)
sweep
dorsal /
where
ventral /
what
O
Fi
top-down spatial
attention
top-down featurebased attention
L
Xi
feedforward input
P (X i |I)
=
N
attentional modulation
�
i
i
i
P (I|X i )
F i ,L P (X |F , L)P (L)P (F )
�
� �
�
i
i
i
i
X i P (I|X )
F i ,L P (X |F , L)P (L)P (F )
Xk
I
suppressive drive
Bayesian inference and
attention
feedforward
(bottom-up)
sweep
Neuron
Review
O
The Normalization Model of Attention
John H. Reynolds1,* and David J. Heeger2
1Salk
Institute for Biological Studies, La Jolla, CA 92037-1099, USA
of Psychology and Center for Neural Science, New York University, New York, NY 10003, USA
*Correspondence: [email protected]
DOI 10.1016/j.neuron.2009.01.002
2Department
Fi
Attention has been found to have a wide variety of effects on the responses of neurons in visual cortex. We
A(x,each
θ)E(x,
θ)different forms of attentional modulation, depending
describe a model of attention that exhibits
of these
R(x,
θ)
=
on the stimulus conditions and the spread (or selectivity) of the attention field in the model. The model helps
S(x,toθ)
+ σ alternative theories of attention. We argue that the
reconcile proposals that have been taken
represent
variety and complexity of the results reported in the literature emerge from the variety of empirical protocols
that were used, such that the results observed in any one experiment depended on the stimulus conditions
and the subject’s attentional strategy, a notion that we define precisely in terms of the attention field in the
model, but that has not typically been completely under experimental control.
L
E
Introduction
Attention has been known to play a central role in perception
since the dawn of experimental psychology (James, 1890).
Over the past 30 years, the neurophysiological basis of visual
i
attention has become an active area of research, yielding
an
F i ,L
i
explosion of findings. Neuroscientists have utilized a variety of
techniques (single-unit electrophysiology, electricalimicrostimui ,L
lation, functional imaging, and visual-evoked
potentials) to F
map
Xi
the network of brain areas that mediate the allocation of attention
(Corbetta and Shulman, 2002; Yantis and Serences, 2003) and to
examine how attention modulates neuronal activity in visual
cortex (Desimone and Duncan, 1995; Kastner and Ungerleider,
2000; Reynolds and Chelazzi, 2004). During the same period of
time, the field of visual psychophysics has developed rigorous
methods for measuring and characterizing the effects of attention on visual performance (Braun, 1998; Carrasco, 2006; Cavanagh and Alvarez, 2005; Sperling and Melchner, 1978; Verghese,
2001; Lu and Dosher, 2008).
We review the single-unit electrophysiology literature documenting the effects of attention on the responses of neurons in
visual cortex, and we propose a computational model to unify
the seemingly disparate variety of such effects. Some results
are consistent with the appealingly simple proposal that attention increases neuronal responses multiplicatively by applying
a fixed response gain factor (McAdams and Maunsell, 1999;
Treue and Martinez-Trujillo, 1999), while others are more in
keeping with a change in contrast gain (Li and Basso, 2008; Martinez-Trujillo and Treue, 2002; Reynolds et al., 2000), or with
P (X |I)
=
Xi
N
A
within the framework of a single computational model. We
demonstrate here that a model of attention that incorporates
divisive normalization (Heeger, 1992b) exhibits each of these
different forms of attentional modulation, depending on the stimi
i
i
ulus conditions and the spread (or selectivity) of the attentional
feedback in the model.
In addition
i
ito unifying a range ofiexperimental data within
a common computational framework, the proposed model helps
reconcile alternative theories of attention. Moran and Desimone
(1985) proposed that attention operates by shrinking neuronal
receptive fields around the attended stimulus. Desimone and
Duncan (1995) proposed an alternative model, in which neurons
representing different stimulus components compete and attention operates by biasing the competition in favor of neurons that
encode the attended stimulus. It was later suggested that attention instead operates simply by scaling neuronal responses by
a fixed gain factor (McAdams and Maunsell, 1999; Treue and
Martinez-Trujillo, 1999). Treue and colleagues advanced the
‘‘feature-similarity gain principle,’’ that the gain factor depends
on the match between a neuron’s stimulus selectivity and the
features or locations being attended (Treue and Martinez-Trujillo,
1999; Martinez-Trujillo and Treue, 2004). Spitzer et al., 1988
proposed that attention sharpens neuronal tuning curves, and
Martinez-Trujillo and Treue (2004) explained that sharpening is
predicted by their ‘‘feature-similarity gain principle.’’ Finally, Reynolds et al., 2000 proposed that attention increases contrast
gain. Indeed, the initial motivation for the model proposed here
derived from the reported similarities between the effects of
�
P (I|X )
P (X |F , L)P (L)P (F )
�
� �
�
P (I|X )
P (X |F , L)P (L)P (F )
S
Bayesian inference and
attention
I
P (X i = x|I)
∝
�
P (X i = x|F i , L)P (I|X i )P (F i )P (L).
F i ,L
McAdams and Maunsell ‘99
Model
P (L = x) ≈ 1
P (L = x) = 1/|L|
Multiplicative scaling of tuning
curves by spatial attention
see also Heeger & Reynolds ’09
i
P (X i |I)
=
�
i
i
i
P (I|X )
P
(X
|F
,
L)P
(L)P
(F
)
i
F ,L
�
� �
�
i
i
i
i
X i P (I|X )
F i ,L P (X |F , L)P (L)P (F )
Trujillo and Treue ‘02
Mc Adams and Maunsell ’99
Contrast vs. response gain
see also Heeger & Reynolds ’09
P (X i = x|I)
Normalized spikes
1
P stim/P cue
NP stim/ P.cue
P.stim/NP cue
NP stim/NP cue
0.9
0.8
Normalized response
0.7
∝
�
P (X i = x|F i , L)P (I|X i )P (F i )P (L).
F i ,L
3
2
1
0
Bichot et al. ‘05
0.6
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60
80
100
120
140
160
180
200
Feature-based attention
neural data from Bichot et al ’05
O
Fi
L
Xi
N
I
Learning to localize cars
and pedestrians
learning object
priors
O
Fi
L
Xi
N
I
Learning to localize cars
and pedestrians
learning object
priors
O
Context
Fi
L
Xi
N
I
Learning to localize cars
and pedestrians
learning location priors
(global contextual cues,
see Torralba, Oliva et al)
learning object
priors
O
Context
Fi
L
Xi
N
I
Learning to localize cars
and pedestrians
• Eye movements as proxy for
attention
• Dataset:
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
- 8 participants asked to count the
number of cars/pedestrians
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
The experiment
• Eye movements as proxy for
attention
• Dataset:
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
- 8 participants asked to count the
number of cars/pedestrians
Uniform priors (bottom-up)
Feature priors
Feature + contextual (spatial) priors
Humans
The experiment
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
1st three fixations
• Eye movements as proxy for
attention
• Dataset:
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
- 8 participants asked to count the
number of cars/pedestrians
Uniform priors (bottom-up)
Feature priors
Feature + contextual (spatial) priors
Humans
The experiment
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
1st three fixations
• Eye movements as proxy for
attention
ROC area
• Dataset:
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
- 8 participants asked to count the
number of cars/pedestrians
Uniform priors (bottom-up)
Feature priors
Feature + contextual (spatial) priors
Humans
The experiment
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
1st three fixations
ROC area
1
• Eye movements as proxy for
attention
• Dataset:
0.75
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
0.5
cars
pedestrians
Uniform priors (bottom-up)
Feature priors
Feature + contextual (spatial) priors
Humans
The experiment
- 8 participants asked to count the
number of cars/pedestrians
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
1st three fixations
ROC area
1
• Eye movements as proxy for
attention
• Dataset:
0.75
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
0.5
cars
pedestrians
Uniform priors (bottom-up)
Feature priors
Feature + contextual (spatial) priors
Humans
The experiment
- 8 participants asked to count the
number of cars/pedestrians
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
1st three fixations
ROC area
1
• Eye movements as proxy for
attention
• Dataset:
0.75
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
0.5
cars
pedestrians
Uniform priors (bottom-up)
Feature priors
Feature + contextual (spatial) priors
Humans
The experiment
- 8 participants asked to count the
number of cars/pedestrians
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
1st three fixations
ROC area
1
• Eye movements as proxy for
attention
• Dataset:
0.75
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
0.5
cars
pedestrians
Uniform priors (bottom-up)
Feature priors
Feature + contextual (spatial) priors
Humans
The experiment
- 8 participants asked to count the
number of cars/pedestrians
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
1st three fixations
ROC area
1
• Eye movements as proxy for
attention
• Dataset:
0.75
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
0.5
cars
pedestrians
Uniform priors (bottom-up)
Feature priors
Feature + contextual (spatial) priors
Humans
The experiment
- 8 participants asked to count the
number of cars/pedestrians
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
1st three fixations
ROC area
1
• Eye movements as proxy for
attention
• Dataset:
0.75
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
0.5
cars
pedestrians
Uniform priors (bottom-up)
Feature priors
Feature + contextual (spatial) priors
Humans
The experiment
- 8 participants asked to count the
number of cars/pedestrians
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
1st three fixations
ROC area
1
• Eye movements as proxy for
attention
• Dataset:
0.75
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
0.5
cars
pedestrians
Uniform priors (bottom-up)
Feature priors
Feature + contextual (spatial) priors
Humans
The experiment
- 8 participants asked to count the
number of cars/pedestrians
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
1st three fixations
ROC area
1
• Eye movements as proxy for
attention
• Dataset:
0.75
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
0.5
cars
pedestrians
Uniform priors (bottom-up)
Feature priors
Feature + contextual (spatial) priors
Humans
The experiment
- 8 participants asked to count the
number of cars/pedestrians
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
1st three fixations
ROC area
1
• Eye movements as proxy for
attention
• Dataset:
0.75
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
0.5
cars
pedestrians
Uniform priors (bottom-up)
Feature priors
Feature + contextual (spatial) priors
Humans
The experiment
- 8 participants asked to count the
number of cars/pedestrians
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
Explains 92% of the inter-subject agreement!
1st three fixations
• Eye movements as proxy for
attention
ROC area
1
• Dataset:
0.75
- 100 street-scenes images with
cars & pedestrians and 20 without
• Experiment
0.5
cars
- 8 participants asked to count the
number of cars/pedestrians
pedestrians
Uniform priors (bottom-up)
Feature priors
Feature + contextual (spatial) priors
Humans
- block design for cars and
pedestrians
- eye movements recorded using
an infra-red eye tracker
The experiment
*similar (independent) results by Ehinger Hidalgo Torralba & Oliva ’10
Method
ROC area
Bruce & Tsotos ’06
72.8%
Itti et al ’01
72.7%
Proposed
77.9%
Fi
L
Xi
N
I
Bottom-up saliency and
free-viewing
human eye data from Bruce & Tsotsos
Summary I
Summary I
• Goal of vision:
- To solve the problem of ’what is where’ problem
Summary I
• Goal of vision:
- To solve the problem of ’what is where’ problem
• Key assumptions:
- ‘Attentional spotlight’ → recognition done sequentially, one object at a time
- ‘What’ and ‘where’ independent pathways
Summary I
• Goal of vision:
- To solve the problem of ’what is where’ problem
• Key assumptions:
- ‘Attentional spotlight’ → recognition done sequentially, one object at a time
- ‘What’ and ‘where’ independent pathways
• Attention as the inference process implemented by the interaction between
ventral and dorsal areas
- Integrates bottom-up and top-down (feature-based and context-based) attentional
mechanisms
- Seems consistent with known physiology
Summary I
• Goal of vision:
- To solve the problem of ’what is where’ problem
• Key assumptions:
- ‘Attentional spotlight’ → recognition done sequentially, one object at a time
- ‘What’ and ‘where’ independent pathways
• Attention as the inference process implemented by the interaction between
ventral and dorsal areas
- Integrates bottom-up and top-down (feature-based and context-based) attentional
mechanisms
- Seems consistent with known physiology
• Main attentional effects in the presence of clutters
- Spatial attention reduces uncertainty over location and improves object recognition
performance over first bottom-up (feedforward) pass
Part I: A ‘Bayesian’ theory of
attention that solves the
‘what is where’ problem
• Predictions for neurophysiology and
human eye movement data
Part II: Readout of IT
population activity
• Attention eliminates clutter
• Sharat Chikkerur
• Cheston Tan
• Tomaso Poggio
• Ying Zhang
• Ethan Meyers
• Narcisse Bichot
• Tomaso Poggio
• Bob Desimone
The ‘readout’ approach
The ‘readout’ approach
The ‘readout’ approach
neuron 1
neuron 2
neuron 3
neuron n
The ‘readout’ approach
neuron 1
neuron 2
neuron 3
neuron n
Pattern Classifier
The ‘readout’ approach
neuron 1
neuron 2
neuron 3
neuron n
Prediction
Pattern Classifier
The ‘readout’ approach
neuron 1
neuron 2
neuron 3
neuron n
“Correct”
Prediction
Pattern Classifier
IT readout improves
with attention
train readout classifier
on isolated object
+
IT readout improves
with attention
train readout classifier
on isolated object
+
test generalization in
clutter
+
IT readout improves
with attention
train readout classifier
on isolated object
test generalization in
clutter
unattended
+
+
IT readout improves
with attention
train readout classifier
on isolated object
test generalization in
clutter
unattended
+
+
attended
IT readout improves
with attention
train readout classifier
on isolated object
test generalization in
clutter
unattended
+
+
attended
attend away
IT readout improves
with attention
Zhang Meyers Bichot Serre Poggio Desimone unpublished data
IT readout improves
with attention
+
Zhang Meyers Bichot Serre Poggio Desimone unpublished data
IT readout improves
with attention
+
Zhang Meyers Bichot Serre Poggio Desimone unpublished data
IT readout improves
with attention
+
Zhang Meyers Bichot Serre Poggio Desimone unpublished data
IT readout improves
with attention
+
Zhang Meyers Bichot Serre Poggio Desimone unpublished data
IT readout improves
with attention
Zhang Meyers Bichot Serre Poggio Desimone unpublished data
IT readout improves
with attention
+
Zhang Meyers Bichot Serre Poggio Desimone unpublished data
IT readout improves
with attention
+
Zhang Meyers Bichot Serre Poggio Desimone unpublished data
IT readout improves
with attention
+
Zhang Meyers Bichot Serre Poggio Desimone unpublished data
IT readout improves
with attention
+
Zhang Meyers Bichot Serre Poggio Desimone unpublished data
IT readout improves
with attention
+
Zhang Meyers Bichot Serre Poggio Desimone unpublished data
IT readout improves
with attention
+
Zhang Meyers Bichot Serre Poggio Desimone unpublished data
IT readout improves
with attention
+
Zhang Meyers Bichot Serre Poggio Desimone unpublished data
• Robert Desimone (MIT)
• Christof Koch (CalTech)
• Laurent Itti (USC)
• Tomaso Poggio (MIT)
• David Sheinberg (Brown)
Thank you!