What is focal attention for?

Institute Jean Nicod, Oct 28, 2005
What is focal attention for?
The What and Why of perceptual selection
 The central function of focal attention is to select
 We must select because our capacity to process
information is limited
 We must select because we need to be able to mark
certain aspects of a scene and to refer to the marked
tokens individually
 That’s what this talk is principally about:
but first some background

The functions of focal attention
A central notion in vision science is that of “picking out” or
selecting (also referring, tracking). The usual mechanism for
perceptual selection is called selective attention or focal attention.
 Why must we select at all? Overview

We must select because we can’t process all the information available.
This is the resource-limitation reason.
○ But in what way (along what dimensions) is it limited? What happens to
what is not selected? The “filter theory” has many problems.



We need to select because certain patterns cannot be computed
without first marking certain special elements (e.g. in counting)
We need to select in order to track the identity of individual things
e.g., to solve the correspondence problem by identifying tokens in
order to establish the equivalence of this (t=i) and this (t=i+ε)
We need to select because of the way relevant information in the
world is packaged. This leads to the Binding Problem. That’s an
important part of what I will discuss in this talk.
Broadbent’s Filter Theory
(illustrating the resource-limited account of selection)
Effectors
Motor planner
Filter
Very Short Term Store
Senses
Rehearsal loop
Limited Capacity Channel
Store of conditional
probabilities of past
events (in LTM)
Broadbent, D. E. (1958). Perception and Communication. London: Pergamon Press.

Attention and Selection
 The question of what is the basis for selection has
been at the bottom of a lot of controversy in vision
science. Some options that have been proposed
include:
 We select what can be described physically (i.e., by
“channels”) – we select transducer outputs
 e.g., we select by frequency, color, shape, or location
 We select according to what is important to us (e.g.,
affordances – Gibson), or according to phenomenal
salience (William James)
 We select what we need to treat as special or what we
need to refer to

selecting as “marking”
Consider the options for what is the
basis of visual selection

The most obvious answer to what we select is places or
locations. We can select most other properties by their
location – e.g., we can move our eyes so our gaze lands
on different places
 Must we always move our eyes to change what we attend to?
○ Studies of Covert Attention-Movement: Posner (1980)
○ Other empirical questions about place selection…
• When places are selected, are they selected automatically or can
they be selected voluntarily?
• How does the visual system specify where to move attention to?
• Are there restrictions on what places we can select?
• Are selected places punctate or can they be regions?
• Must selected places be filled or can they be empty places?
• Can places be specifiable in relation to landmark objects (e.g.,
select the place half way between X and Y)?
Covert movement of attention
Fixation
frame
Cue
Target-cue
interval
Detection target
*
Cued
Uncued
*
Example of an experiment using a cue-validity paradigm for showing that the
locus of attention moves without eye movements and for estimating its speed.
Posner, M. I. (1980). Orienting of Attention. Quarterly Journal of Experimental Psychology, 32, 3-25.
Extension of Posner’s demonstration of attention switch
Does the improved detection in intermediate locations entail that the “spotlight of
attention” moves continuously through empty space?
Sperling & Weichselgartner argued that this analog
movement is best explained by a quantal mechanism
The theory assumes a quantal jump in attention in which the spotlight
pointed at location -2 is extinguished and, simultaneously, the spotlight
at location +2 is turned on. Because extinction and onset take a
measurable amount of time, there is a brief period when the spotlights
partially illuminate both locations simultaneously.

Could Objects, rather than places,
be the basis for selection?
An independently motivated alternative is that selection
occurs when token perceptual objects are individuated
 Individuation involves distinguishing something from all things
it is not. In general individuation involves appealing to properties
of the thing in question (cf Strawson).
○ But a more primitive type of individuation or perceptual parsing
may be computed in early vision
 Primitive Individuation (PI) may be automatic
○ PI is associated with transients or the appearance of a new object
○ PI is sometimes accompanied by assignment of a deictic reference
or FINST that keeps individuals distinct without encoding their
properties (nonconceptual individuation). This indexing process is,
however, numerically limited (to about 4 objects) [* More later]
○ Individuation is often accompanied by the creation of an Object
File (OF) for that individual, though the OF may remain empty
Some empirical evidence for objectbased selection and indexing
 General empirical considerations
 Individuals and patterns – the need for argument-binding
 Examples: subitizing, collinarity and other relational judgments
 Experimental demonstrations
 Single-object advantage in joint judgments
 Evidence that whole enduring objects are selected
 Multiple-Object tracking
 Clinical/neuroscience findings
Some empirical evidence for
object-based selection
 General empirical considerations
 Individuals and patterns – the need for argument-binding
 Examples: subitizing, collinarity and other relational judgments
 Experimental demonstrations
 Single-object advantage in joint judgments
 Evidence that whole enduring objects are selected
 Multiple-Object tracking
 Clinical/neuroscience findings
Individuals and patterns



Vision does not recognize patterns by applying templates but by
parsing the pattern into parts – recognition-by-parts (Biederman)
A pattern is encoded over time (and over eye movements), so the
visual system must keep track of the individual parts and recognize
them as the same objects at different times and stages of encoding
Individuating is a prerequisite for recognition of configurational
properties (patterns) defined among several individual parts


An example of how we can easily detect patterns if they are defined
over a small enough number of parts is in subitizing
In order to recognize a pattern, the visual system must pick out
individual parts and bind them to the representation being constructed

Examples include what Ullman called “visual routines”
 Another area where the concept of an individual has become
important is in cognitive development, where it is clear that
babies are sensitive to the numerosity of individual things in a
way that is independent of their perceptual properties
Are there collinear items (n>3)?
Several objects must be picked out at once
in making relational judgments

The same is true for other relational judgments like inside or on-thesame-contour… etc. We must pick out the relevant individual objects
first. Respond: Inside-same contour? On-same contour?
Another example: Subitizing vs Counting.
How many squares are there?
Subitizing is fast, accurate and only slightly dependent on how many
items there are. Only the squares on the right can be subitized.
Concentric squares cannot be subitized because individuating
them requires a curve tracing operation that is not automatic.
Signature subitizing phenomena only appear when
objects are automatically individuated and indexed
Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A
limited capacity preattentive stage in vision. Psychological Review, 101(1), 80-102.
Some empirical evidence for
object-based selection
 General empirical considerations
 Individuals and patterns – the need for argument-binding
 Examples: subitizing, collinarity and other relational judgments
 Some experimental demonstrations
 Single-object advantage in joint judgments
 Evidence that whole enduring objects are selected
 Multiple-Object tracking
 Clinical/neuroscience findings
Single-object superiority occurs even
when the shapes are controlled
Instruction: Attend to the Red objects
Which vertex is higher, left or right 
(Note: There are now many control studies
that eliminate most obvious confounds)
Attention spreads over perceived objects
A
C
A
C
Spreads to
B and not C
Spreads to
C and not B
B
A
D
C
B
A
D
C
Spreads to
B and not C
Spreads to
C and not B
B
D
B
D
Using a priming method (Egly, Driver & Rafal, 1994) showed that the effect of a prime spreads t
other parts of the same visual object compared to equally distant parts of different objects.
We can select a shape even when it is
intertwined among other similar shapes
Are the green items the same? On a surprise test at the
end, subjects were not able to recall shapes that had
been present but had not been attended in the task
(Rock & Gutman, 1981; DeSchepper & Treisman, 1996)
Further evidence that attention is object-based
comes from the finding that various attention
phenomena move with moving objects

Once an object is selected, the selection appears to
remain with the object as it moves
Inhibition of return appears to be object-based

Inhibition-of-return (IOR) is the phenomenon
whereby attention is slow to go back to an object
that had been attended about 0.7 – 1.0 secs before
 It is thought to help in visual search since it prevents

previously visited objects from being revisited
Tipper, Driver & Weaver (1991) showed that IOR
moves with the inhibited object
IOR appears to be object-based (it travels
with the object that was attended)
Objects endure despite changes in location;
and they carry their history with them!
Object File Theory of Kahneman & Treisman
A
B
A
1
2
3
Letters are faster to read if they appear in the same box where they
appeared initially. Priming travels with the object. According to the theory,
when an object first appears, a file is created for it and the properties of the
object are encoded and subsequently accessed through this object-file.
Some empirical evidence for
object-based selection
 General empirical considerations
 Individuals and patterns – the need for argument-binding
 Examples: subitizing, collinarity and other relational judgments
 Experimental demonstrations
 Single-object advantage in joint judgments
 Evidence that whole enduring objects are selected
 Multiple-Object tracking studies (later)
 Clinical/neuroscience findings
 Visual neglect
 Balint syndrome & simultanagnosia
Visual neglect syndrome is object-based
When a right neglect patient is shown a dumbbell that rotates,
the patient continues to neglect the object that had been on the
right, even though It is now on the left (Behrmann & Tipper, 1999).
Simultanagnosic (Balint Syndrome) patients
attend to only one object at a time
Simultanagnosic patients cannot judge the relative length of two
lines, but they can tell that a figure made by connecting the ends
of the lines is not a rectangle but a trapezoid (Holmes & Horax, 1919).
Balint patients attend to only one object at a time
even if they are overlapping!
Luria, 1959
Some empirical evidence for
object-based selection
 Some general empirical considerations
 Individuals and patterns – the need for argument-binding
 Examples: subitizing, collinarity and other relational judgments
 Some direct experimental demonstrations
 Single-object advantage in joint judgments
 Evidence that whole enduring objects are selected
 Multiple-Object tracking studies
 Clinical/neuroscience findings
Multiple Object Selection
 One of the clearest cases illustrating object-based
selection is Multiple Object Tracking
 Keeping track of individual objects in a scene
requires a mechanism for individuating, selecting,
accessing and tracking the identity of individuals
over time
 These are the functions we have proposed are carried
out by the mechanism of visual indexes (FINSTs)
 We have been using a variety of methods for studying
visual indexing, including subitizing, subset selection
for search, and Multiple Object Tracking (MOT).
Multiple Object Tracking




In a typical experiment, 8 simple identical objects are
presented on a screen and 4 of them are briefly
distinguished in some visual manner – usually by
flashing them on and off.
After these 4 “targets” have been briefly identified, all
objects resume their identical appearance and move
randomly. The subjects’ task is to keep track of which
ones had earlier been designated as targets.
After a period of 5-10 seconds the motion stops and
subjects must indicate, using a mouse, which objects
were the targets.
People are very good at this task (80%-98% correct).
The question is: How do they do it?
Keep track of the objects that flash
How do we do it? Do we keep encode
and update locations serially?
Keep track of the objects that flash
How do we do it? What properties
of individual objects do we use?
Explaining Multiple Object Tracking
 Basic finding: People (even 5 year old children)
can track 4 to 5 individual objects that have no
unique visual properties. How is it done?
 Can it be done by keeping track of the only
distinctive property of objects – their location?
○ Based on the assumption of finite attention movement
speed, our modeling suggest that this cannot be done by
encoding and updating locations (because of the speed at
which they are moving and the distance between them)
○ If tracking is not done by using the only uniquely
distinguishing property of objects, then it must be done
by tracking their historical continuity as the same
individual object
If we are not using objects’ locations,
then how are we tracking them?
 Our independently motivated hypothesis is that a small
number of objects (e.g., 4-5) are individuated and reference
tokens or indexes are assigned to them
 An index keeps referring to the object as the object changes its
properties and its location (that makes it the same object!)
 An object is not selected or tracked by using an encoding of
any of its properties. It is picked it out nonconceptually just
the way a demonstrative does in language (i.e., this, that)
 Although some physical properties must be responsible for the
individuation and indexing of an object, we have data showing
that these properties are not encoded, and the properties that
are encoded need not be used in tracking
What has this to do with the
Binding Problem?

First I will introduce the binding problem as it
appears in psychology
The role of selection in encoding conjunctions
of properties (the binding problem)



The binding problem was initially described by Anne
Treisman who showed conditions under which vision
may fail to correctly bind conjunctions of properties
(resulting in conjunction illusions)
 Feature binding requires focal attention (i.e., selection)
The problem has been of interest to philosophers
because it places constraints on how information may
be encoded in early vision (or, as Clark would put it,
‘at the sensory level’ or nonconceptually)
I introduce the binding problem to show how the
object-based view is essential for its solution
Introduction to the Binding Problem:
Encoding conjunctions of properties


Experiments show the special difficulty that vision
has in detecting conjunctions of several properties
It seems that items have to be attended (i.e.,
individuated and selected) in order for their
property-conjunction to be encoded
 When a display is not attended, conjunction errors are
frequent
Read the vertical line of digits in this display
What were the letters and their colors?
This is what you saw briefly …
Under these conditions Conjunction Errors are very frequent
Encoding conjunctions requires selection
 One source of evidence is from search experiments:
 Single feature search is fast and appears to be
independent of the number if items searched through
(suggesting it is automatic and ‘pre-attentive’)
 Conjunction search is slower and the time increases
with the number of items searched through (suggesting
it requires serial scanning of attention)
Rapid visual search (Treisman)
Find the following simple figure in the next slide:
This case is easy – and the time is independent of
how many nontargets there are – because there is
only one red item. This is called a ‘popout’ search
This case is also easy – and the time is independent of
how many nontargets there are – because there is only
one right-leaning item. This is also a ‘popout’ search.
Rapid visual search (conjunction)
Find the following simple figure in the next slide:
Feature Integration Theory and feature Binding
Treisman’s attention as glue hypothesis: focal attention
(selection) is needed in order to bind properties together
 We can recognize not only the presence of “squareness” and
“redness”, but we can also distinguish between different ways
they may be conjoined together
•Red square and green circle vs green square and red circle
 The evidence suggests that conjoined properties are encode
only if they are attended or selected
 Notice that properties are considered to be conjoined if and
only if they are properties of the same object, so it is objects
that must be selected!
Constraints on nonconceptual representation
of visual information (and the binding problem)


Because early (nonconceptual) vision must not fuse the
conjunctive grouping of properties, visual properties can’t
just be represented as being present in the scene – because
then the binding problem could not be solved!
What else is required?
 The most common answer is that each property must be
represented as being at a particular location
 According to Peter Strawson and Austin Clark, the basic unit
of sensory representation is
Feature F at location L
 This is the global map or feature placing proposal.
This proposal fails for interesting empirical reasons
 But if feature placing is not the answer, what is?
The role of attention to location in Treisman’s Feature Integration Theory
Conjunction detected
Color maps
Shape maps
Orientation maps
R
Y
G
Master location map
Original Input
Attention “beam”
But in encoding properties, early vision can’t just bind them
together according to their spatial co-occurrence – even their cooccurrence within the same region. That’s because the relevant
region depends on the object. So the selection and binding must
be according to the objects that have those properties
If location of properties will not give us a way
of solving the binding problem, what will?

This is why we need object-based selection and
why the object-based attention literature is
relevant …
An alternative view of how we
solve the binding problem
 If we assume that only properties of indexed objects (of
which there are about 4-5) are encoded and that these are
stored in object files associated with each object, then
properties that belong to the same object are stored in the
same object file, which is why they get bound together
 This automatically solves the binding problem!
 This is the view exemplified by both FINST Theory (1989)
and Object File Theory (1992)
 The assumption that only properties of indexed objects are
encoded raises the question of what happens to properties of
the other (unindexed) objects or properties in a display
The logical answer is that they are not encoded and therefore
not available to conceptualization and cognition
But this is counter-intuitive!
An intriguing possibility….
 Maybe we see far less than we think we do!
 This possibility has received a great deal of recent
attention with the discovery of various ‘blindnesses’
such as change-blindness and inattentional blindness
The assumption that no properties other than properties
of indexed objects can be encoded is in conflict with
strong intuitions – namely that we see much more than
we conceptualize and are aware of. So what do we do
about the things we “see” but do not conceptualize?
 Some philosophers say they are represented nonconceptually
 But what makes this a nonconceptual representation, as
opposed to just a causal reaction?
○ At the very minimum postulating that something is a
representation must allow generalizations to be captured over
their content, which would otherwise not be available
○ Traditionally representations are explanatory because they
account for the possibility of misrepresentation and they also
enter into conceptualizations and inferences. But unselected
objects and unencoded properties don’t seem to fit this
requirement (or do they?)
Maybe information about non-indexed
objects is not represented at all!!

A possible view (which I am not prepared to fully endorse
yet) is that certain topographical or biological reactions
(e.g., retinal activity) are not representations – because
they have no truth values and so cannot misrepresent
 One must distinguish between causal and represented properties
 Properties that cause objects to be indexed and tracked and result
in object files being created need not be encoded and made
available to cognition
Is this just terminological imperialism?
If we call all forms of patterned reactions representations
then we will need to have a further distinction among
types within this broader class of representation
 We may need to distinguish between personal and subpersonal

types of ‘representation’ with only the former being
representations for our purposes
We may also need to distinguish between patterned states within
an encapsulated module that are not available to the rest of the
mind/brain and those that are available
○ Certain patterned causal properties may be available to motor
control – but does that make them representations?
 An essential diagnostic is whether reference to content – to what
is represented – allows generalizations that would otherwise be
missed and that, in turn, suggests that there is no representation
without misrepresentation
○ We don’t want to count retinal images as representations because
they can’t misrepresent, though they can be misinterpreted later
What next?

This picture leaves many unanswered questions,
but it does provide a mechanism for solving the
binding problem and also explaining how mental
representations could have a nonconceptual
connection with objects in the world (something
required if mental representations are to connect
with actions)
The End

… except for a few loose ends …
Can objects be individuated but not
indexed? A new twist to this story

We have recently obtained evidence that objects
that are not tracked in MOT are nonetheless being
inhibited and the inhibition moves with them
 It is harder to detect a probe dot on an untracked object

than on either a tracked object or empty space!
But how can inhibition move with a nontarget
when the space through which they move is not
inhibited?
 Doesn’t this require the nontargets to be tracked?
The beginnings of the puzzle of clustering prior to
indexing, and what that might mean!



If moving objects are inhibited then inhibition moves along with the
objects. How can this be unless they are being tracked? And if they
are being tracked there must be at least 8 FINSTs!
This puzzle may signal the need for a kind of individuation that is
weaker than the individuation we have discussed so far – a mere
clustering, circumscribing, figure-ground distinction without a
pointer or access mechanism – i.e. without reference!
It turns out that such a circumscribing-clustering process is needed
to fulfill many different functions in early vision. It is needed
whenever the correspondence problem arises – whenever visual
elements need to be placed in correspondence or paired with other
elements. This occurs in computing stereo, apparent motion, and
other grouping situations in which the number of elements does not
affect ease of pairing (or even results in faster pairing when there are
more elements). Correspondence is not computed over continuous
visual manifolds but only over some pre-clustered elements.
Example of the correspondence
problem for apparent motion
The grey disks correspond to the first flash and the black ones
to the second flash. Which of the 24 possible matches will the
visual system select as the solution to this correspondence
problem? What principal does it use?
Curved matches
Linear matches
Here is how it actually looks
Views of a dome
Structure from Motion Demo
Cylinder Kinetic Depth Effect
The correspondence problem for biological motion
FINST Theory postulates a limited number of pointers in early
vision that are elicited by causal events in the visual field and
that enable vision to refer to things without doing so under
concept or a description