13-Scene-Perception

Computational Architectures in Biological Vision, USC
Lecture 13. Scene Perception
Reading Assignments:
None
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
1
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
2
How much can we remember?
Incompleteness of memory:
how many domes in the Taj Mahal?
despite conscious experience of picture-perfect, iconic memorization.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
3
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
4
Change blindness
Rensink, O’Regan & Clark 1996
See the demo!
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
5
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
6
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
7
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
8
But…
We can recognize complex scenes which we have seen before.
So, we do have some form of iconic memory.
In this lecture:
- examine
how we can perceive scenes
- what is the representation (that can be memorized)
- what are the mechanisms
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
9
Extended Scene Perception
Attention-based analysis: Scan scene with attention, accumulate
evidence from detailed local analysis at each attended location.
Main issues:
- what is the internal representation?
- how detailed is memory?
- do we really have a detailed internal representation at all!!?
Gist: Can very quickly (120ms) classify entire scenes or do simple
recognition tasks; can only shift attention twice in that much time!
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
10
Accumulating Evidence
Combine information across multiple eye fixations.
Build detailed representation of scene in memory.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
11
Eye Movements
1) Free examination
2) estimate material
circumstances of family
3) give ages of the people
4) surmise what family has
been doing before arrival
of “unexpected visitor”
5) remember clothes worn by
the people
6) remember position of people
and objects
7) estimate how long the “unexpected
visitor” has been away from family
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
12
Clinical Studies
Studies with patients with some visual deficits strongly argue that tight
interaction between where and what visual streams are necessary for
scene interpretation.
Visual agnosia: can see objects, copy drawings of them, etc., but cannot
recognize or name them!
Dorsal agnosia: cannot recognize objects
if more than two are presented simultaneously: problem with localization
Ventral agnosia: cannot identify objects.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
13
These studies suggest…
We bind features of objects into objects (feature binding)
We bind objects in space into some arrangement (space binding)
We perceive the scene.
Feature binding = what stream
Space binding = where/how stream
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
14
Schema-based Approaches
Schema (Arbib, 1989): describes objects in terms of their physical
properties and spatial arrangements.
Abstract representation of scenes, objects, actions, and other brain
processes. Intermediate level between neural firing and overall
behavior.
Schemas both cooperate and compete in describing the visual world:
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
15
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
16
VISOR
Leow & Miikkulainen, 1994: low-level -> sub-schema activity maps
(coarse description of components of objects) -> competition across
several candidate schemas -> one schema wins and is the percept.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
17
Biologically-Inspired Models
Rybak et al, Vision Research, 1998.
What & Where.
Feature-based frame of reference.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
18
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
19
Algorithm
- At
each fixation, extract central edge orientation, as well as a number of
“context” edges;
those low-level features into more invariant “second order”
features, represented in a referential attached to the central edge;
- Transform
- Learning:
manually select fixation points;
store sequence of second-order
features found at each fixation
into “what” memory; also store
vector for next fixation, based
on context points and in the
second-order referential;
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
20
Algorithm
As a result, sequence of retinal images is stored in “what” memory, and
corresponding sequence of attentional shifts in the “where” memory.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
21
Algorithm
- Search
mode: look
for an image patch that
matches one of the
patches stored in the
“what” memory;
- Recognition
mode:
reproduce scanpath
stored in memory and
determine whether we
have a match.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
22
Robust to variations in
scale, rotation,
illumination, but not
3D pose.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
23
Schill et al, JEI, 2001
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
24
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
25
Dynamic Scenes
Extension to moving objects and dynamic environment.
Rizzolatti: mirror neurons in monkey area F5 respond when monkey
observes an action (e.g., grasping an object) as well as when he
executes the same action.
Computer vision models: decompose complex actions using grammars
of elementary actions and precise composition rules. Resembles
temporal extension of schema-based systems. Is this what the brain
does?
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
26
Several Problems…
with the “progressive visual buffer hypothesis:”
Change blindness:
Attention seems to be required for us to perceive change in images,
while these could be easily detected in a visual buffer!
Amount of memory required is huge!
Interpretation of buffer contents by high-level vision is very difficult if
buffer contains very detailed representations (Tsotsos, 1990)!
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
27
The World as an Outside Memory
Kevin O’Regan, early 90s:
why build a detailed internal representation of the world?
too complex…
not enough memory…
… and useless?
The world is the memory. Attention and the eyes are a look-up tool!
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
28
The “Attention Hypothesis”
Rensink, 2000
No “integrative buffer”
Early processing extracts information up to “proto-object” complexity
in massively parallel manner
Attention is necessary to bind the different proto-objects into complete
objects, as well as to bind object and location
Once attention leaves an object, the binding “dissolves.” Not a
problem, it can be formed again whenever needed, by shifting attention
back to the object.
Only a rather sketchy “virtual representation” is kept in memory, and
attention/eye movements are used to gather details as needed
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
29
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
30
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
31
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
32
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
33
Back to accumulated evidence!
Hollingworth et al, 2000 argue against the disintegration of coherent visual
representations as soon as attention is withdrawn.
Experiment:
- line
drawings of natural scenes
- change one object (target) during a saccadic eye movement away from that
object
- instruct subjects to examine scene, and they would later be asked questions
about what was in it
- also instruct subjects to monitor for object changes and press a button as soon
as a change detected
Hypothesis:
It is known that attention will precede eye movements. So the change is outside
the focus of attention. If subjects can notice it, it means that some detailed
memory of the object is retained.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
34
Hollingworth et
al, 2000
Subjects can see the
change (26% correct
overall)
Even if they only
notice it a long time
afterwards, at their
next visit of the
object
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
35
Hollingworth et al
So, these results suggest that
“the online representation of a scene can contain detailed visual
information in memory from previously attended objects.
Contrary to the proposal of the attention hypothesis (see Rensink, 2000),
the results indicate that visual object representations do not disintegrate
upon the withdrawal of attention.”
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
36
Gist of a Scene
Biederman, 1981:
from very brief exposure to a scene (120ms or less), we can already
extract a lot of information about its global structure, its category
(indoors, outdoors, etc) and some of its components.
“riding the first spike:” 120ms is the time it takes the first spike to travel
from the retina to IT!
Thorpe, van Rullen:
very fast classification (down to 27ms exposure, no mask), e.g., for
tasks such as “was there an animal in the scene?”
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
37
demo
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
38
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
39
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
40
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
41
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
42
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
43
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
44
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
45
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
46
Gist of a Scene
Oliva & Schyns, Cognitive Psychology, 2000
Investigate effect of color on fast scene perception.
Idea: Rather than looking at the properties of the constituent objects in a
given scene, look at the global effect of color on recognition.
Hypothesis:
diagnostic colors (predictive of scene category) will help recognition.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
47
Color & Gist
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
48
Color & Gist
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
49
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
50
Color & Gist
Conclusion from Oliva & Schyns study:
“colored blobs at a coarse spatial scale concur with luminance cues to
form the relevant spatial layout that mediates express scene
recognition.”
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
51
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
52
Combining saliency and gist
Torralba, JOSA-A, 2003
Idea: when looking for a specific object, gist may combine with saliency
in guiding attention
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
53
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
54
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
55
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
56
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
57
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
58
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
59
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
60
Application: Beobots
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
61
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
62
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
63
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
64
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
65
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
66
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
67
Outlook
It seems unlikely that we perceive scenes by building a progressive
buffer and accumulating detailed evidence into it. It would take to
much resources and be too complex to use.
Rather, we may only have an illusion of detailed representation, and the
availability of our eyes/attention to get the details whenever they are
needed. The world as an outside memory.
In addition to attention-based scene analysis, we are able to very rapidly
extract the gist of a scene – much faster than we can shift attention
around.
This gist may be constructed by fairly simple processes that operate in
parallel. It can then be used to prime memory and attention.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
68
Goal-oriented scene understanding?
Question: describe what is happening in the video clip shown in the
following slide.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
69
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
70
Goal for our algorithms
Extract the “minimal subscene®”, that is, the smallest set of actors,
objects and actions that describe the scene under given task
definition.
E.g.,
If “who is doing what and to whom?” task
And boy-on-scooter video clip
Then minimal subscene is “a boy with a red shirt rides a scooter
around”
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
71
Challenge
The minimal subscene in our example has
10 words, but…
The video clip has over 74 million different pixel values (about 1.8
billion bits once uncompressed and displayed – though with high
spatial and temporal correlation)
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
72
Starting point
Can attend to salient locations
Can identify those locations?
Can evaluate the task-relevance of those locations, based on some
general symbolic knowledge about how various entities relate to
each other?
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
73
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
74
Task influences eye movements
Yarbus, 1967:
Given one image,
An eye tracker,
And seven sets of instructions given to seven observers, …
… Yarbus observed widely different eye movement scanpaths
depending on task.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
75
Yarbus, 1967: Task influences human eye movements
1) Free examination
2) estimate material circumstances
of family
3) give ages of the people
4) surmise what family has been
doing before arrival of “unexpected
visitor”
5) remember clothes worn by
the people
6) remember position of people
and objects
7) estimate how long the “unexpected
visitor” has been away from family
[1]: A.Yarbus, Plenum Press, New York, 1967.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
76
Towards a computational model
Consider the following scene (next slide)
Let’s walk through a schematic (partly hypothetical, partly
implemented) diagram of the sequence of steps that may be triggered
during its analysis.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
77
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
78
Two streams
Not where/what…
But attentional/non-attentional
Attentional: local analysis of details of various objects
Non-attentional: rapid global analysis yields coarse identification of
the setting (rough semantic category for the scene, e.g., indoors
vs. outdoors, rough layout, etc)
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
79
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
Attentional
pathway
Itti 2002, also see Rensink, 2000
Setting
pathway
80
Step 1: eyes closed
Given a task, determine objects that may be
relevant to it, using symbolic LTM (long-term
memory), and store collection of relevant objects
in symbolic WM (working memory).
E.g., if task is to find a stapler, symbolic LTM may
inform us that a desk is relevant.
Then, prime visual system for the features of the
most-relevant entity, as stored in visual LTM.
E.g., if most relevant entity is a red object, boost
red-selective neurons.
C.f. guided search, top-down attentional modulation
of early vision.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
81
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
Navalpakkam & Itti, in press
1. Eyes closed
82
Step 2: attend
The biased visual system yields a saliency map
(biased for features of most relevant entity)
See Itti & Koch, 1998-2003, Navalpakkam & Itti,
2003
The setting yields a spatial prior of where this
entity may be, based on very rapid and very
coarse global scene analysis; here we use this
prior as an initializer for our “task-relevance
map®”, a spatial pointwise filter that will be
applied to the saliency map
E.g., if scene is a beach and looking for humans,
look around where the sand is, not in the sky!
See Torralba, 2003 for computer implementation.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
83
2. Attend
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
84
3. Recognize
Once the most (salient * relevant) location has
been selected, it is fed (through Rensink’s
“nexus” or Olshausen et al.’s “shifter circuit”)
to object recognition.
If the recognized entity was not in WM, it is added
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
85
3. Recognize
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
86
4. Update
As an entity is recognized, its relationships to other
entities in the WM are evaluated, and the
relevance of all WM entities is updated.
The task-relevance map (TRM) is also updated
with the computed relevant of the currentlyfixated entity. That will ensure that we will later
come back regularly to that location, if relevant,
or largely ignore it, if irrelevant.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
87
4. Update
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
88
Iterate
The system keeps looping through steps 2-4
The current WM and TRM are a first
approximation to what may constitute the
“Minimal subscene”:
A set of relevant spatial locations with
attached object labels (see “object files”),
and
A set of relevant symbolic entities with
attached relevance values
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
89
Prototype Implementation
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
90
Symbolic LTM
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
91
Simple hierarchical
Representation of
Visual features of
Objects
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
92
The visual features
Of objects in visual
LTM are used to
Bias attention
Top-down
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
93
Once a location is attended to, its local visual features
Are matched to those in visual LTM, to recognize the
attended object
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
94
Learning object features
And using them for
biasing
Naïve: Looking for
Biased: Looking for
objects
a Coca-cola can 95
Laurent Itti: CS599 – Computational Architectures Salient
in Biological Vision,
USC. Lecture 13: Scene Perception
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
96
Exercising the model by requesting that it finds
several objects
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
97
Learning the TRM through sequences of attention and recognition
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
98
Outlook
Open architecture – model not in any way dedicated to a specific task,
environment, knowledge base, etc. just like our brain probably has not
evolved to allow us to drive cars.
Task-dependent learning – In the TRM, the knowledge base, the object
recognition system, etc., guided by an interaction between attention,
recognition, and symbolic knowledge to evaluate the task-relevance of
attended objects
Hybrid neuro/AI architecture – Interplay between rapid/coarse learnable
global analysis (gist), symbolic knowledge-based reasoning, and local/serial
trainable attention and object recognition
Key new concepts:
Minimal subscene – smallest task-dependent set of actors, objects and
actions that concisely summarize scene contents
Task-relevance map – spatial map that helps focus computational
resources on task-relevant scene portions
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC. Lecture 13: Scene Perception
99