Audio-visual interactions Overview Ventriloquist illusion

Integrating Auditory and Visual
Scene Analysis
Audio-Visual Scene Analysis:
A brief survey of interesting audio-visual
interactions.
Auditory
Streams
Acoustic
Mixture
woof
Auditory Scene Analysis
AV
Objects
1. “woof!”
2. “meow!”
Jon Barker
Speech and Hearing Research
University of Sheffield
A.
Visual Scene Analysis
Visual
Scene
A + V tightly coupled at
an early stage of
processing.
Audio-visual interactions
• Two areas of study:
– The influence of visual input on auditory perception.
woof
AV
Integration
meow
B.
meow
Visual
Objects
Binding problem:
Spatial correspondence.
Temporal correspondence.
Audio Visual Schemas.
Overview
• Primitive Audio-Visual interactions
– Visual cues influencing auditory perception
– Acoustic cues influencing visual perception
– The influence of acoustic input on visual perception.
• Multimodal speech perception
• Auditory Scene Analysis + Vision
• Some questions:
– Conditions necessary for AV interaction?
– At what stage of processing do interactions occur?
– Mechanisms?
Ventriloquist illusion
Bertelson and Radeau (1981)
Speech audio
coming from
Position 1
Ventriliquist illusion
• Often presented as a speech illusion – but it occurs with
very simple stimuli
– A visual flash will cause a synchronized acoustic beep to be
mislocalised.
• Occurs even when subjects don’t perceive visual
stimulus.
– experiments with subjects with unilateral visual neglect
(Bertelson et al., 2000).
– Seems to imply that auditory and visual information are
integrated at a pre-perceptual stage of processing.
The speech audio is
mislocated: it is heard to be
coming from Position 2
Synchronised lip
movements from
Position 2
• Under the right conditions the localization of the visual
stimulus is also effected by the position of the acoustic
stimulus…
1
Auditory Driving
Intersensory Temporal Locking
Gebhard and Mowbray (1959)
Frendrich and Corballis (2001)
• Early demonstration of influence of sound on visual
perception.
Circle moves
around the dial
– Subjects are presented with a flickering light and a fluttering
sound.
– Flicker rate held constant, flutter rate increased or decreased.
– Subjects perceive changes in the (constant) flicker rate.
• The reverse V->A effect was not observed.
– i.e. changing the visual flicker rate does not alter perception of
the audio flutter rate.
• More detailed recent work showing auditory influence on
visual temporal rate perception (e.g. Recanzone, 03)
Shaded region
briefly flashes at the
same point on each
revolution
Brief ‘click’ is
played either
before, during, or
after the flash
Subjects are asked to judge either where the circle is when
the flash occurs or where it is when the click occurs.
Intersensory Temporal Locking
Temporal Ventriloquism
Frendrich and Corballis (2001)
Morein-Zamir et al. (2003)
Error in
judgement when
attending flash
•
Employed a Temporal Order Judgement (TOJ) task.
Either top LED
or bottom LED
comes on first.
Error in
judgement when
attending click
Click and flash judged to be part of the same event so they share a
single temporal reference.
Does this auditory-induced temporal relocation of visual events
generalise to not repetitive stimuli?
Temporal Ventriloquism
Morein-Zamir et al. (2003)
•
•
•
Sounds seem to change perceived timing of lights.
Sounds played before and after lights make TOJ easier.
– i.e. as though they are pulling the light timings further apart.
Shams et al (2002)
• A single flash is accompanied by a varying number of
auditory beeps.
• Subjects asked to report how many flashes are seen.
Employed a Temporal Order Judgement (TOJ) task.
Sounds played
between first and
second light
coming on.
•
•
Sounds played
before and
after lights.
Playing two sounds between lights coming on makes task harder.
Playing one sound between lights coming on has no effect.
Playing 2 beeps
consistently induces
subjects to mistakenly
perceive 2 flashes.
• Not symmetrical - Showing two flashes does not make
subjects hear a single beep as two beeps.
2
Audio-Visual Integration
•
Modalities contribute to percept according to their relative reliability
of measurements. Generally,
– Vision: Reliable spatial location. Unreliable temporal location.
– Audition: Unreliable spatial location. Reliable temporal location.
– Not a new idea: c.f. `modality appropriateness hypothesis’ - Welch and
Warren (1986).
•
Audio-Visual Integration
•
Integration only occurs if there is sufficient evidence that the audio
and visual information are due to a common event
•
Location estimates overlap – modalities combined to form perception of a single
audio-visual event.
Auditory percept
Alais and Burr (2004) – Studied ventriliquist effects using poorly
localized (blurred) visual stimuli.
Integration
– Judgement of location of blurred visual stimuli is influenced by sound.
– Successful prediction of modality dominance by simple models based
on maximum likelihood principle.
•
Important implications for A-V models:
AV reprensentation
Location estimates
•
– When combining A and V estimates, the estimates alone are not
enough, we also need a `confidence measure’ i.e. an estimate of the
variance as well as the mean.
Auditory percept
Integration
Visual percept
Location estimates do not overlap – separate audio and visual objects perceived,
each with its own position.
Auditory percept
Integration
Visual percept
Visual percept
AV reprensentation
Audio-Visual Integration
Overview
• Primitive Audio-Visual interactions
Visual Position
Perceived visual position
Perceived acoustic position
Acoustic
Position
Increasing separation
– Visual cues influencing auditory perception.
– Acoustic cues influencing visual perception.
• Multimodal speech perception.
• Auditory Scene Analysis + Vision.
Beyond a certain separation the
interpretation switches from that of a
single object to that of two
separate objects
Preamble – Is speech special?
• Speech perception exhibits many unusal
properties, including many interesting crossmodal effects.
• Two schools of thought:
– Speech engages a `special mode of perception’, e.g.
Remez et al. (1994)
– Speech is not processed in a qualitatively different
way from nonspeech. “speech perception is
continuous with the perception of other natural
events”, e.g. Bregman (1990), Rosenblum(2005)
Audio-Visual Speech Processing
• Sumby and Pollack (1954),
– untrained listeners routinely employ visual information
when listening to speech in adverse conditions.
– two syllable words + stationary noise at varying SNR.
– visual cues equivalent to improving SNR by ~15 dB.
+
3
Audio-Visual Speech Intelligibility
• Summerfield (1979),
– Where are the visual cues?
– Visual information concentrated in lip region, but also
cues elsewhere.
Audio-Visual Speech Intelligibility
“m”
noise
Energetic
masking results in
“m”/”n” ambiguity
Auditory
Streams
Acoustic
Mixture
Auditory Scene Analysis
1. “m”, “n” ?
2. noise
AV
Integration
Visual Scene Analysis
Visual
Scene
“m”
“m”, “p”, “b” ?
Visual
Objects
“m” is the only
phoneme
consistent with both
A and V evidence.
All bilabial
consonants are
visually ambiguous
Note, increased intelligibility of AV condition can not be taken as evidence of
visual information being involved in auditory source separation
Audio-Visual Speech
Audio-Visual Speech Intelligibility
– Visual information used as a ‘back up’ system.
– Employed when there is reason to doubt the acoustic
signal.
Audio
“b”
Auditory Scene Analysis
1. “b”
AV
Integration
• This is not the case
– Visual information can influence AV speech
perception even when acoustic signal seems
unambiguous.
– AV integration can occur even when the visual cues
contradict seemingly clear acoustic cues.
Audio alone =
unambiguous “b”
Auditory
Streams
• Early idea – primacy of auditory information.
Visual Scene Analysis
Visual
“g”
“?”
“g”, “k”… ?
Visual
Objects
What do we hear?
We’d probably
expect to hear “b”.
Incongruous visual cue,
but temporally
synchronised with audio
The McGurk Effect
The McGurk Effect
McGurk and MacDonald "Hearing lips and seeing voices", Nature 264, 746-748 (1976).
In the original set-up a visual b is synchronized
with an acoustic g.
The listener hears a d or th.
The McGurk Effect: Demo
P. Kuhl, Department of Speech and Hearing Sciences, University
of Washington
4
Fuzzy Logical Model of Perception
The McGurk Effect
Massaro (1987, 1998)
How general is the McGurk effect?
• It works on perceivers with all language
backgrounds (e.g. Massaro et al. 1993)
• It works on young infants (Rosenblum,
Schmuckler, & Johnson, 1997).
• Visual information can be reduced to a few
moving points (Rosenblum & Saldaña, 1996).
• It works when observers touch (rather than
look-at) the face (Fowler & Dekle, 1991).
• Effect persists even if listeners are instructed to
focus attention on the acoustic signal.
pv
x
pv
Degree of support
Evaluation
pa
x
x
Integration
Decision
m
• Treats A and V as independent channels
• Describes how A and V integrate, not whether
they will integrate.
AV speech detection
Grant and Seitz (2000), Grant (2001)
•
•
x
Audio Visual
Prototypes
p
pa
pa
pa
pa x pv
b
pv
pv
AV speech detection studies
•
n
m
Auditory
Streams
Acoustic
Mixture
Compare detection threshold for speech in audio only and audiovisual conditions.
Found visual speech reduces the detection threshold by 2-3 dB.
Effect disappears if the Audio-Visual synchrony is artificially
disturbed.
(speech)
masking noise
Auditory Scene Analysis
noise
Speech below
detection
threshold
Bimodal
Coherance
Masking
Protection
AV speech detection
Acoustic
Mixture
(speech)
Auditory Scene Analysis
masking noise
noise
(speech)
AV
Integration
Visual Scene Analysis
Visual
Scene
AV speech detection
• Strong correlation between dynamics of lip area and of
energy envelope of the F2 region of the spectrum.
Auditory
Streams
speech
speech cues
Visual
Objects
Speech above
acoustic detection
threshold only if A
and V have temporal
correlation.
Visual cues appear to guide auditory attention – the listener can predict when
and where to expect the speech energy.
5
AV source separation
AV source separation
Acoustic
Mixture
“g”
noise
Auditory Scene Analysis
Schwartz et al (2004)
Energetic
masking results in
“g”/”k” ambiguity
Auditory
Streams
1. “g”, “k” ?
2. noise
AV
Integration
Visual Scene Analysis
Visual
“g”
“?”
• Indentification of voiced versus unvoiced
plosives. [gy gu dy du] vs [ty tu ky ku]
• Speech babble added at ~ -9 dB SNR
• Identical video stimulus dubbed onto all acoustic
signals.
“g”, “k” ?
Visual
Objects
Is the AV “g” more
intelligible than the
audio only “g” ?
“g” and “k” are visually
indistinguishable
Schwartz et al (2004)
Auditory
Streams
Acoustic
Mixture
“g”
noise
Auditory Scene Analysis
1. “g”, “k” ?
Using acoustic cues
alone, subjects can’t
reliably descriminate
between g and k.
2. noise
Schwartz et al (2004)
Auditory
Streams
Acoustic
Mixture
“g”
noise
Auditory Scene Analysis
1. “g”, “k”
Visual stimulus aids
stream separation,
so “g” is heard more
reliably
2. noise
AV
Integration
Visual Scene Analysis
“g”
“g”, “k” ?
Visual
Scene
Visual
Objects
Visual stimuli is
totally ambiguous
Overview
Ventriloquist illusion with two
talkers – Driver (1996)
• Primitive Audio-Visual interactions
– Visual cues influencing auditory perception.
– Acoustic cues influencing visual perception.
• Multimodal speech perception.
• Auditory Scene Analysis + Vision.
Off
Off
Target Lips
Target + Masker
speech
Condition 1:
Visual and audio target
cues at same position
6
Ventriloquist illusion with two
talkers – Driver (1996)
Off
Target Lips
Target + Masker
speech
Off
Target heard as coming from
here. There is an illusion that
it is spatially separated from
the masker.
Ventriloquist illusion with two
talkers – Driver (1996)
Condition 2:
Identical audio but
visual cue displaced
Listeners were able to better recognised the target words when the
target and masker `appeared’ to be coming from a different locations.
Does the illusory separation increase intelligibility?
Integrating auditory scene analysis and
vision speech information
Two simultaneous speech
sources at Position 1
Primitive
auditory
processing
Spectro-temporal
source `glimpses’
Two simultaneous speech
sources at Position 1
Primitive
auditory
processing
Spectro-temporal
source `glimpses’
Visual signal
integrates with
certain glimpses
Target lip features
(originate at Position 2)
AV glimpses
have new
position
AV target =
Pos 2
Audio masker =
Pos 1
Integrating auditory scene analysis and
vision speech information
Can separate sources by attending to Pos 1 or Pos 2
Supplying visual cues for masker
Visual signal
integrates with
certain glimpses
Target lip features
(originate at Position 1)
AV glimpses
have new
position
AV target =
Pos 1
Audio masker =
Pos 1
Cannot separate sources by attending to a position
Supplying visual cues for masker
Masker Lips
Target
speech
Masker
speech
Condition 1:
Masker lips present
Masker Lips
Target
speech
Does viewing the masker lips
increase intelligibility of the
target speech?
Masker
speech
Condition 2:
Masker lips occluded
7
Integrating auditory scene analysis and
vision speech information
Supplying visual cues for masker
Two simultaneous speech
sources at Position 1
Primitive
auditory
processing
Spectro-temporal
source `glimpses’
Small but significant
increase in intelligibility
when masker’s lips are
visible.
Target and Masker glimpses very
hard to separate
Integrating auditory scene analysis and
vision speech information
Two simultaneous speech
sources at Position 1
Integrating auditory scene analysis and
vision speech information
Primitive
auditory
processing
Primitive
auditory
processing
Visual information
employed during
early processing to
reduce the effect of
energetic masking
Spectro-temporal
source `glimpses’
Visual signal
integrates with
certain glimpses
Visual signal
integrates with
certain glimpses
Masker lip features
(originate at Position 1)
AV glimpses
and Audio
glimpses
Audio target =
Pos 1
AV glimpses
and Audio
glimpses
AV masker = Pos 1
Formation of AV glimpses
may help separation of
fragments – reduction in
informational masking
AV glimpses carry more
phonetic information than
audio-only glimpses. Video
used as lip reading cues.
Can separate target by focusing on Audio glimpses ?
Summary
•
Audio and visual information are fused into a single percept if there
is sufficient evidence of a common cause.
– Common spatial location
– Common temporal dynamics
– Fit to models
•
Modalities contribute to the common percept according to the
relative reliability of their measurements.
•
Auditory and visual processing probably interact at several different
processing stages. e.g. consider speech,
– Visual system provides reliable spatial information
– Auditory system provides reliable temporal information
– Visual information may be employed in sound source separation
• i.e. release from energetic masking
– Visual information may guide auditory attention
• i.e. release from informational masking
– Visual phonetic information is combined with acoustic phonetic
information to better identify the speech sound being uttered.
•
References
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Bertelson and Radeau (1981) Perception and Pyschophysics 29, 578-584
Bertelson et al. (2000) Neuropsychologia 38, 1634-1642
Bregman (1990) Auditory Scene Analysis, MIT Press
Driver (1996) Nature 381(6577), 66--68
Fendrich and Corballis (2001) Perception and Pyschophysics 63(4), 719-725
Fowler and Dekle (1991) J. Experimental Psychology 17, 816-828
Gebhard and Mowbray (1959) Am J Psychol 72, 521–528
Grant and Seitz (2000) J. Acoust. Soc. Am. 108, 1197–1208
Grant (2001) J. Acoust. Soc. Am 109(5), 2272-2275
Massaro (1998) Perceiving Talking Faces: From Speech Perception to a Behavioral Principle, MIT Press
Massaro et al. (1993) J. Phonetics 21, 445--478
Mc Gurk and MacDonald (1976) Nature 264, 746--748
Morein-Zamir et al. (2003) Cognitive Brain Research 17, 154-163
Recanzone (2003) J. Neurophysiol 89, 1078—1093
Remez et al. (1994) Psychological Review 101, 129--156
Rosenblum et al. (1997) Perception and Psychophysics 59(3), 347-357
Rosenblum and Saldana (1996) J. Experimental Psychology 22(2), 318-331
Rosenblum (2005) in Handbook of Speech Perception, Blackwell
Schwartz et al. (2004) Cognition 93, B69–B78
Shams et al. (2002) Cognitive Brain Research 14, 147-152
Sumby and Pollack (1954) J. Acoust. Soc. Am. 26, 212-215
Summerfield (1979) Phonetica 36, 314-331
Welch and Warren (1986) Handbook of Perception and Human Performance 25, 1-36
The importance of such mechanisms for processing non-speech
events remains unclear.
8