Integrating Auditory and Visual Scene Analysis Audio-Visual Scene Analysis: A brief survey of interesting audio-visual interactions. Auditory Streams Acoustic Mixture woof Auditory Scene Analysis AV Objects 1. “woof!” 2. “meow!” Jon Barker Speech and Hearing Research University of Sheffield A. Visual Scene Analysis Visual Scene A + V tightly coupled at an early stage of processing. Audio-visual interactions • Two areas of study: – The influence of visual input on auditory perception. woof AV Integration meow B. meow Visual Objects Binding problem: Spatial correspondence. Temporal correspondence. Audio Visual Schemas. Overview • Primitive Audio-Visual interactions – Visual cues influencing auditory perception – Acoustic cues influencing visual perception – The influence of acoustic input on visual perception. • Multimodal speech perception • Auditory Scene Analysis + Vision • Some questions: – Conditions necessary for AV interaction? – At what stage of processing do interactions occur? – Mechanisms? Ventriloquist illusion Bertelson and Radeau (1981) Speech audio coming from Position 1 Ventriliquist illusion • Often presented as a speech illusion – but it occurs with very simple stimuli – A visual flash will cause a synchronized acoustic beep to be mislocalised. • Occurs even when subjects don’t perceive visual stimulus. – experiments with subjects with unilateral visual neglect (Bertelson et al., 2000). – Seems to imply that auditory and visual information are integrated at a pre-perceptual stage of processing. The speech audio is mislocated: it is heard to be coming from Position 2 Synchronised lip movements from Position 2 • Under the right conditions the localization of the visual stimulus is also effected by the position of the acoustic stimulus… 1 Auditory Driving Intersensory Temporal Locking Gebhard and Mowbray (1959) Frendrich and Corballis (2001) • Early demonstration of influence of sound on visual perception. Circle moves around the dial – Subjects are presented with a flickering light and a fluttering sound. – Flicker rate held constant, flutter rate increased or decreased. – Subjects perceive changes in the (constant) flicker rate. • The reverse V->A effect was not observed. – i.e. changing the visual flicker rate does not alter perception of the audio flutter rate. • More detailed recent work showing auditory influence on visual temporal rate perception (e.g. Recanzone, 03) Shaded region briefly flashes at the same point on each revolution Brief ‘click’ is played either before, during, or after the flash Subjects are asked to judge either where the circle is when the flash occurs or where it is when the click occurs. Intersensory Temporal Locking Temporal Ventriloquism Frendrich and Corballis (2001) Morein-Zamir et al. (2003) Error in judgement when attending flash • Employed a Temporal Order Judgement (TOJ) task. Either top LED or bottom LED comes on first. Error in judgement when attending click Click and flash judged to be part of the same event so they share a single temporal reference. Does this auditory-induced temporal relocation of visual events generalise to not repetitive stimuli? Temporal Ventriloquism Morein-Zamir et al. (2003) • • • Sounds seem to change perceived timing of lights. Sounds played before and after lights make TOJ easier. – i.e. as though they are pulling the light timings further apart. Shams et al (2002) • A single flash is accompanied by a varying number of auditory beeps. • Subjects asked to report how many flashes are seen. Employed a Temporal Order Judgement (TOJ) task. Sounds played between first and second light coming on. • • Sounds played before and after lights. Playing two sounds between lights coming on makes task harder. Playing one sound between lights coming on has no effect. Playing 2 beeps consistently induces subjects to mistakenly perceive 2 flashes. • Not symmetrical - Showing two flashes does not make subjects hear a single beep as two beeps. 2 Audio-Visual Integration • Modalities contribute to percept according to their relative reliability of measurements. Generally, – Vision: Reliable spatial location. Unreliable temporal location. – Audition: Unreliable spatial location. Reliable temporal location. – Not a new idea: c.f. `modality appropriateness hypothesis’ - Welch and Warren (1986). • Audio-Visual Integration • Integration only occurs if there is sufficient evidence that the audio and visual information are due to a common event • Location estimates overlap – modalities combined to form perception of a single audio-visual event. Auditory percept Alais and Burr (2004) – Studied ventriliquist effects using poorly localized (blurred) visual stimuli. Integration – Judgement of location of blurred visual stimuli is influenced by sound. – Successful prediction of modality dominance by simple models based on maximum likelihood principle. • Important implications for A-V models: AV reprensentation Location estimates • – When combining A and V estimates, the estimates alone are not enough, we also need a `confidence measure’ i.e. an estimate of the variance as well as the mean. Auditory percept Integration Visual percept Location estimates do not overlap – separate audio and visual objects perceived, each with its own position. Auditory percept Integration Visual percept Visual percept AV reprensentation Audio-Visual Integration Overview • Primitive Audio-Visual interactions Visual Position Perceived visual position Perceived acoustic position Acoustic Position Increasing separation – Visual cues influencing auditory perception. – Acoustic cues influencing visual perception. • Multimodal speech perception. • Auditory Scene Analysis + Vision. Beyond a certain separation the interpretation switches from that of a single object to that of two separate objects Preamble – Is speech special? • Speech perception exhibits many unusal properties, including many interesting crossmodal effects. • Two schools of thought: – Speech engages a `special mode of perception’, e.g. Remez et al. (1994) – Speech is not processed in a qualitatively different way from nonspeech. “speech perception is continuous with the perception of other natural events”, e.g. Bregman (1990), Rosenblum(2005) Audio-Visual Speech Processing • Sumby and Pollack (1954), – untrained listeners routinely employ visual information when listening to speech in adverse conditions. – two syllable words + stationary noise at varying SNR. – visual cues equivalent to improving SNR by ~15 dB. + 3 Audio-Visual Speech Intelligibility • Summerfield (1979), – Where are the visual cues? – Visual information concentrated in lip region, but also cues elsewhere. Audio-Visual Speech Intelligibility “m” noise Energetic masking results in “m”/”n” ambiguity Auditory Streams Acoustic Mixture Auditory Scene Analysis 1. “m”, “n” ? 2. noise AV Integration Visual Scene Analysis Visual Scene “m” “m”, “p”, “b” ? Visual Objects “m” is the only phoneme consistent with both A and V evidence. All bilabial consonants are visually ambiguous Note, increased intelligibility of AV condition can not be taken as evidence of visual information being involved in auditory source separation Audio-Visual Speech Audio-Visual Speech Intelligibility – Visual information used as a ‘back up’ system. – Employed when there is reason to doubt the acoustic signal. Audio “b” Auditory Scene Analysis 1. “b” AV Integration • This is not the case – Visual information can influence AV speech perception even when acoustic signal seems unambiguous. – AV integration can occur even when the visual cues contradict seemingly clear acoustic cues. Audio alone = unambiguous “b” Auditory Streams • Early idea – primacy of auditory information. Visual Scene Analysis Visual “g” “?” “g”, “k”… ? Visual Objects What do we hear? We’d probably expect to hear “b”. Incongruous visual cue, but temporally synchronised with audio The McGurk Effect The McGurk Effect McGurk and MacDonald "Hearing lips and seeing voices", Nature 264, 746-748 (1976). In the original set-up a visual b is synchronized with an acoustic g. The listener hears a d or th. The McGurk Effect: Demo P. Kuhl, Department of Speech and Hearing Sciences, University of Washington 4 Fuzzy Logical Model of Perception The McGurk Effect Massaro (1987, 1998) How general is the McGurk effect? • It works on perceivers with all language backgrounds (e.g. Massaro et al. 1993) • It works on young infants (Rosenblum, Schmuckler, & Johnson, 1997). • Visual information can be reduced to a few moving points (Rosenblum & Saldaña, 1996). • It works when observers touch (rather than look-at) the face (Fowler & Dekle, 1991). • Effect persists even if listeners are instructed to focus attention on the acoustic signal. pv x pv Degree of support Evaluation pa x x Integration Decision m • Treats A and V as independent channels • Describes how A and V integrate, not whether they will integrate. AV speech detection Grant and Seitz (2000), Grant (2001) • • x Audio Visual Prototypes p pa pa pa pa x pv b pv pv AV speech detection studies • n m Auditory Streams Acoustic Mixture Compare detection threshold for speech in audio only and audiovisual conditions. Found visual speech reduces the detection threshold by 2-3 dB. Effect disappears if the Audio-Visual synchrony is artificially disturbed. (speech) masking noise Auditory Scene Analysis noise Speech below detection threshold Bimodal Coherance Masking Protection AV speech detection Acoustic Mixture (speech) Auditory Scene Analysis masking noise noise (speech) AV Integration Visual Scene Analysis Visual Scene AV speech detection • Strong correlation between dynamics of lip area and of energy envelope of the F2 region of the spectrum. Auditory Streams speech speech cues Visual Objects Speech above acoustic detection threshold only if A and V have temporal correlation. Visual cues appear to guide auditory attention – the listener can predict when and where to expect the speech energy. 5 AV source separation AV source separation Acoustic Mixture “g” noise Auditory Scene Analysis Schwartz et al (2004) Energetic masking results in “g”/”k” ambiguity Auditory Streams 1. “g”, “k” ? 2. noise AV Integration Visual Scene Analysis Visual “g” “?” • Indentification of voiced versus unvoiced plosives. [gy gu dy du] vs [ty tu ky ku] • Speech babble added at ~ -9 dB SNR • Identical video stimulus dubbed onto all acoustic signals. “g”, “k” ? Visual Objects Is the AV “g” more intelligible than the audio only “g” ? “g” and “k” are visually indistinguishable Schwartz et al (2004) Auditory Streams Acoustic Mixture “g” noise Auditory Scene Analysis 1. “g”, “k” ? Using acoustic cues alone, subjects can’t reliably descriminate between g and k. 2. noise Schwartz et al (2004) Auditory Streams Acoustic Mixture “g” noise Auditory Scene Analysis 1. “g”, “k” Visual stimulus aids stream separation, so “g” is heard more reliably 2. noise AV Integration Visual Scene Analysis “g” “g”, “k” ? Visual Scene Visual Objects Visual stimuli is totally ambiguous Overview Ventriloquist illusion with two talkers – Driver (1996) • Primitive Audio-Visual interactions – Visual cues influencing auditory perception. – Acoustic cues influencing visual perception. • Multimodal speech perception. • Auditory Scene Analysis + Vision. Off Off Target Lips Target + Masker speech Condition 1: Visual and audio target cues at same position 6 Ventriloquist illusion with two talkers – Driver (1996) Off Target Lips Target + Masker speech Off Target heard as coming from here. There is an illusion that it is spatially separated from the masker. Ventriloquist illusion with two talkers – Driver (1996) Condition 2: Identical audio but visual cue displaced Listeners were able to better recognised the target words when the target and masker `appeared’ to be coming from a different locations. Does the illusory separation increase intelligibility? Integrating auditory scene analysis and vision speech information Two simultaneous speech sources at Position 1 Primitive auditory processing Spectro-temporal source `glimpses’ Two simultaneous speech sources at Position 1 Primitive auditory processing Spectro-temporal source `glimpses’ Visual signal integrates with certain glimpses Target lip features (originate at Position 2) AV glimpses have new position AV target = Pos 2 Audio masker = Pos 1 Integrating auditory scene analysis and vision speech information Can separate sources by attending to Pos 1 or Pos 2 Supplying visual cues for masker Visual signal integrates with certain glimpses Target lip features (originate at Position 1) AV glimpses have new position AV target = Pos 1 Audio masker = Pos 1 Cannot separate sources by attending to a position Supplying visual cues for masker Masker Lips Target speech Masker speech Condition 1: Masker lips present Masker Lips Target speech Does viewing the masker lips increase intelligibility of the target speech? Masker speech Condition 2: Masker lips occluded 7 Integrating auditory scene analysis and vision speech information Supplying visual cues for masker Two simultaneous speech sources at Position 1 Primitive auditory processing Spectro-temporal source `glimpses’ Small but significant increase in intelligibility when masker’s lips are visible. Target and Masker glimpses very hard to separate Integrating auditory scene analysis and vision speech information Two simultaneous speech sources at Position 1 Integrating auditory scene analysis and vision speech information Primitive auditory processing Primitive auditory processing Visual information employed during early processing to reduce the effect of energetic masking Spectro-temporal source `glimpses’ Visual signal integrates with certain glimpses Visual signal integrates with certain glimpses Masker lip features (originate at Position 1) AV glimpses and Audio glimpses Audio target = Pos 1 AV glimpses and Audio glimpses AV masker = Pos 1 Formation of AV glimpses may help separation of fragments – reduction in informational masking AV glimpses carry more phonetic information than audio-only glimpses. Video used as lip reading cues. Can separate target by focusing on Audio glimpses ? Summary • Audio and visual information are fused into a single percept if there is sufficient evidence of a common cause. – Common spatial location – Common temporal dynamics – Fit to models • Modalities contribute to the common percept according to the relative reliability of their measurements. • Auditory and visual processing probably interact at several different processing stages. e.g. consider speech, – Visual system provides reliable spatial information – Auditory system provides reliable temporal information – Visual information may be employed in sound source separation • i.e. release from energetic masking – Visual information may guide auditory attention • i.e. release from informational masking – Visual phonetic information is combined with acoustic phonetic information to better identify the speech sound being uttered. • References • • • • • • • • • • • • • • • • • • • • • • • Bertelson and Radeau (1981) Perception and Pyschophysics 29, 578-584 Bertelson et al. (2000) Neuropsychologia 38, 1634-1642 Bregman (1990) Auditory Scene Analysis, MIT Press Driver (1996) Nature 381(6577), 66--68 Fendrich and Corballis (2001) Perception and Pyschophysics 63(4), 719-725 Fowler and Dekle (1991) J. Experimental Psychology 17, 816-828 Gebhard and Mowbray (1959) Am J Psychol 72, 521–528 Grant and Seitz (2000) J. Acoust. Soc. Am. 108, 1197–1208 Grant (2001) J. Acoust. Soc. Am 109(5), 2272-2275 Massaro (1998) Perceiving Talking Faces: From Speech Perception to a Behavioral Principle, MIT Press Massaro et al. (1993) J. Phonetics 21, 445--478 Mc Gurk and MacDonald (1976) Nature 264, 746--748 Morein-Zamir et al. (2003) Cognitive Brain Research 17, 154-163 Recanzone (2003) J. Neurophysiol 89, 1078—1093 Remez et al. (1994) Psychological Review 101, 129--156 Rosenblum et al. (1997) Perception and Psychophysics 59(3), 347-357 Rosenblum and Saldana (1996) J. Experimental Psychology 22(2), 318-331 Rosenblum (2005) in Handbook of Speech Perception, Blackwell Schwartz et al. (2004) Cognition 93, B69–B78 Shams et al. (2002) Cognitive Brain Research 14, 147-152 Sumby and Pollack (1954) J. Acoust. Soc. Am. 26, 212-215 Summerfield (1979) Phonetica 36, 314-331 Welch and Warren (1986) Handbook of Perception and Human Performance 25, 1-36 The importance of such mechanisms for processing non-speech events remains unclear. 8
© Copyright 2026 Paperzz