Integration of Acoustic-Articulatory Information: Event Related

FONETIK 2012, Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg
Integration of Acoustic-Articulatory Information: Event
Related Potentials to Speech and Non-speech Materials
Jenny Hedberg1, Emma Nilsson1, Cristina Ojeda Alvarez1, Åsa Wolgast1, Eeva Klintfors2,
Marie Markelius2, Johannes Bjerva2 and Petter Kallioinen2
1
Department of Clinical Science, Intervention and Technology, Karolinska Institute, Stockholm
2
Department of Linguistics, Stockholm University, Stockholm
33-week old infants (Lacerda, Klintfors,
Gustavsson, Marklund, Sundberg, 2005). The
infants were placed in front of a split-screen
displaying four faces articulating the Swedish
syllables and vowels /by/, /ba/, /a/ or /y/, while
a sound of either /by/ or /a/ was played. The
results showed that infants did not look at the
faces consistent with the auditory information;
instead they looked more at the faces with
bilabial articulations; interpreted as the most
visually prominent ones. The same study
examined infants’ ability to match sound of
hand-clapping to one of four images displaying
different tempos of hand-clapping movements
in a split-screen. In addition it was explored
whether infants could match hand-clapping
sound with a face pronouncing the syllable /by/
at the same pace as the clapping sound. In both
cases three of the images showed movements
or articulations that were incongruent to the
hand-clapping audio, while one image
displayed synchronized visual materials. The
results showed that infants could match the
hand-clapping audio with the synchronized
visual clapping, but they could not match the
temporarily synchronized articulation of /by/
with the sound of hand-clapping. Instead, the
infants looked more at the film that displayed
the most rapid repetition of articulation.
The current investigation will look at ERP
(Event Related Potential) components related to
speech and non-speech materials. It is a pilot
study for future EEG (electroencephalography)
studies that will examine the onset of
coordination of visual and acoustic information
in infancy.
One possible relevant ERP-component for
the current study is the N400, which is related
to semantic processing and characterized by
distinctive negativity 400 ms post-stimulus.
Semantically incorrect sentences such as “He
spread the warm bread with socks” give rise to
a more extensive N400 effect than semantically
Abstract
Twenty adults participated in an EEG pilot
study, with the future intention to assess onset
of
integration
of
acoustic-articulatory
information in infancy. The study examined
ERP-effects in response to matching vs. nonmatching audio-visual speech and non-speech
materials in four conditions: Condition (1) and
(2) consisted of speech sounds and visually
displayed articulation of speech sounds - either
congruent or incongruent, Condition (3)
consisted of the sound of hand-clapping and
visually displayed articulation of speech
sounds, and Condition (4) consisted of the
sound of hand-clapping and visual images of
hand-clapping. In the third and fourth
condition, all visual materials were presented
in different tempos that were either
synchronised or unsynchronised with the
auditory stimuli. The hypothesis was that all the
non-matching materials would elicit a response
similar to N400. The results showed a possible
N400 effect in the third condition when handclapping sound was synchronized with visually
displayed articulation of speech sounds.
Introduction
It is a known fact that adults utilize visual
information from a speaker, in the sense that
mouth and lip reading are performed to
facilitate perception of speech. However, few
studies have examined the age of onset and
development of this ability in infants. In a
previous study young infants (18- to 20-weeks)
showed ability to pair acoustic and articulatory
information (Kuhl & Meltzoff, 1982). In the
study infants watched a split-screen showing
two faces, articulating different vowels while
hearing pronunciation of one of these. The
results showed that infants looked significantly
longer at the face articulating the vowel heard.
A similar study was conducted with 25- to
89
FONETIK 2012, Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg
and one audio stimulus. All stimuli within each
condition were randomized. Three repetitions
of the congruent audio-visual pairing (á 10 sec),
showing matched auditory and visual
information, as well as three incongruent audiovisual pairings (á 10 sec), were shown within
each condition. Total duration of each
condition was 60 seconds. In addition a grey
box was shown between each stimulus (1 sec).
All videos featured the same female actress
against a blue background.
In the first condition the actress articulated
four different speech sounds: /a/, /ba/, /y/ and
/by/. The audio consisted of repetitions of the
syllable /a/. The second condition consisted of
the same visual stimuli, while the audio
consisted of repetitions of the syllable /by/. In
the third condition the syllable /by/ was
articulated at different speeds (157%, 101%,
63% and 49% of the original recording tempo).
The audio played repetitions of the clapping
sound with a pace of 101%. Thus, the auditory
stimulus was congruent with the synchronized
articulation of /by/ in 101% of the original
tempo. In the fourth condition, the visual
stimuli were video sequences of hand clapping
movements at different speeds (157%, 101%,
63% and 49% of the original recording tempo).
The audio consisted of repetitions of the
clapping sound in 101% of the original tempo.
correct sentences, such as “He spread the warm
bread with butter” (Kutas, Hillyard, 1980;
Kutas, Hillyard, 1984). The N400 effect has
also been observed for nonverbal, pictorial
materials. This finding implicates semantic
systems that represent conceptual knowledge
independent of input modality (Nigam,
Hoffman, Simons, 1992). Furthermore, it has
been found that unanticipated events in video
sequences elicit N400 effects (Reid, Hoehl,
Grigutsch, Groendahl, Parise, Striano, 2009).
In the present study negativity after 400 ms
relative to stimulus onset in the on-going EEG
was anticipated when subjects were exposed to
congruent and incongruent auditory-visual
materials. The negativity was expected to be
more extensive in association with incongruent
stimuli since they are more unanticipated.
Method
Participants
The participants were 20 adults (8 male, 12
female; age range 20- to 53 years; mean age
27.6 years). The subjects were personal
acquaintances, first-year speech and language
pathology students and individuals that had
been randomly recruited to take part in the
experiment. 17 participants were native
speakers of Swedish and 3 participants of
Spanish, Portuguese and German respectively,
however they were fluent in Swedish. All
participants reported normal vision and hearing.
19 participants were right-handed and one was
ambidextrous. The subjects were informed
about the purpose of the experiment after their
participation. The participants did not receive
anything in exchange for their participation.
Procedure
The participant was seated in a sound
attenuated studio at a distance of approximately
60 cm from a computer screen (HP L1940T,
19”). The experiment was run with E-Prime
software (ver. 2.0). The loudspeakers were
placed on each side of the screen. When the net
(HydroCel Geodesic Sensor Net) was in place
the impedance was measured with a threshold
of 50 kΩ for each electrode. During experiment
the participant and the experimenters were
separated by an adequately soundproof wall
with an observation window. All data was
Materials
Materials consisted of four animated conditions
(Condition 1- to 4, Table 1) in fixed order.
Each condition consisted of six visual stimuli
Table 1. Schematic table of the test materials
Condition 1: Speech-Articulation
Congruent
Incongruent
Visual: /ba/,
Visual: /a/
Auditory: /a/
Auditory: /a/
/by/, /y/
Condition 3: Hand clapping-Articulation
Congruent
Incongruent
Visual:
Auditory:
Visual: /by/
Auditory:
/by/, pace
hand-clapping,
paces: 49%,
hand-clapping,
101%
pace: 101%
63%, 157%
pace: 101%
Congruent
Visual: /by/
Condition 2: Speech-Articulation
Incongruent
Auditory: /by/
Visual: /a/, /ba/, /y/
Auditory: /by/
Condition 4: Hand clapping-Hand clapping movements
Congruent
Incongruent
Visual:
Auditory:
Visual:
Auditory:
hand-clapping,
hand-clapping,
hand-clapping, paces:
hand-clapping,
pace: 101%
pace: 101%
49%, 63%, 157%
pace: 101%
90
FONETIK 2012, Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg
collected with the EEG recording software Net
Station (ver. 4.2.1). The present study strode
along the principles of research ethics 1.
55
Analysis of data
The data was analyzed with Net Station (ver.
4.2.1). A band pass filter was set to 1-40 Hz to
remove body movement artefacts. Immediately
prior to stimulus onset a period of 200 ms was
used as baseline referencing the EEG-voltage
measurements. The data was divided into
groups based on congruence/incongruence, as
well as speech/non-speech. To exclude eye
blinks and eye movements the filter was set to
55-1400 µv. In order to avoid bad channels,
changes that exceeded 200 µv per time window
were excluded and compensated for. Electrodes
100 and 57 (around processus mastoideus) were
used as references. An average ERP waveform
was calculated for each test condition over the
three first repetitions of the syllable (Condition
1 and 2) or repetitions of the clapping sound
(Condition 3 and 4). The data was first
controlled for N1-N2 response (vertexpotential)
for a number of electrodes. Electrode 55
showing a waveform among the most
distinctive ones that could be located on midline was selected for analysis.
55
Figure 1. Averaged ERP waveforms for congruent
(thick lines) and incongruent (thin lines) audiovisual materials in Condition 1 (vowel /a/ and
articulations of speech sounds), Condition 2
(syllable /by/ and articulations of speech sounds),
and Condition 4 (sound of hand-clapping and visual
hand clapping movements). No distinct differences
between the waves are observed.
55
Results
In the first stage of the analysis an apparent N1P2 brain response was found. The results
showed that ERP waveforms for Condition 1,
2, and 4 displayed no consistency with a typical
N400 effect (Figure 1). The ERP waveforms
for Condition 3 displayed a greater negativity
(at approximately 400 ms after onset) when the
materials were synchronized (Figure 2).
4
µV
3
2
1
55
0
-0,2
-0,1
-1
0
0,1
0,2
0,3
0,4
0,5
0,6
0,4
0,8
0,9
sec
-2
-3
-4
Figure 2. Averaged ERP waveforms for congruent
(thick line) and incongruent (thin line) audio-visual
materials in Condition 3 (sound of hand- clapping
and articulations of /by/ in different tempo).
Negativity is found for the synchronized stimuli at
approximately 400 ms after onset.
1
The project is conducted in accordance to regulations set by the Data
Inspection Board and Research Ethics Committee at Karolinska
Institute (Dnr 2008/3:3), the Personal Data Act (1998:204) and the Act
on Ethics Review of Research Involving Humans (2003:460).
91
FONETIK 2012, Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg
explained as a perceptual miss-match or as a
difference related to attention factors. In fact,
the waveform and topographic similarities
between the current results and the results in
Proverbio & Riva (2009), who found N400
effects for strange pictures, give some support
to a N400 interpretation. Obviously further
investigation along that line is needed.
The current experiment paradigm was used
to validate the materials to assess integration of
audio-visual information in infancy. Testing
materials with adults being necessary, some
inherent methodological difficulties arise when
adults are exposed to child-directed materials.
For example, the participants in the current
study may not have been fully attentive
throughout the experiment due to lack of a
stimulation provided by the contents of the
materials. In an attempt to improve the study,
adult participants could be given a concurrent
pseudo task to remain attention to the materials.
Discussion
Studies on N400 have concluded that N400
effects are generated by sentences and words
(Kutas, Hillyard, 1980; Kutas et al., 1984) and
not just by isolated syllables such as the ones
presented in the current study. The current
results showed no N400 effects for the
incongruent stimuli in none of the test
conditions. One explanation for this might be
that the stimuli in the current experiment were
too short to be perceived as having semantic
content. For example, Condition 4 – presenting
the sound of hand-clapping and visual handclapping movements – dig not bring much or
any linguistic information to the context.
Therefore it is likely that the participants in the
current study did not perceive these audiovisual materials as semantically confusing
enough to elicit a N400 effect.
In Condition 3 – presenting the sound of
hand-clapping
and
visual
articulatory
movements – an extensive negativity was found
for the temporarily synchronized stimuli. This
implies that the participants experienced that
the temporarily synchronized articulation did
not match the sound of hand-clapping. A reason
why an effect similar to N400 was displayed
only for the congruent audio-visual materials,
and not when the audio-visual materials were
unsynchronized, could be that the participants
put together audio-visual information more
instinctively when they are presented at the
very same time. Adults have knowledge of that
mouth and lip movements do not correspond to
the clapping sound. The primary finding of the
present experiment could be useful in ERP
studies in infants, to conclude at what age they
begin to perceive that mouth and lip
movements do not correspond to the sound of
hand-clapping. Also, the age at which infants
have achieved the ability to discriminate
between speech sounds and non-speech sounds
would be indicated.
To conclude, the results in Condition 3 are
in agreement with previous findings and
suggest that N400 effects might not only be
related to pure semantic stimuli (Nigam et al,
1992), but also to unanticipated events (Reid et
al, 2009). The incongruence effect found in the
current experiment could thus be seen as a
violation of world knowledge, i.e. interpreted
as a semantic incongruence, but it could also be
Acknowledgements
Research supported by The Swedish Research
Council (nr 2009-2245).
References
Kuhl P. K. and Meltzoff A. N. (1982). The
bimodal perception of speech in infancy.
Science, 218:1138-1141.
Kutas M. and Hillyard S. A. (1980). Reading
senseless sentences: Brain potentials reflect
semantic incongruity. Science, New Series,
207; 4427: 203-205.
Kutas M. and Hillyard S. A. (1984). Brain
potentials during reading reflect word
expectancy and semantic association.
Nature, 307: 161-163.
Lacerda F., Klintfors E., Gustavsson L.,
Marklund E. and Sundberg U (2005).
Emerging linguistic functions in early
infancy. 5th International Workshop on
Epigenetic Robotics: 55-62.
Nigam A., Hoffman J. E. and Simons R. F.
(1992). N400 to semantically anomalous
pictures and words. Journal of Cognitive
Neuroscience, 4; 1: 15-22.
Proverbio A. M. and Riva F. (2009) RP and
N400
components
reflect
semantic
violations in visual processing of human
actions. Neuroscience Letters, 459: 142146.
Reid V. M., Hoehl S., Grigutsch M., Groendahl
A., Parise E. and Striano T. (2009). The
neural correlates of infant and adult goal
prediction:
Evidence
for
semantic
processing
systems.
Developmental
Psychology, 45; 3: 620-629.
92