View PDF - CiteSeerX

ARTICLE IN PRESS
Speech Communication xxx (2005) xxx–xxx
www.elsevier.com/locate/specom
Effect of audiovisual perceptual training on the
perception and production of consonants by
Japanese learners of English q
Valerie Hazan
b
a,*
, Anke Sennema a, Midori Iba b, Andrew Faulkner
a
a
Department of Phonetics and Linguistics, UCL, Gower Street, London WC1E 6BT, United Kingdom
Konan University Institute for Language and Culture, 8-9-1 Okamoto, Higashi-nada-ku, Kobe-shi, Hyogo, Japan
Received 24 September 2004; received in revised form 18 February 2005; accepted 7 April 2005
Abstract
This study investigates whether L2 learners can be trained to make better use of phonetic information from visual
cues in their perception of a novel phonemic contrast. It also evaluates the impact of audiovisual perceptual training on
the learnersÕ pronunciation of a novel contrast. The use of visual cues for speech perception was evaluated for two English phonemic contrasts: the /v/–/b/–/p/ labial/labiodental contrast and /l/–/r/ contrast. In the first study, 39 Japanese
learners of English were tested on their perception of the /v/–/b/–/p/ distinction in audio, visual and audiovisual modalities, and then undertook ten sessions of either auditory (ÔA trainingÕ) or audiovisual (ÔAV trainingÕ) perceptual training
before being tested again. AV training was more effective than A training in improving the perception of the labial/
labiodental contrast. In a second study, 62 Japanese learners of English were tested on their perception of the /l/–/r/
contrast in audio, visual and audiovisual modalities, and then undertook ten sessions of perceptual training with either
auditory stimuli (ÔA trainingÕ), natural audiovisual stimuli (ÔAV Natural trainingÕ) or audiovisual stimuli with a synthetic face synchronized to natural speech (ÔAV Synthetic trainingÕ). Perception of the /l/–/r/ contrast improved in all
groups but learners trained audiovisually did not improve more than those trained auditorily. Auditory perception
improved most for ÔA trainingÕ learners and performance in the lipreading alone condition improved most for Ônatural
AV trainingÕ learners. The learnersÕ pronunciation of /l/–/r/ improved significantly following perceptual training, and a
greater improvement was obtained for the ÔAV Natural trainingÕ group. This study shows that sensitivity to visual cues
for non-native phonemic contrasts can be enhanced via audiovisual perceptual training. AV training is more effective
q
Portions of the work presented in this paper were presented at the 7th International Conference for Spoken Language Processing
(16–20 September 2002, Denver, USA), at the International Congress of Phonetic Sciences (3–9 August 2003, Barcelona, Spain) and at
the 8th International Conference for Spoken Language Processing (5–9 October 2004, Jeju island, South Korea).
*
Corresponding author. Tel.: +44 20 7679 7069; fax: +44 20 7383 4108.
E-mail addresses: [email protected], [email protected] (V. Hazan), [email protected] (A. Sennema),
[email protected] (M. Iba), [email protected] (A. Faulkner).
0167-6393/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.specom.2005.04.007
ARTICLE IN PRESS
2
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
than A training when the visual cues to the phonemic contrast are sufficiently salient. Seeing the facial gestures of the
talker also leads to a greater improvement in pronunciation, even for contrasts with relatively low visual salience.
Ó 2005 Elsevier B.V. All rights reserved.
Keywords: Audiovisual perception; Second language acquisition; Perceptual training
1. Introduction
This study is concerned with the acquisition of
non-native phonemic contrasts by second-language learners and considers whether perceptual
training with audiovisual stimuli is more effective
than training with auditory stimuli in improving
the perception of these contrasts.
Phoneme inventories vary widely across languages, and it is known that infants attune to the
sounds of their own language within the first few
months of life (e.g., Kuhl, 1993; Werker and Tees,
1984). As a result, learning to perceive and produce sounds of a second language can be effortful,
if these sounds do not occur in the native language
or have a different phonological status in the two
languages (e.g., Flege, 1995; Guion et al., 2000).
In addition to segmental errors in perception
(e.g., Takata and Nabelek, 1990), there can be an
increased cognitive load when listening in a nonnative language (Tyler, 2001). It has also been
shown that even highly-fluent second-language
learners, who have similar intelligibility to native
listeners for speech in quiet, show disproportionate
reductions in intelligibility when speech is degraded by noise (Mayo et al., 1997).
A number of factors affect the degree of difficulty in acquiring novel phonemic contrasts. One
important factor, which has received a lot of attention, is the relation between the phoneme inventories of the first (L1) and second (L2) language, as
our knowledge of the phoneme inventory of the
L1 interferes with the acquisition of that of the
L2. The Speech Learning Model (Flege, 1995) suggests that second-language learners tend to assimilate L2 sound categories to their native sound
categories if the categories are phonetically similar.
Categories that are phonetically dissimilar from
native sound categories are seen as easier to acquire
as there is no process of L1 interference, whilst categories which are identical in the two languages are
typically perceived and produced without too
much difficulty. BestÕs Perceptual Assimilation
Model (PAM) (Best, 1995; Best et al., 2001) is
based on a different theoretical framework, that
of the Direct Realist model of speech perception
(e.g., Fowler, 1986). PAM makes detailed predictions of L2/L1 assimilation on the basis of similarities in articulatory gestures between the L2 and L1
sounds. An L2 sound is either ÔcategorizedÕ as
an exemplar of a native phoneme category, ÔuncategorizedÕ if it is similar to two or more native categories, or ÔnonassimilableÕ if it is not at all similar to
any native category. Both of these theories predict
that certain phonemic contrasts will be difficult to
acquire within a specific L1/L2 combination. For
example, the /l/–/r/ contrast in English has been
extensively used in studies of L2 speech perception
as it is a difficult contrast to acquire for Japanese
learners of English due to the different phonological status of this contrast in Japanese (see Aoyama
et al., 2004 for a detailed description of the relation
between English /l/–/r/ and Japanese /r/).
A significant factor affecting the acquisition of
novel phoneme categories in an L2 is the age of second language acquisition (for a discussion, see
Flege, 1999). Although there is now little support
for the view that there is a critical period after
which the acquisition of new phoneme categories
would be impossible (Lenneberg, 1967), there does
appear to be a sensitive period for L2 acquisition, at
the phonetic level at least. Indeed, in studies of
immigrant L2 learners from a homogenous L1 population, the degree of foreign accent in production
(as assessed by native listener ratings) increased
fairly continuously with age (e.g., Flege, 1998).
Another argument in favour of ongoing plasticity in speech perception comes from studies with
adults showing that the perception of novel phonemic contrasts can be improved by auditory training. In a study in which young adult learners
underwent 45 training sessions, Logan et al.
ARTICLE IN PRESS
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
(1991) succeeded in improving the perception of
the English /l/–/r/ contrast in Japanese-L1 talkers
using a Ôhigh variability phonetic trainingÕ approach. In this technique, learners are exposed to
multiple tokens of minimal pairs of words containing the novel sounds in different syllable positions
produced by different talkers. These tokens are
presented in a forced-choice identification task,
with immediate feedback as to whether their
choice was correct or incorrect. The aim of this approach is to create ÔrobustÕ novel phoneme categories by exposing L2 learners to acceptable
variation within each category. The perceptual
training was shown to be robust as it generalized
to new words and new talkers (Lively et al.,
1993, 1994). Further studies replicated this result
and showed that there was long-term retention of
the learning over a period of three months at least
(Bradlow et al., 1997; Lively et al., 1994). Perceptual training not only improved the perception of
novel speech sounds but also led to improvements
in the pronunciation of these sounds by the L2
learners, improvements that were also retained
for at least three months following the completion
of the training (e.g., Bradlow et al., 1997).
Perceptual training studies have mainly investigated consonant and vowel (e.g., Lambacher et al.,
2002) contrasts, with a strong focus on the English
/l/–/r/ contrast. However, auditory training has
also been shown to be successful for suprasegmental features, as it led to improvements in the perception of tone contrasts (e.g., Wang et al.,
1999); again, improvement in the perception of
these contrasts transferred to tone production,
both for tokens heard during training and for novel tokens (Wang et al., 2003).
A significant source of segmental information
for speech perception that has not hitherto been
fully exploited in perceptual training is the information available via lipreading, which is available
to language learners in traditional face-to-face language learning. For native listeners, visual cues to
consonant and vowel identity are integrated with
acoustic cues early in perceptual processing. This
is shown by the McGurk effect (McGurk and MacDonald, 1976), which is obtained when visual and
acoustic cues to consonant identity are put in conflict (e.g., visual /g/ with auditory /b/). The result-
3
ing percept (typically /d/ in the example given)
shows that the information from each modality
is integrated before the sound is categorized by
the listener. Visual cues add to the multiplicity of
cues that is so crucial for making speech robust
to environmental degradation, and are indeed particularly useful for native listeners in difficult listening conditions (e.g., Sumby and Pollack, 1954).
Infants are sensitive to visual information as
they prefer video presentations of talkers with congruent rather than incongruent audio and visual
channels (Kuhl and Meltzoff, 1982). However,
there is a developmental trend in the use of lipreading cues. Indeed, children aged 6–10 years
are less influenced by visual information in their
judgments of consonant identity than are adult
learners (Massaro et al., 1986; Sekiyama et al.,
2003). Just as the sensitivity to certain acoustic
cues increases within the first 10 years of life
(e.g., Hazan and Barrett, 2000; Nittrouer and Miller, 1997; Mayo and Turk, 2004), sensitivity to visual cues also develops. The degree to which this
visual influence develops appears to be languagedependent as Japanese and Chinese talkers are less
influenced by visual cues than English talkers in
McGurk tasks (Hardison, 1999; Sekiyama and
Tohkura, 1993; Sekiyama, 1997). The degree of
visual influence may be related to the informativeness of visual cues in a particular language
relative to auditory cues. This can depend on the
number of visemes in a specific language (i.e. the
number of Ôvisual categoriesÕ that are identifiable
using lipreading alone), and also on whether the
language is tonal as tone information is not
marked visually. Interestingly, in a study looking
at the development of visual influence in Japanese
and English 6-year-old and adult listeners, it was
shown that the level of visual influence was the
same in 6-year-old English and Japanese listeners
and that it increased significantly in English but
not Japanese adults (Sekiyama et al., 2003). This
suggests that, due to the relatively low informativeness of lipreading information in Japanese,
Japanese listeners become strongly attuned to
auditory rather than to visual information as a
result of language experience.
In view of these factors, L2 learners may be
therefore impeded in their use of visual cues for
ARTICLE IN PRESS
4
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
two reasons. First, if their first language is one in
which visual influence is relatively low (e.g., Japanese or Chinese), they may not naturally attend to
the segmental information available via the visual
channel and may need to be trained to attune more
to the visual channel. Also, just as L2 learners lose
sensitivity to specific acoustic cues to vowel or consonant contrasts that do not occur in their own
language, it is probable that they lose sensitivity
to even salient visual cues that are not relevant
in their L1: they may be able to see the difference
between two visible articulatory gestures but not
be able to consistently associate each with the
appropriate phoneme label. This was found to be
the case in a comparison of the perception of English consonants and vowels presented to Spanish-L1
learners and controls in auditory and audiovisual
conditions with background noise (Ortega-Llebaria et al., 2001). Consonant confusions that were
language-dependent, i.e. which resulted from differences in the phoneme inventories of Spanish
and English, were not reduced by the addition of
visual cues, whereas confusions that were common
to both listener groups and were due to acousticphonetic similarities between categories did show
improvements. A wider-ranging study of concordant and discordant audiovisual syllables with listeners from four different L1 backgrounds showed
that the influence of the visual cue was ‘‘dependent
upon its information value, the intelligibility of the
auditory cue and the assessment of similarity between the two cues’’ and was also affected by linguistic experience (Hardison, 1999).
Even if L2 learners are not highly sensitive to
visual cues, an important issue is whether this
sensitivity can be increased by training. We argue that increasing the use of visual information
to consonant identity is likely to be effective in
improving speech perception in L2 learners, as
it would increase the multiplicity of phonetic
information available to these listeners. Few
studies to date have evaluated the relative effectiveness of auditory and audiovisual training on
the acquisition of novel L2 contrasts. In a study
on the training of the American English /l/–/r/
contrast with Japanese and Korean learners of
English, audiovisual training was more effective
than purely auditory training in improving the
identification of the contrast for both learner
populations (Hardison, 2003). Improvements in
perception also led to improvements in native listener ratings of the learnersÕ pronunciation both
for the auditory-trained and audiovisual-trained
learners. Another approach to audiovisual training has been to use three-dimensional computeranimated talking heads that use accurate articulatory information (e.g., Massaro, 1998). These
are potentially powerful tools for language learning as training materials can be easily generated
and as they give greater potential for ÔenhancingÕ
articulatory information. This can be done, for
example, by making the skin transparent so
that articulatory gestures within the oral tract
can be seen, by slowing down articulatory gestures or highlighting usually invisible cues such
as vocal fold vibration (Massaro and Cole,
2000). Such talking heads have already successfully been used for language learning by children
with hearing loss (Massaro and Light, 2004). In
a first study investigating the usefulness of a
talking head for training the /l/–/r/ contrast with
Japanese learners of English, two training
conditions were compared: (a) ÔstandardÕ talking
head (b) the same talking head but with ÔtransparentÕ skin and visible articulatory gestures
(Massaro and Light, 2003). Training significantly
improved the identification and production of /l/
and /r/ but the visible articulation condition did
not lead to a greater improvement. This study
does not permit us to compare the relative
effectiveness of natural and synthetic visual
cues.
In our study, our aims were to investigate
whether (a) the effectiveness of auditory training
for language learning could be increased providing visual cues, (b) the effect of visual cues on
the learning of the new contrast was dependent
on the distinctiveness of the visual information
and (c) perceptual training with audiovisual
tokens would lead to a greater improvement in
the pronunciation of the novel phonemic contrast
than training with auditory tokens. We also integrated the use of talking heads versus natural
faces in our training methodology to investigate
the possible effects of the mode of visual
presentation.
ARTICLE IN PRESS
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
2. Experiment 1: /b/–/p/–/v/ perceptual training
study
The aim of Experiment 1 was to investigate
whether the perception of the contrast between
the English labials /b/ and /p/ and the labiodental
/v/ could be improved for Japanese learners of
English via auditory or audiovisual training. The
English labial/labiodental contrast in place of
articulation is difficult for Japanese learners of
English because the labiodental phonemes /v/
and /f/ do not occur in the consonant inventory
of Japanese, although /v/ does appear in loanwords and is beginning to acquire phonemic status
in Japanese. The labiodental gesture, which is
highly distinctive in English, therefore does not occur in Japanese. The following pattern of assimilation might be expected: the English phoneme /v/
may be assimilated to Japanese phoneme /b/ as
/b/ is sometimes realised as a bilabial approximant
between vowels in Japanese. The English /b/–/p/
voicing contrast may also lead to confusions as
voiceless plosives in Japanese tend to be unaspirated and so English /b/ and Japanese /p/ will be
phonetically similar. Both phonemes are therefore
included in the study although our primary focus
of interest is the perception of the labial/labiodental distinction.
5
2.2. Speech materials
Speech materials included: (a) test materials that
were presented to the learners before and after the
course of training and (b) the training materials.
2.2.1. Pre- and post-test materials
Nonsense words were used for the pre/post test
to evaluate consonant identification without the
influence of lexical knowledge and with only minimal learning effect (as the same items were to be
repeated following the course of training sessions).
The consonants /p/, /b/ and /v/ were embedded
within nonsense words with the following structure: CV, VCV, or VC, where V was one of /i, ,
u/. Each item was recorded three times, yielding
a total number of 81 tokens.
2.2.2. Training materials
A number of English minimal pairs (real words)
were recorded, chosen to contrast /b/, /p/ and /v/ in
a range of vowel environments and in different
positions and syllabic contexts. A subset of 75
pairs was chosen for the training sessions. This
comprised 15 minimal pairs of words with /b/
and /v/ in initial and medial position and 15 minimal pairs of words with /b/ and /p/ in initial, medial and final position.1
2.1. Subjects
2.3. Recording procedure
A total of 39 Japanese-L1 learners of English as
a Foreign Language (EFL) participated in the
training. All were students from a variety of degree
programmes at the University of Konan in Japan
who were taking an English elective course. They
were aged between 19 and 23 years, had already
learned English for six years at high school before
entering university, but did not often practice listening to English sounds. The training took place
in Japan. A further four students participated in
the study but were not included in the data analysis because their pre-test performance was above
95% on the Audio condition. In order to get a
measure of the visual salience of the contrast, a
group of 13 native English controls (university students at UCL) were tested in the visual condition
only.
A phonetically-trained female talker recorded
the pre/post test items, and three further women
and two men recorded the training items. All
were native talkers of South Eastern British English. For all talkers, video recordings were made
in a soundproof room. The talkerÕs face was set
against a blue background and illuminated with
a key and a fill light. The talkerÕs head was fully
visible within the frame. Video recordings were
made to a Canon XL-1 DV camcorder. Audio
was recorded from a Bruel and Kjaer type
4165 microphone to both the camcorder and to
1
A list of the minimal pairs used in Experiments 1 and 2
is available from: http://www.phon.ucl.ac.uk/home/val/
EPSRCproject.htm.
ARTICLE IN PRESS
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
a DAT recorder. The video channel was digitally
transferred to a PC and time-aligned with the
DAT audio recording, which was of higher
quality than the audio track of the video. Video
clips were edited so that the start and end frames
of each token showed a neutral facial expression.
Stimuli were down-sampled once the editing had
been completed (250 * 300 pixels, 25 f/s, audio
sampling rate 22.05 kHz).
2.4. Test conditions
The test and training software were designed
using the CSLU toolkit (e.g., Cole et al., 1999).
The programs were run individually on laptop
computers, with students working in quiet surroundings and stimuli presented via headphones
at a comfortable listening level. In the pre- and
post-test, the nonsense items described above were
presented to all learners in three conditions: ÔauditoryÕ (A), ÔaudiovisualÕ (AV) and ÔvisualÕ (V) in a
3AFC identification task with response choices
of B, P and V. The learner responded by clicking
on the appropriate symbol with a mouse. The test
items were identical in the three conditions, but in
the single-modality tasks, either the auditory or visual channels were removed. Each block of 81
items was repeated once in each test condition,
yielding a total of 486 responses per listener. The
order of items was randomized within each block
for each listener. Two orders were used for the presentation of the three conditions: AV, A, V or A,
AV, V, and the two orders were counterbalanced
across listeners.
In order to compare the relative effectiveness of
auditory and audiovisual training, a different
group of learners was tested in each of these two
training modes. The training items were identical
for the two learner groups, except that the video
channel was masked for the auditorily-trained
learners. 21 learners carried out the training in
the Auditory training (ÔA trainingÕ) condition,
and 18 learners in the Audiovisual training (ÔAV
trainingÕ) condition. Learners were assigned to
each condition on the basis of their scores in the
auditory condition of the pretest, with the aim of
ensuring a reasonable balance across training
groups in terms of their pre-test performance,
4.0
d' for /b-p/-/v/ contrast (Mean +- 1 SE)
6
3.5
Visual (pre)
3.0
Auditory (pre)
2.5
AV (pre)
2.0
Visual (post)
1.5
Auditory (post)
1.0
0.5
AV (post)
Auditory
Audiovisual
TRAINING
Fig. 1. Measure of the discriminability of the labial/labiodental
place contrast (d 0 ) in the pre-test and post-test for each training
group in each modality (Experiment 1).
although an exact match in terms of baseline performance was not achieved (See Fig. 1).
2.5. Training task
The training approach was a variant of the
High Variability Phonetic Training approach that
was developed by Logan et al. (1991) and used in
numerous training studies. This approach places
emphasis on exposing the learner to normal phonetic variability within a phonemic category, and
on providing immediate feedback on the response
given. Learners are therefore exposed to different
talkers across sessions, and the consonants being
trained are presented in different syllable positions
and vocalic contexts.
In the training software, a conversational
agent—the ÔBaldiÕ talking head (Massaro,
1998)—acted as a Ôvirtual teacherÕ, giving instructions to the learner at the beginning of each training block, telling the learner to Ôlisten againÕ to the
repeated item if an incorrect training response was
given, and telling the listener the percentage of correct identification achieved at the end of each
training block. In the training, students were first
familiarized with the test consonants uttered by
the particular talker of that session, after which
the items were presented either auditorily (for the
ÔA trainingÕ learners) or audiovisually (for the
ARTICLE IN PRESS
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
2.6. Results
Although the /b/ vs /p/ contrast had been included in this study because confusions were expected between these two English consonants, we
were principally interested in whether the listeners
were able to perceive the distinction which is
marked visually, i.e. the difference between the
bilabial and labiodental consonants. Analyses
therefore focused on the perception of the contrast
in place of articulation, and voicing confusions (/b/
perceived as /p/ and /p/ perceived as /b/) were ignored. Given that there were three possible response labels (P, B and V), it is important to
correct for any potential bias in the responses given by the learners. The proportion of correct /v/
responses for pre-test and post-test results in each
modality (A, AV, V) was therefore converted to
the signal detectability measure dprime (d 0 ), which
is calculated as the z-value of the hit-rate minus
that of the false-alarm-rate. The mean d 0 for the labial/labiodental contrast obtained for each training group for the pre and post-test in each
modality is presented in Fig. 1. Despite the fact
that there was twice as many labial than labiodental stimuli in the test, there was no evidence of bias
as shown by average beta values close to 1.0 for
the A, AV and V conditions.
For the native controls, the mean d 0 obtained in
the V condition was 4.61 (S.D. 1.38) which is substantially higher than the d 0 of 1.51 (S.D. 1.04)
obtained for the Japanese-L1 group in the pre-test.
In order to see whether the perception of the labial/labiodental place contrast improved as a result of training, a repeated-measures analysis of
variance (ANOVA) was carried out on the d 0
scores obtained in the pre- and post-test for all
39 learners. Overall, correct place perception was
higher in the post-test than in the pre-test
[F(1, 37) = 108.7; p < 0.001]. The effect of test
mode (A, V or AV) was significant [F(2, 74) =
46.8; p = .000] and Bonferroni-adjusted post-hoc
tests showed that better discriminability of the
place contrast was obtained in the AV test condition than in the A test condition, which itself was
better perceived than the V condition. Training
therefore significantly increased the discriminability of the place contrast. That higher identification was obtained for the AV test condition
suggests that learners were able to integrate information contained in the auditory and visual
modalities.
Change in d' for place contrast (Mean +- 1 SE)
ÔAV trainingÕ learners). After hearing each training
word, the learner had to give a response of B, P or
V by clicking a label with a mouse. If a correct
response was given, a ÔsmileyÕ symbol appeared
and the item was not repeated. If the response
was incorrect, the learner was told to Ôlisten againÕ
and the word was repeated.
The training consisted of 10 sessions, held over
a period of approximately four weeks, each lasting
about 40 min. At each training session, listeners
either saw and heard (AV trainees), or just heard
(A trainees) five blocks of test items produced by
one of the five talkers. At each session, each learner heard 60 instances of the consonants /b/ and
/v/ in real words that formed minimal pairs for this
contrast, and 90 instances of the consonants /b/
and /p/ in real words that formed minimal pairs
for that contrast. The blocks with different consonants alternated in order per day, and the order of
items was randomized within each block for each
listener. Each talker was heard on two different
days. After ten sessions of training, a post-test
was done, which was identical to the pre-test.
7
1.6
1.4
1.2
1.0
TEST CONDITION
0.8
Visual
0.6
0.4
Auditory
0.2
AV
0.0
N = 21
21 21
Auditory
18 18 18
Audiovisual
TRAINING
Fig. 2. Pre/post training change in d 0 for the labial/labiodental
contrast in the three test conditions for A-trained and AVtrained learners (Experiment 1).
ARTICLE IN PRESS
8
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
The evaluation of the relative effectiveness of
the two training modes was carried out on the
change in d 0 between pre-test and post-test scores
(see Fig. 2) so as not to be influenced by any differences in baseline levels of performance across
groups. The main effect of training mode was significant [F(1,37) = 10.3; p = .003], with a greater
improvement in the discriminability of the place
contrast seen for the ÔAV trainingÕ group than
the ÔA trainingÕ group for all three test conditions.
Post-hoc tests revealed that, taking all subject
groups together, this improvement was greater
for the A and AV test conditions than for the V
test condition.2
In order to get a better idea of the effect of each
training condition on the perception of acoustic
and visual cues to the labial/labiodental contrast,
repeated-measures ANOVAs were carried out separately on the data for each training group to look
at the effect of test modality. For the ÔA trainingÕ
group, this effect was significant [F(2, 40) = 4.555;
p < 0.0001]; Bonferroni-adjusted post-hoc tests
showed that a greater improvement in perception
was seen in the A than the V test condition (See
Fig. 2), but that there was no significant difference
in improvement between the A and AV test conditions. For the ÔAV trainingÕ group, the effect of test
mode was not significant: this group showed a similar improvement in all three test conditions, suggesting that sensitivity to visual information
increased for these learners.
It could be argued that AV training is more
likely to be effective for those learners who, before
training, already showed some sensitivity to visual
cues. In order to test this hypothesis, the ÔAV trainingÕ group was divided into Ôhigh V scoreÕ (N = 8)
2
In order to check that the comparison between the two
training groups was not strongly affected by differences in pretraining performance across groups, statistics were carried out
on a subset of the data which included pairs of subjects across
groups that were matched in terms of their pre-training
performance in the Auditory condition. (11 subjects per group).
Results were consistent with those obtained for the larger
group. In a two-way ANOVA for repeated measures carried
out on the pre- and post-training test data, the AV-training
group showed a greater improvement in discriminability of
place of articulation after training (significant interaction
between time of testing and training condition).
and Ôlow V scoreÕ (N = 10) subgroups based on a
criterion of d 0 > 1.15 in the V test condition in
the pre-test. In the pre-test, mean d 0 for the ÔHigh
V scoreÕ group was 2.15 for the A condition, 2.73
for the AV condition and 2.04 in the V condition.
The same means for the Ôlow V scoreÕ group were
1.35 for the A condition, 1.25 for the AV condition
and 0.63 for the V condition. A repeated-measures
ANOVA failed to show any significant difference
between the subgroups in terms of their improvement post-training although there was a trend
for the Ôhigh V scoreÕ group to show greater
improvement in the V condition. Similar statistics
were run for learners with high and low V scores
trained auditorily. Again, there was no statistically
significant difference between the two groups in
terms of their improvement.
2.7. Discussion
We know from the pre-test results that the contrast between the English labial (/b/–/p/) and labiodental (/v/) consonants is visually salient for native
talkers who achieved a mean of 92% correct /v/
identification in the lipreading condition and d 0
of 4.61. Perception of the place contrast by L2
learners was significantly lower than that for native controls. These results confirm the fact that
L2 learners tend not to make use of visual cues
that do not mark phonemic contrasts in their first
language, even if there is a clear difference in articulatory gestures. It was of interest to see whether
learners could be trained, via audiovisual training,
to be more sensitive to visual cues and whether
they would then be able to improve their perception of a novel contrast to a higher degree than
auditorily-trained learners. Audiovisual training
was indeed more effective in improving the perception of the labial/labiodental contrast than auditory training. Whereas the A-trained learners
improved more in the auditory test condition and
showed little evidence of improved sensitivity to visual cues to the contrast, the AV-trained learners
showed improvement both in their use of acoustic
and visual cues to the contrast, and became more
aware of the distinction between labial and labiodental gestures. The amount of overall improvement was also greater for the AV-trained
ARTICLE IN PRESS
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
listeners. It had been expected that improvements
would be maximized if the modality of training
was adapted to listener biases, and therefore that
we would see greater improvements in the AV
training conditions for learners achieving higher
V scores in the pre-test. This did not appear to
be the case.
These results were encouraging in terms of the
effectiveness of audiovisual training, but it is also
important to evaluate whether this finding would
extend to novel phonemic contrasts that are less
visually salient for native listeners. Experiment 2
was therefore devised and looked at the relative
effectiveness of auditory and AV training for the
perception of the British English /l/–/r/ contrast
by Japanese learners of English. Also, in order to
see whether less detailed visual information could
be informative in training, some learners were
trained in a condition in which naturally-produced
items were synchronized to a talking head (Baldi,
the talking head that was used as the Ôvirtual teacherÕ in the training programme).
3. Experiment 2: Perceptual training of /l/–/r/
contrast
The aim of Experiment 2 was to investigate the
effect of audiovisual presentation for training for
the /l/–/r/ contrast, which is less visually distinct
than the labial/labiodental contrast presented in
Experiment 1. In Japanese, lateral approximants
do not occur but there is an alveolar flap [] within
the phonemic inventory. The English phonemes /l/
and /r/ tend to both be assimilated to the native
alveolar flap, leading to problems in the discrimination and identification of these English phonemes. The /l/–/r/ contrast is particularly difficult
for Japanese learners because it is primarily
marked by differences in third formant (F3) transitions that Japanese listeners are not particularly
sensitive to, presumably because F3 transitions
do not act as primary acoustic cues for Japanese
sound distinctions (Iverson et al., 2003; Lotto
et al., 2004; Yamada, 1995). Previous studies
involving American English have suggested that,
although both English consonants are assimilated
to the same native Japanese consonant category,
9
the goodness of fit of the new consonants to the
native category varies and have suggested that /l/
is more perceptually similar to the Japanese alveolar flap [] than English /r/ (e.g. Aoyama et al.,
2004). According to FlegeÕs Speech Learning Model (Flege, 1995), L2 sounds which are more perceptually dissimilar to L1 sounds will be more likely
to be accurately perceived and produced because
the process of interference from the native category will be less great. The hypothesis was confirmed in previous auditory training studies (e.g.
Bradlow et al., 1997) which found better /r/ than
/l/ perception and production in Japanese learners
of English.
3.1. Speech materials
3.1.1. Pre- and post-test materials
The two consonants /l/ and /r/ were embedded
in initial and medial position in nonsense words
in the context of the vowels /i, , u/. The consonants were presented as singletons or as the second
consonant of a consonant cluster (with the additional consonant being /k/ and /f/) and appeared
in initial (CV, cCV) or intervocalic (VCV, VcCV)
positions. Three items were produced for each consonant in each syllabic and vowel context (27 initial /l/ and /r/, 27 medial /l/ and /r/), yielding a
total of 108 items.
3.1.2. Training materials
For the training sessions, materials included 132
minimal pairs of real words containing /l/ or /r/.
The sounds /l/–/r/ appeared in different vowel contexts and positions: 100 pairs with consonant in
initial position (55 singleton and 45 clustered)
and 32 pairs with medial position (28 singleton
and 4 clustered).
3.2. Talkers and recording procedure
To prepare the test items, a female talker of
South Eastern British English was recorded for
the pre- and post-test, and two further women
and three men for the training materials. As most
of the talkers recorded for Experiment 1 were no
longer available, new talkers were recorded, apart
from one talker who was used in training in both
ARTICLE IN PRESS
10
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
experiments. Although the talkers varied in lipreadability as shown subsequently by differences in
learnersÕ performance for individual training sessions (one talker per session), talkers were not
selected according to any specific criteria as
regards their articulatory gestures.
The same recording and signal processing procedures were used as in Experiment 1. For audiovisual stimuli with a synthetic face, the digital
speech signal that was used for the auditory and
audiovisual (natural face) training was carefully
synchronized with the articulatory gestures of the
talking head using the ÔBaldi SyncÕ software provided as part of the CSLU Toolkit (Cole et al.,
1999). This software carries out an automatic
alignment of the acoustic signal to the facial animation; hand-corrections to the alignment were
made where necessary.
3.3. Listeners
A total of 62 Japanese-L1 learners of English as
a Foreign Language (EFL) participated in the
study: 40 were students in the English Department
of the University of Kochi, and were tested in Japan, whilst the rest were attending an short-term
English language or English phonetics course in
the UK. ANOVAs carried out on the data did
not show any difference in performance between
learners tested in these two locations. Learners
were approximately at a lower to lower-intermediate level of English proficiency, were aged between
17 and 32 years, had started learning English after
the age of 13 and none had lived in the UK for a
period of more than four months. They reported
normal hearing and normal or corrected vision.
A further six students participated in the study
but were excluded from the data either because
their pre-test performance was above 95% in the
A condition (N = 4), or because of incomplete
post-test data (N = 2).
3.4. Experimental task
The same testing and training software package
was used as for Experiment 1. For the pre/posttest, the items were presented to all listeners in
three conditions (A, V and AV presentation), with
two blocks of 108 items per condition. Each listener therefore heard 108 repetitions of each consonant (across vowels and positions) in each test
condition. The order of items was randomized
within each block, and two orders of presentation
of test conditions were counterbalanced across listeners. Items were presented to both ears at a comfortable listening level via headphones.
The pre-test was followed by 10 sessions of
training, each lasting about 40 min. In the training,
students were first familiarized with the two test
consonants uttered by the particular talker of that
session, after which the items were presented either
auditorily (ÔA trainingÕ group), audiovisually with
natural face (ÔAV naturalÕ training group) or audiovisually with a synthetic face (ÔAV syntheticÕ
training group). Learners were assigned to a training group on the basis of their scores in the Auditory condition of the pre-test, with the aim of
ensuring a reasonable balance across training
groups (A = 18 learners, AV natural = 25 learners;
AV synthetic = 19 learners). The ten sessions were
held over a period of two to three weeks. The
training program was run individually on laptops
and all sessions were carried out under similar conditions, with students working in quiet surroundings, and stimuli presented via headphones and
visually on the computer screen. At each training
session, listeners were trained on two blocks of test
items produced by one of the five talkers: 200 tokens with /l/–/r/ in initial position and 64 items in
medial position. The blocks with initial and medial
consonants alternated in order per day. The order
of items was randomized within each block for
each listener. In days 6–10, listeners repeated the
sessions 1–5. After the ten days of training, a
post-test was done, which was identical to the
pre-test.
3.5. Results
The difference between pre- and post-test scores
in individual learners ranged from 5% to +48%
in the auditory test condition and from 6.5% to
+48% in the AV test condition. In order to gauge
the degree of visual salience of the contrast for native listeners, a group of 14 native controls were
tested in the visual condition only, and achieved
ARTICLE IN PRESS
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
11
Table 1
Means and standard deviations for the score representing the discriminability of the /l/–/r/ contrast (d 0 ) in the pre- and post-test in each
test condition (A, AV, V) for the Auditory, AV (natural face) and AV (synthetic face) training groups (Experiment 2)
Training group
Condition
Pretest
S.D.
Post-test
S.D.
Auditory N = 18
A
AV
V
0.56
0.79
0.47
0.70
0.87
0.45
2.16
2.05
0.67
1.40
1.26
0.46
AV (natural face) N = 25
A
AV
V
0.40
0.57
0.18
0.78
0.88
0.42
1.63
1.89
0.69
1.36
1.32
0.44
AV (synthetic face) N = 19
A
AV
V
0.61
0.73
0.09
0.61
0.88
0.27
1.82
1.82
0.31
1.36
1.49
0.37
Change in d' for l/r contrast (Mean +/- 1 SE)
a mean score of 72.1% correct (S.D. 10.9). Note
that the same controls had achieved scores of
92% correct /v/ identification in Experiment 1. This
confirms that the /l/–/r/ contrast produced by the
talker in Experiment 2 was less visually distinctive
than the labial/labiodental contrast produced by a
different talker in Experiment 1.
As for Experiment 1, dprime (d 0 ) scores were
calculated for each test condition carried out before and after training to get a measure of the discriminability of the /l/–/r/ contrast (See Table 1). A
repeated-measures ANOVA was carried out on
discriminability scores obtained in the pre- and
2.0
1.5
Test Condition
1.0
Visual
0.5
Auditory
0.0
N = 18 18 18
Auditory
AV
25 25 25
19 19 19
AV natural
AV synthetic
TRAINING
Fig. 3. Pre/post training change in d 0 for the /l/–/r/ contrast in
the A, AV and V test conditions for the three training groups
(Auditory training, AV training with natural face, AV training
with synthetic face) (Experiment 2).
post-test to evaluate the within-group effect of time
of testing. Overall, training was effective in
improving the discriminability of the /l/–/r/ contrast [F(1, 59) = 145.5; p < .001]. The evaluation
of the relative effectiveness of the three training
modes was carried out on the difference in discriminability between pre-test and post-test scores (see
Fig. 3) so as not to be influenced by any differences
in overall levels of performance across groups. The
across-group effect of training mode was not significant although there was a trend for less
improvement in the ÔAV syntheticÕ condition, as
the mean increase in d 0 was 1.02 for both the ÔA
trainingÕ and the ÔAV naturalÕ training, and 0.84
for the ÔAV syntheticÕ training groups. The within-group effect of test condition was significant
[F(2, 118) = 58.25; p < 0.001]: the increase in d 0
for the V test condition was lower than that for
the A and AV test conditions. The interaction between test condition and training mode failed to
reach significance but there was a trend for a greater increase in discriminability in the V test condition in the ÔAV naturalÕ trained group than in the
other two groups (See Fig. 3).
It is of interest to consider whether those learners who were not exposed to visual cues during
training would improve their perception in the V
test condition post-training as this would provide
support for cross-modal enhancement. Prior to
training, learners in the ÔA trainingÕ group
achieved significantly higher scores in the V test
condition than learners in both AV groups. However, a repeated-measures ANOVA carried out on
ARTICLE IN PRESS
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
the pre–post scores in the V condition for the ÔA
trainingÕ group shows that the change in discriminability in the visual-alone condition was not significant for this group.
In order to check previous claims that English
/r/ was easier to perceive than English /l/ due to
its greater dissimilarity to the native phoneme category, the percentage of correct /r/ and correct /l/
perception was calculated for each test condition
(A, V, AV) before and after training It must be
noted that these scores need to be considered with
some caution as differences in /r/ and /l/ identification could be influenced by response bias on the
part of the learners. The percentage of correctly
identified /l/ and /r/ averaged over all listeners
and over the A and AV test conditions was
59.3% (S.D. 14.1) for /l/ and 61.2% (S.D. 13.3)
for /r/ in the pre-test and 79.0% (S.D. 14.4) for /l/
and 76.2% (S.D. 16.5) for /r/ in the post-test. A
repeated-measures ANOVA with consonant, test
condition and time of testing as within-subject factors and training condition as between-subject factor was carried out. Neither the main effect of
consonant, nor the interactions between consonant and time of testing or consonant and training
mode were statistically significant. This suggest
that Japanese learners did not find /r/ easier to
identify than /l/ and that the training did not have
a differential effect on the perception of each of
these consonants.
% correct /l/-/r/ identification (Mean +- 1 SE)
12
90.0
80.0
70.0
TEST
60.0
Pre-test
50.0
40.0
N= 7 7 7
Auditory
Post-test
Generalisation
14 14 14
AV natural
7 7 7
AV synthetic
TRAINING
Fig. 4. Results obtained in pre-test, post-test and generalization
tests for a subset of learners from Experiment 2. The data are
shown for the test condition corresponding to the learnersÕ
training modality (A condition for learners trained auditorily,
AV condition for learners from the AV Natural or AV
Synthetic training groups).
ization materials are unknown real words.
However, post-hoc tests showed that the generalization scores were not significantly lower than
the post-test scores, and that there was no significant effect of Ôtraining modeÕ. The type of training
undertaken therefore did not affect the way in
which the training generalized to new items (see
Fig. 4).
3.7. Discussion
3.6. Generalization test
In order to evaluate the degree to which the effects of training would generalize to new materials,
a further test was completed by 28 of the learners.
This generalization test included 66 new words
produced by a ÔknownÕ talker (the talker heard in
Training sessions 1 and 6) that were not heard in
training. The generalization test was carried out
in the modality used in training (auditory test for
ÔA trainingÕ learners, audiovisual test for ÔAV naturalÕ and ÔAV syntheticÕ learners). Repeated-measures ANOVAs were carried out to compare
scores obtained in the post-test and generalization
in the modality in which learners were trained. The
comparison is not an ideal one as the post-test
materials are nonsense words whereas the general-
Both training conditions led to significant
improvements in the perception of the /l/–/r/ contrast by the L2 learners, but contrary to the findings of Experiment 1 and to those of Hardison
for an American English /l/–/r/ contrast, we did
not find audiovisual training to be more effective
than auditory training. In fact, there was a trend
for the greatest increase in identification scores to
be seen in the A-trained group. The contrast was
less visually salient than the contrast tested in
Experiment 1 as evidenced by the rather low
71.2% correct identification score obtained for
native controls for the visual alone condition. It
was also probably less salient than the /l/-/r/ tokens used in Hardison (2003) as means of around
85% correct identification were obtained for
ARTICLE IN PRESS
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
native controls in a visual alone condition (for
real words). A number of factors may explain the
discrepancy in results between our study and that
of Hardison. First, there may be learner-related
effects as HardisonÕs learners were ESL rather
than EFL learners based in the USA, with greater
exposure to native talkers than the learners tested
in our experiments. Also, there are differences
both in the visual and acoustic characteristics of
/l/ and /r/ in British and American English that
will affect the salience of the contrast. Finally,
there are talker-related effects, as the test talker
used in Experiment 2 may well have had less
clear articulation, compared to the talker used
in Experiment 1, and possibly that used in the
Hardison study.
This study did not replicate previous findings
that Japanese listeners achieve higher identification rates for English /r/ than English /l/ tokens,
taken to support FlegeÕs SLM model predicting
greater L1 interference in terms of the acquisition
of the /l/ phoneme category (Aoyama et al.,
2004; Bradlow et al., 1997; Ingram and Park,
1998). However, the differences in the acoustic–
phonetic characteristics of American and British
English may also be implicated as they will affect
the goodness of fit of the L2 stimuli to the L1
category. Differences in stimulus (e.g., consonant
position, vowel context), talker and listener
characteristics across studies may also be implicated.
Although the difference between training
groups was not found to be statistically significant,
there was a trend for learners trained with natural
audiovisual tokens to become more sensitive to the
visual cues marking the contrast, as shown by their
performance in the V test condition. There was a
trend for those trained with the synthetic face (synchronized to natural audio channel) to show less
improvement in training than those trained auditorily or with natural audiovisual stimuli. This is
not too surprising as, in intelligibility tests with
normally-hearing and hearing impaired listeners
with auditory and audiovisual stimuli, talking
heads were less effective at improving intelligibility
than natural audiovisual stimuli (e.g., Ouni et al.,
2003; Siciliano et al., 2003). However, this difference is likely to decrease as the quality of visual
13
cue information provided by talking heads further
improves.
4. Experiment 3: Effect of perceptual training on
speech production
A number of previous studies have shown that
perceptual improvements resulting from auditory
training can transfer to improvements in the pronunciation of the trained sounds. The aim of
Experiment 3 was to evaluate whether the effect
could be replicated here, and whether audiovisual
training of a contrast (i.e. seeing the articulatory
gestures of the talker) would lead to a better
improvement in pronunciation, even if did not lead
to a greater improvement in perception for the less
visually-salient /l/–/r/ contrast.
4.1. Talkers and listeners
Twenty-five trainees from Experiment 2 were
recorded before and after their course of training,
after having carried out the perceptual testing. Ten
had completed the training study in the ÔA trainingÕ condition, 10 in the ÔAV naturalÕ condition
and five in the ÔAV syntheticÕ condition.
The listeners were 12 native talkers of British
English tested in London. The majority were phoneticians (PhD students) or trained EFL teachers.
Four were PhD students from another university
department who all reported some experience with
L2 learners.
4.2. Speech materials and recording conditions
At the time of the pre- and post-test, the Japanese trainees were asked to read a list of 25 minimal pairs of words which included the sounds /l/
and /r/ in a variety of phonetic environments and
positions. The audio recordings were either made
in a soundproof room using a Sony TCD-D10 digital audio recorder with a ECM-959DT microphone, or in case of students tested in Japan, in
a classroom using a Sony MZ-N707 minidisk recorder with a ECM-T15 microphone. A subset of
five minimal pairs of words were selected in which
the sounds /l/ and /r/ appeared in initial and medial
ARTICLE IN PRESS
14
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
positions, in singleton and clusters. The minimal
pairs were: lake–rake, long–wrong, miller–mirror,
blight–bright, complies–comprise. A total of 20
pre/post tokens per trainee were therefore included, yielding a total of 500 tokens for 25 trainees. The tokens were normalized for level by
equating the peak amplitudes.
4.2.3. Consonant rating task
Listeners were asked to judge the realizations of
/l/ and /r/. The intended word appeared on the
screen, followed by the auditory prompt. Listeners
rated the /l/ or /r/ in the word on a scale from 1 to 7
(1 = bad, 7 = excellent). The order of presentation
was randomized across trials and across talkers.
4.2.1. Experimental tasks
Two independent perceptual evaluation tests
were carried out in which native talkers of British
English were asked to judge the tokens produced
by the learners. The tests included a minimal-pair
identification task and a quality rating task. For
each test, the listeners evaluated the productions
of all learners. They were asked to focus on the
/l/–/r/ realizations in the word rather than on
the correct pronunciation of the word itself.
The experiment was run on a laptop and items
were presented to both ears at a comfortable listening level via headphones. The identification
test took about 45 min, the rating task about
55 min.
4.3. Results
4.3.1. Minimal pair identification test
For each of the 12 English listeners, mean production accuracy scores were calculated for /r/ and
/l/ for each of the 25 Japanese learners. The percentage of /l/ and /r/ productions that were correctly identified by native listeners is shown in
Table 2. It can be seen that the correct identification of /r/ tokens produced by the learners is generally much better than the identification of /l/
tokens, which is consistent with previous findings
on the production of the /r/–/l/ contrast by Japanese learners (Bradlow et al., 1997).
From the data means, it appeared that learners
in the ÔAV syntheticÕ group had a significantly better pronunciation prior to training than other
groups, as /r/ productions for this group were almost perfectly identified (mean of 96% correct
identification). Because of the ceiling effect for /r/
identification, we therefore removed the data for
the ÔAV syntheticÕ learners before subsequent data
analysis and compared the A and ÔAV naturalÕ
groups only. A repeated-measures ANOVA was
4.2.2. Minimal-pair identification task
Tokens were presented in a two-alternative
forced choice task. On each trial, the minimal pair
appeared in writing on two buttons on the screen,
listeners heard the token of a learner and indicated
their choice of the word by pressing the button.
The order of the items was randomized across trials and across talkers.
Table 2
Means and standard deviations for the percentage of /l/–/r/ consonants in words produced by L2-learners prior and following training
that were correctly identified by native English listeners (Experiment 3)
Training mode
/l/
/r/
Mean intelligibility
Pre
Post
Pre
Post
Pre
Post
Auditory N = 120
Mean
S.D.
54.8
23.3
59.8
28.5
86.0
17.7
83.5
22.7
70.4
14.7
71.7
12.7
AV (natural face) N = 120
Mean
S.D.
60.7
30.2
67.3
24.6
76.2
24.5
82.7
24.5
68.4
17.4
75.0
14.0
AV (synthetic face) N = 60
Mean
S.D.
60.7
34.1
65.3
25.2
96.0
8.1
96.0
8.9
78.3
16.3
80.7
14.0
ARTICLE IN PRESS
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
15
revealed that ratings were higher for the posttraining tokens than the pre-training tokens
[F(1, 238) = 33.6; p < .001]. The effect of training
mode was assessed from the difference in ratings before and after training. The effect of training mode was not significant but there was a
significant consonant by training interaction
[F(1, 238) = 6.48; p < 0.02]. Post hoc tests showed
that /r/ rating improved more for the AV training
group than the A training groups but that /l/ ratings did not.
carried out on the production accuracy scores for
/r/ and /l/ obtained before and after training. Identification scores improved significantly post-training [F(1,238) = 20.0; p < 0.001]. Evaluations of
the effect of training mode were carried out on
the scores reflecting the difference in production
accuracy pre-post training for /r/ and /l/ to normalize for differences in pre-training scores. The increase in scores was significantly greater for the
tokens produced by the ÔAV naturalÕ training
learners than for those for the ÔA trainingÕ learners
[F(1,238) = 9.271; p < 0.005]. The within-group effect of consonant was not significant.
The following analysis focused on individual
differences between learners in terms of the effect
of perceptual training on production. The change
in mean production accuracy scores from pre- to
post-training tokens produced by individual learners ranged from 11% to +20%. A univariate ANOVA showed that this effect was significant
[F(19, 220) = 8.538; p < 0.001] suggesting that
some learners improved their pronunciation significantly more than others.
4.4. Discussion of experiment 3
The finding that perceptual training results in
improvements to the pronunciation of the trained
consonants in L2 learners is consistent with many
other studies of perceptual training. A key finding
of our study was that, even for this less visuallydistinctive contrast, the use in perceptual training
of audiovisual stimuli with a natural face led to a
greater improvement in /l/–/r/ pronunciation compared to training with auditory stimuli, even
though there was no difference between the training groups in terms of the effect of training on
the perception of the /l/–/r/ contrast. Exposure to
the articulatory gestures involved in the production of /l/ and /r/ therefore seems to be have been
effective in improving pronunciation, even without
specific pronunciation training. This finding needs
to be verified with a wider range of phonemic
4.3.2. Consonant rating task
For each of the 12 English listeners, consonant
rating scores were calculated for /r/ and /l/ produced by each of the 20 learners from the A and
AV (natural face) groups before and after training
(see Table 3). The data obtained for the rating task
closely mirror those obtained for the consonant
identification task. Repeated-measures ANOVAs
Table 3
Consonant rating on a scale of 1–7 (1 = bad, 7 = excellent) given by native English listeners for words produced prior and following
training by learners in the Auditory, AV (natural face) and AV synthetic face training groups (Experiment 3)
Training
/l/
/r/
Mean rating
Pre
Post
Pre
Post
Pre
Post
Auditory
Mean
S.D.
3.8
1.1
4.1
1.2
5.0
0.9
5.0
1.1
4.4
0.8
4.6
0.7
AV (natural face)
Mean
S.D.
4.2
1.3
4.4
1.1
4.5
1.1
4.9
1.1
4.4
0.9
4.6
0.8
AV (synthetic face)
Mean
S.D.
4.0
1.4
4.3
1.1
5.5
0.8
5.2
0.9
4.7
0.9
4.8
0.9
ARTICLE IN PRESS
16
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
contrasts and talkers. The planned comparison between the effect of natural and synthetic articulatory gesture information was not achieved, due
to the small number of learners in the AV synthetic
group and ceiling effect in pre-training production
performance.
5. General discussion
Previous auditory training studies had shown
that learners could be successfully trained to
redirect their attention to acoustic cues that they
do not naturally attend to because they do not
mark phonemic contrasts in their native language. This study aimed to see whether learners
could also be trained to redirect their attention
to visual cues that mark segmental contrasts.
This may be considered to be a more difficult
task, at least for learners for whom visual cues
are not particularly informative in their L1, as
such listeners will tend to give greater weighting
to acoustic cues. For these learners, audiovisual
training has to be successful both in attracting
attention to the visual modality and in attracting
attention to the specific visual cues marking the
contrast.
Results of Experiment 1 did show that learners
could be trained to attend more to visual cues.
Learners trained audiovisually improved their
perception of the labial/labiodental place contrast
in the visual condition significantly more than
those trained auditorily, and showed greater
improvements in all three conditions post-training. However, certain visual cues appear to be
more difficult to acquire than others as shown
by the lack of an audiovisual training effect in
Experiment 2. For /l/–/r/, although there was a
trend for a greater improvement in /l/–/r/ identification in the visual condition for those trained
with natural audiovisual stimuli, this difference
did not reach significance. Overall, our lack of
an audiovisual effect for the /l/–/r/ contrast
appears to contradict results obtained by Hardison
(2003) who obtained better improvements in
/l/–/r/ perception in Japanese and Korean learners
of English trained audiovisually. However, these
apparently contradictory results fit into our gen-
eral hypothesis, which is that effectiveness of
AV training will depend on the visual distinctiveness of the contrast, which is likely to be dependent on a range of inter-related factors. First, it
will depend on whether the two phonemes belong
to the same or different visemes (Summerfield,
1983), but in experiments of this kind, will also
be related to the clarity of the talkers used in
the training and testing (Gagné et al., 2002). It
will also depend on the specific mapping of L2
phonemes to L1 categories. There are other factors that may affect how well audiovisual training
may increase sensitivity to visual cue information.
Just as acoustic enhancement can attract listenersÕ
attention to specific acoustic cues (e.g., Hazan
and Simpson, 2000), the way in which the visual
information is presented (e.g., whether full-face or
focus on lips/jaw only) may also affect the degree
to which learnersÕ attention are attracted to the
relevant visual cues. It must be noted though that
previous studies did not find differences in word
intelligibility for native listeners for both these
types of visual presentation (Marassa and Lansing, 1995).
Finally, learner-related factors are likely to be
important. First, it is well known that individuals
vary significantly in terms of their lipreading skills
(e.g., Demorest et al., 1996) and also in their ability to integrate auditory and visual information
(Grant and Seitz, 1998). In our study, learners varied quite significantly in their sensitivity to visual
cues before training (Sennema et al., 2003). It
was expected that a greater ÔnaturalÕ awareness of
visual cues would lead to greater success in the
audiovisual training, but this was not found to
be the case in either Experiment 1 or 2. The reason
for this is unclear, but it is possible that other learner-related factors such as individual learning
strategies, may be the cause. Indeed, it is plausible
that, for some learners at least, perceptual training
will be more successful when focused on a single
modality, with the visual modality acting as a distractor. Finally, as mentioned above, the language
background of the learner might influence the relative effectiveness of audiovisual training as learners from an L1 which is informative in terms of its
visual distinctions may attend more to visual information (e.g.,Sekiyama et al., 1996), even if the
ARTICLE IN PRESS
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
visemes being learnt in the L2 are not in their L1
inventory. Finally, the issue of whether learners
are acquiring English as a second language (ESL)
or English as a foreign language (EFL) could have
an effect on the effectiveness of training as it will
impact on the degree of exposure to native English
speakers, and hence to visual cues provided by
these speakers, and on the likely motivation of
the learners. It should be noted that HardisonÕs
subjects were ESL learners based in the USA
whilst our learners were EFL learners living in Japan, or on short visits to the UK.
The fact that perceptual training can result in
improvements in the pronunciation of the sounds
being trained has important practical implications for computer-assisted language learning.
Indeed, computer-based perceptual training programmes are more reliable than computer-based
pronunciation training programmes, which need
to provide accurate automatic ratings of the learnerÕs productions. Contrary to what might be expected from speech perception studies with native
talkers, audiovisual presentation of stimuli may
not automatically produce gains in perception
relative to auditory stimuli, especially for speech
contrasts that may be visible but not highly salient. However, further investigations need to be
carried out to investigate whether training which
further enhances visual cues may be more successful in attracting language learnersÕ attention
to this useful source of segmental information.
Acknowledgements
Funded by Grant GR/N11148 from the Engineering and Physical Sciences Research Council
of Great Britain. We acknowledge Dr Marta Ortega-LlebariaÕs contribution to the development of
the stimuli and test design for Experiment 1. We
thank Prof. Ron Cole and colleagues for their help
in developing the CSLU toolkit-based testing and
training software. We also thank Prof. Masaki
Taniguchi (Kochi University), Jo Thorp (London
Bell School) and the UCL Language Centre for
their substantial help in organizing the testing of
Japanese students.
17
References
Aoyama, K., Flege, J.E., Guion, S.G., Akahane-Yamada, R.,
Yamada, T., 2004. Perceived phonetic dissimilarity and L2
speech learning: the case of Japanese /r/ and English /l/ and
/r/. J. Phonetics 32, 233–250.
Best, C., 1995. A direct realist view of cross-language speech
perception. In: Strange, W. (Ed.), Speech Perception and
Linguistic Experience: Theoretical and Methodological
Issues. York Press, Baltimore, pp. 171–204.
Best, C., McRoberts, G., Goodell, E., 2001. Discrimination
of non-native consonant contrasts varying in perceptual assimilation to the listenerÕs native phonological
system. J. Acoust. Soc. Amer. 109, 775–794.
Bradlow, A., Pisoni, D., Akahane-Yamada, R., Tohkura, Y.,
1997. Training Japanese listeners to identify English /r/
and /l/: IV. Some effects of perceptual learning on
speech production. J. Acoust. Soc. Amer. 101, 2299–
2310.
Cole, R., Massaro, D.W., de Villiers, J., Rundle, B.,
Shobaki, K., Wouters, J., Cohen, M., Beskow, J., Stone,
P., Connors, Tarachow, A., Solcher, D., 1999. New tools for
interactive speech and language training: using animated
conversational agents in the classroom of profoundly deaf
children. In: Proc. ITRW Methods and Tools in
Speech Science Education (MATISSE), London, pp. 45–
52.
Demorest, M.E., Bernstein, L.E., DeHaven, G.P., 1996. Generalizability of speechreading performance on nonsense
syllables, words, and sentences: subjects with normal
hearing. J. Speech Hear. Res. 39, 697–713.
Flege, J.E., 1995. Second-language speech learning: theory,
findings, and problems. In: Strange, W. (Ed.), Speech
Perception and Linguistic Experience: Theoretical and
Methodological Issues. York Press, Baltimore, pp. 229–
273.
Flege, J.E., 1998. Second-language learning: the role of subject
and phonetic variables. In: Proc. StiLL ESCA Workshop,
Stockholm, May 1998, pp. 1–8.
Flege, J.E., 1999. Age of learning and second-language speech.
In: Birdsong, D.P. (Ed.), Second Language Acquisition and
the Critical Period Hypothesis. Lawrence Erlbaum, Hillsdale, NJ, pp. 101–132.
Fowler, C., 1986. An event approach to the study of speech
perception from a direct-realist perspective. J. Phonetics 14,
3–28.
Gagné, J.-P., Rochette, A.-J., Charest, M., 2002. Auditory,
visual and audiovisual clear speech. Speech Comm. 37, 213–
230.
Grant, K.W., Seitz, P.F., 1998. Measures of auditory–visual
integration in nonsense syllables and sentences. J. Acoust.
Soc. Amer. 104, 2438–2450.
Guion, S.G., Flege, J.E., Ahahane-Yamada, R., Pruitt, J.C,
2000. An investigation of current models of second language
speech perception: the case of Japanese adultsÕ perception of
English consonants. J. Acoust. Soc. Amer. 107, 2711–
2725.
ARTICLE IN PRESS
18
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
Hardison, D., 1999. Bimodal speech perception by native and
nonnative talkers of English: factors influencing the
McGurk effect. Lang. Learn. 49, 213–283.
Hardison, D., 2003. Acquisition of second-language speech:
effects of visual cues, context and talker variability. Appl.
Psycholinguist. 24, 495–522.
Hazan, V., Barrett, S., 2000. The development of phonemic
categorisation in children aged 6–12. J. Phonetics 28, 377–
396.
Hazan, V., Simpson, A., 2000. The effect of cue-enhancement
on consonant intelligibility in noise: talker and listener
effects. Lang. Speech 43, 273–294.
Ingram, J.C.L., Park, S.-G., 1998. Language, context, and
speaker effects in the identification and discrimination of
English /r/ and /l/ by Japanese and Korean listeners. J.
Acoust. Soc. Amer. 103, 1161–1174.
Iverson, P., Kuhl, P.K., Akahane-Yamada, R., Diesch, E.,
Tohkura, Y., Kettermann, A., Siebert, C., 2003. A perceptual interference account of acquisition difficulties for nonnative phonemes. Cognition 87, B47–B57.
Kuhl, P.K., 1993. Early linguistic experience and phonetic
perception: implications for theories of developmental
speech perception. J. Phonetics 21, 125–139.
Kuhl, P.K., Meltzoff, A.N., 1982. The bimodal perception of
speech in infancy. Science 218, 1138–1141.
Lambacher, S., Martens, W., Kakeki, K., 2002. The influence
of identification training on identification and production of
the American English mid and low vowels by native talkers
of Japanese. In: Proc. 7th Internat. Conf. on Spoken Lang.
Proc., Denver, pp. 245–248.
Lenneberg, E.H., 1967. Biological Foundations of Language.
John Wiley and Sons Inc, New York.
Lively, S., Logan, J., Pisoni, D., 1993. Training Japanese
listeners to identify English /r/ and /l/. ll: the role of phonetic
environments and talker variability in learning new perceptual categories. J. Acoust. Soc. Amer. 94, 1242–1255.
Lively, S.E., Logan, J.S., Pisoni, D.B., 1994. Training Japanese
listeners to identify English /r/ and /l/. III: long- term
retention of new phonetic categories. J. Acoust. Soc. Amer.
96, 2076–2087.
Logan, J.S., Lively, S.E., Pisoni, D.B., 1991. Training Japanese
listeners to identify English /r/ and /l/: a first report. J.
Acoust. Soc. Amer. 89, 874–886.
Lotto, A.J., Sato, M., Diehl, R.L., 2004. Mapping the task for
the second language learner: the case for Japanese acquisition of /r/ and /l/. In: J. Slifka, S. Manuel, M. Matthies
(Eds.), From Sound to Sense: 50+ years of Discoveries in
Speech Communication, pp. C381–C386.
Marassa, L.K., Lansing, C.R., 1995. Visual word recognition in
2 facial motion conditions—full face versus lips-plus-mandible. J. Speech Hear. Res. 38, 1387–1394.
Massaro, D.W., 1998. Perceiving Talking Faces: from Speech
Perception to a Behavioral Principle. MIT Press, Cambridge, MA.
Massaro, D.W., Cole, R. 2000. From Speech is special to
talking heads in language learning. In: Proc. INSTIL,
Dundee, Scotland, 153–161.
Massaro, D.W., Light, J., 2003. Read my tongue movements:
bimodal learning to perceive and produce non-native speech
/r/ and /l/. In: Proc. Eurospeech 2003, Geneva, Switzerland,
pp. 2249–52.
Massaro, D.W., Light, J., 2004. Using visible speech to train
perception and production of speech for individuals with
hearing loss. J. Speech Lang. Hear. Res. 47, 304–320.
Massaro, D.W., Thompson, L.A., Barron, B., Laren, E., 1986.
Developmental changes in visual and auditory contribution
to speech perception. J. Exp. Child Psychol. 41, 93–113.
Mayo, L.H., Florentine, M., Buus, S., 1997. Age of secondlanguage acquisition and perception of speech in noise. J.
Speech Lang. Hear. Res. 40, 686–693.
Mayo, C., Turk, A., 2004. Adult-child differences in acoustic
cue weighting are influenced by segmental context: Children
are not always perceptually biased toward transitions. J.
Acoust. Soc. Amer. 115, 3184–3194.
McGurk, H., MacDonald, J., 1976. Hearing lips and seeing
voices. Nature 264, 746–748.
Nittrouer, S., Miller, M.E., 1997. Predicting developmental
shifts in perceptual weighting schemes. J. Acoust. Soc.
Amer. 101, 2253–2266.
Ortega-Llebaria, M., Faulkner, A., Hazan, V., 2001. Auditoryvisual L2 speech perception: effects of visual cues and
acoustic–phonetic context for Spanish learners of English.
In: Proc. AVSP-2001, pp. 149–154.
Ouni, S. Massaro, D.W., Cohen, M.M., Young, K., Jesse, A.,
2003. Internationalization of a talking head. In: Proc.
ICPhS, Barcelona, Spain, August 2003, pp. 2569–2572.
Sekiyama, K., Tohkura, Y., Umeda, M., 1996. A few factors
which affect the degree of incorporating lip-read information into speech perception. In: Proc. ICSLP1996, Denver,
USA, pp. 1481–1484.
Sekiyama, K., Tohkura, Y., 1993. Inter-language differences in
the influence of visual cues in speech perception. J.
Phonetics 21, 427–444.
Sekiyama, K., 1997. Cultural and linguistic factors in audiovisual speech processing: the McGurk effect in Chinese
subjects. Percept. Psychophys. 59, 73–80.
Sekiyama, K., Burnham, D., Tam, H., Erdener, D., 2003.
Auditory-visual speech perception development in Japanese
and English talkers. In: Proc. AVSP 2003, St. Jorioz,
France, September 2003, pp. 43–47.
Sennema, A., Hazan, V., Faulkner, A., 2003. The role of visual
cues in L2 consonant perception. In: Proc. 15th ICPhS,
Barcelona, Spain, August 2003, pp. 135–138.
Siciliano, C., Faulkner, A., Williams, G., 2003. Lipreadability
of a synthetic talking face in normal hearing and hearingimpaired listeners. In: Proc. AVSP 2003, St. Jorioz, France,
September 2003, pp. 205–208.
Sumby, W.H., Pollack, I., 1954. Visual contribution to speech
intelligibility in noise. J. Acoust. Soc. Amer. 26, 212–
215.
Summerfield, Q., 1983. Audio-visual speech perception, lipreading and artificial stimulation. In: Lutman, M.E., Haggard, M.P. (Eds.), Hearing Science and Hearing Disorders.
Academic Press, London, pp. 131–182.
ARTICLE IN PRESS
V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx
Takata, Y., Nabelek, A.K., 1990. English consonant recognition in noise and in reverberation by Japanese and American listeners. J. Acoust. Soc. Amer. 88, 663–666.
Tyler, M.D., 2001. Resource consumption as a function of topic
knowledge in nonnative and native comprehension. Lang.
Learn. 51, 257–280.
Wang, Y., Spence, M.M., Jongman, A., Sereno, J.A., 1999.
Training American listeners to perceive Mandarin tones. J.
Acoust. Soc. Amer. 106, 3649–3658.
Wang, Y., Jongman, A., Sereno, J.A., 2003. Acoustic and
perceptual evaluation of Mandarin tone productions before
19
and after perceptual training. J. Acoust. Soc. Amer. 113,
1033–1043.
Werker, J.F., Tees, R.C., 1984. Cross-language speech perception: evidence for perceptual reorganization during the first
year of life. Infant Behavior Develop. 7, 47–63.
Yamada, R.A., 1995. Age and acquisition of second language
speech sounds: perception of American English /r/ and /l/ by
native talkers of Japanese. In: Strange, W. (Ed.), Speech
Perception and Linguistic Experience: Theoretical and
Methodological Issues. York Press, Baltimore, pp. 305–
320.