ARTICLE IN PRESS Speech Communication xxx (2005) xxx–xxx www.elsevier.com/locate/specom Effect of audiovisual perceptual training on the perception and production of consonants by Japanese learners of English q Valerie Hazan b a,* , Anke Sennema a, Midori Iba b, Andrew Faulkner a a Department of Phonetics and Linguistics, UCL, Gower Street, London WC1E 6BT, United Kingdom Konan University Institute for Language and Culture, 8-9-1 Okamoto, Higashi-nada-ku, Kobe-shi, Hyogo, Japan Received 24 September 2004; received in revised form 18 February 2005; accepted 7 April 2005 Abstract This study investigates whether L2 learners can be trained to make better use of phonetic information from visual cues in their perception of a novel phonemic contrast. It also evaluates the impact of audiovisual perceptual training on the learnersÕ pronunciation of a novel contrast. The use of visual cues for speech perception was evaluated for two English phonemic contrasts: the /v/–/b/–/p/ labial/labiodental contrast and /l/–/r/ contrast. In the first study, 39 Japanese learners of English were tested on their perception of the /v/–/b/–/p/ distinction in audio, visual and audiovisual modalities, and then undertook ten sessions of either auditory (ÔA trainingÕ) or audiovisual (ÔAV trainingÕ) perceptual training before being tested again. AV training was more effective than A training in improving the perception of the labial/ labiodental contrast. In a second study, 62 Japanese learners of English were tested on their perception of the /l/–/r/ contrast in audio, visual and audiovisual modalities, and then undertook ten sessions of perceptual training with either auditory stimuli (ÔA trainingÕ), natural audiovisual stimuli (ÔAV Natural trainingÕ) or audiovisual stimuli with a synthetic face synchronized to natural speech (ÔAV Synthetic trainingÕ). Perception of the /l/–/r/ contrast improved in all groups but learners trained audiovisually did not improve more than those trained auditorily. Auditory perception improved most for ÔA trainingÕ learners and performance in the lipreading alone condition improved most for Ônatural AV trainingÕ learners. The learnersÕ pronunciation of /l/–/r/ improved significantly following perceptual training, and a greater improvement was obtained for the ÔAV Natural trainingÕ group. This study shows that sensitivity to visual cues for non-native phonemic contrasts can be enhanced via audiovisual perceptual training. AV training is more effective q Portions of the work presented in this paper were presented at the 7th International Conference for Spoken Language Processing (16–20 September 2002, Denver, USA), at the International Congress of Phonetic Sciences (3–9 August 2003, Barcelona, Spain) and at the 8th International Conference for Spoken Language Processing (5–9 October 2004, Jeju island, South Korea). * Corresponding author. Tel.: +44 20 7679 7069; fax: +44 20 7383 4108. E-mail addresses: [email protected], [email protected] (V. Hazan), [email protected] (A. Sennema), [email protected] (M. Iba), [email protected] (A. Faulkner). 0167-6393/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2005.04.007 ARTICLE IN PRESS 2 V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx than A training when the visual cues to the phonemic contrast are sufficiently salient. Seeing the facial gestures of the talker also leads to a greater improvement in pronunciation, even for contrasts with relatively low visual salience. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Audiovisual perception; Second language acquisition; Perceptual training 1. Introduction This study is concerned with the acquisition of non-native phonemic contrasts by second-language learners and considers whether perceptual training with audiovisual stimuli is more effective than training with auditory stimuli in improving the perception of these contrasts. Phoneme inventories vary widely across languages, and it is known that infants attune to the sounds of their own language within the first few months of life (e.g., Kuhl, 1993; Werker and Tees, 1984). As a result, learning to perceive and produce sounds of a second language can be effortful, if these sounds do not occur in the native language or have a different phonological status in the two languages (e.g., Flege, 1995; Guion et al., 2000). In addition to segmental errors in perception (e.g., Takata and Nabelek, 1990), there can be an increased cognitive load when listening in a nonnative language (Tyler, 2001). It has also been shown that even highly-fluent second-language learners, who have similar intelligibility to native listeners for speech in quiet, show disproportionate reductions in intelligibility when speech is degraded by noise (Mayo et al., 1997). A number of factors affect the degree of difficulty in acquiring novel phonemic contrasts. One important factor, which has received a lot of attention, is the relation between the phoneme inventories of the first (L1) and second (L2) language, as our knowledge of the phoneme inventory of the L1 interferes with the acquisition of that of the L2. The Speech Learning Model (Flege, 1995) suggests that second-language learners tend to assimilate L2 sound categories to their native sound categories if the categories are phonetically similar. Categories that are phonetically dissimilar from native sound categories are seen as easier to acquire as there is no process of L1 interference, whilst categories which are identical in the two languages are typically perceived and produced without too much difficulty. BestÕs Perceptual Assimilation Model (PAM) (Best, 1995; Best et al., 2001) is based on a different theoretical framework, that of the Direct Realist model of speech perception (e.g., Fowler, 1986). PAM makes detailed predictions of L2/L1 assimilation on the basis of similarities in articulatory gestures between the L2 and L1 sounds. An L2 sound is either ÔcategorizedÕ as an exemplar of a native phoneme category, ÔuncategorizedÕ if it is similar to two or more native categories, or ÔnonassimilableÕ if it is not at all similar to any native category. Both of these theories predict that certain phonemic contrasts will be difficult to acquire within a specific L1/L2 combination. For example, the /l/–/r/ contrast in English has been extensively used in studies of L2 speech perception as it is a difficult contrast to acquire for Japanese learners of English due to the different phonological status of this contrast in Japanese (see Aoyama et al., 2004 for a detailed description of the relation between English /l/–/r/ and Japanese /r/). A significant factor affecting the acquisition of novel phoneme categories in an L2 is the age of second language acquisition (for a discussion, see Flege, 1999). Although there is now little support for the view that there is a critical period after which the acquisition of new phoneme categories would be impossible (Lenneberg, 1967), there does appear to be a sensitive period for L2 acquisition, at the phonetic level at least. Indeed, in studies of immigrant L2 learners from a homogenous L1 population, the degree of foreign accent in production (as assessed by native listener ratings) increased fairly continuously with age (e.g., Flege, 1998). Another argument in favour of ongoing plasticity in speech perception comes from studies with adults showing that the perception of novel phonemic contrasts can be improved by auditory training. In a study in which young adult learners underwent 45 training sessions, Logan et al. ARTICLE IN PRESS V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx (1991) succeeded in improving the perception of the English /l/–/r/ contrast in Japanese-L1 talkers using a Ôhigh variability phonetic trainingÕ approach. In this technique, learners are exposed to multiple tokens of minimal pairs of words containing the novel sounds in different syllable positions produced by different talkers. These tokens are presented in a forced-choice identification task, with immediate feedback as to whether their choice was correct or incorrect. The aim of this approach is to create ÔrobustÕ novel phoneme categories by exposing L2 learners to acceptable variation within each category. The perceptual training was shown to be robust as it generalized to new words and new talkers (Lively et al., 1993, 1994). Further studies replicated this result and showed that there was long-term retention of the learning over a period of three months at least (Bradlow et al., 1997; Lively et al., 1994). Perceptual training not only improved the perception of novel speech sounds but also led to improvements in the pronunciation of these sounds by the L2 learners, improvements that were also retained for at least three months following the completion of the training (e.g., Bradlow et al., 1997). Perceptual training studies have mainly investigated consonant and vowel (e.g., Lambacher et al., 2002) contrasts, with a strong focus on the English /l/–/r/ contrast. However, auditory training has also been shown to be successful for suprasegmental features, as it led to improvements in the perception of tone contrasts (e.g., Wang et al., 1999); again, improvement in the perception of these contrasts transferred to tone production, both for tokens heard during training and for novel tokens (Wang et al., 2003). A significant source of segmental information for speech perception that has not hitherto been fully exploited in perceptual training is the information available via lipreading, which is available to language learners in traditional face-to-face language learning. For native listeners, visual cues to consonant and vowel identity are integrated with acoustic cues early in perceptual processing. This is shown by the McGurk effect (McGurk and MacDonald, 1976), which is obtained when visual and acoustic cues to consonant identity are put in conflict (e.g., visual /g/ with auditory /b/). The result- 3 ing percept (typically /d/ in the example given) shows that the information from each modality is integrated before the sound is categorized by the listener. Visual cues add to the multiplicity of cues that is so crucial for making speech robust to environmental degradation, and are indeed particularly useful for native listeners in difficult listening conditions (e.g., Sumby and Pollack, 1954). Infants are sensitive to visual information as they prefer video presentations of talkers with congruent rather than incongruent audio and visual channels (Kuhl and Meltzoff, 1982). However, there is a developmental trend in the use of lipreading cues. Indeed, children aged 6–10 years are less influenced by visual information in their judgments of consonant identity than are adult learners (Massaro et al., 1986; Sekiyama et al., 2003). Just as the sensitivity to certain acoustic cues increases within the first 10 years of life (e.g., Hazan and Barrett, 2000; Nittrouer and Miller, 1997; Mayo and Turk, 2004), sensitivity to visual cues also develops. The degree to which this visual influence develops appears to be languagedependent as Japanese and Chinese talkers are less influenced by visual cues than English talkers in McGurk tasks (Hardison, 1999; Sekiyama and Tohkura, 1993; Sekiyama, 1997). The degree of visual influence may be related to the informativeness of visual cues in a particular language relative to auditory cues. This can depend on the number of visemes in a specific language (i.e. the number of Ôvisual categoriesÕ that are identifiable using lipreading alone), and also on whether the language is tonal as tone information is not marked visually. Interestingly, in a study looking at the development of visual influence in Japanese and English 6-year-old and adult listeners, it was shown that the level of visual influence was the same in 6-year-old English and Japanese listeners and that it increased significantly in English but not Japanese adults (Sekiyama et al., 2003). This suggests that, due to the relatively low informativeness of lipreading information in Japanese, Japanese listeners become strongly attuned to auditory rather than to visual information as a result of language experience. In view of these factors, L2 learners may be therefore impeded in their use of visual cues for ARTICLE IN PRESS 4 V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx two reasons. First, if their first language is one in which visual influence is relatively low (e.g., Japanese or Chinese), they may not naturally attend to the segmental information available via the visual channel and may need to be trained to attune more to the visual channel. Also, just as L2 learners lose sensitivity to specific acoustic cues to vowel or consonant contrasts that do not occur in their own language, it is probable that they lose sensitivity to even salient visual cues that are not relevant in their L1: they may be able to see the difference between two visible articulatory gestures but not be able to consistently associate each with the appropriate phoneme label. This was found to be the case in a comparison of the perception of English consonants and vowels presented to Spanish-L1 learners and controls in auditory and audiovisual conditions with background noise (Ortega-Llebaria et al., 2001). Consonant confusions that were language-dependent, i.e. which resulted from differences in the phoneme inventories of Spanish and English, were not reduced by the addition of visual cues, whereas confusions that were common to both listener groups and were due to acousticphonetic similarities between categories did show improvements. A wider-ranging study of concordant and discordant audiovisual syllables with listeners from four different L1 backgrounds showed that the influence of the visual cue was ‘‘dependent upon its information value, the intelligibility of the auditory cue and the assessment of similarity between the two cues’’ and was also affected by linguistic experience (Hardison, 1999). Even if L2 learners are not highly sensitive to visual cues, an important issue is whether this sensitivity can be increased by training. We argue that increasing the use of visual information to consonant identity is likely to be effective in improving speech perception in L2 learners, as it would increase the multiplicity of phonetic information available to these listeners. Few studies to date have evaluated the relative effectiveness of auditory and audiovisual training on the acquisition of novel L2 contrasts. In a study on the training of the American English /l/–/r/ contrast with Japanese and Korean learners of English, audiovisual training was more effective than purely auditory training in improving the identification of the contrast for both learner populations (Hardison, 2003). Improvements in perception also led to improvements in native listener ratings of the learnersÕ pronunciation both for the auditory-trained and audiovisual-trained learners. Another approach to audiovisual training has been to use three-dimensional computeranimated talking heads that use accurate articulatory information (e.g., Massaro, 1998). These are potentially powerful tools for language learning as training materials can be easily generated and as they give greater potential for ÔenhancingÕ articulatory information. This can be done, for example, by making the skin transparent so that articulatory gestures within the oral tract can be seen, by slowing down articulatory gestures or highlighting usually invisible cues such as vocal fold vibration (Massaro and Cole, 2000). Such talking heads have already successfully been used for language learning by children with hearing loss (Massaro and Light, 2004). In a first study investigating the usefulness of a talking head for training the /l/–/r/ contrast with Japanese learners of English, two training conditions were compared: (a) ÔstandardÕ talking head (b) the same talking head but with ÔtransparentÕ skin and visible articulatory gestures (Massaro and Light, 2003). Training significantly improved the identification and production of /l/ and /r/ but the visible articulation condition did not lead to a greater improvement. This study does not permit us to compare the relative effectiveness of natural and synthetic visual cues. In our study, our aims were to investigate whether (a) the effectiveness of auditory training for language learning could be increased providing visual cues, (b) the effect of visual cues on the learning of the new contrast was dependent on the distinctiveness of the visual information and (c) perceptual training with audiovisual tokens would lead to a greater improvement in the pronunciation of the novel phonemic contrast than training with auditory tokens. We also integrated the use of talking heads versus natural faces in our training methodology to investigate the possible effects of the mode of visual presentation. ARTICLE IN PRESS V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx 2. Experiment 1: /b/–/p/–/v/ perceptual training study The aim of Experiment 1 was to investigate whether the perception of the contrast between the English labials /b/ and /p/ and the labiodental /v/ could be improved for Japanese learners of English via auditory or audiovisual training. The English labial/labiodental contrast in place of articulation is difficult for Japanese learners of English because the labiodental phonemes /v/ and /f/ do not occur in the consonant inventory of Japanese, although /v/ does appear in loanwords and is beginning to acquire phonemic status in Japanese. The labiodental gesture, which is highly distinctive in English, therefore does not occur in Japanese. The following pattern of assimilation might be expected: the English phoneme /v/ may be assimilated to Japanese phoneme /b/ as /b/ is sometimes realised as a bilabial approximant between vowels in Japanese. The English /b/–/p/ voicing contrast may also lead to confusions as voiceless plosives in Japanese tend to be unaspirated and so English /b/ and Japanese /p/ will be phonetically similar. Both phonemes are therefore included in the study although our primary focus of interest is the perception of the labial/labiodental distinction. 5 2.2. Speech materials Speech materials included: (a) test materials that were presented to the learners before and after the course of training and (b) the training materials. 2.2.1. Pre- and post-test materials Nonsense words were used for the pre/post test to evaluate consonant identification without the influence of lexical knowledge and with only minimal learning effect (as the same items were to be repeated following the course of training sessions). The consonants /p/, /b/ and /v/ were embedded within nonsense words with the following structure: CV, VCV, or VC, where V was one of /i, , u/. Each item was recorded three times, yielding a total number of 81 tokens. 2.2.2. Training materials A number of English minimal pairs (real words) were recorded, chosen to contrast /b/, /p/ and /v/ in a range of vowel environments and in different positions and syllabic contexts. A subset of 75 pairs was chosen for the training sessions. This comprised 15 minimal pairs of words with /b/ and /v/ in initial and medial position and 15 minimal pairs of words with /b/ and /p/ in initial, medial and final position.1 2.1. Subjects 2.3. Recording procedure A total of 39 Japanese-L1 learners of English as a Foreign Language (EFL) participated in the training. All were students from a variety of degree programmes at the University of Konan in Japan who were taking an English elective course. They were aged between 19 and 23 years, had already learned English for six years at high school before entering university, but did not often practice listening to English sounds. The training took place in Japan. A further four students participated in the study but were not included in the data analysis because their pre-test performance was above 95% on the Audio condition. In order to get a measure of the visual salience of the contrast, a group of 13 native English controls (university students at UCL) were tested in the visual condition only. A phonetically-trained female talker recorded the pre/post test items, and three further women and two men recorded the training items. All were native talkers of South Eastern British English. For all talkers, video recordings were made in a soundproof room. The talkerÕs face was set against a blue background and illuminated with a key and a fill light. The talkerÕs head was fully visible within the frame. Video recordings were made to a Canon XL-1 DV camcorder. Audio was recorded from a Bruel and Kjaer type 4165 microphone to both the camcorder and to 1 A list of the minimal pairs used in Experiments 1 and 2 is available from: http://www.phon.ucl.ac.uk/home/val/ EPSRCproject.htm. ARTICLE IN PRESS V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx a DAT recorder. The video channel was digitally transferred to a PC and time-aligned with the DAT audio recording, which was of higher quality than the audio track of the video. Video clips were edited so that the start and end frames of each token showed a neutral facial expression. Stimuli were down-sampled once the editing had been completed (250 * 300 pixels, 25 f/s, audio sampling rate 22.05 kHz). 2.4. Test conditions The test and training software were designed using the CSLU toolkit (e.g., Cole et al., 1999). The programs were run individually on laptop computers, with students working in quiet surroundings and stimuli presented via headphones at a comfortable listening level. In the pre- and post-test, the nonsense items described above were presented to all learners in three conditions: ÔauditoryÕ (A), ÔaudiovisualÕ (AV) and ÔvisualÕ (V) in a 3AFC identification task with response choices of B, P and V. The learner responded by clicking on the appropriate symbol with a mouse. The test items were identical in the three conditions, but in the single-modality tasks, either the auditory or visual channels were removed. Each block of 81 items was repeated once in each test condition, yielding a total of 486 responses per listener. The order of items was randomized within each block for each listener. Two orders were used for the presentation of the three conditions: AV, A, V or A, AV, V, and the two orders were counterbalanced across listeners. In order to compare the relative effectiveness of auditory and audiovisual training, a different group of learners was tested in each of these two training modes. The training items were identical for the two learner groups, except that the video channel was masked for the auditorily-trained learners. 21 learners carried out the training in the Auditory training (ÔA trainingÕ) condition, and 18 learners in the Audiovisual training (ÔAV trainingÕ) condition. Learners were assigned to each condition on the basis of their scores in the auditory condition of the pretest, with the aim of ensuring a reasonable balance across training groups in terms of their pre-test performance, 4.0 d' for /b-p/-/v/ contrast (Mean +- 1 SE) 6 3.5 Visual (pre) 3.0 Auditory (pre) 2.5 AV (pre) 2.0 Visual (post) 1.5 Auditory (post) 1.0 0.5 AV (post) Auditory Audiovisual TRAINING Fig. 1. Measure of the discriminability of the labial/labiodental place contrast (d 0 ) in the pre-test and post-test for each training group in each modality (Experiment 1). although an exact match in terms of baseline performance was not achieved (See Fig. 1). 2.5. Training task The training approach was a variant of the High Variability Phonetic Training approach that was developed by Logan et al. (1991) and used in numerous training studies. This approach places emphasis on exposing the learner to normal phonetic variability within a phonemic category, and on providing immediate feedback on the response given. Learners are therefore exposed to different talkers across sessions, and the consonants being trained are presented in different syllable positions and vocalic contexts. In the training software, a conversational agent—the ÔBaldiÕ talking head (Massaro, 1998)—acted as a Ôvirtual teacherÕ, giving instructions to the learner at the beginning of each training block, telling the learner to Ôlisten againÕ to the repeated item if an incorrect training response was given, and telling the listener the percentage of correct identification achieved at the end of each training block. In the training, students were first familiarized with the test consonants uttered by the particular talker of that session, after which the items were presented either auditorily (for the ÔA trainingÕ learners) or audiovisually (for the ARTICLE IN PRESS V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx 2.6. Results Although the /b/ vs /p/ contrast had been included in this study because confusions were expected between these two English consonants, we were principally interested in whether the listeners were able to perceive the distinction which is marked visually, i.e. the difference between the bilabial and labiodental consonants. Analyses therefore focused on the perception of the contrast in place of articulation, and voicing confusions (/b/ perceived as /p/ and /p/ perceived as /b/) were ignored. Given that there were three possible response labels (P, B and V), it is important to correct for any potential bias in the responses given by the learners. The proportion of correct /v/ responses for pre-test and post-test results in each modality (A, AV, V) was therefore converted to the signal detectability measure dprime (d 0 ), which is calculated as the z-value of the hit-rate minus that of the false-alarm-rate. The mean d 0 for the labial/labiodental contrast obtained for each training group for the pre and post-test in each modality is presented in Fig. 1. Despite the fact that there was twice as many labial than labiodental stimuli in the test, there was no evidence of bias as shown by average beta values close to 1.0 for the A, AV and V conditions. For the native controls, the mean d 0 obtained in the V condition was 4.61 (S.D. 1.38) which is substantially higher than the d 0 of 1.51 (S.D. 1.04) obtained for the Japanese-L1 group in the pre-test. In order to see whether the perception of the labial/labiodental place contrast improved as a result of training, a repeated-measures analysis of variance (ANOVA) was carried out on the d 0 scores obtained in the pre- and post-test for all 39 learners. Overall, correct place perception was higher in the post-test than in the pre-test [F(1, 37) = 108.7; p < 0.001]. The effect of test mode (A, V or AV) was significant [F(2, 74) = 46.8; p = .000] and Bonferroni-adjusted post-hoc tests showed that better discriminability of the place contrast was obtained in the AV test condition than in the A test condition, which itself was better perceived than the V condition. Training therefore significantly increased the discriminability of the place contrast. That higher identification was obtained for the AV test condition suggests that learners were able to integrate information contained in the auditory and visual modalities. Change in d' for place contrast (Mean +- 1 SE) ÔAV trainingÕ learners). After hearing each training word, the learner had to give a response of B, P or V by clicking a label with a mouse. If a correct response was given, a ÔsmileyÕ symbol appeared and the item was not repeated. If the response was incorrect, the learner was told to Ôlisten againÕ and the word was repeated. The training consisted of 10 sessions, held over a period of approximately four weeks, each lasting about 40 min. At each training session, listeners either saw and heard (AV trainees), or just heard (A trainees) five blocks of test items produced by one of the five talkers. At each session, each learner heard 60 instances of the consonants /b/ and /v/ in real words that formed minimal pairs for this contrast, and 90 instances of the consonants /b/ and /p/ in real words that formed minimal pairs for that contrast. The blocks with different consonants alternated in order per day, and the order of items was randomized within each block for each listener. Each talker was heard on two different days. After ten sessions of training, a post-test was done, which was identical to the pre-test. 7 1.6 1.4 1.2 1.0 TEST CONDITION 0.8 Visual 0.6 0.4 Auditory 0.2 AV 0.0 N = 21 21 21 Auditory 18 18 18 Audiovisual TRAINING Fig. 2. Pre/post training change in d 0 for the labial/labiodental contrast in the three test conditions for A-trained and AVtrained learners (Experiment 1). ARTICLE IN PRESS 8 V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx The evaluation of the relative effectiveness of the two training modes was carried out on the change in d 0 between pre-test and post-test scores (see Fig. 2) so as not to be influenced by any differences in baseline levels of performance across groups. The main effect of training mode was significant [F(1,37) = 10.3; p = .003], with a greater improvement in the discriminability of the place contrast seen for the ÔAV trainingÕ group than the ÔA trainingÕ group for all three test conditions. Post-hoc tests revealed that, taking all subject groups together, this improvement was greater for the A and AV test conditions than for the V test condition.2 In order to get a better idea of the effect of each training condition on the perception of acoustic and visual cues to the labial/labiodental contrast, repeated-measures ANOVAs were carried out separately on the data for each training group to look at the effect of test modality. For the ÔA trainingÕ group, this effect was significant [F(2, 40) = 4.555; p < 0.0001]; Bonferroni-adjusted post-hoc tests showed that a greater improvement in perception was seen in the A than the V test condition (See Fig. 2), but that there was no significant difference in improvement between the A and AV test conditions. For the ÔAV trainingÕ group, the effect of test mode was not significant: this group showed a similar improvement in all three test conditions, suggesting that sensitivity to visual information increased for these learners. It could be argued that AV training is more likely to be effective for those learners who, before training, already showed some sensitivity to visual cues. In order to test this hypothesis, the ÔAV trainingÕ group was divided into Ôhigh V scoreÕ (N = 8) 2 In order to check that the comparison between the two training groups was not strongly affected by differences in pretraining performance across groups, statistics were carried out on a subset of the data which included pairs of subjects across groups that were matched in terms of their pre-training performance in the Auditory condition. (11 subjects per group). Results were consistent with those obtained for the larger group. In a two-way ANOVA for repeated measures carried out on the pre- and post-training test data, the AV-training group showed a greater improvement in discriminability of place of articulation after training (significant interaction between time of testing and training condition). and Ôlow V scoreÕ (N = 10) subgroups based on a criterion of d 0 > 1.15 in the V test condition in the pre-test. In the pre-test, mean d 0 for the ÔHigh V scoreÕ group was 2.15 for the A condition, 2.73 for the AV condition and 2.04 in the V condition. The same means for the Ôlow V scoreÕ group were 1.35 for the A condition, 1.25 for the AV condition and 0.63 for the V condition. A repeated-measures ANOVA failed to show any significant difference between the subgroups in terms of their improvement post-training although there was a trend for the Ôhigh V scoreÕ group to show greater improvement in the V condition. Similar statistics were run for learners with high and low V scores trained auditorily. Again, there was no statistically significant difference between the two groups in terms of their improvement. 2.7. Discussion We know from the pre-test results that the contrast between the English labial (/b/–/p/) and labiodental (/v/) consonants is visually salient for native talkers who achieved a mean of 92% correct /v/ identification in the lipreading condition and d 0 of 4.61. Perception of the place contrast by L2 learners was significantly lower than that for native controls. These results confirm the fact that L2 learners tend not to make use of visual cues that do not mark phonemic contrasts in their first language, even if there is a clear difference in articulatory gestures. It was of interest to see whether learners could be trained, via audiovisual training, to be more sensitive to visual cues and whether they would then be able to improve their perception of a novel contrast to a higher degree than auditorily-trained learners. Audiovisual training was indeed more effective in improving the perception of the labial/labiodental contrast than auditory training. Whereas the A-trained learners improved more in the auditory test condition and showed little evidence of improved sensitivity to visual cues to the contrast, the AV-trained learners showed improvement both in their use of acoustic and visual cues to the contrast, and became more aware of the distinction between labial and labiodental gestures. The amount of overall improvement was also greater for the AV-trained ARTICLE IN PRESS V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx listeners. It had been expected that improvements would be maximized if the modality of training was adapted to listener biases, and therefore that we would see greater improvements in the AV training conditions for learners achieving higher V scores in the pre-test. This did not appear to be the case. These results were encouraging in terms of the effectiveness of audiovisual training, but it is also important to evaluate whether this finding would extend to novel phonemic contrasts that are less visually salient for native listeners. Experiment 2 was therefore devised and looked at the relative effectiveness of auditory and AV training for the perception of the British English /l/–/r/ contrast by Japanese learners of English. Also, in order to see whether less detailed visual information could be informative in training, some learners were trained in a condition in which naturally-produced items were synchronized to a talking head (Baldi, the talking head that was used as the Ôvirtual teacherÕ in the training programme). 3. Experiment 2: Perceptual training of /l/–/r/ contrast The aim of Experiment 2 was to investigate the effect of audiovisual presentation for training for the /l/–/r/ contrast, which is less visually distinct than the labial/labiodental contrast presented in Experiment 1. In Japanese, lateral approximants do not occur but there is an alveolar flap [] within the phonemic inventory. The English phonemes /l/ and /r/ tend to both be assimilated to the native alveolar flap, leading to problems in the discrimination and identification of these English phonemes. The /l/–/r/ contrast is particularly difficult for Japanese learners because it is primarily marked by differences in third formant (F3) transitions that Japanese listeners are not particularly sensitive to, presumably because F3 transitions do not act as primary acoustic cues for Japanese sound distinctions (Iverson et al., 2003; Lotto et al., 2004; Yamada, 1995). Previous studies involving American English have suggested that, although both English consonants are assimilated to the same native Japanese consonant category, 9 the goodness of fit of the new consonants to the native category varies and have suggested that /l/ is more perceptually similar to the Japanese alveolar flap [] than English /r/ (e.g. Aoyama et al., 2004). According to FlegeÕs Speech Learning Model (Flege, 1995), L2 sounds which are more perceptually dissimilar to L1 sounds will be more likely to be accurately perceived and produced because the process of interference from the native category will be less great. The hypothesis was confirmed in previous auditory training studies (e.g. Bradlow et al., 1997) which found better /r/ than /l/ perception and production in Japanese learners of English. 3.1. Speech materials 3.1.1. Pre- and post-test materials The two consonants /l/ and /r/ were embedded in initial and medial position in nonsense words in the context of the vowels /i, , u/. The consonants were presented as singletons or as the second consonant of a consonant cluster (with the additional consonant being /k/ and /f/) and appeared in initial (CV, cCV) or intervocalic (VCV, VcCV) positions. Three items were produced for each consonant in each syllabic and vowel context (27 initial /l/ and /r/, 27 medial /l/ and /r/), yielding a total of 108 items. 3.1.2. Training materials For the training sessions, materials included 132 minimal pairs of real words containing /l/ or /r/. The sounds /l/–/r/ appeared in different vowel contexts and positions: 100 pairs with consonant in initial position (55 singleton and 45 clustered) and 32 pairs with medial position (28 singleton and 4 clustered). 3.2. Talkers and recording procedure To prepare the test items, a female talker of South Eastern British English was recorded for the pre- and post-test, and two further women and three men for the training materials. As most of the talkers recorded for Experiment 1 were no longer available, new talkers were recorded, apart from one talker who was used in training in both ARTICLE IN PRESS 10 V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx experiments. Although the talkers varied in lipreadability as shown subsequently by differences in learnersÕ performance for individual training sessions (one talker per session), talkers were not selected according to any specific criteria as regards their articulatory gestures. The same recording and signal processing procedures were used as in Experiment 1. For audiovisual stimuli with a synthetic face, the digital speech signal that was used for the auditory and audiovisual (natural face) training was carefully synchronized with the articulatory gestures of the talking head using the ÔBaldi SyncÕ software provided as part of the CSLU Toolkit (Cole et al., 1999). This software carries out an automatic alignment of the acoustic signal to the facial animation; hand-corrections to the alignment were made where necessary. 3.3. Listeners A total of 62 Japanese-L1 learners of English as a Foreign Language (EFL) participated in the study: 40 were students in the English Department of the University of Kochi, and were tested in Japan, whilst the rest were attending an short-term English language or English phonetics course in the UK. ANOVAs carried out on the data did not show any difference in performance between learners tested in these two locations. Learners were approximately at a lower to lower-intermediate level of English proficiency, were aged between 17 and 32 years, had started learning English after the age of 13 and none had lived in the UK for a period of more than four months. They reported normal hearing and normal or corrected vision. A further six students participated in the study but were excluded from the data either because their pre-test performance was above 95% in the A condition (N = 4), or because of incomplete post-test data (N = 2). 3.4. Experimental task The same testing and training software package was used as for Experiment 1. For the pre/posttest, the items were presented to all listeners in three conditions (A, V and AV presentation), with two blocks of 108 items per condition. Each listener therefore heard 108 repetitions of each consonant (across vowels and positions) in each test condition. The order of items was randomized within each block, and two orders of presentation of test conditions were counterbalanced across listeners. Items were presented to both ears at a comfortable listening level via headphones. The pre-test was followed by 10 sessions of training, each lasting about 40 min. In the training, students were first familiarized with the two test consonants uttered by the particular talker of that session, after which the items were presented either auditorily (ÔA trainingÕ group), audiovisually with natural face (ÔAV naturalÕ training group) or audiovisually with a synthetic face (ÔAV syntheticÕ training group). Learners were assigned to a training group on the basis of their scores in the Auditory condition of the pre-test, with the aim of ensuring a reasonable balance across training groups (A = 18 learners, AV natural = 25 learners; AV synthetic = 19 learners). The ten sessions were held over a period of two to three weeks. The training program was run individually on laptops and all sessions were carried out under similar conditions, with students working in quiet surroundings, and stimuli presented via headphones and visually on the computer screen. At each training session, listeners were trained on two blocks of test items produced by one of the five talkers: 200 tokens with /l/–/r/ in initial position and 64 items in medial position. The blocks with initial and medial consonants alternated in order per day. The order of items was randomized within each block for each listener. In days 6–10, listeners repeated the sessions 1–5. After the ten days of training, a post-test was done, which was identical to the pre-test. 3.5. Results The difference between pre- and post-test scores in individual learners ranged from 5% to +48% in the auditory test condition and from 6.5% to +48% in the AV test condition. In order to gauge the degree of visual salience of the contrast for native listeners, a group of 14 native controls were tested in the visual condition only, and achieved ARTICLE IN PRESS V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx 11 Table 1 Means and standard deviations for the score representing the discriminability of the /l/–/r/ contrast (d 0 ) in the pre- and post-test in each test condition (A, AV, V) for the Auditory, AV (natural face) and AV (synthetic face) training groups (Experiment 2) Training group Condition Pretest S.D. Post-test S.D. Auditory N = 18 A AV V 0.56 0.79 0.47 0.70 0.87 0.45 2.16 2.05 0.67 1.40 1.26 0.46 AV (natural face) N = 25 A AV V 0.40 0.57 0.18 0.78 0.88 0.42 1.63 1.89 0.69 1.36 1.32 0.44 AV (synthetic face) N = 19 A AV V 0.61 0.73 0.09 0.61 0.88 0.27 1.82 1.82 0.31 1.36 1.49 0.37 Change in d' for l/r contrast (Mean +/- 1 SE) a mean score of 72.1% correct (S.D. 10.9). Note that the same controls had achieved scores of 92% correct /v/ identification in Experiment 1. This confirms that the /l/–/r/ contrast produced by the talker in Experiment 2 was less visually distinctive than the labial/labiodental contrast produced by a different talker in Experiment 1. As for Experiment 1, dprime (d 0 ) scores were calculated for each test condition carried out before and after training to get a measure of the discriminability of the /l/–/r/ contrast (See Table 1). A repeated-measures ANOVA was carried out on discriminability scores obtained in the pre- and 2.0 1.5 Test Condition 1.0 Visual 0.5 Auditory 0.0 N = 18 18 18 Auditory AV 25 25 25 19 19 19 AV natural AV synthetic TRAINING Fig. 3. Pre/post training change in d 0 for the /l/–/r/ contrast in the A, AV and V test conditions for the three training groups (Auditory training, AV training with natural face, AV training with synthetic face) (Experiment 2). post-test to evaluate the within-group effect of time of testing. Overall, training was effective in improving the discriminability of the /l/–/r/ contrast [F(1, 59) = 145.5; p < .001]. The evaluation of the relative effectiveness of the three training modes was carried out on the difference in discriminability between pre-test and post-test scores (see Fig. 3) so as not to be influenced by any differences in overall levels of performance across groups. The across-group effect of training mode was not significant although there was a trend for less improvement in the ÔAV syntheticÕ condition, as the mean increase in d 0 was 1.02 for both the ÔA trainingÕ and the ÔAV naturalÕ training, and 0.84 for the ÔAV syntheticÕ training groups. The within-group effect of test condition was significant [F(2, 118) = 58.25; p < 0.001]: the increase in d 0 for the V test condition was lower than that for the A and AV test conditions. The interaction between test condition and training mode failed to reach significance but there was a trend for a greater increase in discriminability in the V test condition in the ÔAV naturalÕ trained group than in the other two groups (See Fig. 3). It is of interest to consider whether those learners who were not exposed to visual cues during training would improve their perception in the V test condition post-training as this would provide support for cross-modal enhancement. Prior to training, learners in the ÔA trainingÕ group achieved significantly higher scores in the V test condition than learners in both AV groups. However, a repeated-measures ANOVA carried out on ARTICLE IN PRESS V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx the pre–post scores in the V condition for the ÔA trainingÕ group shows that the change in discriminability in the visual-alone condition was not significant for this group. In order to check previous claims that English /r/ was easier to perceive than English /l/ due to its greater dissimilarity to the native phoneme category, the percentage of correct /r/ and correct /l/ perception was calculated for each test condition (A, V, AV) before and after training It must be noted that these scores need to be considered with some caution as differences in /r/ and /l/ identification could be influenced by response bias on the part of the learners. The percentage of correctly identified /l/ and /r/ averaged over all listeners and over the A and AV test conditions was 59.3% (S.D. 14.1) for /l/ and 61.2% (S.D. 13.3) for /r/ in the pre-test and 79.0% (S.D. 14.4) for /l/ and 76.2% (S.D. 16.5) for /r/ in the post-test. A repeated-measures ANOVA with consonant, test condition and time of testing as within-subject factors and training condition as between-subject factor was carried out. Neither the main effect of consonant, nor the interactions between consonant and time of testing or consonant and training mode were statistically significant. This suggest that Japanese learners did not find /r/ easier to identify than /l/ and that the training did not have a differential effect on the perception of each of these consonants. % correct /l/-/r/ identification (Mean +- 1 SE) 12 90.0 80.0 70.0 TEST 60.0 Pre-test 50.0 40.0 N= 7 7 7 Auditory Post-test Generalisation 14 14 14 AV natural 7 7 7 AV synthetic TRAINING Fig. 4. Results obtained in pre-test, post-test and generalization tests for a subset of learners from Experiment 2. The data are shown for the test condition corresponding to the learnersÕ training modality (A condition for learners trained auditorily, AV condition for learners from the AV Natural or AV Synthetic training groups). ization materials are unknown real words. However, post-hoc tests showed that the generalization scores were not significantly lower than the post-test scores, and that there was no significant effect of Ôtraining modeÕ. The type of training undertaken therefore did not affect the way in which the training generalized to new items (see Fig. 4). 3.7. Discussion 3.6. Generalization test In order to evaluate the degree to which the effects of training would generalize to new materials, a further test was completed by 28 of the learners. This generalization test included 66 new words produced by a ÔknownÕ talker (the talker heard in Training sessions 1 and 6) that were not heard in training. The generalization test was carried out in the modality used in training (auditory test for ÔA trainingÕ learners, audiovisual test for ÔAV naturalÕ and ÔAV syntheticÕ learners). Repeated-measures ANOVAs were carried out to compare scores obtained in the post-test and generalization in the modality in which learners were trained. The comparison is not an ideal one as the post-test materials are nonsense words whereas the general- Both training conditions led to significant improvements in the perception of the /l/–/r/ contrast by the L2 learners, but contrary to the findings of Experiment 1 and to those of Hardison for an American English /l/–/r/ contrast, we did not find audiovisual training to be more effective than auditory training. In fact, there was a trend for the greatest increase in identification scores to be seen in the A-trained group. The contrast was less visually salient than the contrast tested in Experiment 1 as evidenced by the rather low 71.2% correct identification score obtained for native controls for the visual alone condition. It was also probably less salient than the /l/-/r/ tokens used in Hardison (2003) as means of around 85% correct identification were obtained for ARTICLE IN PRESS V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx native controls in a visual alone condition (for real words). A number of factors may explain the discrepancy in results between our study and that of Hardison. First, there may be learner-related effects as HardisonÕs learners were ESL rather than EFL learners based in the USA, with greater exposure to native talkers than the learners tested in our experiments. Also, there are differences both in the visual and acoustic characteristics of /l/ and /r/ in British and American English that will affect the salience of the contrast. Finally, there are talker-related effects, as the test talker used in Experiment 2 may well have had less clear articulation, compared to the talker used in Experiment 1, and possibly that used in the Hardison study. This study did not replicate previous findings that Japanese listeners achieve higher identification rates for English /r/ than English /l/ tokens, taken to support FlegeÕs SLM model predicting greater L1 interference in terms of the acquisition of the /l/ phoneme category (Aoyama et al., 2004; Bradlow et al., 1997; Ingram and Park, 1998). However, the differences in the acoustic– phonetic characteristics of American and British English may also be implicated as they will affect the goodness of fit of the L2 stimuli to the L1 category. Differences in stimulus (e.g., consonant position, vowel context), talker and listener characteristics across studies may also be implicated. Although the difference between training groups was not found to be statistically significant, there was a trend for learners trained with natural audiovisual tokens to become more sensitive to the visual cues marking the contrast, as shown by their performance in the V test condition. There was a trend for those trained with the synthetic face (synchronized to natural audio channel) to show less improvement in training than those trained auditorily or with natural audiovisual stimuli. This is not too surprising as, in intelligibility tests with normally-hearing and hearing impaired listeners with auditory and audiovisual stimuli, talking heads were less effective at improving intelligibility than natural audiovisual stimuli (e.g., Ouni et al., 2003; Siciliano et al., 2003). However, this difference is likely to decrease as the quality of visual 13 cue information provided by talking heads further improves. 4. Experiment 3: Effect of perceptual training on speech production A number of previous studies have shown that perceptual improvements resulting from auditory training can transfer to improvements in the pronunciation of the trained sounds. The aim of Experiment 3 was to evaluate whether the effect could be replicated here, and whether audiovisual training of a contrast (i.e. seeing the articulatory gestures of the talker) would lead to a better improvement in pronunciation, even if did not lead to a greater improvement in perception for the less visually-salient /l/–/r/ contrast. 4.1. Talkers and listeners Twenty-five trainees from Experiment 2 were recorded before and after their course of training, after having carried out the perceptual testing. Ten had completed the training study in the ÔA trainingÕ condition, 10 in the ÔAV naturalÕ condition and five in the ÔAV syntheticÕ condition. The listeners were 12 native talkers of British English tested in London. The majority were phoneticians (PhD students) or trained EFL teachers. Four were PhD students from another university department who all reported some experience with L2 learners. 4.2. Speech materials and recording conditions At the time of the pre- and post-test, the Japanese trainees were asked to read a list of 25 minimal pairs of words which included the sounds /l/ and /r/ in a variety of phonetic environments and positions. The audio recordings were either made in a soundproof room using a Sony TCD-D10 digital audio recorder with a ECM-959DT microphone, or in case of students tested in Japan, in a classroom using a Sony MZ-N707 minidisk recorder with a ECM-T15 microphone. A subset of five minimal pairs of words were selected in which the sounds /l/ and /r/ appeared in initial and medial ARTICLE IN PRESS 14 V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx positions, in singleton and clusters. The minimal pairs were: lake–rake, long–wrong, miller–mirror, blight–bright, complies–comprise. A total of 20 pre/post tokens per trainee were therefore included, yielding a total of 500 tokens for 25 trainees. The tokens were normalized for level by equating the peak amplitudes. 4.2.3. Consonant rating task Listeners were asked to judge the realizations of /l/ and /r/. The intended word appeared on the screen, followed by the auditory prompt. Listeners rated the /l/ or /r/ in the word on a scale from 1 to 7 (1 = bad, 7 = excellent). The order of presentation was randomized across trials and across talkers. 4.2.1. Experimental tasks Two independent perceptual evaluation tests were carried out in which native talkers of British English were asked to judge the tokens produced by the learners. The tests included a minimal-pair identification task and a quality rating task. For each test, the listeners evaluated the productions of all learners. They were asked to focus on the /l/–/r/ realizations in the word rather than on the correct pronunciation of the word itself. The experiment was run on a laptop and items were presented to both ears at a comfortable listening level via headphones. The identification test took about 45 min, the rating task about 55 min. 4.3. Results 4.3.1. Minimal pair identification test For each of the 12 English listeners, mean production accuracy scores were calculated for /r/ and /l/ for each of the 25 Japanese learners. The percentage of /l/ and /r/ productions that were correctly identified by native listeners is shown in Table 2. It can be seen that the correct identification of /r/ tokens produced by the learners is generally much better than the identification of /l/ tokens, which is consistent with previous findings on the production of the /r/–/l/ contrast by Japanese learners (Bradlow et al., 1997). From the data means, it appeared that learners in the ÔAV syntheticÕ group had a significantly better pronunciation prior to training than other groups, as /r/ productions for this group were almost perfectly identified (mean of 96% correct identification). Because of the ceiling effect for /r/ identification, we therefore removed the data for the ÔAV syntheticÕ learners before subsequent data analysis and compared the A and ÔAV naturalÕ groups only. A repeated-measures ANOVA was 4.2.2. Minimal-pair identification task Tokens were presented in a two-alternative forced choice task. On each trial, the minimal pair appeared in writing on two buttons on the screen, listeners heard the token of a learner and indicated their choice of the word by pressing the button. The order of the items was randomized across trials and across talkers. Table 2 Means and standard deviations for the percentage of /l/–/r/ consonants in words produced by L2-learners prior and following training that were correctly identified by native English listeners (Experiment 3) Training mode /l/ /r/ Mean intelligibility Pre Post Pre Post Pre Post Auditory N = 120 Mean S.D. 54.8 23.3 59.8 28.5 86.0 17.7 83.5 22.7 70.4 14.7 71.7 12.7 AV (natural face) N = 120 Mean S.D. 60.7 30.2 67.3 24.6 76.2 24.5 82.7 24.5 68.4 17.4 75.0 14.0 AV (synthetic face) N = 60 Mean S.D. 60.7 34.1 65.3 25.2 96.0 8.1 96.0 8.9 78.3 16.3 80.7 14.0 ARTICLE IN PRESS V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx 15 revealed that ratings were higher for the posttraining tokens than the pre-training tokens [F(1, 238) = 33.6; p < .001]. The effect of training mode was assessed from the difference in ratings before and after training. The effect of training mode was not significant but there was a significant consonant by training interaction [F(1, 238) = 6.48; p < 0.02]. Post hoc tests showed that /r/ rating improved more for the AV training group than the A training groups but that /l/ ratings did not. carried out on the production accuracy scores for /r/ and /l/ obtained before and after training. Identification scores improved significantly post-training [F(1,238) = 20.0; p < 0.001]. Evaluations of the effect of training mode were carried out on the scores reflecting the difference in production accuracy pre-post training for /r/ and /l/ to normalize for differences in pre-training scores. The increase in scores was significantly greater for the tokens produced by the ÔAV naturalÕ training learners than for those for the ÔA trainingÕ learners [F(1,238) = 9.271; p < 0.005]. The within-group effect of consonant was not significant. The following analysis focused on individual differences between learners in terms of the effect of perceptual training on production. The change in mean production accuracy scores from pre- to post-training tokens produced by individual learners ranged from 11% to +20%. A univariate ANOVA showed that this effect was significant [F(19, 220) = 8.538; p < 0.001] suggesting that some learners improved their pronunciation significantly more than others. 4.4. Discussion of experiment 3 The finding that perceptual training results in improvements to the pronunciation of the trained consonants in L2 learners is consistent with many other studies of perceptual training. A key finding of our study was that, even for this less visuallydistinctive contrast, the use in perceptual training of audiovisual stimuli with a natural face led to a greater improvement in /l/–/r/ pronunciation compared to training with auditory stimuli, even though there was no difference between the training groups in terms of the effect of training on the perception of the /l/–/r/ contrast. Exposure to the articulatory gestures involved in the production of /l/ and /r/ therefore seems to be have been effective in improving pronunciation, even without specific pronunciation training. This finding needs to be verified with a wider range of phonemic 4.3.2. Consonant rating task For each of the 12 English listeners, consonant rating scores were calculated for /r/ and /l/ produced by each of the 20 learners from the A and AV (natural face) groups before and after training (see Table 3). The data obtained for the rating task closely mirror those obtained for the consonant identification task. Repeated-measures ANOVAs Table 3 Consonant rating on a scale of 1–7 (1 = bad, 7 = excellent) given by native English listeners for words produced prior and following training by learners in the Auditory, AV (natural face) and AV synthetic face training groups (Experiment 3) Training /l/ /r/ Mean rating Pre Post Pre Post Pre Post Auditory Mean S.D. 3.8 1.1 4.1 1.2 5.0 0.9 5.0 1.1 4.4 0.8 4.6 0.7 AV (natural face) Mean S.D. 4.2 1.3 4.4 1.1 4.5 1.1 4.9 1.1 4.4 0.9 4.6 0.8 AV (synthetic face) Mean S.D. 4.0 1.4 4.3 1.1 5.5 0.8 5.2 0.9 4.7 0.9 4.8 0.9 ARTICLE IN PRESS 16 V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx contrasts and talkers. The planned comparison between the effect of natural and synthetic articulatory gesture information was not achieved, due to the small number of learners in the AV synthetic group and ceiling effect in pre-training production performance. 5. General discussion Previous auditory training studies had shown that learners could be successfully trained to redirect their attention to acoustic cues that they do not naturally attend to because they do not mark phonemic contrasts in their native language. This study aimed to see whether learners could also be trained to redirect their attention to visual cues that mark segmental contrasts. This may be considered to be a more difficult task, at least for learners for whom visual cues are not particularly informative in their L1, as such listeners will tend to give greater weighting to acoustic cues. For these learners, audiovisual training has to be successful both in attracting attention to the visual modality and in attracting attention to the specific visual cues marking the contrast. Results of Experiment 1 did show that learners could be trained to attend more to visual cues. Learners trained audiovisually improved their perception of the labial/labiodental place contrast in the visual condition significantly more than those trained auditorily, and showed greater improvements in all three conditions post-training. However, certain visual cues appear to be more difficult to acquire than others as shown by the lack of an audiovisual training effect in Experiment 2. For /l/–/r/, although there was a trend for a greater improvement in /l/–/r/ identification in the visual condition for those trained with natural audiovisual stimuli, this difference did not reach significance. Overall, our lack of an audiovisual effect for the /l/–/r/ contrast appears to contradict results obtained by Hardison (2003) who obtained better improvements in /l/–/r/ perception in Japanese and Korean learners of English trained audiovisually. However, these apparently contradictory results fit into our gen- eral hypothesis, which is that effectiveness of AV training will depend on the visual distinctiveness of the contrast, which is likely to be dependent on a range of inter-related factors. First, it will depend on whether the two phonemes belong to the same or different visemes (Summerfield, 1983), but in experiments of this kind, will also be related to the clarity of the talkers used in the training and testing (Gagné et al., 2002). It will also depend on the specific mapping of L2 phonemes to L1 categories. There are other factors that may affect how well audiovisual training may increase sensitivity to visual cue information. Just as acoustic enhancement can attract listenersÕ attention to specific acoustic cues (e.g., Hazan and Simpson, 2000), the way in which the visual information is presented (e.g., whether full-face or focus on lips/jaw only) may also affect the degree to which learnersÕ attention are attracted to the relevant visual cues. It must be noted though that previous studies did not find differences in word intelligibility for native listeners for both these types of visual presentation (Marassa and Lansing, 1995). Finally, learner-related factors are likely to be important. First, it is well known that individuals vary significantly in terms of their lipreading skills (e.g., Demorest et al., 1996) and also in their ability to integrate auditory and visual information (Grant and Seitz, 1998). In our study, learners varied quite significantly in their sensitivity to visual cues before training (Sennema et al., 2003). It was expected that a greater ÔnaturalÕ awareness of visual cues would lead to greater success in the audiovisual training, but this was not found to be the case in either Experiment 1 or 2. The reason for this is unclear, but it is possible that other learner-related factors such as individual learning strategies, may be the cause. Indeed, it is plausible that, for some learners at least, perceptual training will be more successful when focused on a single modality, with the visual modality acting as a distractor. Finally, as mentioned above, the language background of the learner might influence the relative effectiveness of audiovisual training as learners from an L1 which is informative in terms of its visual distinctions may attend more to visual information (e.g.,Sekiyama et al., 1996), even if the ARTICLE IN PRESS V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx visemes being learnt in the L2 are not in their L1 inventory. Finally, the issue of whether learners are acquiring English as a second language (ESL) or English as a foreign language (EFL) could have an effect on the effectiveness of training as it will impact on the degree of exposure to native English speakers, and hence to visual cues provided by these speakers, and on the likely motivation of the learners. It should be noted that HardisonÕs subjects were ESL learners based in the USA whilst our learners were EFL learners living in Japan, or on short visits to the UK. The fact that perceptual training can result in improvements in the pronunciation of the sounds being trained has important practical implications for computer-assisted language learning. Indeed, computer-based perceptual training programmes are more reliable than computer-based pronunciation training programmes, which need to provide accurate automatic ratings of the learnerÕs productions. Contrary to what might be expected from speech perception studies with native talkers, audiovisual presentation of stimuli may not automatically produce gains in perception relative to auditory stimuli, especially for speech contrasts that may be visible but not highly salient. However, further investigations need to be carried out to investigate whether training which further enhances visual cues may be more successful in attracting language learnersÕ attention to this useful source of segmental information. Acknowledgements Funded by Grant GR/N11148 from the Engineering and Physical Sciences Research Council of Great Britain. We acknowledge Dr Marta Ortega-LlebariaÕs contribution to the development of the stimuli and test design for Experiment 1. We thank Prof. Ron Cole and colleagues for their help in developing the CSLU toolkit-based testing and training software. We also thank Prof. Masaki Taniguchi (Kochi University), Jo Thorp (London Bell School) and the UCL Language Centre for their substantial help in organizing the testing of Japanese students. 17 References Aoyama, K., Flege, J.E., Guion, S.G., Akahane-Yamada, R., Yamada, T., 2004. Perceived phonetic dissimilarity and L2 speech learning: the case of Japanese /r/ and English /l/ and /r/. J. Phonetics 32, 233–250. Best, C., 1995. A direct realist view of cross-language speech perception. In: Strange, W. (Ed.), Speech Perception and Linguistic Experience: Theoretical and Methodological Issues. York Press, Baltimore, pp. 171–204. Best, C., McRoberts, G., Goodell, E., 2001. Discrimination of non-native consonant contrasts varying in perceptual assimilation to the listenerÕs native phonological system. J. Acoust. Soc. Amer. 109, 775–794. Bradlow, A., Pisoni, D., Akahane-Yamada, R., Tohkura, Y., 1997. Training Japanese listeners to identify English /r/ and /l/: IV. Some effects of perceptual learning on speech production. J. Acoust. Soc. Amer. 101, 2299– 2310. Cole, R., Massaro, D.W., de Villiers, J., Rundle, B., Shobaki, K., Wouters, J., Cohen, M., Beskow, J., Stone, P., Connors, Tarachow, A., Solcher, D., 1999. New tools for interactive speech and language training: using animated conversational agents in the classroom of profoundly deaf children. In: Proc. ITRW Methods and Tools in Speech Science Education (MATISSE), London, pp. 45– 52. Demorest, M.E., Bernstein, L.E., DeHaven, G.P., 1996. Generalizability of speechreading performance on nonsense syllables, words, and sentences: subjects with normal hearing. J. Speech Hear. Res. 39, 697–713. Flege, J.E., 1995. Second-language speech learning: theory, findings, and problems. In: Strange, W. (Ed.), Speech Perception and Linguistic Experience: Theoretical and Methodological Issues. York Press, Baltimore, pp. 229– 273. Flege, J.E., 1998. Second-language learning: the role of subject and phonetic variables. In: Proc. StiLL ESCA Workshop, Stockholm, May 1998, pp. 1–8. Flege, J.E., 1999. Age of learning and second-language speech. In: Birdsong, D.P. (Ed.), Second Language Acquisition and the Critical Period Hypothesis. Lawrence Erlbaum, Hillsdale, NJ, pp. 101–132. Fowler, C., 1986. An event approach to the study of speech perception from a direct-realist perspective. J. Phonetics 14, 3–28. Gagné, J.-P., Rochette, A.-J., Charest, M., 2002. Auditory, visual and audiovisual clear speech. Speech Comm. 37, 213– 230. Grant, K.W., Seitz, P.F., 1998. Measures of auditory–visual integration in nonsense syllables and sentences. J. Acoust. Soc. Amer. 104, 2438–2450. Guion, S.G., Flege, J.E., Ahahane-Yamada, R., Pruitt, J.C, 2000. An investigation of current models of second language speech perception: the case of Japanese adultsÕ perception of English consonants. J. Acoust. Soc. Amer. 107, 2711– 2725. ARTICLE IN PRESS 18 V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx Hardison, D., 1999. Bimodal speech perception by native and nonnative talkers of English: factors influencing the McGurk effect. Lang. Learn. 49, 213–283. Hardison, D., 2003. Acquisition of second-language speech: effects of visual cues, context and talker variability. Appl. Psycholinguist. 24, 495–522. Hazan, V., Barrett, S., 2000. The development of phonemic categorisation in children aged 6–12. J. Phonetics 28, 377– 396. Hazan, V., Simpson, A., 2000. The effect of cue-enhancement on consonant intelligibility in noise: talker and listener effects. Lang. Speech 43, 273–294. Ingram, J.C.L., Park, S.-G., 1998. Language, context, and speaker effects in the identification and discrimination of English /r/ and /l/ by Japanese and Korean listeners. J. Acoust. Soc. Amer. 103, 1161–1174. Iverson, P., Kuhl, P.K., Akahane-Yamada, R., Diesch, E., Tohkura, Y., Kettermann, A., Siebert, C., 2003. A perceptual interference account of acquisition difficulties for nonnative phonemes. Cognition 87, B47–B57. Kuhl, P.K., 1993. Early linguistic experience and phonetic perception: implications for theories of developmental speech perception. J. Phonetics 21, 125–139. Kuhl, P.K., Meltzoff, A.N., 1982. The bimodal perception of speech in infancy. Science 218, 1138–1141. Lambacher, S., Martens, W., Kakeki, K., 2002. The influence of identification training on identification and production of the American English mid and low vowels by native talkers of Japanese. In: Proc. 7th Internat. Conf. on Spoken Lang. Proc., Denver, pp. 245–248. Lenneberg, E.H., 1967. Biological Foundations of Language. John Wiley and Sons Inc, New York. Lively, S., Logan, J., Pisoni, D., 1993. Training Japanese listeners to identify English /r/ and /l/. ll: the role of phonetic environments and talker variability in learning new perceptual categories. J. Acoust. Soc. Amer. 94, 1242–1255. Lively, S.E., Logan, J.S., Pisoni, D.B., 1994. Training Japanese listeners to identify English /r/ and /l/. III: long- term retention of new phonetic categories. J. Acoust. Soc. Amer. 96, 2076–2087. Logan, J.S., Lively, S.E., Pisoni, D.B., 1991. Training Japanese listeners to identify English /r/ and /l/: a first report. J. Acoust. Soc. Amer. 89, 874–886. Lotto, A.J., Sato, M., Diehl, R.L., 2004. Mapping the task for the second language learner: the case for Japanese acquisition of /r/ and /l/. In: J. Slifka, S. Manuel, M. Matthies (Eds.), From Sound to Sense: 50+ years of Discoveries in Speech Communication, pp. C381–C386. Marassa, L.K., Lansing, C.R., 1995. Visual word recognition in 2 facial motion conditions—full face versus lips-plus-mandible. J. Speech Hear. Res. 38, 1387–1394. Massaro, D.W., 1998. Perceiving Talking Faces: from Speech Perception to a Behavioral Principle. MIT Press, Cambridge, MA. Massaro, D.W., Cole, R. 2000. From Speech is special to talking heads in language learning. In: Proc. INSTIL, Dundee, Scotland, 153–161. Massaro, D.W., Light, J., 2003. Read my tongue movements: bimodal learning to perceive and produce non-native speech /r/ and /l/. In: Proc. Eurospeech 2003, Geneva, Switzerland, pp. 2249–52. Massaro, D.W., Light, J., 2004. Using visible speech to train perception and production of speech for individuals with hearing loss. J. Speech Lang. Hear. Res. 47, 304–320. Massaro, D.W., Thompson, L.A., Barron, B., Laren, E., 1986. Developmental changes in visual and auditory contribution to speech perception. J. Exp. Child Psychol. 41, 93–113. Mayo, L.H., Florentine, M., Buus, S., 1997. Age of secondlanguage acquisition and perception of speech in noise. J. Speech Lang. Hear. Res. 40, 686–693. Mayo, C., Turk, A., 2004. Adult-child differences in acoustic cue weighting are influenced by segmental context: Children are not always perceptually biased toward transitions. J. Acoust. Soc. Amer. 115, 3184–3194. McGurk, H., MacDonald, J., 1976. Hearing lips and seeing voices. Nature 264, 746–748. Nittrouer, S., Miller, M.E., 1997. Predicting developmental shifts in perceptual weighting schemes. J. Acoust. Soc. Amer. 101, 2253–2266. Ortega-Llebaria, M., Faulkner, A., Hazan, V., 2001. Auditoryvisual L2 speech perception: effects of visual cues and acoustic–phonetic context for Spanish learners of English. In: Proc. AVSP-2001, pp. 149–154. Ouni, S. Massaro, D.W., Cohen, M.M., Young, K., Jesse, A., 2003. Internationalization of a talking head. In: Proc. ICPhS, Barcelona, Spain, August 2003, pp. 2569–2572. Sekiyama, K., Tohkura, Y., Umeda, M., 1996. A few factors which affect the degree of incorporating lip-read information into speech perception. In: Proc. ICSLP1996, Denver, USA, pp. 1481–1484. Sekiyama, K., Tohkura, Y., 1993. Inter-language differences in the influence of visual cues in speech perception. J. Phonetics 21, 427–444. Sekiyama, K., 1997. Cultural and linguistic factors in audiovisual speech processing: the McGurk effect in Chinese subjects. Percept. Psychophys. 59, 73–80. Sekiyama, K., Burnham, D., Tam, H., Erdener, D., 2003. Auditory-visual speech perception development in Japanese and English talkers. In: Proc. AVSP 2003, St. Jorioz, France, September 2003, pp. 43–47. Sennema, A., Hazan, V., Faulkner, A., 2003. The role of visual cues in L2 consonant perception. In: Proc. 15th ICPhS, Barcelona, Spain, August 2003, pp. 135–138. Siciliano, C., Faulkner, A., Williams, G., 2003. Lipreadability of a synthetic talking face in normal hearing and hearingimpaired listeners. In: Proc. AVSP 2003, St. Jorioz, France, September 2003, pp. 205–208. Sumby, W.H., Pollack, I., 1954. Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Amer. 26, 212– 215. Summerfield, Q., 1983. Audio-visual speech perception, lipreading and artificial stimulation. In: Lutman, M.E., Haggard, M.P. (Eds.), Hearing Science and Hearing Disorders. Academic Press, London, pp. 131–182. ARTICLE IN PRESS V. Hazan et al. / Speech Communication xxx (2005) xxx–xxx Takata, Y., Nabelek, A.K., 1990. English consonant recognition in noise and in reverberation by Japanese and American listeners. J. Acoust. Soc. Amer. 88, 663–666. Tyler, M.D., 2001. Resource consumption as a function of topic knowledge in nonnative and native comprehension. Lang. Learn. 51, 257–280. Wang, Y., Spence, M.M., Jongman, A., Sereno, J.A., 1999. Training American listeners to perceive Mandarin tones. J. Acoust. Soc. Amer. 106, 3649–3658. Wang, Y., Jongman, A., Sereno, J.A., 2003. Acoustic and perceptual evaluation of Mandarin tone productions before 19 and after perceptual training. J. Acoust. Soc. Amer. 113, 1033–1043. Werker, J.F., Tees, R.C., 1984. Cross-language speech perception: evidence for perceptual reorganization during the first year of life. Infant Behavior Develop. 7, 47–63. Yamada, R.A., 1995. Age and acquisition of second language speech sounds: perception of American English /r/ and /l/ by native talkers of Japanese. In: Strange, W. (Ed.), Speech Perception and Linguistic Experience: Theoretical and Methodological Issues. York Press, Baltimore, pp. 305– 320.
© Copyright 2026 Paperzz