Using a Computer in Foreign Language Pronunciation Training

Maxine Eskenazi
Using a Computer in Foreign Language
Pronunciation Training:
What Advantages?
Maxine Eskenazi
Carnegie Mellon University
ABSTRACT
This paper looks at how speech-interactive CALL can help the classroom
teacher carry out recommendations from immersion-based approaches to
language instruction. Emerging methods for pronunciation tutoring are
demonstrated from Carnegie Mellon University’s FLUENCY project, addressing not only phone articulation but also speech prosody, responsible
for the intonation and rhythm of utterances. New techniques are suggested
for eliciting freely constructed yet specifically targeted utterances in speechinteractive CALL. In addition, pilot experiments are reported that demonstrate new methods for detecting and correcting errors by mining the speech
signal for information about learners’ deviations from native speakers’
pronunciation.
KEYWORDS
Speech Recognition, Pronunciation, CALL, Immersion, Direct Approach,
Phonemes, Prosody, Pitch, Intensity, Duration
INTRODUCTION
The ever growing speed and memory of commercially available computers, coupled with decreasing price, is making feasible the idea of creating
computer-assisted language learning (CALL) that is speech-interactive.
Even though the hardware conditions for an ideal automatic training system exist, can the same be said of state-of-the-art automatic speech recognition (ASR) and of our knowledge of the variability of the speech signal—the main stumbling block to higher quality speech recognition? Has
the technology come far enough for systems to be able to teach pronunciation effectively?
© 1999 CALICO Journal
Volume 16 Number 3
447
Foreign Language Pronunciation Training
To answer these questions, we will first specify what is believed to contribute to successful language learning under a direct approach, drawing
largely from principles described by Celce Murcia and Goodwin (1991).
We will then list pedagogical recommendations following from this approach, such as providing language samples from many different speakers. Next, we will look at what speech-interactive CALL can do to help the
classroom teacher carry out these recommendations. We illustrate with
emerging methods for pronunciation tutoring from the FLUENCY project
at Carnegie Mellon University (CMU) (Eskenazi, 1996), methods that support both articulation of phonemes and use of prosody—the intonation
and rhythm of speech. The emphasis here is on pronunciation in the context of overall language learning. Proficient pronunciation is essential to
language learning because below a certain level of pronunciation, even if
grammar and vocabulary have been mastered, communication obviously
cannot take place.
WHAT CONTRIBUTES TO SUCCESS IN TARGET LANGUAGE
PRONUNCIATION?
Conditions for Success and Pedagogical Recommendations Based on Immersion
Many foreign language instructors agree (Celce Murcia & Goodwin,
1991) that living in a country where the target language is spoken is the
best way to become fluent—a total immersion situation. They also generally agree (Kenworthy, 1987; Laroy, 1995; Richards & Rodgers, 1986) on
which conditions of living abroad are critical to effective language learning:
•
•
•
•
•
Learners hear large quantities of speech.
Learners hear many different native speakers.
Learners produce large quantities of utterances on their own.
Learners receive pertinent feedback.
The context in which the language is practiced has significance.
These conditions cover the external environment of language learning.
From each we can extract recommendations for how to learn language
under less than total immersion conditions.1 These recommendations cannot always be carried out in classroom contexts, thus presenting opportunity and motivation for ASR technology to complement teaching.
Recommendation 1
Learners hear large quantities of speech. For language learners
who are not living in the country of the target language, immer448
CALICO Journal
Maxine Eskenazi
sion courses consisting of six to eight hours daily are often the
best alternative for exposing learners to the language. An ideal
ratio of one student-one teacher would provide maximum speaking and feedback time. This situation is not always feasible. On
the one hand, most students have other daily activities and, on
the other, employing human teachers for eight hours a day is expensive (Bernstein, 1994). Moreover, immersion classes usually
have five to ten students, and attending to individual needs reduces the amount of time the teacher speaks to the class.
Recommendation 2
Learners hear many different native speakers. This recommendation implies employing many native teachers with a diversity of
voice types and dialects. However, the variety of native speakers
available locally is limited, as is the number of people that a school
can afford to hire. Traditional educational materials that promote
wider exposure, such as audio and video cassettes, tend to be
non-interactive, and their audio quality can degrade over time.
Recommendation 3
Learners produce large quantities of utterances on their own.
Ideally, the student is in a one-on-one setting where the teacher
encourages short conversations, constantly eliciting the student’s
speech. In reality, students in the classroom share the teacher’s
attention. The amount of time they spend individually producing
speech and participating in conversation is thus reduced.
Recommendation 4
Learners receive pertinent feedback. In immersion contexts feedback that leads to correction of form or content may occur in two
ways. Implicit feedback comes when speaker and listener realize
that the message did not get across. A clarification dialogue usually takes place (“I beg your pardon?” “What did you say?”), ending with a corrected message that is understood. Less often, when
culture and interpersonal context permit, the listener offers explicit correction, such as pointing out the error or repeating what
the speaker said but with correction. In the ideal classroom, teachers offer implicit and explicit feedback at just the right times,
keeping a balance between not intervening too often, to avoid
discouraging the student, and intervening often enough to keep
an error from becoming a hard-to-break habit. Expert teachers
adapt the pace of correction—how often they intervene—to fit
the student’s personality. In reality, however, not all teachers use
the same techniques and, in the classroom, are not always able to
adapt these techniques to individuals. When class size increases,
the amount of feedback to the individual student decreases.
Volume 16 Number 3
449
Foreign Language Pronunciation Training
Recommendation 5
The context in which the language is practiced has significance.
Living in the country where the target language is spoken gives
learners the practical need to speak. Their utterances have immediate significance. To accommodate this recommendation, the
ideal language classroom includes fast-paced games and everyday conversations that create meaningful contexts (Bowen, 1975;
Brumfit, 1984; Crookall & Carpenter, 1990). The student has to
respond rapidly and utter new terms in these contexts. In reality,
classroom size again reduces the individual learner’s time for participating in such activities.
Conditions for Success and Pedagogical Recommendations Based on Structured Intervention
There are two additional conditions that appear critical for learning pronunciation but that do not follow from immersion—indeed, they follow
from an assumption of structured intervention that departs from pure
immersion: 1) Learners feel at ease in the language learning situation.
Whereas the very young language learner perceives and tries out new sounds
easily, older learners lose this ability. Embarrassment or fear may inhibit
the learner from trying new sounds or even from speaking, whether in a
total immersion or a classroom environment (Laroy, 1995). 2) There is
ongoing assessment of learners’ progress. Language learning appears most
efficient when the teacher constantly monitors progress to guide appropriate remediation or advancement.
These conditions lead to pedagogical recommendations that may be particularly hard to carry out in the classroom.
Recommendation 6
Learners feel at ease. A key dimension of the learner’s “internal”
environment is self-confidence and motivation. Although there
are techniques to boost student confidence in the classroom
(Laroy, 1995; Krashen, 1982)—such as correcting only when
necessary, reinforcing good pronunciation, and avoiding negative
feedback—these may not overcome learners’ inhibitions. Laroy
(1995) finds that when students are asked in front of peers to
make sounds that do not exist in their native language, these students tend to feel ill at ease. As a result, they may stop trying
completely or may only make sounds from their native language.
One-on-one teaching is important at this point, allowing students
to “perform” in front of the teacher alone, not in front of a whole
class, until they are comfortable with the newly learned sounds.
450
CALICO Journal
Maxine Eskenazi
In reality, there is often little time for such one-on-one sessions.
When correction must keep pace with a whole class, the poorer
and less confident speakers suffer.
Recommendation 7
There is ongoing assessment. To adapt training to individual needs,
the teacher ideally monitors each student’s moment-by-moment
progress, assessing strong and weak points, and judges where to
focus effort next. The effective teacher takes into account what
the student feels is useful, thus keeping students involved in their
own progress (Celce Murcia & Goodwin, 1991; Laroy, 1995). In
reality, classroom teachers cannot maintain steady monitoring of
each student at this level of detail.
WHERE CAN SPEECH-INTERACTIVE CALL MAKE A CONTRIBUTION?
It is not feasible to carry out these seven recommendations fully in the
traditional language classroom, given constraints on teaching time and
materials. The ideal CALL system could help toward realizing these recommendations by providing individualized practice and feedback in a safe
environment and sending back regular progress reports to the teacher
(Wyatt, 1988). The human teacher must still do the high-level, subtle work
of creating a positive atmosphere for the production of new sounds and
stress patterns, explaining fine conceptual differences between a student’s
native language and the target language, and exploring cultural differences (Bernstein, 1994).
For each of our recommendations we will consider where automatic
functions, in the form of both ASR and CD-ROM, can support the classroom. We draw examples from the FLUENCY project and from other
systems featured in this volume.
CALL Can Help Learners Hear Large Quantities of Speech
With the decreasing cost and increasing capacity of computer memory
and storage, CALL can offer users a choice of many prerecorded utterances. CD-ROMs afford high-quality sound and video clips of speakers,
giving learners a chance to see articulatory movements used in producing
new sounds (e.g., LaRocca, 1994). The teacher no longer has to find or
record native speakers, although tools can be provided for teachers to add
new speakers to the data set. The highly available digitized speech supplements the teacher’s speech without incurring additional cost at each use.
It also allows individualized access to particular samples of speech.
Volume 16 Number 3
451
Foreign Language Pronunciation Training
CALL Can Help Learners Hear Many Different Native Speakers
Increased memory enables presentation of a variety of different speakers from different regions and dialects. Different speakers can be sampled
to find one “golden” voice that the learner would like to imitate. The choice
can, for example, center on finding a voice that has characteristics closest
to the learner’s, as suggested by Wachowicz and Scott (this issue). Speakers’ voices can be sped up or slowed down if students wish. Many speakers can be made to repeat utterances over and over.
The Learn to Speak Spanish course (Duncan, Bruno, & Rice, 1995)
takes advantage of CD-ROM storage to present speech utterances from a
variety of speakers. Videos of different speakers pop up as the course
exercises go along. Each utterance can be heard as many times as desired
although for a given sentence, only one native speaker is available. Rypa
and Price (this issue) demonstrate advances for exploiting recorded speech
from a variety of speakers in the service of listening practice.
ASR-Based CALL Can Help Learners Produce Large Quantities of Utterances on Their Own
LIMITATIONS OF TRADITIONAL ASR-BASED CALL
A major problem in speech-interactive CALL, in commercial products
especially, is that learners remain relatively passive (Wachowicz & Scott,
this issue). Although learners may be asked to voice an answer to a question, this by design involves either parroting an utterance just presented
or reading one of a small set of written choices (Bernstein, 1994; Bernstein
& Franco, 1995). Learners get no practice in constructing their own utterances (i.e., choosing vocabulary and assembling syntax). The commercially available AuraLang package (Auralog, 1995), for example, is an
appealing language teaching system that feeds to ASR the user’s pronunciation of one of three written sentences. Each choice leads the dialogue
along a different path. A certain degree of realism is attained, but students
do not actively construct utterances.
Constructing an utterance means putting it together at many levels; the
syntax and lexicon are being readied at the same time as pronunciation.
Readying pronunciation alone (as in minimal pair exercises) is only one
step toward the end goal of being able to participate actively in a conversation. Current speech-interactive language tutors do not let learners freely
create utterances because current ASR requires a high degree of predictability to recognize reliably what is said. CALL developers look for ways
to palliate imperfect recognition for two reasons: so that the system does
not often interrupt students to tell them they were wrong when, in fact,
they were right, and so that errors are not overlooked and allowed to go
uncorrected.
452
CALICO Journal
Maxine Eskenazi
TECHNIQUES FOR EXTENDING
THE
LIMITATIONS: SENTENCE ELICITATION
The FLUENCY project has developed a technique that enables users of
speech-interactive CALL to participate more actively in constructing utterances (Eskenazi, 1996). In traditional speech-interactive CALL, ASR
works well because the system “knows” what a speaker will say and matches
exemplars of the phones; it expects (pre-stored in memory) against the
incoming signal (what was actually said). The technique developed in
FLUENCY, by contrast, makes it possible to predict enough of what the
speaker will say to satisfy the needs of the recognizer while giving speakers apparent freedom to construct utterances on their own. The technique
is based on sentence elicitation, modeled on the drills used in the once
prevalent Audio-Lingual Method (Modern Language Materials, 1964) and
the British Broadcasting Company tutorials (Allen, 1968).
Several studies have addressed whether specifically targeted speech data
can be collected using sentence elicitation (Hansen, Novick, & Sutton,
1996; Isard & Eskenazi, 1991; Pean, Williams, & Eskenazi, 1993). Results confirm that a given prompt sentence in a carefully constructed exercise elicits at most one to three distinct response sentences from normal
speakers (if speakers are cooperative and follow the examples given). Below is a sample exercise from the FLUENCY project designed for automatic tutoring of sentence structure and prosody:
System:
Student:
System:
Student:
System:
Student:
System:
Student:
When did you meet her? (yesterday) - I met her yesterday.
When did you find it?
I found it yesterday.
Last Thursday.
I found it last Thursday.
When did they find it?
They found it last Thursday.
When did they introduce him?
They introduced him last Thursday.
The exercise screen is free of written prompts and practice is completely
oral, with students doing the sentence construction work themselves. The
technique provides large amounts of fast-moving practice, making students active rather than passive speakers. Later, when they need to build
an utterance during a real conversation, they will have acquired some of
the necessary speaking experience and automatic reflexes. The goal is to
enable them to speak rapidly, in pace with the conversation.
A database called ICY was created to study speakers’ strategies when
changing speaking styles (Isard & Eskenazi, 1991; Eskenazi, 1992). In
studies that led up to ICY, a set of specific, pretargeted syntactic structures in French, British English, and German was successfully elicited
Volume 16 Number 3
453
Foreign Language Pronunciation Training
without telling students ahead of time what to say. In the French studies,
using both oral sentence prompts and nonambiguous visual cues for elicitation, we succeeded in provoking the chosen structures over 85% of the
time (varying from 70% for one sentence to 100% for several sentences).
In vocabulary choice, too, a careful search for non-ambiguous target nouns
and adjectives with few synonyms yielded a highly predictable answer over
95% of the time. For example, we targeted manche bleue ‘blue sleeve’
and succeeded in eliciting that structure (noun followed by adjective) as
opposed to (a) noun followed by verb and adjective (manche est bleue
‘sleeve is blue’) or (b) noun followed by preposition, article, noun, verb,
and adjective (manche de la robe est bleue ‘sleeve of the dress is blue’).
These studies can be considered pretests in the context of FLUENCY. They
allow us to forecast which elicitation sentences and visual cues will evoke
the most predictable responses and to adopt these sentences and cues for
ASR-based tutoring exercises.
In FLUENCY, given that we can predict what will be said, we can use
the method of “forced alignment” in ASR. In other words, we can automatically align the predicted text to the incoming speech signal, as is done
in systems that impose multiple choice responses. Once the recognition
results are obtained, the system can correct pronunciation errors immediately, breaking into the rhythm of the exercise, or it can hold correction
until the end of the exercise. We have observed that waiting until the end
of the exercise ensures a higher level of success in elicitation; however,
our design will allow teachers to intervene earlier if a student’s level and
personality warrant it.
Students can practice constructing answers to the same elicitation sentences as often as they wish, at no additional cost in teacher time and
materials. Availability and patience are other qualities that enable the system to support our recommendation of having learners produce large quantities of utterances on their own.
ASR-Based CALL Can Provide Learners With Pertinent Corrective Feedback
Teachers often ask what type of corrective feedback speech recognition
can furnish. This section will address two aspects of the question: whether
and what types of errors can be detected successfully, and what methods
are effective in telling students about errors and showing them how to
make corrections.
454
CALICO Journal
Maxine Eskenazi
CAN ERRORS BE DETECTED? PHONE ERRORS VERSUS PROSODY ERRORS
Language learners make pronunciation errors of two types: those involving the articulation of phones (phonemes) and those involving the use
of prosody. Prosody is represented by three distinct components in the
acoustic signal: (a) fundamental frequency (pitch), (b) duration (speaking
rate and timing), (c) intensity (amplitude or loudness). These components
underlie the rhythm and intonation of speech. Phone correction is important during the first year of language study because proper articulatory
habits enhance the intelligibility of students’ speech. But intelligible speech
does not rest solely on correct phones. After the first year of study, pronunciation correction typically shifts to prosody. Appropriate prosody
guides the flow of speech in a way that improves intelligibility even when
phone targets are not reached (Celce Murcia & Goodwin, 1991). As discussed below, the two types of pronunciation errors differ in origin, and
their detection and correction imply different procedures.
Learners make phone errors because the number and nature of phonemes differ between native language (L1) and target language (L2) or
because the acceptable pronunciation space of a given phone may differ
between L1 and L2. In prosody, by contrast, the components are the same
in all languages; speakers vary fundamental frequency, duration, and intensity along the same dimensions. However, the relative importance of
each component, the meanings linked to each, and how they vary may
differ from language to language. For example, variations of intensity are
used much less often and with less contrast in French than in Spanish.
Error detection procedures differ as follows. Phone-based errors are
identified in forced alignment mode. Given an expected utterance, the
recognizer takes the actual utterance and returns the placement in time of
phones and words on the speech signal. By this method the learner’s recognition scores can be compared to the mean recognition scores for native
speakers—all uttering the same sentence in the same speaking style—and
the learner’s errors can thereby be identified and located (Bernstein &
Franco, 1995). For prosodic errors, however, only duration can be obtained from the output of the recognizer. That is, when the recognizer
returns the phones and their scores, it can also return the duration of the
phones. Frequency and intensity, on the other hand, are measured on the
speech signal before it is sent to the recognizer but after it is preprocessed.
Intensity is usually obtained by using a technique known as cepstral analysis. Fundamental frequency is obtained from an algorithm that detects
peaks in the signal and measures the distance between them. Speakers as
individuals vary greatly on the three components of prosody. For example,
some people speak louder or faster in general than do others. Thus, it is
important that measures of the three be expressed in relative terms, such
as the duration of one syllable compared to the next.
Volume 16 Number 3
455
Foreign Language Pronunciation Training
PHONE ERROR DETECTION: A PILOT STUDY OF ASR-BASED COMPARISONS OF NATIVE AND
NONNATIVE SPEAKERS
Although researchers have been cautious about using ASR to pinpoint
phone errors, recent work in the FLUENCY project shows that the recognizer can be used in this task if the context is well chosen (Eskenazi, 1996).
Demonstrating this is a pilot study of native and nonnative speakers uttering responses in elicitation exercises.
METHOD
Ten native speakers of American English were recorded (5 male and 5
female) and 20 speakers of other languages (one male and one female
from each of the following L1s: French, German, Hebrew, Hindi, Italian,
Korean, Mandarin, Portuguese, Russian, Spanish).2 Expert language teachers were asked to listen to the sentences recorded by each speaker and to
judge where there was an error, what it was, and how (and when) they
would intervene to correct it. Teachers marked these judgments on
phonemically labeled copies of the target sentences. The agreement between human teachers and ASR detection was used as a preliminary indication of the validity of automatic error detection.
456
CALICO Journal
Maxine Eskenazi
Figure 1
SPHINX II Recognition Scores for Native and Nonnative speakers
meac
miwa
mebw
mkjb
meds
mmel
mejl
mpeg
mbob
mred
mfre
mejh
mgtp
mhrb
Note. The recognition scores for native speakers are represented by solid
lines and for nonnative speakers by dotted lines for each phone in the
utterance, “I did want to have extra insurance.” The Vertical axis shows
sequence of phones in target utterance using CMU phone notation. Individual speakers are referenced by 4-letter labels where first letter indicates gender (m or f), second letter indicates language of origin (by underlined letters) as follows: French, German, Hebrew, Hindi, Italian, Korean,
Mandarin, Portuguese, Russian, Spanish.
RESULTS
Figure 1 shows the recognition results for native and nonnative male
speakers when the speech was processed by CMU’s SPHINX II automatic
speech recognizer (Ravishankar, 1996) in forced alignment mode. The
sentence recorded was, “I did want to have extra insurance,” elicited from
“You didn’t want to have extra insurance!” Phones for this utterance are
listed in sequence on the horizontal axis using CMU phone notation (e.g.,
the first phone /AY/ represents American English pronunciation of “I”).
Native speakers’ values are traced with solid lines, nonnatives with dotted
lines. The speaker labels indicate gender and language of origin for each
Volume 16 Number 3
457
Foreign Language Pronunciation Training
speaker. (The following phonological variants, common to native speakers of English, were taken into account: for /TD/ of “want” and /AXR/ of
“insurance,” not all speakers show values since “want” can be pronounced
/W AA N/ in this context (geminate); “sur” of “insurance” can be /SH
AXR/ or / SH AO R/.)
The vertical axis in Figure 1 represents the normalization of the phone
score given by SPHINX II over the total duration of the phone. As past
experience has shown, information based on absolute thresholds of phone
scores is of little use. However, a nonnative’s input can have a score (for a
given phone in a given context) that is noticeably distant from the cluster
of scores for native speakers in the same context, thus defining a deviation
from normal pronunciation. By this index, Figure 1 shows that the scores
of nonnatives and natives sometimes coincide. That is, nonnatives sometimes sound like natives, especially when the phone in L2 is similar to the
one in L1. In other cases, nonnatives are consistent outliers, for example,
in the case of /DD/ (final stop in “did”). Underlying the case of /DD/ is
the fact that natives did not release the stop whereas nonnatives did. Failure to release final stops contributes to perceived accent although it is not
a feature that teachers noted as needing correction. This feature can be
considered a minor deviation, one that causes listeners to “hear an accent” but not to misunderstand what was said.
The phone scores also indicate noticeable distance between natives and
certain of the nonnatives for /HH AE V/ (“have”). These nonnatives tend
to say /EH/ or /EHF/ instead of /AEV/ because their L1 does not contain
the /AE/ sound. For /IH/ of “insurance,” the German speaker’s score is
very far from the rest. For these two examples, the teachers noted the
same outliers as indicated by the phone scores.
Therefore, for this small sample of nonnative speakers, diverse in terms
of L1, SPHINX II confirms human observations of incorrectly pronounced
phones and does so independently of the speaker’s L1. More work is now
being done to validate these measures over a larger population of speakers and utterances. The potential ability of ASR to spot outliers could
serve to guide ASR-based CALL in deciding which phones need training.3
PROSODY ERROR DETECTION: FREQUENCY, DURATION, INTENSITY
In speech-interactive CALL we posit that correcting prosody is at least
as important as correcting phones. When listening to a foreign speaker, it
is not uncommon to hear a sentence with correct phones and syntax that
is hard to understand because of prosody errors. Yet we also hear sentences with correct prosody and faulty phones or syntax that we understand perfectly well.
Automatic detection of prosodic features is starting to be used success458
CALICO Journal
Maxine Eskenazi
fully in speech-interactive CALL. The SPELL foreign language teaching
system (Rooney, Hiller, Laver, & Jack, 1992) addresses both fundamental
frequency, or pitch, and duration. Pitch detection, like speech recognition, is by no means a perfected technique. But Bagshaw, Hiller, & Jack’s
(1993) work on better pitch detectors for SPELL shows that algorithms
can be made more precise within a specific application. This work compared the student’s pitch contours to those of native speakers to demonstrate the informativeness of pitch detection. Pitch detection was incorporated into SPELL and the output interpreted in visual and auditory feedback for the student. SPELL assumes that suprasegmental (prosodic) aspects of speech should be tied to segmental (phonemic) information—for
example, by showing pitch trajectories (contours over segments) and pitch
anchor points (centers of stressed vowels). SPELL also addresses speech
rhythm, showing segmental duration and acoustic features of vowel quality (predicting strong vs. weak vowels).
Tajima, Port, and Dalby (1994) and Tajima, Dalby, and Port (1996)
have addressed duration. They studied how timing changes in speech affect the intelligibility of nonnative speakers and created remedial training
supported by ASR. By using speech that is practically devoid of segmental
content (ma ma ma Ö), they separate the segmental and suprasegmental
aspects of the speech signal to focus on one aspect—temporal pattern
training.
The FLUENCY project has looked at how to detect changes in duration,
pitch, and intensity to find where a nonnative speaker deviates from acceptable native values. Prosody training in FLUENCY is linked to segmental aspects, with students producing meaningful phones. We aim to
detect deviations independently of L1 and L2 so that if a learning system
is ported to a new target language, its prosody detection does not have to
be changed fundamentally. We have promising results from a pilot study,
reported below, using hand-labeled features of the spectrogram.
PROSODY ERROR DETECTION: A PILOT STUDY
AND NONNATIVE SPEAKERS
OF
ASR-BASED COMPARISONS
OF
NATIVE
METHOD
For the English sentence data recorded in the pilot study on phones, we
additionally asked human teachers to mark the location and type of prosodic
errors of each speaker on transcriptions of the sentences. We first examined the speech signal to determine whether the information used by teachers to detect errors could be characterized in the spectrogram. After examining phone-, syllable-, and word-sized segments, we developed three
measures, one for each component of prosody. We compared these with
Volume 16 Number 3
459
Foreign Language Pronunciation Training
human teachers’ judgments of places where prosody needed improvement
in each sentence and refined the measures until they showed close agreement with human judgments. These measures then define the features we
want to extract automatically from the speech signal to diagnose where
students need improvement.
DURATION RESULTS
The first measure was duration of the speech signal, measured on the
waveform. The results of the duration comparisons are given in Figure 2.
The duration of one voiced segment was compared to the duration of the
preceding one (“ratio of seg1/seg2” on the vertical axis) to make the observations independent of individual variations in speaking rate. Note that
a voiced segment starts at the onset of voicing after silence or after an
unvoiced consonant; it ends when voicing stops at the onset of silence or
at the onset of an unvoiced consonant (independently of the number or
nature of phones the voiced segment contains).
Figure 2
Two-by-Two Comparison of Duration of Voiced Segments
8
7
6
5
4
3
2
1
0
-1
Note. The comparison is expressed as ratio of segment 1 to segment 2
(same sentence and speakers as in Fig. 1). Notations on vertical axis include neighboring unvoiced consonants for clarity.
460
CALICO Journal
Maxine Eskenazi
The spectrographic measures point to outliers that matched the judgments of the teachers. For example, take two cases of outliers in Fig. 2.
First, for the word “extra” in “I did want to have extra insurance,” the
segment /EHK/ is unusually long compared to the following vocalic segment (/STRA/) for the speakers labeled mfrc and miwa (the color of these
speakers’ vowels was closer to IY here). This deviation can be due to a
poor attempt to pronounce the lax vowel /EHK/ and was noted independently by teachers. Indeed, tense/lax vowel quality differences do not exist in French and Italian and are a pervasive problem for speakers of those
languages learning English. Second, mkjp’s “to” is about equal in length
to his “have,” departing from the other speakers. This deviation was also
noted by the teachers. It is interesting to observe the extremely small spread
of native and nonnative values at “want”/“to.” “Want” may be longer than
“to” for everyone, in the same relative proportions, because function words
are marked by shortened duration cues in most languages.
PITCH RESULTS
The second measure we developed was the total number of pitch peaks
present in the speech signal, calculated for each segment.4 Again, results
were compared between neighboring segments. We were able to detect
pitch deviations related to duration as well as independent of it. For example, mfrc raised pitch much higher on /EHK/ in “extra” than on the
following vocalic segment /STRA/, probably because /EHK/ is also longer
(see Figure 2). However, the speaker mpeg varied pitch independently of
duration.
INTENSITY RESULTS
The third measure developed, for intensity, was the average of all the
cepstral values over a given vocalic segment. To address relative rather
than absolute intensity, we compared these values segment-to-segment with
those of neighboring vocalic segments, as with duration and pitch. The
resulting curves and spread of speaker space, shown in Figure 3, differ in
general aspect from the results in Figures 1 and 2. Outliers were indicated
that matched teachers’ judgments about relative stress centers in utterances. For example, msjh shows stress displaced within the “I/did/want”
region, mbob displaced stress within “did/want/to,” and msjh, among others, within the “ex/tra/in” region. The speakers’ changes in amplitude
appeared to be independent of duration and pitch.
Volume 16 Number 3
461
Foreign Language Pronunciation Training
Figure 3
Two-by-Two Comparison of Average Intensity (Amplitude) on Voiced
Segments
1 .5
1 .3
1.1
0 .9
0 .7
0.5
Note. The data in this figure are for the same sentence and speakers as in
Figure 1.
IMPLICATIONS
Our pilot study suggests that the spectrogram can be mined for measures of speech prosody that have diagnostic value and are consistent with
what expert teachers say they would detect and correct. We are now rendering these measures automatically detectable. Being separate from one
another, the three measures of prosody, once analyzed in an utterance,
could be expressed in visual displays for the learner that show pitch, duration, or amplitude. A learner’s utterance could then be compared with a
native speaker’s utterance on each dimension to illustrate differences. Our
results suggest that the components of prosody are not totally independent of each other. We saw this particularly in the dependency of pitch on
duration. We suggest that correction first address the three components
separately, then address their combined effect. Instruction could begin by
exercising pitch and duration changes independently, then give practice
on changing pitch and duration together.
462
CALICO Journal
Maxine Eskenazi
AN ARGUMENT FOR EARLY PROSODY INSTRUCTION
Early prosody instruction, starting the first year of language study, could
be a boon to learning both syntax and phone articulation. Because speakers prepare the syntax of a sentence they want to say at about the same
point as they prepare prosody, incorrect word order will not fit the “song”
that it is to be sung to. Self-correction then comes into play as students
rearrange syntax to give a better fit to prosody. (Because the “song” is
considered as a whole and the syntax as a concatenation of elements, the
student should tend to rearrange syntax and not prosody.) Phones may
benefit from early prosody training, for example, in the case of stressed
and unstressed vowels in English. If a target vowel is unstressed and the
Spanish speaker uses a tense (stressed) vowel that is close to the target in
articulatory space, self-correction should follow because the speaker’s
longer tense vowel will not “fit the song” well. For example, the unstressed
“this” in the sentence “I want this present” is shorter and softer than the
surrounding vowels. Practice of correct prosody in this sentence should
aid pronunciation of “this” by lessening emphasis on and shortening the /
IH/ sound. Follow-up exercises could put “this” into new contexts, such
as “This is yours,” where the word is not so short and the speaker must
make more effort to retain the shortened form just learned.
EFFECTIVE CORRECTION IN SPEECH-INTERACTIVE CALL
Learners’ difficulties with phones and prosody, which our pilot studies
suggest can be readily detected in the speech waveform, become targets
for focused correction in CALL. The system that only detects pronunciation errors (e.g., parts of TriplePlayPlus by Syracuse Language Systems,
1994) is of limited aid. Learners will make random, trial-and-error attempts to correct the reported error. There may be little true amelioration
and even negative effects if learners make a series of poor attempts at a
sound. Such unsupervised repetitions could reinforce poor pronunciation
to the point of becoming a hard-to-correct habit (Morley, 1994).
Effective correction requires that recognizer results be interpreted, as
by putting them into a visually comprehensible form and comparing them
to native speech. Our work in FLUENCY suggests that how recognizer
results are best interpreted for instruction differs between phone correction and prosody correction. This suggestion stems from the fact that
phones are different from one language to another while prosody is produced in the same way across languages. Whereas students must be guided
as to tongue and teeth placement for a new phone, they don’t need instruction on how to increase pitch if they have normal hearing: They only
need to be shown when to increase and decrease it, and by how much.
Volume 16 Number 3
463
Foreign Language Pronunciation Training
CORRECTING PHONE ERRORS
There has been some success in using minimal pairs—contrasting sounds
in context in the target language, such as “I want a beet”/”I want a bit”
(see Dalby & Kewley-Port, this issue; Wachowicz & Scott, this issue).
Effective teachers often go further, with instructions on how to change
articulator position and duration. This kind of instruction is important
because if a sound does not already belong to a learner’s phonetic repertory, the learner will associate it with a close speech sound that is in the
repertory. For example, anglophones beginning to speak French typically
hear and pronounce the French sound /y/ (in tu) as the English sound /u/
(in “too”); but they can be taught to use liprounding to approximate French
/y/.
Automatic systems can teach articulator placement for new sounds,
adding graphical views, for example, of the inside of the mouth (LaRocca,
1994). This instruction can be likened to gymnastics; the learner “feels”
when the articulators are correctly in place and practices with the recognizer to confirm this. Learners can train their ears to recognize the new
sounds and relate them to what they feel their muscles doing. AkhaneYamada et al. (1996) suggest that learning to perceive sound distinctions
helps in their production.
Phone articulation training can be L1-independent. A target vowel, for
example, can be taught by starting with a close cardinal vowel (e.g., /a/, /
i/, and /u/ have a high probability of existing in most L1s). A better solution, requiring more computer memory and linguistic knowledge, is to
start with a close vowel in the learner’s particular L1. Taking into account
the learner’s L1 can help anticipate errors and point to pertinent articulatory hints (Kenworthy, 1987). Thus, knowing that French has no lax vowels lets teachers of English to French speakers focus on how to go from a
tense vowel to a close lax vowel (“peat” to “pit”).
CORRECTING PROSODY ERRORS
Based on work in FLUENCY, we propose that the visual display more
than oral instructions will be critical to prosody correction. The key is for
learners to see where the curve representing their production differs from
the native speaker’s curve. Prosody displays can benefit from the wealth
of work on automated systems that teach the deaf to speak. For example,
Video Voice (Micro Video, 1989) uses histograms to represent intensity
(over time) and xy curves for pitch (over time). Duration is implicit in the
time axis of the intensity histogram. Video Voice compares what the student says to a native speaker’s prerecorded exemplar. For pitch the student sees the two frequency curves and, guided by hints, tries to increase
464
CALICO Journal
Maxine Eskenazi
or decrease pitch at relevant points to come closer to the exemplar. Trials
within the FLUENCY project confirm the importance of visual details to
help learners understand the display, for example, using a continuous line
as opposed to a divided contour for pitch.
ASR-Based CALL Can Provide Significant Contexts for Language Practice
CALL can simulate authentic contexts using multimedia and multimodal
displays in ways discussed elsewhere in this volume (e.g., Rypa & Price;
Wachowicz & Scott). Learners can participate in one-to-one conversations with one or more simulated or videotaped interlocutors. The cue for
the student to speak can be realistic, such as having a character on the
screen turn head and eyes toward the user (or the camera).
ASR-Based CALL Can Put Learners at Ease
The computer can prove the ideal partner for putting a language learner
at ease in speaking. Whereas the human teacher judges the student’s production, the computer can be viewed as neutral. It can support continual
practice of unusual sounds until students have enough confidence to go
before others. The system becomes what Wyatt (1988) calls a collaborative tool rather than a facilitative one, with students assuming the role of
judges of their own productions. This role not only has pedagogical backing (Celce Murcia & Goodwin, 1991) but can also benefit system performance. For example, if an exercise requires making a fine phonetic distinction that the recognizer detects poorly, the system can mislead and
frustrate the student by giving errant pronunciation scores and, on that
basis, deciding what to present next. However, if the system simply displays recognition results without pronunciation scores and allows students
to decide whether they did well or need further practice, then ASR-derived error is less problematic. The student gains a sense of control over
the chain of events but the teacher can still intervene to insist on more
practice.
ASR-Based CALL Can Provide Ongoing Assessment
CALL today can enable rapid, constant assessment of the learner. The
system can provide more details more rapidly than a teacher grading tests
(Bernstein & Franco, 1995). The feedback given to the teacher can go
beyond pronunciation scoring. In traditional computer-aided instruction,
learners are scored right or wrong on a given question and the scores
Volume 16 Number 3
465
Foreign Language Pronunciation Training
tallied at the end of the session. But for a system that gives visual data to
help learners decide where to correct themselves, feedback to the teacher
can include learners’ own decisions as to their strong and weak points.
For example, in a lesson on how to emphasize content words in utterances, if the learner decides to work on duration rather than pitch or amplitude, we can assume either that duration presented more of a problem
or that the learner did not have time for the other two aspects. In any case,
the teacher who receives the system’s report can immediately test progress
in the aspect the learner worked on and recommend what to work on in
the next session.
Latency of response can also be measured (Bernstein & Franco, 1995)
to obtain an even clearer view of where learners are having difficulties.
Responses that took more time to formulate can be noted, as can progress
in decreasing latencies over a session.
CONCLUSION
Speech-interactive CALL brings to pronunciation instruction a wealth
of new, sometimes unforeseen, techniques. Increases in computer memory
and storage for expanded exposure to many speakers and for multimedia
corrective feedback can reproduce some of the advantages of total immersion learning. There is still much to be done. Teachers and computer scientists need to collaborate more closely to refine ASR-based tools and to
invent and validate new teaching methods to build on the advantages of
the new medium.
NOTES
1
Although there is not, to our knowledge, quantitative proof of the effectiveness
of these recommendations, they are important in teaching methodologies based
on immersion (Celce Murcia & Goodwin, 1991; Krashen, 1982).
2
The nonnative speakers in this study had varying degrees of proficiency in English.
3
If knowledge of the difference between the speaker’s L1 and L2 were also used
in the form of post-processing heuristics, the system could hone in on only the
errors that would be relevant to correct for the given speaker.
4
Total number of pitch peaks is defined as maxima in frequency of fundamental
frequency over all of the utterance.
466
CALICO Journal
Maxine Eskenazi
REFERENCES
Akhane-Yamada, R., Tohkura, Y., Bradlow, A., & Pisoni, D. (1996). Does training
in speech perception modify speech production? In Proceedings of the
international conference on spoken language processing. Philadelphia,
PA.
Allen, W.S. (1968). Walter and Connie, parts 1-3. British Broadcasting Corporation.
Auralog (1995). AURA-LANG user manual. Voisins le Bretonneux, France: Author.
Bagshaw, P., Hiller, S., & Jack, M. (1993). Computer aided intonation teaching. In
Proceedings of Eurospeech 93.
Bernstein, J. (1994). Speech recognition in language education. In F. L. Borchardt
& E. Johnson (Eds.), Proceedings of the 1994 annual CALICO symposium: Human factors (pp. 37-41) Durham, NC: CALICO.
Bernstein, J., & Franco, H. (1995). Speech recognition by computer. In N. Lass
(Ed.), Principles of experimental phonetics. St. Louis: Mosby.
Bowen, J. D. (1975). Patterns of English pronunciation. Rowley, MA: Newbury
House.
Brumfit, C. (1984). Communicative methodology in second language teaching.
Cambridge, UK: Cambridge University Press.
Celce Murcia, M., & Goodwin, J. (1991). Teaching pronunciation. In M. Celce
Murcia (Ed.), Teaching English as a second language. Boston: Heinle &
Heinle.
Crookall, D., & Carpenter, R. (Eds.). (1990). Simulation, gaming, and language
learning. New York: Harper Collins.
Duncan, C., Bruno, C., & Rice, M. (1995). Learn to speak Spanish: Text and workbook. Hyperglot Software Co. Inc.
Eskenazi, M. (1992). Changing speech styles, speakers’ strategies in read speech
and careful and casual spontaneous speech. Proceedings of the international conference on spoken language processing. Banff.
Eskenazi, M. (1996). Detection of foreign speakers’ pronunciation errors for second language training—preliminary results. Proceedings of the international conference on spoken language processing, ‘96.
Hansen, B., Novick, D., & Sutton, S. (1996). Systematic design of spoken prompts.
Proceedings of computer human interaction (CHI) ‘96 (pp. 157-164).
Isard, A., & Eskenazi, M. (1991). Characterizing the change from casual to careful
style in spontaneous speech. Journal of the Acoustical Society of America.
89 (4) pt. 2.
Kenworthy, J. (1987). Teaching English pronunciation. New York: Longman.
Krashen, S. (1982). Principles and practice in second language acquisition. New
York: Pergamon.
Volume 16 Number 3
467
Foreign Language Pronunciation Training
LaRocca, S. (1994). Exploiting strengths and avoiding weaknesses in the use of
speech recognition for language learning. CALICO Journal, 12 (1), 102105.
Laroy, C. (1995). Pronunciation. In Resource books for teachers. Oxford: Oxford
University Press.
Micro Video Corporation. (1989). Getting started with Video Voice: A follow-along
tutorial. Ann Arbor, MI: Author.
Modern Language Materials Development Center. (1964). French 8, audio-lingual
materials. New York: Harcourt, Brace and World.
Morley, J. (Ed.). (1994). Pronunciation pedagogy and theory: New views, new directions. Alexandria, VA: TESOL.
Omaggio, A. (1993). Teaching language in context (2nd ed.). Boston: Heinle &
Heinle.
Pean, V., Williams, S., & Eskenazi, M. (1993). The design and recording of ICY, a
corpus for the study of intraspeaker variability and the characterization
of speaking styles. Proceedings of Eurospeech ‘93 (pp. 627 -630).
Ravishankar, M. (1996). Efficient algorithms for speech recognition. (Doctoral dissertation, Carnegie Mellon University, 1996) Technical Report CMU-CS96-143.
Richards, J., & Rodgers, T. (1986). Approaches and methods in language teaching.
Cambridge: Cambridge University Press.
Rooney, E., Hiller, S., Laver, J., & Jack, M. (1992). Prosodic features for automated
pronunciation improvement in the SPELL system. Proceedings of the
international conference on spoken language processing (pp. 413-416).
Syracuse Language Systems. (1994). TriplePlayPlus! User’s Manual. Random House.
Tajima, K., Dalby, J., & Port, R. (1996). Foreign-accented rhythm and prosody in
reiterant speech. Journal of the Acoustical Society of America, 99, 2493.
Tajima, K., Port, R., & Dalby, J. (1994). Influence of timing on intelligibility of
foreign-accented English. Journal of the Acoustical Society of America
(paper 5pSP2).
Wyatt, D. (1988). Applying pedagogical principles to CALL. In W. F. Smith (Ed.),
Modern media in foreign language education. Lincolnwood, IL: National
Textbook Company.
AUTHOR’S BIODATA
Dr. Maxine Eskenazi is a Systems Scientist at Carnegie Mellon University.
She has had the dual experience of working in the field of automatic speech
processing and extensively teaching both French and English as foreign
languages, having obtained foreign language teaching accreditation from
the state of Pennsylvania. She obtained her Doctorate in Computer Science from the University of Paris 11 and worked for over 15 years at the
LIMSI-CNRS laboratory in France as a Chargée de Recherche.
468
CALICO Journal
Maxine Eskenazi
AUTHOR’S ADDRESS
Maxine Eskenazi
Language Technologies Institute
206 Cyert Hall
Carnegie Mellon University
5000 Forbes Ave.
Pittsburgh, PA 15213
Phone: 412/268-3858
Fax:
412/268-6298
E-Mail: [email protected]
Volume 16 Number 3
469
Foreign Language Pronunciation Training
CALICO ‘99
1-5 June 1999
CALICO
CALICO
Tuesday-Wednesday
1-2 June
Preconference Workshops
Thursday-Saturday
3-5 June
Opening Plenary, Sessions, Exhibits, Luncheon,
Courseware Showcase, SIG Meetings, Banquet,
Closing Plenary
Plenary speakers
G. Richard Tucker, Carnegie Mellon University
Diane Birckbichler, Ohio State University
Gary Strong, National Science Foundation.
Register online at http://calico.org/calico99.html
Early (before 1 May)
Member
Nonmember
K-12 or Community College
Saturday only
with luncheon & banquet
$175
$200
$125
no luncheon or banquet
$165
$190
$115
$50
Regular (after 1 May)
Member
Nonmember
K-12 or Community College
Saturday only
$200
$225
$150
On-site
Member
Nonmember
K-12 or Community College
Saturday only
$225
$250
$175
$190
$215
$140
$55
$215
$240
$165
$60
This year’s conference does not have a designated conference hotel. Lodging is
available in residence halls on campus and at motels in the area. For more information, visit CALICO’s web site.
Ascot Travel is the official travel agency for CALICO ‘99 and offers special discount fares on Delta Airlines. Visit CALICO’s web site or call Ascot Travel at
800/460-2471. Be sure to mention you are part of the CALICO group.
For more information, contact CALICO
512/245-1417, [email protected], http://www.calico.org.
470
CALICO Journal