MA Thesis
Automatic pronunciation error detection
in Dutch as a second language:
an acoustic-phonetic approach
Khiet Truong
First Supervisor: Helmer Strik (University of Nijmegen)
Second Supervisor: Gerrit Bloothooft (Utrecht University)
Submitted to the Faculty of Arts of Utrecht University
The Netherlands
Doctoraalscriptie van Khiet Truong
(Titel vertaald: Automatische detectie van uitspraakfouten
bij NT2-leerders: een akoestisch-fonetische aanpak)
Faculteit der Letteren, Universiteit van Utrecht
Opleiding: Algemene Taalwetenschap
Specialisatie: Computertaalkunde
Eerste scriptiebegeleider: Helmer Strik (Katholieke Universiteit van Nijmegen)
Tweede scriptiebegeleider: Gerrit Bloothooft (Universiteit van Utrecht)
Juni 2004
Acknowledgements
The research for my MA thesis was carried out at the department of Language and Speech at the
University of Nijmegen. From September 2003 - June 2004, I participated in the PROO project.
I would like to take this opportunity to thank everyone who has helped me doing research and
writing my MA thesis at this department.
I would like to thank Lou Boves and Gerrit Bloothooft for making this traineeship possible. I
would also like to thank my supervisors who have guided me and helped me completing this thesis:
Helmer Strik, Catia Cucchiarini, Ambra Neri (University of Nijmegen) and Gerrit Bloothooft
(Utrecht University). Thank you, I have learned so much from you. The other members of the
PROO group are also thanked for their help. Finally, the members of the department of Language
and Speech at the University of Nijmegen and the “scriptie groep” at Utrecht University are
thanked for sharing their knowledge and giving feedback on my work and presentations.
Khiet Truong
Apeldoorn, June 2004
3
Contents
Acknowledgements
3
Contents
4
1 Introduction
7
1.1
Background: CAPT within CALL . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.2
The aim of the present study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.3
Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Automatic detection of pronunciation errors: a small literature study
13
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2
Why do L2 learners produce pronunciation errors? . . . . . . . . . . . . . . . . . . . 13
2.3
What kind of pronunciation errors do L2-learners make? . . . . . . . . . . . . . . . . 16
2.4
Possible goals in pronunciation teaching . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5
Overview of automatic pronunciation error detection techniques in the literature . . 19
2.6
2.5.1
Overview of ASR-based techniques . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.2
Adding extra knowledge to acoustic models and ASR-based techniques . . . . 22
Automatic pronunciation error detection techniques employed in real-life applications 24
2.6.1
Employing ASR-based techniques in real-life CALL applications . . . . . . . 24
2.6.2
Using acoustic-phonetic information in real-life CALL applications . . . . . . 25
3 The approach adopted in the present study
28
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2
An acoustic-phonetic approach to automatic pronunciation error detection . . . . . . 28
4
3.3
Selecting pronunciation errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1
Goal of pronunciation teaching adopted in this study . . . . . . . . . . . . . . 30
3.3.2
The pronunciation errors addressed in this study . . . . . . . . . . . . . . . . 30
4 Material & Method
33
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2
Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3
Algorithms used in this study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1
Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2
Decision tree-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 The pronunciation error detectors /A/-/a:/ and /Y/-/u,y/
43
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2
Acoustic characteristics of /A/, /a:/, /Y/, /u/ and /y/ . . . . . . . . . . . . . . . . 43
5.2.1
General acoustic characteristics of vowels . . . . . . . . . . . . . . . . . . . . 43
5.2.2
Acoustic differences between /A/ and /a:/ . . . . . . . . . . . . . . . . . . . . 45
5.2.3
Acoustic differences between /Y/ and /u,y/ . . . . . . . . . . . . . . . . . . . 46
5.2.4
Acoustic features for vowel classification: experiments in the literature . . . . 47
5.3
Method & acoustic measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4
Experiments and results for /A/-/a:/ and /Y/-/u,y/ . . . . . . . . . . . . . . . . . . 53
5.5
5.4.1
Organization of experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4.2
Experiments and results /A/-/a:/ . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.3
Experiments and results /Y/-/u,y/ . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.4
Experiments and results /Y/-/u/-/y/ . . . . . . . . . . . . . . . . . . . . . . 68
Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5.1
Discussion of the results of /A/-/a:/ . . . . . . . . . . . . . . . . . . . . . . . 71
5.5.2
Discussion of the results of /Y/-/u,y/ . . . . . . . . . . . . . . . . . . . . . . 75
6 The pronunciation error detector /x/-/k,g/
78
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2
Acoustic characteristics of /x/, /k/ and /g/ . . . . . . . . . . . . . . . . . . . . . . . 78
6.2.1
General acoustic characteristics of consonants . . . . . . . . . . . . . . . . . . 78
5
6.2.2
Acoustic differences between /x/ and /k,g/ . . . . . . . . . . . . . . . . . . . 80
6.2.3
Acoustic features for fricatives versus plosives classification: experiments in
the literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3
6.4
6.5
Methods & acoustic measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.1
Method I & acoustic measurements . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.2
Method II & acoustic measurements . . . . . . . . . . . . . . . . . . . . . . . 87
Experiments and results for /x/-/k,g/ . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.1
Organization of experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.2
Experiments and results method I . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4.3
Experiments and results method II . . . . . . . . . . . . . . . . . . . . . . . . 90
Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7 Conclusions and summary of results
101
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2
Summary of /A/-/a:/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.3
Summary of /Y/-/u,y/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.4
Summary of /x/-/k,g/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
References
113
A List of abbreviations
117
B List of phonetic symbols
118
C Scripts
121
D Sentences
127
E Amount of speech data
129
F Tables with classification scores
133
G How to read Whisker’s Boxplot
147
6
Chapter 1
Introduction
1.1
Background: CAPT within CALL
Traditionally, pronunciation training received less attention than writing and grammar in foreign
language teaching. Many language teachers believed that pronunciation did not deserve as much
attention as other linguistic aspects such as grammar, mainly because they considered accent-free
pronunciation a myth (Scovel, 1988), and thus an impossible goal to achieve. This view has influenced, among other factors, the amount of available information on how pronunciation can be best
taught rather negatively. Nowadays, it is generally agreed that a reasonably intelligible pronunciation is more important than accent-free pronunciation. Unfortunately, training of pronunciation is
still often neglected in traditional classroom instruction for the main reason that training pronunciation is time-consuming: training pronunciation requires a lot of time for practice from students
and a lot of time from teachers for providing feedback. Computer Aided Language Learning systems (CALL) can offer a solution. More specifically, a Computer Aided Pronunciation Training
module (CAPT) within such a CALL system can tackle problems that are associated with training
pronunciation in a classroom environment, and offers many other advantages.
Technology is nowadays more and more integrated in teaching and more specifically foreign
language teaching. There are many software applications available on the market that teach users
foreign languages. CAPT and CALL applications provide a solution to the problems mentioned
above. First of all, computers are more patient than human teachers, and are usually available
without any time constraints. And secondly, computers provide a more individual way of learning
7
which allows students to practise their own pronunciation difficulties and work at their own pace,
whereas in a traditional classroom environment, it is difficult to focus on the needs of individual
students. Moreover, student profiles can be logged by the system, so the improvement or problems
can be monitored by the teacher or the student him/herself. Finally, a classroom environment
can cause more anxiety or stress for students; a CALL environment which offers more privacy can
reduce this phenomenon known as foreign language anxiety (Young, 1990).
These collective advantages have led to an increasing interest in CALL, and more specifically
CAPT, by the language teaching community. Developing CALL and CAPT systems offers challenges and new interdisciplinary areas of interest in the field of language teaching: technology is
to be integrated in a language teaching system in such a way that it needs to meet pedagogical
requirements. Neri et al. (2002) describe the relationship between pedagogy and technology in
CAPT courseware more closely.
CAPT can be integrated into a CALL system by using Automatic Speech Recognition (ASR)
technology. An (ideal) ASR-based CAPT system can be described by a sequence of three phases:
1)
Speech Recognition
: the first and most important phase because the subsequent phases depend on the accuracy of this one. In interactive dialogues with multiple-choice answers the correct
answer should be recognized by the system and all other answers should be discarded. Furthermore, ASR timealignes
the spoken signal with phone labels; phase 2) and 3) are
based on this timealignment.
2)
Scoring and error detec-
: the system evaluates the pronunciation quality and can
tion/diagnosis
give a global score. Pronunciation errors are located and
the type of error is determined for phase 3).
3)
Feedback
: with the diagnosis of the pronunciation error, correct feedback can be given that meets the pedagogical requirements.
Ideally, such a CAPT system should mimic the tasks of a human teacher and give the same
judgements about the student’s pronunciation as a human teacher would do. CALL and CAPT
systems are therefore usually evaluated by how well judgements from the machine agree/correlate
with human judgements (human-machine correlations) of the same speech material.
8
1.2
The aim of the present study
The focus of this study is on automatic pronunciation error detection (phase 2 in the previous
scheme) in speech of foreign language learners. In our case, the foreign language is Dutch which
is learned by second language (L2) learners. In general, automatic pronunciation error detection
techniques usually involve measures that are obtained by means of automatic speech recognition
technology, generalized under the term “confidence scores” (see chapter 2.5) which in some way
represent how certain the system is that signal X belongs to a pattern Y: a low confidence (score) of
the system may indicate bad pronunciation quality. These measures have the advantage that they
can be obtained fairly easily, and that they can be calculated in similar ways for all speech sounds.
However, ASR confidence measures also have the disadvantage that they are not very accurate: the
average human-machine correlations they yield are rather low, and, consequently, their predictive
power of pronunciation quality is also rather low (see e.g. Kim et al., 1997). This lack of accuracy
might be related to the fact that confidence scores are computed in the same way for all speech
sounds, without focusing on the specific acoustic-phonetic features of individual sounds. These
disadvantages of methods based on confidence measures have led to the present study in which we
investigate an alternative approach that would yield higher detection accuracy. In this study, we
present an acoustic-phonetic approach for detection of pronunciation errors at phone-level. The
goal of this study is formulated as:
Goal: to develop automatic acoustic-phonetic-based classification techniques for automatic
pronunciation error detection in speech of L2 learners of Dutch.
Related to this goal, is the question of how well these automatic classification techniques perform
in detecting pronunciation errors at phone level. This is the main question addressed in this study
and is formulated as the thesis question:
Thesis question: How effective are automatic acoustic-phonetic-based classification techniques
in detecting pronunciation errors of L2 learners of Dutch?
In this context, “effective” means that ideally, the techniques should be able to detect pronunciation errors just as humans do: machine judgements should resemble human judgements. For this
purpose, a non-native speech database of Dutch was annotated by human listeners on pronunciation errors. The non-native speech used in this study was checked and annotated on pronunciation
9
errors so that these human annotations (judgements) could be compared to machine judgements.
The acoustic-phonetic approach (section 3) enables us to be more specific in developing
pronunciation error techniques. First, we selected three pronunciation errors by carrying out a
survey on an annotated non-native speech database. We found that the following three speech
sounds were often mispronounced by non-native speakers and decided to address these three
pronunciation errors in this study:
/A/
mispronounced as
/a:/
/Y/
mispronounced as
/u/ or /y/
/x/
mispronounced as
/k/ or /g/
For each pronunciation error, the acoustic differences between a correctly pronounced phone and an
incorrectly pronounced phone are examined and these acoustic differences, translated into acousticphonetic features, are used to develop a pronunciation error detector. Classification experiments and
statistical analyses can show which specific features are most reliable for detection of a particular
pronunciation error (Q1 ).
Another interesting issue that is to be examined in this study, is the use of native or non-native
speech as training material: it is still not clear whether a detector should be trained on native
or non-native speech material to achieve the highest detection accuracy, without degrading the
performance for native speakers (Q2 ).
For the pronunciation error of /x/, we will examine two different methods: one that uses
Lineair Discriminant Analysis and one that uses a decision tree to classify a sound as either correct
or incorrect. Is there a preference of one method over the other (Q3 )?
Thus, in addition to the main question, three other questions are posed that are related to this
acoustic-phonetic approach and the thesis question:
Q1. What are reliable discriminative acoustic-phonetic features of phonemes for pronunciation
errors of /A/, /Y/ and /x/?
Q2. How do the detectors trained under different conditions (trained on native or non-native
speech), cope with non-native speech?
Q3. What are the advantages of a Lineair Discrimination Analysis method (LDA) as opposed
to a decision tree-based method for automatic pronunciation error detection?
10
The following chapters describe how the goal of this study is achieved and how we try to find
the answers to the questions posed above.
1.3
Structure of the thesis
Chapter 2 reports on a small literature study on automatic pronunciation error detection. First,
we examine why L2 learners produce pronunciation errors (section 2.2) and show some examples
of these errors (section 2.3). A description of possible goals of pronunciation teaching is given in
section 2.4. Finally, overviews of different kinds of automatic pronunciation error techniques are
given in section 2.5 and 2.6.
Chapter 3 describes the approach adopted in this study (section 3.2). Part of this approach is
the selection procedure for the pronunciation errors addressed in this study (3.3).
Chapter 4 gives an description of the material and different classification algorithms that were
used in this study. The speech material is described in section 4.2 and two different classification
algorithms are described in section 4.3.
Chapter 5 reports on the development of the pronunciation error detectors for errors of /A/ and
/Y/. First, the acoustic characteristics of the two sounds are examined (section 5.2) to determine
potential discriminative features. A description of the procedure for acoustic feature extraction
is given in section 5.3 and in section 5.4, the results of the classification experiments are shown.
Finally, a discussion of the results is given in section 5.5.
Chapter 6 reports on the development of the pronunciation error detectors for errors of /x/.
The acoustic properties of this pronunciation error are examined in section 6.2. Two classification
methods for this error are introduced in section 6.3. Finally, in section 6.4 the results of the
classification experiments are shown, and discussed in section 6.5.
Chapter 7 gives a summary of the results. Summaries of the classification experiments of /A/
vs /a:/, /Y/ vs /u,y/ and /x/ vs /k,g/ are given in section 7.2, 7.3 and 7.4 respectively. Finally,
in section 7.5, we try to answer the questions posed at the beginning of this thesis and give some
suggestions for further research.
Some practical remarks:
Throughout this work, phonetic symbols will be used in SAMPA notation: for a list of phonetic
11
symbols in IPA and SAMPA notation, see appendix B. A list of abbrevations used throughout this
work is given in appendix A.
12
Chapter 2
Automatic detection of pronunciation
errors: a small literature study
2.1
Introduction
Chapter 2 reports on a small literature study and consists of two parts: before we give an overview of
automatic pronunciation error detection techniques (section 2.5 and 2.6), we give some background
information on pronunciation errors. Why and what kind of pronunciation errors are produced by
L2 learners is described in section 2.2 and 2.3. Section 2.4 describes some possible goals in pronunciation teaching. The second part of this chapter gives an overview of automatic pronunciation
error detection techniques (section 2.5 and 2.6).
2.2
Why do L2 learners produce pronunciation errors?
Pronunciation errors exist because L2 sounds are not correctly produced by the L2 learner. How
then are L2 sounds learned by L2 learners or to say it differently: why are L2 sounds not properly
learned? Various studies that have investigated this issue, have also paid attention to the relationship between production and perception of L2 sounds. The main question seems to be whether
production precedes perception, or perception precedes production in the process of acquiring an
L2 (Llisterri, 1995). Or in other words, is an L2 learner able to produce an L2 sound accurately if
the same sound is not correctly perceived? This relationship between production and perception
13
of L2 sounds is related to factors such as age of learning and knowledge of L2.
Some researchers have proposed that neurological maturation might lead to a diminished ability
to add or modify sensorimotor programs that configure the movements of the articulators for
producing sounds in an L2 (e.g. McLaughlin, 1977). Many researchers believe that when a certain
age is passed, new sounds in speech cannot be learned perfectly. For instance, it was found that the
later non-native speakers began learning English, the more strongly foreign-accented their English
sentences were judged to be (Flege, Munro and MacKay, 1995). The existence of this so-called
“critical period” is often explained by neurological maturation.
Knowledge of L2 may also affect the relationship between production and perception. Bohn &
Flege (1990) investigated this factor by examining the production and perception of the English
vowels /e:/ and /{/ (IPA /æ/) in two groups of German learners of English: an experienced group
and an inexperienced group. The results showed that there are clear differences between the two
groups of speakers: the inexperienced group did not produce the contrast between the two vowels,
but was able to differentiate them in a labeling task and thus was able to perceive them correctly;
the experienced group did produce the contrast and was better in the labeling task. Furthermore,
they found that the two groups relied on different acoustic cues in the labeling task. They concluded
that perception may lead production in the early stages of L2 speech learning and that production
might be improved by experience.
Evidence supporting the view “production precedes perception” can be found in Borrell (1990),
Neufeld (1988) and Briere (1966) who pointed out that it is common that when learning an L2,
not all sounds that are correctly perceived will be correctly pronounced. Furthermore, an experiment carried out by Sheldon & Strange (1982) with Japanese speakers of English showed that the
production of the English contrast between /r/ and /l/ was more accurate then the perception of
it.
The view “perception precedes production” is supported by evidence from many more studies,
which seems to imply that generally, perception does precede production, at least for vowels.
Already in 1939, Trubetzkoy proposed that bilinguals tend to perceive L2 sounds with their own
L1 phonology, which may lead to wrong productions or accentedness of L2 sounds. Borden, Gerber
& Milsark (1983) examined the relationship between perception and production of English /l/ and
/r/ in Korean learners of English. They found that perceptual judgments of /r/ and /l/ improved
14
before production and that self-perception develops earlier and may be a prerequisite for accurate
production. Flege (1993) examined vowel duration as a cue to voicing in English words produced
and perceived by Chinese speakers of English. The study revealed correlations between differences
in perceived vowel duration and degree of foreign accent and Flege (1993) concluded that “[...] nonnatives will resemble native speakers more closely in perceiving than in producing vowel duration
differences [...]”.
Numerous studies have proven that listeners have difficulty perceiving and making phonetic
distinctions that do not exist in their native language. A common view in the 1970s was that
interference from the L1 is the primary phonological cause of non-native productions: 1) an L2
sound that is identified with a sound in the L1 will be replaced by the L1 sound; 2) contrasts
between sounds in the L2 that do not exist in the L1 will not be honored; 3) contrasts in the L1
that are not found in the L2 may nevertheless be produced in the L2 (e.g. Weinreich, 1953; Lehiste,
1988).
Two more recent working models that focus on phonological contrasts in L1/L2 and support the
perceptive view on L2-learning are Flege’s Speech Learning Model (SLM) and Best’s Perceptual
Assimilation Model (PAM).
SLM (Flege, 1995) claims that “[...] without accurate perceptual targets to guide the sensorimotor learning of L2 sounds, production of the L2 sounds will be inaccurate [...]”. The model
makes the assumptions that the phonetic systems used in the production and perception of vowels
and consonants remain adaptive over the life span and that new phonetic categories are added or
old ones are modified in the phonetic systems when L2 sounds are encountered. It hypothesizes
that many (but not all!) L2 production errors have a perceptual basis. Learners perceptually
relate positional allophones in the L2 to the closest positionally defined allophone in the L1 in
acoustic-phonetic terms, such as the F1/F2 formant space for vowels. L2 learners can establish a
new phonetic category for an L2 sound that differs from the closest L1 sound. The greater the
perceived distance of an L2 sound from the closest L1 sound, the more likely that a new phonetic
category will be established.
According to PAM (Best, 1995), non-native sounds are perceptually assimilated to native phonetic categories according to their articulatory-phonetic (gestural) similarity to native gestural
constellations (Browman & Goldstein, 1989), where gestures are defined by the articulators, place
15
of articulation and manner of articulation. The model states that non-native speech perception is
strongly affected by the linguistic experience of the listener with phonological contrasts and that
listeners perceptually assimilate non-native phones to native phones whenever possible. In PAM, a
given non-native phone may be perceptually assimilated to the native system of phonemes in one
of three ways:
• as a Categorized exemplar of some native phoneme: if the contrasting phones are both assimilated as good exemplars of a single native phoneme then perceptual differentiation is difficult;
if the contrasting phones differ in their “goodness of fit”, thus being an exemplar of a single
native phoneme, then perceptual differentiation is somewhat easier
• as an Uncategorized sound that falls somewhere in between native phonemes, the non-native
phone is roughly similar to two or more phonemes, perceptual differentiation is easy
• as a Nonassimilable nonspeech sound that bears no detectable similarity to any native
phoneme, the non-native phone will be perceptually differentiated on the basis of its auditory
or phonetic characteristics.
The main difference between the two models is that SLM places the emphasis on an acousticphonetic specification of phonetic similarity whereas PAM assumes an articulatory specification of
phonetic similarity.
2.3
What kind of pronunciation errors do L2-learners make?
A distinction can be drawn between pronunciation errors that are made on a segmental level and
errors that are made on a suprasegmental level. On a segmental level, errors may concern vowel
and consonant quality and may be explained by differences between language systems.
An example of a segmental pronunciation error is the pronunciation of the Dutch /i/ in “vies”
and /I/ in “vis”: Japanese and Italian L2 learners of Dutch do not know the difference between /i/
and /I/ because this distinction does not exist in their L1. The same applies to the mispronunciation
of /A/ as /a:/ length is a distinctive feature in Dutch whereas in e.g. Italian this distinctive feature
does not exist.
16
Another example of a segmental pronunciation error is the mispronunciation of /x/, which is a
very common error in Dutch which again might be due to the fact that /x/ is not encountered in
many other languages.
The pronunciation errors may be mispronunciations of L2 sounds, also called substitutions,
because an L2 sound is substituted with another sound, but also deletions or insertions of sounds
in L2 occur. The two latter pronunciation errors may be due to differences in syllable structure
between L1 and L2. Japanese and Arabic do not allow branching onsets or codas, so an L2 word
may be modified so that it fits the L1 syllable structure, which results in vowel epenthesis (see fig.
2.1 and fig. 2.2, both examples were taken from O’Grady et al., 1996). In Turkish, a word cannot
begin with two consonants and Spanish does not allow an /s/ word-initially followed by a sequence
of consonants (O’Grady et al., 1996).
Figure 2.1: English target word
with its syllable structure.
Figure 2.2: Non-native speaker’s version of English target word.
In addition to having deviant intonational contours and deviant lexical stress patterns, L2 learners tend to have lower speech rates and a higher number of disfluencies such as stops, repetitions,
and pauses, which result in lower fluency (suprasegmental errors).
An example of a suprasegmental pronunciation error is incorrect stress placement. L2 learners
have to acquire the stress patterns of the language they are trying to learn, which is difficult because
the stress patterns of L1 interfere. Consider Polish in which word-level stress is assigned to the
penultimate (next-to-last) syllable, regardless of syllable weight. Whereas in English, stress can
also fall on the antipenultimate (third from the end of word) syllable depending on the heaviness of
the syllable. The tendency of Polish speakers to place stress on the penultimate syllable regardless
of syllable weight is a common pronunciation error in English (see table 2.1).
17
English target
as’tonish
main’tain
’cabinet
Non-Native form
as’tonish
’maintain
ca’binet
Table 2.1: Example of a non-native stress pattern in which the next-to-last syllable is always
stressed (this example was taken from O’Grady et al., 1996).
Researchers have examined the spectral differences between native and non-native speech and
found that one of the largest differences between these two types of speech are the patterns of
the second and higher formants (Arslan, 1996; Flege, 1987). This finding can be explained by
Fant (1960), who showed that small changes in the configuration of the tongue position can lead
to large shifts in the frequency location of F2 and F3, while the frequency location of F1 only
changes if the overall shape of the vocal tract changes. To improve intelligibility of L2 learners and
methods of pronunciation teaching, researchers have tried to establish pronunciation error gravity
hierarchies, so that priority can be given to those errors that have the most damaging effect on
the intelligibility of speech (e.g. Van Heuven et al., 1981; Anderson-Hsieh et al, 1992; Derwing &
Munro, 1997). Although the answer to this issue is still not clear, it appears that both segmental
aspects and suprasegmental aspects play important roles. Both aspects can be measured separately,
but they do influence each other as the case of lexical stress illustrates. A stressed syllable is usually
characterized by a clearer pronunciation (which may cause spectral differences, segmental), a higher
amplitude (segmental), a higher pitch (suprasegmental) and a longer duration (suprasegmental).
2.4
Possible goals in pronunciation teaching
Studies have shown that foreign accents may have negative consequences for non-native speakers.
Listeners detect divergences between the phonetic norms of their L1 and those of the non-native
speaker, and may for instance misjudge the non-native speaker’s affective state (e.g. Holden &
Hogan, 1993). Although several studies have shown that a general bias against foreign accentedness in speech exists and that native listeners tend to downgrade non-native speakers because of
their foreign accent, these observations do not directly mean that language teachers should aim
at teaching accent-free speech. Abercrombie (1956) argued that “most language learners need no
more than a comfortably intelligible pronunciation”. Witt (1999) agrees with Abercrombie and
18
defined comfortable intelligibility as “[...] a level of pronunciation quality, where words are correctly pronounced to their phonetic transcription, but there are still subtle differences in how these
phonemes sound in comparison with native speakers [...] the speech of comfortably-intelligible
non-native speakers might differ from native speakers with regard to intonation and rhythm, but
on overall their speech is understandable without requiring too much effort from a listener [...]”.
Comfortable intelligibility seems to be a widely accepted goal in pronunciation teaching: Munro &
Derwing (1995) describe intelligibility as “[...] the extent to which a speaker’s message is actually
understood by a listener, but there is no universally accepted way of assessing it [...]”. The goal
of Munro & Derwing’s study was to examine the interrelationships among accentedness, perceived
comprehensibility and intelligibility in the speech of second language learners. Foreign accent and
intelligibility are related, but it is still not clear how foreign accent affects intelligibility. The most
important finding of their research is that “[...] although strength of foreign accent is indeed correlated with comprehensibility and intelligibility, a strong foreign accent does not necessarily cause
second language speech to be low in comprehensibility or intelligibility [...]”. Thus their study
suggests that existing programs and second language instructors aiming at foreign accent reduction
or accent-free speech do not necessarily improve the intelligibility of a second language learner’s
speech.
In the present study, we aim at teaching intelligible speech rather than accent-free speech (see
also section 3.3.1). We agree with Abercrombie’s view that most language learners do not need
more than comfortable intelligibility.
2.5
Overview of automatic pronunciation error detection techniques in the literature
2.5.1
Overview of ASR-based techniques
In this section, the focus is on different techniques for automatic detection of pronunciation errors that have already been examined and described in the literature. These techniques should
be built in such a way that they match as closely as possible the judgments of human listeners:
in order to be valid, automatic pronunciation error detection techniques or machine scores should
correlate with scores or judgments given by humans. Measures that seem to correlate well with
19
human judgments are temporal measures (which are acoustic measures); they are strongly correlated with human ratings of pronunciation and fluency (Cucchiarini et al., 2000; Neumeyer et al.,
2000). Cucchiarini et al. (2000) showed that expert fluency ratings can be predicted on the basis
of automatically calculated temporal measures such as rate of speech or articulation rate (timing
scores). Another finding was that temporal measures for native and non-native speakers differed
significantly and indicated that native speakers are more fluent than non-native speakers and that
non-natives normally speak more slowly than natives. Fluency is often used in tests to evaluate
non-native speakers’ pronunciation. Consequently, other temporal measures that are related to
rate of speech or articulation rate, such as duration scores (relative phone durations) and timing
scores (rhythm), also correlate well with human listener’s judgments (see also Neumeyer et al.,
2000). Thus these above mentioned temporal (acoustic) measures all function as good predictors
of pronunciation quality because they correlate strongly with human judgments. Therefore, in
principle, machine scores based on temporal measures can suffice for good native and non-native
pronunciation assessment, but not for pronunciation training in a CALL application. With temporal measures alone, feedback can only be given on temporal aspects of non-native pronunciation.
Unfortunately, telling the student to speak faster or to make fewer pauses does not help the student
a lot to improve his/her pronunciation. Therefore, temporal measures should be supplemented with
other measures that are able to evaluate segmental or other suprasegmental aspects of non-native
speech.
These other measures and techniques have been developed by researchers to detect segmental
pronunciation errors and to evaluate non-native pronunciation quality by using parameters from the
ASR system. Nowadays, many CALL applications use measures which are generalized under the
term “confidence measures”: these ASR confidence measures represent in some way the confidence
of the ASR system in deciding that the given signal X belongs to pattern Y; or in other words,
how confident the ASR system is that the given signal X belongs to pattern Y. ASR confidence
measures have the advantage that they can be obtained fairly easily, and that they can be calculated
in similar ways for all speech sounds. These measures based on spectral match can be combined with
temporal measures to compute a combined score to increase human-machine correlation. Because
a good deal of statistics is involved in these scores and methods, I will first shortly explain some
statistical terms.
20
The recognition problem in ASR can be reduced to the following statistical problem: given a set
of measurements, vector X, what is the probability of it belonging to a word sequence W? In other
words, compute P (W |X). The posterior probability P (W |X) cannot be computed directly: it can
only be estimated after the data has been seen (hence the term “posterior”). Therefore Bayes’ Rule
is used to estimate the posterior probability:
P (W |X) = (P (X|W ) × P (W ))/P (X)
In the above formula, P (X|W ) represents the probability density function: given a word sequence W, what is the probability of vector X belonging to that word sequence? This is often
called the data likelihood. P(W) is the probability that the word sequence W was uttered: this
represents the language model that is independent of the observation vectors and is based on prior
knowledge. P(X) is a fixed probability: the average probability that the vector X was observed.
These statistical measures, likelihoods and posterior probabilities that are derived from the
formula just presented, are used to supplement duration scores and timing scores in scoring pronunciation quality of non-native speech. Log-likelihood is assumed to be a good measure of the
similarity between native and non-native speech, therefore Neumeyer et al. (1996) compared loglikelihood scores to segment duration scores (relative phone duration normalized by rate of speech)
and timing scores (speaking rate, rhythm) by computing correlations between machine and human
scores at sentence and speaker level. The correlations in Neumeyer et al. (1996) showed that HMMbased log-likelihoods are poor predictors of pronunciation ratings. The timing scores resulted in
acceptable speaker level correlations, but normalized segment duration scores produced the best
results. So the duration-based scores outperformed the HMM-based log-likelihood scores. This
study was extended in Franco et al. (1997) by examining other HMM-based scores, namely average
phone segment posterior probabilities, and comparing them to log-likelihood and duration scores.
This time, the HMM-based posterior probabilities produced higher human-machine correlations
than log-likelihood and duration scores.
The two previous approaches (Neumeyer et al., 1996; Franco et al., 1997) focused on rating
an entire sentence rather than targeting specific phone segments. Kim et al. (1997) extended
their work by assessing the pronunciation quality of individual phone segments within a sentence.
Probabilistic measures given in Franco et al. (1997) were compared to each other and again the
21
score based on posterior probability was the best at phone and sentence level. Duration scores
that previously showed high human-machine correlations (Neumeyer et al., 1996) now turned out
to be poor measures at phone level. However, the results of duration scores improved and showed
the strongest improvement when the amount of training data increased. This is not surprising
since it is generally known that adding more training data can improve performance. Humanmachine correlations on phone level were always lower than correlations on sentence level, so rating
a single phone by machines is still problematic. Our techniques presented in this study aim at
evaluating a single phone; by adopting the approach presented in chapter 3 we hope to achieve
higher human-machine agreement at segment level.
Another ASR-based method that focuses on rating a phone rather than a word or sentence is
Witt & Young’s Goodness of Pronunciation (GOP) method (Witt & Young, 2000). Their GOP
score is primarily based on the posterior probability of an uttered phoneme. A threshold is used to
decide whether a phoneme was correct or not, based on the GOP score in relation to a predetermined
threshold.
Thus posterior probabilities and temporal measures individually produced good results on sentence level. Therefore, combining these scores might result in even higher human-machine correlations. A combination of such scores was examined in several studies (Franco et al., 1997; Franco
et al., 2000) and indeed showed that a combination of scores in almost every case produced higher
human-machine correlations than a single posterior probability score. Linear and nonlinear regression methods, which were used to predict the human grade from a set of machine scores, were
investigated and it appeared that a nonlinear combination of machine scores produced better results than a linear combination of scores. In the best case, an increase of 11% in correlation was
obtained by using nonlinear regression with a neural network combining posterior, duration and
timing scores (Franco et al., 2000). Thus these studies have shown that some optimal confidence
scores can be combined to achieve higher human-machine correlations at sentence or speaker level.
2.5.2
Adding extra knowledge to acoustic models and ASR-based techniques
The measures described above were all obtained from HMM models trained on native speech only.
Several methods have been introduced where confidence scores were obtained from adapted acoustic
models trained on both native and non-native speech. Furthermore, different methods have been
22
proposed to integrate knowledge about the expected set of mispronunciations in the phone models
or pronunciation networks. HMM models trained with native speech data only can be expanded
to form a network with alternative pronunciations, where models trained on native and non-native
speech are used. In the MisPronunciation (MP) network by Ronen et al. (1997) each phone can be
optionally pronounced as a native or as a non-native sound. This network is then searched using
the Viterbi algorithm. To evaluate the overall pronunciation quality, a mispronunciation score can
be computed that is the relative ratio of the number of non-native phones to the total number of
phones in the sentence. The human-machine correlations obtained with the new MP models were
almost equal to those of the previous native models.
Similarly to Ronen et al. (1997), Franco et al. (1999) used two different acoustic models for
each phone, one trained on acceptable, native speech and another trained on incorrect, strongly
non-native speech to detect mispronunciations at phone level. For each phone, a log-likelihood
ratio score was computed using the correct and incorrect pronunciation models and compared to
a posterior probability score (we have seen that posterior scores correlate well with human scores,
Franco et al., 1997; Kim et al., 1997) computed from models based only on native speech. Results
have shown that the method using both native and non-native models, thus the log-likelihood
ratio score, had higher human-machine correlations than the method using only native models, the
posterior score.
Deroo et al. (2000) also used correct (native-like) and incorrect (strong non-native-like) speech
to model the acoustic models, but this time by using a hybrid system combining HMM models
and ANN (Artificial Neural Networks) to detect mispronunciations at phone level. Unfortunately,
their phoneme models trained with native or non-native speech were very similar to each other,
so the system was not able to discriminate between wrong and right pronunciations. A second
approach produced better results. This time, knowledge about expected mispronunciations was
used: phoneme graphs were built taking all wrong pronunciations of that phoneme into account.
A disadvantage of this approach is that this method requires knowing in advance all the mistakes
that can be uttered by non-native speakers.
23
2.6
Automatic pronunciation error detection techniques employed
in real-life applications
2.6.1
Employing ASR-based techniques in real-life CALL applications
Some of the methods and scores that have been discussed in the above sections are applied in
real-life CALL systems, such as the SRI EduSpeak System (Franco et al., 2000), the ISLE system
(Menzel et al., 2000) and the PLASER system (Mak et al., 2003).
The EduSpeak toolkit uses acoustic models trained with Bayesian adaptation techniques that
optimally combine native and non-native training data so both type of speakers can be handled
with the same models with good recognition performance. In this way, improvement in recognition
for the non-native speakers was achieved without degrading the recognition performance on the
native speakers. The score used in this system is a combination of previously discussed machine
scores: the logarithm of posterior probability, phone duration and speech rate.
In the ISLE system, which focuses on Italian and German learners of English, the development of
the pronunciation training is divided into two components: automatic localization of pronunciation
errors and correction of pronunciation errors (Menzel et al., 2000). Localization of pronunciation
errors is done by identifying the areas of an utterance that are likely to contain pronunciation
errors. Only the most severe errors are selected by the error localization component that assigns
confidence scores to each speech segment. A speech segment with a low confidence score represents
a mispronounced segment. These scores are based on probabilistic measures such as the acoustic
likelihood of the recognized path. After localizing areas that are likely to contain errors, specific
pronunciation errors are detected and diagnosed for correction. Pronunciation errors that a student
might make are predicted by rules that describe how a pronunciation is altered. This results in a set
of alternative pronunciations for each entry in the dictionary; one of the alternative pronunciations
of course include the correct one. Again, all the mistakes that could be made by non-native speakers
should be known in advance. Unfortunately, the system performed poor on finding and explaining
pronunciation errors (Menzel et al., 2000).
The PLASER system (Mak et al., 2003), designed to teach English pronunciation to speakers
of Cantonese Chinese, computes a confidence-based score for each phoneme of a given word. An
English corpus and a Cantonese corpus were both used to develop Cantonese-accented English
24
phoneme HMMs. To asses pronunciation accuracy of a phoneme the Goodness of Pronunciation
measure (GOP) is used. Evaluation of the system showed that the pronunciation accuracy of about
75% of the students improved after using the system for a period of 2-3 months.
2.6.2
Using acoustic-phonetic information in real-life CALL applications
The acoustic-phonetic approach, which is the approach adopted in this study, is not frequently
used as a technique to detect pronunciation errors. Most of the existing methods use scores such
as those described above to evaluate non-native speech. Some projects or systems that adopt approaches resembling the acoustic-approach use raw acoustic data to provide feedback by displaying
waveforms, spectrograms, energy or intonation contours. However, a substantial difference with
our acoustic-phonetic approach is that, in those methods, no actual assessment is done, based on
acoustic-phonetic data.
The VICK system (Nouza, 1998) displays user-friendly visual patterns formed from the students
speech (single words and short phrases) and compares them to reference utterances. Different
types of parameters of the same signal are available for visualization, e.g. the time waveform, the
spectrogram and the energy of F0 contours, vowel plots, diagrams or phonetic labels. Feedback on
the students pronunciation is given by showing and pointing out deviations in a difference panel
that indicates the parts of speech with major differences between the trainees’ attempt and the
references. The VICK system uses two classifiers for the automatic evaluation of speech: primarily
a DTW (Dynamic Time Warping) classifier is used (Nouza, 1998). The distance between the
utterance and the reference is evaluated for the whole set of features or for a specific feature subset
such as log energy or F0. The evaluation is based on means and variances computed from the
scores achieved with the reference speakers.
In the SPELL project (Hiller et al., 1994), different modules teaching consonants, vowel quality,
rhythm and intonation, are characterized by an acoustic similarity metric used to evaluate the
pronunciation of a student. For instance, for the rhythm module, duration and vowel quality are
used as acoustic parameters. The vowel teaching module uses a set of acoustically-based vowel
targets which are derived from a set of vowel tokens produced by a group of native speakers.
First, a student’s vowel token is analyzed to produce estimates of the formants and pitch. After a
normalization procedure, these acoustic parameters are then used to provide feedback in a graphical
25
Figure 2.3: An example of the VICK’s screen (from Nouza, 1998)
display for the student. In the display, an elliptic vowel target for the vowel and the position of
the user’s attempt is shown. The vowel similarity metric decides whether the user’s vowel token
falls within this target vowel space. The consonant module uses a rather different analysis. A
list of pronunciation errors in consonant production by non-native speakers of English was first
made and ranked according to their expected effect on intelligibility. Substitutions were one of
the most frequent consonantal errors. These errors are detected in SPELL by using a simplified
speech recognition technique. Each utterance has a specified phonetic sequence containing the
desired sequence of segments and the likely substitutions (errors) which the student might make.
The errors produced by the student are then detected by the choices the speech recognizer made
in recognizing the utterance.
WinPitch LTL (Germain-Rutherford & Martin, 2000) is another system that provides feedback
by visualizing acoustic data. Learners can visualize the pitch curve, the intensity curve and the
waveform of their own recorded speech. A useful feature of this system is speech synthesis: for
instance, students can hear the correct prosodic contours produced with the students’ own voice
and comparisons of prosodic patterns can be made between the students’ recorded and synthesized
segments. The system offers other user-friendly functions as well, such as many edit-functions to
facilitate the learning process. But a major disadvantage of this system is that the system does
not include ASR: thus no automatic check of the contents of the student’s utterance is available.
Therefore, a teacher is required to do this (e.g. produce the phonetic transcription of the utterance)
26
Figure 2.4: Examples of the SPELL’s screen (from Hiller et al., 1994)
and to explain the students what the meaning is of the various acoustic analyses.
A general questionable issue of CAPT systems that visualize acoustic data to give feedback
to language learners is that some training in reading and understanding the displays is required
beforehand and that in some cases, a teacher is required. Furthermore, matching visual displays
is not always recommended, for instance it is known that matching acoustic waveforms is not
very helpful. Consequently, visualizing acoustic data can be very tricky and therefore this kind
of data should be used with care. Although these applications use acoustic information, actual
assessment of pronunciation based on acoustic information is not done. The acoustic-phonetic
approach adopted in this study, described in the next chapter (chapter 3), will use specific acousticphonetic information to evaluate non-native pronunciation.
27
Chapter 3
The approach adopted in the present
study
3.1
Introduction
In this chapter, the approach adopted in this study is presented. Section 3.2 presents our acousticphonetic approach to automatic detection of pronunciation errors. Section 3.3 describes our goal
of pronunciation teaching (section 3.3.1) and explains how the pronunciation errors addressed in
the present study were selected (section 3.3.2).
3.2
An acoustic-phonetic approach to automatic pronunciation error detection
In this study we propose an acoustic-phonetic approach to automatic pronunciation error detection
(at phoneme level) that differs from the approaches illustrated so far. Earlier, I motivated the
choice for an acoustic-phonetic approach by pointing out two related disadvantages for the more
frequently used ASR-based techniques: 1) the average human-machine correlation of ASR-based
techniques is low, especially at phone level, and furthermore 2) the scores are all computed in the
same way for each phone without taking into consideration specific acoustic properties.
The acoustic-phonetic approach enables us to be more specific and thereby, we hope to achieve
higher error detection accuracy and higher human-machine agreement (at phone level). More
28
specificity is achieved by developing individual classifiers for each pronunciation error by examining
specific acoustic-phonetic differences between the correct and the mispronounced sound, and by
using these specific acoustic-phonetic features to develop classifiers. Moreover, these classifiers will
be gender-dependent so that each classifier is optimally adapted to male or female voices.
The classifiers were developed in three phases:
First, a decision had to be made on which pronunciation errors to target (section 3.3). There
are different types of pronunciation errors such as misplaced stress errors or intonation errors
(section 2.3). In this study we address segmental pronuncation errors (phoneme level). A survey of
segmental pronunciation errors was carried out on a non-native speech database that was annotated
on pronunciation errors, to determine the frequency of each error. The pronunciation errors that
are addressed in this study were selected by taking into consideration the criteria proposed by Neri
et al. (2002).
Second, a small acoustic study on these selected pronunciation errors was conducted to find
reliable acoustic-phonetic features that are able to discriminate between a specific correct nativelike and an incorrect non-native sound. These acoustic-phonetic features form the basis of the
classifier that will be trained and tested in the third phase.
Third, a classification technique was used to train and test the classifiers. The classifiers were
trained to classify a sound as either correctly or incorrectly pronounced (i.e. a binary decision). For
all selected pronunciation errors a statistical classification technique, Linear Discriminant Analysis
(LDA), was used. In addition to LDA, another classification method, which was developed by
Weigelt et al. (1990) and rewritten by myself into a decision tree, was used for one of our selected
pronunciation errors (/x/ vs /k,g/). However, the main focus was on the LDA classifiers. Experiments with these LDA classifiers showed which acoustic features are most powerful. Furthermore,
experiments were carried out to examine how the classifiers performed under different training and
testing conditions: native and non-native speech material from different speech databases (section
4.2) was available for training and testing. The non-native speech material is annotated on pronunciation errors; therefore the level of classification accuracy achieved with this material represents
the level of agreement with human judgements. Finally, the collected pronunciation errors were fed
into the classifiers to see whether these classifiers were able to detect the errors.
This acoustic-phonetic approach enables us to develop more specific pronunciation error detec-
29
tion techniques, and thereby we hope to gain higher error detection accuracy and higher agreement
between machine scores and human judgements than is the case with ASR-based pronunciation
error detection techniques.
3.3
Selecting pronunciation errors
Before developing the classifiers, a decision had to be made on what pronunciation errors to focus
on in this study.
3.3.1
Goal of pronunciation teaching adopted in this study
The literature reports on different goals of pronunciation teaching (see section 2.4), but what is the
goal of pronunciation teaching adopted in the present study? Although the relationship between
foreign accent and comprehensibility or intelligibility is still not clear, some studies (e.g. Munro &
Derwing, 1995) have shown that foreign accent does not necessarily cause second language speech
to be low in comprehensibility or intelligibility. Therefore, we do not aim at teaching accent-free
speech, but we rather aim at intelligibility of speech as proposed by Abercrombie (1949) whose
view states that most language learners need no more than comfortable intelligibility.
3.3.2
The pronunciation errors addressed in this study
This study concentrates on pronunciation errors made on a segmental level: large deviations
in the segmental quality of vowels and consonants between native and non-native speech will be
examined. How does one decide which pronunciation errors to detect in its CAPT application?
Neri et al. (2002) have suggested at least four criteria for selecting pronunciation errors in CAPT
applications:
1. Error frequency: addressing frequent errors is more likely to improve communication significantly.
2. Error persistence: one should not put effort in pronunciation errors that simply disappear
through exposure to the L2.
3. Perceptual relevance: only those errors that according to the native speaker are perceptually
relevant should be targeted, less perceptually “disturbing” error less important.
30
4. Robustness of error detection: only errors that can be reliably detected with the current
technology should be addressed, otherwise the CAPT system is more likely to do harm than
good.
A part of a non-native speech database (see subsection 4.2) was annotated on pronunciation
errors This annotated non-native speech material (referred to as DL2N1-NN) is part of a larger
non-native speech database that consists of 60 non-native speakers with varying L1 (this database
is described in more detail in subsection 4.2). In total, the speech of 31 speakers was annotated,
most of whom received a low score on pronunciation proficiency from expert listeners and were
therefore more likely to produce a relatively high number of pronunciation errors. The countings
of the pronunciation errors were based on this annotated part of the non-native speech database
and consisted of 12 male and 19 female speakers. Each of them read aloud sets of five sentences
(26 read set1 and 2, 3 read only set2, 2 read only set1) resulting in a total of 285 read sentences.
All errors were counted and sorted by frequency to determine which errors were most frequent.
Some of the most frequently made pronunciation errors by non-native speakers according to our
annotated database for vowels and consonants can be seen in the following tables1 : table 3.1 and
table 3.2.
rank
1.
2.
3.
vowel
/A/
/Y/
/@/
mispronounced as
mispronounced as
mispronounced as
error
/a:/
/u/ or /y/
/e/ or /E/ or /A/
#errors/total of all pron.errors
10.9%
7.7%
6.9%
Table 3.1: Ranking of vowel pronunciation errors, based on our survey
rank
1.
2.
3.
consonant
/x/
/N/
/r/
mispronounced as
mispronounced as
mispronounced as
error
/k/ or /g/ or /h/
/Nk/ or /Ng/ or /Nx/
/l/ or /6/
#errors/total of all pron.errors
6.5%
6.2%
5.9%
Table 3.2: Ranking of consonant pronunciation errors, based on our survey
Furthermore, the survey showed that non-native speakers frequently produced schwa-insertions
following a consonant at the end of the syllable.
1
The fifth column is computed as follows: the number of occurences of the observed error in question is divided
by the total of all counted pronunciation errors.
31
error
/@/-insertion after consonant at end of syllable
#errors/total of all pron.errors
6.8%
Table 3.3: Frequency of illegal schwa-insertions.
Many of these pronunciation errors in Dutch were also found in a study by De Graaf (1986), who
did a survey on pronunciation errors that were made by Japanese learners of Dutch. He also found
pronunciation errors that concerned vowel quantity, illegal schwa-insertions and mispronunciations
of /x/ as /k/ or /g/.
Our survey showed that in general more errors were made in the pronunciation of vowels than
in the pronunciation of consonants. This may be explained by the fact that Dutch has a relatively
rich vowel system: Dutch has 13 monophthongal vowels as opposed to e.g. Japanese that only has
5 vowels. Many of these pronunciation errors can be explained by differences between language
systems, as is described in section 2.3.
Based on this survey, we selected three pronunciation errors by the four criteria proposed by
Neri et al. (2002) and focused on two vowel errors (table 3.4) and one consonant error (table 3.52 ).
For a more detailed description of this survey on pronunciation errors, see Neri et al. (2004).
1.
2.
Pronunciation errors: vowels
/A/ mispronounced as /a:/
/Y/ mispronounced as /u/ or /y/
Table 3.4: The pronunciation errors addressed in this study: vowels.
1.
Pronunciation errors: consonants
/x/ mispronounced as /k/ or /g/ (or /h/)
Table 3.5: The pronunciation errors addressed in this study: consonants.
2
The mispronunciation of /x/ as /h/ was considered to be of less importance because native speakers of some
regional varieties of Dutch also produce this “error” , which was left out of the analysis for this reason.
32
Chapter 4
Material & Method
4.1
Introduction
In this chapter, a description of the speech databases used in this study is given. Descriptions of
three different corpora are given in section 4.2. Subsequently, the two classification techniques used
in this study are described in section 4.3: the Linear Discrminant Analysis method in subsection
4.3.1 and the decision tree-based method in subsection 4.3.2.
4.2
Material
In this study we use three different speech corpora; a description of each corpus will be given below.
1. IFA corpus (IFA)
The IFA corpus contains hand-segmented speech from 8 Dutch speakers, 4 men and 4 women, in a
variety of speaking styles, e.g. informal story telling, a narrative story and lists of selected words
(for a more extended description of the IFA corpus, see Van Son et al., 2001). The speech data was
first automatically segmented by an HMM automatic speech recognizer that timealigned the speech
files with a canonical phonemic transcription by using the Viterbi algorithm. These automatically
generated phoneme labels and boundaries were then checked and, if necessary, adjusted by human
transcribers. The audio recordings are available in AIFC format (44.1 kHz sample rate) and the
segmentation results are stored in the label-file format of the PRAAT program. The corpus is
33
freely available and can be accessed online: http://www.fon.hum.uva.nl/IFAcorpus.
For our purpose, we used the speaking style Sentence, where a random list of isolated sentences
of the narrative stories was read aloud from a cueing screen. The sentences were taken from texts
that were based on a known story and a fairy tale. This speaking style was chosen because this
type of speech was similar to the speaking styles of the other corpora. We selected 6 speakers:
the 3 oldest male and female speakers because the youngest male and female speaker were not
considered to fall within the age-group of our CAPT system. The audio and label files were
then downloaded for each selected phoneme (see table 4.1 and appendix E for numbers of used
phonemes). The selected speech data from the IFA corpus used in this study will be referred to as
IFA .
2. DL2N1 corpus (DL2N1)
The DL2N1 corpus (Dutch as second language, Nijmegen corpus 1) contains recorded speech from
60 non-native speakers and 20 native speakers of Dutch (Cucchiarini et al., 2000). The 60 nonnative speakers (40 female and 20 male) all lived in The Netherlands and had attended (or were
attending) a course in Dutch as a second language. This group was sufficiently varied with respect
to mother tongue1 , proficiency level and gender. The group of 20 native speakers consisted of 4
speakers of Standard Dutch and 16 speakers of a regional variety of Dutch. Each speaker read
aloud ten different phonetically rich sentences (see appendix D) over the telephone: subjects called
from their homes and were recorded by the recording system that was connected to an ISDN line.
The audio files are available in 8 kHz sample rated alaw files. Since the speech is recorded via a
telephone line and the sample rate is 8 kHz, the settings for the automatic measurements had to
be adjusted (for settings in scripts, see appendix C).
An automatic phone segmentation was obtained from the HTK speech recognizer (which was
trained on Polyphone data, see Den Os et al., 1995) by the Viterbi algorithm. Forced recognition
was applied: a given canonical phoneme transcription is timealigned with the utterance. The
phoneme segmentations were then converted to a format (TextGrid) that is readable in PRAAT.
Annotations of pronunciation errors were made by phonetically trained expert human listeners
(by auditory analyses); in total, the speech of 12 male (a total of 200 phonetically rich sentences)
1
e.g. Arabic, Turkish, Chinese, Spanish, Italian, Russian, English, German, French, Swedish
34
and 19 female non-native speakers (a total of 245 phonetically rich sentences) was annotated. For
this study we only used this annotated speech of the non-native part of the DL2N1 speech data
(which will be referred to as DL2N1-NN ).
The speech of all native speakers of Dutch was used: 8 male (a total of 80 sentences) and
12 female speakers (a total of 120 sentences). Those speakers were assumed to have a correct
pronunciation of Dutch and their speech was therefore not annotated (this native part of DL2N1
speech data will be referred to as DL2N1-Nat ).
3. TRIEST corpus (TRIEST)
The TRIEST corpus consists of speech recordings made of 22 speakers of which 5 are male and 17
female. All speakers were Italian students at the University of Triest, in Italy, and were learning
or had learned Dutch as a foreign language. Each speaker read aloud the same 20 phonetically
rich sentences that were used in the CITO corpus. The recordings consist of wav-files that are
sampled at 16 kHz. The automatic segmentation was obtained from the HTK speech recognizer
trained with acoustic models (how this HTK speech recognizer was trained can be read in Van
Bael et al., 2003) by using the Viterbi algorithm. Forced recognition was applied to a given phone
transcription of the utterance to obtain the automatic phone segmentation. The segmentations
were converted to a format (TextGrid) that was readable in PRAAT.
The speech (10 sentences for each speaker) in this corpus, was annotated by two or three
phoneticians, which gave a total of 50 annotated sentences for the male speakers and 170 annotated
sentences for the female speakers.
Ideally, one would like to train classifiers with correctly pronounced and mispronounced
sounds to discriminate between a correct and a incorrect sound. However, since in the non-native
annotated material the number of realizations of /a:/, /u/, /y/, /k/ and /g/ that result from
pronunciation errors was too low to train and test acoustic-phonetic classifiers, we decided to
study how well the classifiers can discriminate /A/, /Y/ and /x/ from correct realizations of /a:/,
/u, y/ and /k/, respectively. Thus, all classifiers investigated in this study were trained on tokens
that were considered as pronounced correctly (for total numbers of tokens used in this study, see
table 4.1). We did not include the /g/, since this sound is uncommon in Dutch and therefore we
35
did not have enough training material.
/A/
/a:/
/Y/
/u/
/y/
/x/
/k/
DL2N1-Nat
Male
Female
Training Test Training Test
110
36
170
57
70
23
107
36
23
8
36
12
30
10
45
15
24
8
36
12
84
28
127
42
89
30
126
42
/A/
/a:/
/Y/
/u/
/y/
/x/
/k/
Male
Training
212
140
61
143
33
213
181
/A/
/a:/
/Y/
/u/
/y/
/x/
/k/
DL2N1-NN
Male
Female
Training Test Training Test
199
66
279
93
172
57
241
80
53
17
50
17
143
48
59
20
57
17
56
19
116
39
195
65
122
40
187
62
/A/
/a:/
/Y/
/u/
/y/
/x/
/k/
Male
Training
61
40
21
17
23
45
45
TRIEST
IFA
Test
71
47
20
48
11
71
60
Female
Training Test
324
108
210
70
93
31
205
68
55
18
333
111
270
90
Test
20
13
7
6
7
15
15
Female
Training Test
215
72
136
45
77
26
59
20
43
14
149
49
156
52
Table 4.1: Numbers of phonemes used for training and testing classifiers
4.3
4.3.1
Algorithms used in this study
Linear Discriminant Analysis
In this section, I will describe in short how Linear Discriminant Analysis (referred to as LDA )
works and how it is used in this study to separate /A/ from /a:/, /Y/ from /u,y/ and /x/ from
/k/. LDA is a statistical analysis which is often used to investigate whether and how differences
between several groups exist. For example, if one would like to know how the distinction is made
between readers of different newspapers or magazines based on e.g. educational level, age, income,
profession, then LDA can be used to analyze these variables and to predict what newspaper or
magazine is read by a person based on the variables mentioned above. LDA transforms the variables
to new variables, which are called discriminant scores, that are linear combinations of the old ones
(see formula below), in such a way that the distance between these groups is maximized. Or to
36
put it differently, the method maximizes the ratio of between-class variance (that ideally should
be very large) to the within-class variance (that ideally should be very small). The old variables
are projected orthogonally on a discriminant axis (a rotated axis in the original space) by the
discriminant function Z. For n number of groups, the LDA produces n − 1 number of discriminant
functions of which the first one is always the one with the highest discriminative power.
Z = a + W1 X1 + W2 X2 + ...Wk Xk
where
Z = discriminant score
a = discriminant constant
Wk = discriminant weight or coefficient
Xk = predictor variable
For each new case that has to be classified, a discriminant score is computed and group membership is determined based on the obtained discriminant score. One way of classifying new cases
into groups is to check whether their discriminant scores are above or under a certain threshold
value, cutting point, to determine to which group the new case belongs. If the groups on which
the LDA is trained are of the same size, than the cutting point is simply the average of the mean
discriminant scores (group centroid) of the groups. If the groups are unequal, the optimal cutting
point is the weighted average of the two mean discrimant scores.
Example for two groups:
when n1 = n2
→
Zcutting = (Z̄1 + Z̄2 )/2
when n1 6= n2
→
Zcutting = (n1 Z̄1 + n2 Z̄2 )/2
where Z̄j = mean discriminant score for group j
The working of LDA can best be illustrated by an example. Suppose we would like to discriminate two groups, A and B, from each other based on two variables.
X1
X2
disc.scores
X1
X2
disc.scores
A
-2.5
1.2
3.98
B
-1.5
-2.8
-2.41
A
-2.5
0.2
2.61
B
0.5
-0.8
-1.56
A
0.5
2.2
2.53
B
0.5
-1.8
-2.92
A
0.5
1.2
1.17
B
1.5
0.8
-2.49
A
-0.5
0.2
0.74
B
3.5
1.2
-1.64
37
Because there are only two groups to separate, the LDA will deliver one discriminant function
if the LDA function in SPSS (or PRAAT) is run. The discriminant scores were obtained by the
following discriminant function.
Z = −0.936X1 + 1.362X2
In this case, the discriminant coefficients function as a transformation matrix of which a line
(vector) can be drawn on which the new variables are projected (orthogonally) (see fig. 4.1). It is
easy to see that a cutting point somewhere on the line is able to separate the two different groups
from each other.
4
3
*
o
x2
*
*
a
a
0
/a:/
b
b
–2
b
–3
–4
–4
–3
–2
–1
0
x1
1
2
3
4
Figure 4.1: An example of a discriminant axis.
Finding an optimal cutting score is one way of classifying new cases in LDA, but there are
/a:/
other ways to determine group membership. Group membership can be determined by calculating
the probability of a case being in one group or the other. This is accomplished by calculating the
*
posterior probability of group membership using Bayes’ Rule in combination with the calculation
of the distances (Mahalanobis distances) between the discrimant scores and the mean discriminant
score of a group (group centroid).
*
/
b
b
a
–1
o
/
a
1
*
/
a
2
/a:/
38
i )P (Gi )
P (Gi |D) = PPP(D|G
where:
(D|G )P (G ))
i
D = discriminant score Z
i
P (D|Gi ) = posterior probability that a case is in group i, given a specific discriminant score D
P (D|Gi ) = conditional probability that a case has a discriminant score of D, given that it is in group i
P (Gi ) = prior probability that a case is in group i
Given two groups i and j, the classification rule can be formulated as follows: if P (Gi |D) >
P (Gj |D), then the case is classified in group i. This classification method is used in PRAAT
(http://www.praat.org) and SPSS.
How can we determine the degree of separation between the groups? LDA offers several ways of
determining how well the model discriminates between groups. One way to determine the degree
of separation between groups is to compute the mean discrimant score for each group: the group
centroid. If the absolute difference between the group centroids is large, the degree of separation
is high. The eigenvalue (a term from matrix algebra) is also used to determine the discriminative
power of the model: the larger the eigenvalue, the higher the discriminative power of the model.
Wilks’ Lambda can be used as well to measure how well each function separates cases into groups:
smaller values of Wilks’ Lambda (Wilks’ Lambda ranges from 0 to 1) indicate larger discriminative
ability of the function.
One advantage of LDA is that the discriminative power of each predictive variable can be
examined. We can examine how the performance of the classifier changes if certain prediction
variables are step-wise added or removed. Another possibility is to use the figures that LDA
provides us: the weight of each predictor variable is reflected in the standardized discrimination
function coefficients and the Wilks’ Lambda for each variable. The higher the absolute value of the
standardized coefficient, the higher the discriminative power of this variable. The lower the Wilks’
lambda (which is computed before the model is created), the higher the potential discriminative
power of the variable and the stronger the group differences, and if its signif icancevalue > 0.10
the variable probably does not contribute to the model (computations of eigenvalues, standardized
discriminations function coefficients and Wilks’ Lambda can all be provided by SPSS).
Furthermore, a stepwise LDA can be carried out to reveal which variables are useful in the
model and which ones are not. The stepwise process starts with a single variable and at each step
39
enters a new variable that minimizes the overall Wilks’ lambda. This stepwise analysis guarantees
that only significant variables are entered into the model and that all variables in the model are
checked to assure that they remain significant as new variables are added.
When all data is processed by the LDA and decision regions have been established by feeding
the LDA data of the several groups (training phase), the accuracy of this trained LDA object
can be tested by feeding the object with new data that has not been used to train the classifier
(testing phase). The ratio for training and test data is often 75% training data and 25% test data.
One could also train and test with the same data, but then the classification results would be less
realistic. Another way of obtaining more realistic classification results by feeding the trained LDA
object with new data is known as the “jack knife” method (also known as the “leave-one-out”
method). Each time, the ith observation is removed from the training data and used as test data.
This procedure is repeated N times, where N is the total number of observations.
The correct classification percentage (also called hit ratio) is based on how well the LDA has
classified all cases (phonemes). There are 4 types of classifications.
1. Correct Acceptance (CA): A phone was pronounced correctly and is classified as correct.
2. False Acceptance (FA): A phone was pronounced incorrectly and is classified as correct.
3. Correct Rejection (CR): A phone was pronounced incorrectly and is classified as incorrect.
4. False Rejection (FR): A phone was pronounced correctly and is classified as incorrect.
classified as
correct
incorrect
pronounced correctly
CA
FR
pronounced incorrectly
FA
CR
The correct classification percentage (hit ratio) is then computed as follows:
correct classif ication percentage = ((CA + CR)/(CA + F R + F A + CR)) × 100
To determine whether the model predicts any better than chance, we can use the Maximum
Chance Criterion (MCC). The MCC predicts that all cases are classified in the group with the
largest number of cases.
40
M CC = (nL /NL ) × 100
where
nL = number of subjects in the larger of the two groups
NL = total number of subjects in the combined groups
Another criterion that is used is the Proportional Chance criterion (Cpro , randomly classifies
the cases proportionate to the number of cases in either group).
Cpro = p2 + (1 − p)2
where
p = proportion of subjects in one group
1 − p = proportion of subjects in the other group
If the hit ratio of the model surpasses the higher of these two criteria (MCC and Cpro ), than the
model predicts better than chance. In appendix E, MCC and Cpro are shown for some classification
experiments carried out in this study.
4.3.2
Decision tree-based
This type of classification algorithm is only used in the /x/-/k/ distinction. This algorithm was
developed by Weigelt et al. (1990) to distinguish voiceless plosives from voiceless fricatives, and is
reformulated by us as a classification tree. In fig. 4.2 an example is given of how a classification
tree can look like.
New cases are classified by starting from the top of the tree to evaluate all criteria, and to end
somewhere at a leaf of the tree: this leaf indicates the group to which the case belongs. The values
for the . . . > . . . thresholds in the criteria are determined by training the classification tree with
cases (thus phonemes in our case).
The classification accuracy is expressed in a correct classification percentage that is computed
the same way as in the LDA method: all correct classified cases divided by the total number of
cases gives the correct classification percentage.
41
Enter the tree
...> ...
HH
H
H
No
Yes
It is A.
...< ...
H
HH
H
No
Yes
It is A.
...> ...
H
H
H
No
Yes
It is A.
It is B.
Figure 4.2: An example of a simple binary classification tree
42
Chapter 5
The pronunciation error detectors
/A/-/a:/ and /Y/-/u,y/
5.1
Introduction
This chapter describes how the pronunciation error detectors for /A/ and /Y/ were developed.
First, a small acoustic study was carried out on vowels (section 5.2). How the selected acoustic
features were extracted from the signal is described in section 5.3. Subsequently, the classifiers
were trained and tested under different conditions (section 5.4); the results are shown in section
5.4. Finally, these results are discussed in section 5.5.
5.2
5.2.1
Acoustic characteristics of /A/, /a:/, /Y/, /u/ and /y/
General acoustic characteristics of vowels
Vowels distinguish themselves from other speech sounds in that they are produced by an air stream
which flows freely through the vocal tract without any obstruction. The air stream from the larynx
passes the vocal cords and produces air pressure which blows the vocal cords apart and let them fall
together again; this causes the vocal cords to vibrate (=voicing). In the vocal tract the air stream
coming from the larynx is (further) modified by the articulators, shaping the vocal tract so as to
produce the different (speech) sounds. The shape of the vocal tract determines which frequencies
are amplified and which are not. These amplified frequencies are visible in the spectrum as peaks
43
and are referred to as formants.
↑
tongue height
↓
high
mid
low
tensea
lax
tense
lax
tense
unrounded lips
front
/i/
/I/
/e:/
/E/
rounded lips
central back
/y/
/u/
/Y/
/O/
/2:/
/o:/
/A/
/a:/
← tongue advancement →
Table 5.1: Description of Dutch monophthongal vowels by phonetic parameters
(from Rietveld & Van Heuven, 2001.)
a
Tense vowels are somewhat articulated more constricted than lax vowels. In Dutch, the nothigh [+tense] vowels are long, the [+lax] vowels are short, just like the high vowels (Rietveld & Van
Heuven, 2001).
Since the classical paper by Peterson and Barney (1952), the first three formants have been
regarded as the acoustic parameters to describe vowels. The first two formants are considered to
be the most important perceptually (Fant, 1960), while the third formant plays a supporting role.
It appears that formants convey important information about the identity of the vowel. Formant
frequencies relate to vowel articulation (table 5.1): a rough rule of thumb is that F1 varies mostly
with tongue height (low vowels have a high F1 and high vowels have a low F1) and F2 varies mostly
with tongue advancement (back vowels have a low F2 and front vowels have a relatively higher F2).
Economy is another advantage of formant patterns as descriptors of vowels. In most cases, only the
first three formants are sufficient to achieve good results in discriminating vowels from each other
(as can be seen in section 5.2.4). Another advantage of formant description is that the formants
are relatively easily visible in the acoustic analysis of speech.
Duration is often used as an additional feature to the formants. Dutch vowels have their own
intrinsic durations. In general, a long Dutch vowel /a:, e:, o:, i/ is approximately 50% longer than
its shorter counterpart /A, E, O, I/. Also, it takes more time to produce an open vowel than a
closed vowel: the jaw has to cover a longer distance. Other factors that influence vowel duration are
e.g. syllable stress, voicing of the following consonant, number of syllables in a word, and speaking
rate. It is known that stressed syllables tend to have a longer duration than unstressed syllables
and obviously, a higher speaking rate generally shortens vowel durations.
Fundamental frequency (F0) is another additional acoustic feature that is often included in the
44
1000
1000
u
0
0
A
E
@
I
Y
2 ye
(Hz)
a:
A
O
o
F1
(Hz)
F1
a:
u
i
(Hz)
F2
O
o
0
0
3000
Figure 5.1: Vowelplot with mean formant
values taken from the male IFA corpus,
speaking style Read Sentences.
E
@
Y 2
Ie
y
(Hz)
F2
i
3000
Figure 5.2: Vowelplot with mean formant
values taken from the female IFA corpus,
speaking style Read Sentences.
set of descriptors. It is influenced by e.g. speaker emotion and intonation. Fundamental frequency
determines the distances between the spectral lines and influences the perceived vowel quality. A
general rule is that fundamental frequency varies with vowel height: high vowels have a somewhat
higher fundamental frequency than low vowels. This feature may play a secondary role in vowel
recognition.
Some researchers have proposed other descriptors as opposed to the more traditional formants:
the spectral shape as a whole can be used to distinguish vowels from each other because spectra
contain information in addition to formants. Bladon (1982) advanced several arguments against
a formant representation of speech and favored a representation based on gross spectral shape.
One of his objections to a formant-based representation was that a formant representation is an
incomplete spectral description.
5.2.2
Acoustic differences between /A/ and /a:/
In Dutch, the /A/-/a:/ pair belongs to a group of pairs in which a difference in duration alongside
with other differences, is supposed to be important for keeping the vowels apart (other shortlong pairs include: e.g. /O/-/o:/ and /I/-/i/). These other differences may be vowel quality or
diphthongization of the long vowels (Nooteboom, 1972).
Small differences in F1 and F2 could also play a role: the vowel plots in fig. X and Y show that
F1 is slightly lower for /A/ than for /a:/ and that F2 is also slightly lower for /A/ than for /a:/.
Duration differences are shown in the histograms in fig. 5.3 and 5.4. Although the means of
45
the two vowels show some overlap in duration, the histograms show that /A/ is generally shorter
than /a:/. So length is expected to be a distinctive feature for discriminating between /A/ and
/a:/. Differences in F1 and F2 may play a secondary role.
/a:/
/A/
Percent
15%
10%
5%
0%
50,0
100,0
150,0
200,0
250,0
50,0
Duration (ms)
100,0
150,0
200,0
250,0
Duration (ms)
Figure 5.3: Left: histogram of durations of /A/ (N=283, mean=74.3, sd=28.5).
Right: histogram of durations of /a:/ (N=187, mean=115.2, sd=51.0). Durations
(raw) taken from male IFA corpus, speaking style Read Sentences.
5.2.3
Acoustic differences between /Y/ and /u,y/
The distinction between /Y/ and /u,y/ is somewhat less obvious than the distinction between /A/
and /a:/. In Dutch, /u/ and /y/ are both high vowels whereas /Y/ is a mid vowel (see table 5.1)
which could result in a slightly higher F1 (see fig. 5.1 and 5.2) for /Y/ than for /u,y/. A factor
that may complicate the distinction between /Y/ and /u,y/ is that /u/ and /y/ differ greatly from
each other on the F2 dimension: F2 is much higher for /y/ than for /u/ (see fig. 5.1 and 5.2).
Therefore, grouping /u/ and /y/ together as opposed to the /Y/ to train the classifier, might lower
the effectiveness of the classifier (experiments in section 5.4.4 examines how the classifier performs
if /u/ and /y/ are not grouped together).
Another dimension on which /Y/ can differ from /u/ and /y/ is length. Phonologically, /Y/ is
categorized as a short vowel and /u/ and /y/ are both categorized as long vowels. Phonetically, /u/
46
/A/
/a:/
15%
Percent
10%
5%
0%
50,0
100,0
150,0
200,0
250,0
50,0
Duration (ms)
100,0
150,0
200,0
250,0
Duration (ms)
Figure 5.4: Left: histogram of durations of /A/ (N=432, mean=82.2, sd=25.5).
Right: histogram of durations of /a:/ (N=280, mean=139.4, sd=51.1). Durations
(raw) taken from female IFA corpus, speaking style Read Sentences.
and /y/ are short except before /r/, but distributionally they behave as long vowels. KoopmansVan Beinum (1980) seems to have found a compromise: she categorized /Y/ as short and both /u/
and /y/ as half-long. The mean durations in the IFA corpus of /Y, u, y/ (see fig. 5.5 and 5.6) seem
to fit the phonological categorization and the categorization proposed by Koopmans- Van Beinum
(1980).
5.2.4
Acoustic features for vowel classification: experiments in the literature
Numerous experiments have been carried out to classify a set of monophthongal vowels of a specific
language with acoustic features such as those described above: formants, duration and fundamental
frequency.
In Hillenbrand et al. (1993) a quadratic discriminant classification technique was used to classify
English vowels (spoken in a hVd context) with spectral measurements consisting of F0, F1, F2
and F3. Results showed that error rates were relatively high (correct classification 74.9%) when
classification was based on F1 and F2 alone. The addition of either F0 or F3 to [F1 F2] resulted
in a substantial improvement in performance (correct classification approximately 86.6%). Also,
47
/y/
/u/
/Y/
Percent
30%
20%
10%
0%
50,0
100,0
150,0
200,0
250,0
50,0
Duration (ms)
100,0
150,0
200,0
250,0
50,0
Duration (ms)
100,0
150,0
200,0
250,0
Duration (ms)
Figure 5.5: Left: histogram of durations of /Y/ (N=81, mean=61.2, sd=16.9). Middle: histogram of durations of /u/ (N=191, mean=85.1, sd=48.3). Right: histogram
of durations of /y/ (N=44, mean=92.3, sd=29.7). Durations (raw) taken from male
IFA corpus, speaking style Read Sentences.
/Y/
/u/
/y/
50%
Percent
40%
30%
20%
10%
100,0
200,0
300,0
400,0
Duration (ms)
500,0
100,0
200,0
300,0
400,0
Duration (ms)
500,0
100,0
200,0
300,0
400,0
500,0
Duration (ms)
Figure 5.6: Left: histogram of durations of /Y/ (N=124, mean=69.5, sd=17.8).
Middle: histogram of durations of /u/ (N=273, mean=113.5, sd=48.3). Right:
histogram of durations of /y/ (N=73, mean=103.2, sd=34.8). Durations (raw) taken
from female IFA corpus, speaking style Read Sentences.
48
they concluded that there was no advantage for any of the linear transforms over a linear frequency
scale.
In 1995, Hillenbrand et al. replicated and extended the classical study of vowel acoustics by Peterson and Barney (1952) by adding duration and information about the pattern of spectral change
over time to the measurements. There is evidence that dynamic properties such as duration and
spectral change play an important role in vowel classification (Assmann et al., 1982; Nearey et al.,
1986). Assmann et al. (1982) used formant frequencies (converted to natural log values), formant
slopes and vowel durations to classify isolated vowels. The slopes of F1 and F2 were calculated by
measuring initial and final F1 and F2 values, and dividing the difference between the initial and
final measurements by the time interval between these measurements. These formant slopes are
especially useful in identifying diphthongs since they are associated with formant change over the
course of a vowel. The classification results obtained with discriminant analyses improved strongly
when formant slopes and duration were included as features; this demonstrated that measures of
dynamic information (formant slopes and duration) reduce the degree of overlap between vowel
categories. Nearey et al. (1986) extended the study by Assmann et al. (1982) and took initial (as
early as possible in the vowel) and final (as late as possible in the vowel) formant measurements
for F1 and F2 to represent the spectral change in isolated vowels. It appeared that not only diphthongs showed significant formant movements; monopthongs showed formant movements as well.
Discriminant analysis was carried out on log frequency values of initial and final measurements of
F1 and F2, including F0. The overall classification rate was very high (94%) showing that isolated
vowels could be adequately specified by formant frequency measurements from two brief sections
near the beginning and end of the vowel.
As a result of these findings, Hillenbrand et al. (1995) added vowel duration and formant
measurements, sampled at 20%, 50% and 80% of vowel duration, to the measurements obtained
from the Peterson and Barney (1952) study. A quadratic discriminant analysis was applied to the
data to determine how well the American English vowels could be separated from each other based
on various combinations of the acoustic measurements. Classification rates for analyses based on
static measures of F1 and F2 were low (68.2%), but improved when vowel duration was included
(76.1%). In fact, including vowel duration resulted in a consistent improvement for every feature
set. However, the biggest effect was seen when two samples (on 20% and 80% of vowel duration) of
49
the formant pattern were used instead of one sample (on 50% of vowel duration). Adding a third
sample produced little or no improvement in the classification results.
Some researchers have studied the use of features other than formants in vowel classification
experiments: Zahorian et al. (1993) studied the use of spectral shape features. They assumed
(just like Bladon, 1982) that spectral shape features (cepstral coefficients), which encode the global
smoothed spectrum, provide a more complete spectral description than formants do. Their conclusion was that both global-shape spectral features and formants are adequate parameters for vowels.
The advantage of formants is that a large amount of information is contained in a few features and
the advantage of using spectral shape features is that classification accuracy is better than using
formants.
In summary, it seems clear that vowels can be correctly classified by their formants. In general,
F1 and F2 alone are not sufficient; F1 and F2 should be supplemented with either F0 or F3 to achieve
a substantial improvement in classification (Hillenbrand et al., 1993). Large improvements were
seen when dynamic information was included: adding multiple samples of formant measurements
and adding vowel duration resulted in high classification accuracy (Hillenbrand et al., 1995; Nearey
et al., 1986).
We will use the results from these studies to carry out our own vowel classification experiments:
formants, fundamental frequency and dynamic information as acoustic features of vowels will be
examined in the present study as well. However, the vowel classification experiments that have
been presented in this section, differ from the experiments in the present study which in some
cases might make the classification task more difficult: a) we do not need to classify a set of
monophthongal vowels, but we need to separate the vowels in each vowel pair from each other, b)
the material used in this study does not only contain native speech but also non-native speech,
c) the vowels in this study were taken from read sentences in different contexts (rather than fixed
contexts such as /hVd/ or isolated phonemes), and d) we rely on automated procedures such as
automatic measurements and automatic phone segmentations.
50
5.3
Method & acoustic measurements
Linear Discriminant Analysis is used as a detection algorithm for the two vowel pronunciation error
detectors. The input to the LDA are the acoustic features that were discussed earlier (section 5.2.4)
and which are expected to be able to separate the vowels in each vowel pair from each other.
How were these acoustic features extracted from the speech signal? For the IFA database, acoustic measurements were already available on the internet (http://www.fon.hum.uva.nl/IFAcorpus)
and these were extracted automatically by PRAAT (for parametersettings of the script, see appendix C). These acoustic measurements provided by IFA were used, so new acoustic measurements
were only taken for the DL2N1 and TRIEST material.
The acoustic measurements for vowels from DL2N1 and TRIEST were all carried out automatically by PRAAT. A PRAAT-script (based on a script by Katherine Crosswhite, adjusted by
Joop Kerkhoff and myself) first localizes the specified vowels by retrieving the automatic phone
segmentation obtained from the speech recognizer and then carries out the acoustic measurements.
Pitch, F1, F2 and F3 were all measured at 50% of vowel duration. This point was chosen because
the midpoint of the vowel is often seen as a point in time where the vowel is said to behave most
characteristically, assuming that the phone segmentation is correct. Additionally, F1, F2 and F3
were measured at multiple points, 25% and 75%, of vowel duration to capture dynamic characteristics which have proven to be useful in many studies (see Hillenbrand et al., 1995; Nearey et al.,
1986).
Some problems were encountered when pitch was extracted at the midpoint: in some cases
PRAAT would not return a pitch value. These cases were examined and it was seen that a bad
segmentation or a too short duration of the segmentated vowel (<35 ms) were factors that could
cause these pitch values to be undefined (for the precise working of the pitch algorithm, see the
manual of PRAAT, http://www.praat.org). This problem was partly solved by letting PRAAT
shift the measurepoint from 50% (the midpoint) to 55%, 60% etc. of vowel duration (and to 45%,
40% etc.). If pitch was still undefined, let PRAAT calculate the mean pitch of the vowel segment.
Finally, if the pitch value was still undefined, the vowel segment was excluded from further analysis,
which occured less than ten times for each database.
F1, F2 and F3 were computed by an LPC-analysis (Burg-method) with different parameter
values depending on gender and dataset. It is known that female speakers have higher formants
51
than male speakers; therefore the script is set to analyse five formants in the region up to 5000Hz
for males and up to 5500Hz for females (following the advice given in the manual of PRAAT).
However, this was only valid for the TRIEST material where the recorded speech material consists
of broadband speech. In the case of the DL2N1 material, which is recorded over the telephone,
adjustments had to be made in the parameter settings: the script is set to analyze four formants in
the region up to 4000Hz for both female and male speakers. This setting seemed to work best. For
instance, we also tried to analyze three formants in the region up to 3300Hz, but this seemed to
work less well: distortions were visible in the plotted formant tracks and PRAAT returned many
undefined values for F3 measurements.
For pitch, F1, F2 and F3 no further normalization procedures for e.g. gender or speaker (such
as transformations to log or Bark scales) were carried out because all female and male speech was
kept apart from each other during all analyses so the detectors could be trained optimally on male
or female voices. Also, in the final application the gender of the user is first identified.
Duration is simply the raw duration of the vowel segment in seconds as segmented by the speech
recognizer. In this case, a normalization procedure was applied to correct for speech rate, which is
not unimportant because we are dealing with native and non-native speech. Durations of vowels
are longer if the speech rate is low and speech rate is generally lower for non-native speakers than
for native speakers (Cucchiarini et al., 2000) because they are less fluent than native speakers.
Therefore to prevent our analyses from being biased (namely that non-native vowels are always
longer than native vowels) a normalization procedure was applied. Articulation rate per speaker
as defined in Cucchiarini et al. (2000) was used to normalize vowel durations.
articulation rate = number of phonemes / total duration of speech without internal pauses
For now, articulation rates per speaker were used. An alternative is to use articulation rate per
utterance, but this was not tested. Normalized vowel duration is then obtained by multiplying the
articulation rate with the raw vowel duration.
normalized vowel duration = articulation rate × raw vowel duration
To sum up, a total of 12 variables was made available for the LDA: F0, F1 25 (measured at 25%
52
of vowel duration), F1 (measured at 50% of vowel duration), F1 75 (measured at 75% of vowel
duration), F2 25, F2, F2 75, F3 25, F3, F3 75, duration and normalized duration.
Automatic phone segmentation by HTK
↓
Feature extraction by PRAAT
↓
Feature vector: {F 0, F 1, F 2, F 3, duration etc.}
↓
Train LDA classifier with these feature vectors in PRAAT
↓
Trained LDA classifier classifies new cases in PRAAT
Figure 5.7: Method for /A/-/a:/ and /Y/-/u,y/ classifiers.
5.4
5.4.1
Experiments and results for /A/-/a:/ and /Y/-/u,y/
Organization of experiments
For all corpora, the data was randomly separated in training and test data: 75% training data and
25% test data. All speech material by male and female speakers was kept separate to train genderdependent classifiers. The experiments for the classifiers /A/-/a:/ and /Y/-/u,y/ were organized
as follows:
A.1
A.2
A.3
A.4
75% TRAINING
DL2N1-Nat
DL2N1-NN
IFA
TRIEST
25% TEST
DL2N1-Nat
DL2N1-NN
IFA
TRIEST
Table 5.2: Experiments A
The A-experiments were carried out mainly to examine the separability of the vowel pairs
in different corpora and to examine the use of some acoustic-phonetic features. For vowels, we
considered the addition of duration (either raw or normalized), the addition of F0 and/or F3, and
the addition of dynamic information to be issues of interest. The addition of these features were
of interest because: firstly, we expected duration to be discriminative because in both vowel pairs,
53
length may play a discriminative role. Duration is normalized for articulation rate; the effect of
using normalized duration instead of raw duration can also be examined. To examine whether the
addition of duration significantly improved classification accuracy, we compared in some cases two
hit ratios to each other: one hit ratio obtained without duration was compared to the other hit
ratio obtained with duration, and we calculated whether this difference was statistically significant.
This was done by computing z-scores in the following way:
p1 = k1 /n1 , where k1 is the number of cases classified correctly in the first experiment (e.g. with
duration), n1 is the total number of cases in the first experiment
p2 = k2 /n2 , where k2 is the number of cases classified correctly in the second experiment (e.g
without duration), n2 is the total number of cases in the second experiment
p∗ =
k1 +k2
n1 +n2
z=√
p1 −p2
p∗ ·(1−p∗ )·(1/n1 +1/n2 )
An improvement of a classification percentage (hit ratio) is statistically significant if z > |1.65|
(significant at 0.95, one-tailed).
Secondly, in Hillenbrand et al. (1995), it was shown that the addition of either F0 or F3 led
to a consistent improvement in classification accuracy. Thirdly, also in Hillenbrand et al. 1995,
classification accuracy increased when dynamic information was used. Dynamic information is
added by taking more samples of F1, F2 and F3: either 1 sample at 50% (the 1 sample condition),
or 2 samples at 25% and 75% (the 2 samples condition), or 3 samples at 25%, 50% and 75% (the
3 samples condition) of vowel duration were taken. Does adding dynamic information by taking 2
or 3 samples in the present study increase classification accuracy?
B.1
B.2
75% TRAINING
DL2N1-Nat
IFA
25% TEST
DL2N1-NN
TRIEST
Compare with
A.2
A.4
Table 5.3: Experiments B
The B-experiments were carried out in addition to the A-experiments mainly to investigate how
54
differently trained classifiers cope with non-native speech and to examine how a pronunciation error
detector can best be trained, with native or non-native speech. Generally, applying native models to
non-native speech, which may be less accurately pronounced than native speech, can be problematic.
This means that when the classifier is natively trained, there probably will be a high number of
FalseRejections and a low number of FalseAcceptances on non-native speech: correctly pronounced
non-native speech will often be classified as incorrect by a natively trained classifier and incorrect
non-native speech will be less often classified as correct. When classifiers are trained on non-native
speech, these classifiers might not be able to discern correct non-native speech from incorrect nonnative speech: they might produce a high number of FalseAcceptances (incorrect non-native speech
will more often be classified as correct by a non-natively trained classifier) and a low number of
FalseRejections (correct non-native speech will not often be classified as incorrect) on non-native
speech. For the L2 learner, one can imagine that a high number of FalseRejections can lead to
frustration: his correct pronunciation is falsely rejected by the pronunciation teaching module. So
to reduce the number of FalseRejections on non-native speech one could use a non-natively trained
classifier. But a non-natively trained classifier might be less able to detect pronunciation errors in
non-native speech (high number of FalseAcceptances). As can be seen in table 5.3 no experiments
such as [Training = IFA, Test = DL2N1-NN] were carried out because the DL2N1 speech database
consists of telephone speech and both IFA and TRIEST contain broadband speech; so hybrid
combinations such as training on telephone speech and testing on broadband speech were not
applied.
C.1
C.2
C.3
C.4
75% TRAINING
DL2N1-Nat
DL2N1-NN
IFA
TRIEST
TEST
DL2N1 mispronounced
DL2N1 mispronounced
TRIEST mispronounced
TRIEST mispronounced
Table 5.4: Experiments C
The difference between the B-experiments and the C-experiments is that in the latter case, the
test data consists of the collected mispronunciations of the vowels made by the non-native speakers.
This is to examine how well the differently trained detectors perform on detecting pronunciation
errors, which is the actual goal of this study.
55
Subsets of the following 12 variables obtained from the acoustic measurements were used to
train and test the classifier: F0, F1 25, F1, F1 75, F2 25, F2, F2 75, F3 25, F3, F3 75, nodur (no
duration), dur (raw duration) and normdur (normalized duration). For all experiments, the basic
sets of features consist of: [F1 F2], [F0 F1 F2], [F1 F2 F3] and [F0 F1 F2 F3]. In this way, the
contribution of F0 and F3 could be examined. Raw duration or normalized duration and dynamic
information (2 or 3 samples) were also added to each set to examine whether the use of dynamic
information is useful and whether using normalized duration instead of raw duration increases
classification accuracy.
For a quick overview of numbers of phonemes used in these experiments, see table 4.1. For a
more detailed overview, see appendix E, where MCC and Cpro are also shown to see whether the
classifiers predict better than chance.
5.4.2
Experiments and results /A/-/a:/
A-experiments
Adding duration, F0/F3 and dynamic information
We expected duration to be important in the distinction between /A/-/a:/ and according to fig. 5.8
- 5.11, duration, either raw or normalized, seems to be indeed an important feature. The addition
of raw or normalized duration resulted in every case in an improvement in classification accuracy,
irrespective of the use of dynamic information (therefore fig. 5.8 - 5.11 only show results obtained
with one sample). We examined the addition of ’dur’ to two feature sets, [F1 F2] and [F0 F1 F2 F3],
to find out whether these improvements in classification accuracy were also statistically significant:
in table 5.5, we see that in many cases this improvement is indeed significant (at 0.95), except in
exp. A.2.
The figs. 5.8 - 5.11 also show that the differences in accuracy between the use of raw duration
and normalized duration are rather small and do not indicate a preference for one over the other.
The increase in accuracy when F0, F3 or both are added, which was always observed in previous
classification experiments by Hillenbrand et al. (1993), was not encountered in our classification
experiments; adding either F0, F3 or both F0 and F3 to [F1 F2] resulted in many cases in a minor
improvement or even a small decrease in accuracy (small increase in accuracy: fig. 5.11 male
speakers, small decrease in accuracy: fig. 5.10 male speakers). This shows that in some cases a
56
z-scores computed for different experiments
Does the addition of Does the addition of
‘dur’ to [F1 F2] im- ‘dur’ to [F0 F1 F2
prove classification ac- F3] improve classificacuracy significantly?
tion accuracy significantly?
Male
Female
Male
Female
∗
A.1 Training&Test = DL2N1-Nat 0.46
3.65
0.46
2.87∗
A.2 Training&Test= DL2N1-NN
1.43
-0.11
1.57
0.45
∗
∗
A.3 Training&Test = IFA
3.62
1.22
1.67
0.85
A.4 Training&Test = TRIEST
0.27
3.59∗
0.57
3.72∗
∗ significant at 0.95
Table 5.5: Significance of the addition of ’dur’ to [F1 F2] and [F0 F1 F2 F3] for /A/ vs /a:/
small number of features, [F1 F2 dur/normdur], is sufficient to obtain high classification results:
e.g. in fig. 5.10 [F1 F2 dur] Male speakers, and fig. 5.8 [F1 F2 dur] Female speakers, classification
results are even slightly higher than when [F0 F1 F2 F3 dur] is used.
The influence of dynamic information was investigated by Hillenbrand et al. (1995), who found
that classification accuracy improved when dynamic information about the formant contour was
added by taking more samples of F1, F2 and F3. Dynamic information was examined in our
experiments the same way as was done in Hillenbrand et al. (1995) by adding more samples
(measurements at 25%, 50% and 75% of vowel duration) of F1, F2, and F3. Because no preference
for the use of raw over normalized duration was observed in fig.5.8 - 5.11, the figures below (fig.
5.12 - 5.15) show results obtained with raw duration.
Our results show that adding more samples, sometimes (but not always as stated in Hillenbrand et al., 1995) helps to improve classification accuracy. Usually, results obtained with
one sample (at 50% of vowel duration) are better than results obtained with two (at 25%
and 75% of vowel duration) or three samples (at 25%, 50% and 75% of vowel duration) in
fig. 5.12 - 5.15. Intuitively, it seems better to take more samples, because in this way, more
information is added about the dynamic characteristics of the vowel, and many studies have
proven that dynamic information improves classification accuracy, but it seems that our results
deviate from the results of these studies. Possible explanations for this can be found in section 5.5.1.
Other observations
57
Female
Male
100
90
90
Correct classification %
Correct classification %
Male
100
80
70
NODUR
60
DUR
50
40
NORMDUR
40
NORMDUR
f3
f2
f1
Correct classification %
70
NODUR
60
DUR
50
40
NORMDUR
3
f
f2
Training=IFA,
f1
f0
f3
f2
f2
f1
f1
f2
f0
f1
3
f
f2
f1
f0
f3
f2
f2
f1
f1
f2
f0
f1
3
f
f2
f1
f0
f3
f2
f2
f1
f1
f2
f0
f1
3
f
f2
f1
f0
f3
f2
f2
f1
f1
f2
f0
f1
features
/A/-/a:/
Figure 5.11:
A.4 Training=TRIEST,
Test=TRIEST, 1 sample
Female
Male
100
90
90
Correct classification %
100
80
1 sample
2 samples
40
f0
Female
80
/A/-/a:/
50
f3
NORMDUR
60
f2
DUR
70
f2
NODUR
40
f1
70
50
f1
f2
90
80
60
f0
f1
f3
Male
90
Male
f2
Female
Figure 5.10:
A.3
Test=IFA, 1 sample
f1
f0
f3
f2
f2
f1
f1
f2
f0
f1
f3
f2
f1
/A/-/a:/
Figure 5.9: A.2 Training=DL2N1-NN,
Test=DL2N1-NN, 1 sample
100
3 samples
70
1 sample
60
2 samples
50
40
3 samples
f3
f2
f3
f1
f0
f2
f2
f1
f1
f2
f0
f1
f3
f2
f3
f1
f0
f2
f2
f1
Figure 5.12: A.1 Training=DL2N1-Nat,
Test=DL2N1-Nat, with raw duration
f1
f2
features
f0
f3
f2
f3
f1
/A/-/a:/
Female
80
f1
f0
f2
f2
f1
f1
f2
f0
f1
f3
f2
f3
f1
f0
f2
f2
f1
f1
f2
f0
f1
features
DUR
50
100
features
NODUR
60
features
Male
Correct classification %
70
/A/-/a:/
Figure 5.8: A.1 Training=DL2N1-Nat,
Test=DL2N1-Nat, 1 sample
Correct classification %
80
f0
f3
f2
f2
f1
f1
f2
f0
f1
f3
f2
f1
f0
f3
f2
f2
f1
f1
f2
f0
f1
features
Female
/A/-/a:/
Figure 5.13: A.2 Training=DL2N1-NN,
Test=DL2N1-NN, with raw duration
Other secondary observations that can be made are that in most cases, the vowels /A/-/a:/ of female
speakers are better distinguished from each other than the vowels of male speakers, especially in
58
Male
Female
100
90
90
Correct classification %
Correct classification %
Male
100
80
70
1 sample
60
2 samples
50
40
3 samples
80
70
1 sample
60
2 samples
50
40
3 samples
f1
f0
f2
f1
f1
f2
f0
f1
f3
f2
f3
f2
f3
f2
f3
Figure 5.14:
A.3 Training=IFA,
Test=IFA, with raw duration
f1
features
f0
f2
f3
f2
f3
f2
f3
f2
f3
f2
/A/-/a:/
f2
f1
f1
f2
f0
f1
f1
f0
f2
f1
f1
f2
f0
f1
f1
f0
f2
f1
f1
f2
f0
f1
features
Female
/A/-/a:/
Figure 5.15:
A.4 Training=TRIEST,
Test=TRIEST, with raw duration
native corpora, e.g. fig. 5.8 and 5.10. It is often said that female speakers speak more precisely
and accurately than male speakers, and therefore, vowels from female speech might be better
distinguishable from each other than vowels from male speech. The vowel space is also larger for
female speakers, due to the fact that female speakers have higher formant frequencies. The fact
that this gender-dependent result was not found in non-native corpora might be explained by the
fact that both male and female non-native speakers have the same difficulty in pronouncing sounds
of their L2; therefore it is not very likely that female non-native speakers speak more accurately
than male non-native speakers.
Also, it seems that vowels from native speech are better distinguisable than vowels from nonnative speech; compare the classification percentages from the native corpora (fig. 5.8 and fig.
5.10) to the non-native corpora (fig. 5.9 and fig. 5.11). An explanation for this observation could
be that non-native speech differs in many ways from native speech, which causes non-native speech
possibly to be much more “blurred” and less precisely pronounced than native speech.
B-experiments
In these experiments (see table 5.3) the focus is on the use of different, native or non-native speech,
training material. Since using multiple samples (dynamic information) did not lead to better results in these experiments either, the figures below show results that were obtained with one sample.
Adding duration, F0/F3 and dynamic information
59
Female
Male
100
90
90
Correct classification %
Correct classification %
Male
100
80
70
NODUR
60
DUR
50
40
NORMDUR
80
70
NODUR
60
DUR
50
40
NORMDUR
f1
f0
f2
f1
f1
f2
f0
f1
f3
f2
f3
f2
f3
f2
Figure 5.16: B.1 Training=DL2N1-Nat,
Test=DL2N1-NN, 1 sample
f3
features
f1
f2
f3
f2
f3
f2
f3
f2
f3
f2
/A/-/a:/
f0
f2
f1
f1
f2
f0
f1
f1
f0
f2
f1
f1
f2
f0
f1
f1
f0
f2
f1
f1
f2
f0
f1
features
Female
/A/-/a:/
Figure 5.17:
B.2 Training=IFA,
Test=TRIEST, 1 sample
The results of fig. 5.16 show clearly that the use of normalized duration instead of raw duration has a consistent positive effect on the classification results. This positive effect was not
encountered in experiment B.2, see fig. 5.17: in these cases, using raw duration led to higher
classification accuracy than using normalized duration. A possible explanation for this observation
is given in section 5.5.1. It is important, especially for the B-experiments, to normalize duration
for articulation rate, since non-natives have lower articulation rates and consequently, longer
segment durations than natives; in the B-experiments we will apply natively trained classifiers
to non-native speech. If duration is not normalized, then non-native speakers will always have
longer vowel durations than native speakers since non-natives have lower articulation rates.
Furthermore, here, adding F0 and/or F3 to [F1 F2] does not lead to consistent improvements in
accuracy either: results of [F1 F2] are usually equally high as [F0 F1 F2] or [F1 F2 F3] (see fig. 5.16).
Compare results
To see which classifier is better in coping with non-native speech, results from experiment B.1
(DL2N1-Nat trained, fig. 5.16) can be compared to results from experiment A.2 (DL2N1-NN
trained, fig. 5.9), and B.2 (IFA trained, fig. 5.17) can be compared to A.4 (TRIEST trained, fig.
5.11). No large differences are seen when the DL2N1-Nat trained classifier B.1 is compared to the
DL2N1-NN trained classifier A.2; both types of classifiers are roughly equally capable of coping with
non-native speech. This is somewhat surprising because one would expect that non-native speech
is not very well recognized by a classifier trained on native speech. When the IFA trained B.2 is
60
compared to the TRIEST trained A.4 these differences in classification accuracy do become visible:
the natively trained IFA classifier B.2 performs worse on non-native speech than the non-natively
trained TRIEST A.4, especially for female speakers. But this observation could also be due to the
fact that IFA and TRIEST are speech databases that differ in many respects from each other, as
discussed in more detail in section 5.5.1.
C-experiments
Male
2 samples
3 samples
2 samples
3 samples
NODUR
DUR
Figure 5.18:
CITO
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
NORMDUR
Features
/A/-/a:/
C.1 Training=DL2N1-Nat, Test=mispronounced /A/
Male
1 sample
2 samples
3 samples
Female
1 sample
2 samples
3 samples
100
90
80
70
60
50
40
30
20
10
0
NODUR
DUR
Features
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
NORMDUR
f1f2
f0f1f2
f1f2f3
f0f1f2f3
Correct classification %
1 sample
100
90
80
70
60
50
40
30
20
10
0
f1f2
f0f1f2
f1f2f3
f0f1f2f3
Correct classification %
1 sample
Female
/A/-/a:/
Figure 5.19: C.2 Training=DL2N1-NN, Test=mispronounced /A/ CITO
Adding duration, F0/F3 and dynamic information
These experiments (see table 5.4) are similar to the B-experiments except for the test material which
in this case consists of the pronunciation errors actually observed where /A/ was mispronounced
61
as /a:/. The results were somewhat surprising; the addition of duration did not always improve
accuracy, it sometimes even lowered accuracy, and the number of samples seemed to play a role in
some cases.
Male
2 samples
3 samples
1 sample
2 samples
3 samples
100
90
80
70
60
50
40
30
20
10
0
NODUR
DUR
Features
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
NORMDUR
f1f2
f0f1f2
f1f2f3
f0f1f2f3
Correct classification %
1 sample
Female
/A/-/a:/
Figure 5.20: C.3 Training=IFA, Test=mispronounced /A/ TRIEST
Male
2 samples
3 samples
1 sample
2 samples
3 samples
100
90
80
70
60
50
40
30
20
10
0
NODUR
DUR
Features
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
NORMDUR
f1f2
f0f1f2
f1f2f3
f0f1f2f3
Correct classification %
1 sample
Female
/A/-/a:/
Figure 5.21: C.4 Training=TRIEST, Test=mispronounced /A/ TRIEST
Surprisingly, from fig. 5.18 and also fig. 5.19 it is clear that the addition of duration (either raw
or normalized) in many cases negatively affects classification accuracy. Furthermore, the number
of samples now does have a small effect on classification accuracy, e.g. Female fig. 5.18 and Male
fig. 5.20, and adding F0 and/or F3 to [F1 F2] does not always improve accuracy.
Compare results
On the whole, the natively trained classifiers are better than the non-natively trained classifiers in
detecting mispronounced /A/s (compare fig. 5.18 to 5.19 and fig. 5.20 to 5.21); so the native norm
62
is suitable for detecting pronunciation errors and also not too strict for correct non-native speech
(see fig. 5.16. The same observations apply to the results from experiment C.3 (fig. 5.20) and C.4
(fig. 5.21). For a more detailed discussion on the classification results of /A/-/a:/, see section 5.5.1.
5.4.3
Experiments and results /Y/-/u,y/
A-experiments
Adding duration, F0/F3 and dynamic information
In the A-experiments we examined again the discriminativeness of several features. The results
show that the addition of duration (either raw or normalized) does not lead to large improvements
in classification accuracy (see fig. 5.23 - 5.25), except for the DL2N1-Nat corpus (fig. 5.22). In
fact, these improvements are statistically not significant, except for the DL2N1-Nat corpus (see
table 5.6).
z-scores computed for different experiments
Does the addition of Does the addition of
‘dur’ to [F1 F2] im- ‘dur’ to [F0 F1 F2
prove classification ac- F3] improve classificacuracy significantly?
tion accuracy significantly?
Male
Female
Male
Female
∗
∗
∗
A.1 Training&Test = DL2N1-Nat 1.83
2.07
2.35
2.37∗
A.2 Training&Test= DL2N1-NN
0.00
-0.37
0.20
0.00
A.3 Training&Test = IFA
0.00
0.60
-0.37
0.21
A.4 Training&Test = TRIEST
-0.49
-0.18
0.17
-0.18
∗ significant at 0.95
Table 5.6: Significance of the addition of ’dur’ to [F1 F2] and [F0 F1 F2 F3] for /Y/ vs /u,y/
The improvements in classification accuracy that were seen in Hillenbrand et al. (1995) when
dynamic information was added were not observed in these experiments (therefore these results are
not shown here), except for the TRIEST and IFA corpus (see fig. 5.26 and 5.27). In the TRIEST
corpus, relatively large improvements are found when more than 1 sample is taken, especially in
the male cases (see fig. 5.27).
The addition of F0 and/or F3 to [F1 F2] did not result in any consistent large improvements
(e.g. fig. 5.23 and 5.24) as was observed in Hillenbrand et al. (1993, 1995).
63
Female
Male
100
90
90
Correct classification %
Correct classification %
Male
100
80
70
NODUR
60
DUR
50
40
NORMDUR
70
DUR
50
40
NORMDUR
f1
f0
f3
f2
f3
f2
Correct classification %
NODUR
DUR
40
f2
90
80
50
f1
90
60
f1
f2
f3
f2
f3
Male
100
70
f0
f1
f1
f0
f2
/Y/-/u,y/
Figure 5.23: A.2 Training=DL2N1-NN,
Test=DL2N1-NN, 1 sample
Female
NORMDUR
70
NODUR
60
DUR
50
40
NORMDUR
f3
f2
f3
f1
f0
f2
f2
f1
f1
f2
f0
f1
f3
f2
f3
f1
f0
f2
f2
f1
f1
f2
f0
f1
features
Training=IFA,
Female
80
f3
f2
f3
f1
f0
f3
f2
f2
f1
f1
f2
f0
f1
f2
f3
f1
f0
f2
f2
f1
f1
f2
f0
f1
/Y/-/u,y/
Figure 5.24:
A.3
Test=IFA, 1 sample
f1
f2
f3
f2
Male
f1
f2
f0
f1
f1
f3
f2
f3
f2
f3
f2
features
100
features
NODUR
60
/Y/-/u,y/
Figure 5.22: A.1 Training=DL2N1-Nat,
Test=DL2N1-Nat, 1 sample
Correct classification %
80
f0
f2
f1
f1
f2
f0
f1
f1
f0
f2
f1
f1
f2
f0
f1
features
Female
/Y/-/u,y/
Figure 5.25:
A.4 Training=TRIEST,
Test=TRIEST, 1 sample
64
Male
nodur
dur
Female
normdur
nodur
dur
normdur
90
80
1 sample
70
60
2 samples
50
40
features
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
3 samples
f1f2
f0f1f2
f1f2f3
f0f1f2f3
Correct classification %
100
/Y/-/u,y/
Figure 5.26: A.3 Training=IFA, Test=IFA
Other observations
A further observation is that again, the vowels /Y/-/u,y/ are better distinguished from each other in
native speech than in non-native speech (fig. 5.22 - 5.25), which was also observed in the distinction
/A/-/a:/ (see section 5.4.2): classification accuracy is 15%-20% higher for native than non-native
speech (compare fig. 5.22 to 5.23 and fig. 5.24 to 5.25). Differences in classification results due
to gender vary for different corpora: e.g. in the DL2N1-Nat corpus (fig. 5.22) higher classification
accuracy is found for male speech than for female speech, whereas in the IFA corpus (fig. 5.24),
the opposite is found.
B-experiments
Adding duration, F0/F3 and dynamic information
The results of the B-experiments show that duration usually plays an important role: in exp.
B.1 (fig. 5.28) and B.2 (fig. 5.29), the addition of duration usually affects classification accuracy
positively. The use of normalized duration instead of raw duration is usually disadvantegeous
(fig. 5.28 and 5.29); but we found earlier in the B-experiments of the distinction /A/-/a:/ that
normalized duration can be very effective (see fig. 5.16). Furthermore, the addition of F0 and/or
F3 and the addition of dynamic information (taking more samples of F1, F2 and F3) did not result
in large improvements in accuracy (therefore fig. 5.28 and 5.28 only show results obtained with 1
65
Male
nodur
dur
Female
normdur
nodur
dur
normdur
Correct classification %
100
90
80
1 sample
70
60
2 samples
50
40
features
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
3 samples
/Y/-/u,y/
Figure 5.27: A.4 Training=TRIEST, Test=TRIEST
sample).
Female
Male
100
90
90
Correct classification %
Correct classification %
Male
100
80
70
NODUR
60
DUR
50
40
NORMDUR
80
70
NODUR
60
DUR
50
40
NORMDUR
3
2f
3
1f
f
f0
2
2f
1f
f
f1
2
f
f0
f
f1
3
2f
3
Figure 5.28: B.1 Training=DL2N1-Nat,
Test=DL2N1-NN, 1 sample
1f
features
f
f0
2
/Y/-/u,y/
2f
1f
f
f1
2
f
f0
f
f1
3
2f
3
1f
f
f0
2
2f
1f
f
f1
2
f
f0
f
f1
3
2f
3
1f
f
f0
2
2f
1f
f
f1
2
f
f0
f
f1
features
Female
/Y/-/u,y/
Figure 5.29:
B.2
test=TRIEST, 1 sample
Training=IFA,
Compare results
The main goal of the B-experiments was to examine how differently trained classifiers, trained on
native or non-native speech, would cope with non-native speech. If we compare the results of the
classifier trained on native speech from the DL2N1 corpus (fig. 5.28) with the classifier trained on
non-native speech (fig. 5.23), we can see that there is almost no difference in performance. The same
can be observed in fig. 5.29 and 5.25. Thus, in this case there is almost no loss in performance when
non-native speech is applied to a classifier that is trained on native speech instead of non-native
66
speech.
C-experiments
Adding duration, F0/F3 and dynamic information
Some observations in the C-experiments are that a higher number of samples (dynamic information) can affect classification accuracy both positively or negatively: see fig. 5.30 and 5.32
(therefore results obtained with dynamic information are also shown in fig. 5.30 - 5.33). Finally,
adding F0 and/or F3 to [F1 F2] does not always improve the classification results; it sometimes
even lowers the performance of the classifier (see fig. 5.31 and 5.33).
Compare results
The natively and non-natively trained classifiers are tested on the pronunciation errors which consist
of mispronunciations of /Y/ as /u/ or /y/. First, the pronunciation errors are tested on classifiers
that are trained on native speech, to see whether these natively trained classifiers are able to detect
pronunciation errors.
The results of exp. C.1 (fig. 5.30) and C.3 (fig. 5.31) vary much from each other: in fig. 5.30
we can see that classification accuracy ranges from approximately 60%-70%, and in fig. 5.31 from
70%-100% for male speech and from 20%-25% for female speech. It appears that natively trained
classifiers are sometimes able to detect pronunciation errors of /Y/.
The pronunciation errors were also tested on classifiers trained on non-native speech, to compare
the two differently trained (natively or non-natively) classifiers with each other. When we compare
the results of fig. 5.32 (trained on non-native speech) to 5.30, we find that the classifier trained
on non-native speech (fig. 5.32) performs better than the classifier trained on native speech (fig.
5.30) in detecting pronunciation errors of /Y/. The same observation is found when fig. 5.33
(non-natively trained) is compared to fig. 5.31 (natively trained). Thus, for pronunciation errors of
/Y/, the classifiers trained on non-native speech, turned out to work better as pronunciation error
detectors (according to our results in fig. 5.30 - 5.33). This is the opposite of what was found for
pronunciation error /A/, where classifiers trained on native speech turned out to work better as
pronunciation error detectors (see fig. 5.30 - 5.33).
67
Male
2 samples
3 samples 1 sample
3 samples
NODUR
DUR
Figure 5.30:
DL2N1
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
NORMDUR
features
/Y/-/u,y/
C.1 Training=DL2N1-Nat, Test=mispronounced /Y/
Male
Female
1 sample 2 samples 3 samples 1 sample
2 samples 3 samples
100
90
80
70
60
50
40
30
20
10
0
NODUR
DUR
features
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
NORMDUR
f1f2
f0f1f2
f1f2f3
f0f1f2f3
Correct classification %
2 samples
100
90
80
70
60
50
40
30
20
10
0
f1f2
f0f1f2
f1f2f3
f0f1f2f3
Correct classification %
1 sample
Female
/Y/-/u,y/
Figure 5.31: C.3 Training=IFA, Test=mispronounced /Y/ TRIEST
5.4.4
Experiments and results /Y/-/u/-/y/
In all previous experiments of /Y/-/u,y/ the classifier had to make a binary choice, namely to
classify a vowel as /Y/ or /u,y/; /u/ and /y/ were grouped together and the classifiers were trained
to discriminate /Y/ from /u,y/. In section 5.2.3, I made a remark that /u/ and /y/ acoustically
differ a great deal from each other and that grouping them together therefore, might not be a good
idea. Therefore, in this section, some pilot experiments are described which aimed at establishing
whether the classifier is able to make a tertiary choice, namely to classify a vowel as either /Y/ or
/u/ or /y/. This time, the classifier is trained to discriminate /Y/ from /u/ and from /y/. For this
study, the binary choice of the classifier is sufficient because the task of this classifier is not to give
68
Male
Female
NODUR
DUR
features
Figure 5.32:
DL2N1
f1f2
f0f1f2
f1f2f3
f0f1f2f3
/Y/-/u,y/
C.2 Training=DL2N1-NN, Test=mispronounced /Y/
Male
1 sample 2 samples
3 samples
Female
1 sample
2 samples
3 samples
100
90
80
70
60
50
40
30
20
10
0
NODUR
DUR
features
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
NORMDUR
f1f2
f0f1f2
f1f2f3
f0f1f2f3
Correct classification %
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
f1f2
f0f1f2
f1f2f3
f0f1f2f3
NORMDUR
f1f2
f0f1f2
f1f2f3
f0f1f2f3
Correct classification %
1 sample 2 samples 3 samples 1 sample 2 samples 3 samples
100
90
80
70
60
50
40
30
20
10
0
/Y/-/u,y/
Figure 5.33: C.4 Training=TRIEST, Test=mispronounced /Y/ TRIEST
feedback on how to improve the pronunciation, but simply to classify a vowel as either correct or
incorrect. One can imagine that developing a pronunciation error detector that is also able to tell
L2 learners how to improve their pronunciation is a more difficult task. In this case, developing a
classifier that is able to make a tertiary (or N-ary) choice might be more useful. For instance, if an
L2 learner pronounces an incorrect /Y/ that is correctly classified by the /Y/-/u/-/y/ classifier as
/y/, then this implies that the L2 learner has made an error in length difference, since the primary
difference between /Y/ and /y/ is length. The error is diagnosed and the system can tell the learner
to pronounce it again, but now with a shorter length.
We only tested the A-conditions in these pilot experiments and did not add dynamic information.
69
We can observe in fig. 5.34 - 5.37 that the results are comparable to the results of the /Y/-/u,y/
classifier (fig. 5.22 - 5.25), and that these results are sometimes even better than those of the
/Y/-/u,y/ classifier (compare e.g. fig. 5.23 with 5.35).
Female
Male
100
90
90
Correct classification %
Correct classification %
Male
100
80
70
NODUR
60
DUR
50
40
NORMDUR
80
70
NODUR
60
DUR
50
40
NORMDUR
f1
f0
f2
f1
f1
f2
f0
f1
f3
f2
f3
f2
f3
f2
Figure 5.34:
A.1 Train=DL2N1-Nat,
Test=DL2N1-Nat, 1 sample
f3
features
f1
f2
f3
f2
f3
f2
f3
f2
f3
f2
/Y/-/u/-/y/
f0
f2
f1
f1
f2
f0
f1
f1
f0
f2
f1
f1
f2
f0
f1
f1
f0
f2
f1
f1
f2
f0
f1
features
Female
/Y/-/u/-/y/
Figure 5.35:
A.2 Train=DL2N1-NN,
Test=DL2N1-NN, 1 sample
Thus, this /Y/-/u/-/y/ classifier, trained to make a tertiary choice, manages to categorize /Y/
as either /Y/ or /u/ or /y/ and its performance is almost the same as that of the /Y/-/u,y/
classifier which was trained to make a binary choice. The tertiary-choiced classifier was not further
examined in this study, but the results obtained are promising for further research.
Female
Male
100
90
90
Correct classification %
Correct classification %
Male
100
80
70
NODUR
60
DUR
50
40
NORMDUR
80
70
NODUR
60
DUR
50
40
NORMDUR
f3
f2
f3
f1
f0
f2
f2
f1
f1
f2
f0
f1
f3
f2
f3
f1
f0
f2
Figure 5.36: A.3 Train=IFA, Test=IFA, 1
sample
f2
features
f1
/Y/-/u/-/y/
f1
f2
f0
f1
f3
f2
f3
f1
f0
f3
f2
f2
f1
f1
f2
f0
f1
f2
f3
f1
f0
f2
f2
f1
f1
f2
f0
f1
features
Female
/Y/-/u/-/y/
Figure 5.37:
A.4 Train=TRIEST,
Test=TRIEST, 1 sample
70
5.5
Discussion of results
In this section, we examine the classification results obtained more carefully and mainly try to
explain those results that appeared to be surprising or at least remarkable.
5.5.1
Discussion of the results of /A/-/a:/
Two findings of the vowel classification study by Hillenbrand et al. (1995), are that 1) the addition
of F0 and/or F3 to [F1 F2] results in a consistent improvement in classification accuracy, and 2)
the addition of dynamic information (taking more samples of F1, F2 and F3) results in a consistent
improvement in classification accuracy. These findings were not always found in our A, B and C
experiments: the addition of F0 and/or F3 to [F1 F2] and the addition of dynamic information were
not always helpful. This discrepancy might be explained by the fact that our vowel classification
experiments differ from those of Hillenbrand et al. (1995) in 4 aspects: 1) our acoustic measurements
were taken automatically and were based on an automatic segmentation which were not checked
by hand, whereas in Hillenbrand et al. (1995), all measurements were done by hand, 2) we used
vowels taken from different contexts, whereas in Hillenbrand et al. (1995), vowels taken from a fixed
context (h-V-d syllables) were used, 3) Hillenbrand et al. (1995) aimed at classifying the whole
set of monophthongal American-English vowels, whereas our vowel classification experiments are
restricted to classifying vowel pairs, and 4) we use both native and non-native speech.
Furthermore, the observation was made that the distinction between /A/ and /a:/ was better
made in speech of native female speakers than in speech of native male speakers. It might be
interesting to see how this difference between male and female speakers is visible in the discriminant
spaces of LDA. In fig. 5.38 and 5.39, we can see that the group centroids of the male speakers are
closer together than the group centroids of the female speakers; this is an indication that /A/-/a:/
is better distinguished from each other in female speech than in male speech.
Another observation was that /A/ is better distinguished from /a:/ in native speech than in
non-native speech which might be caused by non-native speech being less clearly and accurately
pronounced than native speech. In fig. 5.40 and 5.41 we can see that the group centroids are
very close to each other, which makes LDA discrimination more difficult. The non-native group
centroids are closer to each other (fig. 5.40 and 5.41) than the native group centroids (fig. 5.38 and
5.39); this indicates that indeed the discrimination of /A/ from /a:/ is more difficult for non-native
71
5
5
5
1.959
1.959
*
0.784
–1.27
–0.518
–1.27
–5
*
–5
/A/
–5
/a:/
5
6
sounds.
–5
6
*
*
*
/A/
*
*
*
*
*
*
/A/
*
/a:/
/A/
/a:/
o
Figure 5.39: Whisker’s box plot, distribution of discriminant scores of /A/-/a:/
with
0.494[F0 F1 F2 F3 dur], female IFA speak–0.321
ers
Figure 5.38: Whisker’s box plot, distri1.959
*
bution of discriminant
scores of /A/-/a:/
with [F0 F1 F2 F3 dur], male IFA speakers
–1.27
*
–6
/a:/
6
o
/A/
o
/a:/
*
0.774
0.494
–0.321
–0.486
*
–6
/A/
–6
/a:/
Figure 5.40: Whisker’s box plot, distribution of discriminant scores of /A/-/a:/
with [F0 F1 F2 F3 dur], male TRIEST
speakers
/A/
/a:/
Figure 5.41: Whisker’s box plot, distribution of discriminant scores of /A/-/a:/
with [F0 F1 F2 F3 dur], female TRIEST
speakers
We saw in the B-experiments that the classifier trained on native IFA speech (fig. 5.17) performed worse than the classifier trained on non-native TRIEST speech (fig. 5.11) when non-native
speech was applied. This was not surprising since applying non-native speech to a classifier trained
on native speech is usually problematic. But in this case, another factor could have affected the
performance of the natively trained classifier in fig. 5.17: the IFA and TRIEST corpus differ in
many respects from each other, e.g. different recording conditions and different stimuli may have
led to more acoustic variation.
72
Furthermore, using normalized duration instead of raw duration affected classification accuracy
positively in exp. B.1 (fig. 5.16), but affected classification accuracy slightly negatively in exp. B.2
(fig. 5.17). If our native speakers indeed have higher articulation rates than non-native speakers,
then normalized duration would work better than raw duration, see fig. 5.16. But if the articulation rates are almost the same for both native and non-native speakers, then probably both raw
and normalized duration perform equally well (since normalization is achieved by multiplying raw
duration by articulation rate), see fig. 5.17. In table 5.7, we can see the mean articulation rates
for each corpus. Native speakers from DL2N1-Nat have a higher mean articulation rate than nonnative speakers from DL2N1-NN (table 5.7); this might explain why normalized duration works
better than raw duration in exp. B.1 (fig. 5.16). Native speakers from IFA have almost the same
mean articulation rate as non-native speakers from TRIEST (table 5.7); this might explain why
both raw and normalized duration perform almost equally well in exp. B.2 (fig. 5.17).
native
non-native
Database
DL2N1-Nat
IFA
DL2N1-NN
TRIEST
Mean
13.66
12.37
11.61
12.53
Standard Dev.
1.19
0.84
1.37
1.07
Minimum
11.92
11.77
9.27
11.01
Maximum
15.89
13.83
15.21
14.53
Table 5.7: Mean articulation rates of all speakers (both male and female) per database
Results of the C-experiments have shown that natively trained classifiers are better in detecting
mispronunciations of /A/ than non-natively trained classifiers: correct classification ranges from
70% to 90% (see fig. 5.18 and 5.20). This implies that the decision regions based on non-native
speech and drawn by LDA are too “relaxed”: the classifier produces too many FalseAcceptances
instead of CorrectRejections. In this case, a classifier trained with native speech is more suitable
for detecting pronunciation errors of /A/.
Surprisingly, duration did not play a discriminative role in the C-experiments, where mispronunciations of /A/ were applied to the classifiers. From the A- and B-experiments, it was clear
that duration did play a discriminative role. However, duration did not always have the same discriminative weight: for the /A/-/a:/ classifier trained on native DL2N1-Nat speech (exp. A.1, fig.
5.8), duration is more discriminative for female speech than for male speech and for the classifier
trained on native IFA speech (exp. A.3, fig. 5.10), duration is more discriminative for male speech
73
/A/
mispronounced /A/
/a:/
Percent
15%
10%
5%
0%
1,0000
2,0000
3,0000
Normalized duration (s)
st.dev=0,42
mean=1,03
n=146
1,0000
2,0000
3,0000
Normalized duration (s)
st.dev=0,74
mean=1,87
n=93
1,0000
2,0000
3,0000
Normalized duration (s)
st.dev=0,65
mean=1,53
n=31
Figure 5.42: Histograms of duration (normalized) from DL2N1-Nat, male speakers
/a:/
/A/
mispronounced /A/
Percent
15%
10%
5%
0%
1,0000
2,0000
3,0000
Normalized duration (s)
st.dev=0,41
mean=1,15
n=227
1,0000
2,0000
3,0000
Normalized duration (s)
st.dev=0,67
mean=1,95
n=143
1,0000
2,0000
3,0000
Normalized duration (s)
st.dev=0,62
mean=1,56
n=56
Figure 5.43: Histograms of duration (normalized) from DL2N1-Nat, female speakers
74
than for female speech. The same classifiers that are trained in the A-experiments are used in the
C-experiments and a possible consequence of this is that duration will also be less important in
the C-experiments in those cases where the discriminant function puts less weight on duration in
the A-experiments. In the C-experiments, this is visible in fig. 5.18 (compare to fig. 5.8 where
duration was less discriminative for male speech) and 5.20 (compare to fig. 5.10 where duration
was less discriminative for female speech).
What else could have caused duration to be less or not effective at all in the C-experiments?
Possibly, the mispronounced /A/s deviated not only in length, but also spectrally from the correct
/A/. Therefore, duration was a less effective discriminative feature. However, as we can see in
the histograms (fig. 5.5.1 and 5.5.1), the correct /A/s do differ somewhat in duration from the
mispronounced /A/s in male and female speech of the DL2N1-Nat corpus, but possibly this length
difference is not large enough, since duration turned out to be not very helpful in exp. C.1 (fig.
5.18) for male speech and since the discriminant function possibly puts less weight on duration.
For the LDA classifiers to predict better than chance and to have any utility at all, its correct
classification percentages (hit ratios) must be higher than the Maximum Chance Criterion (MCC)
and the Proportional Chance Criterion (Cpro ). In appendix E, the MCC and Cpro are calculated for
the A- and B-experiments: these values are approximately 50%. We have seen in our results that
the accuracy of our /A/-/a:/ classifiers is much higher than 50% and thus, surpassess the MCC
and Cpro , which means that the classifiers predict better than chance.
5.5.2
Discussion of the results of /Y/-/u,y/
The A-experiments showed that no specific feature was particularly important for discriminating
between /Y/ and /u,y/; except for the DL2N1-Nat corpus, where duration was very strong (see
fig. 5.22). Again, it was observed that /Y/ was better distinguished from /u,y/ in speech of native
speakers than in speech of non-native speakers. We can examine the discriminant scores in the
same way as was done in the previous section for /A/-/a:/ by looking at the discriminant scores
(fig. 5.44 - 5.47):
We can see in fig. 5.46 and 5.47 that the group centroids of the non-native speakers lie closer to
each other than the group centroids of the native speakers, see fig. 5.44 and 5.45, indicating that
the discrimination between /Y/ and /u,y/ is more difficult for non-native speakers.
75
4
4
*
1.258
0.271
–4
*
*
–0.45
–0.782
*
*
/Y/
–4
/u,y/
Figure 5.44: Whisker’s box plot, distribution discriminant scores of /Y/-/u,y/ with
[F0 F1 F2 F3 dur], male IFA speakers
/Y/
/u,y/
Figure 5.45: Whisker’s box plot, distribution discriminant scores of /Y/-/u,y/ with
[F0 F1 F2 F3 dur], female IFA speakers
*
*
*
0.378
–0.288
0.24
–0.314
*
*
*
/Y/
/Y/
/u,y/
Figure 5.47: Whisker’s box plot, distribution discriminant scores of /Y/-/u,y/ with
[F0 F1 F2 F3 dur], female TRIEST speakers
4 5.46: Whisker’s box plot, distribuFigure
tion discriminant scores of /Y/-/u,y/ with
[F0 F1 F2 F3 dur], male TRIEST speakers
1.258
**
*
–0.45
*
*
–4
/Y/
/u,y/
/u,y/
76
The results of the C-experiments show that classifiers trained on non-native speech are better
in detecting pronunciation errors of /Y/ than classifiers trained on native speech (compare fig. 5.30
with 5.32). This is the opposite of what was found for the /A/-/a:/ pair, where natively trained
classifiers performed better in detecting pronunciation errors of /A/. Further research is needed to
examine why pronunciation errors of /Y/ and /A/ are both best detected by classifiers that are
trained on a type of speech data, native or non-native, that differ for the two errors. However, it
confirms the fact that being more specific is useful: the effectiveness of individual features and the
choice between a native and a non-native reference model can differ for each vowel pair.
MCC and Cpro were calculated to see whether the classifiers could predict better than chance. It
appears that some of the /Y/-/u,y/ classifiers do not predict better than chance: e.g. the classifier
trained on non-native DL2N1-NN speech in fig. 5.23 has an accuracy of around 70%, whereas the
MCC is 72.6% and 74.2% for male and female speech. In appendix E, more MCCs and Cpro are
calculated.
77
Chapter 6
The pronunciation error detector
/x/-/k,g/
6.1
Introduction
This chapter describes the development of the pronunciation error detector for /x/. First, a small
acoustic study was carried out to determine potential discriminative acoustic-phonetic features
(section 6.2). In section 6.3, two different methods for classification of /x/ vs /k/ are presented, and
how the extraction procedure for the acoustic features is described. The results of the classification
experiments are presented in section 6.4 and discussed in section 6.5.
6.2
6.2.1
Acoustic characteristics of /x/, /k/ and /g/
General acoustic characteristics of consonants
Consonants and vowels are both produced by an airstream which flows through the vocal tract, but
what distinguishes consonants from vowels is that in the case of consonants, the airstream flows
through a narrowed vocal tract. The description of the acoustic characteristics of consonants is more
complex than that of vowels. Vowels can be described with essentially the same acoustic features,
such as formant pattern and duration (as was described in the previous chapter). Consonants, on
the other hand, differ significantly among themselves in their acoustic properties, and therefore it
is difficult to describe all of them with a single set of features. Some consonants involve significant
78
noise generation, whereas others have no noise components. Some consonants are produced with
a period of complete obstruction of the vocal tract, whereas others are produced with only a
narrowing of the vocal tract.
Consonants are often categorized by articulatory parameters, such as manner of articulation
and place of articulation (see table 6.1).
manner of articulation
plosives
nasals
fricatives
liquid (lateral)
liquid (approximant/retroflex)
glide
bilabial
p
b
m
labiodental
f
v
place of
alveolar
t
d
n
s
z
l
r
w
articulation
palatal velar
c
k (g)
N
S
x
uvular
glottal
R
j
h
Table 6.1: Traditional way of categorizing Dutch consonants (within each cell: right=voiced,
left=voiceless)
Because of the large acoustic differences between groups of consonants, I will only focus on
the groups of consonants that are of importance for the pronunciation error detector in question:
namely the fricatives and the plosives.
Fricatives are produced with a narrow constriction maintained somewhere in the vocal tract.
When air passes through the constriction at a high rate of flow, turbulence results and this turbulence is associated with the generation of turbulence noise (aperiodic energy) in the acoustic signal.
Also, the amplitudes of fricatives are usually lower than those of surrounding speech sounds. In
short, fricatives are produced a) by the formation of a narrow constriction somewhere in the vocal
tract, b) by the development of turbulent air flow and c) by the generation of turbulence noise.
Fricatives have relatively long durations of noise, compared to other classes of sounds involving
noise generation.
Some studies that aimed at distinguishing fricatives among themselves have concentrated on
four attributes: spectral properties of the frication noise (distribution of energy in the frequency
range), amplitude of the noise, duration of the noise, and spectral properties of the transition from
the fricative into the following vowel (Jongman et al., 2000).
The production of plosives differs greatly from the production of fricatives: a plosive involves a)
a complete closure of the vocal tract, b) a release of the closure and c) a movement toward another
79
vocal tract configuration. The closure is associated with acoustic silence, although weak voicing
can be present if the plosive is voiced (prevoicing). During the closure interval, air pressure is built
up in the mouth and abruptly released. The acoustic evidence of this release is a burst or transient.
The burst is a noise segment similar (acoustically) to the noise segment for a fricative, but much
shorter. Sometimes, the plosive is not released, i.e. the closure is maintained and no burst appears
(usually word-final plosives). During the articulatory movement from a plosive to another sound,
the transition is associated with a brief interval of changing formant pattern.
To distinguish different plosives from each other, studies have been carried out to investigate
two main features: the burst and the formant transitions. It seems that the acoustic properties
of the bursts convey information about the place of articulation of plosives (Stevens & Blumstein,
1978). The formant transitions also convey important information about plosives (and consonants
in general): the F1 transition appears to be a cue to manner of articulation and the F2 and F3
transitions may be cues to place of articulation.
In the next section, the more specific acoustic differences between /x/ and /k,g/ will be examined.
6.2.2
Acoustic differences between /x/ and /k,g/
In the previous section (section 6.2) we have seen some general acoustic properties of fricatives and
plosives. In this section we will examine some specific acoustic differences between the voiceless
velar fricative /x/ and the voiceless or voiced velar plosive /k/ or /g/. First, what the three
consonants have in common is their place of articulation; they are all velar consonants. Manner of
articulation is the difference between /x/ and /k,g/; thus in our case it might be sufficient to make
the distinction between fricatives and plosives. The most important difference between fricatives
and plosives does not concern their spectral properties, but their amplitude in the course of time.
The most important acoustic charactistic of a plosive is the burst, an abrupt rise in amplitude
at consonant onset which is absent in fricatives. Fricatives have a gradual rise in amplitude at
consonant onset. It is this difference in rapid and gradual rise in amplitude that can be used to
distinguish fricatives from plosives.
Notice that /k/ and /g/ also differ from each other: because the /g/ is voiced, the silent interval
preceding the burst might not be completely silent because of some weak voicing. Also, the burst
80
of a voiced plosive might not be as explosive as that of a voiceless one.
Duration might also play a role in this distinction: a fricative sound can be prolonged for quite
some milliseconds, whereas a plosive is usually produced in a relatively shorter period of time due
to the upbuilding pressure in the vocal tract that has to be released.
Therefore, differences in amplitude envelope and duration might be acoustic cues for the distinction between fricatives and plosives.
6.2.3
Acoustic features for fricatives versus plosives classification: experiments
in the literature
Some of the above mentioned acoustic differences between fricatives and plosives have been employed to develop automatic algorithms to distinguish between these two groups. Since the gross
spectral shapes of fricatives and plosives are similar, several other cues have been suggested, all
based on properties of the amplitude envelope. Cues such as consonant duration (Weinstein et
al., 1975), burst amplitude (Dorman et al., 1980) and rise time (e.g. Stevens, 1980) have been
examined to distinguish between voiceless fricatives and voiceless plosives.
Weigelt et al. (1990) describe in their study an algorithm that is able to distinguish voiceless
fricatives from voiceless plosives with only three measures. I will describe this study here in more
detail, because it forms the basis of our pronunciation error detector for /x/-/k,g/: we have developed two classification methods that are based on the algorithm by Weigelt et al. (1990). They
use a variable related to the rise time as the primary measure for the distinction: the rate of rise of
the log Root-Mean-Square (RMS) energy of the waveform ( ROR ). This ROR value can be seen
as the derivative of the log RMS energy of the waveform; it indicates how rapid the rise (or fall) in
energy (amplitude) is.
Plosives have an abrupt rise in energy at consonant onset and therefore a high ROR value (see
figures 6.4 and 6.2). Fricatives have a gradual rise in energy at consonant onset and therefore have
a relatively low ROR value (see figures 6.3 and 6.1). Since the difference between the two ROR
values is large (compare fig. 6.4 to 6.3), an absolute ROR threshold can be set to evaluate peaks in
the ROR contour: consonants with a value above the threshold are labeled as plosives and those
with a peak ROR value below the threshold are labeled as fricatives. Usually large peaks in the
ROR contour correspond to the consonant onset, but these large peaks can also be the result of
81
Figure 6.1: Log RMS contour
(Ebegin...end ) of a /x/
Figure 6.2: Log RMS contour
(Ebegin...end ) of a /k/
Figure 6.3:
ROR contour
(RORbegin...end ) of a /x/
Figure 6.4:
ROR contour
(RORbegin...end ) of a /k/
non-speech sounds such as “lip smacks”. In this algorithm, these peaks are rejected using methods
similar to those commonly used for signal endpoint detection, such as setting thresholds for relative
energy magnitude, energy duration and zero-crossing rate of the signal. Four criteria have been
set up to establish whether a peak is significant (a peak belonging to the consonant’s onset) or
not (a peak belonging to a non-speech sound or vowel onset). The duration of an energy pulse
is examined to ensure that it is large enough in both amplitude and duration to be considered a
speech sound (criterion 1). A relative increase in energy aids in rejection of vowel onsets following
plosives or fricatives and biases toward classifying low-valued peaks as not significant (criterion
2). By evaluating a period of time around each ROR peak to find out whether the zero-crossing
rate is constant or increasing, thus correlated to consonant onset, and by finding out whether the
82
(1)
(2)
(3)
(4)
For the 49-ms period following the peak, the value of log rms energy must never fall
below the value of log rms energy at the peak.
The maximum value of log rms for the following 49 ms must be at least 12 dB above
the value of log rms energy at the peak.
The maximum zero-crossing rate over the 49-ms period after the peak must be
greater than 2000 zero crossings per second.
The zero-crossing rate, exactly 40 ms after the peak, must be no more than 100
crossings per second below the zero-crossing rate 20 ms before the peak.
Table 6.2: Four criteria to discard spurious peaks (ROR peaks not belonging to consonant onset
(from Weigelt et al., 1990)
zero-crossing rate has reached a voiceless speech threshold, peaks that are related to non-speech
sounds or to vowel onsets are discarded (criteria 3 and 4).
If no significant peak is found, then the consonant is labeled as a fricative. If a significant peak
is found and the peak is above the ROR threshold (the main criterium), then the consonant is
labeled as a plosive. All thresholds were set heuristically, including the ROR threshold which was
set in Weigelt et al. (1990) at 2240 dB/s.
This algorithm by Weigelt et al. (1990) seems to work very well since their correct classification
percentages range from approximately 91.1% to 100.0%. Other advantages of this algorithm are
that the acoustic features are relatively easy to compute and the number of features is small. We
used this algorithm to discriminate /x/ from /k,g/ and carried out different experiments. We also
developed an alternative statistical method that uses the ROR feature (see section 6.3.2 and section
6.3.1).
6.3
Methods & acoustic measurements
Very few instances of the /g/ were found in the Dutch corpora, due to the fact that the /g/ is not
a Dutch phoneme. Therefore, the classifiers are trained on correct realizations of the following two
phonemes: /x/ and /k/.
6.3.1
Method I & acoustic measurements
Method I uses the algorithm presented in Weigelt et al. (1990) and described in section 6.2.3.
This method uses three of the four criteria described in table 6.2 and one main criterion (the ROR
83
threshold) to decide whether a consonant sound is a fricative or a plosive. The algorithm is partly
performed by a PRAAT-script, and is supplemented by a Perl script (see appendix C).
/x/
/k/
50%
Percent
40%
30%
20%
10%
5000
10000
15000 20000
25000 30000
5000
ROR values (dB/s)
10000
15000 20000
25000 30000
ROR values (dB/s)
Figure 6.5: Left: histogram of (highest) ROR values of /k/ (mean=11149.74, st.dev=8944.62,
n=241). Right: histogram of (highest) ROR values of /x/ (mean=1472.94, st.dev=1775.47, n=284).
ROR values taken from IFA corpus, male speakers.
/k/
/x/
Percent
30%
20%
10%
0%
5000
10000
15000 20000
25000 30000
5000
ROR values (dB/s)
10000
15000 20000
25000 30000
ROR values (dB/s)
Figure 6.6: Left: histogram of (highest) ROR values of /k/ (mean=14099.49, st.dev=7642.62,
n=360). Right: histogram of (highest) ROR values of /x/ (mean=2259.47, st.dev=5213.46, n=444).
ROR values taken from IFA corpus, female speakers.
We computed the ROR value for each consonant (fricative or plosive) in the same way as was
done in Weigelt et al. (1990). First, the signal was pre-emphasized (from frequency 50Hz). To
calculate the ROR value in PRAAT, a window of 24 milliseconds was shifted in 1 millisecond steps
over the segment and the short-term log RMS (Root-Mean-Square) energy was measured for each
84
window: En = 20 × log10(RM Sn /0.00002). The ROR value is computed by substracting the E of
the previous window from the current E and dividing this by the timestep of 1 ms:
RORn = (En − En−1 )/(∆t),
where ∆t is the separation of the energy measurements (in our case 1 milliseconds). For each
consonant, each of the three highest ROR peaks is evaluated in order of largeness (largest peak
first) for its significance (either belonging to the consonant onset or to a non-speech sound or vowel
onset) by the four criteria (table 6.2). If a candidate peak fails any of the criteria, then the peak
is discarded as being non-significant and the segment is labeled as a fricative. If a candidate peak
passes each of the four criteria (table 6.2), then a significant peak is found. To be classified as a
plosive, this significant peak has to have an ROR value above the predetermined ROR threshold.
A significant peak that has an ROR value below the predetermined ROR threshold is classified as
a fricative. As we can see in fig. 6.5 and 6.6, the difference between the mean ROR value of /x/
and /k/ is large. Figure 6.7 summarizes the decison-tree algorithm that is used in this study. It
differs from the original algorithm by Weigelt et al. (1990) in some points.
First, one criterion for discarding spurious peaks has not been used (criterion 3, see table 6.2),
because pilot experiments showed that this criterion was too strict and did not work well on our
material. Secondly, the pilot experiments showed that the other heuristically determined thresholds
in the criteria were too strict as well. Therefore, we trained this classifier with varying thresholds
and determined heuristically less strict, but optimized thresholds, suited for our material, to be
used in the criteria.
Three criteria from table 6.2 were reformulated into thresholds which were varied in the following
way:
crit.1:
Epeak+1...peak+49 > {0.5; 0.6; 0.7; 0.8; 0.9; 1.0} × Epeak
crit.2:
max Epeak+1...peak+49 > {4, 6, 8, 10, 12} dB + Epeak
crit.3:
max zcrpeak+1...peak+49 > {1600, 1700, 1800, 1900, 2000} zcr/s
ROR threshold:
{1000, 1200, 1400, 1600, 1800, 2000, 2200, 2400} dB/s
Trying out all possible combinations of thresholds takes a lot of time; therefore this process was
automated by a Perl script that tries all possible combinations in a given range.
85
Automatic phone segmentation by HTK
Feature extraction by PRAAT
Get the three highest ROR peaks
Check if one of these peaks is significant and start with the highest peak
crit.1: Epeak+1...peak+49 > Epeak
H
HHH
H
HH
H
No
Yes
It is a fricative.
crit.2: max Epeak+1...peak+49 > 12 dB + Epeak
H
HH
HH
H
HH
No
Yes
It is a fricative.
crit.3: max zcrpeak+1...peak+49 > 2000 zcr/s
H
HH
H
HH
H
H
No
Yes
It is a fricative.
Yes, it is a significant peak.
ROR value of this significant peak > 2240 dB/s
HHH
H
No
Yes
It is a fricative.
It is a plosive.
Figure 6.7: Algorithm from Weigelt et al. (1990) translated into a decision tree.
86
6.3.2
Method II & acoustic measurements
Method II involves an LDA classification that uses a feature from the Weigelt et al. (1990) algorithm, namely the ROR feature. In method II, the peak in the ROR contour will not be examined
for its significance (i.e. the peak does not have to pass the 3 criteria), so the highest peak is simply
taken as the ROR value and assumed to be corresponding to the consonant onset without considering the three criteria. The ROR value will be supplemented with four amplitude measurements
and duration (either raw or normalized). Normalized duration can be used alternatively instead of
raw duration and is computed the same way as explained in section 5.3 (by multiplying articulation
rate per person with raw duration).
Automatic phone segmentation by HTK
↓
Feature extraction by PRAAT
↓
Feature vector: {highest RORi , Ei−5 , Ei+5 , Ei+10 , Ei+20 dur}
↓
Train LDA classifier with these feature vectors in PRAAT
↓
Trained LDA classifier classifies new cases in PRAAT
Figure 6.8: Method II for /x/ pronunciation error detector
All six features were extracted automatically by a PRAAT-script (see appendix C) that performs
the measurements based on the automatic phone segmentations and the sound files. The four energy
measurements consist of Ei−5 , Ei+5 , Ei+10 and Ei+20 , where i is the number of the frame with
the highest ROR value. So, in other words, one energy measurement was taken 5ms before the
highest ROR peak and three energy measurements were taken at 5ms, 10ms and 20 ms after the
highest ROR peak. The energy measurements were taken to model the amplitude envelope of the
consonant, and to examine if the duration of the energy pulse is large enough both in amplitude
and duration to be considered a speech sound. Finally, duration, expressed in number of frames,
was used as a feature because fricatives are generally longer than plosives.
So, in total, six features were used to train the /x/-/k,g/ LDA-classifier: RORi , Ei−5 , Ei+5 ,
Ei+10 , Ei+20 (where i is the number of frame with the highest ROR value) and nodur, dur or
normdur (duration expressed in number of frames, either raw or normalized).
87
6.4
6.4.1
Experiments and results for /x/-/k,g/
Organization of experiments
The organization of the experiments is similar to the organization of the vowel classification experiments described in section 5.4.1, so we have carried out the same A-, B- and C-experiments.
For method I, the experiments consisted of tuning the values for the thresholds in the algorithm
and testing the algorithm with these tuned values.
There are six features available to test in method II: an ROR value (ROR), four energy measurements (i1, i2, 3i, i4) and duration (nodur, dur or normdur). We have experimented with different
combinations of features to see which feature set performs best. Also, LDA provides statistical
tests to examine the discriminative power of each feature set or each specific feature.
The use of duration was again examined, in a way similar to that applied in the vowel classification experiments, so each feature set was tested either without duration or with duration or
with normalized duration. For some feature sets, the addition of duration was examined on their
statistical signficance.
The feature set with the most features, which therefore might be most complete in describing
plosives and fricatives, was [ROR i1 i2 i3 i4]; and because we wanted to examine the effectiveness of
ROR, we also tested [i1 i2 i3 i4]. The number of energy measurements was chosen rather arbitrarily;
the idea behind this was to model the amplitude envelope with these four energy measurements
around the ROR peak. In the case of a plosive, i1 would differ greatly from i2, i3 and i4; in the case
of a fricative the levels of i1, i2, i3 and i4 would differ less from each other. To examine whether /x/
was distinguishable from /k,g/ with fewer features than presented, we replaced the three energy
measures after the ROR peak [i2 i3 i4] by one energy measure i3 and tested [ROR i1 i3]. Again, the
effectiveness of ROR could be examined by testing [i1 i3] and [ROR i3] as well. Besides determining
the relative importance of ROR and the amplitude measurements by looking at the results of the
different feature sets, LDA offers many other ways of determining the relative importance of each
feature (section 4.3.1), such as Stepwise-LDA or by looking at Wilks’ Lambda of each feature.
88
6.4.2
Experiments and results method I
We experimented with method I by varying the thresholds, as described in section 6.3.1, which were
heuristically determined in Weigelt et al. (1990). All possible combinations of varying thresholds
were tested automatically by a Perl script and it was immediately clear that the thresholds used
in the paper by Weigelt et al. (1990) were too strict for our material. Criterion 4 was not used
because it was too strict and all the other criteria were made less strict by using other values than
those given in Weigelt et al. (1990).
A-experiments
The A-experiments show that method I is able to discriminate between /x/ and /k/ rather well;
correct classifications range from 75.3% to 91.7% (see table 6.3). Furthermore, there is no loss of
performance in non-native speech (table 6.3 A.2 and A.4)
A.1
A.2
A.3
A.4
Training&Test=DL2N1-Nat
Training&Test=DL2N1-NN
Training&Test=IFA
Training&Test=TRIEST
c1
0.6
0.6
0.8
0.5
c2
4
5
4
4
Male
c3
ROR
1700 2400
1700 2400
1700 1400
1700 2200
%
81.0%
80.0%
89.7%
90.0%
c1
0.6
0.6
0.5
0.5
c2
4
5
4
4
Female
c3
ROR
1700 2400
1700 2400
1700 2200
1700 2200
%
75.3%
91.7%
83.5%
82.3%
Table 6.3: Criteria with which the highest results were achieved in the A-experiments, classification
results are in the column %.
B-experiments
The results of the B-experiments with natively trained classifiers (B.1 and B.2) are lower than those
with the non-natively trained classifiers, at least for male speakers: applying non-native speech
on native models is known to be problematic and apparently this has affected speech from male
speakers more than speech from female speakers. For female speech, the classification percentage
is still very high, which is rather surprising, because this large difference between male and female
speakers was not found in the A-experiments.
89
B.1 Training=DL2N1-Nat Test=DL2N1-NN
B.2 Training=IFA Test=TRIEST
Male
75.0%
76.7%
Female
91.7%
82.4%
Table 6.4: Classification results of the B-experiments, achieved with method I with thresholds from
table 6.3
C-experiments
The results of the C-experiments are very poor. Similar to the results from method II, no firm
conclusions can be drawn yet because of the small number of pronunciation errors. More non-native
speech data, especially pronunciation errors made by L2 learners of Dutch, is needed for further
examination.
C.1
C.2
C.3
C.4
Training=DL2N1-Nat Test=DL2N1 mispronounced
Train=DL2N1-NN Test=DL2N1 mispronounced
Train=IFA Test=TRIEST mispronounced
Train=TRIEST Test=TRIEST mispronounced
Male
-
Female
10%
10%
41.7%
41.7%
Table 6.5: Classification results from C-experiments, achieved with method I and thresholds from
table 6.3
6.4.3
Experiments and results method II
A-experiments
We started with the A-experiments: they show high classification results, for each corpus the highest
correct classification percentages are between 85.0% and 95.0% (see fig. 6.9-6.12). The addition of
duration (either raw or normalized) results in almost every case in an increase in performance; this
is most visible in the TRIEST corpus (fig. 6.12). However, many improvements in performance
due to the addition of ‘dur’ are not statistically significant as table 6.6 shows (we added ‘dur’ to
[i1 i3] and [ROR i1 i2 i3 i4] for comparison).
The results of the A-experiments show that the height of the ROR peak, which is the main
feature in method I, appears to be less informative than the amplitude measurements and that
90
z-scores computed for different experiments
Does the addition of Does the addition of
‘dur’ to [i1 i3] im- ‘dur’ to [ROR i1 i2
prove classification ac- i3 i4] improve classificuracy significantly?
cation accuracy significantly?
Male
Female
Male
Female
A.1 Training&Test = DL2N1-Nat 0.32
0.56
0.28
-0.29
A.2 Training&Test= DL2N1-NN
0.00
0.00
0.24
0.34
A.3 Training&Test = IFA
0.52
-0.29
0.17
0.00
A.4 Training&Test = TRIEST
0.76
0.45
0.40
0.90
∗ significant at 0.95
Table 6.6: Significance of the addition of ‘dur’ to [i1 i3] and [ROR i1 i2 i3 i4] for /x/ vs /k/
Female
Male
100
90
90
80
NODUR
70
DUR
NORMDUR
60
DUR
NORMDUR
60
A.1
Training=DL2N1-Nat,
i4
i3
i2
ri1
ro 3i4
i
i2
i1 i3
ri1
ro
i3
i1
ri3
ro
r
ro
i4
i3
i2
ri1
ro 3i4
i
i2
i1 i3
ri1
ro
i3
ri3
Features
i1
r
Figure
6.9:
Test=DL2N1-Nat
NODUR
70
ro
/x/-/k,g/
Female
80
ro
i4
i3
i2
ri1
ro i4
i3
i2
i1 i3
ri1
ro
i3
i1
ri3
ro
r
ro
i4
i3
i2
ri1
ro i4
i3
i2
i1 i3
ri1
ro
i3
i1
ri3
ro
r
ro
Features
Correct classification %
Correct classification %
Male
100
/x/-/k,g/
Figure
6.10:
Test=DL2N1-NN
A.2
Training=DL2N1-NN,
not all amplitude measurements are needed. ROR is not always needed for high classification
results, compare [i1 i3] to [ROR i1 i3] in fig. 6.9 - 6.12: the two feature sets almost always produce
approximately the same results, sometimes the accuracy is even better for [i1 i3] (fig. 6.10 male
speakers) than for [ROR i1 i3]. Compare [i1 i2 i3 i4] to [ROR i1 i2 i3 i4] (fig. 6.9 - 6.12) and we
can see again that the results of the two feature sets in many cases do not differ greatly from each
other.
Furthermore, when the results of [i1 i3] and [ROR i3] are compared to each other, we can
examine which feature is more discriminative: i1 or ROR? In almost every case (e.g. fig 6.9 and
6.10), [i1 i3] produced better results than [ROR i3], implying that ROR is less discriminative than
91
Male
Male
Female
100
80
NODUR
70
DUR
NODUR
70
DUR
NORMDUR
60
i4
i3
i2
ri1
ro 3i4
i
i2
i1 i3
ri1
ro
i3
i1
ri3
ro
r
ro
i4
i3
i2
ri1
ro 3i4
i
i2
i1 i3
ri1
ro
i3
ri3
i1
r
ro
i4
i3
i2
ri1
ro 3i4
i
i2
i1 i3
ri1
ro
i3
i1
ri3
ro
r
ro
i4
i3
i2
ri1
ro 3i4
i
i2
i1 i3
ri1
ro
i3
ri3
i1
r
ro
ro
Features
80
ro
NORMDUR
60
Correct classification %
90
90
Correct classification %
Female
100
Features /x/-/k,g/
/x/-/k,g/
Figure
6.12:
Test=TRIEST
Figure 6.11: A.3 Training=IFA, Test=IFA
A.4
Training=TRIEST,
i1.
The tests with ROR and optionally duration (either raw or normalized), show that ROR
should be supplemented with one or more amplitude measurements since the classification results of [ROR nodur/dur/normdur] usually are the lowest of all. Nevertheless, the results of [ROR
nodur/dur/normdur] are still at a level where approximately 79%-85% of all cases are correctly
classified as /x/ or /k/.
Another observation is that, sometimes, the classifier is slightly better able to discriminate
between /x/ and /k/ in female speech than in male speech (e.g. fig. 6.11). Remarkably, the
performance of the classifier does not decrease strongly in non-native speech (e.g. compare fig. 6.9
to 6.10) as we have seen in the case with vowels (compare fig. 5.8 to 5.9, and compare 5.22 to
5.23). It seems that fricatives and plosives pronounced by both non-native speakers and native
speakers are equally well discriminated from each other, whereas vowels from non-native speech
are clearly less well discriminated from each other, and, therefore, they are possibly less accurately
pronounced than vowels from native speech (this is also discussed in section 6.5).
B-experiments
The B-experiments also showed relatively high classificiation results: percentages ranging from 81%
- 95% were observed in the B.1 condition (Training=DL2N1-Nat, Test=DL2N1-NN, fig. 6.13) where
the classifier is natively trained; this is more or less equal to the non-natively trained detector in fig.
92
6.10. For the B.2 condition (Training=IFA, Test=TRIEST, fig. 6.14), somewhat lower percentages
ranging from 68% - 85% were found compared to fig. 6.12 (possible explanations for this can be
found in section 6.5). The addition of duration (either raw or normalized) resulted in a consistent
improvement in exp. B.1 but not in exp. B.2. In general, it seems that these natively trained
classifiers are able to cope with non-native speech at a level, ranging from approximately 80% to
95%.
Male
Female
Male
100
90
90
80
NODUR
70
DUR
NORMDUR
60
80
NODUR
70
DUR
NORMDUR
60
B.1
Training=DL2N1-Nat,
i4
i3
i2
ri1
ro 3i4
i
i2
i1 i3
ri1
ro
i3
i1
ri3
ro
r
ro
i4
i3
i2
ri1
ro 3i4
i
i2
i1 i3
ri1
ro
i3
ri3
Features
i1
r
ro
ro
i4
i3
i2
ri1
ro 3i4
i
i2
i1 i3
ri1
ro
i3
i1
ri3
ro
r
ro
i4
i3
i2
ri1
ro 3i4
i
i2
i1 i3
ri1
ro
i3
ri3
i1
r
ro
ro
Features /x/-/k,g/
Figure
6.13:
Test=DL2N1-NN
Correct classification %
Correct classification %
Female
100
/x/-/k,g/
Figure 6.14: B.2 Training=IFA, Test=TRIEST
C-experiments
The results of the C-experiments were not very good (see fig. 6.15-6.18), but no firm conclusions
can be drawn from these figures due to the small (absolute) numbers of mispronunciations of /x/
as /k,g/ (see table 6.7).
CITO-NN
TRIEST
Male
2
0
Female
10
12
Table 6.7: Absolute numbers of mispronunciations of /x/ as /k,g/
Only the results from female speakers are shown here because the numbers of mispronounciations
are too small for male speakers. The non-natively trained classifiers (fig. 6.17 and 6.18) performed
better than the natively trained classifiers (fig. 6.15 and 6.16).
93
Female
100
90
90
80
80
70
70
Correct classification %
Correct classification %
Female
100
60
50
NODUR
40
30
DUR
20
NORMDUR
10
60
50
NODUR
40
30
DUR
20
NORMDUR
10
ri3
r
4
4
/x/-/k,g/
Features
Figure 6.15: C.1 Training=DL2N1Nat, Test=DL2N1 mispronounced
/x/-/k,g/
Figure 6.16: C.3 Training=IFA,
Test=TRIEST mispronounced
Female
Female
100
100
90
90
80
Correct classification %
80
70
60
50
NODUR
40
30
DUR
20
60
50
NODUR
40
30
DUR
20
10
NORMDUR
4
i
i3
i2
ri1
ro i4
i3
i2
i1 i3
ri1
ro
i3
ri3
i1
r
ro
ro
i
i3
i2
ri1
ro i4
i3
i2
i1 i3
ri1
ro
i3
ri3
i1
r
ro
NORMDUR
10
70
ro
Correct classification %
i
i3
i2
ri1
ro i4
i3
i2
i1 i3
ri1
ro
i3
i1
ro
ro
ri3
r
i
i3
i2
ri1
ro i4
i3
i2
i1 i3
ri1
ro
i3
i1
ro
ro
Features
4
Features
Features
/x/-/k,g/
/x/-/k,g/
Figure
6.18:
ing=TRIEST,
mispronounced
Figure 6.17: C.2 Training=DL2N1NN, Test=DL2N1 mispronounced
94
C.4
TrainTest=TRIEST
6.5
Discussion of results
Discussion of the results of method I (decision tree)
By setting only four absolute thresholds it is possible to distinguish /x/ from /k/ with high accuracy.
These thresholds differed from the original thresholds set by Weigelt et al. (1990) in that our
thresholds were made less strict. For example, the 12 dB in criterion 2 was lowered to 4 or 6
dB and the zero-crossing rate in criterion 3 was lowered from 2000 to 1700. The ROR threshold
was set around the 2000-2400 dB/s, which is almost similar to the original threshold. The highest
classification results were obtained with minimal differences in thresholds for different corpora.
It seems that the thresholds can be set individually and optimized for each corpus, so that the
algorithm is made optimally suitable for specific material.
Duration, which sometimes appeared to be a useful feature in method II, was not used in
method I at all, but still, the performance of method I was equally high. Obviously, it is difficult to
set absolute thresholds for duration so this was not used in method I. The two different methods,
method I and method II, show that the choice of features also depend on the algorithm that is
going to be used.
The results of the C-experiments were low, a similar explanation as in section 6.4.3 can be given
here as well: the number of mispronunciations of /x/ is rather low so no clear-cut conclusions can
be drawn from the C-experiments, and here the /g/ was not included in the training data either.
Discussion of the results of method II (LDA)
The results of the A-experiments in section 6.4.3 show that all feature sets with a varying number of
features gave relatively high classificationn results ranging from 79%-95%. How good the distinction
between /x/ and /k/ can be made, is illustrated by fig. 6.19 and 6.20 where the discriminant scores
and group centroids of the features set [ROR i1 i2 i3 i4 dur] are plotted: the group centroids of
the two classes of phonemes differ stronlgy from each other, which indicates that the discriminant
function performs rather well.
To maximize efficiency, one would like to use as few features as possible to achieve good results.
Not all features ([ROR i1 i2 i3 i4 dur/normdur]) might be needed to achieve good results; in
particular, the sets [ROR i1 i3], [ROR i3] and [i1 i3] also produced high classification results,
95
1.026
*
*
–1.215
–4
4
4
–4
/k/
*
*
1.153
1.026
–1.215
/x/
*
*
/x/
**
*
*
o
o
–1.422
–4
/k/
/x/
/k/
Figure 6.20: Whisker’s box plot, distribution of discriminant scores of /x/-/k,g/
with [ROR i1 i2 i3 i4 dur], female IFA
speakers
Figure 6.19: Whisker’s box plot, distribution of discriminant scores of /x/-/k,g/
with [ROR i1 i2 i3 i4 dur], male IFA speakers
sometimes even higher than [ROR i1 i2 i3 i4 dur] (see e.g. fig. 6.11). A correlation matrix of the
features was examined to investigate whether some features were strongly correlated to each other,
so that one of these strongly correlated features might be considered superfluous.
ROR
ROR 1∗
i1
-0.78∗
i2
-0.08
i3
-0.01
i4
0.06
dur
0.01
∗ significant at
i1
-0.78∗
1∗
0.60∗
0.53∗
0.45∗
0.01
0.99
i2
-0.08
0.60∗
1∗
0.98∗
0.94∗
0.04
i3
-0.01
0.53∗
0.98∗
1∗
0.98∗
0.03
i4
0.06
0.45∗
0.94∗
0.98∗
1∗
0.03
dur
0.01
0.01
0.04
0.03
0.03
1∗
Table 6.8: Correlation matrix (Pearson) of features used in /x/-/k,g/ distinction, from male IFA
corpus
In the correlation matrix (table 6.8, all other correlation matrices are quite similar to each
other, so only one matrix is shown here) we can see that ROR is well correlated with i1 (>-0.70)
and not correlated with the other variables. The somewhat high correlation between ROR and i1
can be explained by the fact that i1 is measured 5 ms before the ROR peak: peaks are logically
preceded by relatively low amplitude levels otherwise there would be no peak. Furthermore, three
energy measurements i2, i3 and i4, are highly correlated with each other (>0.90) and somewhat
less correlated with i1 (0.40 - 0.60). This seems to suggest that not all three energy measurements
i2, i3, i4 are necessary and that e.g. i3 could replace the three energy measurements. Duration
96
is not correlated at all with other variables which suggests that duration is independent and can
be a useful feature. So, the correlation matrices show that, in principle, three measurements are
sufficient to achieve good results: {ROR or i1}, {i2 or i3 or i4}, and duration. The A-experiments
have shown this to be true: [i1 i3] and [ROR i1 i3] (with or without duration) perform very well
(>85%), sometimes even better than [ROR i1 i2 i3 i4].
Another way to determine which features are most effective in the distinction between /x//k,g/ is to examine the output of the LDA classification given in SPSS. In section 4.3.1 I explained
that low Wilks’ Lambda values and high (absolute) standardized discriminant function coefficients
indicate large discriminative power of a particular feature. Furthermore, a stepwise LDA can
reveal which features contribute significantly to the discriminant model and therefore should be
used, and which features do not significantly contribute to the model and can therefore be left out
the model. Several stepwise analyses were carried out and revealed that the model could indeed be
significant with combinations of a small number of features, which in almost every case included:
i1, dur/normdur, i4 and ROR. The stepwise analyses confirm that some features used in method II
are superfluous, and that with just 2-3 features (e.g. [i1 i3] or [i1 i3/i4 dur/normdur]) the classifier
is able to discriminate /x/ from /k/ at a high level of accuracy (>85%, see fig. 6.9-6.12).
Duration was useful in the B-experiments for the DL2N1 material, but not for the IFA-TRIEST
material. The same explanation that was given in section 5.5.1 for the /A/-/a:/ classification
experiments might be applied here as well: the two databases might be too different from each
other in several respects.
Furthermore, the B.2 experiment (fig. 6.14), where the classifier is natively trained, shows that
the performance has decreased when it is compared to the non-natively trained classifier in A.4 fig.
6.12. This decrease in performance probably has two causes which were also discussed in the vowel
discussion section: it is known that applying a natively trained model to non-native speech can be
problematic and the training and test material originated from two different databases, recorded
under different conditions, which can lead to more acoustic variation. The second cause mentioned
does not apply to the results from B.1 (fig. 6.13) where both training and test material originate
from the same database. When we compare B.1 (fig. 6.13) to A.2 (fig. 6.10), we see that the first
cause mentioned does not apply either: there is no loss of performance when non-native speech is
evaluated by a natively or non-natively trained classifier. Apparently, in this case non-native /x/s
97
and /k/s do not deviate strongly from native /x/s and /k/s. Moreover, this might also be due to
the fact that the relation between the steepness of the onset of fricatives and plosives is to a large
degree language independent. This is an example of a case where acoustic-phonetic features are
more powerful than ‘blind’ confidence measures.
According to the results of the C-experiments, it seems that the /x/-/k,g/ classifier is not able
to detect mispronounced /x/s very well. This could be due to the fact that we did not have enough
occurences of /g/, which is not a common Dutch sound, with which to train the classifier.
Not much can be concluded yet about the results of the C-experiments obtained with both
methods I&II, because the number of mispronounced /x/ as /k/ or /g/ is too small. But the
results from the C-experiments do indicate that although both /k/ and /g/ are velar plosives, the
acoustic difference caused by the voice distinction is possibly large enough for the /g/ not to be
classified correctly by the classifier, which is only trained on the /x/-/k/ distinction. The /g/,
which is a voiced velar plosive, differs in some ways from the voiceless velar /k/: the silent interval
previous to the burst might not be completely silent due to prevoicing and the burst might not be
so explosive as is in the case of a /k/. These two factors influence the height of the ROR peak
of /g/ and it is this kind of information that is missing for the /g/ in the classifier. But still, the
/k/ and /g/ are acoustically very similar sounds except for some minor differences resulting in /g/
being a somewhat “milder” version of /k/. Apparently, these minor differences are large enough to
make modelling of /g/ necessary. For the C-experiments we need more pronuncation errors to test
the algorithm: the absolute numbers are now too small to draw clear-cut conclusions from.
Although both methods predict better than chance (MCC ± 50% < accuracy of the two methods
70%-95%, at least for the A- and B-experiments), in general, method II seems to perform better
than method I. Evidence for this claim is illustrated in table 6.9, where the results of method I are
compared to the results of method II. In almost every case, the results of method II with only 3
features [i1 i3 normdur] are higher than those of method I.
We also see in table 6.9 that at least for method II with the features [i1 i3 normdur], the
difference between the classification results of male and female speakers is small. The effect of
using gender-dependent models is thus not big for the /x/-/k,g/ classifier (for method II) whereas
for vowels , we have seen that gender-dependent models are preferred since there are large differences
between the results of male and female speakers (fig. 5.10 and 5.24).
98
Experiment
A.1 Training&Test
A.2 Training&Test
A.3 Training&Test
A.4 Training&Test
=
=
=
=
DL2N1-Nat
DL2N1-NN
IFA
TRIEST
Method
M
81.0%
80.0%
89.7%
90.0%
I
F
75.3%
91.7%
83.5%
82.3%
Method
M
93.1%
95.9%
86.4%
90.0%
II, [i1 i3 normdur]
F
94.1%
95.9%
86.6%
90.2%
Table 6.9: Comparison results method I and method II (feature set = [i1 i3 normdur])
Is there a preference of one method over the other? Table 6.10 illustrates some advantages and
disadvantages of the two methods. It seems that method II is preferred to method I.
-
-
+
-
Method I (Weigelt et al. (1990),
classification tree)
costs relatively much computing
time to establish values for thresholds with which the best classification results are achieved (look for
boundaries yourself)
establishing best combination of values for thresholds is automated,
possible loss of insight into discriminative weight of features (sort of
“trial & error”)
only 4 criteria: ROR (1 criterion),
zero-crossing rate (1 criterion), amplitude (2 criteria)
not all variables are easy to fit in a
classification tree
Method II (LDA classifier)
+
LDA computes discriminant function automatically, process of training relatively fast (LDA computes
boundaries for you)
+
LDA offers ways of determining discriminative weight of individual features, thus can give more insight
into discriminative weight of features
only 3 measures: i1, i3, duration
+
+
many different kinds of discriminative variables (numeric) can easily
be used
Table 6.10: Advantages and disadvantages of the two methods for /x/-/k,g/
Both classifiers for /x/-/k,g/ perform better than the classifiers for /A/-/a:/ and /Y/-/u,y/
in non-native speech. The performance of the vowel classifiers degraded strongly when non-native
speech was used instead of native speech, whereas the performance of the consonant classifiers
remained roughly equal. This might be caused by the fact that vowels are more difficult to classify,
since the distinction between vowels is projected on a more continuous scale, whereas the distinction
between consonants is less vague. In non-native speech, these boundaries of classes of sounds are
even more “blurred”, making the boundaries, especially in the case of vowels, even more vague.
99
Apparently, this “blurriness”, caused by non-native speech, affects consonants less than vowels,
since in the case of the /x/-/k,g/ classifiers, performance does not decrease in non-native speech.
100
Chapter 7
Conclusions and summary of results
7.1
Introduction
In this final chapter, summaries of all the results are given and more importantly, answers are
given to the questions posed at the beginning of this thesis. Summaries of all the classification
experiments can be found in section 7.2, 7.3 and 7.4. Answers to the thesis question and 3 other
subordinate questions are given in section 7.5; this section also includes suggestions for further
research.
7.2
Summary of /A/-/a:/
Features that were used for training the /A/-/a:/ classifier include: the three lowest formants,
pitch and duration. Dynamic information was added by taking more samples at 25% and 75% of
vowel duration. Furthermore, duration was normalized by multiplying the articulation rate per
person with raw vowel duration. It was expected that duration would be an important feature
since the primary acoustic difference between /A/ and /a:/ is length. Classification experiments
were carried out under different training conditions to examine how differently trained classifiers
would cope with non-native speech. A summary of the results of these LDA vowel classification
experiments is given below:
• Duration is in most cases a discriminative feature (differences in length between /A/ and
/a:/ are shown in fig. 5.4). We have seen that when duration (either raw or normalized)
101
was added as a feature, the classification percentages increased strongly (e.g. fig. 5.8 (A.1)
- 5.11 (A.4), fig. 5.16 (B.1) and 5.17 (B.2)). This was expected because length is one of
the primary acoustic distinctions between /A/ and /a:/ (see the length difference in fig. 5.4).
However, duration was most of the time not very helpful in discriminating between a correctly
pronounced /A/ and a /A/ wrongly pronounced as /a:/ (fig. 5.18 (C.1) and 5.19 (C.2)).
• No significant increase in classification accuracy, as described in Hillenbrand et al. (1995),
was seen when dynamic information was added (see fig. 5.12 - 5.15). A possible explanation
of the discrepancy between the results of Hillenbrand et al. (1995) and our results is that
Hillenbrand et al. (1995) used vowels all taken from the same /hVd/ context, whereas the
vowels used in our study are taken from different contexts. Furthermore, our classification
process is fully automated (including segmentation and feature extraction), and we aimed at
classifying vowel pairs instead of a set of vowels.
• Since no significant, consistent increase in classification accuracy, as described in Hillenbrand
et al. (1995) was seen when F0 and/or F3 were added to [F1 F2 nodur/dur/normdur](fig.
5.8 - 5.11), F1, F2 and duration (either raw or normalized) seem to be the most important
acoustic features to discriminate /A/ from /a:/. Adding F0 and F3 sometimes improves
classification accuracy.
• The distinction /A/-/a:/ is better made in native female speech than in native male speech
(e.g. fig. 5.8 and 5.10), possibly because female speakers pronounce sounds more accurately
and their vowel space is larger than that of male speakers. This gender-dependent accuracy
difference was not found for non-native speakers; probably due to the fact that the two
possible causes mentioned above do not apply to non-native speakers since, they do not fully
master the L2 in question yet.
• The distinction between /A/-/a:/ is generally better made in native speech than in non-native
speech (compare fig. 5.8 to 5.9 and compare fig. 5.10 to 5.11). Non-native speech deviates
from native speech: non-native speech is generally less clearly and accurately pronounced
than native speech, which causes the distinction between /A/-/a:/ to be projected on a even
more continuous scale than it already was the case for native speech.
102
• The natively trained /A/-/a:/ classifiers are able to cope with non-native speech at a level
with classification accuracy ranging from 55%-73% (fig. 5.16 (B.1) and 5.17 (B.2)).
• The use of normalized duration, which is computed by taking the product of vowel duration
and articulation rate, instead of raw duration, can be very helpful (see fig. 5.16 and 5.17).
Normalization can be especially helpful when vowels from slow speakers (non-native speakers)
have to be compared to vowels from relatively fast speakers (native speakers); this kind of
duration normalization prevents that all vowels from non-native speakers are always longer
than those of native speakers.
• The /A/-/a:/ classifier is able to detect /A/s mispronounced as /a:/ in non-native speech at
a reasonable level (70%-80%) of accuracy with natively trained LDA-classifiers (fig. 5.18 and
5.20); LDA-classifiers trained on non-native speech performed worse.
7.3
Summary of /Y/-/u,y/
The /Y/-/u,y/ classifier was trained and tested with the same features that were used for the /A//a:/ disctinction: the lowest three formants (with the addition of dynamic information), pitch and
duration (either raw or normalized). The task of the classifier was to classify a new case in either
a group that was labelled as correct (/Y/) or a group that was labelled as incorrect (/u/ and /y/).
Vowel classification experiments were carried out under different training and testing conditions.
Additional tests were carried out to determine whether it was possible to let the classifier make a
tertiary choice, i.e. to classify a new case as either /Y/ or /u/ or /y/, instead of making the binary
choice between /Y/ and /u,y/. A summary of the results obtained from the vowel classification
experiments is shown below:
• The addition of duration does not lead to large improvements in classification accuracy in
most cases (fig. 5.23 (A.2) - 5.25 (A.4)). Differences in length between /Y/ and /u,y/ are
shown in the histograms in figs. 5.5 and 5.6.
• In most cases, no increase in classification accuracy, as described in Hillenbrand et al. (1995),
was observed when dynamic information was added (for possible explanations see section 7.2).
103
• Since no significant increase in classification accuracy, as described in Hillenbrand et al.
(1995), was observed when F0 and/or F3 was added (fig. 5.22 (A.1) - 5.25 (A.4)), F1 and F2
seem to be the two most important acoustic features for discriminating /Y/ from /u,y/.
• The distinction between /Y/-/u,y/ is generally better made in native speech than in nonnative speech (compare fig. 5.22 (A.1) to 5.23 (A.2) and compare fig. 5.24 (A.3) to 5.25
(A.4)).
• The classifiers trained on native speech are able to cope with non-native speech with a classification accuracy ranging from 60%-73% (see fig. 5.28 and 5.29).
• The use of normalized duration, which is computed by taking the product of raw vowel
duration and articulation rate, is most of the times not more effective than the use of raw
duration (fig. 5.22 - 5.25 and fig. 5.28 and 5.29).
• The /Y/-/u,y/ classifier is able to detect /Y/s mispronounced as either /u/ or /y/ in nonnative speech at a higher level of accuracy when the classifiers are non-natively trained (fig.
5.32 and 5.33) than when they are natively trained (fig. 5.30 and 5.31).
• For feedback purposes, the classifier can be trained to make a tertiary distinction among
/Y/, /u/ and /y/ instead of a binary distinction between /Y/ and /u,y/, without a big loss
in performance (fig. 5.34 - 5.37).
7.4
Summary of /x/-/k,g/
For the /x/-/k,g/ distinction we first tested an existing algorithm written by Weigelt et al. (1990).
This algorithm uses features such as ROR (Rate Of Rise, i.e. derivative of amplitude), amplitude
and zero-crossing rate. We rewrote this algorithm into a simple decision tree that uses (absolute)
thresholds to classify new cases as either /x/ or /k,g/.
The second method uses LDA classification and features that were based on the algorithm
by Weigelt et al. (1990). We chose features that could model the amplitude envelope of the velar
fricative and the velar plosive since acoustic differences between these two sounds are mostly present
in their amplitude envelopes. Therefore, we used the derivative of amplitude (ROR, Weigelt et al.
1990), four amplitude measurements and duration (either raw or normalized) as features. Duration
104
was added because, in general, fricatives are slightly longer than plosives. The LDA classifier was
trained and tested with these features and the importance of individual features was examined as
well. Training and test conditions were the same as in /A/-/a:/ and /Y/-/u,y/.
Summarizations of the results of the classification experiments carried out with method I or
method II are given below:
For method I:
• It is possible to develop a decision tree (with absolute thresholds) that is able to distinguish
/x/ from /k,g/ at a level of performance that is slightly worse than the level of performance
of the statistical LDA method (compare table 6.3 to fig. 6.9 - 6.12).
• Generally, the performance of the classifier does not decrease when non-native speech is
applied (see table 6.3). However, in some cases the performance did decrease in the Bexperiments when non-native speech was evaluated by a classifier that was trained with
native speech (see table 6.4).
• The natively trained classifier is able to cope with female non-native speech better than with
male non-native speech (table 6.3 and 6.4).
For method II:
• The addition of duration (either raw or normalized) results in most cases in a small increase
in classification performance (e.g. fig. 6.9 and 6.12). However, for some feature sets, this
increase was not significant (see table 6.6).
• i1 is in most cases a stronger discriminative feature than ROR: when i1 is used ([i1 i3]) instead
of ROR ([ROR i3]), the classification results are usually higher (compare [i1 i3] to [ROR i3]
in fig. 6.9 - 6.12).
• In many cases, ROR as an additional feature slightly improves classification accuracy (compare [i1 i3] to [ROR i1 i3] and [i1 i2 i3 i4] to [ROR i1 i2 i3 i4] in fig. 6.9 - 6.12). ROR adds
information about the explosiveness of the burst and contributes in this way to the distinction
between /x/ and /k,g/.
105
• Not all features introduced are needed to discriminate /x/ from /k,g/: classification results show that high classification accuracy (>90%) can also be achieved with [i1 i3
nodur/dur/normdur] (see e.g. fig. 6.9 and 6.10).
• There is almost no loss in performance for the /x/-/k,g/ classifier in non-native speech (compare fig. 6.9 with 6.10, and compare 6.11 with 6.12). Furthermore, there is almost no loss in
performance when non-native speech is evaluated by the /x/-/k,g/ classifier that is trained
with native speech (fig. 6.13) instead of non-native speech (fig. 6.10). Accuracy did decrease
slightly in exp. B.2, where non-native speech from TRIEST is applied to a classifier trained
with native IFA speech (fig. 6.14) instead of a classifier trained with non-native TRIEST
speech (fig. 6.12).
In both methods, C-experiments were carried out to test whether the /x/-/k,g/ classifiers could
detect pronunciation errors of /x/. Unfortunately, the amount of test data is too small to draw
conclusions from these results. However, preliminary results already seem to show that the training
of /g/ should not be omitted; although both /g/ and /k/ are velar plosives, modelling only /k/ to
detect mispronunciations of /x/ as /k/ or /g/ is not sufficient. Again, this confirms the fact that
we should employ more specific methods. Furthermore, the two methods were compared in the
discussion section (6.5) and it seems that using method II is more advantageous than using method
I (see table 6.9 and 6.10). The most important advantage is that method II, with only 3 features
[i1 i3 normdur], works slightly better than method I.
7.5
Conclusions
The goal of this study was to develop automatic classification techniques based on acoustic features
to detect pronunciation errors in L2 speech. We have presented two types of automatic classification
methods with which detection of pronunciation errors is possible: a method that uses Linear
Discriminant Analysis (statistical method) and a classification tree-based method (non-statistical
method). For the two vowel pairs /A/-/a:/ and /Y/-/u,y/ we used LDA and for the consonant
pair /x/-/k,g/, both LDA and classification tree-based methods were used. These classifiers were
all tested under different conditions to answer the thesis question: Thesis question: How effective
are automatic acoustic-phonetic-based classification techniques in detecting pronunciation errors of L2
106
learners of Dutch? On the whole, the classifiers detected our annotated (by human listeners) selection
of pronunciation errors on a poor level for consonants and on a good level for vowels. The /A/-/a:/
classifier was able to detect more than 80% of the DL2N1 annotated pronunciation errors, and 80%90% of all pronunciation errors of /A/ in the TRIEST corpus was detected correctly. These high
classification results indicate that the machine judgments agree well with human judgments. The
/Y/-/u,y/ classifier also performed well on detecting pronunciation errors; percentages between
80% and 100% percent were observed. But these high percentages were obtained under other
training conditions than those of /A/-/a:/: these high percentages for all corpora were observed
when the /Y/-/u,y/ classifiers were trained with non-native speech (as opposed to native speech
in the case of /A/-/a:/). When the /Y/-/u,y/ classifiers were trained on non-native speech to
detect pronunciation errors, percentages decreased to approximately 70% for CITO material and
even to 20% for female TRIEST material. Intuitively, it may seem that natively trained classifiers
should be used to detect pronunciation errors because one would expect that the native norm is
better in discriminating wrong from good pronunciations, and preferably L2 learners should learn
L2 according to a native norm. But adopting the “native norm” does not mean that non-native
speakers should learn “accent-free” speech; the native norm must not be too strict. It seems that
the question of how to train the classifiers, with native or non-native speech (Q2), remains open
and needs to be further investigated.
The /x/-/k,g/ classifier detected around 40%-60% of all annotated pronunciation errors when
natively trained; when the classifier was non-natively trained it performed slightly better. But the
number of annotated mispronunciations of /x/ as /k,g/ is too small to allow clear-cut conclusions to
be made about the results. The /x/-/k,g/ classifier, developed by method II, shows similar results.
It is clear that more non-native speech data with annotated pronunciation errors is needed before
any conclusions can be drawn about the performance of these /x/-/k,g/ classifiers in detecting
pronunciation errors.
Overall, the vowel classifiers were able to detect more than 50% of our selection of annotated
pronunciation errors: between 50%-80% of all pronunciation errors of /A/ and between 60%-100%
of all pronunciation errors of /Y/ were detected. Nevertheless, to examine the real accuracy of the
classifiers real-time experiments with both native and non-native subjects should be carried out.
Some of these preliminary real-time tests have already been carried out with non-native subjects;
107
generally, the classifiers did not perform very well and produced many False Rejections. It indicates
that the work carried out in this study has to be continued and that improvements are necessary;
in the next subsection I will elaborate on how these classifiers can be improved. But first, I will
try to answer the more detailed questions that were also investigated in this study.
In addition to the thesis question, three other questions that were closely related to the thesis
question were investigated. The first question Q1. What are reliable discriminative acoustic-phonetic
features of phonemes for pronunciation errors of /A/, /Y/ and /x/? was examined by carrying out
classification experiments with different feature sets. We saw that for the /A/-/a:/ discrimination
duration was an important feature and thus must be included in the feature set. Adding either
F0 or F3 did not increase performance strongly; therefore from the four measures F0, F1, F2
and F3, F1 and F2 appear to be sufficient, although adding F0 and F3 sometimes did increase
accuracy slightly. Adding dynamic information (thus taking 2 or 3 samples) did not improve
the classification percentages strongly and sometimes even lowered the percentages. We expected
dynamic information to be of importance because studies by e.g. Hillenbrand et al. (1995) showed
that dynamic information indeed played an important role in their vowel classification experiments
(which were carried out under other conditions than ours). For the /Y/-/u,y/ distinction almost
the same was observed: on the whole, no large improvements when F0 or F3 were added, or when
dynamic information was added, were observed. Duration had a less discriminative weight in the
/Y/-/u,y/ distinction.
It seems that for /A/-/a:/, a feature set such as [F1 F2 dur/normdur] is sufficient for classification, where duration (either raw or normalized) is a very important feature. For /Y/-/u,y/,
the same set [F1 F2 dur/normdur] is sufficient for classification. In both cases, F0 and F3 can
optionally be added to the feature sets: the addition of these two features sometimes improves
accuracy slightly.
For vowels, the choice of acoustic features is rather limited: it has been determined a long time
ago (Peterson & Barney, 1952) that formants are good predictors of vowels. For consonants, the
variety of acoustic features is somewhat larger. One of the acoustic features we used for the /x//k,g/ distinction was the rate of amplitude rise (ROR), which was the main criterion for method I. In
method II, the ROR was supplemented with one energy measurement before this ROR peak, three
energy measurements after the peak and duration. In the results of the classification experiments
108
we saw that classification performance was high for almost every feature set. In the correlation
matrices it was observed that ROR, i1 (energy measurement before the peak), i3 (one of the three
energy measurements after the peak) and duration potentially could be discriminative features.
The results and LDA analyses (such as stepwise LDA) have proven this to be true although it
seems that i1 has more discriminative power than ROR in many cases. To maximize efficiency, a
reliable /x/-/k,g/ classifier can be built with just three simple acoustic features: i1, i3 and raw
or normalized duration, i.e. [i1 i3 dur/normdur]. ROR can optionally be added since accuracy
sometimes improved with the addition of ROR. One can also opt for method I, a classification tree
that uses no statistics, but simple if . . . > . . . then . . . rules and four simple measures to make the
distinction between /x/-/k,g/. The classification accuracy of method I is in many cases slightly
less than the accuracy of the LDA method. After weighting the advantages and disadvantages of
both methods, we came to the conclusion that method II is to be preferred to method I.
The second question Q2. How do the detectors trained under different conditions (trained on native
or non-native speech), cope with non-native speech? was investigated by training the classifiers with
native and non-native speech and testing them on non-native speech. A small loss of performance
was observed when non-native speech was evaluated by natively trained classifiers, especially when
TRIEST material was applied on IFA material, which is not surprising for two reasons: 1) nonnative speech is generally less clearly and less accurately pronounced than native speech, therefore
a natively trained classifier might have difficulty classifying “blurry” non-native sounds, and 2) the
two corpora were recorded under different conditions, which could cause more acoustic variation
and loss of performance.
In general, classification of non-native sounds can be difficult: the results of the vowel classification experiments show that classification accuracy for non-native speech ranges from approximately 55% to 83% for /A/-/a:/ and 60%-70% for /Y/-/u,y/ when applied to both natively and
non-natively trained classifiers. In both cases, accuracy decreased (sometimes strongly) when nonnative speech was evaluated by a classifier that was trained with native speech. However, this
decrease in accuracy when non-native speech is evaluated by natively trained classifier, was not
observed in the case of /x/-/k,g/. Possibly, vowel pronunciation is more affected by and susceptible to non-native influences than consonant pronunciation: this agrees with one of the findings
of the survey carried out on pronunciation errors described in section 3.3.2, where we found that
109
pronunciation errors more often concerned vowels rather than consonants.
The third and last question Q.3 What are the advantages of a Lineair Discrimination Analysis
method (LDA) as opposed to a decision tree-based method for automatic pronunciation error detection? involves a comparison between an LDA classification method (e.g. method II in /x/-/k,g/
distinction) and a decision tree-based method (e.g. method I in /x/-/k,g/ distinction). A comparison between these two methods was already made in section 6.5 (table 6.10), where we found that
using method II was more advantageous than using method I. One of the important advantages of
the LDA method is that one can start the LDA training process with a number of discriminative
features (e.g. [ROR i1 i2 i3 i4 dur]), and by carrying out different LDA analyses, prune away
features that are considered superfluous and do not significantly contribute to the discriminant
model (e.g. [ROR i1 i2 i3 i4 dur] becomes [i1 i3 dur]). A second advantage of the LDA method
is that all kinds of numeric variables can be used for classification. For instance, in the case of
vowels, duration might play an important role. However, it is difficult to formulate duration as a
threshold to fit into the decision tree. Furthermore, to build a decision tree one should already have
a clear idea of what features are discriminative, since one would have to formulate the classification
rules (boundaries, thresholds) oneself, whereas in the case of LDA, during the training phase of the
classifier, these boundaries (the discriminant function) are computed automatically. Consequently,
it takes relatively less effort and time to train and test an LDA classifier than to train and test a
classification tree.
The acoustic-phonetic approach appears to be a promising method for detection of pronunciation
errors in speech of L2-learners. We have shown that with a small number of acoustic-phonetic
features, speech sounds can be automatically separated from each other, in both native and nonnative speech, with an acceptable level of accuracy. We have also shown that these acoustic-phonetic
classifiers can be used to detect pronunciation errors. However, there is room for improvement. In
the next section, some suggestions are given on topics for further research.
Further research
The developed classifiers have already been tested in real-time during preliminary tests with both
native and non-native speakers and it was observed that they produced many False Rejections. In
other words, the classifiers need to be improved. I will briefly discuss some ways of improving the
110
performance and effectiveness of the classifiers, and that have not been examined in this study and
therefore, need further investigation.
One could start improving accuracy by looking at the first layer of the classification process (see
fig. 5.7 and 6.8 ), where all further processes are based on: the automatic phoneme segmentation.
The segmentation is important because all the information that is extracted from the acoustic
signal, which forms the basis of the classifier, is based on this time alignment. If the phoneme
segmentation is not correct, then you are extracting information from the wrong phoneme. We
have randomly looked at some of these segmentations to review the quality very globally. We never
adjusted segment boundaries by hand.
Another option is to train more or different mispronounced phonemes. For each pronunciation
error, more phoneme models can be trained as “incorrect” as opposed to one correct phoneme
in the distinction between correct-incorrect. For instance, in the /A/-/a:/ distinction we have
only modelled the /a:/ as a mispronounced /A/. Although this was a frequently made error
(based on a survey carried out on a part of the non-native speech material, section 3.3.2), other
mispronunciations of /A/ do occur, like /E/ or /{/ (IPA /æ/). Perhaps instead of making the
distinction between /A/ and /a:/, the distinction can be made between /A/ and /a:, E, {, . . . /. An
additional problem that arises then is that /{/ is not a Dutch sound. Therefore, training material
for that particular sound should be collected from foreign (non-native) speech databases. This was
also a problem for the /x/-/k,g/ distinction where /g/ is not a common Dutch sound either, and,
consequently, this classifier was trained with only /k/ as an incorrect sounds as opposed to the
correct one (/x/). It could explain the low performance in pronunciation error detection for our
/x/-/k,g/ classifier.
We have now trained the classifiers with only correct realizations; it would be better to train
the classifiers with “real” pronunciation errors, made by L2 learners. For this purpose, one would
need much more non-native speech data that not only contains many pronunciation errors, but also
should be comparable with the native speech data that is (perhaps) going to be used for training.
Furthermore, for feedback purposes the pronunciation error detector may also have to diagnose
the error: the classificator can be trained to make n-ary classifications instead of binary classifications. We have tried this type of classificator in the tertiary /Y/-/u/-/y/ distinction where /Y/
was mispronounced as either /u/ or /y/. Preliminary experiments showed that the /Y/-/u/-/y/
111
classifier performed almost equally well as the /Y/-/u,y/ classifier. But these were preliminary
results and therefore, classifiers, such as the /Y/-/u/-/y/ classifier, which are able to diagnose the
error, need to be further investigated.
More investigation is also needed on the question of how to adapt native models to non-native
speech. We have seen that the performance of the natively trained classifiers in some cases slightly
degrades when non-native speech is used. If one would like the system to work for non-native
speakers without degrading the performance for native speakers, then it might be necessary to
adapt the native models to non-native speech. Perhaps a midway can be found where “mixed”
models are trained on partly native and partly non-native speech.
Finally, the more traditional ASR-based methods should be compared to acoustic-phonetic
classifiers, to find out which method performs better in detecting pronunciation errors in speech
of L2 learners: ASR-based methods should be tested with the same speech material that was used
to test the acoustic-phonetic classifiers in this study to make the results comparable.
112
References
Abercrombie, D. (1991[1956]). Teaching pronunciation. In A. Brown (Ed.), Teaching English
pronunciation. London: Routledge, 87-95.
Anderson-Hsieh, J., Johnson, R. and Koehler, K. (1992). “The relationship between native speaker
judgments of nonnative pronunciation and deviance in segmentals, prosody, and syllable structure”,
Language Learning 42, 529-555.
Arslan, L.M. (1996). Foreign accent classification in American English. Doctoral dissertation, Dept.
of Elec. And Comp. Engineering, Duke University, USA.
Assman, P.F., Nearey, T.M., and Hogan, J.T. (1982). “Vowel identification: orthographic, perceptual, and acoustic aspects”, Jorunal of the Acoustical Society of America 71, 975-989.
Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange (Ed.),
Speech Perception and Lingusitic Experience: Issues in Cross-Language Research, Baltimore: York
Press, 171-204.
Bladon, R.A.V. (1982). Arguments against formants in the auditory representation of speech. In
R. Carlson and B. Granstrom (Eds.), The Representation of Speech in the Peripheral Auditory
System, Amsterdam: Elsevier, 95-102.
Bohn, O.S. and Flege, J.E. (1990). Perception and Production of a New Vowel Category by Adult
Second Language Learners. In J. Leather and A. James, (Eds.), New Sounds 90: Proceedings
of the 1990 Amsterdam Symposium on the Acquisition of Second-Language Speech, Amsterdam:
University of Amsterdam Press, 37-56.
Borden, G., Gerber, A., and Milsark, G. (1983). “Production and Perception of the /r/-/l/ Contrast
in Kkorean Adults Learning English”, Language Learning 33, 499-526.
Borrell, A. (1990). “Perception et (re)production dans l’apprentissage des langues étrangères.
Quelques réflexions sur les aspects phonético-phonologiques”, Revue de Phonétique Appliquée
95-96-97, 107-114.
Brière, E. (1966). “An Investigation of Phonological Interference”, Language 42, 769-796.
Brownman, C.P. and Goldstein, L. (1989). “Articulatory gestures as phonological units”, Phonology
6, 201-251.
Cucchiarini, C., Strik, H., Boves, L. (2000). “Quantitative assessment of second language learners’
fluency by means of automatic speech recognition technology”, Journal of the Acoustical Society
of America 107, 989-999.
De Graaf, T. (1986). “De uitspraak van het Nederlands door buitenlanders”, Logopedie en Foniatrie
58, 343-347.
Den Os, E.A., Boogaart, T.I., Boves, L. and Klabbers, E. (1995). “The Dutch Polyphone corpus”,
Proceedings of Eurospeech ’95, Madrid, Spain, 825-828.
113
Deroo, O., Ris, C., Gielen, S. and Vanparys, J. (2000). “Automatic detection of mispronounced
phonemes for language learning tools”, Proceedings of the 6th International Conference on Spoken
Language Processing (ICSLP) 2000, Beijing, China, 681-684.
Derwing, T.M. and Munro, M.J. (1997). “Accent, intelligibility, and comprehensibility”, Studies in
Second Language Acquisition 20, 1-16.
Dorman, M.F., Raphael, L.J., and Isenberg, D. (1980). “Acoustic cues for a fricative-affricate
contrast in word-final position”, Journal of Phonetics 8, 397-405.
Fant, G. (1960). Acoustic theory of speech production. The Hague: Mouton.
Flege, J.E. (1987). Effects of equivalence classification on the production of foreign language
speech sounds. In A. James and J. Leather (eds.), Sound patterns in second language acquisition,
Dordrecht: Foris, 9-39.
Flege, J.E. (1993). “Production and perception of a novel, second-language phonetic contrast”,
Journal of the Acoustical Society of America 93, 1589-1608.
Flege, J.E., Munro, M., and Mackay, I. (1995). “Factors affecting degree of perceived foreign accent
in a second language”, Journal of the Acoustical Society of America 97, 3125-3134.
Flege, J.E. (1995). Second Language Speech Learning, Theory, Findings, and Problems, in Strange,
W. (ed.), Speech Perception and Linguistic Experience: Theoretical and Methodological Issues.,
Timonium, MD: York Press, 233-273.
Franco, H., Neumeyer, L., Kim, Y., and Ronen, O. (1997). “Automatic Pronunciation Scoring for
Language Instruction”, Proceedings of the International Conference on Acoustics, Speech, and
Signal Processing (ICASSP) ’97, Munich, Germany, 1471-1474.
Franco, H., Neumeyer, L., Ramos, M., and Bratt, H. (1999). “Automatic Detection of Phone-Level
Mispronunciation for Language Learning”, Proceedings Eurospeech ’99, Budapest, Hungary,
851-854.
Franco, H., Neumeyer, L., Digilakalis, V. and Ronen, O. (2000). “Combination of machine scores
for automatic grading of pronunciation quality”, Speech Communciation 30, 121-130.
Franco, H., Abrash, V., Precoda, K., Bratt, H., Rao, R., and Butzberger, J. (2000). “The
SRI EduSpeak(TM) System: Recognition and Pronunciation Scoring for Language Learning”,
Proceedings of InSTILL 2000, Dundee, Scotland, 123-128.
Friel, C.M. “Notes on discriminant analysis”, Sam Houston State University, available online here
(http://www.shsu.edu/~icc_cmf/newCJ742/7DISCRIMINANTANALYSISnew.doc ), last consulted
24/05/04.
Germain-Rutherford, A., and Martin, P. (2000). “Presentation d’un logiciel de visualization pour
l’apprentissage de l’oral en langue seconde”, ALSIC 3, 61-76.
Hillenbrand, J.M. and Gayvert, R.T. (1993). “Vowel classification based upon fundamental frequency
and formant frequencies”, Journal of Speech and Hearing Research 36, 694-700.
Hillenbrand, J.M., Getty, L.A., Clark, M.J., and Wheeler, K. (1995). “Acoustic characteristics of
American English vowels”, Journal of the Acoustical Society of America 97, 3099-3111.
Hiller, S., Rooney, E., Vaughan, R., Eckert, M., Laver, J. and Jack, M. (1994). “An automated
system for computer-aided pronunciation learning”, Computer Assisted Language Learning 7,
51-63.
Holden, K. and Hogan, J. (1993). “The emotive impact of foreign intonation: an experiment in
switching English and Russian intonation”, Language and Speech 36, 67-88.
Jongman, A., Wayland, R., and Wong, S. (2000). “Acoustic characteristics of English fricatives”,
Journal of the Acoustical Society of America 108, 1252-1263.
114
Kent, R.D. and Read, C. (1992). The Acoustic Analysis of Speech. San Diego: Singular Publishing
Group, Inc.
Kim, Y., Franco, H. and Neumeyer, L. (1997). “Automatic pronunciation scoring of specific phone
segments for language instruction”, Proceedings Eurospeech ’97, Rhodes, Greece, 645-648.
Koopmans-van Beinum, F.J. (1980). Vowel contrast reduction. An acoustical and perceptual study
of Dutch vowels in various speech conditions. Doctoral dissertation, University of Amsterdam.
Lehiste, I. (1988). Lectures on Language Contact. Cambridge, MA:MIT Press.
Llisteri, J. (1995). “Relationships between speech production and speech perception in a second
language”, Proceedings of the International Conference of Phonetic Sciences, Stockholm, Sweden,
92-99.
Mak, B., Siu, M.H., Ng, M., Tam, Y.C., Chan, Y.C., Chan, K.W., Leung, K.Y., Ho, S., Chong, F.H.,
Wong, J., Lo, J. (2003). “PLASER: Pronunciation Learning via Automatic Speech Recognition”,
Proceedings of HLT-NAACL 2003, Edmonton, Canada, 23-29.
Menzel, W., Herron, D., Bonaventura, P., and Morton, R. (2000). “Automatic detection and
correction of non-native English pronunciations”, Proceedings of InSTILL 2000, Dundee, Scotland,
49-56.
McLaughlin, B. (1977). “Second language learning in childhood”, Psychological Bulletin 84, 438-459.
Munro, M.J. and Derwing, T.M. (1995). “Foreign accent, Comprehensibility, and Intelligibility in
the Speech of Seond Language Learners”, Language Learning 45, 74-97.
Nearey, T.M. and Assmann, P.F. (1986). “Modeling the role of inherent spectral change in vowel
identification”, Jorunal of the Acoustical Society of America 80, 1297-1308.
Neri, A., Cucchiarini, C. and Strik, H. (2002). “The pedagogy-technology interface in Computer
Assisted Pronunciation Training”, Computer Assisted Language Learning 15, 441-447.
Neri, A., Cucchiarini, C. and Strik, H. (2004). “Segmental errors in Dutch as a second language:
how to establish priorities for CAPT”, to appear in Proceedings of InSTILL 2004.
Neufeld, G.G. (1988). “Phonological asymmetry in second language learning and performance”,
Language Learning 38, 531-559.
Neumeyer, L., Franco, H., Weintraub, M., and Price, P. (1996). “Automatic text-independent
pronunciation scoring of foreign language student speech”, Proceedings of the International
Conference on Spoken Language Processing (ICSLP) 1996, Philadelphia, USA, 1457-1460.
Neumeyer, L., Franco, H., Digilakis, V. and Weintraub, M. (2000). “Automatic scoring of pronunciation quality”, Speech Communication 30, 83-93.
Nooteboom, S.G. (1972). Production and perception of vowel duration: a study of durational
properties of vowels in Dutch. Doctoral dissertation, Utrecht University.
Nooteboom, S.G. and Cohen, A. (1995). Spreken en verstaan. Assen: Van Gorcum.
Nouza, J. (1998). “Training speech through visual feedback patterns”, Proceedings of the International Conference of Spoken Language Processing ’98, Sydney, Australia.
O’Grady, W., Dobrovolsky, M. and Katamba, F. (1996). Contemporary Linguistics, An Introduction.
London: Addison Wesley Longman.
Peterson, G.E., and Barney, H.L. (1952). “Control methods used in a study of the vowels”, Journal
of the Acoustical Society of America 24, 175-184.
Praat, http://www.praat.org
Rietveld, A.C.M. and Van Heuven, V.J.J.P.M. (2001). Algemene Fonetiek. Bussum: Coutinho.
Ronen, O., Neumeyer, L. and Franco, H. (1997). “Automatic detection of mispronunciation for
language instruction”, Proceedings Eurospeech 97, Rhodes, Greece, 649-652.
115
Scovel,T. (1998). A Time to Speak. A Psycholinguistic Inquiry into the Critical Period for Human
Speech. Rowley, Mass.: Newbury House.
Sheldon, A., and Strange, W. (1982). “The acquisition of /r/ and /l/ by Japanese learners of English:
Evidence that speech production can precede speech perception”, Applied Psycholinguistics 3,
243-261.
Stevens, K.N. and Blumstein, S.E. (1978). “Invariant cues for place of articulation in stop consonants”, Journal of the Acoustical Society of America 64, 1358-1368.
Stevens, K.N. (1980). “Acoustic correlates of some phonetic categories”, Journal of the Acoustical
Society of America 68, 836-842.
Van Bael, C.P.J., Strik, H. and Van den Heuvel, H. (2003). “Application-oriented validation
of phonetic transcriptions: preliminary results”, Proceedings of 15th ICPhS, Barcelona, Spain,
1161-1164.
Van Heuven, V.J.J.P.M., Kruyt, J.G. and De Vries, J.W. (1981). “Buitenlandsheid en begrijpelijkheid in het Nederlands van buitenlandse arbeiders: een verkennende studie”, Forum der Letteren
22, 171-178.
Van Son, R.J.J.H., Binnenpoorte, D., Van den Heuvel, H. and Pols, L.C.W. (2001). “The IFA corpus:
a phonemically segmented Dutch Open Source speech database”, Proceedings of EUROSPEECH
2001, Aalborg, Denmark, 2051-2054.
Weigelt, L.F., Sadoff, S.J. and Miller, J.D. (1990). “The plosive/fricative distinction: The voiceless
case”, Journal of the Acoustical Society of America 87, 2729-2737.
Weinreich, U. (1953). Languages in Contact, Findings and Problems. The Hague: Mouton.
Weinstein, C.J., McCandless, S.S., Mondshein, L.F., and Zue, V.W. (1975). “A system for
acoustic-phonetic analysis of continuous speech”, IEEE Trans. Acoust. Speech Signal Process. 23,
54-67.
Witt, S.M. (1999). Use of Speech Recognition in Computer-Assisted Language Learning. Doctoral
dissertation, University of Cambridge.
Witt, S.M. and Young, S.J. (2000). “Phone-level Pronunciation Scoring and Assessment for
Interactive Language Learning”, Speech Communication 30, 95-108.
Young, D.J. (1990). “An investigation of students’ perspectives on anxiety and speaking”, Foreign
Language Annals 23, 539-553.
Zahorian, S.A. and Jaharghi, A.J. (1993). “Spectral-shape features versus formants as acoustic
correlates for vowels”, Journal of the Acoustical Society of America 94, 1966-1982.
116
Appendix A
List of abbreviations
These are abbreviations which were used in this work.
CALL
CAPT
DL2N1
DL2N1-Nat
DL2N1-NN
IFA
L1
L2
LDA
ROR
RMS
TRIEST
Computer Aided Language Learning
Computer Aided Pronunciation Training
speech database Dutch as L2, Nijmegen corpus 1
native part of corpus DL2N1
non-native part of corpus DL2N1
native speech database of the Institute of Phonetic Sciences of Amsterdam
First Language, mother tongue
Second Language
Linear Discriminant Analysis
Rate Of Rise: an acoustic-phonetic feature that is used for distinguishing
/x/ from /k/
Root Mean Square: a mean amplitude measure
non-native speech database which contains speech of speakers from the
University of Triest
117
Appendix B
List of phonetic symbols
The list is shown on the next page...
118
IPA
SAMPA
Example in Dutch word
p
p
p ak
b
b
b ak
t
t
tak
d
d
dak
k
k
kap
g
goal
c
c
matje
f
f
fel
s
s
sok
v
v
vel
z
z
zak
x
x
toch
S
sjaal
m
m
man
n
n
non
N
bang
l
lam
R
rand
r
r
rand
w
w
ruw
j
j
ja
h
h
hond
l
119
IPA
SAMPA
Example in Dutch word
I
pit
E
pet
A
pat
O
pot
Y
put
@
gedoe
i
i
piet
y
y
fuut
e
e:
veel
a
a:
paal
ø
2:
beuk
o
o:
boot
u
u
voet
æ
{
bat (English word)
ε
120
Appendix C
Scripts
Scripts for /A/ and /Y/
I will only show the most important parts of the script that show how the features were extracted
and how the parameter values were set. For a detailed description of the parameters, see the manual
of Praat (http:///www.praat.org)
(...)
% If the script encounters a particular phoneme, then perform the acoustic analysis
for b from 1 to number_of_intervals
interval_label$ = Get label of interval... 1 ’b’
if ((interval_label$ = "A") or (interval_label$ = "a:")
... or (interval_label$ = "Y") or (interval_label$ = "u") or (interval_label$ = "y"))
% Calculate duration and points at 25%, 50% and 75% of vowel duration for
% acoustic measurements
begin_vowel = Get starting point... 1 ’b’
end_vowel = Get end point... 1 ’b’
duration = end_vowel - begin_vowel
midpoint = begin_vowel + ((duration) / 2)
point_25 = begin_vowel + ((duration)/4)
point_75 = end_vowel - ((duration)/4)
% Perform the formantanalysis (Burg method): a short-term spectral analysis.
% Extracts 4 formants per frame and searches for formants up to 4000 Hz
% => telephone speech, e.g. DL2N1 corpus
To Formant (burg)... 0.001 4 4000 0.025 50
121
% In broadband speech (e.g. TRIEST corpus), formants are searched for up to 5000 Hz for
% male speech, and up to 5500 Hz for female speech
if gender$ = "f"
To Formant (burg)... 0.001 5 5500 0.025 50
else
To Formant (burg)... 0.001 5 5000 0.025 50
endif
%
%
%
%
The formant values of the IFA corpus were already provided by
the University of Amsterdam.
We used their values which were computed the following way (no distinction between
male and female speech, scripts provided by Rob van Son):
To Formant (burg)... 0.001 5 5500 0.025 50
% Get the formants
f_one = Get value at time... 1 ’midpoint’ Hertz Linear
f_two = Get value at time... 2 ’midpoint’ Hertz Linear
f_three = Get value at time... 3 ’midpoint’ Hertz Linear
(... etc.)
% Perform the pitchanalysis: candidates below 75 Hz will not be recruited, and
% candidates above 600 Hz will be ignored
To Pitch... 0.0025 75 600
Kill octave jumps
Interpolate
% Get pitch
f_zero = Get value at time... ’midpoint’ Hertz Linear
% If pitch is undefined at ’midpoint’, shift the measurepoint 5% to the right or left of
% the ’midpoint’
percent5 = ((duration)/20
timevalue = midpoint
while ((f_zero = undefined) and (timevalue<end_vowel))
timevalue = timevalue + percent5
f_zero = Get value at time... ’timevalue’ Hertz Linear
endwhile
122
% If pitch is still undefined, then calculate the mean pitch over the segment
if (f_zero = undefined)
f_zero = Get mean... ’begin_vowel’ ’end_vowel’ Hertz
endif
% All information is written to a single log-file that logs all the
% extracted information.
% This file can be used for training and testing the LDA-classifier.
(...)
endif
endfor
Scripts for /x/
I will only show the most important parts of the script that show how the features were extracted
and how the parameter values were set. For a detailed description of the parameters, see the manual
of Praat (http:///www.praat.org)
for b from 1 to number_of_intervals
interval_label$ = Get label of interval... 1 ’b’
if ((interval_label$ = "x") or (interval_label$ = "k"))
begin_segm = Get starting point... 1 ’b’
end_segm = Get end point... 1 ’b’
duration = (end_segm - begin_segm) * 1000
% Create an object with zero-crossings, before pre-emphasis
To PointProcess (zeroes)... yes yes
dt = 0.001
old_rms = 0
Pre-emphasize (in-line)... 50
%
%
%
%
The analysis window of 0.024 seconds long is shifted every 0.001 seconds
over the acoustic signal.
Calculate Root-Mean-Square and ROR for each window (frame)
And calculate the zero-crossing rate for each window
while (begin_segm < end_segm)
123
endwindow = begin_segm + 0.024
rms = Get root-mean-square... ’begin_segm’ ’endwindow’
logrms = 20 * log10 (rms/0.00002)
ror = (logrms - old_rms)/dt
old_rms = logrms
begin_segm = begin_segm + 0.001
select PointProcess ’object_name$’
first = Get nearest index... ’begin_segm’
second = Get nearest index... ’endwindow’
zerocross = second - first
zerocrossrate = zerocross/0.024
framenr = framenr + 1
%
%
%
%
Write this information for each phoneme to a file that contains four columns
of information:
number of the frame, ROR of this frame, log RMS of this frame, zero-crossing
rate of this frame. Duration can now be expressed as the number of frames.
endwhile
endfor
% Now you have a whole bunch of unique files containing the extracted information
% for each specified phoneme.
%
%
%
%
%
What we now need from each file for the LDA-classifier is the highest ROR
value (ROR peak), i1 (amplitude 5ms before the peak), i2 (5ms after the peak),
i3 (10ms after the peak) and i4 (20ms after the peak).
Duration is the number of frames and normalized duration is computed outside
this script.
for j from 1 to number_files % For each file we do:
% Get maximum ROR and remember which frame number corresponds to the maximum ROR
nframes = Get number of rows
for y from 1 to nframes
value = Get value... ’y’ 2 % second column is ROR
if (value>=max)
max = value
max_place = y
endif
endfor
124
% Get all the amplitude measurements around the peak
if (max_place - 5 <= 0)
i1 = Get value... 1 3
else
i1 = Get value... ’max_place’-5 3
endif
if (max_place + 5 > nframes)
i2 = Get value... ’nframes’ 3
else
i2 = Get value... ’max_place’+5 3
endif
if (max_place + 10 > nframes)
i3 = Get value... ’nframes’ 3
else
i3 = Get value... ’max_place’+10 3
endif
if (max_place + 20 > nframes)
i4 = Get value... ’nframes’ 3
else
i4 = Get value... ’max_place’+20 3
endif
endfor
%
%
%
%
Write this information away in a single file where each line contains:
the phoneme (x or k) with its highest ROR value, the four amplitude measurements
and duration. This file can be used for training and testing the
LDA-classifier (method II).
% The bunch of files to which I referred earlier is also used in method I,
% i.e. the algorithm by Weigelt et al. (1990). This algorithm was rewritten
% into a decision tree in Perl. Three criteria and a ROR threshold were formulated:
% criterion 1
if ($min < ($var1 * $logrms[$i]))
{
$ans1 = 0;}
else
{
$ans1 = 1;}
% where $min is the lowest value of E for the following 49ms after the peak
% $logrms[$i] is the value of E at the peak and $var1 is originally 1 but is
% now one of the parameters that can be tuned
125
% criterion
if ($max >=
{
$ans2 =
else
{
$ans2 =
2
$var2 + $logrms[$i])
1;}
0;}
% where $max is the maximum value for the 49ms following the peak, $var2
% is originally 12 but can now be tuned
% criterion
if ($max2 >
{
$ans3 =
else
{
$ans3 =
3
$var3)
1;}
0;}
% where $max2 is the maximum zero-crossing rate over the 49ms period after the peak
% $var3 is originally 2000 but can now be varied
% If any of the 3 criteria fails, then it is fricative.
% If the case passes all 3 criteria and its ROR peak is above the ROR
% cutoff ($RORcutoff, which can be varied) then it is a plosive.
if ($ans1 and $ans2 and $ans3 and $H{$i}>$RORcutoff)
{
print "It is a plosive\n";}
else
{
print "It is a fricative\n";}
126
Appendix D
Sentences
Sentences used in DL2N1
• Vitrage is heel ouderwets en past niet bij een modern interieur.
• De Nederlandse gulden is al lang even hard als de Duitse mark.
• Een bekertje warme chocolademelk moet je wel lusten.
• Door jouw gezeur zijn we nu al meer dan een uur te laat voor die afspraak.
• Met een flinke garage erbij moet je genoeg opbergruimte hebben.
• Een foutje van de stuurman heeft het schip doen kapseizen.
• Gelokt door een stukje kaas liep het muisje keurig in de val.
• Het ziet er naar uit dat het deze week bij ons opnieuw gaat regenen.
• Na die grote lekkage was het dure behang aan vervanging toe.
• Geduldig hou ik de deur voor je open.
Sentences used in TRIEST
• Ik wou al om half drie hier zijn om alles in de etalage te zetten.
• De voetballer belooft zijn contractuele verplichtingen na te komen.
• De juffrouw rust een middagje uit en doet een dutje.
• De chauffeur tracht met wilde bewegingen de kuilen in de weg te omzeilen.
• De huiseigenaar kwam aan de deur om de huur op te halen.
• Vitrage is heel ouderwets en past niet bij een modern interieur.
• De Nederlandse gulden is al lang even hard als de Duitse mark.
127
• Een bekertje warme chocolademelk moet je wel lusten.
• Door jouw gezeur zijn we nu al meer dan een uur te laat voor die afspraak.
• Met een flinke garage erbij moet je genoeg opbergruimte hebben.
128
Appendix E
Amount of speech data
Numbers of all tokens used in this study
M
DL2N1-Nat
DL2N1-NN
IFA
TRIEST
/A/
146
265
283
81
F
/a:/
93
229
187
53
/A/
227
372
432
287
/a:/
143
321
280
181
Table E.1: All tokens of /A/ and /a:/
DL2N1-Nat
DL2N1-NN
IFA
TRIEST
/Y/
31
70
81
28
M
/u/
40
96
191
23
/y/
32
76
44
30
/Y/
48
67
124
103
F
/u/
60
119
273
79
/y/
48
75
73
57
Table E.2: All tokens of /Y/, /u/, and /y/
M
DL2N1-Nat
DL2N1-NN
IFA
TRIEST
/x/
112
155
284
60
F
/k/
119
162
241
60
/x/
169
260
444
198
/k/
168
249
360
208
Table E.3: All tokens of /x/ and /k/
129
Number of tokens of /A/ and /a:/ used in A- and B-experiments:
training and test.
All data is now divided in a training and test set. Maximum Chance Criterion (MCC) and Proportional Chance Criterion (Cpro ) are also shown.
An example of how MCC and Cpro were calculated:
Male
Training
Test
MCC
Cpro
/A/ /a:/ /A/ /a:/
DL2N1-Nat 110 70
36
23
61.0% 52.4%
MCC is calculated by: [ 36/(36+23) ] × 100 = 61.0%,
Cpro is calculated by: [ 36/(36+23) ]2 + [1 - [36/(36+23)] ]2 = 52.4%.
DL2N1-Nat
DL2N1-NN
IFA
TRIEST
Training
/A/ /a:/
110 70
199 172
212 140
61
40
Male
Test
/A/ /a:/
36
23
66
57
71
47
20
13
MCC
Cpro
61.0%
53.7%
60.2%
60.6%
52.4%
50.3%
52.1%
52.2%
Training
/A/ /a:/
170 107
279 241
324 210
215 136
Female
Test
/A/ /a:/
57
36
93
80
108 70
72
45
MCC
Cpro
61.3%
53.5%
60.7%
61.5%
52.6%
50.3%
52.3%
52.7%
Table E.4: A-experiments /A/-/a:/
B.1
B.2
Training
/A/ /a:/
110 70
212 140
Male
Test
/A/ /a:/
66
57
20
13
MCC
Cpro
53.7%
60.6%
50.3%
52.2%
Training
/A/ /a:/
170 107
324 210
Female
Test
/A/ /a:/
93
80
72
45
Table E.5: B-experiments /A/-/a:/
130
MCC
Cpro
53.8%
61.5%
50.3%
52.7%
Number of tokens of /Y/, /u/, and /y/ used in A- and Bexperiments: training and test.
All data is now divided in a training and test set. Maximum Chance Criterion (MCC) and Proportional Chance Criterion (Cpro ) are also shown.
DL2N1-Nat
DL2N1-NN
IFA
TRIEST
Training
/Y/ /u/ /y/
23
30
24
53
70
57
61
143 33
21
17
23
Test
/Y/ /u/
8
10
17
26
20
48
7
6
/y/
8
17
11
7
MCC
Cpro
69.2%
72.6%
74.7%
65.0%
57.4%
60.2%
62.2%
54.5%
Table E.6: Male A-experiments /Y/-/u,y/
DL2N1-Nat
DL2N1-NN
IFA
TRIEST
Training
/Y/ /u/ /y/
36
45
36
50
89
56
93
205 55
77
59
43
/Y/
12
17
31
26
Test
/u/
15
30
68
20
/y/
12
19
18
14
MCC
Cpro
69.2%
74.2%
73.5%
56.7%
57.4%
61.8%
61.0%
50.0%
Table E.7: Female A-experiments /Y/-/u,y/
B.1
B.2
Training
/Y/ /u/ /y/
23
30
24
61
143 33
/Y/
17
7
Test
/u/
26
6
/y/
17
7
MCC
Cpro
72.6%
65.0%
60.2%
54.5%
Table E.8: Male B-experiments /Y/-/u,y/
B.1
B.2
Training
/Y/ /u/ /y/
36
45
36
93
205 55
/Y/
17
26
Test
/u/
30
20
/y/
19
14
MCC
Cpro
74.2%
56.7%
61.8%
50.0%
Table E.9: Female B-experiments /Y/-/u,y/
131
Number of tokens of /x/ and /k/ used in A- and B-experiments:
training and test.
All data is now divided in a training and test set. Maximum Chance Criterion (MCC) and Proportional Chance Criterion (Cpro ) are also shown.
DL2N1-Nat
DL2N1-NN
IFA
TRIEST
Training
/x/ /k/
84
89
116 122
213 181
45
45
Male
Test
/x/ /k/
28
30
39
40
71
60
15
15
MCC
Cpro
51.7%
50.6%
54.2%
50.0%
50.1%
50.0%
50.4%
50.0%
Training
/x/ /k/
127 126
195 187
333 270
149 156
Female
Test
/x/ /k/
42
42
65
62
111 90
49
52
MCC
Cpro
50.0%
51.2%
55.2%
51.5%
50.0%
50.0%
50.5%
50.0%
Table E.10: A-experiments /x/-/k/.
B.1
B.2
Training
/x/ /k/
84
89
213 181
Male
Test
/x/ /k/
39
40
15
15
MCC
Cpro
50.6%
50.0%
50.0%
50.0%
Training
/x/ /k/
127 126
333 270
Female
Test
MCC
/x/ /k/
65
62
51.2%
49
52
51.5%
Cpro
50.0%
50.0%
Table E.11: B-experiments /x/-/k/
Number of mispronounciations of /A/, /Y/, and /x/ used in Cexperiments.
/A/ as
DL2N1-NN
M F
31 56
/a:/
TRIEST
M F
12 35
/Y/ as /u,y/
DL2N1-NN TRIEST
M F
M F
10 37
12 30
/x/ as /k,g/
DL2N1-NN TRIEST
M F
M F
2
9
0
12
Table E.12: Pronunciation errors
132
Appendix F
Tables with classification scores
Classification scores /A/-/a:/
/A/ vs /a:/ Exp. A.1 Training & Test = DL2N1-Nat
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
78.0
81.4 79.7
79.7
81.4 79.7
78.0
81.4 79.7
78.0
81.4 83.1
Male
2 samples
nodur dur normdur
67.8
78.0 79.7
66.1
78.0 79.7
69.5
78.0 79.7
67.8
78.0 79.7
3 samples
nodur dur normdur
71.2
79.7 76.3
71.2
79.7 78.0
74.6
79.7 84.8
74.6
79.7 84.8
1 sample
dur normdur
95.7 93.6
95.7 93.6
92.6 90.4
92.6 90.4
Female
2 samples
nodur dur normdur
75.5
91.5 88.3
74.5
91.5 88.3
79.8
88.3 90.4
79.8
88.3 90.4
nodur
77.7
77.7
76.6
76.6
nodur
77.7
77.7
77.7
77.7
133
3 samples
dur normdur
90.4 90.4
90.4 90.4
91.5 89.4
91.5 89.4
/A/ vs /a:/ Exp. A.2 Training & Test = DL2N1-NN
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
56.8
65.6 67.2
58.4
69.6 68.0
58.4
66.4 68.0
58.4
68.0 68.0
Male
2 samples
nodur dur normdur
59.2
66.4 65.6
62.4
66.4 69.6
64.0
64.8 66.4
66.4
72.0 68.0
3 samples
nodur dur normdur
61.6
67.2 68.0
63.2
69.6 67.2
62.4
67.2 64.0
63.2
70.4 64.8
1 sample
dur normdur
60.3 63.8
60.3 63.2
63.2 65.5
64.9 66.7
Female
2 samples
nodur dur normdur
56.9
60.3 63.2
58.1
60.3 66.1
58.6
62.1 64.9
57.5
61.5 65.5
nodur
59.8
60.9
59.8
60.3
nodur
60.9
60.9
59.2
62.6
3 samples
dur normdur
60.3 63.8
61.5 67.2
62.6 64.4
61.5 66.1
/A/ vs /a:/ Exp. A.3 Training & Test = IFA
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
nodur
56.8
64.4
55.9
62.7
1 sample
dur normdur
78.8 78.8
77.1 76.3
73.7 71.2
72.9 77.1
Male
2 samples
nodur dur normdur
62.7
78.0 78.8
71.2
78.0 78.0
67.0
75.4 77.1
68.6
77.1 76.3
nodur
66.1
69.5
67.8
68.6
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
83.7
88.2 88.8
83.7
88.2 88.8
89.3
92.7 92.7
92.1
94.4 94.4
Female
2 samples
nodur dur normdur
83.2
88.8 88.8
85.4
88.8 88.8
88.2
90.5 89.9
87.6
92.1 92.1
3 samples
nodur dur normdur
86.5
89.3 89.3
85.4
89.9 90.5
89.3
93.8 93.8
93.3
93.8 94.9
134
3 samples
dur normdur
78.0 80.5
78.8 80.5
75.4 77.1
77.1 78.0
/A/ vs /a:/ Exp. A.4 Training & Test = TRIEST
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
71.4
74.3 74.3
65.7
74.3 71.4
74.3
82.7 85.7
74.3
80.0 82.9
Male
2 samples
nodur dur normdur
65.7
71.4 71.4
68.6
71.4 77.1
77.1
77.1 77.1
74.3
77.1 77.1
3 samples
nodur dur normdur
74.3
74.3 74.3
74.3
80.0 80.0
77.1
82.9 82.9
74.3
82.9 82.9
1 sample
dur normdur
84.8 80.5
85.6 83.1
85.5 83.1
84.8 82.2
Female
2 samples
nodur dur normdur
65.3
78.8 77.1
63.6
78.8 75.4
67.8
78.0 76.3
65.3
77.1 76.3
nodur
71.2
67.0
66.1
66.1
nodur
64.4
61.9
65.3
63.6
3 samples
dur normdur
78.8 77.1
77.1 75.4
78.0 75.4
78.0 76.3
/A/ vs /a:/ Exp. B.1 Training = DL2N1-Nat & Test = DL2N1-NN
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
nodur
59.9
59.5
59.9
60.5
1 sample
dur normdur
63.4 68.6
63.8 68.6
62.2 68.6
63.0 68.6
Male
2 samples
nodur dur normdur
60.9
64.8 69.0
61.3
64.8 69.2
60.7
65.0 68.2
61.3
64.8 68.2
nodur
59.3
60.1
58.5
59.5
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
59.3
63.8 66.2
59.5
63.1 67.0
61.6
64.7 68.3
61.0
63.9 68.1
Female
2 samples
nodur dur normdur
60.8
63.6 66.7
59.7
63.6 66.8
60.3
63.9 66.1
60.6
64.9 65.8
3 samples
nodur dur normdur
59.5
64.5 67.2
59.0
64.1 66.5
60.6
64.7 67.0
60.8
64.5 66.7
135
3 samples
dur normdur
63.0 69.4
62.8 68.8
62.4 67.8
61.3 67.8
/A/ vs /a:/ Exp. B.2 Training = IFA & Test = TRIEST
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
59.0
72.4 70.2
63.4
71.6 70.2
61.2
74.6 73.9
61.2
69.4 69.4
Male
2 samples
nodur dur normdur
56.7
69.4 67.2
63.4
69.4 68.7
58.2
62.7 60.5
56.7
65.7 64.2
3 samples
nodur dur normdur
58.2
68.7 67.2
60.5
68.7 67.9
57.5
61.2 59.7
55.2
66.4 64.2
1 sample
dur normdur
54.5 53.6
56.4 55.8
51.7 50.4
53.4 52.6
Female
2 samples
nodur dur normdur
47.0
49.6 49.8
47.7
49.6 50.9
45.7
47.7 47.4
47.0
50.4 49.8
nodur
47.2
49.4
45.7
48.3
nodur
48.5
51.3
46.5
49.6
3 samples
dur normdur
51.7 51.7
54.7 54.1
50.2 49.2
51.9 51.1
/A/ vs /a:/ Exp. C.1 Training = DL2N1-Nat & Test = mispronounced /A/ DL2N1-NN
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
nodur
74.2
80.7
74.2
83.9
1 sample
dur normdur
64.5 58.1
67.7 58.1
67.7 58.1
67.7 58.1
Male
2 samples
nodur dur normdur
70.8
58.1 48.4
71.0
58.1 48.4
67.7
58.1 45.2
67.7
58.1 45.2
nodur
83.9
83.9
77.4
77.4
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
64.3
64.3 44.6
64.3
64.3 44.6
66.1
73.2 60.7
66.1
73.2 62.5
Female
2 samples
nodur dur normdur
69.6
71.4 62.5
69.6
71.4 62.5
66.1
76.8 64.3
66.1
76.8 66.1
3 samples
nodur dur
normdur
67.9
69.6 60.7
67.9
73.2 60.7
69.6
80.l4 69.6
69.6
80.4 69.6
136
3 samples
dur
normdur
67.7 54.8
67.7 54.8
64.5 48.4
64.5 51.6
/A/ vs /a:/ Exp. C.2 Training = DL2N1-NN & Test = mispronounced /A/ DL2N1-NN
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
64.5
61.3 61.3
64.5
61.3 58.1
64.5
61.3 61.3
61.3
61.3 58.1
Male
2 samples
nodur dur normdur
58.1
58.1 58.1
58.1
58.1 61.3
61.3
61.3 71.0
61.3
67.7 71.0
3 samples
nodur dur normdur
61.3
64.5 64.5
64.5
67.7 64.5
61.3
71.0 67.7
64.5
71.0 71.0
1 sample
dur normdur
39.3 42.9
37.5 39.3
48.2 48.2
44.6 51.8
Female
2 samples
nodur dur normdur
51.8
42.9 44.6
51.8
42.9 44.6
57.1
48.2 48.2
58.9
46.4 46.4
nodur
51.8
50.0
57.1
57.1
nodur
53.6
51.8
57.1
55.4
3 samples
dur normdur
41.1 42.9
41.1 41.1
50.0 50.0
50.0 48.2
/A/ vs /a:/ Exp. C.3 Training = IFA & Test = mispronounced /A/ TRIEST
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
nodur
50.0
58.5
66.7
50.0
1 sample
dur normdur
50.0 58.3
58.3 50.0
75.0 66.7
66.7 75.0
Male
2 samples
nodur dur normdur
66.7
83.3 66.7
66.7
83.3 66.7
75.0
83.3 75.0
75.0
83.3 75.0
nodur
75.0
75.0
75.0
83.3
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
94.3
88.6 88.6
85.7
82.9 80.0
88.6
88.6 85.7
82.9
74.3 77.1
Female
2 samples
nodur dur normdur
94.3
91.4 91.4
88.6
91.4 88.6
91.4
88.6 88.6
91.4
85.7 85.7
3 samples
nodur dur normdur
91.4
91.4 94.3
91.4
82.9 88.6
91.4
88.6 88.6
88.6
82.9 82.9
137
3 samples
dur normdur
75.0 66.7
66.7 75.0
83.3 75.0
83.3 75.0
/A/ vs /a:/ Exp. C.4 Training = TRIEST & Test = mispronounced /A/ TRIEST
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
25.0
33.3 33.3
16.7
33.3 33.3
50.0
50.0 50.0
41.7
50.0 50.0
Male
2 samples
nodur dur normdur
8.3
0
0
8.3
0
0
25.0
8.3
0
25.0
8.3
0
3 samples
nodur dur normdur
8.3
8.3
8.3
8.3
8.3
8.3
33.3
25.0 25.0
33.3
16.7 25.0
1 sample
dur normdur
8.6
8.6
8.6
5.7
14.3 8.6
11.4 8.6
Female
2 samples
nodur dur normdur
17.1
11.4 17.1
14.3
1.4
8.6
14.3
14.3 17.1
8.6
8.6
11.4
nodur
17.1
14.3
17.1
11.4
nodur
14.3
5.7
14.3
5.7
138
3 samples
dur normdur
11.4 17.1
8.6
8.6
14.3 17.1
14.3 11.4
Classification scores /Y/-/u,y/
/Y/ vs /u,y/ Exp. A.1 Training & Test = DL2N1-Nat
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
nodur
73.1
80.8
73.1
80.8
1 sample
dur normdur
92.3 96.2
100 100
92.3 96.2
100 100
Male
2 samples
nodur dur normdur
84.6
92.3 92.3
84.6
92.3 92.3
80.8
84.6 88.5
80.8
88.5 92.3
nodur
76.9
80.8
76.9
80.8
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
64.1
84.6 87.2
64.1
84.6 87.2
64.1
87.2 87.2
64.1
87.2 87.2
Female
2 samples
nodur dur normdur
59.0
89.7 89.7
59.0
89.7 89.7
61.5
89.7 94.9
61.5
89.7 94.9
3 samples
nodur dur normdur
61.5
89.7 89.7
61.5
89.7 89.7
61.5
89.7 89.7
61.5
89.7 89.7
3 samples
dur normdur
96.2 96.2
100 100
100 100
100 100
/Y/ vs /u,y/ Exp. A.2 Training & Test = DL2N1-NN
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
70.5
70.5 72.1
70.5
70.5 70.5
70.5
72.1 72.1
70.5
72.1 72.1
Male
2 samples
nodur dur normdur
70.5
70.5 72.1
70.5
70.5 70.5
70.5
72.1 70.5
70.5
72.1 72.1
3 samples
nodur dur normdur
70.5
72.1 70.5
70.5
70.5 70.5
70.5
72.1 73.8
70.5
72.1 73.8
1 sample
dur normdur
66.7 69.7
69.7 71.2
69.7 71.2
69.7 71.2
Female
2 samples
nodur dur normdur
69.7
68.2 69.7
69.7
68.2 69.7
69.7
66.7 69.7
69.7
65.2 69.7
nodur
68.2
69.7
69.7
69.7
nodur
69.7
66.7
71.2
69.7
139
3 samples
dur normdur
68.2 69.7
68.2 69.7
69.7 69.7
68.2 71.2
/Y/ vs /u,y/ Exp. A.3 Training & Test = IFA
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
74.7
74.7 74.7
73.4
74.7 74.7
76.0
72.2 73.4
77.2
74.7 76.0
Male
2 samples
nodur dur normdur
76.0
77.2 77.2
73.4
77.2 77.2
76.0
77.2 79.8
77.2
82.3 82.3
3 samples
nodur dur normdur
77.2
77.2 77.2
74.7
77.2 77.2
78.5
77.2 77.2
81.0
77.2 77.2
1 sample
dur normdur
88.9 88.9
89.7 89.7
88.9 88.9
89.7 89.7
Female
2 samples
nodur dur normdur
81.2
81.2 81.2
84.6
81.2 84.6
81.2
81.2 81.2
84.6
84.6 84.6
nodur
85.5
87.2
87.2
89.7
nodur
86.3
88.0
88.9
88.9
3 samples
dur normdur
86.3 86.3
88.9 88.9
87.2 87.2
90.6 90.6
/Y/ vs /u,y/ Exp. A.4 Training & Test = TRIEST
Male*
1 sample
2 samples
nodur dur normdur nodur dur normdur
F1 F2
65.4
61.7 63.0
74.1
70.4 71.6
F0 F1 F2
65.4
64.2 64.2
74.1
70.4 72.8
F1 F2 F3
65.4
64.2 64.2
77.8
79.0 82.7
F0 F1 F2 F3 65.4
66.7 64.2
77.8
81.5 80.3
* trained and tested on the same material,
since there was not enough training material
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
59.0
57.4 57.4
60.7
55.7 55.7
57.4
60.7 60.7
57.4
55.7 55.7
Female
2 samples
nodur dur normdur
62.3
63.9 63.9
63.9
63.9 62.3
62.3
63.9 63.9
63.9
62.3 62.3
140
nodur
76.5
79.0
80.3
84.0
3 samples
dur normdur
72.8 72.8
75.3 75.3
84.0 82.7
81.5 81.5
3 samples
nodur dur normdur
62.3
65.6 65.6
60.7
62.3 60.7
63.9
63.9 63.9
63.9
60.7 60.7
/Y/ vs /u,y/ Exp. B.1 Training = DL2N1-Nat & Test = DL2N1-NN
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
60.3
71.5 68.2
62.0
71.9 69.0
60.3
71.5 67.8
62.0
71.9 69.0
Male
2 samples
nodur dur normdur
56.6
64.9 66.5
60.3
64.9 66.1
55.0
66.9 64.9
56.2
67.4 65.3
3 samples
nodur dur normdur
58.7
69.0 67.4
61.2
69.4 68.2
57.4
69.8 66.5
60.7
70.3 66.1
1 sample
dur normdur
75.9 75.1
75.1 75.7
74.3 74.3
73.2 74.3
Female
2 samples
nodur dur normdur
70.5
70.9 72.4
69.0
70.9 72.8
69.7
72.0 72.4
69.0
71.7 72.8
nodur
70.1
69.4
68.2
68.6
nodur
73.2
74.0
72.4
73.2
3 samples
dur normdur
70.9 73.2
71.3 73.2
70.9 70.9
70.1 72.8
/Y/ vs /u,y/ Exp. B.2 Training = IFA & Test = TRIEST
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
nodur
63.0
63.0
65.4
64.2
1 sample
dur normdur
65.4 65.4
65.4 65.4
63.0 63.0
65.4 65.4
Male
2 samples
nodur dur normdur
64.2
64.2 64.2
64.2
64.2 64.2
59.3
65.4 65.4
67.9
67.9 70.4
nodur
64.2
64.2
58.0
66.7
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
58.2
56.1 56.5
59.8
58.2 58.2
57.3
56.5 55.6
56.9
56.9 57.3
Female
2 samples
nodur dur normdur
59.0
57.3 57.3
58.6
57.3 59.4
57.3
57.3 57.3
58.6
59.0 59.0
3 samples
nodur dur normdur
53.6
54.0 54.0
54.0
54.0 54.0
54.0
53.1 53.6
54.0
54.0 54.0
141
3 samples
dur normdur
64.2 64.2
64.2 64.2
65.4 65.4
66.7 69.1
/Y/ vs /u,y/ Exp. C.1 Training = DL2N1-Nat & Test = mispronounced /Y/ DL2N1-NN
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
40.0
30.0 20.0
50.0
60.0 50.0
40.0
30.0 20.0
50.0
60.0 40.0
Male
2 samples
nodur dur normdur
40.0
30.0 20.0
70.0
30.0 30.0
70.0
60.0 60.0
70.0
60.0 60.0
3 samples
nodur dur normdur
20.0
20.0 10.0
50.0
20.0 10.0
30.0
20.0 10.0
90.0
90.0 90.0
1 sample
dur normdur
54.1 54.1
56.8 51.4
56.8 46.0
56.8 46.0
Female
2 samples
nodur dur normdur
73.0
59.5 48.7
73.0
59.5 48.7
73.0
56.8 51.4
73.0
56.8 51.4
nodur
70.3
70.3
64.9
67.6
nodur
70.3
70.3
61.2
62.2
3 samples
dur normdur
56.8 48.7
54.1 48.7
56.8 51.4
54.1 51.4
/Y/ vs /u,y/ Exp. C.2 Training = DL2N1-NN & Test = mispronounced /Y/ DL2N1-NN
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
nodur
100
100
100
100
1 sample
dur normdur
100 100
100 100
100 100
100 100
Male
2 samples
nodur dur normdur
100
100 100
100
100 100
100
100 100
100
100 100
nodur
100
100
100
90.0
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
81.1
75.7 73.0
81.1
73.0 73.0
81.1
75.7 70.3
83.8
75.7 73.0
Female
2 samples
nodur dur normdur
89.2
78.4 78.4
89.2
78.4 78.4
89.2
78.4 75.7
89.2
78.4 78.4
3 samples
nodur dur normdur
81.2
73.0 67.6
78.4
73.0 67.6
81.1
73.0 67.6
81.1
73.0 70.3
142
3 samples
dur normdur
100 100
100 100
100 100
90.0 90.0
/Y/ vs /u,y/ Exp. C.3 Training = IFA & Test = mispronounced /Y/ TRIEST
F1
F0
F1
F0
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
100
100 100
100
100 100
100
91.7 91.7
91.7
83.3 83.3
Male
2 samples
nodur dur normdur
100
100 100
100
100 100
66.7
66.7 66.7
58.3
66.7 66.7
3 samples
nodur dur normdur
100
100 100
100
100 100
58.3
66.7 66.7
58.3
66.7 66.7
1 sample
dur normdur
20.0 20.0
26.7 26.7
23.3 23.3
23.3 23.3
Female
2 samples
nodur dur normdur
13.3
20.0 23.3
23.3
20.0 26.7
16.7
23.3 23.3
23.3
26.7 26.7
nodur
16.7
23.3
16.7
23.3
nodur
20.0
26.7
20.0
26.7
3 samples
dur normdur
16.7 20.0
16.7 16.7
23.3 23.3
20.0 16.7
/Y/ vs /u,y/ Exp. C.4 Training = TRIEST & Test = mispronounced /Y/ TRIEST
Male*
1 sample
2 samples
nodur dur normdur nodur dur normdur
F1 F2
100
100 100
66.7
66.7 66.7
F0 F1 F2
100
100 100
58.3
66.7 50.0
F1 F2 F3
100
83.3 83.3
83.3
66.7 66.7
F0 F1 F2 F3 100
83.3 83.3
66.7
66.7 66.7
* trained and tested on the same material,
since there was not enough training material
F1
F0
F1
F0
F2
F1 F2
F2 F3
F1 F2 F3
1 sample
nodur dur normdur
93.3
96.7 96.7
80.0
83.3 83.3
86.7
86.7 86.7
73.3
76.7 76.7
Female
2 samples
nodur dur normdur
76.7
80.0 80.0
66.7
80.0 66.7
70.0
70.0 70.0
63.3
63.3 63.3
143
nodur
66.7
58.3
66.7
58.3
3 samples
dur normdur
58.3 66.7
41.7 41.7
66.7 66.7
58.3 58.3
3 samples
nodur dur normdur
76.7
76.7 80.0
66.7
66.7 66.7
70.0
70.0 70.0
63.3
63.3 63.3
Classification scores /x/-/k/
/x/ vs /k/ Exp. A.1 Train & Test = DL2N1-Nat
ROR
ROR i3
i1 i3
ROR i1 i3
i1 i2 i3 i4
ROR i1 i2 i3 i4
nodur
81.0
86.2
89.7
89.7
86.2
86.1
Male
dur normdur
79.3 77.6
86.2 86.2
91.4 93.1
91.4 93.1
89.7 91.4
87.9 91.4
Female
nodur dur normdur
82.4
82.4 84.7
89.4
91.8 91.8
90.6
92.9 94.1
92.9
91.8 94.1
91.8
91.8 91.8
92.9
91.8 94.1
/x/ vs /k/ Exp. A.2 Train & Test = DL2N1-NN
ROR
ROR i3
i1 i3
ROR i1 i3
i1 i2 i3 i4
ROR i1 i2 i3 i4
nodur
83.8
86.3
95.9
87.5
88.8
87.5
Male
dur normdur
82.5 82.5
85.0 85.0
95.9 95.9
90.0 90.0
87.5 88.8
88.8 88.8
nodur
91.7
93.4
95.9
96.7
94.2
95.9
Female
dur normdur
90.9 90.9
93.4 93.4
95.9 95.9
96.7 96.7
95.9 95.9
96.7 96.7
/x/ vs /k/ Exp. A.3 Train & Test = IFA
ROR
ROR i3
i1 i3
ROR i1 i3
i1 i2 i3 i4
ROR i1 i2 i3 i4
nodur
78.0
79.6
84.1
85.6
84.1
84.9
Male
dur normdur
78.8 78.8
78.8 79.6
86.4 86.4
86.4 86.4
84.9 85.6
85.6 85.6
Female
nodur dur normdur
90.1
90.6 90.6
90.6
90.6 89.6
87.1
86.1 86.6
89.6
88.6 88.6
88.1
87.1 86.6
89.1
89.1 89.1
/x/ vs /k/ Exp. A.4 Training & Test = TRIEST
ROR
ROR i3
i1 i3
ROR i1 i3
i1 i2 i3 i4
ROR i1 i2 i3 i4
nodur
83.3
80.0
83.3
83.3
86.7
86.7
Male
dur normdur
86.7 86.7
86.7 86.7
90.0 90.0
90.0 90.0
90.0 90.0
90.0 90.0
144
nodur
87.3
88.2
88.2
88.2
86.3
87.3
Female
dur normdur
86.3 87.3
91.2 90.2
90.2 90.2
92.2 93.1
89.2 89.2
91.2 91.2
/x/ vs /k/ Exp. B.1 Training = DL2N1-Nat & Test = DL2N1-NN
ROR
ROR i3
i1 i3
ROR i1 i3
i1 i2 i3 i4
ROR i1 i2 i3 i4
nodur
82.5
85.0
88.8
90.0
88.8
88.8
Male
dur normdur
82.5 83.8
88.8 88.8
93.8 92.5
93.8 92.5
92.5 91.3
92.5 91.3
nodur
92.6
94.2
91.7
93.4
90.9
93.4
Female
dur normdur
90.9 91.7
95.9 95.9
95.9 95.0
95.9 95.9
95.0 95.0
95.9 95.9
/x/ vs /k/ Exp. B.2 Training = IFA & Test = TRIEST
ROR
ROR i3
i1 i3
ROR i1 i3
i1 i2 i3 i4
ROR i1 i2 i3 i4
nodur
83.3
80.0
86.7
83.3
80.0
83.3
Male
dur normdur
70.3 73.3
73.3 76.7
80.0 80.0
80.0 80.0
80.0 80.0
80.0 80.0
nodur
68.6
68.6
78.4
76.5
75.5
73.5
Female
dur normdur
69.6 69.6
68.6 68.6
81.4 80.4
77.5 75.5
78.4 77.5
76.5 73.5
/x/ vs /k/ Exp. C.1 Training = DL2N1Nat & Test = mispronounced /x/ DL2N1NN
ROR
ROR i3
i1 i3
ROR i1 i3
i1 i2 i3 i4
ROR i1 i2 i3 i4
nodur
20.0
30.0
60.0
30.0
30.0
20.0
Female
dur normdur
20.0 20.0
40.0 40.0
60.0 60.0
50.0 50.0
50.0 50.0
50.0 50.0
/x/ vs /k/ Exp. C.2 Training = DL2N1NN & Test = mispronounced /x/ DL2N1NN
ROR
ROR i3
i1 i3
ROR i1 i3
i1 i2 i3 i4
ROR i1 i2 i3 i4
Female
nodur dur normdur
20.0
20.0 20.0
20.0
20.0 20.0
20.0
50.0 30.0
20.0
20.0 20.0
30.0
40.0 40.0
30.0
30.0 30.0
145
/x/ vs /k/ Exp. C.3 Training = IFA &
Test = mispronounced /x/ TRIEST
ROR
ROR i3
i1 i3
ROR i1 i3
i1 i2 i3 i4
ROR i1 i2 i3 i4
Female
nodur dur normdur
16.7
33.3 33.3
25.0
41.7 41.7
41.7
41.7 41.7
33.3
41.7 41.7
41.7
41.7 41.7
41.7
41.7 41.7
/x/ vs /k/ Exp. C.4 Training = TRIEST
& Test = mispronounced /x/ TRIEST
ROR
ROR i3
i1 i3
ROR i1 i3
i1 i2 i3 i4
ROR i1 i2 i3 i4
Female
nodur dur normdur
41.7
100 100
50.0
75.0 75.0
50.0
83.3 83.3
50.0
83.3 75.0
50.0
83.3 83.3
50.0
83.3 75.0
146
Appendix G
How to read Whisker’s Boxplot
A box plot provides a simple graphical summary of data. The description and plot originate from
the Praat manual, http://www.praat.org.
o
outlier > upper Outer Fence
upper Outer Fence
outlier > upper Inner Fence
*
upper Inner Fence
upper Whisker
q75
mean
q50
q25
lower Whisker
147
q25 = lower quartile, 25% of the data lie below this value
q50 = median, 50% of the data lie below this value
q75 = upper quartile, 25% of the data lie above this value
hspread = |q75 - q25 | (50% interval)
upper Inner Fence = q75 + 1.5 * hspread
upper Outer Fence = q75 + 3.0 * hspread
lower Whisker = smallest data value larger then lower Inner Fence
upper Whisker = largest data value smaller then upper Inner Fence
• the dotted line corresponds to the mean
• the outliers outside the outer Fences are drawn with an ’◦’
• the outliers in the intervals are drawn with an ’*’
• with no outliers present, the whiskers mark minimum and/or maximum of the data
148
© Copyright 2026 Paperzz