Dialect Classification via Text-Independent Training and Testing for

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011
85
Dialect Classification via Text-Independent Training
and Testing for Arabic, Spanish, and Chinese
Yun Lei, Student Member, IEEE, and John H. L. Hansen, Fellow, IEEE
Abstract—Automatic dialect classification has emerged as an important area in the speech research field. Effective dialect classification is useful in developing robust speech systems, such as
speech recognition and speaker identification. In this paper, two
novel algorithms are proposed to improve dialect classification for
text-independent spontaneous speech in Arabic and Spanish languages, along with probe results for Chinese. The problem considers the case where no transcripts but dialect labels are available for training and test data, and speakers are speaking spontaneously, which is defined as text-independent dialect classification.
The Gaussian mixture model (GMM) is used as the baseline system
for text-independent dialect classification. The major motivation is
to suppress confused/distractive regions from the dialect language
space and emphasize discriminative/sensitive information of the
available dialects. In the training phase, a symmetric version of
the Kullback–Leibler divergence is used to find the most discriminative GMM mixtures (KLD-GMM), where the confused acoustic
GMM region is suppressed. For testing, the more discriminative
frames are detected and used via the location of where the frames
are in the GMM mixture feature space, which is termed frame selection decoding (FSD-GMM). The first KLD-GMM and second
FSD-GMM techniques, are shown to improve dialect classification
performance for three-way dialect tasks. The two algorithms and
their combination are evaluated on dialects of Arabic and Spanish
corpora. Measurable improvement is achieved in both two cases,
over a generalized maximum-likelihood estimation GMM baseline
(MLE-GMM).
Index Terms—Arabic dialects, dialect classification, frame selection, Gaussian mixture, Kullback–Leibler divergence, Spanish dialects.
I. INTRODUCTION
D
IALECT classification, or as it is sometimes referred to
as dialect identification, is an emerging research topic in
the speech recognition community because dialect is one of the
most important factors next to gender that influences speech
recognition performance [1]–[4]. Automatic dialect classification is important for characterizing speaker traits [5] and knowledge estimation, which can then be employed to build dynamic
lexicons by selecting alternative pronunciations [6], generate
pronunciation modeling via dialect adaptation [7] , or train [8]
and adapt [9] dialect dependent acoustic models. Dialect knowledge is also helpful for data mining and spoken document re-
Manuscript received September 30, 2008; revised November 21, 2009; accepted January 10, 2010. Date of publication March 11, 2010; date of current
version October 01, 2010. This work was supported in part by the AFRL under
a subcontract to RADC, Inc., under Grant FA8750–09–C–0067 and in part by
the University of Texas at Dallas under Project EMMITT. The associate editor
coordinating the review of this manuscript and approving it for publication was
Dr. Richard C. Rose.
The authors are with the Center for Robust Speech Systems (CRSS),
University of Texas at Dallas, Richardson, TX 75083-0688 USA (e-mail:
[email protected]; john.hansen@utdallas. edu).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASL.2010.2045184
trieval[10], [11]. In this paper, the definition employed for the
term dialect is: a pattern of pronunciation and/or vocabulary of
a language used by the community of native speakers belonging
to some geographical region.1 For example, Cuban Spanish and
Peruvian Spanish are two dialects of Spanish; American English
and U.K. English are two dialects of English. Here, we refer to
American English and U.K. English as parent family tree dialects, while dialects such as Cambridge, Belfast, or Cardiff are
represented as subclasses under the U.K. family tree. It is noted
that slight differences in definition of dialect exist across research studies, depending on their perspective of the problem,
linguistics, or speech engineering goals.
In previous studies, it has been shown that isolated words as
well as individual phonemes can be successfully used for dialect classification [13], [14]. Utterance-based dialect classification presents two different text scenarios: constrained and unconstrained. If transcripts are available, supervised word-based
dialect classification is suggested. The method turns the text-independent dialect classification problem into a text-dependent
dialect classification problem by comparing a range of given
words which are the output of an automatic speech recognizer
(ASR), and has been shown to obtain very high accuracy [15]. A
context-adaptive training (CAT) algorithm has also been applied
for cases where the training data set size is very small[16]. In
general, most conversational dialect data is unconstrained since
transcript information is expensive to produce. In the present
framework, typically no text, speaker, or gender information except the dialect label is available for the data, and therefore an
text-independent algorithm must be formulated. Alternatively,
a Gaussian mixture model (GMM)-based classifier can be applied for unconstrained data [17]. Several successful methods
have also been proposed based on reducing model confusion
to achieve better performance for dialect classification. For example, training data selection and Gaussian mixture selection
[18] based on the training corpus attempts to exclude or balance the confusion region; minimum classification error (MCE)
[19] training, as a common discriminative training method, can
also be applied to reduce model confusion. In a manner similar
with MCE, maximum mutual information (MMI) has been applied for language and accent identification successfully [20],
[21]. Factor analysis [22]–[24], constrained maximum-likelihood linear regression (CMLLR) [25] and vocal tract length
normalization (VTLN), as the methods for variability compen1Dialect in this context refers to regional dialects of a language. Social, as
well as economic-based, dialects also exist in languages/countries. Such studies
consider problems of the origin and diffusion of linguistic change, the nature
of stylistic variation in language use, and the effect of class structure on linguistic variation within a speech community. Such issues are not addressed in
the present study, but we note the existence of such work in the field of social-linguistics [12].
1558-7916/$26.00 © 2010 IEEE
86
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011
sation, have all been successfully applied for language identification. While they are all general compensation techniques,
they could also be applied for dialect classification. Factor analysis, especially based on the eigenchannel model, can be used
to describe the channel variability, which can influence dialect
classification; CMLLR can also be used to compensate for the
channel, but use the assumption that the mean and covariance
parameters are governed by one transform per class; VTLN, as
an approach to normalize speaker characteristics, can suppress
the influence of speakers in dialect classification. In addition to
the acoustic phase, the vocabulary and grammar differences of
dialects can also be studied and applied for dialect classification
[26].
The focus in this study is to identify and emphasize those
traits that are most discriminative across dialects of a common
language. A GMM is used to represent the acoustic space of the
dialects. The hypothesis considered here is that some mixtures
are significantly different among the dialects, which will help
us to classify the dialects, while others possess information that
is dialect neutral. In the training phase, the symmetrized KL divergence (KL2) [27] based algorithm is employed to assess the
dialect dependent mixtures in order to enhance overall dialect
discrimination (KLD), while suppressing dialect neutral mixtures. The training phase, however, is not the only phase which
can benefit from improved dialect modeling for classification.
Along a similar concept used in the mixture division, in the decoding phase the frames can also be divided into two classes:
dialect dependent and dialect neutral frames based on the importance of the frames for dialect classification performance. Effective selection of dialect dependent frames while setting aside
dialect neutral frames will have a similar impact as seen for mixtures [frame selection decoding (FSD)].
This paper is organized as follows. The next section begins
with a brief introduction of the GMM-based dialect classification system (Section II), followed by a discussion of
training and testing techniques in Section III. The proposed
training technique—KLD, is presented in Section III-A; the
test technique—FSD, is proposed in Section III-B. Section IV
presents a series of experimental results with a comparison of
the proposed methods to the traditional maximum-likelihood
(ML) method. Finally, research findings are summarized along
with a discussion of the impact in Section V.
II. GMM-BASED CLASSIFICATION ALGORITHM
In this paper, only text-independent classification is considered since it is assumed that no transcripts are available for
either training or test data. The GMM classifier, employing a
soft Bayes classifier, has been successfully applied for speech
related classification such as text-independent speaker recognition [28] and dialect classification [17]. Here, a GMM-based
dialect classification algorithm is employed as the baseline
system. Fig. 1 shows the flow diagram of the baseline GMM
dialects are considtraining process, where a closed set of
ered. The dialect GMM model is trained with spontaneous data
from each speech dialect. The training method is generalized
maximum-likelihood estimation (MLE) employing the expectation–maximization (EM) algorithm [29], [30]. In the training
phase, silence frames are first removed from the input audio
Fig. 1. Baseline MLE-GMM text-independent dialect training system.
Fig. 2. Baseline GMM text-independent dialect testing system.
stream using an energy threshold, followed by MFCC feature
extraction. For each dialect, gender-dependent GMM models
are constructed. The test phase is shown in Fig. 2, where silence
removal and feature extraction steps are applied prior to dialect
classification. The details of model formulation are described
in the experimental section. To avoid influence of gender and
emphasize dialect classification, gender information is assumed
known so gender classification is not considered here.
In general, dialect classification is considered to be similar in some respects to language identification. A number
of successful techniques for language identification could be
applicable for dialect classification. For example, there are
many methods based on phone recognition, such as Phone
Recognition and Language Modeling (PRLM), parallel PRLM
(PPRLM), and language dependent Parallel Phone Recognition
(PPR) [31]. Also, support vector machine (SVM), SVM phone
recognition, or an SVM using a GMM super-vector kernel
could also be applied to achieve good language ID performance [32]. Maximum mutual information (MMI), as a general
discriminative learning method, also achieves significant improvement for language identification. The scope of research
work in language identification is quite large, compared with
that in dialect identification. As such, more focused studies
have been applied in language identification which explore the
minimization of issues such as microphone (factor analysis,
CMLLR), vocal tract length/speaker differences (VTLN), etc.
For the field of dialect identification, it is more important to
first establish competitive solutions before such non-dialect
dependent variability can be effectively addressed. The lack
of extensive dialect corpora in the field is one reason for the
lack of research progress. The focus of this study is to develop
better algorithms than the standard MLE for text-independent
dialect classification by emphasizing dialect-specific traits.
Therefore, this paper uses the MLE-GMM algorithm as the
baseline system. Also, the focus is on dialect classification, and
not minimizing non-dialect dependent variability. As such, it is
possible to further improve actual classification scores if such
addition processing is also included. This issue is suggested for
future work.
LEI AND HANSEN: DIALECT CLASSIFICATION VIA TEXT-INDEPENDENT TRAINING AND TESTING FOR ARABIC, SPANISH, AND CHINESE
III. TRAINING AND TESTING FOR TEXT-INDEPENDENT
DIALECT CLASSIFICATION
Although dialect classification is similar to language identification, there are some differences. Language identification
attempts to determine the language in which the speech was
spoken. Normally, different languages have different phonemes,
vocabulary, grammar, as well as different pronunciations. Also,
boundaries between languages are generally quite distinct, and
easier to recognize perceptually. Dialects, especially subclass
dialects (e.g., Cardiff, Belfast, Cambridge), are more subtle and
less perceptually recognized. For the dialect case, differences
among dialects of a language are usually smaller than between
languages in terms of grammar, pronunciation, and vocabulary
selection. Also, there is less formal documentation available
(i.e., it is easy to obtain dictionaries of English, German, and
Spanish languages; but is very difficult to obtain a Belfast U.K.
English dictionary verses Cardiff U.K. English dictionary). In
the acoustic space, it is suggested that the acoustic/linguistic distance between dialects is usually much closer than the distance
between languages, and therefore there should be more overlap
among dialects. In fact, the study by Mehrabani and Hansen [33]
have illustrated this for dialect and language separation.
The proposed method here employs a two-step process. In the
first technique, the focus is to find and remove the confusing region of the dialect model in the training phase. The technique,
as a training method, is called Gaussian Mixture Selection by
KL2 Divergence (KLD-GMM). Here, a GMM model is used
to represent the acoustic space, where the individual mixtures
of the GMM are employed to represent different regions of the
acoustic space. It is assumed that some mixtures contribute to
effective dialect classification, while other mixtures distract the
model from effective dialect classification. The technique therefore classifies mixtures into contribution and distraction, and retains only the contribution mixtures. The second technique is
the testing method, entitled Frame Selection Decoding (FSD).
The technique finds and removes the confused region of the
frames. In testing, the frames are classified into contribution
and distraction parts, with only contribution frames retained for
classification.
A. Training Algorithm: Gaussian Mixture Selection by KL2
Divergence
Assuming a single GMM model is employed to describe one
dialect, each Gaussian mixture component is expected to contribute to the individual parts of the dialect acoustic space. Here
we suggest individual parts since the covariance matrix is diagonal for each mixture component. Although there is no direct
one-to-one mapping from the individual mixtures to the individual phones, we employ the following example mapping of
the mixtures to phones to explain why and how to classify the
mixtures, since the selection of the mixture number is typically
based on an approximation of the number of phonemes in the
system generally. However, when we actually classify the mixtures, we only measure the distance among the mixtures, which
is not based on the phone labels (since they are not known),
which means those mixtures can represent the phones, vocal
tract, or even more general speaker/speech properties.
87
In this example, the pronunciation of some phonemes in different dialects can be similar from both an MFCC feature perspective as well as perceptually. If pronunciations of this phone
are similar in different dialects, then the mixtures which represent this phone will not be contributed to dialect classification, if
we assume individual phones will be represented via particular
GMM mixtures. Alternatively, if the pronunciations of a particular phone are very different across dialects, then these mixtures will contribute to dialect classification. Clearly, a portion
of the mixtures which represent the same phone will be similar
across different dialects. These mixtures therefore do not contribute to improving overall dialect classification. However, the
portion of the mixtures representing phones that are different,
will emphasize the separation between dialects. Fig. 3 illustrates
and
are
these two scenarios. In Fig. 3(a), two phones
between
shown, where there are limited changes for phone
dialect and anti-dialect models. However, notable differences
phone from dialect to anti-dialect models.
exist for the
Here, the anti-dialect model can either represent another dialect,
or a composition of dialects expected to compete with the target
dialect. Therefore, almost all mixtures representing phone
have changed. In Fig. 3(b), a portion of the mixtures are similar
between the dialect and anti-dialect models, while some mixchange from dialect to anti-dialect models.
tures of phone
also change from
Similarly, part of the mixtures of phone
dialect to anti-dialect model. The algorithm therefore results in
two sets of mixtures including dialect sensitive mixtures (e.g.,
Discriminative Mixtures), and neutral mixtures (e.g., General
Mixtures), which are suppressed to decrease confusion between
dialects. A method, however, is needed to measure the similarity
between these mixtures. A symmetric version of the KL divergence (KL2) is appropriate for this task, since it is often used as
a measure of similarity between two density distributions. The
KL2 divergence between two probability density functions
and
is defined as [34]
(1)
where
is the KL divergence from probability density
function (pdf) to , and
is the KL divergence from
pdf to . In general, it is difficult to calculate the KL2 divergence between two arbitrary distributions. For the case of
GMMs, however, all mixtures are typically Gaussian distributions. Fortunately, the KL2 divergence for two Gaussian distributions has a closed form expression
(2)
where
and
are the covariance matrixes of
and
,
and
and
are the means of
and
,
is the
determinant of the matrixes , and
is the trace. The previous equations describe the KL2 divergence between any two
Gaussian distributions. Assuming the covariance matrices are
88
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011
Fig. 3. (a) phoneme == shows similar structure between dialect and anti-dialect models, while == shows key mixture differences between the dialect to
anti-dialect models. (b) subregion of mixture space of phone == and == differ from dialect to anti-dialect models.
diagonal, the KL2 divergence can be calculated using only the
mean and variance of the Gaussian distributions. Since MFCC
features are employed for the dialect system, and the cross correlation values between MFCC feature dimensions can be assumed to be zero, the diagonal assumption employed here both
is valid and eases overall computational analysis. It is noted
that the complete GMM model is described by three parameters: mixture weight, mean, and variance
(3)
is the number of the mixtures,
is the parameter
where
weight, is the mean vector, and is the variance vector of
the pdf in the GMM model. Therefore, it is reasonable to add
into the KL2 divergence calculation
the parameter weight
for the distance measurement in the GMM models. Here, the
is attached to the Gaussian distributions,
parameter weight
and
as
with a redefinition of the functions
(4)
(5)
The KL divergences from function
to function
, and
from function
to function
, are updated and recalculated with the three GMM parameters: mixture weight, mean,
and variance as follows:
(6)
where is the feature dimension. The new KL2 divergence
and
can be recalculated by (1). Here, the
between
Gaussian mixture of dialect model or anti-dialect model is
or
in (4) and (5).
defined to be the same function
Since each GMM will have multiple Gaussian mixtures, all
KL2 divergences between any individual Gaussian mixture
of the dialect model and any single anti-dialect model are
calculated, and will result in an
KL2 divergence matrix
where is the number of mixtures in the GMM. Here, let the
in the matrix be defined as the KL2 divergence
element
between mixture “ ” from the dialect model and mixture “ ”
from the anti-dialect model. First, the mixture pair
with
the minimal KL2 divergence from the matrix is considered.
of the matrix are
Next, all elements in row and column
set aside. The process is repeated by considering each mixto
. Therefore, the range
ture pair from
represents the KL2 divergence
values for the mixtures ranked in ascending order. The proposed method here designates those mixtures which are in the
as general mixtures, with
beginning range
all others higher in the list tagged as discriminative mixtures.
To determine whether mixture “ ”, defined as any element in
, is general or discriminative, let us
the range
define as
if
if
(7)
where “ ” signifies a discriminative mixture, and “ ” a general
mixture and represents the index of the “ ”th mixture in the
range. The value
is the relative threshold which represents
the upper bound on the number of general mixtures, where the
is
. For testing, the probabilities of the genrange for
eral mixtures are not calculated, since from a dialect perspective
these do not contribute to dialect discrimination. In addition, an
is needed to ensure
upper bound represented by the constant
that mixtures with sufficient divergence are retained as discriminative mixtures for the case where dialect difference is very
significant. Any mixtures with KL2 divergences greater than the
are tagged as discriminative mixtures.
upper bound constant
Fig. 4 shows the flow diagram of the KL2 divergence-based discriminative training processing—KLD. For each dialect, the dialect model and anti-dialect model are trained. The function of
the processing block “KLD SELECTION” in Fig. 4 is to designate mixtures of the dialect model with the mixture selection algorithm formulated above, with results saved in a “TAG FILE.”
In this file, all mixtures are tagged as one of two classes: discriminative mixtures and general mixtures. If the discriminative
testing process formulated in the next subsection is not included,
then the testing process is equivalent to the baseline with the exception that only discriminative mixture parts are used instead
of the entire GMM models.
Next, since the size of the dialect corpus is typically small,
model adaptation is generally considered to address this
problem. To apply adaptation for the GMM, development of
LEI AND HANSEN: DIALECT CLASSIFICATION VIA TEXT-INDEPENDENT TRAINING AND TESTING FOR ARABIC, SPANISH, AND CHINESE
89
Fig. 4. Training strategy based on Gaussian mixture selection by KL2 divergence (KLD-GMM).
a universal background model (UBM)2 must first be trained.
The UBM can be trained from another corpus, or from the
entire multiple dialect corpus. If the UBM is trained from a
separate corpus, it must be the same language as the dialects
under evaluation, and should include a sampling of dialects. In
this paper, all data in the dialect corpus is used to train the base
UBM. In this case, no new or parallel corpus is needed to train
the UBM, and the dialect corpus will typically be balanced
across the dialects of interest, assuming a balance in the original training corpus. After training the UBM, the dialect model
can be adapted from the UBM using data from the particular
dialect. Here, MAP adaptation is considered to generate a
dialect dependent model from the UBM. The proposed KLD
algorithm can also be applied for the dialect and anti-dialect
models, which are adapted from the UBM. However, since the
dialect and anti-dialect models are derived from the UBM, the
projection between dialect model and anti-dialect model can be
simplified. If the mixtures of the UBM are tagged, making an
index of mixtures from 1 to , it would be possible to record
and track the index during model adaptation so the mixtures
with the same index in dialect and anti-dialect models can be
paired instead of processing for the projection from the KL2
matrix. In this case, the calculation of the KL matrix is removed
and the meaning of the pairs between dialect and anti-dialect
models become more clear. The simplified KLD-GMM algorithm can only be used for models which have been adapted
from the UBM.
In the KLD-GMM algorithm, it is important to note that the
pdf weights in the discriminative part are not re-normalized after
removal of the general mixtures. The reason for this is that the
discriminative mixtures more accurately represent the target dialect in the discriminative acoustic space, while the general mixtures represent the confused portion with the competing dialects.
2A UBM is a standard GMM model for representing large numbers of
speakers which are typically outside subjects for open-set speaker recognition.
At some level, the resulting sum of the discriminative mixture
weights reflects the true separation of the dialect against its
neighbors. Since the a priori dialect probabilities are unknown,
it is assumed to be equal prior values, and therefore it is appropriate to employ the likelihood instead of the posterior probabilities. With this, the sum of the weights in each discriminative
phase can be considered to be the prior probabilities of the discriminative part.
B. Testing Algorithm: Frame Selection Decoding
Along a similar concept as the proposed mixture based KLD
method, a frame selection decoding (FSD) scheme is also developed here for dialect classification. It is known that for accent
classification, there is a reasonable expectation that a nonnative speaker will systematically substitute sounds (phones) from
his/her native language for those foreign sounds which are less
familiar. Here, accent can be defined as a pattern of pronunciation and/or vocabulary of a language used by the community
of nonnative speakers belonging to some geographical region.
For example, English spoken by native Chinese or Germans are
two accents of English. In this case, the phones from the native language and the substituted phones from the foreign accent
are very similar. On the other hand, for dialect classification, it
is expected that all speakers are native so, we believe, it is not
necessary for the speakers to purposely substitute the phonemes
between two dialects in order to maintain understanding between each other. However, the interaction effect among dialects
are significant due to the communication among people with
different dialects. In general, since different dialects originate
from the same language, they will have similarities with each
other. In addition, increased communication between populations of neighboring dialect regions are expected to contribute
to some degree of dialect transference/fusion. This would more
likely occur in word selection, versus phonemes, pronunciation,
grammar, or prosodic (intonation, timing, etc.) structure. The
90
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011
Fig. 5. Testing strategy based on frame selection decoding (FSD-GMM).
more opportunities a speaker has to hear a dialect, the more
possibly the speaker acquires those dialect traits. A speaker is
more likely to hear phones with high frequency of occurrence
in some dialect, rather than phones with low frequency of occurrence. It is believed that the speaker will more unintentionally imitate the pronunciations of phones with high frequency
of occurrence, and therefore be closer to the pronunciations in
the dialect, although some phonemes that are only used in particular dialect of the same language could be difficult to learn
by speakers from other dialect regions. Alternatively, there is a
lower probability that the speaker will hear phones with low frequency of occurrence, which suggests that these phones would
maintain true traits of the dialect better.
Therefore, motivated by the arguments above, frames in the
test data sequence will carry a nonuniform range of dialect-dependent information. The frames are classified into two classes
based on locations in the acoustic space. In the decoding phase,
the frames that represent phones with high frequency of occurrence, and are acoustically closer to each other, are believed to
be more dialect confusable and less dialect dependent, while
frames representing phones with lower occurrence frequency
will reflect more dialect dependent information. In a manner
consistent with the proposed mixture-based approach, frames
which are less dialect dependent are tagged as General Frames,
since they can decrease dialect classification performance and
should be suppressed. Frames which are more dialect dependent are tagged as Discriminative Frames, since they enhance
discrimination and should therefore improve classification performance. If the probabilities from all dialects under test for that
frame are much greater, this would mean that the phone represented by this frame is used frequently in the language, and the
frame is identified as a General frame and set aside, otherwise
the frame is retained for classification. In practice, the sum of
the probabilities from all dialects of the frame are used, instead
of their probabilities to identify the frames, since the assumption
that probabilities from all dialects of that frame are very similar.
In the tagging process, the sum of the probabilities from all dialects of every frame in the test file are calculated as
where “ ” is the number of dialects under test (typically three
is the sum of the probabilities of the
for our scenario), and
th frame ( ). All
are ranked in ascending order. To deis a dialect discriminative or general
termine whether frame
frame, define “ ” to be a discriminative sample and “ ” a general sample, where the “ th” frame is classified as
if
if
(9)
where
represents the index of the th frame in the range
, and
represents the total number of frames in the test
file. The value
is the relative threshold which represents the
upper bound on the number of dialect-recessive frames, where
is
. The discriminative test process is
the range for
shown in Fig. 5. Dialect-discriminative frames are selected by
the block “Frame Credibility Analysis.” Dialect-general frames
are set aside and only dialect-dominant frames will contribute
to the overall classification score.
IV. EXPERIMENTS
In the experimental evaluation, the proposed algorithms are
evaluated on two dialect corpora: Arabic and Spanish dialects.
It is emphasized here that no associated transcripts for train or
test data are employed in any of the dialects. All data used in
these experiments are spontaneous speech, since earlier evaluations on dialect ID showed that read speech contains limited
dialect dependent structure [16], [19]. Mel-frequency cepstral
coefficients (MFCCs) are used here, consisting of a total of 26
coefficients, with log energy and 12 MFCCs plus their delta coefficients for each frame. The frame length is 20 ms, with an
overlap skip rate of 50% (10 ms). During test, all audio files are
partitioned into short utterances. The length of each test utterance is 10 s in duration. The final classification performance is
the average over all test utterances. Finally, 256 mixtures were
used in the GMM dialect model, as well as anti-dialect model in
all evaluations since the number of the mixtures does not influence the performance significantly in dialect classification [19].
A. Dialect Corpora: Arabic and Spanish
(8)
The first corpus employed consists of Arabic dialect data
from five different regions, including United Arab Emirates
LEI AND HANSEN: DIALECT CLASSIFICATION VIA TEXT-INDEPENDENT TRAINING AND TESTING FOR ARABIC, SPANISH, AND CHINESE
91
TABLE I
A SUMMARY OF MALE ARABIC DIALECT DATA EMPLOYED
IN THIS STUDY (AFTER SILENCE FRAME REMOVAL)
TABLE III
A SUMMARY OF COMBINED MALE AND FEMALE CHINESE CORPUS EMPLOYED
IN THIS STUDY, (AFTER SILENCE FRAME REMOVAL)
TABLE II
A SUMMARY OF FEMALE SPANISH DIALECT DATA EMPLOYED
IN THIS STUDY (AFTER SILENCE FRAME REMOVAL)
Since several combinations of parameters are considered in
the following experimental results, a further development data
set would be needed to assess the best parameter settings in real
application, though the results here identify reasonable values
for parameters. Here, no development data is set aside for parameter optimization in order to employ all available data for
either train or test due to the size limitation of the corpora.
B. Additional Corpus Evaluation: Chinese Corpus
(UAE), Egypt, Iraq, Palestine, and Syria. The full set of 250
sessions (500 speakers) make up the corpus, which represents
the larger of the two dialect corpora considered. There are 100
speakers in each dialect, balanced between male/female gender.
Each session consists of two speakers completing four combined conversational recordings. Each conversational recording
contains four selected topics from a list of 12 preselected topics,
such as the weather, shopping, travel, and other common topics.
The topics were chosen according to the ease for speakers in
conversation, but also with the aim of achieving, as much as
possible, an equal distribution across all topics for the final
database. The majority of speakers were asked to discuss
weather as one of their four topics. A lapel microphone is used
in conversational recording for each speaker per conversation.
The gender, topic and signal-to-noise ratio (SNR) are labeled
for each recording per conversation. To avoid the influence of
noise for this study on dialect ID, all recordings not meeting
a minimum SNR were set aside so that only a defined range
(e.g., “clean”) dialect data would be used in the study. This
was done to minimize non-dialect variability, since addressing
such differences are suggested for future research. Since long
periods of silence do exist within the audio, a silence remover
based on an overall energy measure is applied to eliminate low
energy silence frames. Three dialects are selected for use in the
study based on geographical origins which include: UAE (AE),
Egypt (EG), and Syria (SY). To avoid the influence of gender,
only male data is used for train and test. Table I summarizes the
train and test data employed after silence removal (all speakers
are male in Table I).
The second corpus employed is based on the MIAMI corpus
described in [3], and also employed in [19]. Dialect speech from
Cuba, Peru, and Puerto Rico (PR) is used in this paper. Here,
the spontaneous speech portion which consists of spoken answers to questions prompted by interviewers serves as the material for both train and testing (i.e., only open test results are
reported). All subjects used a close-talk headset microphone to
record speech. In a manner similar to the Arabic corpus evaluation, silence removal is also performed prior to training. Here,
only female speakers are considered in the evaluation. Table II
summarizes the number of speakers and speech data duration
for train and test after silence removal.
To further evaluate the FSD algorithm, a Chinese Corpus is
also used in the present study. The corpus includes three Chinese sub-languages: Mandarin, Cantonese, and Xiang. All data
in this corpus consist of spontaneous noise-free speech. Both
male and female subjects are used in this phase of the study.
Table III summarizes the train and test data employed after silence removal. The reason for using this corpus in the evaluation of the FSD algorithm is that although these sub-languages
are not true dialects of a common language, all three sub-languages have similar grammar and text, and all speakers from the
three sub-languages live in one country, which potentially would
allow more contact with country specific content (e.g., weather,
politics, etc.). Although the native sub-languages of the Chinese
speakers are different, most can partially understand each other.
As such, communication among these speakers with different
sub-languages is common and will help assess improved discrimination for the dialect case in the FSD algorithm. Another
reason to employ this Chinese corpus for the evaluation of the
FSD algorithm is that the size of the corpus is sufficiently large
so that performance can be assessed with greater confidence.
C. Distribution of KL2 Divergence Between Mixtures
In the KLD algorithm, the key step is the ability to
correctly tag discriminative and general mixtures. The
tagging process is based on the KL2 divergence range
, which represents the distances of the mixture pairs
.
Furthermore, the mixture pairs represent a form of projection
map from the mixtures of the dialect model to the mixtures of
the anti-dialect model. For each pair, the mixture of the dialect
model and the mixture of the anti-dialect model are considered
to be a matched pair. For example, the KL2 divergence range
of the dialect and anti-dialect
models from the Arabic corpus are illustrated in Fig. 6. Note
that the final 15 element pairs between dialect and anti-dialect
pdf pairs are too large to be included in the figure. The -axis
represents the pdf index of the GMM range
.
Here, the KL2 divergence values are small or close to zero for
the first half of the pairs (i.e., index ranging from 1 to 150),
which suggests that these pdfs are in fact linked and similar.
Alternatively, the KL2 divergence values become progressively
92
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011
D ;D ; ;
Fig. 6. Distribution of the KL2 divergence range f
g of dialect and anti-dialect models from Arabic corpus.
D
...
larger with a rapid increase as the index range increases to
256. Therefore, it is suggested that pairs with large divergences (e.g., pairs 175–256) in the tail of the distribution range
are mismatched. All these mismatched mixtures (e.g., pairs
175–256) should be included as discriminative mixtures in this
, for all elements
case. In addition, the KL2 divergence
, are compared
in the range
with all elements in the th row of the KL2 divergence matrix
defined in Section III-A. Fig. 7 shows for the UAE dialect the
to the maximum value in the row of the matrix,
ratio of
with the exception of the self-test value. It is noted that similar
distribution results are achieved for the other dialects of this
language. If the ratio is greater than 1, then
is the largest
entry in the th row of the matrix and the mixture pair
will
be the most matched pair for all . The larger the ratio , the
is relatively. Alternatively,
more matched the mixture pair
if the ratio is much smaller than 1, then there exists another
matched mixture for mixture other than mixtures , which
means the pair is a mismatch. In Fig. 7, the left portion of the
) shows the ratios which are
distribution (e.g.,
large and should be considered matched pairs, while the right
) has lower ratios and are
side of the range (e.g.,
therefore mismatched. To retain the majority of the mismatched
is used in the range from 0
mixtures, a relative threshold
to 0.3 in the following experiments, where at least 70% of the
mixtures in the GMM are retained as discriminative mixtures.
Fig. 7. The distribution of the ratio of
of the matrix, except with the self term
D
D
i
to the maximum value in the ^ row
for the UAE dialect.
dialect samples with a total of 58 male speakers and 99 minutes,
along with spontaneous Spanish (SP) data from the MIAMI
corpus as the comparative language sample. A representative
sampling of all three dialects from the MIAMI corpus was
used to train the Spanish GMM model. MLE-GMM models
with 256 mixtures are trained for the Arabic dialects as well
as the Spanish language. The KL2 divergence pair range is as
, with mixture pairs
follows:
generated from any two
Arabic dialects pairs (AE and EG, EG and SY, AE and SY), or
any single Arabic dialect and overall Spanish data (AE and SP,
EG and SP, SY and SP). Since the dialect mismatch is in the
tail of the mixture pairs illustrated in Fig. 6, only the first 200
elements of the KL2 divergence ranges are shown in Fig. 8.
The elements of KL2 Divergence for dialect pairs are smaller
with a gradual and smooth increase, as shown in Fig. 8, while
the elements of the KL2 Divergence between individual Arabic
dialects and Spanish are larger with a more significant rate of
increase as the index approaches 200. Therefore, mixture pairs
between dialects are a better match, while a larger mismatch
exists for most mixture pairs between a dialect entry and the a
separate language. Therefore, more confusion, or equivalently
overlap regions exist between dialect pairs versus a dialect-new
language pair. Based on these dialect-versus-language traits, it
is more important to reduce the impact of neutral mixtures in
dialect pair classification as opposed to a dialect-new language
pair.
D. Differences Between Language and Dialect in the
Distribution of KL2 Divergence
E. Evaluation on Arabic and Spanish Dialect Corpora
Compared to languages, dialects are much closer with each
other in general. For example, Fig. 8 illustrates the assessed difference between languages and dialects using the distribution of
the KL2 divergence of the mixtures. Here, a direct comparison
, where
is performed for the KL2 divergence of the mixture
GMMs from any two of three Arabic dialects are first employed, followed by individual Arabic dialects compared with
Spanish. The number of discriminative mixtures is expected to
be higher for the Arabic dialect to Spanish comparison versus
pairwise Arabic dialect pairs. Arabic dialects are used as the
Having illustrated the distribution of the KL2 divergence
range between dialects, as well as the distance between a single
dialect and an alternative language, the focus now shifts to an
evaluation of the proposed algorithms over two dialect corpora:
Arabic and Spanish. Fig. 9 shows classification accuracy of
the proposed KLD-GMM algorithm by varying the relative
divergence threshold from 0 to 30%. The upper bound constant
is set to 0.006 in both dialect evaluof the KL2 divergence
ations. The -axis of both subfigures is the relative threshold
, which reflects the percentage of general mixtures which are
LEI AND HANSEN: DIALECT CLASSIFICATION VIA TEXT-INDEPENDENT TRAINING AND TESTING FOR ARABIC, SPANISH, AND CHINESE
93
Fig. 8. Distribution of the KL2 divergence ranges of dialects and languages.
The ranges from two Arabic dialects and the ranges from one Arabic dialect
and Spanish are shown to illustrate the different distribution between dialects
and languages.
set aside. When the relative threshold
is 0, all mixtures are
tagged as discriminative mixtures and the KLD-GMM reduces
to the baseline GMM (MLE-GMM) framework. The -axis of
all sub-figures is the dialect classification accuracy (%), with
33% as chance for the three-way dialect task. The best perforfor Arabic dialects, and
mance is achieved with
for Spanish dialects. Tables IV and V summarize
the best performance of KLD-GMM from Fig. 9, where the
performance of the baseline (MLE-GMM) system is 67.7%
for the three-way Arabic dialects and 75.8% for the Spanish
dialects. The best KLD-GMM performance achieved a
improvement on Arabic dialects and
improvement on
Spanish dialects. The relative error reductions over baseline is
12.4% for Arabic and 12.0% for Spanish dialects. In Fig. 9(a),
there is some performance variability when the
threshold
is in the range 0%–6%, suggesting that a small reduction in
the existing GMM mixture space may not suppress a particular
dialect-related pdf from the model. This is expected, since
the particular dialect GMM begins with 256 mixtures, so a
1%–5% mixture reduction is effectively removing 2–12 pdfs,
while each phone is expected to be modeled by 2–5 mixtures.
When the threshold is greater than 20%, performance begins
to decrease, suggesting that setting aside too many mixtures
that are more likely discriminative mixtures will impact performance, and so a balance must be achieved. This balance,
as determined from these evaluations should be between 6%
and 20%, where the performance increases with only slight
fluctuations. In Fig. 9(b), similar fluctuations appear in the
range for small thresholds (if the threshold is less than 14%).
Since the size of the Spanish corpus is smaller, model adaptation is employed and the simplified KLD-GMM algorithm is
compared to the KLD-GMM algorithm (note, the simplified
KLD-GMM method was presented at the end of Section II).
First, the UBM is trained using all available training data in
the Spanish dialect corpus which include 77 minutes from 44
speakers. The dialect and anti-dialect models, including the
parameters: weight, mean, and variance, are adapted from the
UBM via MAP adaptation (shown also in Table V). A new evaluation is performed for Spanish Dialect ID. Fig. 10 shows the
Fig. 9. Dialect Classification accuracy of KLD-GMM algorithm for a
three-way test with two corpora: Arabic and Spanish. The x-axis is relative
threshold N ; the y -axis is classification accuracy, with 33.3% as chance.
(a) Three-way Dialect Classification accuracy over the Arabic corpus.
(b) Three-way Dialect Classification accuracy over the Spanish corpus.
Fig. 10. Classification accuracy of KLD-GMM and simplified KLD-GMM algorithms with three-way Spanish dialect corpus. The x-axis is relative threshold
N ; the y -axis is the classification accuracy. All models used are adapted from
the UBM model by using MAP adaptation.
classification accuracy of the proposed KLD-GMM algorithm
on the Spanish dialect corpus by varying the relative divergence
from 0% to 30%. All models, including dialect
threshold
and anti-dialect models, used in the evaluation are adapted
from the UBM. The baseline system is the MAP-GMM instead
of MLE-GMM. The simplified KLD-GMM algorithm is also
evaluated with the resulting performance virtually the same as
the KLD-GMM algorithm, which suggests that the simplified
KLD-GMM algorithm can achieve the same performance as the
KLD-GMM when the models are adapted from the UBM, and
that the pair sequence from the KLD-GMM algorithm is the
same as that from the simplified KLD-GMM algorithm. From
Fig. 10, the range of Spanish dialect ID performance fluctuations is from 0% to 6%, which is the same as that observed in
the Arabic corpus. The best performance of 83.3% is achieved
in Fig. 10. From Table V, the performance
at
of the baseline system (MAP-GMM) is 76.4% with the best
absolute performance improvement of 6.9%, and a relative
error reduction over the baseline (MAP-GMM) of 29.2%.
Having illustrated mixture selection in the KLD-GMM
algorithm, frame selection is now considered. Fig. 11 shows
94
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011
TABLE IV
SUMMARY OF THREE-WAY DIALECT CLASSIFICATION PERFORMANCE
ON THE ARABIC DIALECT CORPUS
TABLE V
SUMMARY OF THREE-WAY DIALECT CLASSIFICATION PERFORMANCE
ON THE SPANISH DIALECT CORPUS
classification accuracy of the frame selection-based FSD-GMM
. For the
algorithm by varying the frame relative threshold
Spanish dialect corpus, MLE-GMM and MAP-GMM systems are used as the baseline systems since the size of the
corpus is small. The -axis of all sub-figures is the frame
while the -axis show classification
relative threshold
is 0,
accuracy (%). When the frame relative threshold
the FSD-GMM reduces to the baseline methods. The best
for Arabic dialects, and
performance is achieved at
for Spanish dialects with MLE-GMM as the basefor Spanish dialects when MAP-GMM
line, and
is the baseline. Table IV summarizes the best performance of
FSD-GMM from Fig. 11, where the best performance achieved
absolute improvement for Arabic dialects, versus
is a
absothe MLE-GMM baseline, and in Table V, a
lute improvement is obtained for Spanish dialects versus the
when MAP-GMM
MLE-GMM baseline system, and
is the baseline system. The relative error reductions over the
baseline are 12.7% on Arabic dialects, and 35.5% and 36.4%
on Spanish dialects versus the MLE-GMM and MAP-GMM
baseline systems. To assess which frames are actually being set
aside to achieve this performance improvement, an example
is illustrated in Fig. 12, which show frames which are set
aside in a test file sample from the Arabic dialect corpus. The
-axis shows a group of 150 frames which are set aside from
an input 10-s test file (1000 total frames) in the Arabic dialect
corpus, which represents an FSD relative frame threshold of
, where the -axis is the time index/location of
the frames. Here, vertical movement reflects frames that are
retained. If there are longer contiguous frame sets in this pdf, it
suggest that the frame rejection is not random, but associated
with speech types over time. From Fig. 12, it is clear that
frames which are set aside are divided into six major groups
which represent six phones over time. Since the likelihoods
of the frames are high, it is also known that the frequency of
occurrence of these phones are high, which means the frames
representing the phones are dialect-recessive frames.
In addition, the combination of KLD-GMM and FSD-GMM
algorithms, named KLD-FSD-GMM are evaluated on the
Arabic dialect corpus. The system parameters are
and
, which reflect the best individual performances
achieved in KLD-GMM and FSD-GMM. Dialect ID perforabsolute improvement
mance is 76.4%, representing an
over the baseline system (this is also shown in Table IV). This
confirms that the two algorithms address the suppression of
Fig. 11. Classification accuracy of FSD-GMM on Arabic and Spanish
corpus. The x-axis is frame relative threshold M for FSD algorithm. The
baseline is MLE-GMM algorithm on Arabic corpus and both MLE-GMM and
MAP-GMM are used as the baselines on Spanish corpus. (a) Classification accuracy on Arabic corpus with MLE-GMM baseline. (b) Classification accuracy
on Spanish corpus with MLE-GMM baseline. (c) Classification accuracy on
Spanish corpus with MAP-GMM baseline.
Fig. 12. Example of the FSD-GMM algorithm on the Arabic dialect corpus.
These are the frames which are set aside from an 10-s test file. Six phones are set
aside and the frames representing the phones are identified as dialect-recessive
frames.
the common acoustic space between dialects in different ways,
since the combination outperforms the individual algorithms
for Arabic.
LEI AND HANSEN: DIALECT CLASSIFICATION VIA TEXT-INDEPENDENT TRAINING AND TESTING FOR ARABIC, SPANISH, AND CHINESE
Fig. 13. Classification accuracy of FSD-GMM on Chinese language corpus.
The x-axis is frame relative threshold M for FSD algorithm. The baseline is
MLE-GMM algorithm.
F. Probe Evaluation of FSD Algorithm on Chinese Material
For a final evaluation, a three-way classification task is considered using data from a Chinese language corpus (e.g., Mandarin, Cantonese, and Xiang). Fig. 13 shows classification accuracy of the frame selection-based FSD-GMM algorithm by
on the three-way Chivarying the frame relative threshold
nese language Corpus. The MLE-GMM system is used as the
baseline system. The -axis of the figure shows the frame relaand the -axis shows classification accuracy
tive threshold
(%). When the frame relative threshold
is 0, the FSD-GMM
system reduces to the baseline. From Fig. 13, the performance
of the baseline system is 81.2%, and the resulting improvement
achieved as expected is significant. The best performance ocwith a
improvement, with a relative
curs for
error reduction over the baseline of 18.1%, confirming that the
proposed algorithms are also appropriate for classification of related sub-languages.
V. CONCLUSION
Speech from distinct dialects of a language can be separated
into dialect sensitive and dialect neutral parts, which are represented by discriminative mixtures and general mixtures in
the GMM model, as well as discriminative frames and general
frames in test files. Due to the similarity of dialects, there will
be more neutral content between dialects versus languages.
The neutral content, represented as the distractive/confusing
region of the dialects, can be reduced or excluded via frame
or mixture selection. In this paper, a training algorithm (the
Gaussian mixture selection by KL2 divergence) and a testing
algorithm (frame selection decoding algorithm) have been
proposed and developed for text-independent dialect classification, which means the dialect label for the data is known
but no text transcripts are available. The algorithms have
focused on emphasizing those mixtures (KLD-GMM) and
frames (FSD-GMM) which are more dialect sensitive, and
de-emphasizing those which are dialect neutral. The three-way
dialect classification algorithms were evaluated on two different
size corpora from two languages, with an MLE trained GMM
system used as baseline. In addition, a MAP trained GMM
system was also used as an alternative baseline for the Spanish
corpus due to the limited size of that corpus. The KLD and FSD
algorithms achieved measurable and significant performance
improvement over the baseline system. The combination of
KLD and FSD achieves further performance improvement for
the Arabic corpus, but no additional gain for Spanish dialects.
95
In conclusion, the proposed algorithms achieve an
absolute improvement and
relative error reduction
(from KLD-FSD-GMM) on the Arabic dialect corpus against
the MLE-GMM baseline, and a
absolute improvement
and
relative error reduction (from FSD-GMM) on the
Spanish dialect corpus against the MAP-GMM baseline system.
Therefore, the proposed algorithms have been shown to be effective for dialects of Arabic and Spanish, and are promising
in generalization to dialects in different languages. Another
strength of the proposed algorithms is the low implementation
complexity, where it is easy to fall back to more traditional operating conditions without changing the fundamental algorithm
structure or existing dialect models. These findings confirm the
effectiveness of improved mixture selection within the GMM
and frame selection during decoding for GMM-based dialect
classification. The premise of suppressing common dialect
acoustic subareas, while maintaining discriminative regions
represents the primary advancement shown.
The algorithms developed in this study have lead to advancements in the area of dialect identification. While these
algorithms have resulted in effective performance for Arabic,
Spanish, and Chinese dialects (or related sub-languages), we
do not claim that the resulting solution is the optimal or final
contribution in dialect clarification. Other strategies could be
considered in order to improve overall dialect classification
rates, such as, factor analysis [23], CMLLR [25], MMI, and
VTLN. It is suggested that these be considered in future studies,
in the context of the algorithms developed in the current study.
Dialect ID is a challenging research topic, with issues relating
to uniqueness, knowledge of ground truth of the speakers,
and separation of dialects/languages. Future studies could also
leverage further knowledge of linguistics across the dialect
under evaluation as well.
REFERENCES
[1] V. Gupta and P. Mermelstein, “Effect of speaker accent on the performance of a speaker-independent, isolated word recognizer,” J. Acoust.
Soc. Amer., vol. 71, pp. 1581–1587, 1982.
[2] C. Huang, T. Chen, S. Li, E. Chang, and J. L. Zhou, “Analysis of
speaker variability,” in Interspeech’01, Aalborg, Denmark, 2001, pp.
1377–1380.
[3] M. A. Zissman, T. P. Gleason, D. M. Rekart, and B. L. Losiewicz, “Automatic dialect identification of extemporaneous conversational, Latin
American Spanish speech,” in Proc. ICASSP’96, Atlanta, GA, 1996,
vol. 2, pp. 777–780.
[4] L. M. Arslan and J. H. L. Hansen, “Language accent classification in
American English,” Speech Commun., vol. 18, pp. 353–367, 1996.
[5] L. M. Arslan and J. H. L. Hansen, “A study of temporal features and frequency characteristics in American English foreign accent,” J. Acoust.
Soc. Amer., vol. 102, pp. 28–40, 1997.
[6] W. Ward, H. Krech, X. Yu, K. Herold, G. Figgs, A. Ikeno, D. Jurafsky,
and W. Byrne, “Lexicon adaptation for LVCSR: Speaker idiosyncracies, non-native speakers, and pronunciation choice,” in Proc. ISCA
Workshop Pronunciat. Modeling Lexicon Adaptat., 2002, pp. 83–88.
[7] M. K. Liu, B. Xu, T. Y. Huang, Y. G. Deng, and C. R. Li, “Mandarin accent adaptation based on context-independent/ context-dependent pronunciation modeling,” in Proc. ICASSP’00, Istanbul, Turkey, 2000, vol.
2, pp. 1025–1028.
[8] J. J. Humphries and P. C. Woodland, “The use of accent-specific
pronunciation dictionaries in acoustic model training,” in Proc.
ICASSP’98, Seattle, WA, 1998, vol. 1, pp. 317–320.
[9] V. Diakoloukas, V. Digalakis, L. Neumeyer, and J. Kaja, “Development of dialect-specific speech recognizers using adaptation methods,”
in Proc. ICASSP’97, Munich, Germany, 1997, vol. 2, pp. 1455–1458.
[10] B. Zhou and J. H. L. Hansen, “Speechfind: An experimental on-line
spoken document retrieval system for historical audio archives,”
in Proc. Interspeech-02/ICSLP-02, Denver, CO, 2002, vol. 2, pp.
1969–1972.
96
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011
[11] S. Gray and J. H. L. Hansen, “An integrated approach to the detection
and classification of accents/dialects for a spoken document retrieval
system,” in Proc. IEEE Workshop Autom. Speech Recognition Understanding, 2005, vol. 2, pp. 35–40.
[12] A. S. Kroch, “Toward a theory of social dialect variation,” in Language
in Scoiety. Cambridge, U.K.: Cambridge Univ. Press, 1978, vol. 7,
pp. 17–36.
[13] L. Arslan and J. H. L. Hansen, “Selective training for hidden Markov
models with applications to speech classification,” IEEE Trans. Speech
Audio Process., vol. 7, no. 1, pp. 46–54, Jan. 1999.
[14] L. R. Yanguas, G. C. O’Leary, and M. A. Zissman, “Incorporating linguistic knowledge into automatic dialect identification of Spanish,” in
ICSLP’98, Sydney, Australia, 1998.
[15] R. Huang and J. H. L. Hansen, “Dialect/accent classification via
boosted word modeling,” in ICASSP’05, Philadelphia, PA, 2005, vol.
1, pp. 585–588.
[16] R. Huang and J. H. L. Hansen, “Advances in word based dialect/accent
classification,” in Proc. Interspeech’05, Lisbon, Portugal, 2005, vol. 1,
pp. 2241–2244.
[17] P. A. Torres-Carrasquillo, T. P. Gleason, and D. A. Reynolds, “Dialect identification using Gaussian mixture models,” in Proc. Odyssey:
Speaker Lang. Recog. Work., Toledo, Spain, 2004.
[18] R. Huang and J. H. L. Hansen, “Gaussian mixture selection and data
selection for unsupervised Spanish dialect classification,” in Proc. Interspeech’06, Pittsburgh, PA, 2006, pp. 445–448.
[19] R. Huang and J. H. L. Hansen, “Unsupervised discriminative training
with application to dialect classification,” IEEE Trans. Audio, Speech,
Lang. Process., vol. 15, no. 8, pp. 2444–2453, Nov. 2007.
[20] G. Choueiter, G. Zweig, and P. Nguyen, “An empirical study of automatic accent classification,” in Proc. ICASSP’08, Las Vegas, NV, 2008,
vol. 1, pp. 4265–4268.
[21] P. Matejka, L. Burget, P. Schwarz, and J. Cernocky, “Brno university
of technology system for NIST 2005 language recognition evaluation,”
in Proc. IEEE Odyssey 2006: Speaker Lang. Recognition Workshop,
2006, vol. 1, pp. 1–7.
[22] C. Vair, D. Colibro, F. Castaldo, E. Dalmasso, and P. Laface, “Channel
factors compensation in model and feature domain for speaker recognition,” in Proc. IEEE Odyssey 2006: Speaker Lang. Recognition Workshop, 2006, vol. 4, pp. 1–6.
[23] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling
with sparse training data,” IEEE Trans. Speech Audio Process., vol. 13,
no. 3, pp. 345–354, May 2005.
[24] R. Vogt, B. Baker, and S. Sridharan, “Modeling session variability
in text-independent speaker verification,” in Proc. Interspeech’05,
Lisbon, Portugal, 2005, pp. 3117–31220.
[25] W. Shen and D. Reynolds, “Improved GMM-based language recognition using constrained MLLR transforms,” in Proc. ICASSP’08, Las
Vegas, NV, 2008, pp. 4149–4152.
[26] R. Huang and J. H. L. Hansen, “Dialect classification on printed text
using perplexity measure and conditional random fields,” in Proc.
ICASSP’07, Honolulu, HI, 2007, vol. 4, pp. 993–996.
[27] S. Kullback, Information Theory and Statistics. Mineola, NY: Dover.
[28] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification
using adapted Gaussian mixture models,” Digital Signal Process., vol.
10, pp. 72–83, 2000.
[29] A. Dempster, M. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc., vol. 39, pp.
1–38, 1977.
[30] D. Reynolds and R. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Speech
Audio Process., vol. 3, no. 1, pp. 72–83, Jan. 1995.
[31] M. A. Zissman, “Comparison of four approaches to automatic language identification of telephone speech,” IEEE Trans. Speech Audio
Process., vol. 4, no. 1, pp. 31–44, Jan. 1996.
[32] W. M. Campbell, F. Richardson, and D. A. Reynolds, “Language
recognition with word lattices and support vector machines,” in Proc.
ICASSP’07, Honolulu, HI, 2007, pp. 989–992.
[33] M. Mehrabani and J. H. L. Hansen, “Dialect separation assessment
using log-likelihood score distributions,” in Proc. Interspeech’08, Brisbane, Australia, 2008, pp. 747–750.
[34] J. R. Hershey and P. A. Olsen, “Approximating the Kullback Leibler
divergence between Gaussian mixture models,” in Proc. ICASSP’07,
Honolulu, HI, 2007, pp. 317–320.
Yun Lei (S’07) received the B.S. degree in electrical
engineering from Nanjing University, Jiangsu, China,
in 2003 and the M.S. degree in electrical engineering
from Institute of Acoustics, Chinese Academy of Science (CAS), Beijing, China, in 2006. He is currently
pursuing the Ph.D. degree in electrical engineering at
the University of Texas at Dallas, Richardson.
He has been a Research Engineer at the Center for
Robust Speech Systems (CRSS), University of Texas
at Dallas.
John H.L. Hansen (S’81–M’82–SM’93–F’07)
received the B.S.E.E. degree from the College of
Engineering, Rutgers University, New Brunswick,
NJ, in 1982 and the M.S. and Ph.D. degrees in
electrical engineering from the Georgia Institute of
Technology, Atlanta, in 1983 and 1988.
He joined the Erik Jonsson School of Engineering
and Computer Science, University of Texas at Dallas
(UTD), Richardson, in the fall of 2005, where he is
Professor and Department Head of Electrical Engineering and holds the Distinguished University Chair
in Telecommunications Engineering. He also holds a joint appointment as Professor in the School of Behavioral and Brain Sciences (Speech and Hearing).
At UTD, he established the Center for Robust Speech Systems (CRSS) which
is part of the Human Language Technology Research Institute. Previously, he
served as Department Chairman and Professor in the Department of Speech,
Language, and Hearing Sciences (SLHS) and Professor in the Department of
Electrical and Computer Engineering, at the University of Colorado, Boulder
(1998–2005), where he cofounded the Center for Spoken Language Research.
In 1988, he established the Robust Speech Processing Laboratory (RSPL) and
continues to direct research activities in CRSS at UTD. His research interests
span the areas of digital speech processing, analysis, and modeling of speech and
speaker traits, speech enhancement, feature estimation in noise, robust speech
recognition with emphasis on spoken document retrieval, and in-vehicle interactive systems for hands-free human–computer interaction. He has supervised
50 (22 Ph.D., 28 M.S./M.A.) thesis candidates. He is author/coauthor of 352
journal and conference papers and eight textbooks in the field of speech processing and language technology, coauthor of the textbook Discrete-Time Processing of Speech Signals, (IEEE Press, 2000), coeditor of DSP for In-Vehicle
and Mobile Systems (Springer, 2004), Advances for In-Vehicle and Mobile Systems: Challenges for International Standards (Springer, 2006), and lead author
of the report ”The Impact of Speech Under ’Stress’ on Military Speech Technology,” (NATO RTO-TR-10, 2000).
Prof. Hansen was named IEEE Fellow for contributions in ”Robust Speech
Recognition in Stress and Noise,” in 2007 and is currently serving as Member of
the IEEE Signal Processing Society Speech Technical Committee (2005–2008;
2010–2013; elected Chair-elect in 2010), and Educational Technical Committee
(2005–2008; 2008–2010). Previously, he has served as Technical Advisor to a
U.S. Delegate for NATO (IST/TG-01), IEEE Signal Processing Society Distinguished Lecturer (2005/06), Associate Editor for the IEEE TRANSACTIONS ON
SPEECH AND AUDIO PROCESSING (1992–1999), Associate Editor for the IEEE
SIGNAL PROCESSING LETTERS (1998–2000), and Editorial Board Member for
the IEEE Signal Processing Magazine (2001–2003). He has also served as a
Guest Editor of the October 1994 special issue on Robust Speech Recognition for the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING. He has
served on the Speech Communications Technical Committee for the Acoustical
Society of America (2000–2003), and is serving as a member of the International Speech Communications Association (ISCA) Advisory Council. He was
recipient of the 2005 University of Colorado Teacher Recognition Award as
voted by the student body. He also organized and served as General Chair for
ICSLP/Interspeech-2002: International Conference on Spoken Language Processing, September 16–20, 2002, and has served as Co-Organizer and Technical
Program Chair for the IEEE ICASSP-2010, Dallas, TX.