Unsupervised accent classification for deep data fusion of accent

Available online at www.sciencedirect.com
Speech Communication 78 (2016) 19–33
www.elsevier.com/locate/specom
Unsupervised accent classification for deep data fusion of accent and
language information
John H.L. Hansen∗, Gang Liu
Center for Robust Speech Systems (CRSS), University of Texas at Dallas, Richardson, TX, USA
Received 26 November 2014; received in revised form 13 December 2015; accepted 14 December 2015
Available online 11 January 2016
Abstract
Automatic Dialect Identification (DID) has recently gained substantial interest in the speech processing community. Studies have shown
that the variation in speech due to dialect is a factor which significantly impacts speech system performance. Dialects differ in various ways
such as acoustic traits (phonetic realization of vowels and consonants, rhythmical characteristics, prosody) and content based word selection
(grammar, vocabulary, phonetic distribution, lexical distribution, semantics). The traditional DID classifier is usually based on Gaussian
Mixture Modeling (GMM), which is employed as baseline system. We investigate various methods of improving the DID based on acoustic
and text language sub-systems to further boost the performance. For acoustic approach, we propose to use i-Vector system. For text language
based dialect classification, a series of natural language processing (NLP) techniques are explored to address word selection and grammar
factors, which cannot be modeled using an acoustic modeling system. These NLP techniques include: two traditional approaches, including
N-Gram modeling and Latent Semantic Analysis (LSA), and a novel approach based on Term Frequency–Inverse Document Frequency (TFIDF) and logistic regression classification. Due to the sparsity of training data, traditional text approaches do not offer superior performance.
However, the proposed TF-IDF approach shows comparable performance to the i-Vector acoustic system, which when fused with the i-Vector
system results in a final audio-text combined solution that is more discriminative. Compared with the GMM baseline system, the proposed
audio-text DID system provides a relative improvement in dialect classification performance of +40.1% and +47.1% on the self-collected
corpus (UT-Podcast) and NIST LRE-2009 data, respectively. The experiment results validate the feasibility of leveraging both acoustic and
textual information in achieving improved DID performance.
© 2015 Elsevier B.V. All rights reserved.
Keywords: NLP; TF-IDF; Accent classification; Dialect identification; UT-Podcast.
1. Introduction
Automatic Dialect Identification (DID)/Classification has
recently gained significant interest in the speech processing
community (Hansen et al., 2004; Torres-Carrasquillo, 2004;
Ma et al., 2006; Li et al., 2007; Biadsy et al., 2009; Hansen
et al., 2010; Liu et al., 2011; Liu et al., 2012; Sangwan and
Hansen, 2012; William et al., 2013; Zhang et al., 2014). For
∗ Corresponding author. Center for Robust Speech Systems (CRSS), Erik
Jonsson School of Engineering and Computer Science, Department of Electrical Engineering, University of Texas at Dallas, 2601 N. Floyd Road, EC33,
Richardson, TX 75080-1407, USA. Tel.: +1 972 883 2910; fax: +1 972 883
2710.
E-mail address: [email protected] (J.H.L. Hansen).
URL: http://crss.utdallas (J.H.L. Hansen)
http://dx.doi.org/10.1016/j.specom.2015.12.004
0167-6393/© 2015 Elsevier B.V. All rights reserved.
dialects of a language, using related material such as lexicons, audio, and text can help, but ground truth knowledge is
critical, especially if there is a potential for code-switching1
between dialects. The ability to leverage additional signal dependent information within the speech audio stream can help
improve overall speech system performance (i.e., the use of
“Environmental Sniffing” to characterize noise (Akbacak and
Hansen, 2007); similarities between classes such as in-set/outof-set recognition (Angkititrakul and Hansen, 2007); or content based text structure based on latent semantic analysis
(Bellegarda, 2000)). In a related domain, DID is important
for characterizing speaker traits (Arslan and Hansen, 1997)
1 In linguistics, code-switching is the practice alternating between two or
more languages, or language varieties, in the context of a single conversation.
For more details, refer to (Muysken, 1995).
20
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
and can help improve speaker verification systems as well.
In general, Dialect/Accent is one of the most important factors that influence automatic speech recognition (ASR) performance next to gender (Gupta and Mermelstein, 1982; Huang
et al., 2001). Research has shown that traditional ASR systems are not robust to variations due to speaker dialect/accent
(Huang et al., 2004). Therefore, the formulation of effective
dialect classification for selection of dialect dependent acoustic models is one solution to improve ASR performance. Dialect knowledge could also be used in various components of
an ASR system such as pronunciation modeling (Liu et al.,
2000), lexicon adaptation (Ward et al., 2002), and acoustic
model training (Humphries and Woodland, 1998) or adaptation (Diakoloukas et al., 1997). Dialect classification techniques were used for rich indexing of historical speech corpora as well as providing dialect information for spoken document retrieval systems (Gray and Hansen, 2005). Dialect
knowledge could also be directly applied in automatic call
center and directory lookup service (Zissman et al., 1996). Effective methods for accent modeling and detection have also
been developed, which can contribute to improving speech
systems (Angkititrakul and Hansen, 2006).
We note there are some subtle differences in the definition
of accent versus dialect. To prevent this study from concentrating on too many details, accent and dialect are used interchangeably here. The term dialect is defined as: a pattern
of pronunciation and/or vocabulary of a language used by the
community of native speakers belonging to some geographical region (Lei and Hansen, 2011). Dialects can be further
classified into family-tree and sub-tree dialects, all of which
are part of the “language forest” (see Fig. 1). Family-tree dialects are the family sub-branches in the dialect tree, where
their parent node is the actual language. As an example, for
the English language it is possible to consider broad groups of
American, Australian, and United Kingdom branches within
the overall family tree. Beneath each main partition would
be sub-classification (i.e., Belfast, Bradford, Cardiff, etc. for
UK English). Moving upwards in the “language forest”, it is
possible to realize an English forest, a Spanish forest (this
may include Cuban Spanish, Peruvian Spanish, and Puerto
Rican Spanish, family-trees), etc. The general speech processing community does not have a well-defined definition
of the language space relating to dialects, accents, and languages. In general, a more detailed focused level below
family-tree dialects would be called “sub-tree dialects” which
would reflect the sub-branches of the family-trees. For example in American English, there are many regional sub-dialects
which are estimated to be 56 that include geographical regions: such as Boston/New England, New York City/New
Jersey/Philadelphia, New Orleans, Texan, etc. There is much
effort dedicated to investigating language identification at the
higher language forest level. For example, National Institute
of Standards and Technology (NIST) has conducted a number of automatic language recognition evaluations (LRE) since
1996. This has resulted in the introduction of successful algorithms, such as Parallel Phone Recognition and Language
Modeling (PPRLM) (Zissman, 1996), Vector Space Modeling
(VSM) (Li et al., 2007), and others. However, this research
primarily focused at the language forest level with only limited or no attention paid to the lower level family-tree dialects
level (NIST LRE, 2007, 2009, 2011, 2015), where the classification is usually more difficult than at the language forest
level (for example, English vs. Spanish). In the NIST LREs,
some closely related language pairs have been considered
(i.e., Russian vs. Ukrainian, Urdu vs. Hindi), as well as dialects (i.e., Arabic Iraqi, Arabic Levantine, Arabic Maghrebi,
and modern standard Arabic). Researchers have also explored
the differences between automatic versus human assisted classification for speaker recognition (Hansen and Hasan, 2015),
and language ID (Zissman and Berkling, 2001). Therefore,
this study is positioned to focus at the family-tree dialect level
to further research in this domain. Research advancements at
this level would not only help boost overall performance of
language identification, but shed new light on more subtle
challenges stemming from the sub-tree dialect level.
In order to achieve good performance in English dialect
classification, it is first necessary to understand how dialects
differ. Fortunately, there are numerous studies on English dialectology (Purnell et al., 1999; Trudgill, 1999; Wells, 1982).
English dialects differ in the following areas (Wells, 1982):
1. Phonetic realization of vowels and consonants
2. Phonotactic distribution (e.g., rhotic in farm: /farm/ vs.
/fa:m/)
3. Phonemic system (the number or identity of phonemes
used)
4. Lexical distribution
5. Rhythmical characteristics
6. Semantics
The first four areas/items are visible at the word level from
both production and perception levels. From a linguistic point
of view, a word may be the best unit to classify dialects. However, for an automatic classification system, it is impossible
to build models for all words from different dialects. Therefore, many researchers focus on identifying pronunciation differences for dialect classification (Huang and Hansen, 2005;
Huang and Hansen, 2006) to address items 1, 2 and 3. Huang
and Hansen (2005) addressed dialect classification using word
level based modeling, which was termed Word based Dialect
Classification, converting the text independent decision problem into a text dependent problem, producing multiple combination decisions at the word level rather than making a single overall decision at the utterance level. Gray and Hansen
(2005) considered temporal and spectral based features including the Stochastic Trajectory Model (STM), pitch structure, formant location and voice onset time (VOT) for dialect
classification to address items 1 and 5. In order to make this
process unsupervised, Huang and Hansen (2006) proposed the
use of frame-based selection via Gaussian Mixture Models
(GMM) for unsupervised dialect classification. One challenge
is that most research studies are based on in-house data and
more traditional acoustic modeling approaches. It is not until recently that some groups have begun to employ stateof-the-art technology (i.e., i-Vector) to perform the acoustic
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
21
Fig. 1. World Language Tree: illustrated as highest level language / Dialect Space, followed by Family-Tree Dialects (many exist), followed by Sub-Tree
Dialects and individual sub-tree entries. For Sub-Tree Dialect level, several examples of UK English and American English are shown, but many more exist
within each sub-tree category.
modeling based English, Finnish, Swiss French accent identification (Bahari et al., 2013, Behravan et al., 2013, Lazaridis
et al., 2014). But the corpus used in those research are relative homogenous and small. This study is an extension on
this direction by validating on two different large corpora.
Alternatively, limited emphasis has been placed on the lexical distribution for dialect classification. In particular, word
selection and grammar are additional factors, which cannot be
modeled directly using frame based GMM or i-Vector acoustic information. For example, word selection differences between UK and US dialects can be distinguished with examples
such as - “lorry” vs. “truck”, “lift” vs. “elevator”, “dickie” vs.
“shirt”, etc. Australian English has its own lexical terms such
as tucker (food), outback (wilderness), etc. (Layer, 1994). It
is also common to observe the use of present perfect (UK)
vs. simple past tense (US) which reflects differences at the
grammar level - “I’ve already eaten.” vs. “I already ate.”.
One additional factor in which dialects differ is in semantics. For example, momentarily which means for a moment
duration (UK) vs. in a minute or any minute now (US). The
sentence “This flight will be leaving momentarily” could represent different time periods therefore in US vs. UK dialects
(Layer, 1994). In this study, some traditional natural language
processing (NLP) techniques (such as N-Gram and LSA), as
well as novel NLP techniques (TF-IDF logistic regression) are
employed to address these two factors (highlighted as items
4 and 6 above).
It is also noted that the lack of extensive dialect corpora
in the field is one reason for the slow research progress, so in
this study we also wish to explore a potential solution for this
challenge. As such, this paper is organized as follows. The
next section begins with a description of dialect corpora and
their statistics, followed by a baseline system description in
Section 3. The proposed system includes three Sub-Systems
described in Section 4. In Section 5, an experimental eval-
uation is reported and analyzed. The study is concluded in
Section 6 with a summary and directions for future research.
2. Database description
It has previously been shown that it is more probable to
observe semantic differences in spontaneous text and speech
rather than formal written newspapers or prompted/read
speeches (Antoine, 1996; Hansen, 2004; Hasegawa-Johnson
and Levinson, 2006). Simply reading prepared text does not
in general convey actual dialect content of a language (Huang
and Hansen, 2007a; Liu et al., 2010b). There is no publicly
available corpus for audio across dialects of common languages in the speech community, and in particular on containing spontaneous speech to address the problems discussed
in Section 1. Therefore, the first challenge in this study is the
collection of a database which is both spontaneous and realistic. We propose a cost-effective solution by using web based
online podcasts of interviews where subjects speak spontaneously, and for which it is possible to determine the ground
truth dialect in the family-tree from web content and/or metadata. Since this audio is cost free, and easy to find through
web based podcasts from various parts of the world, this is
an effective way of obtaining a diverse range of dialect specific data. Hence, a process was initiated to crawl podcasts of
interviews from various websites. Although there are many
English dialects, the focus here was on three broad familytree dialects of English: American English (denoted as US),
Australian English (denoted as AU), and United Kingdom
English (denoted as UK). Since most subjects in these interviews are well-known, it is possible to identify the speaker’s
dialect based on their demographic information. This corpus is named as UT-Podcast in this study for convenience
(see Chitturi and Hansen, 2007; Chitturi and Hansen, 2008;
Chitturi, 2008 for preliminary corpus discussion).
22
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
Text transcriptions are decoded from audio with the
CMUsphinx ASR engine (Walker et al., 2004). The pretrained US English HUB4 based language model and acoustic model are used in the ASR engine (Sourceforge, 2015). It
is noted that the pre-trained ASR has a non-negligible word
error rate, which means the ASR engine cannot transcribe
all English audio files 100% correctly. Our focus here instead is to employ a reasonable ASR engine to extract the
corresponding text as an input feature, which should allow
for some text based noise for a practical deployment. Another potential question is why a specific English dialect
(i.e., US English in this study) based ASR engine is chosen. It is because that this pre-trained ASR engine can be
taken as a common audio-to-text transformation system. The
resulted text will be used as feature for further processing.
During the transformation, different dialects, due to the abovementioned (Section 1) unique pronunciation characteristics,
are expected to demonstrate different error during audio-totext projection and could potentially contribute to dialect identification after proper modeling. Also, due to the high similarity among different English dialects, we expect many words
can be correctly decoded no matter what specific English dialect based ASR engine is actually employed, such as “lorry”
vs. “truck” can be safely taken as good indicator of UK English vs. American English. Although other English dialect
ASR engine is possible, this will entail tremendous more effort for language modeling and acoustic modeling, which will
severely reduce the popularity of this approach. The experiment results from text approach in latter section (Section
5.1.2) of this study also support this claim.
The acoustic and text statistics of the assembled podcast
database are described in Section 2.1. To make this study
more general, a second database based on the NIST LRE2009 is introduced in Section 2.2.
2.1. Podcast dataset: UT-Podcast
In an effort to demonstrate actual performance for open test
conditions, training and test data are collected from entirely
different websites within each dialect region. To increase the
representativeness of the data, we collect data from 12, 5,
and 8 websites for AU, UK, and US English, respectively.
One benefit from this approach is that the bias caused by
channel/noise/recording can be dramatically mitigated in actual dialect classification performance. Also to make the data
more realistic, we include podcasts which cover a wide range
of topics, such as news report, scientific report, religion, social
life. Compared with the sound booth collection, this approach
produces more spontaneous and representative data.
We note that data collected from online podcasts is spontaneous and in general not well structured. Therefore, it is
expected that some non-speech audio content such as music,
silence, etc. will be presented in the audio streams. To address
this, energy based voice activity detection (VAD) is employed
(Sohn et al., 1999). Since the collected audio files range in
duration from 30–60 min, after pre-processing of the audio
as described above, the audio data is subsequently segmented
Table 1
UT-Podcast Database Description. AU, UK and US correspond to Australian
English, United Kingdom English, and American English, respectively. Note,
‘# file’ refers to the smaller files after segmentation.
Data Set
Train
Test
Class
AU
UK
US
AU
UK
US
# File
Total duration (hour)
Duration/file (sec.)
Word count
Word/file
Dictionary size
449
2.1
16.8
22235
49.5
3923
246
1.2
17.6
11845
48.2
2025
406
1.9
16.8
18437
45.4
3224
332
1.6
17.3
16772
50.5
3178
89
0.4
15.4
3386
38.0
917
240
1.2
18.0
10838
45.2
2337
into much smaller audio segment files for either training or
test. With the help from previous VAD processing, the segmentation is performed in a manner so as to avoid abrupt cuts
within the middle of a conversation, which follows the same
approach as that used by LDC in the NIST LRE data preparation. In general, each segmented audio file is of duration
17 s and contains on average 46 words. Further details, such
as file numbers, duration, word count, etc. are summarized in
Table 1, where both acoustic and text statistics are provided.
It is noted that the corpus is released for public research
usage under a free license.2 To give an idea concerning the
structure of these podcasts, sample text extracts are provided
here:
Extract 1: “He’s laying on the ground in front of the hotel
next to my office. And there’s a truck parked right there. In
the picture that I showed, there’s a truck, pickup truck, parked
right there. He was stood up in front of that pickup truck and
shot in the back five more times. It was shown on CNN”.
Extract 2: “Jackie: ‘Well, according to the report, alcohol is the main cause of antisocial behavior in Britain. …
Let’s listen to Elena. Is she worried about anti-social behavior?’ Elena: ‘My name is Elena and I live in North West
London…’”.
From these two examples, it is apparent that some keywords may carry more dialect-dependent information than
others. For example, some words are more American English dialect dependent — “truck”, “apartment” — versus others which may be more likely to be UK dialect — “lorry”,
“flat”. It should be noted that both word selection and pronunciation/prosody reflect dialect. Extract 1 and 2 were drawn
from American English and UK English, respectively. Apparently, this salient information is not directly reflected in
the acoustic modeling approach. However, a language model
based approach that can weigh the dialect-sensitive word selection information properly can be more helpful in dialect
identification.
2.2. LRE-2009 dataset
In addition to UT-Podcast, the NIST LRE-2009 dataset is
also considered. While this is a rich, diverse and extensive
corpus, one drawback is that it is only available at no cost to
2
http://crss.utdallas.edu/corpora/UT-Podcast/.
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
Table 2
LRE-2009 Database Description (only American English and Indian English
data from LRE-2009 are used).
Data Set
Train
Class
American
English
Indian
English
American
English
Indian
English
# File
Total duration (hour)
Duration/file (sec.)
Word count
Word/File
Dictionary Size
500
3.8
27.4
39796
79.6
4909
500
2.8
20.2
24126
48.3
2918
2615
9.3
12.8
93365
35.7
7402
1624
6.2
13.7
54631
33.6
4282
23
Once we extract all the features, we will use Gaussian
Mixture Modeling (GMM) to train the model and make the
decision accordingly.
Test
those members who agreed to and participated in the NIST
LREs. In spite of this limitation, we still wish to consider
it in order to validate the proposed algorithms as a comparison with other sites. For the LRE-2009 dataset, we limit the
language set to American English (noted as US) and Indian
English (noted as IN) since the focus of this study is on dialect/accent identification. The training data for American English includes broadcast news programs from Voice of America (VOA), and conversation telephone speech (CTS). The
training data for Indian English is purely based on conversation telephone speech. The American test data include both
VOA and CTS audio, while the Indian test data only consist of
CTS audio. VOA data is pre-processed using the data purification procedure proposed in (Liu et al., 2012). Audio files are
segmented into blocks, which on average are 19 s in duration
and contain 49 words. Further details such as file numbers,
duration, word count, etc. are summarized in Table 2.
3. Baseline: GMM-based DID system
A number of past research studies have explored the use of
spectral based features such as MFCC (Hansen et al., 2004;
Huang and Hansen, 2006) for the purpose of dialect classification. As such, MFCC features are employed here. Before feature extraction, VAD must be performed to set aside silence
and music in all audio files. Next, acoustic features are extracted with an analysis window size of 20 ms and a skip-rate
of 10 ms, resulting in a frame rate of 100 frames per second.
The entire feature set includes MFCCs and their shifted delta
cepstrum (SDC) features (Bielefeld, 1994) to represent the
audio sequence. The SDC features can incorporate additional
temporal information into the feature vector, which is proved
to be effective in utterance based identification such as DID
and emotion identification (Liu et al., 2010a). The classical
SDC parameter configuration N-d-P-k for language identification is 7-1-3-7, which combines with MFCC to produce
an overall 56-dimensional feature vector on a frame-by-frame
basis. To be specific, the windowed signal is filtered through
Mel-scale filterbank over the band 300–4000 Hz, producing
26 log-filterbank energies. Then the log-filterbank energies are
processed via DCT and producing seven cepstral coefficients
(c0-c6). The cepstral coefficients are further normalized using
cepstral mean and variance normalization (CMVN). Then by
following the 7-1-3-7 SDC configuration, 56-dimension feature vectors are derived.
4. Proposed DID system
The proposed system for DID consists of three SubSystems – an acoustic Sub-System (i-Vector based DID), a
language or text-based classification Sub-System (noted as
Sub-System 2), and a score fusion Sub-System (noted as
Sub-System 3). Sub-System 1 (acoustic system) and 2 (text
system) is discussed in Sections 4.1 and 4.2, respectively.
Sub-System 3 is simply an equal weight fusion solution. The
overall proposed DID system is illustrated in Fig. 2.
4.1. Audio based dialect identification: Sub-System #1
The MFCC-SDC features derived in Section 3 are also referred to as raw features since they can be further processed
and converted into more refined features such as i-Vectors.
i-Vector, initially introduced for speaker recognition (Dehak
et al., 2010; Liu et al., 2014a; Liu et al., 2014b; Yu et al.,
2014a; Yu et al., 2014b; Zhang et al., 2015), is now widely
adopted in the field of speech processing such as emotion
identification (Xia and Liu, 2012; Liu and Hansen, 2014) and
language identification (Martinez et al., 2011). However, it is
seldom explored in dialect identification and therefore warrants further exploration here. i-Vector is extracted by following the factor analysis (Dehak et al., 2010):
M = m + Tω
(1)
where T is referred to as the Total Variability matrix and ω
is the resulting i-Vector, m is the universal background model
(UBM) mean supervector, and M is the super-vector derived from raw features. The extraction converts frame lengthvaried spectral features matrix into dimension-fixed features
vector for each speech utterance.
All available training data are employed to train both
the UBM and total variability matrix with five iterations of
Expectation–Maximization. Next, the i-Vectors for both training and test sets are extracted with the total variability matrix.
Once the i-Vectors are ready, we rely on a Gaussian Backend algorithm for dialect modeling. The MFCC-SDC + iVector + Gaussian Backend is the state-of-the-art system for
language identification task (Torres-Carrasquillo et al., 2010;
Dehak et al., 2011) and is adopted for DID task here. Contrary
to the popular GMM approach (Shen and Reynolds, 2008;
Huang and Hansen, 2006; Huang and Hansen, 2007b), the
performance of i-Vector approach (Behravan et al., 2013) dramatically surpass GMM as well as other popular approaches
(Choueiter et al., 2008, Larcher et al., 2011; Liu et al., 2011,
Brümmer et al., 2011). So we also want to explore its performance on the English dialect scenario. The flowchart of the
proposed i-Vector based DID system is illustrated in Fig. 3.
In this study, the mixture number for UBM is 32 and the
dimension of iVector is 100, which is whitened and length
normalized before using Gaussian Backend.
24
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
Fig. 2. Flowchart of proposed DID system. Text content is obtained from ASR. The acoustic DID system is detailed in Fig. 3. Refer to Fig. 4 for Text based
Dialect ID system details.
Audio Data
SAD
(VAD)
RAW FEATURE
EXTRACTION
FRONT-END
UBM
TOTAL VARIABILITY
MATRIX
Data #1
Data #2
i-VECTOR
POSTPROCESSING
GAUSSIAN
BACK-END
SCORES
BACK-END
Fig. 3. Block diagram of the i-Vector based dialect identification (DID) system employed. Data 1 and 2 correspond to raw feature data used for UBM and
total variability matrix training, respectively. ‘Audio data’ represents all of the acoustic data involved for the identification task. 56-Dimensional MFCC-SDC
is employed as raw feature.
4.2. Text based dialect identification: Sub-System # 2
measured as,
Next, we focus on Sub-System 2 (text system), which is
illustrated in Fig. 4. As stated in Section 2, an ASR engine
is used to derive the estimated text sequence from the input
audio stream. Next, three text categorization approaches are
explored, which are introduced in the following three subsections.
P(W |D) = P(W1 , W2 , . . . , Wn |D)
= P(W1 |D)P(W2 |W1 , D) . . . P(Wn |W1 , . . . , Wn−1 , D)
n
=
P(Wi |W1 , . . . , Wi−1 , D)
i=1
=
n
P(Wi |Wi−(n−1) , . . . , Wi−1 , D),
(2)
i=1
4.2.1. N-Gram based dialect identification: Text DID system
#1
It is assumed that a given text document associated with
the corresponding audio which consists of many sentences.
Each sentence can be regarded as a sequence of words W.
The probability of generating the word sequence W can be
where n is the number of words in W, Wi is the ith word, and
D is the dialect specific language model. For example, in UTPodcast corpus, D∈{AU, UK, US} (corresponds to Australia
English, United Kingdom English, and American English, respectively). The final equation comes from the N-Gram definition. The N-Gram probabilities are calculated via occurrence
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
25
Fig. 4. (a) Text based dialect identification system. Text is derived from ASR. Three text based sub-systems follow the similar workflow, but differ in
implementation of individual modules. (b) TF-IDF logistic regression based Text DID system. See Section 4.2 for more details.
counting. The final result is then decided by
C = arg max
P(W |D),
D
(3)
W ∈ϕ
where ϕ is the set of sentences in a given document stemming
from the corresponding input audio stream. The prior probabilities are dropped in this equation since different classes are
assumed to have equal priors for simplification. In this study,
we use the derivative measure of the cross entropy, known
as the test set perplexity for dialect classification. If the word
sequence is sufficiently long, the cross entropy of the word
sequence W is approximated as,
1
(4)
H (W |D) = − log2 P(W |D),
n
where n is the length of the test word sequence W, measured
in words. The perplexity of the test word sequence W as it
relates to the language model D is,
PP(W |D) = 2H (W |D) = P(W |D)− n .
1
(5)
The perplexity of the test word sequence is therefore the
generalization capability of the given language model. The
smaller the perplexity, the better the language model generalizes to the test word sequence. The final classification decision
is then written as,
1
P(W |D)− n .
(6)
C = arg min
D
W ∈ϕ
By comparing Eqs. 3 and 6, it is noted that the perplexity
measure is actually the normalized probability measure, where
the normalization factor is simply the word sentence length n.
4.2.2. Latent semantic analysis based dialect classification:
text DID System # 2
One approach used to address topic classification problems has been latent semantic analysis (LSA), which was first
explored for document indexing in (Deerwester, 1990). That
work focused on addressing the issues of synonymy - many
ways to refer to the same idea, and polysemy – words having more than one distinct meaning. These two issues present
problems for dialect classification because two conversations
about the same topic need not contain the same words and
conversely two conversations about different topics may in
fact contain the same words but with different intended meanings. There are other semantic differences as previously described in Section 1 , which can be explored in the classification of dialect.
In order to find an alternate feature space which avoids
these problems, singular value decomposition (SVD) is performed to derive an orthogonal vector representation of the
documents. SVD uses eigen-analysis to derive linearly independent directions of the original term via a document matrix A whose columns correspond to the number of dialects,
while the rows correspond to the words/terms in the entire
text database. SVD decomposes this original term document
matrix A into three other matrices: A = U∗S∗VT , where the
columns of U are the eigenvectors of AAT (called the left
eigenvectors), S is a diagonal matrix whose diagonal elements
are the singular values of A, and the columns of V are the
eigenvectors of AT A (called the right eigenvectors). An example is shown in Figs. 5 and 6 to illustrate this concept for
text/language information SVD decomposition.
26
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
Terms
He
said
she
wanted
to
go
public
d1 d2 d3
A=
1
0
1
1
1
0
1
1
1
0
0
1
2
1
1
1
0
1
1
0
1
Table 3
Combinations of Text Pre-processing tasks. Note, only two variations for
N-Gram are considered here, unigram and bigram. So under the heading
“employing bigram”, N = “unigram”, Y = “bigram”.
q
0
0
0
q= 0
0
1
1
Fig. 5. Term Document Matrix A. d1, d2, and d3 represents three dialects.
q is the query file need to be classified as one of the three dialects.
U=
0.4677
0.3326
0.1351
0.2849
0.4677
0.3657
0.4677
0.1148
-0.2040
0.3188
0.4935
0.1148
-0.7574
0.1148
-0.0249
0.6371
-0.6620
0.1553
-0.0249
-0.3603
-0.0249
S=
3.6733
0
0
0
0
1.9049 0
0
0.9371
V=
0.4963 0.6073 -0.6204
0.6716 -0.7214 -0.1688
0.5501 0.3329 0.7659
Fig. 6. Decomposed matrices: A = U ∗ S ∗ VT for the term in Fig. 5.
The new dialect vector coordinates in this lower dimensional space are the rows of V. The coordinates (noted as
q1 ) of the projection of query file q in the lower dimensional
space are given by q1 = qT ∗ U ∗ S −1 . The test utterance is
then classified as a particular dialect based on the scores,
given by the cosine similarity measure (also called as Cosine
Distance Scoring) as shown in Eq. 7,
dbest = arg max
di
(q1 di )
,
|q1 ||di |
(7)
where di is one of dialect classes.
4.2.3. TF-IDF – logistic regression based dialect
classification: Text DID system # 3
Both N-Gram and LSA are traditional popular NLP approaches, which directly work on text sequence data to find
potential semantic cues to achieve dialect classification. Another approach is using a keyword spotting approach for dialect classification. The underlying assumption behind this
is that each dialect should demonstrate some unique word
set, which could help distinguish. The text sequences need
to be pre-processed before performing feature extraction. For
feature extraction, a weighting factor algorithm called Term
Frequency–Inverse Document Frequency (TF-IDF) is adopted
(Jones, 1972). Next, logistic regression is employed to perform dialect modeling and classification.
4.2.3.1. Text pre-processing. In any language, there are many
words that convey little or no meaning, but are indispensable
for the grammar structure of the language; these words are
defined as “stop words”. For example, in the English language
words like: “a”, “the”, “and” and many other high frequency
Combination #
Employing
Stopword
removal
Employing
stemming
Employing
bigram
0
1
2
3
4
5
6
7
N
N
N
N
Y
Y
Y
Y
N
N
Y
Y
N
N
Y
Y
N
Y
N
Y
N
Y
N
Y
words are considered as stop words (Greenberg, 1997). It is
common practice to exclude stop words from the feature vector since they are not expected to be class/dialect sensitive.
A stop word list is employed for this procedure.
Another text pre-processing procedure is called stemming,
which means to derive a morphological root for the words.
For example, if an English document would contain 5 instances of the word “play”, 2 instances of the word “plays”
and 1 instance of the word “playing”, then a stemming reduction for these three separate features would reduce them to
one word that describes the set of words with the root similar
to “play” occurred in the document 8 (=5 + 2+1) times.
After applying these text processing procedures, we can
perform an N-Gram operation (we need to decide how many
word grams should be united), which is stated in Section
4.2.1. It effectively groups the text information into some basic unit. Bigger N can store more context. If we have sufficient training data, we can set N to a high number in the
N-Gram operation. However, if the data is sparse, a large N
can render the resulting model ineffective due to the sparsity of specific word sequence occurrences. Due to the data
scarcity of the data in this study, we only investigate two scenarios: N = 1 and N = 2, which is also called unigram and
bigram, respectively.
The collection of these pre-processing procedures are used
in a sequential manner, with combinations listed in Table 3.
4.2.3.2. Feature generation. Once the raw text from ASR is
processed in the pre-processing phase, there are several alternative feature extraction approaches, which are detailed in
Table 4.
For Feature Method 0 and 1, their implementation is selfevident. Further explanation is needed for Method 2, that is,
the TF-IDF approach. TF-IDF is a numerical statistic which
reflects how important a term/word is to a document in a
dataset (Salton and Buckley, 1988). It is often used as a
weighting factor for information retrieval and text mining
(Hansen et al., 2005; Kim and Hansen, 2007). We use it here
as a feature for our dialect text categorization tasks. There are
two empirical intuitions in using TF-IDF: (a) the more frequently a term (say ti ) occurs in a document j (denoted as dj ),
the more important it is for dj (the term frequency intuition);
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
Table 4
Feature representation variations.
Feature
Method #
0
Feature representation approach
Binary feature: use 0 or 1 to signify whether some
word/item is present or not.
Term frequency: how often a word occurred in a
document
TF-IDF: Term Frequency–Inverse Document
Frequency, reflecting how important a word is to a
document in a collection
1
2
(b) the more documents ti occurs in, the less discriminating it is (i.e., the smaller its contribution is in characterizing
the semantics of a document in which it occurs) (the inverse
document frequency intuition). In another words, the TF-IDF
value increases proportionally to the number of times a word
appears in the document, but is offset by the frequency of
the word across the entire corpus, which reflects its discriminating ability and helps control for the fact that some words
are generally more common than others. The weighting factor computed by TF-IDF techniques is often normalized so
as to adjust for the tendency of TF-IDF to emphasize longer
duration documents. To be more specific, TF-IDF can be calculated as:
T F IDF (t, d, D) = T F (t, d ) • IDF (t, D)
(8)
The first term in the right hand side of Eq. 8 is defined
as:
ni, j
TF (ti , d j ) = k nk, j
Number o f times term ti appears in document d j
=
,
Total number o f terms in document d j
(9)
where the numerator, ni,j , is the number of occurrences of
the specific term ti in document dj , and the denominator, the
total number of terms in document dj , is used for normalization. Note, to simplify the notation, the subscript, i and j, are
dropped in Eq. 8. The second term in the right hand side of
Eq. 8 is defined as:
IDF (t, D) = log
|D|
,
1 + |{d : ti ∈ d}|
(10)
where |D| is the total number of documents in the Dataset D,
and |{d : ti ∈ d}| is the number of documents where term ti
actually appears. To avoid any potential divide-by-zero, 1 +
|{d : ti ∈ d}| is used as a minimum floor value instead.
This feature extraction procedure essentially assigns some
weighting to the terms derived from the pre-processing
step (see Section 4.2.3.1 ). Having addressed the text preprocessing and feature generation steps of Fig. 4( b), we turn
to model development.
4.2.3.3. Text Categorization Modeling. Logistic regression
(LR) is a type of discriminative probabilistic classification
model that operates over real-valued vector inputs. It has
27
been explored in large text classification tasks (Genkin et al.,
2007), but limited studies have investigated its potential application in dialect text classification. Therefore, LR is investigated here.
Given a set of instance-label pairs {xi , yi }li=1 , where xi ∈
N
R , and yi ∈ {−1, 1}, LR binary classification assumes the
following probability model:
1
.
(11)
P(y = ±1|x, w) =
1 + exp(−y(wT x + b))
To consider a simpler derivation without the bias term b,
one often augments each instance with an additional dimension as follows:
xiT ← [xiT , 1], wT ← [wT , b].
(12)
With this, Eq. 11 can be updated as follows:
1
P(y = ±1|x, w) =
.
(13)
1 + exp(−ywT x)
The only LR model parameter, w, can now be estimated
by maximizing the probability as follows:
max
l
i=1
1
,
1 + exp(−yi wT xi )
(14)
or by minimizing the negative log-likelihood as follows:
min
w
l
log(1 + exp(−ywT x)).
(15)
i=1
Moreover, to avoid over-fitting, one adds a regularization
term wT w/2; thus, in this study, we consider the following
unconstrained optimization problem as regularized logistic regression:
l
1 T
T
w w +C
min
log(1 + exp(−yi w xi )) ,
(16)
w
2
i=1
where C > 0 is a penalty parameter used to balance the two
terms in Eq. 16. The trust region Newton method is employed
to learn the LR model (i.e., w, by solving Eq. 16), which is
also referred to as the primal form of the logistic regression.
5. Experiments
As discussed in Section 2 and shown in Fig. 2 , the input
to the DID system is audio data only. As such, we rely on an
ASR engine to derive the necessary parallel text information,
which is then used for text modeling and classification. Next
as Fig. 2 shows we leverage both audio and text modeling to
boost the final dialect classification system performance. The
proposed algorithms are evaluated on the two corpora (e.g.,
UT-Podcast and NIST LRE-2009) described in Section 2 to
fully assess the validity of the proposed audio-text solution.
To measure performance, two metrics are used in this
study. The first one is unweighted average recall (UAR)(Liu
et al., 2014). In the binary class case (‘X’ and ‘NX’), it is
defined as:
Recall(X) + Recall(NX)
(17)
UAR =
2
28
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
Table 5
Dialect identification performance confusion matrixes of GMM baseline and
i-Vector system on UT-Podcast corpus. AU, UK and US correspond to Australian English, United Kingdom English, and American English, respectively.
‘UAR’ is the unweighted average recall.
Table 6
Performance of N-Gram based dialect identification DID text system applied
to UT-Podcast corpus. AU, UK and US correspond to Australian English,
United Kingdom English, and American English, respectively. ‘UAR’ is the
unweighted average recall.
Test →
Recall of GMM system (%)
Recall of i-Vector system (%)
Train ↓
Classification system
(N-Gram perplexity)
AU
UK
US
AU
UK
US
AU
UK
US
UAR
(%)
85.5
3.6
10.8
38.2
32.6
29.2
60.3
26.3
10.8
62.9
78.0
11.5
10.5
15.7
61.8
22.5
74.5
8.7
7.5
83.8
Our study relies on unweighted average recall rather than
weighted average (WA) recall or conventional classification
accuracy (Bahari et al., 2013) since it is also meaningful for
highly unbalanced data. Eq. 17 can be easily generalized to
multi-class case. The advantage of the metric of UAR is that
it is application independent. To assist the potential comparison, another popular language recognition evaluation metric,
average detection cost, or Cavg, is also adopted here (Li,
et al., 2013). It is application dependent and follows the definition of LRE20093 .
5.1. UT-Podcast Corpus
5.1.1. Audio system
Both GMM baseline system (Section 3) and i-Vector system is purely based on audio (see Fig. 2 ), and therefore
named as “Audio System” to distinguish from “Text System”.
The 3-way classification confusion matrix of GMM and iVector are summarized in Table 5. As can be seen, the GMM
baseline system offers decent performance in the dialect of
AU and US but faces huge challenge in modeling the dialect
of UK due to the sparsity of training data for UK English
(see Table 1). However, i-Vector can successfully model all
the three dialects and offers an absolute +14.2% improvement
over the GMM baseline system in terms of UAR.
5.1.2. Text system
Next, we consider the performance of various text processing based DID solutions.
5.1.2.1. N-Gram based DID system . For the N-Gram text
categorization system, one important parameter is “N”, which
represents the number of grams. As we can see from
Table 6, a good choice for N is 2. However, due to the limited
text content duration (i.e., 1, 2, or 3 word sequences), overall
classification performance is weak. In reality, the performance
of this N-Gram text processing is very close to random guessing.
5.1.2.2. LSA based DID system . As another popular text categorization approach, LSA Cosine Distance Scoring approach
3 NIST LRE 2009,http://www.itl.nist.gov/iad/mig/tests/lre/2009/LRE09_
EvalPlan_v6.pdf.
Unigram (N = 1)
Bigram (N = 2)
Trigram (N = 3)
Recall(%)
UAR(%)
AU
UK
US
5.7
22.6
22.9
74.2
53.9
53.9
17.5
30.4
30.0
32.5
35.6
35.6
Table 7
Performance of LSA based dialect identification text system on UT-Podcast
corpus. AU, UK and US correspond to Australian English, United Kingdom English, and American English, respectively. ‘UAR’ is the unweighted
average recall.
Classification system
LSA Cosine Distance Scoring
Recall(%)
UAR(%)
AU
UK
US
55.4
48.3
51.7
51.8
is also evaluated. The corresponding results from UT-Podcast
data is detailed in Table 7. Compared with the N-Gram approach, LSA offers much better results, but is still far below
the i-Vector audio system as a stand-alone classifier. Once
again, this is because the sparse training data cannot support
an effective combined feature representation and modeling solution for the text sequence.
5.1.2.3. TF-IDF based DID system . For the TF-IDF
approach, since the overall feature extraction includes preprocessing and various feature representation approach, we
would like to first investigate optimal feature configuration.
Performance of the pre-processing step will depend on
whether employing stopword removal, stemming, unigram
or not, with these various text pre-processing options, there
are a total of 8 (=23 ) combinations which are detailed in
Table 3 (noted as Step 1). Step 2 (i.e., feature representation) has four variations as listed previously from Table 4
(Section 4.2.3.2). To find the optimal feature configuration,
we consider all combinations presented by the above two
steps and summarize the corresponding results in Table 8.
Based on Tables 3, 4, and 8, we can observe that the optimal
text feature configuration is: Pre-processing = 0; Feature
representation = 2. That is, a unigram model without
stopword removal and stemming should be the combination
employed as the final pre-processing operation. Next, the
pre-processed text files are used to extract TF-IDF weighting
information, which is used as the final feature. Usually preprocessing techniques, such as stopword removal, stemming
and high order N-Grams, should present more discriminative
information. However both training and test text data are
very limited in this study, so to keep as much information as
possible in this case is more beneficial. Therefore, we will
use this optimal configuration for the remainder of this paper.
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
29
Audio System
Text System
Fusion System (Audio + Text)
36
34
Cavg(%)
32
30
28
26
24
22
20
30s
10s
3s
Mixed
Fig. 7. Dialect identification system performance comparisons across three testing duration scenarios of English dialects from NIST LRE-2009 corpus. Here
the text system is the TF-IDF logistic regression system, and audio system is the i-Vector solution. The duration including 4 scenarios: 30 s, 10 s, 3 s and
‘Mixed’ (i.e., the mixture of the 3 duration scenarios). The metric is Cavg.
Table 8
Optimization of pre-processing and feature representation for text-based dialect identification. 8 pre-processing options from Tables 3, and 4 feature
representations from Table 4 are considered. Optimal configuration is: Preprocessing = 0; Feature representation = 2.
Cavg(%)
Feature representation
Pre-processing
0
1
2
3
4
5
6
7
0
1
2
34.7
34.2
33.5
34.4
35.6
36.0
36.2
36.2
33.1
33.5
31.4
33.2
35.2
36.6
37.1
37.1
31.0
33.2
32.0
32.2
34.6
35.3
37.5
36.3
For convenience of comparison, the performance of all text
based dialect classification systems is summarized in Table 9.
We note that both N-Gram perplexity and LSA Cosine Distance Scoring approaches individually do not contribute to
improve performance in our case. We deduce the poor performance is due to the relatively limited amount of data available
from training since each training audio file is approximately
around 17 s on average and contains 49 words. However, TFIDF logistic regression approach provides competitive performance based on the limited training data. Therefore, TF-IDF
logistic regression is adopted as the only text-based approach
for the remainder of this study.
5.1.3. Final proposed DID system
Based on our previous observations, our final DID system
includes the i-Vector Gaussian Backend-based audio system
and TF-IDF logistic regression-based text system. Their fusion results are also summarized in Table 9. Compared with
the i-Vector audio system, the proposed audio-text fusion system improves Cavg relatively by +6.8%. Although the TFIDF logistic regression text system also suffers from the data
sparsity in modeling UK English, it shows best modeling capability in the Australian English. This is different from iVector system, which can best model American English.
Table 9
Performance comparison of different dialect identification approaches on UT-Podcast. AU, UK and US correspond to Australian
English, United Kingdom English, and American English, respectively. ‘UAR’ is the unweighted average recall.
System #
0
1
2
3
4
5
Classification system
Audio System (Baseline: GMM)
Audio System (i-Vector)
Text System(N-Gram Perplexity)
Text System(LSA Cosine Distance Scoring)
Text System(TF-IDF logistic regression)
Proposed Audio-Text system (Fusion of System 1 and 4)
Recall(%)
AU
UK
US
85.5
78.0
22.6
55.4
83.1
86.1
32.6
61.8
53.9
48.3
32.6
60.7
62.9
83.8
30.4
51.7
60.4
82.1
UAR(%)
Cavg(%)
60.3
74.5
35.6
51.8
58.7
76.3
29.7
19.1
54.0
36.2
31.0
17.8
30
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
Table 10
Comparison of different dialect identification approaches on the NIST LRE-2009 corpus. US and IN correspond to American
English and Indian English, respectively. ‘UAR’ is the unweighted average recall.
System #
0
1
2
3
Classification system
Recall (%)
Audio System (Baseline: GMM)
Audio System (i-Vector)
Text System (TF-IDF logistic regression)
Proposed Audio-Text system (Fusion of Systems 1 and 2)
i-Vector acoustic system map an utterance with different
phoneme distribution into a lower dimensional space, it ignore the time/context relationship among the phonemes and
words of the utterance. This also suggests that i-Vector system
is more appropriate to capture global image of the utterance
in the dialect space, but may have difficulty to mode the local
details which can help identification. TF-IDF text system focuses on the words level and can capture the semantic details
to assist modeling. Due to the high similarity among dialects,
many words may be the same and the resulted system may
not be very effective. However, if the salient dialect cue do
appear in some utterance, the text approach can capture it and
do an efficient classification. These two different approaches
can complement each other and explain away the gain in
the fusion. The proposed audio-text system offers a relative
+40.1% improvement against the GMM baseline system in
terms of Cavg.
It is noted that the text based approach is computationally
very efficient on UT-Podcast (similar results are also observed
in LRE-2009 corpus). On our hardware, Intel(R) Xeon(R)
CPU X5650, the text approach only takes 2 s, while i-Vector
audio approach requires 922 s.
5.2. NIST LRE-2009 corpus
To further verify the proposed DID approach, we also want
to validate its performance on the English dialect portion of
the NIST LRE-2009 corpus. The results are summarized in
Table 10. First of all, we noticed that the GMM baseline
system cannot well model US English and the overall performance is the worst. i-Vector system can provide far more
balanced performance in each dialect and is specially good at
modeling Indian English. Compared with i-Vector system, the
TF-IDF logistic regression based text DID system, demonstrate different decision tendency: favor more US English.
This is caused by the accent mismatch between the US English based Engine and Indian English accent and the difference in acoustic space is automatically translated into textural space. This can compensate the best individual system
(i-Vector system) and their fusion offer a relative +47.1%
improvement over the GMM baseline system in terms
of Cavg.
By comparing with results on UT-Podcast data in
Section 5.1 (Table 9), we can observe that i-Vector based
audio system offers best individual system performance consistently. Due to the heterogeneous nature of Text system, the
US
IN
32.4
60.0
73.8
64.9
94.8
94.4
61.3
91.6
UAR (%)
Cavg (%)
63.6
77.2
67.6
78.3
41.2
22.8
32.5
21.8
fusion of it with i-Vector audio system can consistently boost
the performance at trivial computation cost.
Since the duration of the test data of LRE-2009 has three
variations: 30 s, 10 s, and 3 s. It is also of interest to more
closely look at performance in the three test duration scenarios, where results are illustrated in Fig. 7. From Fig. 7
we observe duration affects the performance in a significant
way. In the scenario of 30 s, the text DID approach has
comparable performance with audio as an individual system.
However, in shorter durations (i.e., 10 s and 3 s), the performance/gain rapidly decreases. Overall, however, the text
approach can consistently help boost DID performance in
all the three scenarios as seen when audio and text based
solutions are fused. These results are also highlighted in
Table 10.
6. Conclusions
This study has presented an integrated audio and text based
solution for dialect identification (DID). Various progressing
options were considered for both audio and text sub-systems.
The proposed solution provides a cost-effective method to
collect spontaneous and realistic dialect data, which was harvested by downloading from Podcast websites. A range of
important factors were addressed to determine how dialects
differ across a range of speech/language characteristics. The
solution focused on investigating this based on two main approaches: acoustic and text based dialect classification. The
acoustic classification system was a state-of-the-art i-Vector
system, based on MFCC features and a Gaussian Backend
which was shown to consistently outperform the GMM baseline system. Next, a text modeling approach was systematically explored. Two traditional text categorization approaches
(i.e., N-Gram perplexity and LSA cosine similarity) were
adopted but failed to provide sufficient individual performance
gains as stand-alone solutions. Motivated by these results, a
new text based DID approach was investigated using TF-IDF
logistic regression, which not only individually demonstrated
comparable performance to the audio system (i-Vector system), but also demonstrated a degree of complementarity performance, making the final integrated audio-text system more
discriminative. The proposed system was shown to improve
the Cavg performance by +40.1% and +47.1% relatively on
the UT-Podcast and LRE-2009 corpora, respectively. By comparing with the best individual system (i-Vector system), the
audio-text system boosts performance by +6.8% and +4.4%
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
on the UT-Podcast and LRE-2009 corpora, respectively. This
benefit was achieved at a far less computational cost than
the best individual system (more than 400 times faster than
the i-Vector system). This observation is meaningful since
it suggests that it is possible to leverage broader internet
text data at a trivial cost to further boost overall DID performance. It was also observed that comparable performance
can be offered by TF-IDF text approach when the duration of
test file is 30 s. For shorter test duration scenarios, the TFIDF approach is less competitive but can still offer consistent
complementary gains. All results demonstrate that a multilevel DID approach which leverages information sources including audio and text is beneficial. The effective performance gains from the innovative combination of audio and
text information suggest that deep data mining of the available data can help boost performance on machine learning
systems.
The algorithms developed in this study have led to advancements in the area of dialect identification. While these
algorithms have resulted in effective performance for English
dialects, it is not claimed that the resulting solution is optimal
or the final contribution in dialect classification. Other strategies could be considered in order to improve overall dialect
classification rates, such as deep neural networks (Montavon
2009; Liu, 2010; Glorot et al., 2011; Ngiam et al., 2011;
Hinton et al., 2012) could also be applied to both audio and
text at the feature learning time. It is suggested that these be
considered in future studies, in the context of the algorithms
developed in the current study.
Finally, in an effort to contribute to the research community in dialect recognition, the collected and organized
UTDallas-DID corpus will be freely shared with those interested, given that users complete a no-cost license agreement
which prevents commercial/corpus distribution.
Acknowledgment
This project was supported by AFRL through contract FA8750-15-1-0205, and partially by the University of
Texas at Dallas from the Distinguished University Chair in
Telecommunications Engineering held by J. Hansen. The
authors wish to acknowledge that preliminary work on
leveraging audio and text conducted by Rahul Chitturi as
part of his MSCS Thesis at CRSS-UTDallas, and published in (Chitturi and Hansen, 2007; Chitturi and Hansen,
2008; Chitturi, 2008). The work to collect and organize the UT-Podcast corpus stemmed from these earlier
studies.
References
Akbacak, M., Hansen, J.H.L., Feb 2007. Environmental sniffing: noise knowledge estimation for robust speech systems. IEEE Trans. Audio Speech
Lang. Process. 15 (2), 465–477.
Angkititrakul, P., Hansen, J.H.L., 2006. Advances in phone-based modeling for automatic accent classification. IEEE Trans. Audio Speech Lang.
Process 14 (2), 634–646.
31
Angkititrakul, P., Hansen, J.H.L., 2007. Discriminative in-set/out-of-set
speaker recognition. IEEE Trans. Audio Speech Lang. Process. 15 (2),
498–508 Feb.
Antoine, J.Y., 1996. Spontaneous speech and natural language processing.
ALPES: a robust semantic-led parser. In: Proceedings of ICSLP-1996.
Philadelphia, PA, USA, pp. 1157–1160.
Arslan, L.M., Hansen, J.H.L., 1997. A study of temporal features and frequency characteristics in American english foreign accent. J. Acoust. Soc.
Am. 102, 28–40.
Bahari, M.H., Saeidi, R., van Leeuwen, D., 2013. Accent recognition using
i-Vector, Gaussian mean supervector and Gaussian posterior probability
supervector for spontaneous telephone speech. In: Proceedings of IEEE
ICASSP-2013. Vancouver, Canada, pp. 7344–7348.
Behravan, H., Hautamaki, V., Kinnunen, T., 2013. Foreign accent detection
from spoken finnish using i-vectors. In: Proceedings of INTERSPEECH2013. Lyon, France, pp. 79–83.
Bellegarda, J.R., 2000. Exploiting latent semantic information in statistical language modeling. In: Proceedings of IEEE, 88, pp. 1279–
1296.
Biadsy, F., Hirschberg, J., Habash, N., 2009. Spoken Arabic dialect identification using phonotactic modeling. In: Proceedings of the EACL 2009
Workshop on Computational Approaches to Semitic Languages, pp. 53–
61.
Bielefeld, B., 1994. Language identification using shifted delta cepstrum. In:
Proceedings of The 14th Annual Speech Research Symposium.
Brümmer, N., Cumani, S., Glembek, O., Karafiát, M., Matejka, P., 2011. Description and analysis of the Brno276 system for LRE2011. In: Proceedings of NIST 2011 Language Recognition Evaluation Workshop. Atlanta,
USA.
Chitturi, R., Hansen, J.H.L., 2007. Multi-stream dialect classification using
SVM-GMM hybrid classifiers. In: Proceedings of IEEE ASRU-2007: Automatic Speech Recognition and Understanding Workshop. Kyoto, Japan,
pp. 431–436.
Chitturi, R., Hansen, J.H.L., 2008. Dialect classification for online podcasts
fusing acoustic and language based structural and semantic information.
In: Proceedings of ACL-08: HCT:Association for Computational Linguistics (ACL): Human Communication Technologies Conf. Columbus, Ohio,
pp. 21–24.
Chitturi, R., 2008. Unsupervised Dialect Classification for Podcast Data based
on Fusing Acoustic and Language Information, Center for Robust Speech
Systems, MSCS Thesis in Computer Science, University of Texas at Dallas.
Choueiter, G., Zweig, G., Nguyen, P., 2008. An empirical study of automatic
accent classification. In: Proceedings of IEEE ICASSP-2008, 1. Las Vegas, NV, pp. 4265–4268.
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W.,
Harshman, R.A., 1990. Indexing by latent semantic analysis. J. Am. Soc.
Inf. Sci. 391–407.
Dehak, N., Kenny, P., Dehak, R., Ouellet, P., Dumouchel, P., 2010. Frontend factor analysis for speaker verification. IEEE Trans. Audio, Speech,
Lang. Process. 19 (4), 788–798.
Dehak, N., Torres-Carrasquillo, P., Reynolds, D., Dehak, R., 2011. Language recognition via i-Vectors and dimensionality reduction. In:
INTERSPEECH-2011, pp. 857–860.
Diakoloukas, V., Digalakis, V., Neumeyer, L., Kaja, J., 1997. Development of dialect-specific speech recognizers using adaptation methods.
In: Proceedings of IEEE ICASSP-1997, 2. Munich, Germany, pp. 1455–
1458.
Genkin, A., Lewis, D.D., Madigan, D., 2007. Large-scale Bayesian logistic
regression for text categorization. Technometrics 49 (3), 291–304.
Glorot, X., Bordes, A., Bengio, Y., 2011. Domain adaptation for large-scale
sentiment classification: a deep learning approach. In: Proceedings of
ICML-2011, pp. 513–520.
Gray, S., Hansen, J.H.L., 2005. An integrated approach to the detection
and classification of accents/dialects for a spoken document retrieval
system. In: Proceedings of IEEE ASRU 2005. San Juan, Puerto Rico,
pp. 35–40.
32
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
Greenberg, S., 1997. On the origins of speech intelligibility in the real world.
In: Proceedings of the ESCA Workshop on Robust Speech Recognition
for Unknown Communication Channels. Pont-a-Mousson, France.
Gupta, V., Mermelstein, P., 1982. Effect of speaker accent on the performance
of a speaker-independent isolated word recognition. J. Acoust. Soc. Am.
71, 1581–1587.
Hansen, J.H.L., Hasan, T., Nov. 2015. Speaker recognition by machines and
humans: a tutorial review. IEEE Signal Process. Mag. 32 (6), 74–99.
Hansen, J.H.L., Gray, S., Kim, W., 2010. Automatic voice onset time detection for unvoiced stops (/p/, /t/, /k/) with application to accent classification, 52, pp. 777–789. Oct.
Hansen, J.H.L., Yapanel, U., Huang, R., Ikeno, A., 2004. Dialect analysis
and modeling for automatic classification. In: INTERSPEECH-2004. Jeju
Island, South Korea, pp. 1569–1572.
Hansen, J.H.L., Huang, R., Zhou, B., Seadle, M., Deller Jr., J.R.,
Gurijala, A.R., Kurimo, M., Angkititrakul, P., 2005. Speechfind: Advances
in spoken document retrieval for a national gallery of the spoken word.
IEEE Trans. Speech Audio Process 13 (5), 712–730.
Hasegawa-Johnson, M., Levinson, S.E., 2006. Extraction of pragmatic and
semantic salience from spontaneous spoken English. Speech Commun.
48 (3-4), 437–462.
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N.,
Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B., 2012.
Deep neural networks for acoustic modeling in speech recognition: the
shared views of four research groups. Signal Process. Mag., IEEE 29 (6),
82–97.
Huang, C., Chen, T., Li, S., Chang, E., Zhou, J.-L., 2001. Analysis
of speaker variability. In: Proceedings of European Conference on
Speech Communication and Technology, 2. Aalborg, Denmark, pp. 1377–
1380.
Huang, C., Chen, T., Chang, E., 2004. Accent issues in large vocabulary
continuous speech recognition. Int. J. Speech Technol. 7 (2-3), 141–153.
Huang, R., Hansen, J.H.L., 2005. Dialect/accent classification via boosted
word modeling. In: Proceedings of IEEE ICASSP-2005, pp. I-585–I-588.
Huang, R., Hansen, J.H.L., 2006. Gaussian mixture selection and data selection for unsupervised Spanish dialect classification. In: INTERSPEECH2006, pp. 445–448.
Huang, R., Hansen, J.H.L., 2007. Dialect classification on printed text using perplexity measure and conditional random fields. In: Proceedings of
IEEE ICASSP-2007, pp. IV-993–IV-996.
Huang, R., Hansen, J.H.L., 2007. Unsupervised discriminative training with
application to dialect classification. IEEE Trans. Audio, Speech, Lang.
Process. 15 (8), 2444–2453.
Humphries, J.J., Woodland, P.C., 1998. The use of accent-specific pronunciation dictionaries in acoustic model training. In: Proceedings of IEEE
ICASSP-1998, 1. Seattle, WA, pp. 317–320.
Jones, K.S., 1972. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28 (1), 11–21.
Kim, W., Hansen, J.H.L., 2007. Advances in speechfind: transcript reliability estimation employing confidence measure based on discriminative
sub-word model for SDR. In: INTERSPEECH-2007. Antwerp, Belgium,
pp. 2409–2412.
Larcher, A., Lee, K.A., Li, H., Ma, B., Sun, H., Tong, R., You, C.H., 2011.
IIR system description for the 2011 nist language recognition evaluation.
In: Proceedings of NIST 2011 Language Recognition Evaluation Workshop. Atlanta, USA.
Layer, J., 1994. Principles of Phonetics. Cambridge University Press.
Lazaridis, A., Khoury, E., Goldman, J.P., Avanzi, M., Marcel, S., Garner, P.N.,
2014. Swiss French regional accent identification. In: Proceedings of
Odyssey-2014. Joensuu, Finland.
Lei, Y., Hansen, J.H.L., 2011. Dialect classification via text-independent training and testing for Arabic, Spanish, and Chinese. IEEE Trans. Audio,
Speech, Lang. Process. 19 (1), 85–96.
Li, H., Ma, B., Lee, K.A., 2013. Spoken language recognition: from fundamentals to practice. Proc. IEEE 101 (5), 1136–1159.
Li, H., Ma, B., Lee, C.-H., 2007. A vector space modeling approach to
spoken language identification. IEEE Trans. Audio, Speech, Language
Process. 15 (1), 271–284.
Liu, M.-K., Xu, B., Huang, T.-Y., Deng, Y.-G., Li, C.-R., 2000. Mandarin
accent adaptation based on context-independent/ context-dependent pronunciation modeling. In: Proceedings of IEEE ICASSP-2000, 2. Istanbul,
Turkey, pp. 1025–1028.
Liu, G., Lei, Y., Hansen, J.H.L., 2010. A novel feature extraction strategy
for multi-stream robust emotion identification. In: INTERSPEECH-2010.
Makuhari Messe, Japan, pp. 482–485.
Liu, G., Lei, Y., Hansen, J.H.L., 2010. Dialect Identification: Impact of difference between read versus spontaneous speech. In: EUSIPCO-2010.
Aalborg, Denmark, pp. 2003–2006.
Liu, G., Hansen, J.H.L., 2011. A systematic strategy for robust automatic
dialect identification. In: EUSIPCO-2011. Barcelona, Spain, pp. 2138–
2141.
Liu, G., Yu, C., Misra, A., Shokouhi, N., Hansen, J.H.L., 2014. Investigating
state-of-the-art speaker verification in the case of unlabeled development
data. In: Odyssey, pp. 118–122.
Liu, G., Yu, C., Shokouhi, N., Misra, A., Xing, H., Hansen, J.H.L.,
2014. Utilization of unlabeled development data for speaker verification”. In: Proceedings of IEEE Spoken Language Technology Workshop (SLT 2014). South Lake Tahoe, Nevada, Dec 7-10, 2014, pp. 418–
423.
Liu, G., Zhang, C., Hansen, J.H.L., 2012. A linguistic data acquisition frontend for language recognition evaluation. In: Odyssey-2012. Singapore.
Liu, G., Hansen, J.H.L., 2014. Supra-segmental feature based speaker trait
detection. In: Odyssey-2014. Joensuu, Finland.
Liu, G., Sadjadi, S.O., Hasan, T., Suh, J.W., Zhang, C., Mehrabani, M.,
Boril, H., Hansen, J.H.L., 2011. UTD-CRSS systems for NIST language
recognition evaluation 2011. In: Proceeding of NIST 2011 Language
Recognition Evaluation Workshop. Atlanta, USA.
Liu, T., 2010. A novel text classification approach based on deep belief
network. In: Neural Information Processing. Theory and Algorithms,
pp. 314–321.
Ma, B., Zhu, D., Tong, R., May 2006. Chinese dialect identification using
tone features based on pitch flux. In: Proceedings of IEEE ICASSP-2006.
Toulouse, France, pp. 1029–1032.
Martınez, D., Plchot, O., Burget, L., Glembek, O., Matejka, P., 2011. Language recognition in iVectors space. In: INTERSPEECH-2011,. Firenze,
Italy, pp. 861–864.
Montavon, G., 2009. Deep learning for spoken language identification. In:
Proceedings of NIPS Workshop on Deep Learning for Speech Recognition and Related Applications.
Muysken, P., 1995. Code-switching and grammatical theory. In: Milroy, L.,
Muysken, P. (Eds.), One Speaker, Two Languages: Cross-disciplinary Perspectives on Code-switching. Cambridge University Press, Cambridge,
pp. 177–198.
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A., 2011. Multimodal
deep learning. In: ICML-2011, pp. 689–696.
NIST
LRE
2007,
http://www.itl.nist.gov/iad/mig/tests/lre/2007/
LRE07EvalPlan-v8b.pdf. (accessed 17.01.16).
NIST LRE 2009, http://www.itl.nist.gov/iad/mig/tests/lre/2009/LRE09_
EvalPlan_v6.pdf. (accessed 17.01.16).
NIST LRE 2011, http://www.nist.gov/itl/iad/mig/upload/LRE11_EvalPlan_
releasev1.pdf. (accessed 17.01.16).
NIST LRE 2015, http://www.nist.gov/itl/iad/mig/upload/LRE15_EvalPlan_
v20_r1.pdf. (accessed 17.01.16).
Purnell, T., Idsardi, W., Baugh, J., 1999. Perceptual and phonetic experiments
on American english dialect identification. J. Lang. Soc. Psychol. 18 (1),
10–30.
Salton, G., Buckley, C., 1988. Term-weighting approaches in automatic text
retrieval. Inf. Process. Manag. 24 (5), 513–523.
Sangwan, A., Hansen, J.H.L., 2012. Automatic analysis of Mandarin accented
english using phonological features. Speech Commun. 54 (1), 40–54 Jan.
Shen, W., Reynolds, D., 2008. Improved GMM-based language recognition
using constrained MLLR transforms. In: Proceedings of IEEE ICASSP2008. Las Vegas, NV, USA, pp. 4149–4152.
Sohn, J., Kim, N.S., Sung, W., 1999. A statistical model-based voice activity
detection. IEEE Signal Process. Lett. 6 (1), 1–3.
J.H.L. Hansen, G. Liu / Speech Communication 78 (2016) 19–33
Sourceforge, 2015, http://sourceforge.net/projects/cmusphinx/files/Acoustic%
20and%20Language%20Models/US%20English%20HUB4%20Language
%20%20Model. (accessed 17.01.16).
Torres-Carrasquillo, P.A., Gleason, T.P., Reynolds, D.A., 2004. Dialect identification using Gaussian mixture models. In: Odyssey-2004. Toledo, Spain.
Torres-Carrasquillo, P.A., Singer, E., Gleason, T., McCree, A.,
Reynolds, D.A., Richardson, F., Sturim, D., 2010. The MITLL NIST
LRE 2009 language recognition system. In: Proceedings of IEEE
ICASSP-2010. Dallas, TX, USA, pp. 4994–4997.
Trudgill, P., 1999. The Dialects of England, 2nd edition Blackwell Publishers
Ltd, Oxford, UK.
Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf,
P., Woelfel, J., 2004 Sphinx-4: A Flexible Open Source Framework for
Speech Recognition. Technical Report SML1 TR2004-0811., Sun Microsystems Inc.
Ward, W., Krech, H., Yu, X., Herold, K., Figgs, G., Ikeno, A., Jurafsky, D.,
Byrne, W., 2002. Mandarin accent adaptation based on contextindependent/ context-dependent pronunciation modeling. In: Proceedings
of ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation.
Estes Park, CO.
Wells, J.C., 1982. Accents of English. vol. I, II, III. Cambridge University
Press, Cambridge, UK.
William, F., Sangwan, A., Hansen, J.H.L., 2013. Automatic accent assessment using phonetic mismatch and human perception. IEEE Trans. Audio, Speech Lang. Process. 21 (9), 1818–1828 Sept.
33
Xia, R., Liu, Y., 2012. Using i-Vector space model for emotion recognition.
In: INTERSPEECH-2012. Portland, USA, pp. 2230–2233.
Yu, C., Liu, G., Hansen, J.H.L., 2014. Acoustic Feature Transformation using UBM-based LDA for speaker recognition". In: ISCA Interspeech.
Singapore, pp. 1851–1854.
Yu, C., Liu, G., Hahm, S., Hansen, J.H.L., 2014. Uncertainty propagation in
front end factor analysis for noise robust speaker recognition. In: IEEE
ICASSP. Florence, pp. 4045–4049.
Zhang, C., Liu, G., Yu, C., Hansen, J.H.L., 2015. i-Vector based physical
task stress detection with different fusion strategies. In: Interspeech 2015.
Dresden, Germany.
Zhang, Q., Liu, G., Hansen, J.H.L., 2014. Robust language recognition based
on diverse features. In: ISCA Odyssey-2014. Joensuu, Finland, pp. 152–
157.
Zissman, M.A., Gleason, T.P., Rekart, D.M., Losiewicz, B.L., 1996. Automatic dialect identification of extemporaneous conversational, Latin
American Spanish speech. In: IEEE ICASSP-1996, 2, pp. 777–780.
Zissman, M.A., 1996. Comparison of four approaches to automatic language
identification of telephone speech. IEEE Trans. Speech Audio Process 4
(1), 31–44.
Zissman, M.A., Berkling, K.M., 2001. Automatic language identification.
Speech Commun. 35 (1-2), 115–124.