IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011 85 Dialect Classification via Text-Independent Training and Testing for Arabic, Spanish, and Chinese Yun Lei, Student Member, IEEE, and John H. L. Hansen, Fellow, IEEE Abstract—Automatic dialect classification has emerged as an important area in the speech research field. Effective dialect classification is useful in developing robust speech systems, such as speech recognition and speaker identification. In this paper, two novel algorithms are proposed to improve dialect classification for text-independent spontaneous speech in Arabic and Spanish languages, along with probe results for Chinese. The problem considers the case where no transcripts but dialect labels are available for training and test data, and speakers are speaking spontaneously, which is defined as text-independent dialect classification. The Gaussian mixture model (GMM) is used as the baseline system for text-independent dialect classification. The major motivation is to suppress confused/distractive regions from the dialect language space and emphasize discriminative/sensitive information of the available dialects. In the training phase, a symmetric version of the Kullback–Leibler divergence is used to find the most discriminative GMM mixtures (KLD-GMM), where the confused acoustic GMM region is suppressed. For testing, the more discriminative frames are detected and used via the location of where the frames are in the GMM mixture feature space, which is termed frame selection decoding (FSD-GMM). The first KLD-GMM and second FSD-GMM techniques, are shown to improve dialect classification performance for three-way dialect tasks. The two algorithms and their combination are evaluated on dialects of Arabic and Spanish corpora. Measurable improvement is achieved in both two cases, over a generalized maximum-likelihood estimation GMM baseline (MLE-GMM). Index Terms—Arabic dialects, dialect classification, frame selection, Gaussian mixture, Kullback–Leibler divergence, Spanish dialects. I. INTRODUCTION D IALECT classification, or as it is sometimes referred to as dialect identification, is an emerging research topic in the speech recognition community because dialect is one of the most important factors next to gender that influences speech recognition performance [1]–[4]. Automatic dialect classification is important for characterizing speaker traits [5] and knowledge estimation, which can then be employed to build dynamic lexicons by selecting alternative pronunciations [6], generate pronunciation modeling via dialect adaptation [7] , or train [8] and adapt [9] dialect dependent acoustic models. Dialect knowledge is also helpful for data mining and spoken document re- Manuscript received September 30, 2008; revised November 21, 2009; accepted January 10, 2010. Date of publication March 11, 2010; date of current version October 01, 2010. This work was supported in part by the AFRL under a subcontract to RADC, Inc., under Grant FA8750–09–C–0067 and in part by the University of Texas at Dallas under Project EMMITT. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Richard C. Rose. The authors are with the Center for Robust Speech Systems (CRSS), University of Texas at Dallas, Richardson, TX 75083-0688 USA (e-mail: [email protected]; john.hansen@utdallas. edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2010.2045184 trieval[10], [11]. In this paper, the definition employed for the term dialect is: a pattern of pronunciation and/or vocabulary of a language used by the community of native speakers belonging to some geographical region.1 For example, Cuban Spanish and Peruvian Spanish are two dialects of Spanish; American English and U.K. English are two dialects of English. Here, we refer to American English and U.K. English as parent family tree dialects, while dialects such as Cambridge, Belfast, or Cardiff are represented as subclasses under the U.K. family tree. It is noted that slight differences in definition of dialect exist across research studies, depending on their perspective of the problem, linguistics, or speech engineering goals. In previous studies, it has been shown that isolated words as well as individual phonemes can be successfully used for dialect classification [13], [14]. Utterance-based dialect classification presents two different text scenarios: constrained and unconstrained. If transcripts are available, supervised word-based dialect classification is suggested. The method turns the text-independent dialect classification problem into a text-dependent dialect classification problem by comparing a range of given words which are the output of an automatic speech recognizer (ASR), and has been shown to obtain very high accuracy [15]. A context-adaptive training (CAT) algorithm has also been applied for cases where the training data set size is very small[16]. In general, most conversational dialect data is unconstrained since transcript information is expensive to produce. In the present framework, typically no text, speaker, or gender information except the dialect label is available for the data, and therefore an text-independent algorithm must be formulated. Alternatively, a Gaussian mixture model (GMM)-based classifier can be applied for unconstrained data [17]. Several successful methods have also been proposed based on reducing model confusion to achieve better performance for dialect classification. For example, training data selection and Gaussian mixture selection [18] based on the training corpus attempts to exclude or balance the confusion region; minimum classification error (MCE) [19] training, as a common discriminative training method, can also be applied to reduce model confusion. In a manner similar with MCE, maximum mutual information (MMI) has been applied for language and accent identification successfully [20], [21]. Factor analysis [22]–[24], constrained maximum-likelihood linear regression (CMLLR) [25] and vocal tract length normalization (VTLN), as the methods for variability compen1Dialect in this context refers to regional dialects of a language. Social, as well as economic-based, dialects also exist in languages/countries. Such studies consider problems of the origin and diffusion of linguistic change, the nature of stylistic variation in language use, and the effect of class structure on linguistic variation within a speech community. Such issues are not addressed in the present study, but we note the existence of such work in the field of social-linguistics [12]. 1558-7916/$26.00 © 2010 IEEE 86 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011 sation, have all been successfully applied for language identification. While they are all general compensation techniques, they could also be applied for dialect classification. Factor analysis, especially based on the eigenchannel model, can be used to describe the channel variability, which can influence dialect classification; CMLLR can also be used to compensate for the channel, but use the assumption that the mean and covariance parameters are governed by one transform per class; VTLN, as an approach to normalize speaker characteristics, can suppress the influence of speakers in dialect classification. In addition to the acoustic phase, the vocabulary and grammar differences of dialects can also be studied and applied for dialect classification [26]. The focus in this study is to identify and emphasize those traits that are most discriminative across dialects of a common language. A GMM is used to represent the acoustic space of the dialects. The hypothesis considered here is that some mixtures are significantly different among the dialects, which will help us to classify the dialects, while others possess information that is dialect neutral. In the training phase, the symmetrized KL divergence (KL2) [27] based algorithm is employed to assess the dialect dependent mixtures in order to enhance overall dialect discrimination (KLD), while suppressing dialect neutral mixtures. The training phase, however, is not the only phase which can benefit from improved dialect modeling for classification. Along a similar concept used in the mixture division, in the decoding phase the frames can also be divided into two classes: dialect dependent and dialect neutral frames based on the importance of the frames for dialect classification performance. Effective selection of dialect dependent frames while setting aside dialect neutral frames will have a similar impact as seen for mixtures [frame selection decoding (FSD)]. This paper is organized as follows. The next section begins with a brief introduction of the GMM-based dialect classification system (Section II), followed by a discussion of training and testing techniques in Section III. The proposed training technique—KLD, is presented in Section III-A; the test technique—FSD, is proposed in Section III-B. Section IV presents a series of experimental results with a comparison of the proposed methods to the traditional maximum-likelihood (ML) method. Finally, research findings are summarized along with a discussion of the impact in Section V. II. GMM-BASED CLASSIFICATION ALGORITHM In this paper, only text-independent classification is considered since it is assumed that no transcripts are available for either training or test data. The GMM classifier, employing a soft Bayes classifier, has been successfully applied for speech related classification such as text-independent speaker recognition [28] and dialect classification [17]. Here, a GMM-based dialect classification algorithm is employed as the baseline system. Fig. 1 shows the flow diagram of the baseline GMM dialects are considtraining process, where a closed set of ered. The dialect GMM model is trained with spontaneous data from each speech dialect. The training method is generalized maximum-likelihood estimation (MLE) employing the expectation–maximization (EM) algorithm [29], [30]. In the training phase, silence frames are first removed from the input audio Fig. 1. Baseline MLE-GMM text-independent dialect training system. Fig. 2. Baseline GMM text-independent dialect testing system. stream using an energy threshold, followed by MFCC feature extraction. For each dialect, gender-dependent GMM models are constructed. The test phase is shown in Fig. 2, where silence removal and feature extraction steps are applied prior to dialect classification. The details of model formulation are described in the experimental section. To avoid influence of gender and emphasize dialect classification, gender information is assumed known so gender classification is not considered here. In general, dialect classification is considered to be similar in some respects to language identification. A number of successful techniques for language identification could be applicable for dialect classification. For example, there are many methods based on phone recognition, such as Phone Recognition and Language Modeling (PRLM), parallel PRLM (PPRLM), and language dependent Parallel Phone Recognition (PPR) [31]. Also, support vector machine (SVM), SVM phone recognition, or an SVM using a GMM super-vector kernel could also be applied to achieve good language ID performance [32]. Maximum mutual information (MMI), as a general discriminative learning method, also achieves significant improvement for language identification. The scope of research work in language identification is quite large, compared with that in dialect identification. As such, more focused studies have been applied in language identification which explore the minimization of issues such as microphone (factor analysis, CMLLR), vocal tract length/speaker differences (VTLN), etc. For the field of dialect identification, it is more important to first establish competitive solutions before such non-dialect dependent variability can be effectively addressed. The lack of extensive dialect corpora in the field is one reason for the lack of research progress. The focus of this study is to develop better algorithms than the standard MLE for text-independent dialect classification by emphasizing dialect-specific traits. Therefore, this paper uses the MLE-GMM algorithm as the baseline system. Also, the focus is on dialect classification, and not minimizing non-dialect dependent variability. As such, it is possible to further improve actual classification scores if such addition processing is also included. This issue is suggested for future work. LEI AND HANSEN: DIALECT CLASSIFICATION VIA TEXT-INDEPENDENT TRAINING AND TESTING FOR ARABIC, SPANISH, AND CHINESE III. TRAINING AND TESTING FOR TEXT-INDEPENDENT DIALECT CLASSIFICATION Although dialect classification is similar to language identification, there are some differences. Language identification attempts to determine the language in which the speech was spoken. Normally, different languages have different phonemes, vocabulary, grammar, as well as different pronunciations. Also, boundaries between languages are generally quite distinct, and easier to recognize perceptually. Dialects, especially subclass dialects (e.g., Cardiff, Belfast, Cambridge), are more subtle and less perceptually recognized. For the dialect case, differences among dialects of a language are usually smaller than between languages in terms of grammar, pronunciation, and vocabulary selection. Also, there is less formal documentation available (i.e., it is easy to obtain dictionaries of English, German, and Spanish languages; but is very difficult to obtain a Belfast U.K. English dictionary verses Cardiff U.K. English dictionary). In the acoustic space, it is suggested that the acoustic/linguistic distance between dialects is usually much closer than the distance between languages, and therefore there should be more overlap among dialects. In fact, the study by Mehrabani and Hansen [33] have illustrated this for dialect and language separation. The proposed method here employs a two-step process. In the first technique, the focus is to find and remove the confusing region of the dialect model in the training phase. The technique, as a training method, is called Gaussian Mixture Selection by KL2 Divergence (KLD-GMM). Here, a GMM model is used to represent the acoustic space, where the individual mixtures of the GMM are employed to represent different regions of the acoustic space. It is assumed that some mixtures contribute to effective dialect classification, while other mixtures distract the model from effective dialect classification. The technique therefore classifies mixtures into contribution and distraction, and retains only the contribution mixtures. The second technique is the testing method, entitled Frame Selection Decoding (FSD). The technique finds and removes the confused region of the frames. In testing, the frames are classified into contribution and distraction parts, with only contribution frames retained for classification. A. Training Algorithm: Gaussian Mixture Selection by KL2 Divergence Assuming a single GMM model is employed to describe one dialect, each Gaussian mixture component is expected to contribute to the individual parts of the dialect acoustic space. Here we suggest individual parts since the covariance matrix is diagonal for each mixture component. Although there is no direct one-to-one mapping from the individual mixtures to the individual phones, we employ the following example mapping of the mixtures to phones to explain why and how to classify the mixtures, since the selection of the mixture number is typically based on an approximation of the number of phonemes in the system generally. However, when we actually classify the mixtures, we only measure the distance among the mixtures, which is not based on the phone labels (since they are not known), which means those mixtures can represent the phones, vocal tract, or even more general speaker/speech properties. 87 In this example, the pronunciation of some phonemes in different dialects can be similar from both an MFCC feature perspective as well as perceptually. If pronunciations of this phone are similar in different dialects, then the mixtures which represent this phone will not be contributed to dialect classification, if we assume individual phones will be represented via particular GMM mixtures. Alternatively, if the pronunciations of a particular phone are very different across dialects, then these mixtures will contribute to dialect classification. Clearly, a portion of the mixtures which represent the same phone will be similar across different dialects. These mixtures therefore do not contribute to improving overall dialect classification. However, the portion of the mixtures representing phones that are different, will emphasize the separation between dialects. Fig. 3 illustrates and are these two scenarios. In Fig. 3(a), two phones between shown, where there are limited changes for phone dialect and anti-dialect models. However, notable differences phone from dialect to anti-dialect models. exist for the Here, the anti-dialect model can either represent another dialect, or a composition of dialects expected to compete with the target dialect. Therefore, almost all mixtures representing phone have changed. In Fig. 3(b), a portion of the mixtures are similar between the dialect and anti-dialect models, while some mixchange from dialect to anti-dialect models. tures of phone also change from Similarly, part of the mixtures of phone dialect to anti-dialect model. The algorithm therefore results in two sets of mixtures including dialect sensitive mixtures (e.g., Discriminative Mixtures), and neutral mixtures (e.g., General Mixtures), which are suppressed to decrease confusion between dialects. A method, however, is needed to measure the similarity between these mixtures. A symmetric version of the KL divergence (KL2) is appropriate for this task, since it is often used as a measure of similarity between two density distributions. The KL2 divergence between two probability density functions and is defined as [34] (1) where is the KL divergence from probability density function (pdf) to , and is the KL divergence from pdf to . In general, it is difficult to calculate the KL2 divergence between two arbitrary distributions. For the case of GMMs, however, all mixtures are typically Gaussian distributions. Fortunately, the KL2 divergence for two Gaussian distributions has a closed form expression (2) where and are the covariance matrixes of and , and and are the means of and , is the determinant of the matrixes , and is the trace. The previous equations describe the KL2 divergence between any two Gaussian distributions. Assuming the covariance matrices are 88 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011 Fig. 3. (a) phoneme == shows similar structure between dialect and anti-dialect models, while == shows key mixture differences between the dialect to anti-dialect models. (b) subregion of mixture space of phone == and == differ from dialect to anti-dialect models. diagonal, the KL2 divergence can be calculated using only the mean and variance of the Gaussian distributions. Since MFCC features are employed for the dialect system, and the cross correlation values between MFCC feature dimensions can be assumed to be zero, the diagonal assumption employed here both is valid and eases overall computational analysis. It is noted that the complete GMM model is described by three parameters: mixture weight, mean, and variance (3) is the number of the mixtures, is the parameter where weight, is the mean vector, and is the variance vector of the pdf in the GMM model. Therefore, it is reasonable to add into the KL2 divergence calculation the parameter weight for the distance measurement in the GMM models. Here, the is attached to the Gaussian distributions, parameter weight and as with a redefinition of the functions (4) (5) The KL divergences from function to function , and from function to function , are updated and recalculated with the three GMM parameters: mixture weight, mean, and variance as follows: (6) where is the feature dimension. The new KL2 divergence and can be recalculated by (1). Here, the between Gaussian mixture of dialect model or anti-dialect model is or in (4) and (5). defined to be the same function Since each GMM will have multiple Gaussian mixtures, all KL2 divergences between any individual Gaussian mixture of the dialect model and any single anti-dialect model are calculated, and will result in an KL2 divergence matrix where is the number of mixtures in the GMM. Here, let the in the matrix be defined as the KL2 divergence element between mixture “ ” from the dialect model and mixture “ ” from the anti-dialect model. First, the mixture pair with the minimal KL2 divergence from the matrix is considered. of the matrix are Next, all elements in row and column set aside. The process is repeated by considering each mixto . Therefore, the range ture pair from represents the KL2 divergence values for the mixtures ranked in ascending order. The proposed method here designates those mixtures which are in the as general mixtures, with beginning range all others higher in the list tagged as discriminative mixtures. To determine whether mixture “ ”, defined as any element in , is general or discriminative, let us the range define as if if (7) where “ ” signifies a discriminative mixture, and “ ” a general mixture and represents the index of the “ ”th mixture in the range. The value is the relative threshold which represents the upper bound on the number of general mixtures, where the is . For testing, the probabilities of the genrange for eral mixtures are not calculated, since from a dialect perspective these do not contribute to dialect discrimination. In addition, an is needed to ensure upper bound represented by the constant that mixtures with sufficient divergence are retained as discriminative mixtures for the case where dialect difference is very significant. Any mixtures with KL2 divergences greater than the are tagged as discriminative mixtures. upper bound constant Fig. 4 shows the flow diagram of the KL2 divergence-based discriminative training processing—KLD. For each dialect, the dialect model and anti-dialect model are trained. The function of the processing block “KLD SELECTION” in Fig. 4 is to designate mixtures of the dialect model with the mixture selection algorithm formulated above, with results saved in a “TAG FILE.” In this file, all mixtures are tagged as one of two classes: discriminative mixtures and general mixtures. If the discriminative testing process formulated in the next subsection is not included, then the testing process is equivalent to the baseline with the exception that only discriminative mixture parts are used instead of the entire GMM models. Next, since the size of the dialect corpus is typically small, model adaptation is generally considered to address this problem. To apply adaptation for the GMM, development of LEI AND HANSEN: DIALECT CLASSIFICATION VIA TEXT-INDEPENDENT TRAINING AND TESTING FOR ARABIC, SPANISH, AND CHINESE 89 Fig. 4. Training strategy based on Gaussian mixture selection by KL2 divergence (KLD-GMM). a universal background model (UBM)2 must first be trained. The UBM can be trained from another corpus, or from the entire multiple dialect corpus. If the UBM is trained from a separate corpus, it must be the same language as the dialects under evaluation, and should include a sampling of dialects. In this paper, all data in the dialect corpus is used to train the base UBM. In this case, no new or parallel corpus is needed to train the UBM, and the dialect corpus will typically be balanced across the dialects of interest, assuming a balance in the original training corpus. After training the UBM, the dialect model can be adapted from the UBM using data from the particular dialect. Here, MAP adaptation is considered to generate a dialect dependent model from the UBM. The proposed KLD algorithm can also be applied for the dialect and anti-dialect models, which are adapted from the UBM. However, since the dialect and anti-dialect models are derived from the UBM, the projection between dialect model and anti-dialect model can be simplified. If the mixtures of the UBM are tagged, making an index of mixtures from 1 to , it would be possible to record and track the index during model adaptation so the mixtures with the same index in dialect and anti-dialect models can be paired instead of processing for the projection from the KL2 matrix. In this case, the calculation of the KL matrix is removed and the meaning of the pairs between dialect and anti-dialect models become more clear. The simplified KLD-GMM algorithm can only be used for models which have been adapted from the UBM. In the KLD-GMM algorithm, it is important to note that the pdf weights in the discriminative part are not re-normalized after removal of the general mixtures. The reason for this is that the discriminative mixtures more accurately represent the target dialect in the discriminative acoustic space, while the general mixtures represent the confused portion with the competing dialects. 2A UBM is a standard GMM model for representing large numbers of speakers which are typically outside subjects for open-set speaker recognition. At some level, the resulting sum of the discriminative mixture weights reflects the true separation of the dialect against its neighbors. Since the a priori dialect probabilities are unknown, it is assumed to be equal prior values, and therefore it is appropriate to employ the likelihood instead of the posterior probabilities. With this, the sum of the weights in each discriminative phase can be considered to be the prior probabilities of the discriminative part. B. Testing Algorithm: Frame Selection Decoding Along a similar concept as the proposed mixture based KLD method, a frame selection decoding (FSD) scheme is also developed here for dialect classification. It is known that for accent classification, there is a reasonable expectation that a nonnative speaker will systematically substitute sounds (phones) from his/her native language for those foreign sounds which are less familiar. Here, accent can be defined as a pattern of pronunciation and/or vocabulary of a language used by the community of nonnative speakers belonging to some geographical region. For example, English spoken by native Chinese or Germans are two accents of English. In this case, the phones from the native language and the substituted phones from the foreign accent are very similar. On the other hand, for dialect classification, it is expected that all speakers are native so, we believe, it is not necessary for the speakers to purposely substitute the phonemes between two dialects in order to maintain understanding between each other. However, the interaction effect among dialects are significant due to the communication among people with different dialects. In general, since different dialects originate from the same language, they will have similarities with each other. In addition, increased communication between populations of neighboring dialect regions are expected to contribute to some degree of dialect transference/fusion. This would more likely occur in word selection, versus phonemes, pronunciation, grammar, or prosodic (intonation, timing, etc.) structure. The 90 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011 Fig. 5. Testing strategy based on frame selection decoding (FSD-GMM). more opportunities a speaker has to hear a dialect, the more possibly the speaker acquires those dialect traits. A speaker is more likely to hear phones with high frequency of occurrence in some dialect, rather than phones with low frequency of occurrence. It is believed that the speaker will more unintentionally imitate the pronunciations of phones with high frequency of occurrence, and therefore be closer to the pronunciations in the dialect, although some phonemes that are only used in particular dialect of the same language could be difficult to learn by speakers from other dialect regions. Alternatively, there is a lower probability that the speaker will hear phones with low frequency of occurrence, which suggests that these phones would maintain true traits of the dialect better. Therefore, motivated by the arguments above, frames in the test data sequence will carry a nonuniform range of dialect-dependent information. The frames are classified into two classes based on locations in the acoustic space. In the decoding phase, the frames that represent phones with high frequency of occurrence, and are acoustically closer to each other, are believed to be more dialect confusable and less dialect dependent, while frames representing phones with lower occurrence frequency will reflect more dialect dependent information. In a manner consistent with the proposed mixture-based approach, frames which are less dialect dependent are tagged as General Frames, since they can decrease dialect classification performance and should be suppressed. Frames which are more dialect dependent are tagged as Discriminative Frames, since they enhance discrimination and should therefore improve classification performance. If the probabilities from all dialects under test for that frame are much greater, this would mean that the phone represented by this frame is used frequently in the language, and the frame is identified as a General frame and set aside, otherwise the frame is retained for classification. In practice, the sum of the probabilities from all dialects of the frame are used, instead of their probabilities to identify the frames, since the assumption that probabilities from all dialects of that frame are very similar. In the tagging process, the sum of the probabilities from all dialects of every frame in the test file are calculated as where “ ” is the number of dialects under test (typically three is the sum of the probabilities of the for our scenario), and th frame ( ). All are ranked in ascending order. To deis a dialect discriminative or general termine whether frame frame, define “ ” to be a discriminative sample and “ ” a general sample, where the “ th” frame is classified as if if (9) where represents the index of the th frame in the range , and represents the total number of frames in the test file. The value is the relative threshold which represents the upper bound on the number of dialect-recessive frames, where is . The discriminative test process is the range for shown in Fig. 5. Dialect-discriminative frames are selected by the block “Frame Credibility Analysis.” Dialect-general frames are set aside and only dialect-dominant frames will contribute to the overall classification score. IV. EXPERIMENTS In the experimental evaluation, the proposed algorithms are evaluated on two dialect corpora: Arabic and Spanish dialects. It is emphasized here that no associated transcripts for train or test data are employed in any of the dialects. All data used in these experiments are spontaneous speech, since earlier evaluations on dialect ID showed that read speech contains limited dialect dependent structure [16], [19]. Mel-frequency cepstral coefficients (MFCCs) are used here, consisting of a total of 26 coefficients, with log energy and 12 MFCCs plus their delta coefficients for each frame. The frame length is 20 ms, with an overlap skip rate of 50% (10 ms). During test, all audio files are partitioned into short utterances. The length of each test utterance is 10 s in duration. The final classification performance is the average over all test utterances. Finally, 256 mixtures were used in the GMM dialect model, as well as anti-dialect model in all evaluations since the number of the mixtures does not influence the performance significantly in dialect classification [19]. A. Dialect Corpora: Arabic and Spanish (8) The first corpus employed consists of Arabic dialect data from five different regions, including United Arab Emirates LEI AND HANSEN: DIALECT CLASSIFICATION VIA TEXT-INDEPENDENT TRAINING AND TESTING FOR ARABIC, SPANISH, AND CHINESE 91 TABLE I A SUMMARY OF MALE ARABIC DIALECT DATA EMPLOYED IN THIS STUDY (AFTER SILENCE FRAME REMOVAL) TABLE III A SUMMARY OF COMBINED MALE AND FEMALE CHINESE CORPUS EMPLOYED IN THIS STUDY, (AFTER SILENCE FRAME REMOVAL) TABLE II A SUMMARY OF FEMALE SPANISH DIALECT DATA EMPLOYED IN THIS STUDY (AFTER SILENCE FRAME REMOVAL) Since several combinations of parameters are considered in the following experimental results, a further development data set would be needed to assess the best parameter settings in real application, though the results here identify reasonable values for parameters. Here, no development data is set aside for parameter optimization in order to employ all available data for either train or test due to the size limitation of the corpora. B. Additional Corpus Evaluation: Chinese Corpus (UAE), Egypt, Iraq, Palestine, and Syria. The full set of 250 sessions (500 speakers) make up the corpus, which represents the larger of the two dialect corpora considered. There are 100 speakers in each dialect, balanced between male/female gender. Each session consists of two speakers completing four combined conversational recordings. Each conversational recording contains four selected topics from a list of 12 preselected topics, such as the weather, shopping, travel, and other common topics. The topics were chosen according to the ease for speakers in conversation, but also with the aim of achieving, as much as possible, an equal distribution across all topics for the final database. The majority of speakers were asked to discuss weather as one of their four topics. A lapel microphone is used in conversational recording for each speaker per conversation. The gender, topic and signal-to-noise ratio (SNR) are labeled for each recording per conversation. To avoid the influence of noise for this study on dialect ID, all recordings not meeting a minimum SNR were set aside so that only a defined range (e.g., “clean”) dialect data would be used in the study. This was done to minimize non-dialect variability, since addressing such differences are suggested for future research. Since long periods of silence do exist within the audio, a silence remover based on an overall energy measure is applied to eliminate low energy silence frames. Three dialects are selected for use in the study based on geographical origins which include: UAE (AE), Egypt (EG), and Syria (SY). To avoid the influence of gender, only male data is used for train and test. Table I summarizes the train and test data employed after silence removal (all speakers are male in Table I). The second corpus employed is based on the MIAMI corpus described in [3], and also employed in [19]. Dialect speech from Cuba, Peru, and Puerto Rico (PR) is used in this paper. Here, the spontaneous speech portion which consists of spoken answers to questions prompted by interviewers serves as the material for both train and testing (i.e., only open test results are reported). All subjects used a close-talk headset microphone to record speech. In a manner similar to the Arabic corpus evaluation, silence removal is also performed prior to training. Here, only female speakers are considered in the evaluation. Table II summarizes the number of speakers and speech data duration for train and test after silence removal. To further evaluate the FSD algorithm, a Chinese Corpus is also used in the present study. The corpus includes three Chinese sub-languages: Mandarin, Cantonese, and Xiang. All data in this corpus consist of spontaneous noise-free speech. Both male and female subjects are used in this phase of the study. Table III summarizes the train and test data employed after silence removal. The reason for using this corpus in the evaluation of the FSD algorithm is that although these sub-languages are not true dialects of a common language, all three sub-languages have similar grammar and text, and all speakers from the three sub-languages live in one country, which potentially would allow more contact with country specific content (e.g., weather, politics, etc.). Although the native sub-languages of the Chinese speakers are different, most can partially understand each other. As such, communication among these speakers with different sub-languages is common and will help assess improved discrimination for the dialect case in the FSD algorithm. Another reason to employ this Chinese corpus for the evaluation of the FSD algorithm is that the size of the corpus is sufficiently large so that performance can be assessed with greater confidence. C. Distribution of KL2 Divergence Between Mixtures In the KLD algorithm, the key step is the ability to correctly tag discriminative and general mixtures. The tagging process is based on the KL2 divergence range , which represents the distances of the mixture pairs . Furthermore, the mixture pairs represent a form of projection map from the mixtures of the dialect model to the mixtures of the anti-dialect model. For each pair, the mixture of the dialect model and the mixture of the anti-dialect model are considered to be a matched pair. For example, the KL2 divergence range of the dialect and anti-dialect models from the Arabic corpus are illustrated in Fig. 6. Note that the final 15 element pairs between dialect and anti-dialect pdf pairs are too large to be included in the figure. The -axis represents the pdf index of the GMM range . Here, the KL2 divergence values are small or close to zero for the first half of the pairs (i.e., index ranging from 1 to 150), which suggests that these pdfs are in fact linked and similar. Alternatively, the KL2 divergence values become progressively 92 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011 D ;D ; ; Fig. 6. Distribution of the KL2 divergence range f g of dialect and anti-dialect models from Arabic corpus. D ... larger with a rapid increase as the index range increases to 256. Therefore, it is suggested that pairs with large divergences (e.g., pairs 175–256) in the tail of the distribution range are mismatched. All these mismatched mixtures (e.g., pairs 175–256) should be included as discriminative mixtures in this , for all elements case. In addition, the KL2 divergence , are compared in the range with all elements in the th row of the KL2 divergence matrix defined in Section III-A. Fig. 7 shows for the UAE dialect the to the maximum value in the row of the matrix, ratio of with the exception of the self-test value. It is noted that similar distribution results are achieved for the other dialects of this language. If the ratio is greater than 1, then is the largest entry in the th row of the matrix and the mixture pair will be the most matched pair for all . The larger the ratio , the is relatively. Alternatively, more matched the mixture pair if the ratio is much smaller than 1, then there exists another matched mixture for mixture other than mixtures , which means the pair is a mismatch. In Fig. 7, the left portion of the ) shows the ratios which are distribution (e.g., large and should be considered matched pairs, while the right ) has lower ratios and are side of the range (e.g., therefore mismatched. To retain the majority of the mismatched is used in the range from 0 mixtures, a relative threshold to 0.3 in the following experiments, where at least 70% of the mixtures in the GMM are retained as discriminative mixtures. Fig. 7. The distribution of the ratio of of the matrix, except with the self term D D i to the maximum value in the ^ row for the UAE dialect. dialect samples with a total of 58 male speakers and 99 minutes, along with spontaneous Spanish (SP) data from the MIAMI corpus as the comparative language sample. A representative sampling of all three dialects from the MIAMI corpus was used to train the Spanish GMM model. MLE-GMM models with 256 mixtures are trained for the Arabic dialects as well as the Spanish language. The KL2 divergence pair range is as , with mixture pairs follows: generated from any two Arabic dialects pairs (AE and EG, EG and SY, AE and SY), or any single Arabic dialect and overall Spanish data (AE and SP, EG and SP, SY and SP). Since the dialect mismatch is in the tail of the mixture pairs illustrated in Fig. 6, only the first 200 elements of the KL2 divergence ranges are shown in Fig. 8. The elements of KL2 Divergence for dialect pairs are smaller with a gradual and smooth increase, as shown in Fig. 8, while the elements of the KL2 Divergence between individual Arabic dialects and Spanish are larger with a more significant rate of increase as the index approaches 200. Therefore, mixture pairs between dialects are a better match, while a larger mismatch exists for most mixture pairs between a dialect entry and the a separate language. Therefore, more confusion, or equivalently overlap regions exist between dialect pairs versus a dialect-new language pair. Based on these dialect-versus-language traits, it is more important to reduce the impact of neutral mixtures in dialect pair classification as opposed to a dialect-new language pair. D. Differences Between Language and Dialect in the Distribution of KL2 Divergence E. Evaluation on Arabic and Spanish Dialect Corpora Compared to languages, dialects are much closer with each other in general. For example, Fig. 8 illustrates the assessed difference between languages and dialects using the distribution of the KL2 divergence of the mixtures. Here, a direct comparison , where is performed for the KL2 divergence of the mixture GMMs from any two of three Arabic dialects are first employed, followed by individual Arabic dialects compared with Spanish. The number of discriminative mixtures is expected to be higher for the Arabic dialect to Spanish comparison versus pairwise Arabic dialect pairs. Arabic dialects are used as the Having illustrated the distribution of the KL2 divergence range between dialects, as well as the distance between a single dialect and an alternative language, the focus now shifts to an evaluation of the proposed algorithms over two dialect corpora: Arabic and Spanish. Fig. 9 shows classification accuracy of the proposed KLD-GMM algorithm by varying the relative divergence threshold from 0 to 30%. The upper bound constant is set to 0.006 in both dialect evaluof the KL2 divergence ations. The -axis of both subfigures is the relative threshold , which reflects the percentage of general mixtures which are LEI AND HANSEN: DIALECT CLASSIFICATION VIA TEXT-INDEPENDENT TRAINING AND TESTING FOR ARABIC, SPANISH, AND CHINESE 93 Fig. 8. Distribution of the KL2 divergence ranges of dialects and languages. The ranges from two Arabic dialects and the ranges from one Arabic dialect and Spanish are shown to illustrate the different distribution between dialects and languages. set aside. When the relative threshold is 0, all mixtures are tagged as discriminative mixtures and the KLD-GMM reduces to the baseline GMM (MLE-GMM) framework. The -axis of all sub-figures is the dialect classification accuracy (%), with 33% as chance for the three-way dialect task. The best perforfor Arabic dialects, and mance is achieved with for Spanish dialects. Tables IV and V summarize the best performance of KLD-GMM from Fig. 9, where the performance of the baseline (MLE-GMM) system is 67.7% for the three-way Arabic dialects and 75.8% for the Spanish dialects. The best KLD-GMM performance achieved a improvement on Arabic dialects and improvement on Spanish dialects. The relative error reductions over baseline is 12.4% for Arabic and 12.0% for Spanish dialects. In Fig. 9(a), there is some performance variability when the threshold is in the range 0%–6%, suggesting that a small reduction in the existing GMM mixture space may not suppress a particular dialect-related pdf from the model. This is expected, since the particular dialect GMM begins with 256 mixtures, so a 1%–5% mixture reduction is effectively removing 2–12 pdfs, while each phone is expected to be modeled by 2–5 mixtures. When the threshold is greater than 20%, performance begins to decrease, suggesting that setting aside too many mixtures that are more likely discriminative mixtures will impact performance, and so a balance must be achieved. This balance, as determined from these evaluations should be between 6% and 20%, where the performance increases with only slight fluctuations. In Fig. 9(b), similar fluctuations appear in the range for small thresholds (if the threshold is less than 14%). Since the size of the Spanish corpus is smaller, model adaptation is employed and the simplified KLD-GMM algorithm is compared to the KLD-GMM algorithm (note, the simplified KLD-GMM method was presented at the end of Section II). First, the UBM is trained using all available training data in the Spanish dialect corpus which include 77 minutes from 44 speakers. The dialect and anti-dialect models, including the parameters: weight, mean, and variance, are adapted from the UBM via MAP adaptation (shown also in Table V). A new evaluation is performed for Spanish Dialect ID. Fig. 10 shows the Fig. 9. Dialect Classification accuracy of KLD-GMM algorithm for a three-way test with two corpora: Arabic and Spanish. The x-axis is relative threshold N ; the y -axis is classification accuracy, with 33.3% as chance. (a) Three-way Dialect Classification accuracy over the Arabic corpus. (b) Three-way Dialect Classification accuracy over the Spanish corpus. Fig. 10. Classification accuracy of KLD-GMM and simplified KLD-GMM algorithms with three-way Spanish dialect corpus. The x-axis is relative threshold N ; the y -axis is the classification accuracy. All models used are adapted from the UBM model by using MAP adaptation. classification accuracy of the proposed KLD-GMM algorithm on the Spanish dialect corpus by varying the relative divergence from 0% to 30%. All models, including dialect threshold and anti-dialect models, used in the evaluation are adapted from the UBM. The baseline system is the MAP-GMM instead of MLE-GMM. The simplified KLD-GMM algorithm is also evaluated with the resulting performance virtually the same as the KLD-GMM algorithm, which suggests that the simplified KLD-GMM algorithm can achieve the same performance as the KLD-GMM when the models are adapted from the UBM, and that the pair sequence from the KLD-GMM algorithm is the same as that from the simplified KLD-GMM algorithm. From Fig. 10, the range of Spanish dialect ID performance fluctuations is from 0% to 6%, which is the same as that observed in the Arabic corpus. The best performance of 83.3% is achieved in Fig. 10. From Table V, the performance at of the baseline system (MAP-GMM) is 76.4% with the best absolute performance improvement of 6.9%, and a relative error reduction over the baseline (MAP-GMM) of 29.2%. Having illustrated mixture selection in the KLD-GMM algorithm, frame selection is now considered. Fig. 11 shows 94 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011 TABLE IV SUMMARY OF THREE-WAY DIALECT CLASSIFICATION PERFORMANCE ON THE ARABIC DIALECT CORPUS TABLE V SUMMARY OF THREE-WAY DIALECT CLASSIFICATION PERFORMANCE ON THE SPANISH DIALECT CORPUS classification accuracy of the frame selection-based FSD-GMM . For the algorithm by varying the frame relative threshold Spanish dialect corpus, MLE-GMM and MAP-GMM systems are used as the baseline systems since the size of the corpus is small. The -axis of all sub-figures is the frame while the -axis show classification relative threshold is 0, accuracy (%). When the frame relative threshold the FSD-GMM reduces to the baseline methods. The best for Arabic dialects, and performance is achieved at for Spanish dialects with MLE-GMM as the basefor Spanish dialects when MAP-GMM line, and is the baseline. Table IV summarizes the best performance of FSD-GMM from Fig. 11, where the best performance achieved absolute improvement for Arabic dialects, versus is a absothe MLE-GMM baseline, and in Table V, a lute improvement is obtained for Spanish dialects versus the when MAP-GMM MLE-GMM baseline system, and is the baseline system. The relative error reductions over the baseline are 12.7% on Arabic dialects, and 35.5% and 36.4% on Spanish dialects versus the MLE-GMM and MAP-GMM baseline systems. To assess which frames are actually being set aside to achieve this performance improvement, an example is illustrated in Fig. 12, which show frames which are set aside in a test file sample from the Arabic dialect corpus. The -axis shows a group of 150 frames which are set aside from an input 10-s test file (1000 total frames) in the Arabic dialect corpus, which represents an FSD relative frame threshold of , where the -axis is the time index/location of the frames. Here, vertical movement reflects frames that are retained. If there are longer contiguous frame sets in this pdf, it suggest that the frame rejection is not random, but associated with speech types over time. From Fig. 12, it is clear that frames which are set aside are divided into six major groups which represent six phones over time. Since the likelihoods of the frames are high, it is also known that the frequency of occurrence of these phones are high, which means the frames representing the phones are dialect-recessive frames. In addition, the combination of KLD-GMM and FSD-GMM algorithms, named KLD-FSD-GMM are evaluated on the Arabic dialect corpus. The system parameters are and , which reflect the best individual performances achieved in KLD-GMM and FSD-GMM. Dialect ID perforabsolute improvement mance is 76.4%, representing an over the baseline system (this is also shown in Table IV). This confirms that the two algorithms address the suppression of Fig. 11. Classification accuracy of FSD-GMM on Arabic and Spanish corpus. The x-axis is frame relative threshold M for FSD algorithm. The baseline is MLE-GMM algorithm on Arabic corpus and both MLE-GMM and MAP-GMM are used as the baselines on Spanish corpus. (a) Classification accuracy on Arabic corpus with MLE-GMM baseline. (b) Classification accuracy on Spanish corpus with MLE-GMM baseline. (c) Classification accuracy on Spanish corpus with MAP-GMM baseline. Fig. 12. Example of the FSD-GMM algorithm on the Arabic dialect corpus. These are the frames which are set aside from an 10-s test file. Six phones are set aside and the frames representing the phones are identified as dialect-recessive frames. the common acoustic space between dialects in different ways, since the combination outperforms the individual algorithms for Arabic. LEI AND HANSEN: DIALECT CLASSIFICATION VIA TEXT-INDEPENDENT TRAINING AND TESTING FOR ARABIC, SPANISH, AND CHINESE Fig. 13. Classification accuracy of FSD-GMM on Chinese language corpus. The x-axis is frame relative threshold M for FSD algorithm. The baseline is MLE-GMM algorithm. F. Probe Evaluation of FSD Algorithm on Chinese Material For a final evaluation, a three-way classification task is considered using data from a Chinese language corpus (e.g., Mandarin, Cantonese, and Xiang). Fig. 13 shows classification accuracy of the frame selection-based FSD-GMM algorithm by on the three-way Chivarying the frame relative threshold nese language Corpus. The MLE-GMM system is used as the baseline system. The -axis of the figure shows the frame relaand the -axis shows classification accuracy tive threshold (%). When the frame relative threshold is 0, the FSD-GMM system reduces to the baseline. From Fig. 13, the performance of the baseline system is 81.2%, and the resulting improvement achieved as expected is significant. The best performance ocwith a improvement, with a relative curs for error reduction over the baseline of 18.1%, confirming that the proposed algorithms are also appropriate for classification of related sub-languages. V. CONCLUSION Speech from distinct dialects of a language can be separated into dialect sensitive and dialect neutral parts, which are represented by discriminative mixtures and general mixtures in the GMM model, as well as discriminative frames and general frames in test files. Due to the similarity of dialects, there will be more neutral content between dialects versus languages. The neutral content, represented as the distractive/confusing region of the dialects, can be reduced or excluded via frame or mixture selection. In this paper, a training algorithm (the Gaussian mixture selection by KL2 divergence) and a testing algorithm (frame selection decoding algorithm) have been proposed and developed for text-independent dialect classification, which means the dialect label for the data is known but no text transcripts are available. The algorithms have focused on emphasizing those mixtures (KLD-GMM) and frames (FSD-GMM) which are more dialect sensitive, and de-emphasizing those which are dialect neutral. The three-way dialect classification algorithms were evaluated on two different size corpora from two languages, with an MLE trained GMM system used as baseline. In addition, a MAP trained GMM system was also used as an alternative baseline for the Spanish corpus due to the limited size of that corpus. The KLD and FSD algorithms achieved measurable and significant performance improvement over the baseline system. The combination of KLD and FSD achieves further performance improvement for the Arabic corpus, but no additional gain for Spanish dialects. 95 In conclusion, the proposed algorithms achieve an absolute improvement and relative error reduction (from KLD-FSD-GMM) on the Arabic dialect corpus against the MLE-GMM baseline, and a absolute improvement and relative error reduction (from FSD-GMM) on the Spanish dialect corpus against the MAP-GMM baseline system. Therefore, the proposed algorithms have been shown to be effective for dialects of Arabic and Spanish, and are promising in generalization to dialects in different languages. Another strength of the proposed algorithms is the low implementation complexity, where it is easy to fall back to more traditional operating conditions without changing the fundamental algorithm structure or existing dialect models. These findings confirm the effectiveness of improved mixture selection within the GMM and frame selection during decoding for GMM-based dialect classification. The premise of suppressing common dialect acoustic subareas, while maintaining discriminative regions represents the primary advancement shown. The algorithms developed in this study have lead to advancements in the area of dialect identification. While these algorithms have resulted in effective performance for Arabic, Spanish, and Chinese dialects (or related sub-languages), we do not claim that the resulting solution is the optimal or final contribution in dialect clarification. Other strategies could be considered in order to improve overall dialect classification rates, such as, factor analysis [23], CMLLR [25], MMI, and VTLN. It is suggested that these be considered in future studies, in the context of the algorithms developed in the current study. Dialect ID is a challenging research topic, with issues relating to uniqueness, knowledge of ground truth of the speakers, and separation of dialects/languages. Future studies could also leverage further knowledge of linguistics across the dialect under evaluation as well. REFERENCES [1] V. Gupta and P. Mermelstein, “Effect of speaker accent on the performance of a speaker-independent, isolated word recognizer,” J. Acoust. Soc. Amer., vol. 71, pp. 1581–1587, 1982. [2] C. Huang, T. Chen, S. Li, E. Chang, and J. L. Zhou, “Analysis of speaker variability,” in Interspeech’01, Aalborg, Denmark, 2001, pp. 1377–1380. [3] M. A. Zissman, T. P. Gleason, D. M. Rekart, and B. L. Losiewicz, “Automatic dialect identification of extemporaneous conversational, Latin American Spanish speech,” in Proc. ICASSP’96, Atlanta, GA, 1996, vol. 2, pp. 777–780. [4] L. M. Arslan and J. H. L. Hansen, “Language accent classification in American English,” Speech Commun., vol. 18, pp. 353–367, 1996. [5] L. M. Arslan and J. H. L. Hansen, “A study of temporal features and frequency characteristics in American English foreign accent,” J. Acoust. Soc. Amer., vol. 102, pp. 28–40, 1997. [6] W. Ward, H. Krech, X. Yu, K. Herold, G. Figgs, A. Ikeno, D. Jurafsky, and W. Byrne, “Lexicon adaptation for LVCSR: Speaker idiosyncracies, non-native speakers, and pronunciation choice,” in Proc. ISCA Workshop Pronunciat. Modeling Lexicon Adaptat., 2002, pp. 83–88. [7] M. K. Liu, B. Xu, T. Y. Huang, Y. G. Deng, and C. R. Li, “Mandarin accent adaptation based on context-independent/ context-dependent pronunciation modeling,” in Proc. ICASSP’00, Istanbul, Turkey, 2000, vol. 2, pp. 1025–1028. [8] J. J. Humphries and P. C. Woodland, “The use of accent-specific pronunciation dictionaries in acoustic model training,” in Proc. ICASSP’98, Seattle, WA, 1998, vol. 1, pp. 317–320. [9] V. Diakoloukas, V. Digalakis, L. Neumeyer, and J. Kaja, “Development of dialect-specific speech recognizers using adaptation methods,” in Proc. ICASSP’97, Munich, Germany, 1997, vol. 2, pp. 1455–1458. [10] B. Zhou and J. H. L. Hansen, “Speechfind: An experimental on-line spoken document retrieval system for historical audio archives,” in Proc. Interspeech-02/ICSLP-02, Denver, CO, 2002, vol. 2, pp. 1969–1972. 96 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 1, JANUARY 2011 [11] S. Gray and J. H. L. Hansen, “An integrated approach to the detection and classification of accents/dialects for a spoken document retrieval system,” in Proc. IEEE Workshop Autom. Speech Recognition Understanding, 2005, vol. 2, pp. 35–40. [12] A. S. Kroch, “Toward a theory of social dialect variation,” in Language in Scoiety. Cambridge, U.K.: Cambridge Univ. Press, 1978, vol. 7, pp. 17–36. [13] L. Arslan and J. H. L. Hansen, “Selective training for hidden Markov models with applications to speech classification,” IEEE Trans. Speech Audio Process., vol. 7, no. 1, pp. 46–54, Jan. 1999. [14] L. R. Yanguas, G. C. O’Leary, and M. A. Zissman, “Incorporating linguistic knowledge into automatic dialect identification of Spanish,” in ICSLP’98, Sydney, Australia, 1998. [15] R. Huang and J. H. L. Hansen, “Dialect/accent classification via boosted word modeling,” in ICASSP’05, Philadelphia, PA, 2005, vol. 1, pp. 585–588. [16] R. Huang and J. H. L. Hansen, “Advances in word based dialect/accent classification,” in Proc. Interspeech’05, Lisbon, Portugal, 2005, vol. 1, pp. 2241–2244. [17] P. A. Torres-Carrasquillo, T. P. Gleason, and D. A. Reynolds, “Dialect identification using Gaussian mixture models,” in Proc. Odyssey: Speaker Lang. Recog. Work., Toledo, Spain, 2004. [18] R. Huang and J. H. L. Hansen, “Gaussian mixture selection and data selection for unsupervised Spanish dialect classification,” in Proc. Interspeech’06, Pittsburgh, PA, 2006, pp. 445–448. [19] R. Huang and J. H. L. Hansen, “Unsupervised discriminative training with application to dialect classification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 8, pp. 2444–2453, Nov. 2007. [20] G. Choueiter, G. Zweig, and P. Nguyen, “An empirical study of automatic accent classification,” in Proc. ICASSP’08, Las Vegas, NV, 2008, vol. 1, pp. 4265–4268. [21] P. Matejka, L. Burget, P. Schwarz, and J. Cernocky, “Brno university of technology system for NIST 2005 language recognition evaluation,” in Proc. IEEE Odyssey 2006: Speaker Lang. Recognition Workshop, 2006, vol. 1, pp. 1–7. [22] C. Vair, D. Colibro, F. Castaldo, E. Dalmasso, and P. Laface, “Channel factors compensation in model and feature domain for speaker recognition,” in Proc. IEEE Odyssey 2006: Speaker Lang. Recognition Workshop, 2006, vol. 4, pp. 1–6. [23] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with sparse training data,” IEEE Trans. Speech Audio Process., vol. 13, no. 3, pp. 345–354, May 2005. [24] R. Vogt, B. Baker, and S. Sridharan, “Modeling session variability in text-independent speaker verification,” in Proc. Interspeech’05, Lisbon, Portugal, 2005, pp. 3117–31220. [25] W. Shen and D. Reynolds, “Improved GMM-based language recognition using constrained MLLR transforms,” in Proc. ICASSP’08, Las Vegas, NV, 2008, pp. 4149–4152. [26] R. Huang and J. H. L. Hansen, “Dialect classification on printed text using perplexity measure and conditional random fields,” in Proc. ICASSP’07, Honolulu, HI, 2007, vol. 4, pp. 993–996. [27] S. Kullback, Information Theory and Statistics. Mineola, NY: Dover. [28] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Process., vol. 10, pp. 72–83, 2000. [29] A. Dempster, M. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc., vol. 39, pp. 1–38, 1977. [30] D. Reynolds and R. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Speech Audio Process., vol. 3, no. 1, pp. 72–83, Jan. 1995. [31] M. A. Zissman, “Comparison of four approaches to automatic language identification of telephone speech,” IEEE Trans. Speech Audio Process., vol. 4, no. 1, pp. 31–44, Jan. 1996. [32] W. M. Campbell, F. Richardson, and D. A. Reynolds, “Language recognition with word lattices and support vector machines,” in Proc. ICASSP’07, Honolulu, HI, 2007, pp. 989–992. [33] M. Mehrabani and J. H. L. Hansen, “Dialect separation assessment using log-likelihood score distributions,” in Proc. Interspeech’08, Brisbane, Australia, 2008, pp. 747–750. [34] J. R. Hershey and P. A. Olsen, “Approximating the Kullback Leibler divergence between Gaussian mixture models,” in Proc. ICASSP’07, Honolulu, HI, 2007, pp. 317–320. Yun Lei (S’07) received the B.S. degree in electrical engineering from Nanjing University, Jiangsu, China, in 2003 and the M.S. degree in electrical engineering from Institute of Acoustics, Chinese Academy of Science (CAS), Beijing, China, in 2006. He is currently pursuing the Ph.D. degree in electrical engineering at the University of Texas at Dallas, Richardson. He has been a Research Engineer at the Center for Robust Speech Systems (CRSS), University of Texas at Dallas. John H.L. Hansen (S’81–M’82–SM’93–F’07) received the B.S.E.E. degree from the College of Engineering, Rutgers University, New Brunswick, NJ, in 1982 and the M.S. and Ph.D. degrees in electrical engineering from the Georgia Institute of Technology, Atlanta, in 1983 and 1988. He joined the Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas (UTD), Richardson, in the fall of 2005, where he is Professor and Department Head of Electrical Engineering and holds the Distinguished University Chair in Telecommunications Engineering. He also holds a joint appointment as Professor in the School of Behavioral and Brain Sciences (Speech and Hearing). At UTD, he established the Center for Robust Speech Systems (CRSS) which is part of the Human Language Technology Research Institute. Previously, he served as Department Chairman and Professor in the Department of Speech, Language, and Hearing Sciences (SLHS) and Professor in the Department of Electrical and Computer Engineering, at the University of Colorado, Boulder (1998–2005), where he cofounded the Center for Spoken Language Research. In 1988, he established the Robust Speech Processing Laboratory (RSPL) and continues to direct research activities in CRSS at UTD. His research interests span the areas of digital speech processing, analysis, and modeling of speech and speaker traits, speech enhancement, feature estimation in noise, robust speech recognition with emphasis on spoken document retrieval, and in-vehicle interactive systems for hands-free human–computer interaction. He has supervised 50 (22 Ph.D., 28 M.S./M.A.) thesis candidates. He is author/coauthor of 352 journal and conference papers and eight textbooks in the field of speech processing and language technology, coauthor of the textbook Discrete-Time Processing of Speech Signals, (IEEE Press, 2000), coeditor of DSP for In-Vehicle and Mobile Systems (Springer, 2004), Advances for In-Vehicle and Mobile Systems: Challenges for International Standards (Springer, 2006), and lead author of the report ”The Impact of Speech Under ’Stress’ on Military Speech Technology,” (NATO RTO-TR-10, 2000). Prof. Hansen was named IEEE Fellow for contributions in ”Robust Speech Recognition in Stress and Noise,” in 2007 and is currently serving as Member of the IEEE Signal Processing Society Speech Technical Committee (2005–2008; 2010–2013; elected Chair-elect in 2010), and Educational Technical Committee (2005–2008; 2008–2010). Previously, he has served as Technical Advisor to a U.S. Delegate for NATO (IST/TG-01), IEEE Signal Processing Society Distinguished Lecturer (2005/06), Associate Editor for the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING (1992–1999), Associate Editor for the IEEE SIGNAL PROCESSING LETTERS (1998–2000), and Editorial Board Member for the IEEE Signal Processing Magazine (2001–2003). He has also served as a Guest Editor of the October 1994 special issue on Robust Speech Recognition for the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING. He has served on the Speech Communications Technical Committee for the Acoustical Society of America (2000–2003), and is serving as a member of the International Speech Communications Association (ISCA) Advisory Council. He was recipient of the 2005 University of Colorado Teacher Recognition Award as voted by the student body. He also organized and served as General Chair for ICSLP/Interspeech-2002: International Conference on Spoken Language Processing, September 16–20, 2002, and has served as Co-Organizer and Technical Program Chair for the IEEE ICASSP-2010, Dallas, TX.
© Copyright 2026 Paperzz