T-SP-01444-2003.R2 1 Applications of Support Vector Machines to Speech Recognition A. Ganapathiraju, Member, IEEE, J. Hamaker, Member, IEEE and J. Picone, Senior Member, IEEE Abstract—Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. In this paper, we show that SVMs provide a significant improvement in performance on a static pattern classification task based on the Deterding vowel data. We also describe an application of SVMs to large vocabulary speech recognition, and demonstrate an improvement in error rate on a continuous alphadigit task (OGI Alphadigits) and a large vocabulary conversational speech task (Switchboard). Issues related to the development and optimization of an SVM/HMM hybrid system are discussed. Index Terms—Support vector machines, statistical modeling, machine learning, speech recognition. I. INTRODUCTION S peech recognition systems have become one of the premier applications for machine learning and pattern recognition technology. Modern speech recognition systems, including those described in this paper, use a statistical approach [1] based on Bayes’ rule. The acoustic modeling components of a speech recognizer are based on hidden Markov models (HMMs) [1,2]. The power of an HMM representation lies in its ability to model the temporal evolution of a signal via an underlying Markov process. The ability of an HMM to statistically model the acoustic and temporal variability in speech has been integral to its success. The probability distribution associated with each state in an HMM models the variability which occurs in speech across speakers or phonetic context. This distribution is typically a Gaussian mixture model (GMM) since a GMM provides a sufficiently general parsimonious parametric model as well as an efficient and robust mathematical framework for estimation and analysis. Widespread use of HMMs for modeling speech can be attributed to the availability of efficient parameter estimation procedures [1,2] that involve maximizing the likelihood (ML) Manuscript received July 16, 2003; revised March 18, 2004. This work was supported in part by the National Science Foundation under Grant No. IIS-0095940. A. Ganapathiraju is with Conversay, Redmond, WA 98052 USA (email: [email protected]). J. Hamaker is with Microsoft Corporation, Redmond, WA 98052-6399 USA (email: [email protected]). J. Picone is with the Department of Electrical and Computer Engineering at Mississippi State University, Mississippi State, MS 39762 USA (phone: 662-325-3149; fax: 662-325-2298; email: [email protected]). of the data given the model. One of the most compelling reasons for the success of ML and HMMs has been the existence of iterative methods to estimate the parameters that guarantee convergence. The expectation maximization (EM) algorithm provides an iterative framework for ML estimation with good convergence properties, though it does not guarantee finding the global maximum [3]. There are, however, problems with an ML formulation for applications such as speech recognition. A simple example, shown in Fig. 1, illustrates this problem. The two classes shown are derived from completely separable uniform distributions. ML is used to fit Gaussians to these classes and Bayes’ rule is used to classify the data. We see that the decision threshold occurs inside the range of class 2. This results in a significant probability of error. If we were to simply recognize that the range of data points in class 1 is less than 3.3 and that no data point in class 2 occurs within this range, we can achieve perfect classification. In this example, ML training of a Gaussian model will never achieve perfect classification. Learning decision regions discriminatively will improve classification performance. The important point here is not that Gaussian models are necessarily an incorrect choice, but rather that discriminative approaches are a key ingredient for creating robust and more accurate models. Many promising techniques [4,5] have been introduced for using discriminative techniques to improve the Figure 1. An example of a two-class problem where a maximum likelihoodderived decision surface is not optimal (adapted from [4]). In the exploded view, the shaded region indicates the error induced by modeling the separable data by Gaussians estimated using maximum likelihood. This case is common for data, such as speech, where there is overlap in the feature space or where class boundaries are adjacent. T-SP-01444-2003.R2 estimation of HMM parameters. Artificial neural networks (ANNs) represent an interesting and important class of discriminative techniques that have been successfully applied to speech recognition [6-8]. Though ANNs attempt to overcome many of the problems previously described, their shortcomings with respect to applications such as speech recognition are well-documented [9,10]. Some of the most notable deficiencies include design of optimal model topologies, slow convergence during training and a tendency to overfit the data. However, it is important to note that many of the fundamental ideas presented in this paper (e.g., soft margin classifiers) have similar implementations within an ANN framework. In most classifiers, controlling a trade-off between overfitting and good classification performance is vital to the success of the approach. In this paper, we describe the application of one particular discriminative approach, support vector machines (SVMs) [11], to speech recognition. We review the SVM approach in Section II, discuss applications to speech recognition in Section III, and present experimental results in Section IV. More comprehensive treatments of fundamental topics such as risk minimization and speech recognition applications can be found in [11-20]. 2 a good hyperplane, though this does not guarantee a unique solution. Adding an additional requirement that the optimal hyperplane should have good generalization properties can help choose the best hyperplane. The structural risk minimization (SRM) principle imposes structure on the optimization process by ordering the hyperplanes based on the margin. The optimal hyperplane is the one that maximizes the margin while minimizing the empirical risk. This indirectly guarantees better generalization [11]. Fig. 2 illustrates the differences between using ERM and SRM. An SVM classifier is defined in terms of the training examples. However, all training examples do not contribute to the definition of the classifier. In practice, the proportion of support vectors is small, making the classifier sparse. The data set itself defines how complex the classifier needs to be. This is in stark contrast to systems such as neural networks and HMMs where the complexity of the system is typically predefined or chosen through a cross-validation process. Real-world classification problems typically involve data which can only be separated using a nonlinear decision surface. Optimization on the input data in this case involves the use of a kernel-based transformation [11]: K ( x i , x j ) ( x i ) ( x j ) . II. SUPPORT VECTOR CLASSIFIERS A support vector machine (SVM) [11] is one example of a classifier that estimates decision surfaces directly rather than modeling a probability distribution across the training data. SVMs have demonstrated good performance on several classic pattern recognition problems [13]. Fig. 2 shows a typical 2class problem in which the examples are perfectly separable using a linear decision region. H1 and H2 define two hyperplanes. The distance separating these hyperplanes is called the margin. The closest in-class and out-of-class examples lying on these two hyperplanes are called the support vectors. Empirical risk minimization (ERM) [11] can be used to find (1) Kernels allow a dot product to be computed in a higher dimensional space without explicitly mapping the data into these spaces. A kernel-based decision function has the form: N f ( x ) i yi K ( x , xi ) b . (2) i 1 Two commonly used kernels explored in this study are: K ( x i , x j ) ( x y 1 )d , K ( x i , x j ) exp{ ( x y (polynomial) (3) 2 )} . (radial basis) (4) Radial basis function (RBF) kernels are extremely popular though data-dependent kernels [21] have recently emerged as a powerful alternative. Though convergence for RBF kernels is typically slower than for polynomial kernels, RBF kernels often deliver better performance [11]. Since there are N dot products involved in the definition of the classifier, where N is the number of support vectors, the classification task scales linearly with the number of support vectors. Non-separable data is typically addressed by the use of soft margin classifiers. Slack variables [11] are introduced to relax the separation constraints: Figure 2. Difference between empirical risk minimization and structural risk minimization for a simple example involving a hyperplane classifier. Each hyperplane achieves perfect classification and, hence, zero empirical risk. However, C0 is the optimal hyperplane because it maximizes the margin — the distance between the hyperplanes H1 and H2. Maximizing the margin indirectly results in better generalization. (5) x i w b 1 i , for y i 1 , for y i 1 , and i 0 , i , (7) xi w b 1 i , (6) T-SP-01444-2003.R2 3 where y i are the class assignments, w represents the weight vector defining the classifier, b is a bias term, and the i ’s are the slack variables. Derivation of an optimal classifier for this non-separable case exists and is described in detail in [11,12]. III. APPLICATIONS TO SPEECH RECOGNITION Hybrid approaches for speech recognition [6] provide a flexible paradigm to evaluate new acoustic modeling techniques. These systems do not entirely eliminate the HMM framework because classification models such as SVMs cannot model the temporal structure of speech effectively. Most contemporary connectionist systems use neural networks only to estimate posterior probabilities and use the HMM structure to model temporal evolution [6,10]. In integrating SVMs into such a hybrid system, several issues arise: Posterior Estimation: One drawback of an SVM is that it provides an m-ary decision. Most signal processing applications however need a posterior probability that captures our uncertainty in classification. This issue is particularly important in speech recognition because there is significant overlap in the feature space. SVMs provide a distance or discriminant which can be used to compare classifiers. This is unlike connectionist systems whose output activations are estimates of the posterior class probabilities [6,7]. One of the main concerns in using SVMs for speech recognition is the lack of a clear relationship between distance from the margin and the posterior class probability. A variety of options for converting the posterior to a probability were analyzed in [18] including Gaussian fits and histogram approaches. These methods are not Bayesian in nature in that they do not account for the variability in the estimates of the SVM parameters. Ignoring this variability in the estimates often results in overly confident predictions by the classifiers on the test set [22]. Kwok [14] and Platt [15] have extensively studied the use of moderated SVM outputs as estimates of the posterior probability. Kwok’s work also discusses the relationship between the SVM output and the evidence framework. We chose unmoderated probability estimates based on ML fitting as a trade-off between computational complexity and error performance. We used a sigmoid distribution to map the output distances to posteriors: p( y 1 f ) 1 . 1 exp( Af B ) (8) As suggested by Platt, the parameters A and B can be estimated using a model-trust minimization algorithm [23]. An example of the fit for a typical classifier is shown in Fig. 3. Note that we have assumed that the prior class probabilities are equal. An issue that arises from this formulation of estimating posteriors is that the distance estimates are heavily biased by the training data. In order to avoid biased estimates, a cross-validation set must be used to Figure 3. A sigmoid fit to the SVM distance-based posterior. estimate the parameters of the sigmoid [15]. The size of this data set can be determined based on the amount of training data that is available for the classifier. Classifier Design: A fundamental issue in classifier design is whether the classifiers should be one-vs-one classifiers, which learn to discriminate one class from another class, or one-vs-all, which learn to discriminate one class from all other classes. One-vs-one classifiers are typically smaller and less complex and can be estimated using fewer resources than one-vs-all classifiers. When the number of classes is N we need to estimate N(N-1)/2 one-vs-one classifiers as compared to one-vs-all classifiers. On several standard classification tasks it has been proven that one-vs-one classifiers are marginally more accurate than one-vs-all classifiers [16,17]. Nevertheless, for computational efficiency, we chose to use one-vs-all classifiers in all experiments reported here. Segmental Modeling: A logical step in building a hybrid system would be to replace the Bayes classifier in a traditional HMM system with an SVM classifier at the frame level. However, the amount of training data and the confusion inherent in frame-level acoustic feature vectors prevents this at the current time. Though very efficient optimizers are used to train an SVM, it is still not computationally feasible to train on all data available in the large corpora used in this study. In our work, we have chosen to use a segment-based approach to avoid these issues. This allows every classifier to use all positive examples in the corpora and an equal number of randomly-selected negative examples. The other motivating factor for most segment-based approaches [24] is that the acoustic model needs to capture both the temporal and spectral structure of speech that is clearly missing in frame-level classification schemes. Segmental approaches also overcome the assumption of conditional independence between frames of data in traditional HMM systems. Segmental data takes better advantage of the correlation in adjacent frames of speech data. A related problem is the variable length or duration problem. Segment durations are correlated with the word choice and speaking rate, but are difficult to exploit in an SVM-type framework. A simple but effective approach motivated by the 3-state HMMs used in most state-of-the-art speech recognition systems is to assume that the segments T-SP-01444-2003.R2 (phones in most cases) are composed of a fixed number of sections [10,24]. The first and third sections model the transition into and out of the segment, while the second section models the stable portion of the segment. We use segments composed of three sections in all recognition experiments reported in this work. The segment vector is then augmented with the logarithm of the duration of the phone instance to explicitly model the variability in duration. Fig. 4 demonstrates the construction of a composite vector for a phone segment. SVM classifiers in our hybrid system operate on such composite vectors. The composite segment feature vectors are based on the alignments from a baseline 3state Gaussian-mixture HMM system. The length of the composite vector is dependent on the number of sections in each segment and the dimensionality of the frame-level feature vectors. For example, with a 39-dimensional feature vector at the frame level and 3 sections per segment, the composite vector has a dimensionality of 117. SVM classifiers are trained on these composite vectors and recognition is also performed using these segment-level composite vectors. N-best List Rescoring: As a first step towards building a stand-alone hybrid SVM/HMM system for continuous speech recognition, we have explored a simple N-best list rescoring paradigm. Assuming that we have already trained the SVM classifiers for each phone in the model inventory, we generate N-best lists using a conventional HMM system. A model-level alignment for each hypothesis in the N-best list is then generated using the HMM system. Segment-level feature vectors are generated from these alignments. These segments are then classified using the SVMs. Posterior probabilities are computed using the sigmoid approximation previously discussed which are then used to compute the utterance likelihood of each hypothesis in the N-best list. The N-best list is reordered based on the likelihood and the top hypothesis is used to calibrate the performance of the system. An overview of the resulting hybrid system is shown in Fig. 5. The above framework allows the SVM to rescore each Nbest entry against the corresponding segmentation derived by a Viterbi alignment. It is also instructive to fix the segmentation so that each N-best entry is rescored against a single, common, segment-level feature stream. One approach to accomplishing this, which represents our second segmentation method, is to define a single feature stream by Viterbi-aligning the 1-best hypothesis generated by the baseline HMM system. A third Figure 5. Composition of the segment level feature vector assuming a 3-4-3 proportion for the three sections. 4 Figure 4. An overview of a hybrid HMM/SVM system based on an N-best list rescoring paradigm. segmentation approach uses a feature stream derived from the reference transcription to investigate the lower error bound when a perfect segmentation is available. These three segmentation methods are explored in the experiments below. IV. EXPERIMENTAL RESULTS We evaluated our SVM approach on three popular tasks: Deterding vowel classification [25], OGI Alphadigits [26] and Switchboard [27]. In these experiments, we compared the performance of the SVM approach to a baseline HMM system that was used to generate segmentations. In this section, we also present a detailed discussion of the classifier design in terms of data distribution and parameter selection. A. Static Pattern Classification The Deterding vowel data [25] is a relatively simple but popular static classification task used to benchmark nonlinear classifiers. In this evaluation, the speech data was collected at a 10 kHz sampling rate and low pass filtered at 4.7 kHz. The signal was then transformed to 10 log-area parameters. A window duration of 50 msec. was used for generating the features. The training set consisted of 528 frames from eight speakers and the test set consisted of 462 frames from a different set of seven speakers. The speech data consisted of 11 vowels uttered by each speaker in an h*d context. A traditional method for estimating an RBF classifier involves finding the RBF cluster centers for each class separately using a clustering mechanism such as K-MEANS [28]. We estimate weights corresponding to each cluster center to complete the definition of the classifier. The goal of the optimization process is typically not improved discrimination, but better representation. The training process requires using heuristics to determine the number of cluster centers. In contrast, the SVM approach for estimating RBF classifiers is more elegant where the number of centers (support vectors in this case) and their weights are learned automatically in a discriminative framework. The parameters of interest in tuning an RBF kernel are , T-SP-01444-2003.R2 the variance of the kernel and C, the parameter used to penalize training errors during SVM estimation [11,18]. Table 1 shows the performance of SVMs using RBF kernels for a wide range of parameter values. The results clearly indicate that the performance of the classifiers is very closely tied to the parameter setting, though there exists a pretty wide range of values for which performance is comparable. Another interesting observation is the effect of C on the performance. Note that for values of C greater than 20 the performance does not change. This suggests that a penalty of 20 has already accounted for all overlapped data and a larger value of C will have no additional benefit. Table 1. Effect of the kernel parameters on the classification performance of RBF kernel-based SVMs on the Deterding vowel classification task. Table 2 presents results comparing two types of SVM kernels to a variety of other classification schemes. A detailed description of the conditions for this experiment are described in [18]. Performance using the RBF kernel was better than most nonlinear classification schemes. The best performance we report is 35%. This is worse than the best performance reported on this data set (30% using a speaker adaptation scheme called Separable Mixture Models [29]), but significantly better than the best neural network classifiers (e.g., Gaussian Node Network) [8]. We have observed similar trends on a variety of static classification tasks [30] and recent publications on applications of SVMs to other speech problems [31] report similar findings. B. Spoken Letters and Numbers We next evaluated this approach on a small vocabulary continuous speech recognition task: OGI Alphadigits (AD) Table 2. The SVM classifier using an RBF kernel provides significantly better performance than most other comparable classifiers on the Deterding vowel classification task. 5 [26]. This database consists of spoken letters and numbers recorded over long distance telephone lines. These words are extremely confusable for telephone-quality speech (e.g., “p” vs. “b”). AD is an extremely challenging task from an acoustic modeling standpoint since the language model, a “self-loop” in which each letter or number can occur an arbitrary number of times and can follow any other letter or number, provides no predictive power. A baseline HMM system described in [18] was used to generate segmented training data by Viterbi-aligning the training reference transcription to the acoustic data. The time marks derived from this Viterbi alignment were used to extract the segments. Before extraction, each feature dimension was normalized to the range [-1,1] to improve the convergence properties of the quadratic optimizers used as part of the SVM estimation utilities in SVMLight [32]. For each phone instance in the training transcription, a segment was extracted. This segment was divided into three parts as shown in Figure . An additional parameter describing the log of the segment duration was added to yield a composite vector of size 3 * 39 features + 1 log duration = 118 features. Once the training sets were generated for all the classifiers, the SVMLight utilities were used to train each of the 29 phone SVM models. For each phone, an SVM model was trained to discriminate between this phone and all other phones (one-vsall models), generating a total of 29 models. In order to limit the number of samples (especially the out-of-class data) that is required by each classifier, a heuristic data selection process was used that required the training set to consist of equal amounts of within-class and out-of-class data. All within-class data available for a phone is by default part of the training set. The out-of-class data was randomly chosen such that one half of the out-of-class data came from phones that were phonetically similar to the phone of interest and one half came from all other phones [18]. Table 3 gives the performance of the hybrid SVM system as a function of the kernel parameters. These results were generated with 10-best lists whose total list error (the error inherent in the lists themselves) was 4.0%. The list error rate is the best performance any system can achieve by postprocessing these N-best lists. Though we would ideally like this number to be as close to zero as possible, the constraints placed by the limitations of the knowledge sources used in the system (acoustic models, language models etc.) force this number to be nonzero. In addition, the size of the Nbest list was kept small to minimize computational complexity. The goal in SRM is to build a classifier which balances generalization with discrimination on the training set. Table 3 shows how the RBF kernel parameter is used as a tuning parameter to achieve this balance. As gamma increases, the variance of the RBF kernel decreases. This in turn produces a narrower support region in a high-dimensional space. This support region requires a larger number of support vectors and leads to overfitting as shown when gamma is set to 5.0. As T-SP-01444-2003.R2 6 C. Table 4. Comparison of word error rates on the Alphadigits task as a function of the RBF kernel width (gamma) and the polynomial kernel order. Results are shown for a 3-4-3 segment proportion with the error penalty, C, set to 50. The WER for the baseline HMM system is 11.9%. gamma decreases, the number of support vectors decreases, which leads to a smoother decision surface. Eventually, we reduce the number of support vectors to a point where the decision region is overly smooth (gamma = 0.1), and performance degrades. As with the vowel classification data, RBF kernel performance was superior to the polynomial kernel. In addition, as we observed in the vowel classification task, the generalization performance was fairly flat for a wide range of kernel parameters. The segmentation derived from the 1-best hypothesis from the baseline HMM resulted in an 11.0% WER using an RBF kernel. To provide an equivalent and fair comparison with the HMM system we have rescored the 10best lists with the baseline HMM system. Using a single segmentation to rescore the 10-best lists does force a few hypothesis in the lists to become inactive because of the duration constraints. The effective average list size after eliminating hypotheses that do not satisfy the duration constraints imposed by the segmentation is 6.9. The result for the baseline HMM system using these new N-best lists remains the same indicating that the improvements provided by the hybrid-system are indeed because of better classifiers and not the smaller search space. As a point of reference, we also produced results in Table 3 that used the reference transcription to generate the segments. Using this oracle segmentation (generated using the reference transcription for Viterbi alignment), the best performance obtained on this task was 7.0% WER. This is a 36% relative improvement in performance over the best configuration of the hybrid system using hypothesis-based segmentations. The effective average list size after eliminating hypotheses that do not satisfy the duration constraints imposed by the oracle segmentation was 6.7. This experiment shows that SVMs efficiently lock on to good segmentations. However, when we let SVMs choose the best segmentation and hypothesis combination by using the N-best segmentations, the performance gets worse (11.8% WER as shown in Table 4). This apparent anomaly suggests the need to incorporate variations in segmentation into the classifier estimation process. Relaxing this strong interdependence between the segmentation and the SVM performance is a point for further research. Conversational Speech The Switchboard (SWB) [27] task, which is based on twoway telephone recordings of conversational speech, is very different from the AD task in terms of acoustic confusability and classifier complexity. The baseline HMM system for this task performs at 41.6% WER. We followed a similar SVM training process to that which was described for the AD task. We estimated 43 classifiers for SWB. Since a standard crossvalidation set does not exist for this task, we used 90,000 utterances from 60 hours of training data. The cross-validation set consisted of 24,000 utterances. We limited the number of positive examples for any classifier to 30,000. These positive examples were chosen at random. A summary of our SWB results are shown in Table 4. In the first experiment we used a segmentation derived from the baseline HMM system’s top hypothesis to rescore the N-best list. This hybrid setup does improve performance over the baseline, albeit only marginally — 40.6% compared to a baseline of 41.6%. The second experiment was performed by using N segmentations to rescore each of the utterances in the N-best lists. From the experimental results on the AD task we expected the performance with this setup to be worse than 40.6%. The performance of the system did indeed get worse — 42.1% WER. When HMM models were used instead of SVMs with the same setup, the HMM system achieved a WER of 42.3% compared to the baseline of 41.6%. From this result we deduce that the lack of any language modeling information when we reorder the N-best lists is the reason for this degradation in performance. The next experiment was an oracle experiment. Use of the reference segmentation to rescore the N-best list gave a WER of 36.1%, confirming our hypothesis that the segmentation issue needs further exploration. A second set of oracle experiments evaluated the richness of N-best lists. The N-best list error rate was artificially reduced to 0% by adding the reference to the original 10-best lists. Rescoring these new Nbest lists using the corresponding segmentations resulted in error rates of 38.1% and 9.1% WER on SWB and AD respectively. The HMM system under a similar condition improves performance to 38.6%. On AD the HMM system Table 3. Experimental results comparing the baseline HMM system to a new hybrid SVM/HMM system. The hybrid system marginally outperforms the HMM system when used as a post-processor to the output of the HMM system where the segmentation is derived using the hypothesis of the HMM system. However, when the reference transcription is included, or when the reference based segmentation is used, SVMs offer a significant improvement. T-SP-01444-2003.R2 does not improve performance over the baseline even when the reference (or correct) transcription is added to the N-best list. This result suggests that SVMs are very sensitive to segmentations and can perform well if accurate segmentations are available. Another set of experiments were run to quantify the absolute ceiling in performance improvements the SVM hybrid system can provide. This ceiling can be achieved when we use the hybrid system to rescore the N-best lists that include the reference transcription and the reference-based segmentation. Using this setup the system gives a WER of 5.8% on SWB and 3.3% on AD. This huge improvement should not be mistaken to be a real improvement in performance for two reasons. First, we cannot guarantee that the reference segmentation is available at all times. Second, generating N-best lists with 0% WER is extremely difficult, if not impossible for conversational speech. This improvement should rather be viewed as an indication of the fact that by using good segmentations to rescore good N-best lists, the SVM/HMM hybrid system has a potential to improve performance. Also, using matched training and test segmentations seems to improve performance dramatically. Table 4 summarizes the important results in terms of the various segmentations and N-best lists that were processed to arrive at the final hypothesis. The key point to be noted here is that experiments 2 and 4 are designed such that both the hybrid system and the HMM system are operating under the same conditions and offer a fair comparison of the two systems. For these experiments, since we reorder N-best lists by using segmentations corresponding to each of the hypothesis in the list, both systems have the opportunity to evaluate the same segments. On the other hand if we were to run the experiments using a single segmentation (experiment 1 for example), the HMM system cannot use the segmentation information while the hybrid system can. Experiments 2 and 4 are key in order to compare both systems from a common point of reference. Experiment 4 suggests that when the HMM and hybrid system process good segmentations and rich N-best lists, the hybrid system outperforms the HMM system — significantly in the case of AD and marginally on SWB. V. CONCLUSIONS This paper addresses the use of a support vector machine as a classifier in a continuous speech recognition system. The technology has been successfully applied to two speech recognition tasks. A hybrid SVM/HMM system has been developed which uses SVMs to post-process data generated by a conventional HMM system. The results obtained in the experiments clearly indicate the classification power of SVMs and affirm the use of SVMs for acoustic modeling. The oracle experiments reported here clearly show the potential of this hybrid system while highlighting the need for further research into the segmentation issue. Our ongoing research into new acoustic modeling 7 techniques for speech recognition is taking two directions. First, it is clear that for a formalism such as SVMs, there is a need for an iterative SVM classifier estimation process embedded within the learning loop of the HMM system to expose the classifier to the same type of ambiguity observed during recognition. This should be done in a way similar to the implicit integration of optimal segmentation and MCE/MMIbased discriminative training of HMMs [4,5]. Mismatched training and evaluation conditions are dangerous for recognition experiments. Second, the basic SVM formalism suffers from three fundamental problems: scalability, sparsity, and Bayesian-ness. Recent related research [20] based on relevance vector machines (RVM) directly addresses the last two issues. Finally, the algorithms, software, and recognition systems described in this work are available in the public domain as part of our speech recognition toolkit (see http://www.isip.msstate.edu/ projects/speech/software). VI. ACKNOWLEDGMENTS The inspiration for this work grew out of discussions with Vladimir Vapnik at the 1997 Summer Workshop at the Center for Speech and Language Processing at the Johns Hopkins University. We are also grateful to many students at the Institute for Signal and Information Processing at Mississippi State University, including Naveen Parihar and Bohumir Jelinek, for their contributions to the software infrastructure required for the experiments. VII. REFERENCES F. Jelinek, “Continuous Speech Recognition by Statistical Methods,” Proceedings of the IEEE, vol. 64, no. 4, pp. 532-537, 1976. [2] L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, New Jersey, USA, 1993. [3] A.P. Dempster, N.M. Laird and D.B. Rubin, “Maximum Likelihood Estimation from Incomplete Data,” Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1-38, 1977. [4] E. McDermott, “Discriminative Training for Speech Recognition,” Ph.D. Thesis, Waseda University, Japan, 1977. [5] P. Woodland and D. Povey, “Very Large Scale MMIE Training for Conversational Telephone Speech Recognition,” Proceedings of the Speech Transcription Workshop, Univ. of Maryland, MD, USA, 2000. [6] H.A. Bourlard and N. Morgan, Connectionist Speech Recognition — A Hybrid Approach, Kluwer Academic Publishers, Boston, Massachusetts, USA, 1994. [7] J.S. Bridle and L. Dodd, “An Alphanet Approach to Optimizing Input Transformations for Continuous Speech Recognition,” Proc. of the Int. Conf. on ASSP, Toronto, Canada, vol. 1, pp. 277-280, Apr. 1991. [8] A.J. Robinson, Dynamic Error Propagation Networks, Ph.D. Thesis, Cambridge University, UK, 1989. [9] G.D. Cook and A.J. Robinson, “The 1997 ABBOT System for the Transcription of Broadcast News,” Proc. of the DARPA BNTU Workshop, Lansdowne, VA, USA, 1997. [10] N. Ström, L. Hetherington, T.J. Hazen, E. Sandness and J. Glass, “Acoustic Modeling Improvements in a Segment-Based Speech Recognizer,” Proc. IEEE ASR Workshop, Keystone, CO, Dec. 1999. [11] V.N. Vapnik, Statistical Learning Theory, John Wiley, New York, New York, USA, 1998. [12] V.N. Vapnik, The Nature of Statistical Learning Theory, SpringerVerlag, New York, New York, USA, 1995. [1] T-SP-01444-2003.R2 [13] C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Knowledge Discovery and Data Mining, vol. 2, no. 2, pp. 121-167, 1998. [14] J. Kwok, “Moderating the Outputs of Support Vector Machine Classifiers,” IEEE Trans. on Neural Networks, vol. 10, no. 5, 1999. [15] J. Platt, “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods,” in Advances in Large Margin Classifiers, MIT Press, Cambridge, MA, USA, 1999. [16] E. Allwein, R.E. Schapire and Y. Singer, “Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers,” Journal of Machine Learning Research, vol. 1, pp. 113-141, December 2000. [17] J. Weston and C. Watkins, “Support Vector Machines for Multi-Class Pattern Recognition,” Proceedings of the Seventh European Symposium On Artificial Neural Networks, Bruges, Belgium, pp. 219-224, 1999. [18] A. Ganapathiraju, Support Vector Machines for Speech Recognition, Ph. D. Thesis, Mississippi State University, MS State, MS, USA, 2002. [19] A. Ganapathiraju, J. Hamaker and Picone, “A Hybrid ASR System Using Support Vector Machines,” Proc. Int. Conf. Spoken Language Processing, vol. 4, pp. 504-507, Beijing, China, October 2000. [20] J. Hamaker, J. Picone, and A. Ganapathiraju, “A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines,” Proc. Int. Conf. Spoken Language Processing, vol. 2, pp. 1001-1004, Denver, CO, USA, September 2002. [21] N. Smith and M. Niranjan, “Data-dependent Kernels in SVM Classification of Speech Patterns,” Proc. Int. Conf. Spoken Language Processing, pp. 297-300, Beijing, China, October 2000. [22] M.E. Tipping, “The Relevance Vector Machine,” Advances in Neural Information Processing Systems 12, pp. 652–665, MIT Press, Cambridge, MA, USA, 2000. [23] P.E. Gill, W. Murray, and M.H. Wright, Practical Optimization, Academic Press, London, U.K., 1981. [24] J. Chang and J. Glass, “Segmentation and Modeling in Segment-based Recognition,” Proc. Eurospeech, pp. 1199-1202, Rhodes, Greece, Sept. 1997. [25] D. Deterding, M. Niranjan and A.J. Robinson, “Vowel Recognition (Deterding data),” available at http://ftp.ics.uci.edu/pub/machinelearning-databases/undocumented/connectionist-bench/vowel/, Berkeley, CA, USA, 2004. [26] R. Cole, “Alphadigit Corpus v1.0” http://www.cse.ogi.edu/CSLU/ corpora/alphadigit, CSLU, OGI, Portland, OR, 1998. [27] J. Godfrey, E. Holliman and J. McDaniel, “SWITCHBOARD: Telephone Speech Corpus for Research and Development,” Proc. ICASSP, San Francisco, CA, USA, vol. 1, pp. 517-520, March 1992. [28] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, San Diego, California, USA, 1990. [29] J. Tenenbaum and W.T. Freeman, “Separating Style and Content,” Advances in Neural Information Processing Systems 9, MIT Press, Cambridge, MA, USA, 1997. [30] J. Picone and D. May, “A Visual Introduction to Pattern Recognition,” http://www.isip.msstate.edu/projects/speech/software/demonstrations/a pplets/util/pattern_recognition/current/index.html, MS State, 2003. [31] E. Singer, P.A. Torres-Carrasquillo, T.P. Gleason, and W.M. Campbell, “Acoustic, Phonetic, and Discriminative Approaches to Automatic Language Identification,” Proc. Eurospeech, pp. 1345-1348, Geneva, Switzerland, Sept. 2003. [32] T. Joachims, SVMLight: Support Vector Machine, http://www.cs. cornell.edu/People/tj/svm_light/, Cornell Univ., Nov. 1999. [33] J. Hamaker and J. Picone, “Advances in Speech Recognition Using Sparse Bayesian Methods,” IEEE Trans. SAP (in review). Aravind Ganapathiraju (M’92) was born in Hyderabad, India in 1973. He received a B.E. degree from Regional Engineering College, Trichy, India in 1995, an M.S. degree in electrical engineering in 1997, and a Ph.D. degree in computer engineering in 2002, both from Mississippi State University. He is currently the VP of Core Technology at Conversay. He has been working on various aspects of speech recognition technology for the past 8 years. His research interests include providing technology for limited-resource platforms using statistical 8 techniques derived from various fields, including pattern recognition and neural networks. He is also interested in nascent technology areas such as support vector machines and cognitive science. Dr. Ganapathiraju has served as a reviewer for several journals in the signal processing area. Jonathan E. Hamaker (M’96) was born in Birmingham, Alabama, USA in 1974. He received a B.S. degree in electrical engineering in 1997 and an M.S. degree in computer engineering in 2000, both from Mississippi State University. He is currently a member of technical staff at Microsoft, where he is focused on the development of core acoustic modeling techniques for speech platforms. His primary research interest is in the application of machine learning to speech recognition in constrained environments. Further interests include noise-robust acoustic modeling and discriminative modeling. Mr. Hamaker has served as a reviewer for several journals in the signal processing and machine learning areas. Joseph Picone (M’80-SM’90) was born in Oak Park, Illinois, USA in 1957. He received a B.S. degree in 1979, an M.S. degree in 1980, and a Ph.D. degree in 1983, all from Illinois Institute of Technology in electrical engineering. He is currently a Professor in the Department of Electrical and Computer Engineering at Mississippi State University. He has previously been employed by Texas Instruments and AT&T Bell Laboratories. His primary research interests are the application of machine learning techniques to speech recognition and the development of public domain speech technology. Dr. Picone is a registered Professional Engineer. He has also served in several capacities with the IEEE, been awarded 8 patents, and has published more than 120 papers in the area of speech processing.
© Copyright 2026 Paperzz