SPRING 2004 Computer Science & Linguistics Building a Regionally Inclusive Dictionary for Speech Recognition Speech Recognition (SR) is the automated conversion of speech into written text. Applications range from simple phone-based information services to commercial-grade automated customer service systems, such as those for airline phone reservations. While this process is complex in and of itself, it is further complicated by the fact that speakers (the users) from different parts of the country have varying accents and pronounce the same words differently. Our aim is to create a more speaker-independent SR system while maintaining speed and accuracy of transcription. This requires the construction of an SR dictionary that takes into account the existence of multiple pronunciations for the same word. However, the existence of too many alternate pronunciations overloads the system and is detrimental to accuracy and speed. By finding the optimal number of pronunciations per word, the percentage of words correctly identified by the SR system increased from 78% using the traditional technique to 85.7% using the improved method outlined in this article. This brings the technology 35% closer to the goal of complete recognition and the use of speech as the primary method of human-computer interaction. Justin Burdick S peech Recognition (SR) is a process that transcribes speech into text using a computer. Many repetitive phone tasks can be automated with speech recognition technology, saving businesses significant amounts of money. However, the transcription of everyday human speech is significantly more difficult than simply the recognition of a small set of words, as is the case in a SR-based phone information-retrieval system (e.g. 411 services). A complete Speech Recognition system is fairly complex and consists of various subsystems that interact together to convert speech into written text, as shown in Figure 1. A typical SR system utilizes a common tool for pattern-matching called Hidden Markov Models (HMM) to model parts of speech known as “phonemes” (similar to syllables) [1]. The models are first trained using a set of training data, which consists of a large number of wave sound files with speech and the corresponding text transcriptions. Each speech file is converted into a series of observations known as “feature vectors." These are sets of numbers which describe the sound mathematically by extracting specific information from the waveform at 25-millisecond intervals. The computer then matches this numerical representation to the corresponding text transcription included in the training data. Applied over the entire training set, this process creates a series of models which later allow the SR system to translate speech into text. The training process results in the creation of two data files: one consisting of a model of properties for each phoneme and another of a list of pronunciations for each word. Phonemes, the most fundamental units of speech, are single sounds; each phoneme is modeled by its average sound, the variation of that sound across various speakers, and the transition probabilities between it and other phonemes. The list of pronunciations, termed the “pronouncing dictionary,” contains each of the words used during training along with the sequence of phonemes that compose that word. Once the models are trained, the system can be used to perform recognition. In order to produce meaningful sentences, the SR system relies on a dictionary and a grammar model as well as the trained phoneme models from the training phase. Transcription is accomplished using the Viterbi algorithm. This algorithm divides the incoming speech into individual “observations” and processes one observation at a time by comparing it to every possible phoneme. When the algorithm moves to the next observation, it eliminates paths that have a low matching probability. By this elimination at each observation, only the most probable paths survive until the end of the observation sequence; the recognized word sequence can then be generated. The dictionary is important in this process. As the system processes each observation, it must check the dictionary to see if it has made a word yet. Once the word is recognized, the system uses the grammar file to influence the next word chosen, which is very valuable for making the sentence meaningful, similar to grammar check in a wordprocessing program. Implementing an effective SR system requires the synthesis of concepts 1 SURJ. Building the Dictionary Figure 1. A simplified model of an HMM-based speech recognition system from several fields, including: • Signal Processing (for extracting important features from the sound waves) • Probability and Statistics (for defining and training the models and recognition) • Linguistics (for building the dictionary and the grammar model) • Computer Science (for creating efficient search algorithms) Improvements in any of the subsystems above could result in better overall performance. The following performance goals are among the main objectives of any SR system [2]: 1. Increased Accuracy Accuracy is measured as a percentage of words (or sentences) that are correctly detected. To emphasize the importance of this value, consider a certain SR system that has an accuracy of 99%. This means that during the task of dictation, one word in every 100 words is identified incorrectly. For the average dictation, such an error rate would result in approximately six errors per single-spaced page. Searching for and correcting these typos is a tedious and time-consuming task. This shows that even an accuracy of 99% may not be sufficiently high for the task of dictation. 2. Increased Speed Many speech recognition applications must run on small hand-held devices with limited CPU power. It is important for the SR system 2 to be as computationally inexpensive as possible. This allows the system to transcribe speech in realtime, even on a low-performance system like a PDA or cell phone. 3. Speaker Independence Most SR systems can operate with very high accuracy if they are trained to a specific speaker. However, problems can arise when the system attempts to transcribe the speech of a different person. The reasons for this include, but are not limited to: a. Different speakers may pronounce the same phonemes (sub-words) differently, e.g. a speaker from Brooklyn may pronounce certain vowels differently than a speaker from California. b. Different speakers may pronounce the same words differently, with a different sequence of phonemes, e.g. the word “maybe” could be pronounced as “m ey b iy” or as “m ey v iy.” In this project, our goal is to improve recognition of speaker independent SR systems by taking into account that different speakers may pronounce similar words differently. This requires building a dictionary that has multiple pronunciations for any given word. The details of building such a dictionary are explained below. In order to test the effects of this approach, a set of carefully designed experiments were conducted, the results of which are in the Results and Discussion section. The TIMIT (Texas InstrumentsMassachusetts Institute of Technology) database was used for both training and testing of the SR system. This database contains audio files of sample sentences as well as word-level and phonemelevel transcriptions. However, we needed to create our own "pronouncing dictionary" before we could do training and recognition. Each word in a pronouncing dictionary is associated with a series of phonemes that approximate the sound of that word. However, some words need multiple pronunciations in order to be best represented (see Table 1 for examples). When this sequence of phonemes is pronounced, it accurately approximates the sound of the word being spoken. All of the TIMIT data was used in the experiments. About 70% was used to train the models, and the rest was used for performing the testing. Table 2 summarizes key characteristics of the TIMIT data. Method limitations In order to provide a versatile training set, data was collected from speakers spanning eight different geographical areas to capture regional dialects [4] (shown in Figure 2). Unfortunately, these regional dialects introduce problems for the construction of a dictionary. Since many of the speakers pronounce words differently—especially in the case of short, common words—multiple transcriptions for the same word were common. In fact, the Word As Coauthors Coffee Phoneme-level Transcription ae z ax z kcl k ow ao th axr z k aa iy f k ao f iy Table 1. Sample dictionary entries. Multiple pronunciations can exist for the same word. For example, the first pronunciation of coffee is that of a typical New Yorker, while the second is that of someone from the old Northwest. SPRING 2004 Number of... Speakers Utterances Distinct texts Words Utterances in the training set Utterances in the test set Male speakers Female speakers 630 (70% male, 30% female) 6300 (10 per speaker) 2342 6099 4620 1680 326 (training set) + 112 (test set) 136 (training set) + 56 (test set) Table 2. Key characteristics of the TIMIT data unedited dictionary contained an average of two pronunciations per word [1]. However, these pronunciations tended to cluster around certain words, with most words having only one pronunciation but with others having six or more. A certain level of multiple pronunciations can be helpful at capturing common divergences in pronunciation (e.g. potato/pot_to), as even the best hand-made dictionaries contain 1.1 to 1.2 pronunciations per word [5]. Too many pronunciations per word, however, could be misleading when the computer performs recognition, because a poorly recognized string of phonemes could cause a word mismatch that makes a sentence meaningless. Methodology In order to reduce the number of redundant definitions in the dictionary, two methods were investigated. The first, called “skimming,” simply edits out and deletes low-frequency transcriptions. This method defined a threshold based on the most commonly occurring pronunciations. Anything below this threshold was removed from the dictionary. The second method, called “percentaging,” encodes the frequency at which each pronunciation was encountered into the definitions themselves, which a speech recognizer can use to modify its recognition network. As mentioned earlier, the dictionary was created by examining all the speech transcriptions during training mode. The original system just added all of the different combinations of word and phonemes into the system. The new method does the same thing, except that it keeps count of how many times each combination occurs. These counts can be used to perform either skimming (removal of low-occurrence pairs), or percentaging (placing percentage occurrence information with the data). This paper examines the effect of skimming alone, percentaging alone, and finally skimming followed by percentaging on SR system accuracy. Results and Discussion The two systems of dictionary making (skimming and percentaging) were tested independently and then with the 2-step method (skimming followed by percentaging). As shown in Figure 4, skimming examples of dictionaries after processing provided a much larger accuracy gain than did percentaging. When the two were used simultaneously, percentaging added a very slim increase in accuracy to the method of skimming. An unexpected bonus of using the skimming method is a reduction in required recognition time. Since there are fewer possible recognition sequences, the computer can process each speech fragment more rapidly. At optimal skimming (30%), the test took only two hours, rather than the full three, which is a 33% reduction in time. Furthermore, at this level of skimming, there was an average of 1.6 pronunciations per word, as opposed to 2.0 pronunciations without skimming. Thus, not only does skimming provide more accurate speech recognition, it also allows for faster recognition. Conclusion Since people speak in many differ-ent ways, it is expected that having a large dictionary, inclusive of all pronunciation and dialectic differences, would be a valuable aid to recognition. However, this paper demonstrates that the most effective SR system dictionary may be one with a small number of select alternate pronunciations per word. The dictionary must be built so that a balance is found between being inclusive of pronunciation variants and of conforming to the limitations of a computerized recognition system. This study observed that while a typical SR system contains on average 2.0 pronunciations per word, maximum Figure 2: Speaker Breakdown by Region (total number of speakers= 630). 3 SURJ. SR system accuracy is achieved at 1.6 pronunciations per word. Tailoring a dictionary to contain fewer pronunciations per word is best achieved manually; however, given the size of a typical SR system dictionary, (13,000 entries) a computerized method of deleting erroneous and infrequently used pronunciations is more practical. This study revealed that the method of skimming both effectively increases recognition accuracy and results in a 33% reduction in the time required for speech recognition tasks. An example of an unaltered dictionary Clearly, “about ax b aw t” is the most commonly given pronunciation for the word about, and so it should be given the highest weight, whereas definitions “bah” and “baw” should just be erased. Examples of dictionaries after processing After Skimming 4 about ax b aw d 17 about ax b aw t Here we have done skimming, deleting the least frequently occurring pronunciations. After Percentaging 3.85% about ax b ae t 7.69% about ax b ah 15.38% about ax b aw d 65.38% about ax b aw t 3.85% about b ah 3.85% about b aw Figure 3: The process of modifying the dictionary Table 3. Simulation Test Results Figure 4. Test Results 4 Original 1 about ax b ae t 2 about ax b ah 4 about ax b aw d 17 about ax b aw t 1 about b ah 1 about b aw Here we have done percentaging, inserting the frequency of each pronunciation as a percentage. SPRING 2004 References 1. Mohajer, K., Zhong-Min Hu, Pratt, V. Time Boundary Assisted speech recognition. International Journal of Information Fusion (Special Issue on Multi-Sensor Information Fusion For Speech Processing Applications). 2. Rabiner, Juang. Fundamentals of Speech Recognition, Chapter 6, Prentice Hall Signal Processing Series, New Jersey, 1993. 3. Young, Evermann, Kershaw, Moore, Odell, Ollason, Valtchev, Woodland. The HTK Book, Ch5, Cambridge University Engineering Department, 2001-2002. 4. Garofolo, Lamel, Fisher, Fiscus, Pallett, Dahlgren. TIMIT Printed Documentation, Ch5, U.S. Depart. Of Commerce, 1993. 5. Hain, T., Woodland, P.C., Evermann, G. et al. New features in the CU-HTK system for transcription of conversational telephone speech. Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP ‘01). 2001 IEEE International Conference on, Vol.1, Iss., 2001. Pages:57-60 vol.1. 6. Culicover, Peter. Lecture: An Introduction to Language in the Humanities. Spring 2002. http://www.ling.ohio-state.edu/~culicove/H201/Language variation.pdf. Justy Burdick Justy Burdick is a sophomore majoring in Electrical Engineering and is considering a minor in Computer Science. He would like to thank Professor Vaughan Pratt and the URP office for sponsoring this Research Experience for Undergraduates project. Furthermore, he would like to thank Keyvan Mohajer for the guidance and help he received at every step of the research process. Finally, he would like to thank the 2002 and 2003 speech group REU team members: Simon Hu, Melissa Mansur, Ryan Bickerstaff, Kenneth Lee, Gu Pan Grace. 5
© Copyright 2026 Paperzz