Building a Regionally Inclusive Dictionary for Speech Recognition

SPRING 2004
Computer Science & Linguistics
Building a Regionally Inclusive
Dictionary for Speech Recognition
Speech Recognition (SR) is the automated conversion of speech into written text. Applications
range from simple phone-based information services to commercial-grade automated customer
service systems, such as those for airline phone reservations. While this process is complex in
and of itself, it is further complicated by the fact that speakers (the users) from different parts of
the country have varying accents and pronounce the same words differently. Our aim is to create
a more speaker-independent SR system while maintaining speed and accuracy of transcription.
This requires the construction of an SR dictionary that takes into account the existence of multiple
pronunciations for the same word. However, the existence of too many alternate pronunciations
overloads the system and is detrimental to accuracy and speed. By finding the optimal number of
pronunciations per word, the percentage of words correctly identified by the SR system increased
from 78% using the traditional technique to 85.7% using the improved method outlined in this
article. This brings the technology 35% closer to the goal of complete recognition and the use of
speech as the primary method of human-computer interaction.
Justin Burdick
S
peech Recognition (SR) is a process that transcribes speech into text using a computer. Many repetitive phone
tasks can be automated with speech
recognition technology, saving businesses significant amounts of money.
However, the transcription of everyday
human speech is significantly more difficult than simply the recognition of a
small set of words, as is the case in a
SR-based phone information-retrieval
system (e.g. 411 services).
A complete Speech Recognition
system is fairly complex and consists
of various subsystems that interact together to convert speech into written
text, as shown in Figure 1. A typical
SR system utilizes a common tool for
pattern-matching called Hidden Markov Models (HMM) to model parts of
speech known as “phonemes” (similar
to syllables) [1]. The models are first
trained using a set of training data,
which consists of a large number of
wave sound files with speech and the
corresponding text transcriptions. Each
speech file is converted into a series of
observations known as “feature vectors." These are sets of numbers which
describe the sound mathematically by
extracting specific information from
the waveform at 25-millisecond intervals. The computer then matches this
numerical representation to the corresponding text transcription included
in the training data. Applied over the
entire training set, this process creates
a series of models which later allow the
SR system to translate speech into text.
The training process results in the creation of two data files: one consisting
of a model of properties for each phoneme and another of a list of pronunciations for each word. Phonemes, the
most fundamental units of speech, are
single sounds; each phoneme is modeled by its average sound, the variation
of that sound across various speakers,
and the transition probabilities between
it and other phonemes. The list of pronunciations, termed the “pronouncing
dictionary,” contains each of the words
used during training along with the
sequence of phonemes that compose
that word.
Once the models are trained, the
system can be used to perform recognition. In order to produce meaningful
sentences, the SR system relies on a
dictionary and a grammar model as
well as the trained phoneme models
from the training phase. Transcription is accomplished using the Viterbi
algorithm. This algorithm divides the
incoming speech into individual “observations” and processes one observation at a time by comparing it to every
possible phoneme. When the algorithm
moves to the next observation, it eliminates paths that have a low matching
probability. By this elimination at each
observation, only the most probable
paths survive until the end of the observation sequence; the recognized word
sequence can then be generated. The
dictionary is important in this process.
As the system processes each observation, it must check the dictionary to see
if it has made a word yet. Once the
word is recognized, the system uses
the grammar file to influence the next
word chosen, which is very valuable
for making the sentence meaningful,
similar to grammar check in a wordprocessing program.
Implementing an effective SR system requires the synthesis of concepts
1
SURJ.
Building the Dictionary
Figure 1. A simplified model of an HMM-based speech recognition system
from several fields, including:
• Signal Processing (for extracting
important features from the sound
waves)
• Probability and Statistics (for defining and training the models and
recognition)
• Linguistics (for building the dictionary and the grammar model)
• Computer Science (for creating efficient search algorithms)
Improvements in any of the subsystems above could result in better
overall performance. The following
performance goals are among the main
objectives of any SR system [2]:
1. Increased Accuracy
Accuracy is measured as a percentage of words (or sentences) that are
correctly detected. To emphasize the
importance of this value, consider a
certain SR system that has an accuracy
of 99%. This means that during the
task of dictation, one word in every
100 words is identified incorrectly.
For the average dictation, such an error
rate would result in approximately six
errors per single-spaced page. Searching for and correcting these typos is a
tedious and time-consuming task. This
shows that even an accuracy of 99%
may not be sufficiently high for the
task of dictation.
2. Increased Speed
Many speech recognition applications must run on small hand-held
devices with limited CPU power.
It is important for the SR system
2
to be as computationally inexpensive as possible. This allows the
system to transcribe speech in realtime, even on a low-performance
system like a PDA or cell phone.
3. Speaker Independence
Most SR systems can operate with
very high accuracy if they are trained to
a specific speaker. However, problems
can arise when the system attempts to
transcribe the speech of a different person. The reasons for this include, but
are not limited to:
a. Different speakers may pronounce
the same phonemes (sub-words)
differently, e.g. a speaker from
Brooklyn may pronounce certain
vowels differently than a speaker
from California.
b. Different speakers may pronounce
the same words differently, with a
different sequence of phonemes,
e.g. the word “maybe” could be
pronounced as “m ey b iy” or as
“m ey v iy.”
In this project, our goal is to improve recognition of speaker independent SR systems by taking into account
that different speakers may pronounce
similar words differently. This requires
building a dictionary that has multiple
pronunciations for any given word. The
details of building such a dictionary are
explained below. In order to test the effects of this approach, a set of carefully
designed experiments were conducted,
the results of which are in the Results
and Discussion section.
The TIMIT (Texas InstrumentsMassachusetts Institute of Technology)
database was used for both training and
testing of the SR system. This database
contains audio files of sample sentences as well as word-level and phonemelevel transcriptions. However, we
needed to create our own "pronouncing
dictionary" before we could do training
and recognition. Each word in a pronouncing dictionary is associated with
a series of phonemes that approximate
the sound of that word. However, some
words need multiple pronunciations in
order to be best represented (see Table
1 for examples). When this sequence of
phonemes is pronounced, it accurately
approximates the sound of the word being spoken. All of the TIMIT data was
used in the experiments. About 70%
was used to train the models, and the
rest was used for performing the testing. Table 2 summarizes key characteristics of the TIMIT data.
Method limitations
In order to provide a versatile
training set, data was collected from
speakers spanning eight different geographical areas to capture regional dialects [4] (shown in Figure 2). Unfortunately, these regional dialects introduce
problems for the construction of a
dictionary. Since many of the speakers pronounce words differently—especially in the case of short, common
words—multiple transcriptions for the
same word were common. In fact, the
Word
As
Coauthors
Coffee
Phoneme-level
Transcription
ae z
ax z
kcl k ow ao th axr z
k aa iy f
k ao f iy
Table 1. Sample dictionary entries. Multiple
pronunciations can exist for the same word.
For example, the first pronunciation of coffee is
that of a typical New Yorker, while the second
is that of someone from the old Northwest.
SPRING 2004
Number of...
Speakers
Utterances
Distinct texts
Words
Utterances in the training set
Utterances in the test set
Male speakers
Female speakers
630 (70% male, 30% female)
6300 (10 per speaker)
2342
6099
4620
1680
326 (training set) + 112 (test set)
136 (training set) + 56 (test set)
Table 2. Key characteristics of the TIMIT data
unedited dictionary contained an average of two pronunciations per word [1].
However, these pronunciations tended
to cluster around certain words, with
most words having only one pronunciation but with others having six or
more. A certain level of multiple pronunciations can be helpful at capturing
common divergences in pronunciation
(e.g. potato/pot_to), as even the best
hand-made dictionaries contain 1.1 to
1.2 pronunciations per word [5]. Too
many pronunciations per word, however, could be misleading when the computer performs recognition, because a
poorly recognized string of phonemes
could cause a word mismatch that
makes a sentence meaningless.
Methodology
In order to reduce the number of
redundant definitions in the dictionary, two methods were investigated.
The first, called “skimming,” simply
edits out and deletes low-frequency
transcriptions. This method defined a
threshold based on the most commonly
occurring pronunciations. Anything below this threshold was removed from
the dictionary. The second method,
called “percentaging,” encodes the
frequency at which each pronunciation
was encountered into the definitions
themselves, which a speech recognizer
can use to modify its recognition network.
As mentioned earlier, the dictionary was created by examining all the
speech transcriptions during training
mode. The original system just added
all of the different combinations of
word and phonemes into the system.
The new method does the same thing,
except that it keeps count of how many
times each combination occurs. These
counts can be used to perform either
skimming (removal of low-occurrence
pairs), or percentaging (placing percentage occurrence information with
the data). This paper examines the
effect of skimming alone, percentaging alone, and finally skimming followed by percentaging on SR system
accuracy.
Results and Discussion
The two systems of dictionary
making (skimming and percentaging)
were tested independently and then
with the 2-step method (skimming
followed by percentaging). As shown
in Figure 4, skimming examples of
dictionaries after processing provided
a much larger accuracy gain than did
percentaging. When the two were used
simultaneously, percentaging added a
very slim increase in accuracy to the
method of skimming. An unexpected
bonus of using the skimming method
is a reduction in required recognition
time. Since there are fewer possible
recognition sequences, the computer
can process each speech fragment more
rapidly. At optimal skimming (30%),
the test took only two hours, rather than
the full three, which is a 33% reduction
in time. Furthermore, at this level of
skimming, there was an average of 1.6
pronunciations per word, as opposed to
2.0 pronunciations without skimming.
Thus, not only does skimming provide
more accurate speech recognition, it
also allows for faster recognition.
Conclusion
Since people speak in many differ-ent ways, it is expected that having
a large dictionary, inclusive of all pronunciation and dialectic differences,
would be a valuable aid to recognition.
However, this paper demonstrates that
the most effective SR system dictionary may be one with a small number
of select alternate pronunciations per
word. The dictionary must be built so
that a balance is found between being
inclusive of pronunciation variants
and of conforming to the limitations
of a computerized recognition system.
This study observed that while a typical SR system contains on average 2.0
pronunciations per word, maximum
Figure 2: Speaker Breakdown by Region (total number of speakers= 630).
3
SURJ.
SR system accuracy is achieved at
1.6 pronunciations per word. Tailoring a dictionary to contain fewer pronunciations per word is best achieved
manually; however, given the size of a
typical SR system dictionary, (13,000
entries) a computerized method of
deleting erroneous and infrequently
used pronunciations is more practical.
This study revealed that the method of
skimming both effectively increases
recognition accuracy and results in a
33% reduction in the time required for
speech recognition tasks.
An example of an unaltered dictionary
Clearly, “about ax b aw t” is the most commonly given
pronunciation for the word about, and so it should be given
the highest weight, whereas definitions “bah” and “baw”
should just be erased.
Examples of dictionaries after processing
After Skimming
4 about ax b aw d
17 about ax b aw t
Here we have done
skimming, deleting the
least frequently occurring
pronunciations.
After Percentaging
3.85% about ax b ae t
7.69% about ax b ah
15.38% about ax b aw d
65.38% about ax b aw t
3.85% about b ah
3.85% about b aw
Figure 3: The process of modifying the dictionary
Table 3. Simulation Test Results
Figure 4. Test Results
4
Original
1 about ax b ae t
2 about ax b ah
4 about ax b aw d
17 about ax b aw t
1 about b ah
1 about b aw
Here we have done
percentaging, inserting
the frequency of each
pronunciation as a
percentage.
SPRING 2004
References
1. Mohajer, K., Zhong-Min Hu, Pratt, V. Time Boundary Assisted speech recognition. International Journal of
Information Fusion (Special Issue on Multi-Sensor Information Fusion For Speech Processing Applications).
2. Rabiner, Juang. Fundamentals of Speech Recognition, Chapter 6, Prentice Hall Signal Processing Series, New Jersey,
1993.
3. Young, Evermann, Kershaw, Moore, Odell, Ollason, Valtchev, Woodland. The HTK Book, Ch5, Cambridge University
Engineering Department, 2001-2002.
4. Garofolo, Lamel, Fisher, Fiscus, Pallett, Dahlgren. TIMIT Printed Documentation, Ch5, U.S. Depart. Of Commerce,
1993.
5. Hain, T., Woodland, P.C., Evermann, G. et al. New features in the CU-HTK system for transcription of conversational
telephone speech. Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP ‘01). 2001 IEEE
International Conference on, Vol.1, Iss., 2001. Pages:57-60 vol.1.
6. Culicover, Peter. Lecture: An Introduction to Language in the Humanities. Spring 2002.
http://www.ling.ohio-state.edu/~culicove/H201/Language variation.pdf.
Justy Burdick
Justy Burdick is a sophomore majoring in Electrical Engineering and is considering a minor in Computer
Science. He would like to thank Professor Vaughan Pratt and the URP office for sponsoring this Research
Experience for Undergraduates project. Furthermore, he would like to thank Keyvan Mohajer for the
guidance and help he received at every step of the research process. Finally, he would like to thank the
2002 and 2003 speech group REU team members: Simon Hu, Melissa Mansur, Ryan Bickerstaff, Kenneth Lee, Gu Pan Grace.
5