Voice-driven Computer Game in Noisy Environments 1. Introduction

International Journal of Computer Science and Applications
c Technomathematics Research Foundation
Vol. 10 No. 1, pp. 31 - 45, 2013
Voice-driven Computer Game in Noisy Environments
ARTUR JANICKI
Institute of Telecommunications, Warsaw University of Technology
ul. Nowowiejska 15/19, 00-665 Warsaw, Poland.
[email protected]
DARIUSZ WAWER
Institute of Telecommunications, Warsaw University of Technology
ul. Nowowiejska 15/19, 00-665 Warsaw, Poland.
[email protected]
The paper describes the performance of a task-oriented continuous automatic speech
recognition (ASR) system in the computer game interface in noisy conditions. First, the
process of designing the ASR system for Polish, based on CMU Sphinx4, is presented.
Then, the concept of the computer game called Rally Navigator is described. The experiments were first run for the clean speech, and then repeated with added environmental
noise with various signal-to-noise (SNR) ratios. Results of experiments with clean speech
show that as little as 15 minutes of audio material is enough to produce a highly effective single-speaker command-and-control ASR system for the computer game, providing
the sentence recognition accuracy of 97.6%. Results of the tests under noisy conditions
show that minor degradation of performance was observed in car environment, however,
accuracy decreased severely for babble and factory noises for SNR below 20 dB.
Keywords: speech recognition; voice interface; computer game; noise.
1. Introduction
Automatic speech recognition (ASR) systems become more and more popular as
an input mechanism in various applications - especially in word processors where
dictation software is being introduced. But there are also trials being conducted
to replace pads and keyboards as computer games controlling devices, thus making the games more interesting and enabling multimodal input. ASR systems can
be successfully utilized in computer games, under the condition that they provide
satisfactory recognition accuracy and short processing time.
To ensure realistic conditions, an ASR system in the computer game should be
able to recognize continuous speech, which is how people usually talk. Such speech
is much more difficult to recognize than isolated words. Apart from a few cases,
for example [Ziolko et al. (2008)] and [Szymanski et al. (2008)], continuous speech
recognition systems barely exist for the Polish language, which is highly inflective
and thus difficult to recognize. Recognizing speech in Polish under noisy conditions
31
32
Artur Janicki and Dariusz Wawer
poses an additional challenge.
This paper describes experiments with a small-vocabulary task-oriented automatic continuous speech recognition system for Polish, working in the voice interface
of a computer game under clean and noisy conditions. The impact of noise on speech
recognition performance will be analyzed and discussed.
2. Voice-driven Computer Games
Nowadays computer games are a large and well developing industry, often generating
high revenues. To be successful in the market, a game has to be easy to play,
intuitive, must be fun and interesting and must be overall player-friendly.
While the first three of the requirements apply rather to the game design, the
last one refers as well to the user interface, i.e., in this case, to the ASR input
system. It imposes some demanding requirements: firstly, the system must recognize
the command in a short time; the maximal acceptable amount of time depends on
how fast the action is in a game, but should be lower than one second. Secondly,
the accuracy of recognition must be very high. For our work, we have aimed at
90% sentence recognition accuracy. Should a player issue several misinterpreted
commands in a row, the game would fail at the friendliness requirement.
An ideal game using speech recognition should work robustly for every speaker
in most of the common acoustic conditions, including people with various accents,
non-native speakers and even people with speech deficiencies.
There are many factors that can have negative impact on speech recognition
robustness:
• a low quality microphone;
• noisy environments;
• high processing power and memory requirements of an ASR system.
It is, however, advantageous that games usually require command-and-control
(task-oriented) systems, which are easier to implement and tend to have better
accuracy than dictation systems with a dictionary of similar size.
A computer game is a specific environment for a speech recognition system. It
can be treated as a special type of a dialog system, in which in some cases the game
may actually predict what the user might want to say, knowing the game scenario.
A slight on-the-fly modification to the language model may be made to reflect the
scenario.
The usage of speech recognition systems in computer games is yet unexplored.
There have been a few attempts to utilize ASR systems, particularly in strategy
games [Tom Clancy’s Endwar (2009)] and flight simulators [Tom Clancy’s H.A.W.X
(2009)]. ASR systems implemented in these games did work properly, but have not
been enthusiastically accepted, mostly because they were tedious to use.
Voice-driven Computer Game in Noisy Environments
33
3. Automatic Speech Recognition in Voice Interfaces
3.1. Continuous Speech Recognition
Early works on ASR systems, starting in the 1950s, concerned recognition of isolated
phonemes, or at best - a few words. One example is the isolated digits recognizer
constructed at Bell Labs in 1952 [Furui (2009)]. The invention of Dynamic Time
Warping (DTW) in the late 60s allowed for projects on larger-vocabulary word
recognition and for processing of connected words [Rabiner (2008)]. In 1971 the
ARPA SUR (Speech Understanding Research) program started, aiming at creating a
reliable large vocabulary ASR for continuous speech. One of its results was HARPY an ASR system developed at the Carnegie Mellon University, working with semantic
accuracy of 95% at the processing speed of 80 times real-time [Rabiner (2008)].
Studies on continuous speech recognition continued intensively in the 1980s, when
the usage of statistic acoustic modeling and statistic language modeling advanced,
and they have continued till nowadays.
The key to successful continuous speech recognition is a combination of a highly
accurate acoustic modeling (AM), based usually on MFCC parameters and Hidden
Markov Models (HMM), and a precise and yet flexible language modeling (LM),
based most often on N -grams [Young (2008)] or grammars, such as Java Speech
Grammar Format (JSGF ) [JSGF, Sun Microsystems (1998)].
3.2. Speech Recognition in Noisy Environments
Performance of speech recognition systems used in noisy environments is usually
degraded - this phenomenon has been observed in many studies [Droppo and Acero
(2008)].
Various methods of improving robustness against noise were researched. One of
them is the use of speech enhancement algorithms - in this method speech signal,
before submitting to an ASR system, undergoes a denoising process, e.g. by spectral
subtraction [Yasui et al. (2009)] or Wiener filtering. Other way is to design new
auditory models which are less sensitive to noise [Doh-Suk Kim et al. (1999)].
Other researchers propose advanced feature processing, such as cepstral normalization techniques (e.g., cepstral mean normalization - CMN, variable cepstral mean
normalization - VCMN), see [Hansen et al. (2001)], or other techniques which try to
estimate cepstral parameters of undistorted speech, given cepstral parameters of the
noisy speech [Deng et al. (2008)]; this is sometimes combined with multi-condition
training, i.e., training acoustic models with speech distorted with different types of
noise and signal-to-noise (SNR) ratios. Using sparse representation based classification allows for improving robustness, though it requires a lot of processing power
[Gemmeke et al. (2011)]. For some types of noise using perceptual properties proved
to improve results of the ASR system [Haque et al. (2008)]
Many corpora with noisy speech exist to be used for testing the ASR systems,
such as SPINE [Schmidt-Nielsen et al. (2008)], unfortunately they are dedicated
34
Artur Janicki and Dariusz Wawer
mostly for English. However, researchers use often use corpora of noise sounds, such
as NOISEX-92 [Varga and Steeneken (1993)] - such noise signals can be added to
clean speech data, thus emulating noisy conditions. Unfortunately this method does
not take into account phenomena such as the Lombard effect - changing speaker
articulation in noisy environments in order to be better heard through the noise.
3.3. The Sphinx Framework
CMU Sphinx framework was created and has been continuously developed at
Carnegie Mellon University (CMU). In our work we used Sphinx4 recognizer and
SphinxTrain toolset for preparing training data and accoustic and language models.
Sphinx4 is a multi-threaded CPU-based recognizer written in Java, which utilizes HMMs for speech recognition. For our research we chose 13 parameter MFCC
acoustic models, an N -gram language model and a JSGF grammar.
The Sphinx framework was originally designed for English, but nowadays it
supports several languages, e.g. Spanish, French and Mandarin. However, there are
no acoustic nor language models available for Polish.
To use it with Polish, the package required some modifications to work properly,
e.g. several internal classes needed to be adapted to correctly read names containing
non-ASCII characters, which were found in the language models and dictionaries.
4. Designing the Game Voice Interface
In this section we will characterize briefly the Polish language, describe the concept
of the voice-driven computer game Rally Navigator, originally proposed in [Janicki
and Wawer (2011)], and we will describe the consecutive steps of designing the
speech recognition system for this game, as well as the testing procedure under
noisy conditions.
4.1. The Polish Language
Spoken Polish contains 38 phonemes: 6 vowels and 32 consonants [Jassem (1973)].
Most of the phonemes (plosives, fricatives and affricates) exist in pairs: unvoicedvoiced, e.g. [tS] - [dZ].
The Polish grammar is complex. It is highly inflective - nouns are inflected
according to seven cases and two numbers, verbs are inflected according to gender,
tense and number, adjectives and numerals are inflected, too [Feldstein (2001)].
There are three genders in singular and two genders in plural. Thanks to the high
inflection, the word order in a Polish sentence is loose, as the function of the word
(e.g. whether a noun is a subject or an object) is determined not by the position of
the word within the sentence, but by the form of the word. For the same reason,
subjects are often dropped.
These properties make continuous speech recognition for Polish quite a demanding challenge. Lack of articles makes detecting nouns difficult. High inflection sometimes causes a single word to have dozens of forms, which often sound similarly.
Voice-driven Computer Game in Noisy Environments
35
Fig. 1. A view of Rally Navigator’s main window
Loose word order increases the complexity of language models. Pronunciation variation, which happen to be advantageous in speech synthesis [Janicki et al. (2008)],
is a disturbing factor in speech recognition. Luckily these problems have smaller
impact on the system complexity if we consider a task-oriented, small vocabulary
recognition, which is usually the case for a computer game.
Moreover, in such applications some speech recognition errors are negligible for
the game’s command interpreter. For example, changing the case of a numeral will
result in decoding it as the same number. E.g., for the game considered, some nonsignificant prepositions are ignored.
4.2. Concept of the Computer Game
We decided to create a game called Rally Navigator in which the player would
compete in races – not as a driver, but as a navigator. The player’s task would
be to provide the driver with information about the route and track elements like
curves and straights. To make the game more difficult (and the ASR system more
complex) we also decided to include speed control and gear switching. The aim
of the game is to win the rally. The more precise and well-timed the information
supplied by the player, the quicker the car reaches the finish line.
We wanted to make the ASR system an integral part of the game, not an addi-
36
Artur Janicki and Dariusz Wawer
tional or alternative way of controlling it.
The ASR system in the game is activated manually, by pressing the space bar.
This both decreases the chance of recognizing background noise as a command and
simplifies the recognition process by determining command boundaries. Fig. 1 shows
the game prototype’s main window. In the upper left corner there are three lines of
text. The first line shows the car’s current position and speed, the second line shows
the sentence recognized by our ASR system while the third line show the command
that was issued based on the recognition. In this case the command was Accelerate
to 50 and the issued command was Set speed 50.
4.3. Developing the Acoustic Model
The following elements were required to train a new acoustic model:
• audio data with recorded speech;
• transcription of each audio file;
• dictionary with phonetic representations of all words appearing in the transcriptions;
• list of phonemes (and sounds) appearing in the transcriptions.
For a simple command-and-control one-speaker system (the one we began from)
the amount of training audio data can be fairly low. For multi-speaker systems
the amount of required audio increases, and increases even further for dictation
purposes.
The training audio signal was recorded in a home environment, using 16 kHz
sampling. The speaker read the following:
• 114 CORPORA [Grocholewski (1997)] sentences, which were phonetically
balanced, i.e., contained all Polish phonemes and diphones;
• the same set of sentences, but spoken faster, in order to get prepared for
quickly spoken commands, for example for a sequence of tight curves. The
CORPORA sentences in total formed ca. 11 minutes of our training data;
• sample commands, which will be used in game, and numbers, which must
be correctly recognized for the game to work (needed for the curve angles,
gears and distances).
In total, we prepared 25 minutes of audio for a single-speaker acoustic model.
The audio files were then transcribed, including special silence marks for silence
appearing in the file. The transcription allowed the training algorithm to better
align the speech and, in turn, produce better acoustic models. The transcription
involved also non-speech sounds, such as mouse clicking or breathing. It is important
to include such sounds in the transcriptions, so that they are not confused with
phonemes during training. If the non-speech sounds are properly labeled in the
training data, the ASR system may successfully recognize and omit them.
Voice-driven Computer Game in Noisy Environments
37
The phonetic dictionary was prepared in such a way that it contained all expected words with possible variants of their pronunciation. Careful preparation of
phonetic dictionary turned out to be crucial for the robustness of the speech recognition.
We also generated multi-speaker acoustic models based on CORPORA speech
database, using the audio data originating from 45 speakers: 28 adult males, 3 young
males, 11 adult females and 3 young females. In total, the multi-speaker model was
trained based on c.a. 180 minutes of speech.
To train our acoustic model we used applications and scripts from SphinxTrain,
which is a part of CMU Sphinx.
4.4. Training the Language Model
We decided to use a tool supplied by CMU Sphinx called Sphinx Knowledge Base
Tool [Rudnicky (2010)], which generates the language model from a list of sentences.
The tool calculates all n-grams appearing in the training data and then converts the
results to an n-gram language model format compatible with Sphinx. The quality
of the model therefore depends on how well it reflects the commands which will be
issued in the system. In our work we decided to split our commands into at least
three word long parts, each part with at most two parameters and generated all
possible variations of each of such fragments. Some of these fragments overlapped,
so that all possible bigrams and trigrams were included. This way also allowed us
to repeat the less-parametrized fragments, since the amount of repetitions did not
have to be large. This whole operation caused our training file to become relatively
small, simple and quick to generate. After using the Sphinx Knowledge Base Tool
on this file, the resulting model required some fine-tuning. Most importantly all
impossible silence-starting and silence-ending N -grams were removed.
In addition we decided to include, as we called them, negative n-grams, which are
artificially created n-grams with near-zero probability. Their purpose is to disallow
word sequences which we consider invalid in our systems, but in preliminary tests
were sometimes incorrectly recognized by the system.
Our JSGF grammar was created manually. First we defined groups of words
describing parameters in our commands, like distances, directions and angles. Then
we defined constant fragments of commands using these groups, some of these in a
few variations, each correct from the Polish language point of view. Next step was
grouping these fragments into commands, and some commands into sequences of
commands. It is important to note, that we have tried to relax the strict commands
by making some words optional, and providing alternatives to some of the words.
After the grammar was complete we ran the tests, and then repeated the whole
process for sentences from the tests which did not match any of the grammar rules.
38
Artur Janicki and Dariusz Wawer
(a) Spectrogram of the ”babble” noise
(b) Spectrogram of the ”factory” noise
(c) Spectrogram of the ”Volvo” noise
Fig. 2. Spectrograms of the noise signals used for testing: (a) babble, (b) factory, (c) Volvo.
4.5. Testing Speech Recognition in Noisy Environments
The ASR system was tested using a set of 85 recordings, containing commands
which are likely to occur in the Rally Navigator game. It contained 2-16 words long
phrases, total number of tested words was 422. Two male speakers were recorded
in quiet environment, speaker A whose voice was used the acoustic model training,
and speaker B used only for testing. Total time of test recordings was 489 s.
The game ASR system was first tested with the original test recordings, i.e., in
quiet environment, and then the ASR system was challenged with noisy conditions.
Voice-driven Computer Game in Noisy Environments
39
We decided to conduct the experiments using tree types of environmental noises,
taken from the NOISEX-92 corpus [Varga and Steeneken (1993)]:
• babble: sound of a 100 people speaking in a canteen, with individual voices
are slightly audible; use of this noise can imitate background voices in the
room where the computer game is played;
• factory: a noise was recorded in a factory, near plate-cutting and electrical welding equipment; this sound can imitate a highly disturbing noisy
environment;
• Volvo: a noise of car interior; the Volvo 340 noise recorded at the speed of
120 kmph, in 4th gear, on an asphalt road in rainy conditions; this is the
case of the computer game played using a mobile device in a car.
Figure 2 presents spectrograms of the used noises. One can observe that the
car noise (”Volvo”) is a lower-band stationary noise, whilst the babble and factory
noise have much wider bandwidth and show frequent changes in time. The factory
noise contains sudden sharp sounds, resembling hitting or cutting.
The noise samples were added to the recordings of test commands in various
proportions, controlling the SNR values. This way several sets of recordings were
obtained, with the SNR ranging from ca. 5 dB to 40 dB, which were submitted to
the ASR system of the game.
5. Results
This section presents the most significant results of the carried out experiments.
Following other research on ASR system tests, in this study the following metrics
were evaluated:
• substitutions (Sub): the number of words which were recognized as other
words;
• insertions (Ins): the number of words which were wrongly added to the
recognized words;
• deletions (Del): the number of words which omitted in recognition;
• word error rate (WER): the ratio of word recognition errors (such as substitutions, insertions and deletions) against the total number of words;
• word accuracy (WA): the ratio of correctly recognized words against the
total number of words; it is strongly related to WER;
• sentence accuracy (SA): the ratio of correctly recognized sentences against
the total number of sentences.
5.1. Tests for clean speech
Figure 3 shows the recognition performance for various language models for clean
speech using the single-speaker acoustic model for speaker A. The JSGF grammar
yielded the worst results (W A = 79.15%, SA = 61.2% for speaker A). More detailed
40
Artur Janicki and Dariusz Wawer
Fig. 3. Word accuracy and sentence accuracy for recognition using various language modeling for
clean speech.
Fig. 4. Word error rate against the duration of audio signal used in the training process.
analysis of the recognition logs showed that it worked very well for short sentences.
However, it had trouble correctly interpreting long word sequences, especially if
they consisted of more than one short command. The recognizer would in such cases
completely ignore the data after the first command, in spite of the used grammar
definition which allowed compound commands.
After switching to the N -gram language models, the recognition improved. Even
unigram models enabled accuracy better than the one for JSFG, but after adding
information about probabilities of bigrams and trigrams, the results improved significantly, yielding word accuracy of 98.58% and sentence accuracy of 92.9%. The
use of negative trigrams turned out to be a successful move, giving for speaker A
the final result of W A = 99.29% and SA = 96.5% .
Figure 4 displays the performance of recognition for clean speech when the single-
Voice-driven Computer Game in Noisy Environments
41
Table 1. Recognition results for speakers A and B with various numbers of sentences used for
acoustic model adaptation. No noise added.
speaker
/ adaptation
WER
WA
SA
Sub
Ins
Del
A
B, no adapt.
B, 10 sentences
B, 20 sentences
B, 30 sentences
B, 40 sentences
B, 60 sentences
B, 80 sentences
0.9%
8.3%
9.69%
6.62%
5.44%
4.49%
3.55%
1.42%
99.29%
92.91%
90.78%
94.09%
95.27%
95.98%
96.93%
98.58%
96.5%
72.9%
69.4%
77.6%
81.2%
82.4%
84.7%
92.9%
2
20
28
18
13
9
6
3
1
5
2
3
3
2
2
0
1
10
11
7
7
8
7
3
speaker acoustic model was trained with different amount of audio data. The word
error rate for speaker A after training the acoustic model with 114 CORPORA sentences (ca. 7 minutes of recording) was almost 4%, what is considered high, taking
into account that it is a small-vocabulary task-oriented recognition. After adding
4 minutes more, containing the same sentences, but uttered faster, WER became
slightly below 3% and the sentence accuracy increased from 88.2% to 90.6%. When
adding recordings containing numerals and control commands, the performance continued to improve until the training set contained 15 minutes of recordings, where
WER equaled to 0.7%. Sentence accuracy at this moment reached 97.6%, what
was considered a satisfactory result. Further enlargement of audio data resulted in
worsening of WER up to 2.1%, due to slight increase of substitutions and deletions.
Training using 25 minutes of recordings yielded again a low value of WER - 0.9%.
It is noteworthy that not every word deletion, substitution or insertion resulted
in a wrong command. E.g. omitting w (here meaning ’to’) in a sentence skrec w
prawo (’turn to the right’) caused the command change into ’turn right’, being
actually the same. So the semantic accuracy was actually higher than the sentence
accuracy.
Table 1 gives information about the recognition performance both for speakers A
and B. When challenging the system with the voice of speaker B, who was not used
to create the single-speaker acoustic model, the ASR system was able to recognize
correctly 72.9% sentences at the WER rate of 8.3%. After adapting the models
with 10 CORPORA sentences of the speaker B, WER even slightly increased, but
adaptation using 20 sentences decreased WER down to 6.62%. Further enlargement
of the adaptation session was steadily improving the recognition performance, but
80 sentences were required to make WER as low as 1.42%, with sentence accuracy
of almost 93%.
42
Artur Janicki and Dariusz Wawer
(a) Speaker A
(b) Speaker B
Fig. 5. Number of substitutions, insertions and deletions against the SNR value of the tested
speech, for speaker A and B.
5.2. Tests for noisy environments
All noisy environments tests were performed using multi-speaker acoustic model.
Neither speaker’s voice was used to generate or adapt the model.
Figure 5 shows the number of substitutions, insertions and deletions for babblenoised data against the SNR value. As SNR decreases, number of substitutions
increases rapidly, while number of deletions increases at a slower rate. As SNR
decreases further, number of substitutions stops its increase and deletions start to
occur more frequently. There is only a small range of SNR where insertions occur,
though it is different for each speaker. Interestingly, insertions did not occur in
larger numbers in either factory or Volvo-noised data.
Figure 6 shows the Word Error Rate against the SNR for the three types of
noise. As the SNR decreased, up to a certain point WER remained low, and then
started to rise. For the babble noise, the rate at which WER increased was almost
linear. For the factory noise, the WER increase started earlier, but up to a certain
SNR value it was slower. At the SNR between 15 dB and 20 dB WER started to
increase rapidly, then, at about 10 dB the increase rate slowed. Audio data noised
by factory sounds tended to have highest WER, though for speaker B there was a
range of the SNR where babble noise had the highest WER. On the tested SNR
range the Volvo noise had none or very low influence on WER. Up to a certain
SNR value the Volvo noise did not increase WER. As SNR decreased further, WER
started to increase gradually, but its increase rate was significantly slower than for
both babble and factory noise.
Figure 7 shows the Sentence Accuracy against the SNR. At the SNR of 5 dB,
from data noised with babble and factory none of the test sentences could be recognized, while sentences noised with the Volvo noise still could be recognized with
about 80% correctness.
Voice-driven Computer Game in Noisy Environments
(a) Speaker A
43
(b) Speaker B
Fig. 6. Word error rate against the SNR for various types of noise, for speaker A and B.
(a) Speaker A
(b) Speaker B
Fig. 7. Sentence accuracy against the SNR for various types of noise, for speaker A and B.
(a) The ”babble” noise
(b) The ”factory” noise
Fig. 8. WER comparison for the two speakers in the presence of (a) babble, and (b) factory noise.
6. Conclusion
This paper described a task-oriented continuous speech recognition system for Polish working in a computer game called Rally Navigator under clean and noisy
44
Artur Janicki and Dariusz Wawer
conditions. The process of designing and implementing the system, based on CMU
Sphinx4, is described. We presented the steps undertaken to create the acoustic
model and the language model, using both the grammar and the statistic N -gram
model.
As for the language model, we showed that the best results were achieved if
the statistic trigram model was used. We improved it by adding negative trigrams,
what decreased the number of misrecognized words.
Initial experiments for clean speech showed that the audio material as short as
15 minutes is enough to produce a highly effective single-speaker command-andcontrol ASR system, providing the sentence recognition accuracy of 97.6%. What
was expected, such a model required adaptation for another speaker. 20 sentences
of the new speaker enabled partial adaptation of ASR, so that it reached word
accuracy of 94.09%, but better results (WER below 4%) were obtained if the model
was adapted with 60 or 80 sentences. Obviously using such a long audio material
for adaptation of each new user would be impractical, so the acoustic model needs
to be improved.
Experiments with speech recognition in noisy conditions showed that the recognition accuracy was hardly affected if SNR exceeded 25 dB. Only minor degradation
of performance was observed if recognition was tested in car-interior environment
(the Volvo noise), even for the SNR as low as 5 dB. However, major degradation
of accuracy was observed if speech signal was distorted with babble and factory
noises and the SNR decreased below 20 dB. A game using a voice interface would
need to be expanded by, e.g. a noise suppression system, if planned to be used in
environments with loud background speech or factory-like noises.
References
B. Ziolko, S. Manandhar, R. C. Wilson, M. Ziolko, J. Galka, Application of HTK to the
Polish Language, In Proceedings of IEEE International Conference on Audio, Language
and Image Processing ICALIP2008, Shanghai, 2008, pp.1759-1764.
M. Szymanski, J. Ogorkiewicz, M. Lange, K. Klessa, S. Grocholewski, G. Demenko, First
evaluation of Polish LVCSR acoustic models obtained from the JURISDIC database,
Speech and Language Technology, vol. 11, 2008.
S. Furui, Selected topics from 40 years of research in speech and speaker recognition, in
Proc. Interspeech 2009, Brighton UK, 2009.
L. Rabiner and B.-H. Juang, Historical Perspective of the Filed of ASR/NLU, Springer
Handbook of Speech Processing, ed. J.Benesty et al.,Berlin Heidelberg: SpringerVerlag, 2008.
S. Young, HMMs and Related Speech Recognition Technologies, Springer Handbook of
Speech Processing, ed. J.Benesty et al.,Berlin Heidelberg: Springer-Verlag, 2008.
Sun
Microsystems,
Java Speech Grammar Format, version 1, 1998, http://java.sun.com/products/javamedia/speech/forDevelopers/JSGF/
Ubisoft Shanghai, Tom Clancy’s Endwar, 2009, http://endwargame.us.ubi.com/
Ubisoft Romania, Tom Clancy’s H.A.W.X, 2009, http://www.hawxgame.com/
A. Janicki and D. Wawer Automatic Speech Recognition for Polish in a Computer Game
Voice-driven Computer Game in Noisy Environments
45
Interface, in Proc. of the Federated Conference on Computer Science and Information
Systems (FedCSIS 2011), pp. 711-716, Szczecin, 2011
W. Jassem, Podstawy fonetyki akustycznej (in Polish), Warsaw, Poland : PWN, 1973.
R.F. Feldstein, A Concise Polish Grammar, Duke University, Durham, USA: Slavic and
Eurasian Language Resource Center, 2001.
A. Janicki, P. Meus, M. Topczewski, Taking Advantage of Pronunciation Variation in Unit
Selection Speech Synthesis for Polish, 3rd International Symposium on Communications, Control and Signal Processing (ISCCSP 2008), St.Julians, Malta, 2008
S. Grocholewski, CORPORA - Speech Database for Polish Diphones, 5th European Conference on Speech Communication and Technology Eurospeech ’97, Rhodes, Greece,
1997.
A.
Rudnicky,
Sphinx
Knowledge
Base
Tool,
2010,
http://www.speech.cs.cmu.edu/tools/lmtool.html
J. Droppo and A. Acero, Environmental Robustness, in: Springer Handbook of Speech
Processing, ed. J.Benesty et al.,Berlin Heidelberg: Springer-Verlag, 2008.
H. Yasui, K. Shinoda, S. Furui, and K. Iwano, Noise robust speech recognition using spectral subtraction and F0 information extracted by Hough transform, APSIPA Annual
Summit and Conference 2009, TP-P1-4, pp.631-634, 2009.
D.S. Kim, S.Y. Lee, R.M. Kil, Auditory Processing of Speech Signals for Robust Speech
Recognition in Real-World Noisy Environments, IEEE Transactions on Speech and
Audio Processing, 7, pp. 55-69, 1999.
J.H.L. Hansen, R. Sarikaya, U. Yapanel, B. Pellom, Robust Speech Recognition in Noise:
An Evaluation using the SPINE Corpus, in Proc. Eurospeech 2001, Aalborg, Denmark,
2001.
L. Deng, A. Acero, L. Jiang, J. Droppo, and X. Huang, High-Performance Robust Speech
Recognition Using Stereo Training Data, in Proc. ICASSP, Salt Lake City, Utah, 2001.
J.F. Gemmeke, T. Virtanen, A. Hurmalainen and Y. Sun, Exemplar-based sparse representations for noise robust automatic speech recognition, IEEE Transactions on Audio,
Speech and Language Processing, 19(7):2067 2080.
S. Haque, R.Togneri and A. Zaknich, Perceptual features for automatic speech recognition
in noisy environments, Speech Communications, 51 (1), pp. 58-75, 2009.
A. Schmidt-Nielsen, et al., Speech in Noisy Environments (SPINE) Training Audio, Linguistic Data Consortium, Philadelphia, 2000.
A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92:
A database and an experiment to study the effect of additive noise on speech recognition
systems, Speech Communication, 12 (3), pp. 247-251, 1993.

Download Report

Voice-driven Computer Game in Noisy Environments 1. Introduction

Paperzz.com

Your Paperzz