International Journal of Computer Science and Applications c Technomathematics Research Foundation Vol. 10 No. 1, pp. 31 - 45, 2013 Voice-driven Computer Game in Noisy Environments ARTUR JANICKI Institute of Telecommunications, Warsaw University of Technology ul. Nowowiejska 15/19, 00-665 Warsaw, Poland. [email protected] DARIUSZ WAWER Institute of Telecommunications, Warsaw University of Technology ul. Nowowiejska 15/19, 00-665 Warsaw, Poland. [email protected] The paper describes the performance of a task-oriented continuous automatic speech recognition (ASR) system in the computer game interface in noisy conditions. First, the process of designing the ASR system for Polish, based on CMU Sphinx4, is presented. Then, the concept of the computer game called Rally Navigator is described. The experiments were first run for the clean speech, and then repeated with added environmental noise with various signal-to-noise (SNR) ratios. Results of experiments with clean speech show that as little as 15 minutes of audio material is enough to produce a highly effective single-speaker command-and-control ASR system for the computer game, providing the sentence recognition accuracy of 97.6%. Results of the tests under noisy conditions show that minor degradation of performance was observed in car environment, however, accuracy decreased severely for babble and factory noises for SNR below 20 dB. Keywords: speech recognition; voice interface; computer game; noise. 1. Introduction Automatic speech recognition (ASR) systems become more and more popular as an input mechanism in various applications - especially in word processors where dictation software is being introduced. But there are also trials being conducted to replace pads and keyboards as computer games controlling devices, thus making the games more interesting and enabling multimodal input. ASR systems can be successfully utilized in computer games, under the condition that they provide satisfactory recognition accuracy and short processing time. To ensure realistic conditions, an ASR system in the computer game should be able to recognize continuous speech, which is how people usually talk. Such speech is much more difficult to recognize than isolated words. Apart from a few cases, for example [Ziolko et al. (2008)] and [Szymanski et al. (2008)], continuous speech recognition systems barely exist for the Polish language, which is highly inflective and thus difficult to recognize. Recognizing speech in Polish under noisy conditions 31 32 Artur Janicki and Dariusz Wawer poses an additional challenge. This paper describes experiments with a small-vocabulary task-oriented automatic continuous speech recognition system for Polish, working in the voice interface of a computer game under clean and noisy conditions. The impact of noise on speech recognition performance will be analyzed and discussed. 2. Voice-driven Computer Games Nowadays computer games are a large and well developing industry, often generating high revenues. To be successful in the market, a game has to be easy to play, intuitive, must be fun and interesting and must be overall player-friendly. While the first three of the requirements apply rather to the game design, the last one refers as well to the user interface, i.e., in this case, to the ASR input system. It imposes some demanding requirements: firstly, the system must recognize the command in a short time; the maximal acceptable amount of time depends on how fast the action is in a game, but should be lower than one second. Secondly, the accuracy of recognition must be very high. For our work, we have aimed at 90% sentence recognition accuracy. Should a player issue several misinterpreted commands in a row, the game would fail at the friendliness requirement. An ideal game using speech recognition should work robustly for every speaker in most of the common acoustic conditions, including people with various accents, non-native speakers and even people with speech deficiencies. There are many factors that can have negative impact on speech recognition robustness: • a low quality microphone; • noisy environments; • high processing power and memory requirements of an ASR system. It is, however, advantageous that games usually require command-and-control (task-oriented) systems, which are easier to implement and tend to have better accuracy than dictation systems with a dictionary of similar size. A computer game is a specific environment for a speech recognition system. It can be treated as a special type of a dialog system, in which in some cases the game may actually predict what the user might want to say, knowing the game scenario. A slight on-the-fly modification to the language model may be made to reflect the scenario. The usage of speech recognition systems in computer games is yet unexplored. There have been a few attempts to utilize ASR systems, particularly in strategy games [Tom Clancy’s Endwar (2009)] and flight simulators [Tom Clancy’s H.A.W.X (2009)]. ASR systems implemented in these games did work properly, but have not been enthusiastically accepted, mostly because they were tedious to use. Voice-driven Computer Game in Noisy Environments 33 3. Automatic Speech Recognition in Voice Interfaces 3.1. Continuous Speech Recognition Early works on ASR systems, starting in the 1950s, concerned recognition of isolated phonemes, or at best - a few words. One example is the isolated digits recognizer constructed at Bell Labs in 1952 [Furui (2009)]. The invention of Dynamic Time Warping (DTW) in the late 60s allowed for projects on larger-vocabulary word recognition and for processing of connected words [Rabiner (2008)]. In 1971 the ARPA SUR (Speech Understanding Research) program started, aiming at creating a reliable large vocabulary ASR for continuous speech. One of its results was HARPY an ASR system developed at the Carnegie Mellon University, working with semantic accuracy of 95% at the processing speed of 80 times real-time [Rabiner (2008)]. Studies on continuous speech recognition continued intensively in the 1980s, when the usage of statistic acoustic modeling and statistic language modeling advanced, and they have continued till nowadays. The key to successful continuous speech recognition is a combination of a highly accurate acoustic modeling (AM), based usually on MFCC parameters and Hidden Markov Models (HMM), and a precise and yet flexible language modeling (LM), based most often on N -grams [Young (2008)] or grammars, such as Java Speech Grammar Format (JSGF ) [JSGF, Sun Microsystems (1998)]. 3.2. Speech Recognition in Noisy Environments Performance of speech recognition systems used in noisy environments is usually degraded - this phenomenon has been observed in many studies [Droppo and Acero (2008)]. Various methods of improving robustness against noise were researched. One of them is the use of speech enhancement algorithms - in this method speech signal, before submitting to an ASR system, undergoes a denoising process, e.g. by spectral subtraction [Yasui et al. (2009)] or Wiener filtering. Other way is to design new auditory models which are less sensitive to noise [Doh-Suk Kim et al. (1999)]. Other researchers propose advanced feature processing, such as cepstral normalization techniques (e.g., cepstral mean normalization - CMN, variable cepstral mean normalization - VCMN), see [Hansen et al. (2001)], or other techniques which try to estimate cepstral parameters of undistorted speech, given cepstral parameters of the noisy speech [Deng et al. (2008)]; this is sometimes combined with multi-condition training, i.e., training acoustic models with speech distorted with different types of noise and signal-to-noise (SNR) ratios. Using sparse representation based classification allows for improving robustness, though it requires a lot of processing power [Gemmeke et al. (2011)]. For some types of noise using perceptual properties proved to improve results of the ASR system [Haque et al. (2008)] Many corpora with noisy speech exist to be used for testing the ASR systems, such as SPINE [Schmidt-Nielsen et al. (2008)], unfortunately they are dedicated 34 Artur Janicki and Dariusz Wawer mostly for English. However, researchers use often use corpora of noise sounds, such as NOISEX-92 [Varga and Steeneken (1993)] - such noise signals can be added to clean speech data, thus emulating noisy conditions. Unfortunately this method does not take into account phenomena such as the Lombard effect - changing speaker articulation in noisy environments in order to be better heard through the noise. 3.3. The Sphinx Framework CMU Sphinx framework was created and has been continuously developed at Carnegie Mellon University (CMU). In our work we used Sphinx4 recognizer and SphinxTrain toolset for preparing training data and accoustic and language models. Sphinx4 is a multi-threaded CPU-based recognizer written in Java, which utilizes HMMs for speech recognition. For our research we chose 13 parameter MFCC acoustic models, an N -gram language model and a JSGF grammar. The Sphinx framework was originally designed for English, but nowadays it supports several languages, e.g. Spanish, French and Mandarin. However, there are no acoustic nor language models available for Polish. To use it with Polish, the package required some modifications to work properly, e.g. several internal classes needed to be adapted to correctly read names containing non-ASCII characters, which were found in the language models and dictionaries. 4. Designing the Game Voice Interface In this section we will characterize briefly the Polish language, describe the concept of the voice-driven computer game Rally Navigator, originally proposed in [Janicki and Wawer (2011)], and we will describe the consecutive steps of designing the speech recognition system for this game, as well as the testing procedure under noisy conditions. 4.1. The Polish Language Spoken Polish contains 38 phonemes: 6 vowels and 32 consonants [Jassem (1973)]. Most of the phonemes (plosives, fricatives and affricates) exist in pairs: unvoicedvoiced, e.g. [tS] - [dZ]. The Polish grammar is complex. It is highly inflective - nouns are inflected according to seven cases and two numbers, verbs are inflected according to gender, tense and number, adjectives and numerals are inflected, too [Feldstein (2001)]. There are three genders in singular and two genders in plural. Thanks to the high inflection, the word order in a Polish sentence is loose, as the function of the word (e.g. whether a noun is a subject or an object) is determined not by the position of the word within the sentence, but by the form of the word. For the same reason, subjects are often dropped. These properties make continuous speech recognition for Polish quite a demanding challenge. Lack of articles makes detecting nouns difficult. High inflection sometimes causes a single word to have dozens of forms, which often sound similarly. Voice-driven Computer Game in Noisy Environments 35 Fig. 1. A view of Rally Navigator’s main window Loose word order increases the complexity of language models. Pronunciation variation, which happen to be advantageous in speech synthesis [Janicki et al. (2008)], is a disturbing factor in speech recognition. Luckily these problems have smaller impact on the system complexity if we consider a task-oriented, small vocabulary recognition, which is usually the case for a computer game. Moreover, in such applications some speech recognition errors are negligible for the game’s command interpreter. For example, changing the case of a numeral will result in decoding it as the same number. E.g., for the game considered, some nonsignificant prepositions are ignored. 4.2. Concept of the Computer Game We decided to create a game called Rally Navigator in which the player would compete in races – not as a driver, but as a navigator. The player’s task would be to provide the driver with information about the route and track elements like curves and straights. To make the game more difficult (and the ASR system more complex) we also decided to include speed control and gear switching. The aim of the game is to win the rally. The more precise and well-timed the information supplied by the player, the quicker the car reaches the finish line. We wanted to make the ASR system an integral part of the game, not an addi- 36 Artur Janicki and Dariusz Wawer tional or alternative way of controlling it. The ASR system in the game is activated manually, by pressing the space bar. This both decreases the chance of recognizing background noise as a command and simplifies the recognition process by determining command boundaries. Fig. 1 shows the game prototype’s main window. In the upper left corner there are three lines of text. The first line shows the car’s current position and speed, the second line shows the sentence recognized by our ASR system while the third line show the command that was issued based on the recognition. In this case the command was Accelerate to 50 and the issued command was Set speed 50. 4.3. Developing the Acoustic Model The following elements were required to train a new acoustic model: • audio data with recorded speech; • transcription of each audio file; • dictionary with phonetic representations of all words appearing in the transcriptions; • list of phonemes (and sounds) appearing in the transcriptions. For a simple command-and-control one-speaker system (the one we began from) the amount of training audio data can be fairly low. For multi-speaker systems the amount of required audio increases, and increases even further for dictation purposes. The training audio signal was recorded in a home environment, using 16 kHz sampling. The speaker read the following: • 114 CORPORA [Grocholewski (1997)] sentences, which were phonetically balanced, i.e., contained all Polish phonemes and diphones; • the same set of sentences, but spoken faster, in order to get prepared for quickly spoken commands, for example for a sequence of tight curves. The CORPORA sentences in total formed ca. 11 minutes of our training data; • sample commands, which will be used in game, and numbers, which must be correctly recognized for the game to work (needed for the curve angles, gears and distances). In total, we prepared 25 minutes of audio for a single-speaker acoustic model. The audio files were then transcribed, including special silence marks for silence appearing in the file. The transcription allowed the training algorithm to better align the speech and, in turn, produce better acoustic models. The transcription involved also non-speech sounds, such as mouse clicking or breathing. It is important to include such sounds in the transcriptions, so that they are not confused with phonemes during training. If the non-speech sounds are properly labeled in the training data, the ASR system may successfully recognize and omit them. Voice-driven Computer Game in Noisy Environments 37 The phonetic dictionary was prepared in such a way that it contained all expected words with possible variants of their pronunciation. Careful preparation of phonetic dictionary turned out to be crucial for the robustness of the speech recognition. We also generated multi-speaker acoustic models based on CORPORA speech database, using the audio data originating from 45 speakers: 28 adult males, 3 young males, 11 adult females and 3 young females. In total, the multi-speaker model was trained based on c.a. 180 minutes of speech. To train our acoustic model we used applications and scripts from SphinxTrain, which is a part of CMU Sphinx. 4.4. Training the Language Model We decided to use a tool supplied by CMU Sphinx called Sphinx Knowledge Base Tool [Rudnicky (2010)], which generates the language model from a list of sentences. The tool calculates all n-grams appearing in the training data and then converts the results to an n-gram language model format compatible with Sphinx. The quality of the model therefore depends on how well it reflects the commands which will be issued in the system. In our work we decided to split our commands into at least three word long parts, each part with at most two parameters and generated all possible variations of each of such fragments. Some of these fragments overlapped, so that all possible bigrams and trigrams were included. This way also allowed us to repeat the less-parametrized fragments, since the amount of repetitions did not have to be large. This whole operation caused our training file to become relatively small, simple and quick to generate. After using the Sphinx Knowledge Base Tool on this file, the resulting model required some fine-tuning. Most importantly all impossible silence-starting and silence-ending N -grams were removed. In addition we decided to include, as we called them, negative n-grams, which are artificially created n-grams with near-zero probability. Their purpose is to disallow word sequences which we consider invalid in our systems, but in preliminary tests were sometimes incorrectly recognized by the system. Our JSGF grammar was created manually. First we defined groups of words describing parameters in our commands, like distances, directions and angles. Then we defined constant fragments of commands using these groups, some of these in a few variations, each correct from the Polish language point of view. Next step was grouping these fragments into commands, and some commands into sequences of commands. It is important to note, that we have tried to relax the strict commands by making some words optional, and providing alternatives to some of the words. After the grammar was complete we ran the tests, and then repeated the whole process for sentences from the tests which did not match any of the grammar rules. 38 Artur Janicki and Dariusz Wawer (a) Spectrogram of the ”babble” noise (b) Spectrogram of the ”factory” noise (c) Spectrogram of the ”Volvo” noise Fig. 2. Spectrograms of the noise signals used for testing: (a) babble, (b) factory, (c) Volvo. 4.5. Testing Speech Recognition in Noisy Environments The ASR system was tested using a set of 85 recordings, containing commands which are likely to occur in the Rally Navigator game. It contained 2-16 words long phrases, total number of tested words was 422. Two male speakers were recorded in quiet environment, speaker A whose voice was used the acoustic model training, and speaker B used only for testing. Total time of test recordings was 489 s. The game ASR system was first tested with the original test recordings, i.e., in quiet environment, and then the ASR system was challenged with noisy conditions. Voice-driven Computer Game in Noisy Environments 39 We decided to conduct the experiments using tree types of environmental noises, taken from the NOISEX-92 corpus [Varga and Steeneken (1993)]: • babble: sound of a 100 people speaking in a canteen, with individual voices are slightly audible; use of this noise can imitate background voices in the room where the computer game is played; • factory: a noise was recorded in a factory, near plate-cutting and electrical welding equipment; this sound can imitate a highly disturbing noisy environment; • Volvo: a noise of car interior; the Volvo 340 noise recorded at the speed of 120 kmph, in 4th gear, on an asphalt road in rainy conditions; this is the case of the computer game played using a mobile device in a car. Figure 2 presents spectrograms of the used noises. One can observe that the car noise (”Volvo”) is a lower-band stationary noise, whilst the babble and factory noise have much wider bandwidth and show frequent changes in time. The factory noise contains sudden sharp sounds, resembling hitting or cutting. The noise samples were added to the recordings of test commands in various proportions, controlling the SNR values. This way several sets of recordings were obtained, with the SNR ranging from ca. 5 dB to 40 dB, which were submitted to the ASR system of the game. 5. Results This section presents the most significant results of the carried out experiments. Following other research on ASR system tests, in this study the following metrics were evaluated: • substitutions (Sub): the number of words which were recognized as other words; • insertions (Ins): the number of words which were wrongly added to the recognized words; • deletions (Del): the number of words which omitted in recognition; • word error rate (WER): the ratio of word recognition errors (such as substitutions, insertions and deletions) against the total number of words; • word accuracy (WA): the ratio of correctly recognized words against the total number of words; it is strongly related to WER; • sentence accuracy (SA): the ratio of correctly recognized sentences against the total number of sentences. 5.1. Tests for clean speech Figure 3 shows the recognition performance for various language models for clean speech using the single-speaker acoustic model for speaker A. The JSGF grammar yielded the worst results (W A = 79.15%, SA = 61.2% for speaker A). More detailed 40 Artur Janicki and Dariusz Wawer Fig. 3. Word accuracy and sentence accuracy for recognition using various language modeling for clean speech. Fig. 4. Word error rate against the duration of audio signal used in the training process. analysis of the recognition logs showed that it worked very well for short sentences. However, it had trouble correctly interpreting long word sequences, especially if they consisted of more than one short command. The recognizer would in such cases completely ignore the data after the first command, in spite of the used grammar definition which allowed compound commands. After switching to the N -gram language models, the recognition improved. Even unigram models enabled accuracy better than the one for JSFG, but after adding information about probabilities of bigrams and trigrams, the results improved significantly, yielding word accuracy of 98.58% and sentence accuracy of 92.9%. The use of negative trigrams turned out to be a successful move, giving for speaker A the final result of W A = 99.29% and SA = 96.5% . Figure 4 displays the performance of recognition for clean speech when the single- Voice-driven Computer Game in Noisy Environments 41 Table 1. Recognition results for speakers A and B with various numbers of sentences used for acoustic model adaptation. No noise added. speaker / adaptation WER WA SA Sub Ins Del A B, no adapt. B, 10 sentences B, 20 sentences B, 30 sentences B, 40 sentences B, 60 sentences B, 80 sentences 0.9% 8.3% 9.69% 6.62% 5.44% 4.49% 3.55% 1.42% 99.29% 92.91% 90.78% 94.09% 95.27% 95.98% 96.93% 98.58% 96.5% 72.9% 69.4% 77.6% 81.2% 82.4% 84.7% 92.9% 2 20 28 18 13 9 6 3 1 5 2 3 3 2 2 0 1 10 11 7 7 8 7 3 speaker acoustic model was trained with different amount of audio data. The word error rate for speaker A after training the acoustic model with 114 CORPORA sentences (ca. 7 minutes of recording) was almost 4%, what is considered high, taking into account that it is a small-vocabulary task-oriented recognition. After adding 4 minutes more, containing the same sentences, but uttered faster, WER became slightly below 3% and the sentence accuracy increased from 88.2% to 90.6%. When adding recordings containing numerals and control commands, the performance continued to improve until the training set contained 15 minutes of recordings, where WER equaled to 0.7%. Sentence accuracy at this moment reached 97.6%, what was considered a satisfactory result. Further enlargement of audio data resulted in worsening of WER up to 2.1%, due to slight increase of substitutions and deletions. Training using 25 minutes of recordings yielded again a low value of WER - 0.9%. It is noteworthy that not every word deletion, substitution or insertion resulted in a wrong command. E.g. omitting w (here meaning ’to’) in a sentence skrec w prawo (’turn to the right’) caused the command change into ’turn right’, being actually the same. So the semantic accuracy was actually higher than the sentence accuracy. Table 1 gives information about the recognition performance both for speakers A and B. When challenging the system with the voice of speaker B, who was not used to create the single-speaker acoustic model, the ASR system was able to recognize correctly 72.9% sentences at the WER rate of 8.3%. After adapting the models with 10 CORPORA sentences of the speaker B, WER even slightly increased, but adaptation using 20 sentences decreased WER down to 6.62%. Further enlargement of the adaptation session was steadily improving the recognition performance, but 80 sentences were required to make WER as low as 1.42%, with sentence accuracy of almost 93%. 42 Artur Janicki and Dariusz Wawer (a) Speaker A (b) Speaker B Fig. 5. Number of substitutions, insertions and deletions against the SNR value of the tested speech, for speaker A and B. 5.2. Tests for noisy environments All noisy environments tests were performed using multi-speaker acoustic model. Neither speaker’s voice was used to generate or adapt the model. Figure 5 shows the number of substitutions, insertions and deletions for babblenoised data against the SNR value. As SNR decreases, number of substitutions increases rapidly, while number of deletions increases at a slower rate. As SNR decreases further, number of substitutions stops its increase and deletions start to occur more frequently. There is only a small range of SNR where insertions occur, though it is different for each speaker. Interestingly, insertions did not occur in larger numbers in either factory or Volvo-noised data. Figure 6 shows the Word Error Rate against the SNR for the three types of noise. As the SNR decreased, up to a certain point WER remained low, and then started to rise. For the babble noise, the rate at which WER increased was almost linear. For the factory noise, the WER increase started earlier, but up to a certain SNR value it was slower. At the SNR between 15 dB and 20 dB WER started to increase rapidly, then, at about 10 dB the increase rate slowed. Audio data noised by factory sounds tended to have highest WER, though for speaker B there was a range of the SNR where babble noise had the highest WER. On the tested SNR range the Volvo noise had none or very low influence on WER. Up to a certain SNR value the Volvo noise did not increase WER. As SNR decreased further, WER started to increase gradually, but its increase rate was significantly slower than for both babble and factory noise. Figure 7 shows the Sentence Accuracy against the SNR. At the SNR of 5 dB, from data noised with babble and factory none of the test sentences could be recognized, while sentences noised with the Volvo noise still could be recognized with about 80% correctness. Voice-driven Computer Game in Noisy Environments (a) Speaker A 43 (b) Speaker B Fig. 6. Word error rate against the SNR for various types of noise, for speaker A and B. (a) Speaker A (b) Speaker B Fig. 7. Sentence accuracy against the SNR for various types of noise, for speaker A and B. (a) The ”babble” noise (b) The ”factory” noise Fig. 8. WER comparison for the two speakers in the presence of (a) babble, and (b) factory noise. 6. Conclusion This paper described a task-oriented continuous speech recognition system for Polish working in a computer game called Rally Navigator under clean and noisy 44 Artur Janicki and Dariusz Wawer conditions. The process of designing and implementing the system, based on CMU Sphinx4, is described. We presented the steps undertaken to create the acoustic model and the language model, using both the grammar and the statistic N -gram model. As for the language model, we showed that the best results were achieved if the statistic trigram model was used. We improved it by adding negative trigrams, what decreased the number of misrecognized words. Initial experiments for clean speech showed that the audio material as short as 15 minutes is enough to produce a highly effective single-speaker command-andcontrol ASR system, providing the sentence recognition accuracy of 97.6%. What was expected, such a model required adaptation for another speaker. 20 sentences of the new speaker enabled partial adaptation of ASR, so that it reached word accuracy of 94.09%, but better results (WER below 4%) were obtained if the model was adapted with 60 or 80 sentences. Obviously using such a long audio material for adaptation of each new user would be impractical, so the acoustic model needs to be improved. Experiments with speech recognition in noisy conditions showed that the recognition accuracy was hardly affected if SNR exceeded 25 dB. Only minor degradation of performance was observed if recognition was tested in car-interior environment (the Volvo noise), even for the SNR as low as 5 dB. However, major degradation of accuracy was observed if speech signal was distorted with babble and factory noises and the SNR decreased below 20 dB. A game using a voice interface would need to be expanded by, e.g. a noise suppression system, if planned to be used in environments with loud background speech or factory-like noises. References B. Ziolko, S. Manandhar, R. C. Wilson, M. Ziolko, J. Galka, Application of HTK to the Polish Language, In Proceedings of IEEE International Conference on Audio, Language and Image Processing ICALIP2008, Shanghai, 2008, pp.1759-1764. M. Szymanski, J. Ogorkiewicz, M. Lange, K. Klessa, S. Grocholewski, G. Demenko, First evaluation of Polish LVCSR acoustic models obtained from the JURISDIC database, Speech and Language Technology, vol. 11, 2008. S. Furui, Selected topics from 40 years of research in speech and speaker recognition, in Proc. Interspeech 2009, Brighton UK, 2009. L. Rabiner and B.-H. Juang, Historical Perspective of the Filed of ASR/NLU, Springer Handbook of Speech Processing, ed. J.Benesty et al.,Berlin Heidelberg: SpringerVerlag, 2008. S. Young, HMMs and Related Speech Recognition Technologies, Springer Handbook of Speech Processing, ed. J.Benesty et al.,Berlin Heidelberg: Springer-Verlag, 2008. Sun Microsystems, Java Speech Grammar Format, version 1, 1998, http://java.sun.com/products/javamedia/speech/forDevelopers/JSGF/ Ubisoft Shanghai, Tom Clancy’s Endwar, 2009, http://endwargame.us.ubi.com/ Ubisoft Romania, Tom Clancy’s H.A.W.X, 2009, http://www.hawxgame.com/ A. Janicki and D. Wawer Automatic Speech Recognition for Polish in a Computer Game Voice-driven Computer Game in Noisy Environments 45 Interface, in Proc. of the Federated Conference on Computer Science and Information Systems (FedCSIS 2011), pp. 711-716, Szczecin, 2011 W. Jassem, Podstawy fonetyki akustycznej (in Polish), Warsaw, Poland : PWN, 1973. R.F. Feldstein, A Concise Polish Grammar, Duke University, Durham, USA: Slavic and Eurasian Language Resource Center, 2001. A. Janicki, P. Meus, M. Topczewski, Taking Advantage of Pronunciation Variation in Unit Selection Speech Synthesis for Polish, 3rd International Symposium on Communications, Control and Signal Processing (ISCCSP 2008), St.Julians, Malta, 2008 S. Grocholewski, CORPORA - Speech Database for Polish Diphones, 5th European Conference on Speech Communication and Technology Eurospeech ’97, Rhodes, Greece, 1997. A. Rudnicky, Sphinx Knowledge Base Tool, 2010, http://www.speech.cs.cmu.edu/tools/lmtool.html J. Droppo and A. Acero, Environmental Robustness, in: Springer Handbook of Speech Processing, ed. J.Benesty et al.,Berlin Heidelberg: Springer-Verlag, 2008. H. Yasui, K. Shinoda, S. Furui, and K. Iwano, Noise robust speech recognition using spectral subtraction and F0 information extracted by Hough transform, APSIPA Annual Summit and Conference 2009, TP-P1-4, pp.631-634, 2009. D.S. Kim, S.Y. Lee, R.M. Kil, Auditory Processing of Speech Signals for Robust Speech Recognition in Real-World Noisy Environments, IEEE Transactions on Speech and Audio Processing, 7, pp. 55-69, 1999. J.H.L. Hansen, R. Sarikaya, U. Yapanel, B. Pellom, Robust Speech Recognition in Noise: An Evaluation using the SPINE Corpus, in Proc. Eurospeech 2001, Aalborg, Denmark, 2001. L. Deng, A. Acero, L. Jiang, J. Droppo, and X. Huang, High-Performance Robust Speech Recognition Using Stereo Training Data, in Proc. ICASSP, Salt Lake City, Utah, 2001. J.F. Gemmeke, T. Virtanen, A. Hurmalainen and Y. Sun, Exemplar-based sparse representations for noise robust automatic speech recognition, IEEE Transactions on Audio, Speech and Language Processing, 19(7):2067 2080. S. Haque, R.Togneri and A. Zaknich, Perceptual features for automatic speech recognition in noisy environments, Speech Communications, 51 (1), pp. 58-75, 2009. A. Schmidt-Nielsen, et al., Speech in Noisy Environments (SPINE) Training Audio, Linguistic Data Consortium, Philadelphia, 2000. A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems, Speech Communication, 12 (3), pp. 247-251, 1993.
© Copyright 2026 Paperzz