SPECOM'2009, St. Petersburg, 21-25 June 2009 Automatic Detection of Emphasized Words for Performance Enhancement of a Czech ASR System Martin Kroul Technical University of Liberec, Faculty of Mechatronics, Studentská 2, 461 17 Liberec, Czech Republic [email protected] Abstract This paper deals with a problem of prosodically emphasized word detection in Czech speech. The main goal is to propose an automatic emphasized word detection system that would be component of an Automatic speech recognition system (ASR) and would enrich its text output with highlighting emphasized words. The detection method is based on Czech prosodic rules and uses speech signal intensity, pitch (fundamental frequency) and speech segment duration as features. A large speech corpus has been used to compute some prosodic statistics, which have been designated for feature evaluation. The proposed system is speaker-independent and it achieves a detection score of 91%. 1. Introduction In this work we focus on development of prosody-based automatic emphasized word detection system, which is supposed to be used for enhancement of automatic speech recognition (ASR) systems. Nowadays, automatic speech recognizers provide quite reliable textual transcription of any utterance. This transcription is usually further processed to improve readability by inserting punctuation marks, capitalization etc. Nevertheless, ASR systems by themselves cannot resolve meaning of an utterance, determine speech topic or extract keywords. With utilization of prosodic parameters of a speech, we can detect emphasized words in the utterance, which can be helpful in these tasks. The goal of our work is to create a system that could automatically search for emphasized words or sentence regions in recognized utterances. After detection of these words, we can highlight them in recognizer’s output text (using underlining, different font type, size or color) which will improve comprehension of the text. This system can help reader to better understand sense of an utterance and in some cases it can resolve meaning which would be otherwise (from textual information only) undetectable at all. 2. Related work Various papers concerning emphasized word detection problem have been presented in the past. Most of them have focused on pitch accent detection in general. In [1], a pitch-based emphasis detection algorithm has been used for segmenting of speech recordings. This algorithm has been used in an interactive system called Speech Skimmer, that enables high-speed skimming and browsing of speech recordings. The algorithm firstly adapts to pitch range of speaker and after that it marks regions with significantly increased pitch as emphasized. 470 A combination of acoustic and lexical features has been used in [2] for detection of prosodically salient or emphasized words in speech. Pitch, intensity and duration have been used as prosodic features. Word class, its position in sentence, word and syllable identity, term frequency, inverse document frequency and negative log frequency have been used as lexical features. According to [3], stressed vowels are characterized by a considerable increase in the amplitude of voicing and slightly increased open quotient. Furthermore, F0, duration, overall intensity, closure rate/skewness of the glottal pulse, glottal leakage and formant based parameters were examined in [3]. Acoustical difference between stressed and unstressed vowels are presented in [4]. Furthermore, methods based on stress detection have been also used for speech topic [5] or focus [6] detection. 3. Stress in Czech Stress in Czech is well described by linguists [7, 8], however its automatic detection and evaluation still hasn’t been resolved. It is generally agreed, that stress is characterized by significant changes of prosodic parameters. There are two kinds of stress in Czech [7, 8]. The first is called lexical stress. It affects syllables and plays an important role in producing of rhythm of speech. In Czech, this stress is fixed and stressed syllable is defined as the first syllable in each word. In the following example, stressed syllables are highlighted: „ZasnČžené vrcholky Jizerských hor“. However, there are some exceptions: • Some monosyllabic words (like „a“, „i“, „my“, „ti“, „by“, „jsem“) are unstressed, so they remain unaffected. • All prepositions take over the stress of the following word in sentence. For example „do školy“, „bez telefonu“, „pod stromem“. Stressed syllables are characterized by prosodic changes, from which the most significant is increase of intensity. According to [8], this increase is just very small. However, results of our tests show that there is 10% to 30% increase of intensity on stressed syllable [9]. There is also some increase of duration (10 – 20%) and minor increase of pitch. The other kind of stress in Czech language is called sentential stress and its purpose lies in emphasizing of particular word in sentence. In normal speech, the stressed word use to be the last word in sentence. Only when speaker wants to put a special attention on a particular word (or on a part of sentence), he changes the stress position to this word. Like the lexical stress, the sentential stress is also characterized by prosodic changes. The emphasized word is SPECOM'2009, St. Petersburg, 21-25 June 2009 always prolonged, its first syllable is usually amplified (even more than due to the lexical stress only), and there is also significant increase of pitch. Knowledge of emphasized word position in sentence is essential for full comprehension of sense of the utterance. Putting emphasis on different words may change meaning of the sentence completely. One typical example can be the question: „PĤjdeme tam zítra?“ (Shall we go there tomorrow?). • • In case that the first word is emphasized („Pjdeme tam zítra?“), speaker asks, whether they go there or not. If the last word is emphasized („PĤjdeme tam zítra?“), speaker asks, whether they go there tomorrow or another day. This is the case where automatic detection of stressed words could help in better understanding of recognized utterance and enrich ASR system’s output. This work will focus on sentential stress, especially on prosody changes in stressed syllables of emphasized words. However, the lexical stress has to be kept in view as well, because there is strong interaction between both kinds of stress. More detailed view on problem of stress in Czech can be found in [7]. 4. Data preparation and parametrization A corpus of 180 sentences from three different speakers (2 male and 1 female) has been created for testing purpose. All the utterances have been recorded under clean conditions using ordinary close-talk microphone. Speakers have been asked to put emphasis on arbitrary word in sentence. These recordings had to be forced-aligned, to obtain accurate position and length of each phoneme in sentence. An automatic HMM-based speech recognizer and aligner has been used for this task, using 100-mixture 3-state monophone models. These models have been trained on approx. 35 hours of broadcast news speech. Totally 39 features (13 MFCC, 13 delta and 13 delta-delta) have been used for training. Comparison of alignment accuracy using various HMMs has been already published in [10]. Prosodic statistics have been also needed to evaluate some of the features. We obtained these statistics from the speech corpus mentioned above (35 hours of broadcast news speech). Using the Automatic speech recognizer and aligner, these recordings have been aligned firstly and average phoneme durations and average phoneme intensities were acquired then. The final emphasis detection system is supposed to be part of an ASR system, that uses the same HMMs for recognition as the aligner. Hence can be this ASR system also used for phoneme alignment. 5. Feature extraction As mentioned before, HMM-based automatic phoneme aligner has been used for phoneme positions detection and their durations estimation. However, this alignment isn’t accurate sufficiently to build up the emphasis detection system on phoneme-level based features. In some cases, recognized phoneme borders have been shifted even by tens of milliseconds from true borders. Thus we rather focused on syllable or word-based features. In that case these inaccurateness aren’t so influential and the syllable/word alignment is accurate sufficiently. 5.1. Duration With knowledge of statistical phoneme lengths, any word length can be computed as a sum of lengths of particular phonemes. The statistical word length can be compared then with length of examined word. Although the rate of examined speech can be different from the rate of speech in the corpus (from that the statistics have been made) it can be assumed, that there is a constant difference between statistical word length and length of word in examined sentence. In case that this difference increases significantly for some word, we suppose that this word is emphasized. In our test we used the Relative word prolongation feature (RWP), which is defined as RWP = w _ length − stat _ w _ length , N (1) where w_length is length of the investigated word, stat_w_length is statistical length computed as sum of statistical phoneme lengths and N is number of phonemes in the word. 5.2. Pitch The intonation contour of speech has been estimated using F0 detector based on combination of the short-time autocorrelation function and cepstrum of speech signal. Firstly, the speech signal has been filtered using LP filter with cutoff frequency 400 Hz to reduce amplitude of unvoiced sounds and high-frequency noise. The filtered signal has been divided then into 25 ms long frames, with a 10 ms frame rate. After that, energy value has been computed for each frame and a voiced/unvoiced decision has been made using an energy detector. The energy threshold t has been defined as t = min (ene) + κ ⋅ (max (ene) − min (ene)) , (2) where ene is vector of energy values and Ë has been experimentally set to 0,55. With knowledge of positions of voiced frames, pitch values have been evaluated for voiced segments only. This has been done using combination of autocorrelation function and cepstrum of the filtered speech signal and method of searching their maxima. Continuous intonation contour has been obtained by interpolating unvoiced segments. Finally, median and mean filters have been used for correcting errors in pitch detection and for smoothing the curve. According to section 3, there is usually a significant increase of pitch in emphasized words (sentential stress). The pitch contour has usually a lot of fluctuations, so we desided to compute pitch means for each word in sentence and then compare these values with each other. The word with maximum pitch mean should be the most emphasized word. The Word mean pitch (WMP) has been defined as n2 ¦ pitch[i ] WMP = i = n1 N , (3) 471 SPECOM'2009, St. Petersburg, 21-25 June 2009 Chceme TAM jet na kole (We want to go there by bicycle) where n1 and n2 are initial and final frame of a word respectively and N is a number of frames in the word. 3 Relative Word Prolongation Relative Intensity Increase Word Mean Pitch Overall Score 2.5 5.3. Intensity syl _ mean − stat _ syl _ mean , RII = N (4) where syl_mean is 1st syllable mean intensity, stat_syl_mean is statistical 1st syllable mean intensity and N is number of phonemes in the syllable. For each word, three parameters have been obtained. All the three feature vectors had to be normalized, to be comparable with each other. Normalized feature vector fn has been computed as fn = f − min ( f ) . max ( f ) − min ( f ) (5) Normalized vectors have minimal value equal to 0 and maximal equal to 1. 6. Experiments The emphasized word detection system has been tested on 180 sentences spoken by 3 speakers (2 male and 1 female). In each sentence, one word have been emphasized, arbitrarily selected by speaker. All the utterances have been parametrized and three prosodic feature vectors have been obtained from each sentence. Each feature vector represents relative amount of emphasis in a particular prosodic area. Since all the feature vectors have been normalized, we can obtain the overall emphasis vector simply by summarization of the three particular feature vectors. It is obvious, that the maximal value of the overall emphasis vector belongs to emphasized word. This combination of prosodic features describes emphasis well. However, emphasized word detection tests using each prosodic part separately have been done as well, to compare their contribution on overall emphasis. In Figure 1, there is an example of emphasis detection result of a Czech sentence “Chceme tam jet na kole”. In this case, all the three prosodic features have their maxima on the second word. This indicates, that the second word is emphasized most probably. 472 Score 2 In Czech, the first syllable in a word is stronger that the others in most cases (lexical stress). If the word is emphasized (sentencial stress), the first syllable is even stronger. The other syllables in the emphasized word may be stronger too, but in most cases they remain unaffected. We proposed a feature based on comparison of the first syllable’s intensity of all the words in a sentence. Like in the case of duration, knowledge of statistical mean intensity value for each phoneme is required. This information has been obtained from our speech corpus as mentioned in the section 4. The Relative intensity increase feature (RII) has been defined as 1.5 1 0.5 0 1 2 3 Words 4 5 Fig. 1. Emphasis detection result of a Czech sentence “Chceme tam jet na kole”, with emphasis on the second word “tam”. 7. Results The results of our tests are shown in Table 1. Totally, 180 sentences have been used, each containing one emphasized word. In 164 sentences, the emphasized word position has been recognized correctly; in 16 sentences it has been misplaced. This gives an overall detection score of 91.1%. The most significant feature for the emphasized word detection task seems to be Relative Word Prolongation. It achieved a score of 86.1%. However, effectiveness of this feature is strongly dependent on quality of HMMs and forcedalignment. On the contrary, Word Mean Pitch feature’s score is the lowest from all, but it doesn’t necessarily mean that this feature is least significant. There have been some errors in octave jump detection in our intonation contour estimator thus not all the pitch contours have been extracted correctly. Besides, in some cases, increase of pitch doesn’t affect only one particular word and with minor fluctuations of pitch it can lead to incorrect detection. This can be a reason of the low score while using pitch-based feature only, although pitch should be a very good indicator of emphasis. Not all the three prosodic features have to be affected necessarily to create a word emphasis. A combination of two features is usually sufficient, third feature’s contribution may be imperceptible. The amount of emphasis and a portion of each prosodic part on it are dependent on speaker. Features used Detection score Overall (RWP+RII+WMP) 91.1% Relative Word Prolongation 86.1% Relative Intensity Increase 77.8% Word Mean Pitch 70.6% Table 1. Emphasized word detection results. 8. Conclusions Algorithm for automatic emphasized word detection has been presented in this paper. Three features have been proposed, based on Czech prosodic rules. It has been shown, that system based on prosodic features can detect emphasized words quite reliably and can be used as an enhancement of an ASR system. A score of 91.1% has been achieved, using a combination of all the three prosodic features. SPECOM'2009, St. Petersburg, 21-25 June 2009 9. Acknowledgements This work has been supported by GA AVCR (grant no. 1QS108040569) and by grant MŠMT OC09066 (project COST 2102). 10. References [1] Arons, B.: Pitch-Based Emphasis Detection for Segmenting Speech Recordings. In Proceedings of International Conference on Spoken Language Processing (September 18-22, Yokohama, Japan), vol. 4, 1994, pp. 1931–1934. [2] Brenier, J. M., Cer, D. M., Jurafsky, D: The Detection of Emphatic Words Using Acoustic and Lexical Features. In: Interspeech 2005, Lisbon, Portugal, 3297-3300. [3] Slujter, A. M. C., Heuven, V. J.: Acoustic Correlates of Linguistic Stress and Accent in Dutch and American English. In Proc. ICSLP96, Philadelphia, PA, 1996, pp. 630–633. [4] Kuijk, D, Boves, L.: Acoustic characteristics of lexical stress in continuous telephone speech. In Speech Communication 27 (1999) p.p. 95-111 [5] Silipo, R.; Crestani, F.: Prosodic stress and topic detection in spoken sentences. In: String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh International Symposium, 27-29 Sept. 2000, p.243-252. [6] Heldner, M., Strangert, E., Deschamps, T.,: A focus detector using overall intensity and high frequency emphasis, in Proceedings ICPhS'99. [7] Palková, Z.: Fonetika a fonologie þeštiny. Univerzita Karlova, Praha, Czech Republic, 1994. [8] Hála, B., Sovák, M.: Hlas, Ĝeþ, sluch. Státní pedagogické nakladatelství, Praha, Czech Republic, 4.edition, 1962. [9] Kroul, M.: Computer Speech Synthesis for Czech. Master thesis, Technical University of Liberec, Czech Republic, 2006. [10] Kroul, M.: Automatic Speech Segmentation Based on HMM. In: Radioengineering - June 2007, Volume 16, Nr. 2, ISSN 1210-2512 473
© Copyright 2026 Paperzz