Automatic Detection of Emphasized Words for Performance

SPECOM'2009, St. Petersburg, 21-25 June 2009
Automatic Detection of Emphasized Words for Performance
Enhancement of a Czech ASR System
Martin Kroul
Technical University of Liberec, Faculty of Mechatronics,
Studentská 2, 461 17 Liberec, Czech Republic
[email protected]
Abstract
This paper deals with a problem of prosodically emphasized
word detection in Czech speech. The main goal is to propose
an automatic emphasized word detection system that would be
component of an Automatic speech recognition system (ASR)
and would enrich its text output with highlighting emphasized
words. The detection method is based on Czech prosodic rules
and uses speech signal intensity, pitch (fundamental
frequency) and speech segment duration as features. A large
speech corpus has been used to compute some prosodic
statistics, which have been designated for feature evaluation.
The proposed system is speaker-independent and it achieves a
detection score of 91%.
1. Introduction
In this work we focus on development of prosody-based
automatic emphasized word detection system, which is
supposed to be used for enhancement of automatic speech
recognition (ASR) systems. Nowadays, automatic speech
recognizers provide quite reliable textual transcription of any
utterance. This transcription is usually further processed to
improve readability by inserting punctuation marks,
capitalization etc. Nevertheless, ASR systems by themselves
cannot resolve meaning of an utterance, determine speech
topic or extract keywords. With utilization of prosodic
parameters of a speech, we can detect emphasized words in
the utterance, which can be helpful in these tasks.
The goal of our work is to create a system that could
automatically search for emphasized words or sentence
regions in recognized utterances. After detection of these
words, we can highlight them in recognizer’s output text
(using underlining, different font type, size or color) which
will improve comprehension of the text. This system can help
reader to better understand sense of an utterance and in some
cases it can resolve meaning which would be otherwise (from
textual information only) undetectable at all.
2. Related work
Various papers concerning emphasized word detection
problem have been presented in the past. Most of them have
focused on pitch accent detection in general.
In [1], a pitch-based emphasis detection algorithm has
been used for segmenting of speech recordings. This
algorithm has been used in an interactive system called
Speech Skimmer, that enables high-speed skimming and
browsing of speech recordings. The algorithm firstly adapts to
pitch range of speaker and after that it marks regions with
significantly increased pitch as emphasized.
470
A combination of acoustic and lexical features has been
used in [2] for detection of prosodically salient or emphasized
words in speech. Pitch, intensity and duration have been used
as prosodic features. Word class, its position in sentence,
word and syllable identity, term frequency, inverse document
frequency and negative log frequency have been used as
lexical features.
According to [3], stressed vowels are characterized by a
considerable increase in the amplitude of voicing and slightly
increased open quotient. Furthermore, F0, duration, overall
intensity, closure rate/skewness of the glottal pulse, glottal
leakage and formant based parameters were examined in [3].
Acoustical difference between stressed and unstressed vowels
are presented in [4].
Furthermore, methods based on stress detection have been
also used for speech topic [5] or focus [6] detection.
3. Stress in Czech
Stress in Czech is well described by linguists [7, 8], however
its automatic detection and evaluation still hasn’t been
resolved. It is generally agreed, that stress is characterized by
significant changes of prosodic parameters.
There are two kinds of stress in Czech [7, 8]. The first is
called lexical stress. It affects syllables and plays an important
role in producing of rhythm of speech. In Czech, this stress is
fixed and stressed syllable is defined as the first syllable
in each word. In the following example, stressed syllables are
highlighted: „ZasnČžené vrcholky Jizerských hor“. However,
there are some exceptions:
•
Some monosyllabic words (like „a“, „i“, „my“, „ti“, „by“,
„jsem“) are unstressed, so they remain unaffected.
•
All prepositions take over the stress of the following word
in sentence. For example „do školy“, „bez telefonu“,
„pod stromem“.
Stressed syllables are characterized by prosodic changes, from
which the most significant is increase of intensity. According
to [8], this increase is just very small. However, results of our
tests show that there is 10% to 30% increase of intensity
on stressed syllable [9]. There is also some increase of
duration (10 – 20%) and minor increase of pitch.
The other kind of stress in Czech language is called
sentential stress and its purpose lies in emphasizing of
particular word in sentence. In normal speech, the stressed
word use to be the last word in sentence. Only when speaker
wants to put a special attention on a particular word (or on a
part of sentence), he changes the stress position to this word.
Like the lexical stress, the sentential stress is also
characterized by prosodic changes. The emphasized word is
SPECOM'2009, St. Petersburg, 21-25 June 2009
always prolonged, its first syllable is usually amplified (even
more than due to the lexical stress only), and there is also
significant increase of pitch.
Knowledge of emphasized word position in sentence is
essential for full comprehension of sense of the utterance.
Putting emphasis on different words may change meaning of
the sentence completely. One typical example can be the
question: „PĤjdeme tam zítra?“ (Shall we go there
tomorrow?).
•
•
In case that the first word is emphasized („P‚jdeme tam
zítra?“), speaker asks, whether they go there or not.
If the last word is emphasized („PĤjdeme tam zítra?“),
speaker asks, whether they go there tomorrow or another
day.
This is the case where automatic detection of stressed
words could help in better understanding of recognized
utterance and enrich ASR system’s output. This work will
focus on sentential stress, especially on prosody changes in
stressed syllables of emphasized words. However, the lexical
stress has to be kept in view as well, because there is strong
interaction between both kinds of stress.
More detailed view on problem of stress in Czech can be
found in [7].
4. Data preparation and parametrization
A corpus of 180 sentences from three different speakers
(2 male and 1 female) has been created for testing purpose.
All the utterances have been recorded under clean conditions
using ordinary close-talk microphone. Speakers have been
asked to put emphasis on arbitrary word in sentence.
These recordings had to be forced-aligned, to obtain
accurate position and length of each phoneme in sentence. An
automatic HMM-based speech recognizer and aligner has
been used for this task, using 100-mixture 3-state monophone
models. These models have been trained on approx. 35 hours
of broadcast news speech. Totally 39 features (13 MFCC, 13
delta and 13 delta-delta) have been used for training.
Comparison of alignment accuracy using various HMMs has
been already published in [10].
Prosodic statistics have been also needed to evaluate some
of the features. We obtained these statistics from the speech
corpus mentioned above (35 hours of broadcast news speech).
Using the Automatic speech recognizer and aligner, these
recordings have been aligned firstly and average phoneme
durations and average phoneme intensities were acquired
then.
The final emphasis detection system is supposed to be
part of an ASR system, that uses the same HMMs for
recognition as the aligner. Hence can be this ASR system
also used for phoneme alignment.
5. Feature extraction
As mentioned before, HMM-based automatic phoneme
aligner has been used for phoneme positions detection and
their durations estimation. However, this alignment isn’t
accurate sufficiently to build up the emphasis detection
system on phoneme-level based features. In some cases,
recognized phoneme borders have been shifted even by tens
of milliseconds from true borders. Thus we rather focused on
syllable or word-based features. In that case these
inaccurateness aren’t so influential and the syllable/word
alignment is accurate sufficiently.
5.1. Duration
With knowledge of statistical phoneme lengths, any word
length can be computed as a sum of lengths of particular
phonemes. The statistical word length can be compared then
with length of examined word. Although the rate of examined
speech can be different from the rate of speech in the corpus
(from that the statistics have been made) it can be assumed,
that there is a constant difference between statistical word
length and length of word in examined sentence. In case that
this difference increases significantly for some word, we
suppose that this word is emphasized.
In our test we used the Relative word prolongation
feature (RWP), which is defined as
RWP =
w _ length − stat _ w _ length ,
N
(1)
where w_length is length of the investigated word,
stat_w_length is statistical length computed as sum of
statistical phoneme lengths and N is number of phonemes in
the word.
5.2. Pitch
The intonation contour of speech has been estimated using
F0 detector based on combination of the short-time
autocorrelation function and cepstrum of speech signal.
Firstly, the speech signal has been filtered using LP filter with
cutoff frequency 400 Hz to reduce amplitude of unvoiced
sounds and high-frequency noise. The filtered signal has been
divided then into 25 ms long frames, with a 10 ms frame rate.
After that, energy value has been computed for each frame
and a voiced/unvoiced decision has been made using an
energy detector. The energy threshold t has been defined as
t = min (ene) + κ ⋅ (max (ene) − min (ene)) ,
(2)
where ene is vector of energy values and Ë has been
experimentally set to 0,55. With knowledge of positions of
voiced frames, pitch values have been evaluated for voiced
segments only. This has been done using combination of
autocorrelation function and cepstrum of the filtered speech
signal and method of searching their maxima. Continuous
intonation contour has been obtained by interpolating
unvoiced segments. Finally, median and mean filters have
been used for correcting errors in pitch detection and for
smoothing the curve.
According to section 3, there is usually a significant
increase of pitch in emphasized words (sentential stress). The
pitch contour has usually a lot of fluctuations, so we desided
to compute pitch means for each word in sentence and then
compare these values with each other. The word with
maximum pitch mean should be the most emphasized word.
The Word mean pitch (WMP) has been defined as
n2
¦ pitch[i ]
WMP =
i = n1
N
,
(3)
471
SPECOM'2009, St. Petersburg, 21-25 June 2009
Chceme TAM jet na kole (We want to go there by bicycle)
where n1 and n2 are initial and final frame of a word
respectively and N is a number of frames in the word.
3
Relative Word Prolongation
Relative Intensity Increase
Word Mean Pitch
Overall Score
2.5
5.3. Intensity
syl _ mean − stat _ syl _ mean
,
RII =
N
(4)
where syl_mean is 1st syllable mean intensity, stat_syl_mean
is statistical 1st syllable mean intensity and N is number of
phonemes in the syllable.
For each word, three parameters have been obtained. All
the three feature vectors had to be normalized, to be
comparable with each other. Normalized feature vector fn has
been computed as
fn =
f − min ( f )
.
max ( f ) − min ( f )
(5)
Normalized vectors have minimal value equal to 0 and
maximal equal to 1.
6. Experiments
The emphasized word detection system has been tested on
180 sentences spoken by 3 speakers (2 male and 1 female). In
each sentence, one word have been emphasized, arbitrarily
selected by speaker. All the utterances have been
parametrized and three prosodic feature vectors have been
obtained from each sentence. Each feature vector represents
relative amount of emphasis in a particular prosodic area.
Since all the feature vectors have been normalized, we can
obtain the overall emphasis vector simply by summarization
of the three particular feature vectors. It is obvious, that the
maximal value of the overall emphasis vector belongs to
emphasized word.
This combination of prosodic features describes emphasis
well. However, emphasized word detection tests using each
prosodic part separately have been done as well, to compare
their contribution on overall emphasis.
In Figure 1, there is an example of emphasis detection
result of a Czech sentence “Chceme tam jet na kole”. In this
case, all the three prosodic features have their maxima on the
second word. This indicates, that the second word is
emphasized most probably.
472
Score
2
In Czech, the first syllable in a word is stronger that the others
in most cases (lexical stress). If the word is emphasized
(sentencial stress), the first syllable is even stronger. The
other syllables in the emphasized word may be stronger too,
but in most cases they remain unaffected.
We proposed a feature based on comparison of the first
syllable’s intensity of all the words in a sentence. Like in the
case of duration, knowledge of statistical mean intensity value
for each phoneme is required. This information has been
obtained from our speech corpus as mentioned in the
section 4. The Relative intensity increase feature (RII) has
been defined as
1.5
1
0.5
0
1
2
3
Words
4
5
Fig. 1. Emphasis detection result of a Czech sentence
“Chceme tam jet na kole”, with emphasis on the second word
“tam”.
7. Results
The results of our tests are shown in Table 1. Totally, 180
sentences have been used, each containing one emphasized
word. In 164 sentences, the emphasized word position has
been recognized correctly; in 16 sentences it has been
misplaced. This gives an overall detection score of 91.1%.
The most significant feature for the emphasized word
detection task seems to be Relative Word Prolongation. It
achieved a score of 86.1%. However, effectiveness of this
feature is strongly dependent on quality of HMMs and forcedalignment.
On the contrary, Word Mean Pitch feature’s score is the
lowest from all, but it doesn’t necessarily mean that this
feature is least significant. There have been some errors in
octave jump detection in our intonation contour estimator
thus not all the pitch contours have been extracted correctly.
Besides, in some cases, increase of pitch doesn’t affect only
one particular word and with minor fluctuations of pitch it can
lead to incorrect detection. This can be a reason of the low
score while using pitch-based feature only, although pitch
should be a very good indicator of emphasis.
Not all the three prosodic features have to be affected
necessarily to create a word emphasis. A combination of two
features is usually sufficient, third feature’s contribution may
be imperceptible. The amount of emphasis and a portion of
each prosodic part on it are dependent on speaker.
Features used
Detection score
Overall (RWP+RII+WMP)
91.1%
Relative Word Prolongation
86.1%
Relative Intensity Increase
77.8%
Word Mean Pitch
70.6%
Table 1. Emphasized word detection results.
8. Conclusions
Algorithm for automatic emphasized word detection has been
presented in this paper. Three features have been proposed,
based on Czech prosodic rules. It has been shown, that system
based on prosodic features can detect emphasized words quite
reliably and can be used as an enhancement of an ASR system.
A score of 91.1% has been achieved, using a combination of
all the three prosodic features.
SPECOM'2009, St. Petersburg, 21-25 June 2009
9. Acknowledgements
This work has been supported by GA AVCR (grant no.
1QS108040569) and by grant MŠMT OC09066 (project
COST 2102).
10. References
[1] Arons, B.: Pitch-Based Emphasis Detection for
Segmenting Speech Recordings. In Proceedings of
International Conference on Spoken Language
Processing (September 18-22, Yokohama, Japan), vol. 4,
1994, pp. 1931–1934.
[2] Brenier, J. M., Cer, D. M., Jurafsky, D: The Detection of
Emphatic Words Using Acoustic and Lexical Features.
In: Interspeech 2005, Lisbon, Portugal, 3297-3300.
[3] Slujter, A. M. C., Heuven, V. J.: Acoustic Correlates of
Linguistic Stress and Accent in Dutch and American
English. In Proc. ICSLP96, Philadelphia, PA, 1996, pp.
630–633.
[4] Kuijk, D, Boves, L.: Acoustic characteristics of lexical
stress in continuous telephone speech. In Speech
Communication 27 (1999) p.p. 95-111
[5] Silipo, R.; Crestani, F.: Prosodic stress and topic
detection in spoken sentences. In: String Processing and
Information Retrieval, 2000. SPIRE 2000. Proceedings.
Seventh International Symposium, 27-29 Sept. 2000,
p.243-252.
[6] Heldner, M., Strangert, E., Deschamps, T.,: A focus
detector using overall intensity and high frequency
emphasis, in Proceedings ICPhS'99.
[7] Palková, Z.: Fonetika a fonologie þeštiny. Univerzita
Karlova, Praha, Czech Republic, 1994.
[8] Hála, B., Sovák, M.: Hlas, Ĝeþ, sluch. Státní pedagogické
nakladatelství, Praha, Czech Republic, 4.edition, 1962.
[9] Kroul, M.: Computer Speech Synthesis for Czech. Master
thesis, Technical University of Liberec, Czech Republic,
2006.
[10] Kroul, M.: Automatic Speech Segmentation Based on
HMM. In: Radioengineering - June 2007, Volume 16,
Nr. 2, ISSN 1210-2512
473