O2-4
Emphasized Speech Synthesis Based on Hidden Markov Models
Kumiko Morizane, Keigo Nakamura, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano
Graduate School of Information Science, Nara Institute of Science and Technology
{kei-naka, tomoki, sawatari, shikano}@is.naist.jp
Abstract
This paper presents a statistical approach to
synthesizing emphasized speech based on hidden
Markov models (HMMs). Context-dependent HMMs
are trained using emphasized speech data uttered by
intentionally emphasizing an arbitrary accentual
phrase in a sentence. To model acoustic
characteristics of emphasized speech, new contextual
factors describing an emphasized accentual phrase are
additionally considered in model training. Moreover,
to build HMMs for synthesizing both normal speech
and emphasized speech, we investigate two training
methods; one is training of individual models for
normal and emphasized speech using each of these two
types of speech data separately; and the other is
training of a mixed model using both of them
simultaneously. The experimental results demonstrate
that 1) HMM-based speech synthesis is effective for
synthesizing emphasized speech and 2) the mixed
model allows a more compact HMM set generating
more naturally sounding but slightly less emphasized
speech compared with the individual models.
1. Introduction
Speech has been used as the ordinary way for most
people to communicate with each other. One of
features of speech as a communication medium is to
simultaneously convey not only linguistic information
but also the other information such as speaker
individuality, speaker attitudes, emotion, and so on.
We usually generate various speech sounds by
controlling some speech features such as prosody and
voice quality to convey intended information.
Text-to-speech (TTS) synthesis is a technique for
generating speech waveforms from an input text. This
technique has been studied for several decades and has
been used in many applications, e.g., man-machine
interfaces such as a car navigation system or a spoken
dialogue system, text reader for visual impaired people,
and speech generator for speaking aid. The current
trend in TTS is corpus-based approaches [1] based on
a large amount of speech data and statistical processing.
A typical approach is concatenative speech synthesis
based on unit selection [2] that generates waveform
signals by concatenating speech segments selected
from a speech database. This approach has caused
dramatic improvements in the naturalness of synthetic
speech.
In order to convey not only linguistic but also
paralinguistic information there have been studied
several attempts at flexibly synthesizing expressive
speech (e.g., several speaking styles [3, 4], emotional
speech [5, 6], and emphasized speech [3, 7]). These
techniques make synthetic speech more informative
and make man-machine communication more useful.
For instance, when synthesizing a text “It is eleven
o'clock.’’ as an answer to a question “What time is it
now?’’, it would be more informative to emphasize the
word “eleven” rather than the other words. Although
concatenative speech synthesis has been successfully
applied to the expressive speech synthesis [3, 5, 7], its
flexibility is indeed limited because the use of a large
speech database widely covering desired acoustic
spaces is inevitable.
Recently, HMM-based speech synthesis [8] has
attracted attention as an alternative approach to
concatenative speech synthesis because of its potential
for realizing a very flexible TTS system. Unlike
concatenative speech synthesis, this method generates
waveform signals from statistics of acoustic
inventories extracted from the speech database. This
framework has many attractive features such as
completely data-driven voice building, flexible voice
quality control, speaker adaptation, small footprint,
and so on. Therefore, it is worthwhile to investigate its
effectiveness in various types of expressive speech
synthesis.
In this paper, we apply HMM-based speech
synthesis to emphasized speech synthesis. Contextdependent HMMs are trained using emphasized speech
data uttered by intentionally emphasizing an arbitrary
accentual phrase within a sentence. To effectively
model acoustic characteristics of emphasized speech,
new contextual factors describing the emphasized
accentual phrase are additionally considered in model
training. Two training methods are investigated in this
paper; one is training of individual models for normal
speech and emphasized speech using each of those two
types of speech data separately; and the other is
training of a mixed model using both of them
simultaneously. We conduct several subjective
evaluations to demonstrate the effectiveness of the
proposed approach to synthesizing emphasized speech.
This paper is organized as follows. In Section 2, we
describe an overview of HMM-based speech synthesis.
In Section 3, we describe our attempts at synthesizing
emphasized speech based on HMMs. In Section 4,
experimental evaluations are presented. Finally, we
summarize this paper in Section 5.
2. HMM-based speech synthesis
A basic framework of HMM-based speech
synthesis consists of training and synthesis process.
In the training process, speech parameters including
spectral parameters such as mel-cepstral coefficients
and excitation parameters such as log-scaled F0 are
extracted from speech waveforms. Time sequences of
these parameters including their dynamic features are
modeled by multi-stream HMMs consisting of
continuous density HMMs for the spectral component
and multi-space probability distribution HMMs (MSDHMMs) [9] for the F0 component. HMM state
duration is also modeled by Gaussian distribution. To
capture not only segmental features but also suprasegmental features, we use context-dependent
phoneme HMMs considering various types of
contextual factors [8] such as
x mora1 count of sentence
x position of breath group in sentence
x mora count of {preceding, current, succeeding}
breath group
x position of current accentual phrase in current
breath group
x mora count and accent type of {preceding,
current, succeeding} accentual phrase
x {preceding, current, succeeding} part-of-speech
x position of current phoneme in current accentual
phrase
x {preceding, current, succeeding} phoneme
The model structure of context-dependent HMMs is
automatically determined using a decision-tree-based
1
A mora is a syllable-sized unit in Japanese.
context clustering technique based on the minimum
description length (MDL) criterion [10]. Since each
component (e.g., spectrum, F0, and duration) has its
own dominant contextual factors, a clustering process
is performed independently for individual components.
In the synthesis process, a composite sentence
HMM is constructed by concatenating contextdependent HMMs according to a context label
sequence, which is generated from an input text. And
then, smoothly varying speech parameter trajectories
are generated by maximizing the likelihood of the
composite sentence HMM [11]. Finally a vocoding
technique is employed for generating a speech
waveform from the generated speech parameters.
3. Emphasized speech synthesis based on
HMM
We develop an HMM-based speech synthesizer to
synthesize both normal speech and emphasized speech.
3.1. Recording of emphasized speech
We recorded 503 phonetically balanced sentences
of normal speech and those of emphasized speech
uttered by one Japanese female speaker. In recording
of emphasized speech, the speaker uttered each
sentence by intentionally emphasizing one accentual
phrase that was arbitrary determined in advance.
Figure 1 shows an example of waveforms, F0
contours, and spectrograms of normal speech and
emphasized speech. We can see some acoustic changes
caused by emphasis especially around the emphasized
accentual phrase, e.g., longer duration of the accentual
phrase, higher F0, and larger waveform power.
3.2. Contextual factors describing emphasized
information
We additionally consider the following new
contextual factors describing emphasis information
assuming that the emphasis of an accentual phrase
significantly affects the other accentual phrases in the
same breath group but less affects those in different
breath groups.
x position of the emphasized accentual phrase in
the current breath group
x distance of the current accentual phrase to the
{preceding, succeeding} emphasized accentual
phrase in the current breath group
The position is represented as forward count and
backward count based on mora and accentual phrase.
The distance is also represented as mora count and
Position
Accentual
phrase
Distance
FW BW to preceding to succeeding
count count emp. phrase emp. phrase
1st breath group
Silence
/pau/
1st phrase
/a r a y u r u/
2nd phrase
/g e N j i ts u o/
Short pause
/pau/
3rd phrase
1
/s u b e t e/
2nd breath group
accentual phrase count. Figure 2 shows an example of
these new contextual factors when emphasizing the 4th
accentual phrase within a sentence. It is noted that only
accentual phrase count is shown in this figure although
we actually use the mora count as well. Because no
accentual phrase is emphasized in the 1st breath group,
any contextual factors on emphasis are not defined for
individual phonemes in the 1st and 2nd accentual
phrases. On the other hand, some contextual factors on
emphasis are defined for every phoneme in the 2nd
breath group because the emphasized accentual phrase
is included in it. The contextual factors based on the
position are considered as a part of a context label for
every phoneme in the emphasized accentual phrase
(i.e., the 4th accentual phrase). Moreover, the
contextual factors based on the distance are considered
as a part of a context label for every phoneme in the
other accentual phrases in the 2nd breath group (i.e.,
the 3rd, 5th, and 6th accentual phrases).
Emphasized
4th phrase
/j i b u N n o/
2
5th phrase
/h o o e/
6th phrase
/n e j i m a g e
t a n o d a/
3
1
2
Silence
/pau/
Freq.
Amplitude
[Hz]
11372
Fig. 2. An example of contextual factors describing
emphasis information. The 4th accentual phrase in the
2nd breath group is emphasized in a sentence “/ a r a y
u r u / g e N j i ts u o / pau / s u b e t e / j i b u N n o / h
o o e / n e j i m a g e t a n o d a /.”
0
-13743
500
300
100
Freq.
[kHz]
8
3.3. Model training
6
4
2
0
a) Normal speech
Freq.
Amplitude
[Hz]
11929
0
-13729
500
300
100
Freq.
[kHz]
8
6
4
2
0
b) Emphasized speech
Fig. 1. An example of waveforms, F0 contours, and
spectrograms of a) normal speech and b) emphasized
speech for a breath group within the same sentence.
Parts surrounded by rectangles show the accentual
phrase to be emphasized in emphasized speech.
We train the context-dependent HMMs considering
both the original contextual factors and the new
contextual factors on emphasis. Various questions
related to the new contextual factors are additionally
used in the decision-tree-based context clustering.
We adopt two straightforward training methods [6]
for building context-dependent HMMs to generate
both normal speech parameters and emphasized speech
parameters. One method is to independently train
different two HMM sets using each of normal speech
data and emphasis speech data separately; i.e., one
HMM set models only normal speech and the other
HMM set models only emphasized speech. The
resulting models are called individual models in this
paper. In synthesis, we select one of these two HMM
sets according to which type of speech is synthesized.
The other method is to train one HMM set
simultaneously using both normal speech data and
emphasis speech data. The resulting model is called
mixed model in this paper. In this training method,
speech segments of emphasized speech are
distinguished from those of normal speech by the
contextual factors on emphasis. Consequently,
differences of acoustic characteristics between normal
speech and emphasized speech are captured by the
decision trees. If these differences are small enough in
some contexts, clusters including speech segments
from both normal speech and emphasized speech
would be created. If these differences are significantly
large in other contexts, speech segments from normal
speech and those from emphasized speech would be
separated into different clusters based on a question
related to the contextual factors on emphasis.
4. Experimental evaluations
We conducted several experimental evaluations to
demonstrate the effectiveness of HMM-based
emphasized speech synthesis.
4.1. Experimental conditions
We used 450 sentences from normal speech data
and the same 450 sentences from emphasized speech
data described in Section 3.1 as training data. The
remaining 53 sentences of each speech data were used
for evaluation. Sampling frequency was set to 16 kHz.
The 0th through 39th mel-cepstral coefficients were
used as a spectral parameter. Log-scaled F0 was used
as an excitation parameter. STRAIGHT [12] was
employed for the analysis-synthesis method. A speech
parameter vector including the static features and their
delta and delta-deltas was used. The frame shift was
set to 5 ms.
The individual models were trained using 450
sentences of normal speech and those of emphasized
speech separately. The mixed model was trained using
900 sentences consisting of both normal speech and
emphasized speech. We employed state-shared
context-dependent phoneme HMMs (5 state left-toright with no skips), of which model structures were
determined by the decision-tree-based context
clustering technique adopting MDL criterion [10].
4.2. Experimental results
4.2.1. Model complexity. Table 1 shows the number
of HMM states for each feature component of the
individual models and the mixed model. We can see
that the number of HMM states of the emphasized
speech model is larger than that of the normal speech
model in every feature component of individual
models. One of factors causing these increases of the
number of HMM states is that acoustic variations of
emphasized speech are larger than those of normal
speech. It is also shown from a comparison of the
increasing rates among different feature components
that the increasing rate for the F0 component is much
larger than those for the other components. This result
suggests that the speaker providing speech data in this
paper emphasizes an accentual phrase particularly by
changing F0 widely.
The number of HMM states of the mixed model is
smaller than the total number of HMM states of
individual models. This is because speech segments in
some contexts are effectively shared and modeled by
the same HMM states between normal speech and
emphasized speech in the mixed model. We can see
that the decreasing rate for the F0 component is much
smaller than those for the other components. This is
because there are a small number of HMM states
sharing speech segments with similar F0 parameters
between normal speech and emphasized speech
because F0 of emphasized speech is quite different
from that of normal speech as mentioned above.
Table 1. Number of HMM states for each component
of individual models and mixed model. “Inc. rate”
denotes an increasing rate of the number of HMM
states from the normal model to the emphasis model
“Emp. Model” in individual models. “Dec. rate”
denotes a decreasing rate of the number of HMM
states from the individual models to the mixed model.
Individual models
Spectrum
F0
Duration
Total
Normal
model
Emp.
model
Inc.
rate [%]
536
594
10.8
1130
920
1245
35.3
2165
277
319
15.2
2158
24.5
596
1733
3891
Mixed
model
853
1951
465
3269
Dec.
rate [%]
24.5
9.9
22.0
16.0
4.2.2. Naturalness. We conducted a preference test
on naturalness of synthetic speech. Samples of normal
speech synthesized by the normal speech model (i.e.,
one individual model) and by the mixed model for
each test sentence were presented to listeners in
random order. Listeners were asked which sample
sounded more natural. Samples of emphasized speech
synthesized by the emphasized speech model (i.e., the
other individual model) and by the mixed model were
also evaluated in the same manner. Ten Japanese
listeners participated in the test. Each listener
evaluated 100 sentence pairs including 25 different
sentences randomly selected from the evaluation data.
95% confidence intervals
Mean opinion score for
the degree of emphasis
Figure 3 shows the preference scores. The mixed
model synthesizes significantly more naturally
sounding speech for both normal speech and
emphasized speech than the individual models. This is
because the mixed model enables the use of a larger
amount of training data than each individual model by
effectively sharing speech segments with similar
acoustic characteristics between normal speech and
emphasized speech. Consequently, the mixed model
captures more variations of acoustic characteristics
compared with each individual model.
5
4
3
2
1
Normal Emphasized Normal Emphasized
speech
speech
speech
speech
Individual models
95% confidence intervals
Individual models
Mixed model
Normal
speech
Fig. 4. Result of opinion test for the degree of
emphasis.
These results showed above suggest that 1) the
mixed model is very effective for synthesizing both
normal speech and emphasized speech reasonably well,
and besides, 2) its model size is relatively small.
Emphasized
speech
0
Mixed model
20
40
60
80
100
Preference score of naturalness 䌛%䌝
Fig. 3. Result of preference test on naturalness.
4.2.3. Degree of emphasis. We conducted an opinion
test on the degree of emphasis. First we showed which
accentual phrase was to be emphasized to listeners.
And then, the listeners evaluated the degree of
emphasis of the intended accentual phrase using the
opinion score set to a 5-point scale (from 5: clearly
emphasized to 1: non-emphasized). Four types of
synthetic speech including normal speech and
emphasized speech synthesized by the individual
models and those by the mixed model were evaluated.
Ten Japanese listeners participated in the test. Each
listener evaluated 100 speech samples including 25
different sentences randomly selected from the
evaluation data.
Figure 4 shows the mean opinion scores. Both of
the individual models and the mixed model properly
synthesize normal speech exhibiting low degree of the
emphasis and emphasized speech exhibiting high
degree of the emphasis. We can see that the mixed
model causes slightly less emphasized speech
compared with the individual model for emphasized
speech. In the mixed model, it is possible to use some
HMM states trained using not only emphasized speech
data but also normal speech data for synthesizing the
emphasized accentual phrase. Although the use of such
HMM states slightly decreases the degree of emphasis,
the emphasized speech synthesized by the mixed
model still keeps the degree of emphasis high enough.
4.2.4.
Correctness of emphasis. In order to
investigate whether or not the intended accentual
phrase is correctly emphasized, we conducted a
perceptual test. In this test, emphasized speech was
presented to listeners. Listeners chose one accentual
phrase which they perceived as most prominent in a
sentence. Three types of emphasized speech including
recorded speech, synthetic speech by the emphasized
speech model (i.e., one individual model), and that by
the mixed model were evaluated. Twelve Japanese
listeners participated in the test. Each listener
evaluated 102 speech samples including 17 different
sentences randomly selected from the evaluation data.
The percentages of emphasized phrases correctly
chosen by listeners are 99.51% for the recorded
emphasized speech, 97.30% for the emphasized
speech synthesized by the emphasized speech model,
and 93.56% for the emphasized speech synthesized by
the mixed model. Although the correct rates of
synthetic speech are still lower than that of the
recorded speech, HMM-based speech synthesis allows
us to precisely emphasize the intended accentual
phrase by a relatively high rate ( > 90%).
4.2.5. Impact of individual speech features on
degree of emphasis. To demonstrate which feature
component has to be controlled for synthesizing
emphasized speech, we conducted another opinion test
on the degree of emphasis. We generated speech
parameters such as spectrum, F0, and HMM state
duration of each of normal speech and emphasized
speech using the mixed model. And then, eight types
of speech were synthesized with the corresponding
speech
parameters:
all
combinations
of
normal/emphasized speech for spectral/F0/duration
components, as shown in Fig. 5. Twelve Japanese
listeners participated in the test. Each listener
evaluated 104 speech samples including 13 different
sentences randomly selected from the evaluation data.
The other conditions were the same as in Section 4.2.3.
Fig. 5 shows a result of the opinion test. We can see
that the F0 and duration components strongly affect the
degree of emphasis although impact of the spectral
component is very limited. We can also see that
simultaneous control of both F0 and duration is very
effective for yielding significant improvements in the
degree of emphasis.
Mean opinion score for
the degree of emphasis
95% confidence intervals
5
4
3
2
1
Spectrum Nor Emp Nor Emp Nor Emp Nor Emp
F0
Nor Nor Emp Emp Nor Nor Emp Emp
Duration Nor Nor Nor Nmp Emp Emp Emp Emp
Fig. 5. Result of opinion test for the degree of
emphasis depending on individual speech parameter
components. “Nor” denotes synthesized parameters of
normal speech and “Emp” denotes synthesized
parameters of emphasized speech.
5. Conclusions
This paper has described a statistical approach to
synthesizing emphasized speech based on HMM-based
speech synthesis. In order to effectively model acoustic
characteristics of emphasized speech by the contextdependent HMMs, we have proposed additional
contextual factors describing emphasis information.
We have evaluated two types of models for
synthesizing normal speech and emphasized speech;
one is individual models trained using each of normal
speech data and emphasized speech data separately;
and the other is a mixed model trained using both of
them simultaneously.
Experimental results have
demonstrated that 1) HMM-based speech synthesis
works reasonably well in emphasized speech synthesis
and 2) the mixed model allows more compact models
generating more naturally sounding but slightly less
emphasized speech compared with the individual
models.
6. Acknowledgement
The authors are grateful to Prof. Hideki Kawahara
of Wakayama University, Japan, for permission to use
the STRAIGHT analysis-synthesis method.
7. References
[1] Y. Sagisaka, “Speech synthesis by rule using an optimal
selection of non-uniform synthesis units,” Proc. of ICASSP,
pp. 679-682, New York, USA, Apr. 1988.
[2] A.J. Hunt and A.W. Black, “Unit selection in a
concatenative speech synthesis system using a large speech
database,” Proc. of ICASSP, pp. 373-376, Atlanta, USA,
May 1996.
[3] J.F. Pitrelli, R. Bakis, E.M. Eide, R. Fernandez, W.
Hamza, and M.A. Picheny, “The IBM expressive Text-toSpeech synthesis system for American English,” IEEE Trans.
Speech and Audio Processing, Vol. 14, No. 4, pp. 1099-1108,
2006.
[4] T. Nose, J. Yamagishi, T. Masuko, and T. Kobayashi, “A
style control technique for HMM-based expressive speech
synthesis,” IEICE Trans. Inf. and Syst., Vol. E90-D, No. 9,
pp. 1406-1413, 2007.
[5] A. Iida, N. Campbell, F. Higuchi, and M. Yasumura, “A
corpus-based speech synthesis system with emotion,”
Speech Communication, Vol. 40, No. 1-2, pp. 161-187, 2003.
[6] R. Tsuzuki, H. Zen, K. Tokuda, T. Kitamura, M. Bulut,
and S.S. Narayanan, “Constructing emotional speech
synthesizers with limited speech database,” Proc. of
INTERSPEECH, pp. 1185-1188, Jeju, Korea, Oct. 2004.
[7] V. Strom, R. Clark, and S. King, “Expressive prosody for
unit-selection speech synthesis,” Proc. of INTERSPEECH,
pp. 1296-1999, Pittsburgh, USA, Sep. 2006.
[8] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and
T. Kitamura, “Simultaneous modeling of spectrum, pitch and
duration in HMM-based speech synthesis,” Proc. of
EUROSPEECH, pp. 2347-2350, Budapest, Hungary, Sep.
1999.
[9] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi,
“Multi-space probability distribution HMM,” IEICE Trans.
Inf. and Syst., Vol. E85-D, No. 3, pp. 455-464, 2002.
[10] K. Shinoda and T. Watanabe, “MDL-based contextdependent subword modeling for speech recognition,” J.
Acoust. Soc. Jpn. (E), Vol. 21, No. 2, pp. 79-86, 2000.
[11] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi,
and T. Kitamura, “Speech parameter generation algorithms
for HMM-based speech synthesis,” Proc. of ICASSP, pp.
1315-1318, Istanbul, Turkey, June 2000.
[12] H. Kawahara, I. Masuda-Katsuse, and A.de Cheveigne,
“Restructuring speech representations using a pitch-adaptive
time-frequency smoothing and an instantaneous-frequencybased F0 extraction: possible role of a repetitive structure in
sounds,” Speech Communication, Vol. 27, No. 3-4, pp. 187207, 1999.
© Copyright 2026 Paperzz