lNTERSPEECH 2007 M/2 1=-M/2 where M加dL are the respective

•
lNTERSPEECH 2007
A trainable excitation model for HMM・based speech synthesis
R. Maia↑, T. TodaU, H. Zen↑↑, Y. Nankaku↑↑, K. Tokudat,↑↑
↑Nationallnst. of Inf. and Comm. Tech. (NiCT) / ATR Spoken Language Comm. Labs, Japan
:1 Nara Institute of Science and Technology, Japan
什Nagoya Institute of Technology, Japan
[email protected], 七[email protected], {zen,nankaku,tokuda}@sp.nitech.ac.jp
Abstract
not been exploited until now.
In the method described in this paper, mixed excitation is
produced by inputting pulse train and white noise into two state­
dependent filters. These specific states may be represented, for
instance, by leaves of phonetic decision-田町. The filters are
derived so as to maximize the likelihood of residual sequences
over the corresponding states through an iterative prωess. Aside
from filter deterrnination, the amplitudes and positions of the
pulse trains are also optirnized in the sense of residual like­
lihood maximization during the referred closed-Ioop training
Although some analysis-by吋nthesis methods (similar to Code­
Excited Linear Prediction (CELP) speech coding algorithms [3])
have already been proposed for unit concatenation-based speech
synthesis, e.g. [9],出e present approach targets residual instead
of speech and considers the eπor of the system as the unvoiced
component for the waveforrn generation part.
This paper is organized as follows: Section 2 outlines the
proposed method; Section 3 presents some experiments; and the
conclusions are in Section 4.
This paper introduces a novel excitation approach for speech
synthesizers in which the final waveforrn is generated through
parameters directly obtained from Hidden Markov Models
(HMMs). Despite the attractiveness of the HMM-based speech
synthesis technique, namely utilization of small corpora and flex­
ibility concerning the achievement of di仔erent voice styles, syn­
thesized speech presents a characteristic buzziness caused by the
simple excitation model which is employed during the speech
production. This paper presents an innovative scheme where
rnixed excitation is modeled through closed-loop 甘aining of a
set of state-dependent filters and puls巴町ains, with rninimization
of the eπor between excitation and residual sequences. The pro­
posed method shows effectiveness, yielding synthesized speech
with quality far superior to the simple excitation baseline and
comparable to the best excitation schemes出us fô訂repo口ed for
HMM-based speech synthesis.
Index Terms: Speech Processing, Speech Synthesis, f品1M.
1. Introduction
2. Mixed excitation by residual modeling
In the last ye紅s the speech synthesis techniqu巴 in which the
final waveforrn is generated by parameters directly obtained from
Hidden Markov Models (HMMs) [1] has emerged as a good
choice for synthesizing speech with different voice styles and
characteristics, e.g. [2]. Nevertheless, synthesized speech for
this technique presents a ce目白n unnaturalness degree due to the
waveforrn generation part, that consists in a source-filt巴r model
wherein the excitation is assumed to be either a戸riodic pulse
train or a white noise sequence. Consequently, the sarne kind
of artifacts that are usually observ巴d in linear prediction (LP)
vocoding [3] 紅巳noticed in the synthesized speech.
Although many attempts have been made to solve the unnat­
uralness problem of LP vocoders, the most successful reported
approach has been the one in [4], which eventually became the
Linear Prediction (MELP)
[3]. Fo­
cusing on this idea, an excitation scheme for HMM-based syn­
thesis was proposed in [5], which concemed the modeling of
MELP parameters jointly with mel-cepstral coefficients and FO.
Following the same dir巴ction, in order to utilize the high-quality
vocoding technique describ巴d in [6], the modeling of aperiod­
icity components with consequent application to f品仏4・based
synthesis was perforrned by Zen 巴t al. [7]. Approaches related
to harmonic plus noise and sinusoidal models have also been
reported, e.g. [8]. The common aspect among these m巴thods
is the fact of being bas巴d on the implementation of an excita­
tion model through the utilization of some special parameters
modeled by HMMs. Ideas related to the minirnization of the
distortion between artificial excitation and speech residual have
The idea
The proposed excitation model is depicted in Figure 1. The input
pulse train, t(η), and white noise sequence,ω(n),訂e filtered
through Hv(z) and Hu(z), respectively, and added together to
result in the excitation signal e(n). The voiced and unvoiced
filters, H (z) and Hu (z), respectively, are associated with each
剛M state = {1,・ー,S'}, as depicted in Figure 1, and their
transfer functions are given by
v
s
=乞ん(I)Z-I,
H�(z) =ー
algorithm
)
(
M/2
1=-M/2
F
where M加dL are the respective orders.
2.1.1.
10
10
H�(z)
1
Mixed Excitation
2.1.
【
S
ι1 g.(l )z
冒
(2)
Function of Hv(z)
The function of the voiced創ter Hv(z) is to transforrn the in­
put pulse train t(n), yielding白e signal v(吋which is intended
be as similar as possible
the residual sequence e(η). Be­
cause pulses are moslly considered in the voiced regions,り(η)
is referred to as the voiced excitation component. The prope口y
of having finite impulse response leads to stability and phase
inforrnation retention. Further, since the final waveforrn is syn­
thesized offline, an anti-causal s汀ucture is advantageous in terrns
of resolution
1909
-149 -
August 27-31, Antwerp, Belgium
ト
。
ト
。
。十
告ケ→|T
いやや♀..
5tate
1‘ 2“
,・
5tate
crl c�l d
cl I c� Id
Figure
2.1.2.
1:
H,�(z), H�(z)
FO (gene叫吋)
Filters
,ーーーーー一 一一ー
Proposedα凶ation scheme for HMM・based speech synthesis: 自Iters Hv(z)制比(z)町制oci蹴d w油each山te
u
ω
Filter determination
Likelihood 01 e(吋given the exc卯rion model
=
The lik�ihood of the residual vector e
[e(0)・・.e(N -1)f.
with [-]1 meani昭transpo幻tion and N bei時the whole database
length in number of samples given the voiced excitation vector
v = [v(O)・・. v(N - 1)] and G. is
1.
1
e-tie-vlTGTGIe-VI,
1
)
(2
ゾ π N(IGTGI)-
(3)
�(n)
Pulse t間n t(n)
↑vri↑?
p, p2 P3 pz
w
Through the re-包Tangement of the excitation construction block
of Figure 1. Figure 2 can be obtained. ln this case puIse train and
speech residuaI co汀espond to the input of the system whereas
white noise is the output, as a result of the filtering of (n)
through the inverse ur附iced filter G(z)
By observing the system shown in Figure 2. an analogy with
analysis-by-synthesis speech coders [3] can be made as follows.
The target signaI is represented by the residual e(n).出e error
of the system is (η). and the terms whose inc問mentaI modifi­
cation can minimize the power ofω(n) are the filters and pulse
train. Therefore. according to this interpretation. the problem of
achieving an excitation signal whose waveform can be as c10se
as possぬle to th巴 residuaI will consist in the design of H v (z)
and H,,(z). and optimization of the positions. {Pl,... ,pz}.
and amplitud巴s, {α1,. ..,αz}.oft(η).
Ple|v,Gl=
s.
Residual
(target signal)
Funcrion 01 H,,(z)
Problem formulation
2.2.33..1.
Mel-cepstral coefficier、ts
(generated)
H�'(z), H�'(z)
l
Because white noise is assumed to be the input of the unvoiced
filter, the function of H" (z) is thus to weight the noise - in terms
of spectral shape and power - which is eventually added to the
voiced excitation t巾1). The reason for the all-pole structure is
elucidated in Section 2.3.
2.2.
State durations
Fof I FofI FofI FO:i
ー ー -ー ー ーーーー
守
、.、
5tate 5'
cf l cf l cf l c:i
FollFoi IFO� Foi IFO� IFO�
H,'(z), H,�(z)
,,
tr
se
←→ =吋苛I
G(z)
(error)
Unvoiced
Excitation
Figure 2: Re-arrangement of the excitation part: puIse train and
residuaI are the input while white noise is the output.
where G is釦N
N matrix containing the overall impuIse
陀sponse of出e inverse unvoiced filter G(z) including aII the
山tes {1,..., S}, with Sbe凶ein昭g t白hem削u叩I江r町r巾巴ぽr of幻蜘a似tes c∞on副enn
t由he entir陀e da瓜tabas巴. Since the filters are state-dependent, v can
be wrÎtten as
x
s
Ashs
v=乞
8=1
= + . .+
A1h1
Ashs,
(4)
where hs = [hs( -M/2)・・ ん(M/2)f is the impuIse re­
spons巴 vector of the voiced filter for state s, and the term As
is the overall pulse廿ain matrix whe陀 only pulse positions be­
Ionging to state are non-zero.
After substituting (4) into (3), and taking the Iogarithm,批
following expression can be obtained for the Iog likelihood of e,
given the filters Hv(z) and H ,,(z) and puIse回in t(η),
s
10gP[eIHv(z),Hバz), t(n)J =ζω+jlog|GTG|
s
,T
s
,
� le- 乞 A . hsl GTGle- L A.h.l .
唱r- L
l
The entire database is considered to be contained in a single vector
.=1
r
J L
.=1
J (5)
nu
nu
'l
phd
ny
苛i
'I
Determination of Hv(z)
Derive residual signals e(η) from speech co刷S
勾/
吋『1
1Ila-,J
11
h
nu A
­
n
一
(
­
rl
11L
11
ll1
,
E
)
G
,N
­
G
J
り
一h A
(
-5
U
A
l
p
一G
σb
O
一
G
nd
2.3.2.
)
t-
,s
(
u­
H一
lE
H一
e一
For a given state s, the coπesponding vector of coefficients
which maximizes (5) is determined from
h
s
(6)
Initial田pulse t日ins t(η): pitch marks
Initialize voiced filters: h.,(n) ð(η).
=
s=
1,
The expression above results in
s
ヤム民
e
T
TS
T
T
S
A
hu
Recurs附l
procedure
I
that corresponds to the least-squares formulation for filter design
by the solution of an over-deterrnined linear system [10].
Determination of Hu(z)
2.3.3.
Baωsed or川1e assu叩H川I町I叩tI剛lonα叩nt出h】u副at (何川η吋) i凶swh noise, the function of
Hu(z) is thus to remove long and short-term correlation from the
unvoiced excitation叫n). For such p岬ose the all-pole struc­
ture based on LP coefficients shown in (2) with large L seems
to be a good choice, due to its simplicity in te口ns of computa・
tional complexity. Therefore, the coefficients of the unvoiced
filter for state s, {gs(l),.・田,gs(L)}, and related gain, Ks, are
deterrnined through an autoregressive spec甘al estimation [11]
of ( ) over speech segments belongi昭to state s.
ω
印
un
2.4.
Pulse optimization
Aside from the deterrnination of the filters, the positions and
ampl山des of t(π), {p1,'" ,pz} and {α1,. ..,αz}, with
being the number of pulses of the entire training database, are
modified in the sense of minimizing the mean squared eπor of
the system of Figure 2, given by
Z
r
噌
噌 Lr
s�
s=1
,T
r
�
,
"J r L s J,
(8)
e=寺wTw= 寺le- L:Ashsl GTGle- L:Ashsl
s=1
In fact, it can be verified that the minimization of (8) corresponds
to the maximization of (5) when G(z) is minimum phase, which
is assured by the assumption that Hu(z) is stable
The procedure for calculating the positions and amplitudes
resembles multipulse 巴xcitation linear prediction coding algo­
rithms [3]. For the present case the search range is performed
towards each pulse position {p1,
,pz}.
•••
2.5.
tn
Recursive algorithm
The overall procedure for the deterrnination of the filters Hv (z)
and Hu(z), and optimization of the positions and amplitudes of
( ) is depict吋in Figure 3. Pitch marks may rep陀sent the best
choice to cons ct the initial pulse trains t(作例π吋). Fu
rrno悶r陀e
e sequence
ini甘itialization of t由he voiced filt巴rs as the unit samplが 巴
8( ) c∞ons sma go∞od approxima祇山tlO叩nc∞o叩n】路悶s幻idering t白ha低t th巴pitch
marks may indicate the actual pitch pulses in ( ) The varia­
tion of the voiced filter coefficients is taken into account as the
convergence cntenon.
凶
η
口巾he巴
叫
en.
3. Experiment
In order to verify the effectiveness of the proposed excitation
approach, the English speech database CMU.ARCTIC [12], fe­
male speaker SLT, was utilized. For the present case, the states
{1, ,S} were regarded as leaves of phonetic decision-trees
generated for mel-cepstral coefficients. The choice of the states
•••
Figure 3: Closed-Ioop training for the deterrnination of the ex­
citation model.
was made upon the assumption that residual signals are highly
correlated with their coπesponding spectral parameters [13].
The 甘ees were constructed with the utilization of phone-based
questions and the clustering parameters were adjusted to derive
small trees, in order to group general phonetic characteristics
within each terrninal node. Under these conditions, the num­
ber of states obtained in the end of the clustering process was a
hundred and thirty-one, i.e., S
131.
Residual segmentation was perfoπned as follows. First,
Viterbi alignment of the database was don巴 using the usual
context-dependent models yielded by the training of the HMM­
based synthesizer. After that, the contextual labels of the re­
sulting segments were mapped onto the states of the above de­
scribed small phonetic trees. Finally, the appropriately tagged
segments were used to train the excitation model. Filter orders
were
512 and L 256.
=
M=
3.1.
=
Effect of the closed-Ioop training
The right side of Figure 4 shows a segment of natural speech
with three coπesponding versions synthesized by using: pulse
trainlwhite noise switch (simple excitation)ーsecond row; the
mixed excitation approach utilized in the HMM-based speech
synthesizer for the Blizzard Challenge 2∞5 [7] - third row; the
proposed excitation model - fourth row. The related sources are
depicted in the left side of the figure. One can observe that among
th巴 excitation and synthesized speech versions, the proposed
method approaches more residual and natural speech waveforrns,
as an effect of the closed-loop training where phase inforrnation
is also modeled by the voiced filters.
3.2.
Subjective evaluation
A subjective evaluation was performed with six subjects. Two
of the listeners were speech synthesis specialists whereas one
of the other four subjects had English as native language. The
test consisted in an AB forced preference, with ten utterances
randomly selected out of forty sentences, taken from the BTEC
corpus [14]. The order in which the utterances were played
within each test pair was also randomized. The experiment was
executed in a quite room with the utilization of headphones.
-a
1i
5
引
1ょ
'E且
川州\川村1山川M 川川州内川ベ
:�llJIllll l||竹内川W川
300
500
800
900
, 000
'00
200
400
600
700
8
6ト
寸
4トー
→
2 tjJ
L
�,
.�
JI
1,
1.
1
1,
1オ
0 "ゆゆ・4・闘悼胸骨e・桝4・4・←4嶋崎・・ゆ峰崎・帽酢炉・制帽-由刷曹司刷蜘岬仲. 聞栴例伊帽 岨島崎恥d岨植
2ト
-4
1凹
2∞
^
"
2 00
,.
内
I
400
^
500
600
700
^
800
900
10∞
内
300
4∞
5∞
6∞
7∞
s∞
9∞
oh. 1" ìI ‘ l' \11. J" \1\. '"、 . ('1 \1\ . /' 、 . /' \1\ I r \1\ r \1\ I ・ 、州
1α抽
.,
3叩
11
400
11
5∞
600
_0 5
100
200
O.5r---r
0開閉'llØl'!11'IIr唱叫,同伊川.JI_1I"ft"地<M11I'"町IW!同帆“却問叫.tlI'II'I間内他国岡崎�1
市 00
300
200
I1 \1 11 \1 1 \1 '\1\1 11 \1 n1 1 (\ '111 1 1\ '�\1\1 ft\1I1 ^- l1 1 f\ 11
. . . . 、. . . . _. 1 .H. . 1IC U \1 11 \1 U' ' U V U
5,----,-----,---r・
-5'
1田
0.51
7 00
800
9∞
300
400
5∞
600
700
8∞
900
'0∞
・
O� ^^' 111 ^Af 111 ^M 111 �へI III^M IlInN III^N II\./V II\./V 111 ^IV1
'0∞
-.
0 5・
'00
2叩
3凹
400
500
600
7 00
800
帥o
'0∞
Figure 4: Top: residual (left) and natural speech (right), Second row: simple excitation (left) and coπesponding synthesized speech
(right). Third row: excitation according to [7] (left) and corresponding synthesized speech (right). Fourth row: excitation given by the
proposed method (left) and respective synthesized speech (right).
内ζ
ED
E
S
E
--s
一提
出
3
60%Proposed
Bl58.iz33%ard
20 Number
40 0160test pai80rs 100 120
Figure 5: Overall pr巴ference for each excitation version.
Figure 5 shows the listener's preference. It can be seen that
the proposed approach is equivalent to the Blizzard Challenge
2005 HMM.based synthesizer in te口ns of quality. In fact, the
question asked during the test was: “which utterance from Ihe
pairp陀senls better q凶lity?". Though most of times the di紅白­
ence between the proposed飢d Blizzard versions was noticed,
the decision conceming which one presented better quality was
usually difficult to be made, according to the listeners.
4. Conclusions
Th巴 proposed method synthesizes speech with quality consid­
erably better than the simple excitation baseline. Furth巴口nore,
when compared with one the best approaches repo口ed to elimi­
nate the buzziness of HMM-based speech synthesis: the Blizzard
Challenge 2005 system as described in [7], the proposed model
presents the advantage of producing speech which seems closer
to natural. The results have been promising and future steps to・
wards the conclusion of this work include pulse train modeling
for the waveforrn generation part and state clustering using the
residual signals themselves.
5. References
[1] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi,皿d T. Kita.
mura, “Simu1taneous mode1ing of spectrum, pitch and duration in
HMM.based speech synthesis," in Proc. ofEUROSPEECH, 1999.
[2]
J.
W.
Yamagishi,M. Tachibana,T. Masuko,and T. Kobayashi,“Speak.
ing sty1e adaptation using context clustering decision町ee for
HMM.based speech synthesis," in Proc. of ICASSP, 2004.
[3]
Chu, Speech Coding AIgorithms.
Wiley.lnterscience,2003 .
[4] A. McCree and T. Barnwell, “A mixed excitation LPC vocoder
mode1 for 10w bit rate speech coding," IEEE Trans. Speech and
Audio Processing, vol. 3, no. 4, Ju1y 1995.
[5] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Ki.
tamura,“Mixed.excitation for HMM.based speech synthesis," in
Proc. of EUROSPEECH, 2001.
1.
traction: possible role of a repetitive structure
[7] T.
&
Dg I
[6] H. Kaw油ara,
Masuda.Katsuse, 胡d A. de Cheveigné, “Re.
structunng sp田ch representations using a pitch.adaptive time.
SpeFOech
仕equency smoothing and an instantaneous.合'equency.based
in sounds,"
ex.
Communication, vol. 27, no. 3-4,Apr. 1999.
H. Zen,
Toda, M. Nakamura, 四d K. Tokuda, “Details of
the Nitech HMM.based speech synthesis for B1izzard Challenge
2005," IEICE Trans. on lnf and Systems, vol. E90-D, no. 1 ,2007.
[8] S. J. Kim and M. Hahn,‘守、o.band excitation for HMM.based
speech synthesis," IEICE Trans. lnf
Syst., vol. E90-D,2007.
[9] M. Akarnine and T. Kagoshima,“'Ana1ytic generation of syn出e.
sis units by closed loop training for totally speaker driven text to
speech systemσOS drive TTS)," in Proc. ICSLP, 1998.
[10] L. B. Jackson,
i
i ta
filters and signaI processing.
Kluwer
Academics,1996.
[11] S. Kay, Modern Spectral Estimation.
USA: Prentice.Hall,1988
[12] http://festvox.orglcmu..arctic.
[13] H. Duxans and A. Bonafonte,“Residual conversion versus predic.
tion on voice morphing systems," in Proc. of ICASSP, 2006
[14] http://www此atr.Jp江wsr.:η004_whatsnew/archives/000782 .html
今ノ信
円ノ“
'l
phd
ny
1ム
'I