• lNTERSPEECH 2007 A trainable excitation model for HMM・based speech synthesis R. Maia↑, T. TodaU, H. Zen↑↑, Y. Nankaku↑↑, K. Tokudat,↑↑ ↑Nationallnst. of Inf. and Comm. Tech. (NiCT) / ATR Spoken Language Comm. Labs, Japan :1 Nara Institute of Science and Technology, Japan 什Nagoya Institute of Technology, Japan [email protected], 七[email protected], {zen,nankaku,tokuda}@sp.nitech.ac.jp Abstract not been exploited until now. In the method described in this paper, mixed excitation is produced by inputting pulse train and white noise into two state dependent filters. These specific states may be represented, for instance, by leaves of phonetic decision-田町. The filters are derived so as to maximize the likelihood of residual sequences over the corresponding states through an iterative prωess. Aside from filter deterrnination, the amplitudes and positions of the pulse trains are also optirnized in the sense of residual like lihood maximization during the referred closed-Ioop training Although some analysis-by吋nthesis methods (similar to Code Excited Linear Prediction (CELP) speech coding algorithms [3]) have already been proposed for unit concatenation-based speech synthesis, e.g. [9],出e present approach targets residual instead of speech and considers the eπor of the system as the unvoiced component for the waveforrn generation part. This paper is organized as follows: Section 2 outlines the proposed method; Section 3 presents some experiments; and the conclusions are in Section 4. This paper introduces a novel excitation approach for speech synthesizers in which the final waveforrn is generated through parameters directly obtained from Hidden Markov Models (HMMs). Despite the attractiveness of the HMM-based speech synthesis technique, namely utilization of small corpora and flex ibility concerning the achievement of di仔erent voice styles, syn thesized speech presents a characteristic buzziness caused by the simple excitation model which is employed during the speech production. This paper presents an innovative scheme where rnixed excitation is modeled through closed-loop 甘aining of a set of state-dependent filters and puls巴町ains, with rninimization of the eπor between excitation and residual sequences. The pro posed method shows effectiveness, yielding synthesized speech with quality far superior to the simple excitation baseline and comparable to the best excitation schemes出us fô訂repo口ed for HMM-based speech synthesis. Index Terms: Speech Processing, Speech Synthesis, f品1M. 1. Introduction 2. Mixed excitation by residual modeling In the last ye紅s the speech synthesis techniqu巴 in which the final waveforrn is generated by parameters directly obtained from Hidden Markov Models (HMMs) [1] has emerged as a good choice for synthesizing speech with different voice styles and characteristics, e.g. [2]. Nevertheless, synthesized speech for this technique presents a ce目白n unnaturalness degree due to the waveforrn generation part, that consists in a source-filt巴r model wherein the excitation is assumed to be either a戸riodic pulse train or a white noise sequence. Consequently, the sarne kind of artifacts that are usually observ巴d in linear prediction (LP) vocoding [3] 紅巳noticed in the synthesized speech. Although many attempts have been made to solve the unnat uralness problem of LP vocoders, the most successful reported approach has been the one in [4], which eventually became the Linear Prediction (MELP) [3]. Fo cusing on this idea, an excitation scheme for HMM-based syn thesis was proposed in [5], which concemed the modeling of MELP parameters jointly with mel-cepstral coefficients and FO. Following the same dir巴ction, in order to utilize the high-quality vocoding technique describ巴d in [6], the modeling of aperiod icity components with consequent application to f品仏4・based synthesis was perforrned by Zen 巴t al. [7]. Approaches related to harmonic plus noise and sinusoidal models have also been reported, e.g. [8]. The common aspect among these m巴thods is the fact of being bas巴d on the implementation of an excita tion model through the utilization of some special parameters modeled by HMMs. Ideas related to the minirnization of the distortion between artificial excitation and speech residual have The idea The proposed excitation model is depicted in Figure 1. The input pulse train, t(η), and white noise sequence,ω(n),訂e filtered through Hv(z) and Hu(z), respectively, and added together to result in the excitation signal e(n). The voiced and unvoiced filters, H (z) and Hu (z), respectively, are associated with each 剛M state = {1,・ー,S'}, as depicted in Figure 1, and their transfer functions are given by v s =乞ん(I)Z-I, H�(z) =ー algorithm ) ( M/2 1=-M/2 F where M加dL are the respective orders. 2.1.1. 10 10 H�(z) 1 Mixed Excitation 2.1. 【 S ι1 g.(l )z 冒 (2) Function of Hv(z) The function of the voiced創ter Hv(z) is to transforrn the in put pulse train t(n), yielding白e signal v(吋which is intended be as similar as possible the residual sequence e(η). Be cause pulses are moslly considered in the voiced regions,り(η) is referred to as the voiced excitation component. The prope口y of having finite impulse response leads to stability and phase inforrnation retention. Further, since the final waveforrn is syn thesized offline, an anti-causal s汀ucture is advantageous in terrns of resolution 1909 -149 - August 27-31, Antwerp, Belgium ト 。 ト 。 。十 告ケ→|T いやや♀.. 5tate 1‘ 2“ ,・ 5tate crl c�l d cl I c� Id Figure 2.1.2. 1: H,�(z), H�(z) FO (gene叫吋) Filters ,ーーーーー一 一一ー Proposedα凶ation scheme for HMM・based speech synthesis: 自Iters Hv(z)制比(z)町制oci蹴d w油each山te u ω Filter determination Likelihood 01 e(吋given the exc卯rion model = The lik�ihood of the residual vector e [e(0)・・.e(N -1)f. with [-]1 meani昭transpo幻tion and N bei時the whole database length in number of samples given the voiced excitation vector v = [v(O)・・. v(N - 1)] and G. is 1. 1 e-tie-vlTGTGIe-VI, 1 ) (2 ゾ π N(IGTGI)- (3) �(n) Pulse t間n t(n) ↑vri↑? p, p2 P3 pz w Through the re-包Tangement of the excitation construction block of Figure 1. Figure 2 can be obtained. ln this case puIse train and speech residuaI co汀espond to the input of the system whereas white noise is the output, as a result of the filtering of (n) through the inverse ur附iced filter G(z) By observing the system shown in Figure 2. an analogy with analysis-by-synthesis speech coders [3] can be made as follows. The target signaI is represented by the residual e(n).出e error of the system is (η). and the terms whose inc問mentaI modifi cation can minimize the power ofω(n) are the filters and pulse train. Therefore. according to this interpretation. the problem of achieving an excitation signal whose waveform can be as c10se as possぬle to th巴 residuaI will consist in the design of H v (z) and H,,(z). and optimization of the positions. {Pl,... ,pz}. and amplitud巴s, {α1,. ..,αz}.oft(η). Ple|v,Gl= s. Residual (target signal) Funcrion 01 H,,(z) Problem formulation 2.2.33..1. Mel-cepstral coefficier、ts (generated) H�'(z), H�'(z) l Because white noise is assumed to be the input of the unvoiced filter, the function of H" (z) is thus to weight the noise - in terms of spectral shape and power - which is eventually added to the voiced excitation t巾1). The reason for the all-pole structure is elucidated in Section 2.3. 2.2. State durations Fof I FofI FofI FO:i ー ー -ー ー ーーーー 守 、.、 5tate 5' cf l cf l cf l c:i FollFoi IFO� Foi IFO� IFO� H,'(z), H,�(z) ,, tr se ←→ =吋苛I G(z) (error) Unvoiced Excitation Figure 2: Re-arrangement of the excitation part: puIse train and residuaI are the input while white noise is the output. where G is釦N N matrix containing the overall impuIse 陀sponse of出e inverse unvoiced filter G(z) including aII the 山tes {1,..., S}, with Sbe凶ein昭g t白hem削u叩I江r町r巾巴ぽr of幻蜘a似tes c∞on副enn t由he entir陀e da瓜tabas巴. Since the filters are state-dependent, v can be wrÎtten as x s Ashs v=乞 8=1 = + . .+ A1h1 Ashs, (4) where hs = [hs( -M/2)・・ ん(M/2)f is the impuIse re spons巴 vector of the voiced filter for state s, and the term As is the overall pulse廿ain matrix whe陀 only pulse positions be Ionging to state are non-zero. After substituting (4) into (3), and taking the Iogarithm,批 following expression can be obtained for the Iog likelihood of e, given the filters Hv(z) and H ,,(z) and puIse回in t(η), s 10gP[eIHv(z),Hバz), t(n)J =ζω+jlog|GTG| s ,T s , � le- 乞 A . hsl GTGle- L A.h.l . 唱r- L l The entire database is considered to be contained in a single vector .=1 r J L .=1 J (5) nu nu 'l phd ny 苛i 'I Determination of Hv(z) Derive residual signals e(η) from speech co刷S 勾/ 吋『1 1Ila-,J 11 h nu A n 一 ( rl 11L 11 ll1 , E ) G ,N G J り 一h A ( -5 U A l p 一G σb O 一 G nd 2.3.2. ) t- ,s ( u H一 lE H一 e一 For a given state s, the coπesponding vector of coefficients which maximizes (5) is determined from h s (6) Initial田pulse t日ins t(η): pitch marks Initialize voiced filters: h.,(n) ð(η). = s= 1, The expression above results in s ヤム民 e T TS T T S A hu Recurs附l procedure I that corresponds to the least-squares formulation for filter design by the solution of an over-deterrnined linear system [10]. Determination of Hu(z) 2.3.3. Baωsed or川1e assu叩H川I町I叩tI剛lonα叩nt出h】u副at (何川η吋) i凶swh noise, the function of Hu(z) is thus to remove long and short-term correlation from the unvoiced excitation叫n). For such p岬ose the all-pole struc ture based on LP coefficients shown in (2) with large L seems to be a good choice, due to its simplicity in te口ns of computa・ tional complexity. Therefore, the coefficients of the unvoiced filter for state s, {gs(l),.・田,gs(L)}, and related gain, Ks, are deterrnined through an autoregressive spec甘al estimation [11] of ( ) over speech segments belongi昭to state s. ω 印 un 2.4. Pulse optimization Aside from the deterrnination of the filters, the positions and ampl山des of t(π), {p1,'" ,pz} and {α1,. ..,αz}, with being the number of pulses of the entire training database, are modified in the sense of minimizing the mean squared eπor of the system of Figure 2, given by Z r 噌 噌 Lr s� s=1 ,T r � , "J r L s J, (8) e=寺wTw= 寺le- L:Ashsl GTGle- L:Ashsl s=1 In fact, it can be verified that the minimization of (8) corresponds to the maximization of (5) when G(z) is minimum phase, which is assured by the assumption that Hu(z) is stable The procedure for calculating the positions and amplitudes resembles multipulse 巴xcitation linear prediction coding algo rithms [3]. For the present case the search range is performed towards each pulse position {p1, ,pz}. ••• 2.5. tn Recursive algorithm The overall procedure for the deterrnination of the filters Hv (z) and Hu(z), and optimization of the positions and amplitudes of ( ) is depict吋in Figure 3. Pitch marks may rep陀sent the best choice to cons ct the initial pulse trains t(作例π吋). Fu rrno悶r陀e e sequence ini甘itialization of t由he voiced filt巴rs as the unit samplが 巴 8( ) c∞ons sma go∞od approxima祇山tlO叩nc∞o叩n】路悶s幻idering t白ha低t th巴pitch marks may indicate the actual pitch pulses in ( ) The varia tion of the voiced filter coefficients is taken into account as the convergence cntenon. 凶 η 口巾he巴 叫 en. 3. Experiment In order to verify the effectiveness of the proposed excitation approach, the English speech database CMU.ARCTIC [12], fe male speaker SLT, was utilized. For the present case, the states {1, ,S} were regarded as leaves of phonetic decision-trees generated for mel-cepstral coefficients. The choice of the states ••• Figure 3: Closed-Ioop training for the deterrnination of the ex citation model. was made upon the assumption that residual signals are highly correlated with their coπesponding spectral parameters [13]. The 甘ees were constructed with the utilization of phone-based questions and the clustering parameters were adjusted to derive small trees, in order to group general phonetic characteristics within each terrninal node. Under these conditions, the num ber of states obtained in the end of the clustering process was a hundred and thirty-one, i.e., S 131. Residual segmentation was perfoπned as follows. First, Viterbi alignment of the database was don巴 using the usual context-dependent models yielded by the training of the HMM based synthesizer. After that, the contextual labels of the re sulting segments were mapped onto the states of the above de scribed small phonetic trees. Finally, the appropriately tagged segments were used to train the excitation model. Filter orders were 512 and L 256. = M= 3.1. = Effect of the closed-Ioop training The right side of Figure 4 shows a segment of natural speech with three coπesponding versions synthesized by using: pulse trainlwhite noise switch (simple excitation)ーsecond row; the mixed excitation approach utilized in the HMM-based speech synthesizer for the Blizzard Challenge 2∞5 [7] - third row; the proposed excitation model - fourth row. The related sources are depicted in the left side of the figure. One can observe that among th巴 excitation and synthesized speech versions, the proposed method approaches more residual and natural speech waveforrns, as an effect of the closed-loop training where phase inforrnation is also modeled by the voiced filters. 3.2. Subjective evaluation A subjective evaluation was performed with six subjects. Two of the listeners were speech synthesis specialists whereas one of the other four subjects had English as native language. The test consisted in an AB forced preference, with ten utterances randomly selected out of forty sentences, taken from the BTEC corpus [14]. The order in which the utterances were played within each test pair was also randomized. The experiment was executed in a quite room with the utilization of headphones. -a 1i 5 引 1ょ 'E且 川州\川村1山川M 川川州内川ベ :�llJIllll l||竹内川W川 300 500 800 900 , 000 '00 200 400 600 700 8 6ト 寸 4トー → 2 tjJ L �, .� JI 1, 1. 1 1, 1オ 0 "ゆゆ・4・闘悼胸骨e・桝4・4・←4嶋崎・・ゆ峰崎・帽酢炉・制帽-由刷曹司刷蜘岬仲. 聞栴例伊帽 岨島崎恥d岨植 2ト -4 1凹 2∞ ^ " 2 00 ,. 内 I 400 ^ 500 600 700 ^ 800 900 10∞ 内 300 4∞ 5∞ 6∞ 7∞ s∞ 9∞ oh. 1" ìI ‘ l' \11. J" \1\. '"、 . ('1 \1\ . /' 、 . /' \1\ I r \1\ r \1\ I ・ 、州 1α抽 ., 3叩 11 400 11 5∞ 600 _0 5 100 200 O.5r---r 0開閉'llØl'!11'IIr唱叫,同伊川.JI_1I"ft"地<M11I'"町IW!同帆“却問叫.tlI'II'I間内他国岡崎�1 市 00 300 200 I1 \1 11 \1 1 \1 '\1\1 11 \1 n1 1 (\ '111 1 1\ '�\1\1 ft\1I1 ^- l1 1 f\ 11 . . . . 、. . . . _. 1 .H. . 1IC U \1 11 \1 U' ' U V U 5,----,-----,---r・ -5' 1田 0.51 7 00 800 9∞ 300 400 5∞ 600 700 8∞ 900 '0∞ ・ O� ^^' 111 ^Af 111 ^M 111 �へI III^M IlInN III^N II\./V II\./V 111 ^IV1 '0∞ -. 0 5・ '00 2叩 3凹 400 500 600 7 00 800 帥o '0∞ Figure 4: Top: residual (left) and natural speech (right), Second row: simple excitation (left) and coπesponding synthesized speech (right). Third row: excitation according to [7] (left) and corresponding synthesized speech (right). Fourth row: excitation given by the proposed method (left) and respective synthesized speech (right). 内ζ ED E S E --s 一提 出 3 60%Proposed Bl58.iz33%ard 20 Number 40 0160test pai80rs 100 120 Figure 5: Overall pr巴ference for each excitation version. Figure 5 shows the listener's preference. It can be seen that the proposed approach is equivalent to the Blizzard Challenge 2005 HMM.based synthesizer in te口ns of quality. In fact, the question asked during the test was: “which utterance from Ihe pairp陀senls better q凶lity?". Though most of times the di紅白 ence between the proposed飢d Blizzard versions was noticed, the decision conceming which one presented better quality was usually difficult to be made, according to the listeners. 4. Conclusions Th巴 proposed method synthesizes speech with quality consid erably better than the simple excitation baseline. Furth巴口nore, when compared with one the best approaches repo口ed to elimi nate the buzziness of HMM-based speech synthesis: the Blizzard Challenge 2005 system as described in [7], the proposed model presents the advantage of producing speech which seems closer to natural. The results have been promising and future steps to・ wards the conclusion of this work include pulse train modeling for the waveforrn generation part and state clustering using the residual signals themselves. 5. References [1] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi,皿d T. Kita. mura, “Simu1taneous mode1ing of spectrum, pitch and duration in HMM.based speech synthesis," in Proc. ofEUROSPEECH, 1999. [2] J. W. Yamagishi,M. Tachibana,T. Masuko,and T. Kobayashi,“Speak. ing sty1e adaptation using context clustering decision町ee for HMM.based speech synthesis," in Proc. of ICASSP, 2004. [3] Chu, Speech Coding AIgorithms. Wiley.lnterscience,2003 . [4] A. McCree and T. Barnwell, “A mixed excitation LPC vocoder mode1 for 10w bit rate speech coding," IEEE Trans. Speech and Audio Processing, vol. 3, no. 4, Ju1y 1995. [5] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Ki. tamura,“Mixed.excitation for HMM.based speech synthesis," in Proc. of EUROSPEECH, 2001. 1. traction: possible role of a repetitive structure [7] T. & Dg I [6] H. Kaw油ara, Masuda.Katsuse, 胡d A. de Cheveigné, “Re. structunng sp田ch representations using a pitch.adaptive time. SpeFOech 仕equency smoothing and an instantaneous.合'equency.based in sounds," ex. Communication, vol. 27, no. 3-4,Apr. 1999. H. Zen, Toda, M. Nakamura, 四d K. Tokuda, “Details of the Nitech HMM.based speech synthesis for B1izzard Challenge 2005," IEICE Trans. on lnf and Systems, vol. E90-D, no. 1 ,2007. [8] S. J. Kim and M. Hahn,‘守、o.band excitation for HMM.based speech synthesis," IEICE Trans. lnf Syst., vol. E90-D,2007. [9] M. Akarnine and T. Kagoshima,“'Ana1ytic generation of syn出e. sis units by closed loop training for totally speaker driven text to speech systemσOS drive TTS)," in Proc. ICSLP, 1998. [10] L. B. Jackson, i i ta filters and signaI processing. Kluwer Academics,1996. [11] S. Kay, Modern Spectral Estimation. USA: Prentice.Hall,1988 [12] http://festvox.orglcmu..arctic. [13] H. Duxans and A. Bonafonte,“Residual conversion versus predic. tion on voice morphing systems," in Proc. of ICASSP, 2006 [14] http://www此atr.Jp江wsr.:η004_whatsnew/archives/000782 .html 今ノ信 円ノ“ 'l phd ny 1ム 'I
© Copyright 2026 Paperzz