A study of hypo- and hyper-articulated synthesized speech Mauro Nicolao Speech and Hearing Research Group - Department of Computer Science The University of Sheffield SCALE - Speech Communication with Adaptive Learning 2nd Winter School, Aachen, February 15, 2011 Outline a) The Speech Synthesis by Analysis project b) Complete project architecture c) TTS prototype with control on speech quality (towards H&H) a) Weighted MLLR transformation b) Global Variance model manipulation c) Dynamic- vs static-feature weight control in speech generation d) Next steps Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 Outline a) The Speech Synthesis by Analysis project b) Complete project architecture c) TTS prototype with control on speech quality (towards H&H) a) Weighted MLLR transformation b) Global Variance model manipulation c) Dynamic- vs static-feature weight control in speech generation d) Next steps Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 Speech Synthesis by Analysis Project • Modifications of human speech: • Success in communication: ‒ voice intensity increasing ‒ speech rate adjustments ‒ to produce an intelligible speech ‒ to satisfy listener s needs ‒ noise rhythm adaptation ‒ signal processing (i.e. Lombard effect) ‒ change of word vocabulary ‒ to transfer a concept form talker s to listener s mind Lindblom (1990), Lane et al. (2007), Levelt et al. (1999) Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 Speech Synthesis by Analysis Project • Automatic TTS ignore environmental effects on speech and any feedback from listener. • Many researchers in different disciplines are investigating model to describe the human behaviour • New way of thinking automatic speech synthesis Moore (2007), Casserly and Pisoni (2010) Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 Outline a) The Speech Synthesis by Analysis project b) Complete project architecture c) TTS prototype with control on speech quality (towards H&H) a) Weighted MLLR transformation b) Global Variance model manipulation c) Dynamic- vs static-feature weight control in speech generation d) Next steps Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 Mauro Nicolao FEEDBACK FEEDFORWARD Complete project architecture A study of hypo- and hyper-articulated synthesised speech SII Aachen, February 15, 2011 Outline a) The Speech Synthesis by Analysis project b) Complete project architecture c) TTS prototype with control on speech quality (towards H&H) a) Weighted MLLR transformation b) Global Variance model manipulation c) Dynamic- vs static-feature weight control in speech generation d) Next steps Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 TTS prototype with control on speech quality • Control function: ‒ none • Synthesis: • Control actions: ‒ HTS + SAT synthesis ‒ STRAIGHT parameters ‒ GV control Mauro Nicolao ‒ Phoneme substitution ‒ MLLR transformation ‒ GV gaussian model manipulation ‒ Dynamic feature weight control A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 TTS prototype with control on speech quality Hyper-articulated speech Hypo-articulated speech HTS-Demo speech Intelligible but unnatural Muttered but friendly • Aim: ‒ Manipulate HTS model parameters to shift the speech quality along this line ‒ Act on generation parameters ‒ Only acoustic model manipulation • Strategies ‒ Weighted MLLR transformation ‒ Global Variance model manipulation ‒ Dynamic- vs static-feature weight control in speech generation Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 Weighted MLLR transformation Idea: hypo articulation can be obtained by reducing all the normally-articulated vowels to minimally articulated schwa. A CMLLR can be trained to perform this change. Ideally, the opposite CMLLR transformation should define a transformation from the standard to the hyper-articulated acoustic space T’1 T’2 T1 Mauro Nicolao T2 A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 raining data of HTS Demo. o! = Wo (5) wel in generation label filesW with a schwa vowel, because = [bA] (6) Weighted MLLR transformation ed vowel amongst the others. ! o = Ao + b (7) with c the scaling factor with 0 ! α ! 1,1100 I the identity matrix and O the all-zero us of hypo-articulated speech examples (about utter1. Substituting in each vowel in generation label files with a matrix. A is a n × n transformation matrix and b for each class of the decision tree. schwa vowel, because this is the less articulated vowel ! ansformation can be seen also astransformation a vector form could o (source observation) to othis amongst the others. 5. Eventually, an opposite be thought. Ideally, should densformation from standard acoustic model (AM) to hypo ansformed 2. observation) Generating a small corpus hypo-articulated speech fine a transformation fromofthe standard to the hyper-articulated space. HTS-Demo acoustic Hypo speech examples (about 1100 utterances) speechmove the spectral Assuming that the vector v! is defining transformations which v = o − o (8) 3. Training a CMLLR transformation from standard to hypo ! characteristics in a direction (i.e. a movement in F1-F2 diagram towards the centre o = Wo (5) acoustic model. of4. it), -v should transform inrequired: the opposite direction. New vectors (spectrum, F0 andisduration) bservation vector o transformed bythe (α spectrum ∗ 100)% W =observation [bA] (6) o! = Ao + b ô = o + α ∗ v −v = o − o! (7) o: observation vector generated by standard model. A, b: parameters of transformation I: Identity matrix 0: all-zero matrix (9) (12) 5. matrix This beclass weighted by using a scalar α. v (11) = transformation (Aand − I)o + bcan (10) sformation b for each of the decision tree. From be seen also as (α a vector form o (source to o! ô = ∗ A + (1 − α)I)o + (αobservation) ∗ b + (1 − α)O) (11) ô = (α ∗ A + (1 − c)I)o + (α ∗ b + (1 − α)O) vation) 6. Ideally, the opposite CMLLR transformation should define a transformation we have 4 standard to the hyper-articulated v = o! − o from the (8) acoustic space. 7. The been ô −inverse (α ∗ btransformation + (1 − c)O) has = (c ∗ Acomputed: + (1 − α)I) ∗ o Hyper speech HTS-Demo speech o transformed by (c ∗ 100)% is required: o = (α ∗ A + (1 − α)I)−1 ô − (α ∗ A + (1 − α)I)−1 (α ∗ b + (1 − α)O) o+c∗v (9) (13) (14) Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 ! = α ∗ A + (1 − α)I (A − I)o + Mauro b A (10) Substituting and b! = α ∗ b + (1 − α)O in both (11) and (14), we 1: Diagram of standard average distribution of F1, F2 values for the English Global Variance model manipulation cite!! Idea: to change Global-Variance model parameters either to reduce or to rds Hyper - Hypo controlled synthesis amplify the range of variations in the generated feature vectors. Variance‒ control generation of c vectors with Global Variance term P (c|λ, λν ) = ! P (Wc, Q|λ)ω P (ν(c)|λν ) (17) Toda and Tokuda (2007) all Q ‒ Manipulation of GV model is the manipulation of the variance value range of observation vectors ‒ Scaling factors are used −1 to control Û−1 µ̂ c = (WT Û W)−1 WT the transformation (none for F0) (18) (19) (20) ‒ This allows for a increasing of variance but the mean of observation vector is still leading the feature generation Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 ds Hyper - Hypo controlled synthesis Variance control . . . . . .. vs static-feature .. .. weight .. Dynamiccontrol.. ct! · · · 0 I 0 · · · c t−1 ω P (o|λ, λν ) = P (o, Q|λ) P (ν(c)|λ ) (17) ν ct 0 −I/2 · · · t = · · · −I/2 ∆c Qimportance 2all Idea: to give more to dynamic vs. static features in the speech ∆ ct · · · I −2I I · · · ct+1 generation process (18). .. .. .. .. .. . . . . ' () * ' () * ' () * 1. By increasing (decreasing) the window weights in generation process, among the possible realizations variations o =of a phoneme it is chosen W the one with the low (high) c c = (WT Û−1 W)−1 WT Û−1 µ̂ (19) (20) (19) 2. Different weight for each dynamic feature. Transformation defined(20) by [α1 α2 α3] vector ' .. . ct ∆ct ∆ 2 ct .. . () o ··· = ··· ··· 6 * = ' .. .. .. . . . α1 0 α1 I α1 0 ··· ct−1 −α2 I/2 α2 0 −α2 I/2 · · · ct α3 I −2α3 I α3 I ··· ct+1 .. .. .. .. . . . . () * ' () .. . 3. α1 usually set to 1 for F0 (pitch shifting) Mauro Nicolao W A study of hypo- and hyper-articulated synthesised speech c * (21) (22) Aachen, February 15, 2011 Dynamic- vs static-feature weight control F1 0.1 1000 0.463141502 Formant frequency (Hz) α1=1 α2=0.2 α3=0.2 α1=1 α2=1 α3=1 α1=1 α2=10 α3=10 ae l ax 0 0.1 s 0.4631 Time (s) Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 Audio examples Hyper-articulated speech Hypo-articulated speech HTS-Demo speech Vowel Reduction GV weight Dynamic control Dynamic + reduction Dynamic + reduction in noise GUI Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 Outline a) The Speech Synthesis by Analysis project b) Complete project architecture c) First realizations: a) TTS prototype with extended Speech Intelligibility Index (SII) feedback b) TTS prototype with control on speech quality (towards H&H) d) Next steps Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 Next steps • Add articulatory constraints • Find new parameters to control feature generation • Complete the control feedback by: ‒ defining an optimization function ‒ adding recognition function ‒ real-time reactions • Investigate formant synthesiser as possible vocoder • Add more generalization in the parameter generation process: ‒ Multiple phonetization activated by same word ‒ Bayesan synthesiser Mauro Nicolao (ref. Zen, H.) A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011 Thank you Mauro Nicolao A study of hypo- and hyper-articulated synthesised speech Aachen, February 15, 2011
© Copyright 2026 Paperzz