A study of hypo- and hyper-articulated synthesized speech

A study of
hypo- and hyper-articulated
synthesized speech
Mauro Nicolao
Speech and Hearing Research Group - Department of Computer Science
The University of Sheffield
SCALE - Speech Communication with Adaptive Learning
2nd Winter School, Aachen, February 15, 2011
Outline
a)  The Speech Synthesis by Analysis project
b)  Complete project architecture
c) 
TTS prototype with control on speech quality (towards H&H)
a)  Weighted MLLR transformation
b)  Global Variance model manipulation
c) 
Dynamic- vs static-feature weight control in speech generation
d)  Next steps
Mauro Nicolao
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
Outline
a)  The Speech Synthesis by Analysis project
b)  Complete project architecture
c) 
TTS prototype with control on speech quality (towards H&H)
a)  Weighted MLLR transformation
b)  Global Variance model manipulation
c) 
Dynamic- vs static-feature weight control in speech generation
d)  Next steps
Mauro Nicolao
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
Speech Synthesis by Analysis Project
• 
Modifications of human speech:
•  Success in communication:
‒  voice intensity increasing
‒  speech rate adjustments
‒  to produce an intelligible speech
‒  to satisfy listener s needs
‒  noise rhythm adaptation
‒  signal processing (i.e. Lombard effect)
‒  change of word vocabulary
‒  to transfer a concept form talker s to listener s
mind
Lindblom (1990), Lane et al. (2007), Levelt et al. (1999)
Mauro Nicolao
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
Speech Synthesis by Analysis Project
•  Automatic TTS ignore environmental effects on speech and any feedback from listener.
•  Many researchers in different disciplines are investigating model to describe the human
behaviour
•  New way of thinking automatic speech synthesis
Moore (2007), Casserly and Pisoni (2010)
Mauro Nicolao
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
Outline
a)  The Speech Synthesis by Analysis project
b)  Complete project architecture
c) 
TTS prototype with control on speech quality (towards H&H)
a)  Weighted MLLR transformation
b)  Global Variance model manipulation
c) 
Dynamic- vs static-feature weight control in speech generation
d)  Next steps
Mauro Nicolao
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
Mauro Nicolao
FEEDBACK
FEEDFORWARD
Complete project architecture
A study of hypo- and hyper-articulated synthesised speech
SII
Aachen, February 15, 2011
Outline
a)  The Speech Synthesis by Analysis project
b)  Complete project architecture
c) 
TTS prototype with control on speech quality (towards H&H)
a)  Weighted MLLR transformation
b)  Global Variance model manipulation
c) 
Dynamic- vs static-feature weight control in speech generation
d)  Next steps
Mauro Nicolao
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
TTS prototype with control on speech quality
•  Control function:
‒  none
•  Synthesis:
•  Control actions:
‒  HTS + SAT synthesis
‒  STRAIGHT parameters
‒  GV control
Mauro Nicolao
‒  Phoneme substitution
‒  MLLR transformation
‒  GV gaussian model manipulation
‒  Dynamic feature weight control
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
TTS prototype with control on speech quality
Hyper-articulated speech
Hypo-articulated speech
HTS-Demo
speech
Intelligible but unnatural
Muttered but friendly
•  Aim:
‒  Manipulate HTS model parameters to shift the speech quality along this line
‒  Act on generation parameters
‒  Only acoustic model manipulation
•  Strategies
‒  Weighted MLLR transformation
‒  Global Variance model manipulation
‒  Dynamic- vs static-feature weight control in speech generation
Mauro Nicolao
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
Weighted MLLR transformation
Idea: hypo articulation can be obtained by reducing all the normally-articulated
vowels to minimally articulated schwa. A CMLLR can be trained to perform
this change. Ideally, the opposite CMLLR transformation should define a
transformation from the standard to the hyper-articulated acoustic space
T’1
T’2
T1
Mauro Nicolao
T2
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
raining data of HTS Demo.
o! = Wo
(5)
wel in generation label filesW
with
a schwa vowel, because
=
[bA]
(6)
Weighted
MLLR
transformation
ed vowel amongst the others. !
o = Ao + b
(7)
with c the scaling
factor
with 0 !
α ! 1,1100
I the
identity matrix and O the all-zero
us of hypo-articulated
speech
examples
(about
utter1.  Substituting in each vowel in generation label files with a
matrix.
A is a n × n transformation matrix and b for each class of the decision tree.
schwa vowel, because this is the less articulated vowel
!
ansformation
can be
seen
also astransformation
a vector form could
o (source
observation)
to othis
amongst
the
others.
5.
Eventually,
an
opposite
be
thought.
Ideally,
should densformation
from
standard
acoustic
model
(AM)
to
hypo
ansformed
2.  observation)
Generating
a small corpus
hypo-articulated
speech
fine
a transformation
fromofthe
standard to
the hyper-articulated
space.
HTS-Demo acoustic
Hypo speech
examples (about 1100 utterances)
speechmove the spectral
Assuming that the vector v! is defining transformations which
v
=
o
−
o
(8)
3.  Training
a
CMLLR
transformation
from
standard
to
hypo
!
characteristics
in
a
direction
(i.e.
a
movement
in
F1-F2
diagram
towards
the centre
o = Wo
(5)
acoustic
model.
of4. it),
-v
should
transform
inrequired:
the opposite
direction.
New
vectors (spectrum,
F0 andisduration)
bservation
vector
o transformed
bythe
(α spectrum
∗ 100)%
W
=observation
[bA]
(6)
o! = Ao + b
ô = o + α ∗ v
−v = o − o! (7)
o: observation vector generated by
standard model.
A, b: parameters of transformation
I: Identity matrix
0: all-zero matrix
(9)
(12)
5.  matrix
This
beclass
weighted
by using
a scalar
α.
v (11)
= transformation
(Aand
− I)o
+ bcan
(10)
sformation
b for
each
of the
decision
tree.
From
be seen also
as (α
a vector
form
o (source
to o!
ô =
∗ A + (1
− α)I)o
+ (αobservation)
∗ b + (1 − α)O)
(11)
ô = (α ∗ A + (1 − c)I)o + (α ∗ b + (1 − α)O)
vation)
6.  Ideally, the opposite CMLLR transformation should define
a transformation
we have
4 standard to the hyper-articulated
v = o! − o from the
(8)
acoustic space.
7.  The
been
ô −inverse
(α ∗ btransformation
+ (1 − c)O) has
= (c
∗ Acomputed:
+ (1 − α)I) ∗ o
Hyper speech
HTS-Demo
speech
o transformed by (c ∗ 100)% is required:
o = (α ∗ A + (1 − α)I)−1 ô − (α ∗ A + (1 − α)I)−1 (α ∗ b + (1 − α)O)
o+c∗v
(9)
(13)
(14)
Nicolao
A study of hypo- and
hyper-articulated synthesised speech
Aachen, February 15, 2011
! = α ∗ A + (1 − α)I
(A
− I)o + Mauro
b A
(10)
Substituting
and b! = α ∗ b + (1
− α)O in both (11) and (14), we
1: Diagram of standard average distribution of F1, F2 values for the English
Global Variance model manipulation
cite!!
Idea: to change Global-Variance model parameters either to reduce or to
rds Hyper
- Hypo controlled synthesis
amplify the range of variations in the generated feature vectors.
Variance‒  control
generation of c vectors with Global
Variance term
P (c|λ, λν ) =
!
P (Wc, Q|λ)ω P (ν(c)|λν )
(17)
Toda and Tokuda (2007)
all Q
‒  Manipulation of GV model is the
manipulation of the variance value range
of observation vectors
‒  Scaling factors are
used
−1 to control
Û−1 µ̂
c = (WT Û
W)−1 WT the
transformation (none for F0)
(18)
(19)
(20)
‒  This allows for a increasing of variance
but the mean of observation vector is still
leading the feature generation
Mauro Nicolao
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
ds Hyper - Hypo controlled synthesis
Variance control 
 


.
.
.
.
.
.. vs static-feature
..
.. weight
..
Dynamiccontrol..

 


 ct!   · · ·



0
I
0
·
·
·
c
t−1






ω
P (o|λ, λν ) =
P
(o,
Q|λ)
P
(ν(c)|λ
)
(17)
ν
 
 ct 
0
−I/2 · · · 
t  =  · · · −I/2
 ∆c



Qimportance
2all
Idea: to give
more
to
dynamic
vs.
static
features
in
the
speech
 ∆




ct   · · ·
I
−2I
I
· · ·   ct+1 


generation process
(18).
..
..
..
..
..
.
.
.
.
' () * '
()
* ' () *
1.  By increasing (decreasing) the window weights in generation process, among the
possible realizations
variations
o
=of a phoneme it is chosen
W the one with the low (high)
c
c = (WT Û−1 W)−1 WT Û−1 µ̂
(19)
(20)
(19)
2.  Different weight for each dynamic feature. Transformation defined(20)
by [α1 α2 α3] vector








'
..
.
ct
∆ct
∆ 2 ct
..
.
()
o


 
  ···
 
 =  ···
 
  ···
 
6
*
=
'

..
..
..
.
.
 .

α1 0
α1 I
α1 0
··· 
  ct−1

−α2 I/2 α2 0 −α2 I/2 · · · 
  ct

α3 I
−2α3 I
α3 I
··· 
  ct+1
..
..
..
..
.
.
.
.
()
* ' ()
..
.
3.  α1 usually set to 1 for F0 (pitch shifting)
Mauro Nicolao
W
A study of hypo- and hyper-articulated synthesised speech
c








*
(21)
(22)
Aachen, February 15, 2011
Dynamic- vs static-feature weight control
F1
0.1
1000
0.463141502
Formant frequency (Hz)
α1=1 α2=0.2 α3=0.2
α1=1 α2=1 α3=1
α1=1 α2=10 α3=10
ae
l
ax
0
0.1
s
0.4631
Time (s)
Mauro Nicolao
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
Audio examples
Hyper-articulated speech
Hypo-articulated speech
HTS-Demo
speech
Vowel
Reduction
GV
weight
Dynamic
control
Dynamic +
reduction
Dynamic +
reduction
in noise
GUI
Mauro Nicolao
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
Outline
a)  The Speech Synthesis by Analysis project
b)  Complete project architecture
c) 
First realizations:
a)  TTS prototype with extended Speech Intelligibility Index (SII) feedback
b)  TTS prototype with control on speech quality (towards H&H)
d)  Next steps
Mauro Nicolao
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
Next steps
•  Add articulatory constraints
•  Find new parameters to control feature generation
•  Complete the control feedback by:
‒  defining an optimization function
‒  adding recognition function
‒  real-time reactions
•  Investigate formant synthesiser as possible vocoder
•  Add more generalization in the parameter generation process:
‒  Multiple phonetization activated by same word
‒  Bayesan synthesiser
Mauro Nicolao
(ref. Zen, H.)
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011
Thank you
Mauro Nicolao
A study of hypo- and hyper-articulated synthesised speech
Aachen, February 15, 2011