SPAM

11-752: Prosody
Statistical Phrase Accent Modeling (SPAM)
SPAM
◆
Gopala Anumanchipalli's PhD Thesis
●
●
●
●
Data drive approach
Higher level than frames/syllables
Accent Groups
Multi-tiered
Raw vs Smooth
Frame-based prediction
ToBI labelling
Fujisaki Model
Tilt Model (Taylor)
Prediction level
Frame
◆ HMM State
◆ Phone
◆ Syllable
◆ Word
◆ Accent Group
◆ Phrase
◆ Sentence
◆
Frame Level
Syllable Level
Word Level
Accent Group
Resynth from Accent Groups
Multi-tiered
◆
Fujisaki:
●
◆
Phrase and Accent groups combine
SPAM can be multi-tiered too
●
●
●
Phrase component
Accent Group component
Microprosody component
Multi-tiered SPAM
Multi-tiered SPAM
Full SPAM Prediction
SPAM Examples
◆
Frame based
◆
SPAM based
SPAM Preference
◆
User preference
●
●
◆
SPAM is smoother, more “sing-songy”
Preferred about 2-1 in listening tests
Object Measures
●
●
●
RMSE and Correlation
Not always better
Mean and variance better
But there is more …
◆
SPAM in voice conversion
●
●
Not just frame-by-frame zscore mapping
Accent shape to Accent shape
F0 speaker variance
SPAM in voice conversion
Speech to Speech Translation
◆
Find out what a speaker accents
●
●
●
Transfer that to the translated word
Captures cross-linguistic emphasis
Captures cross-linguistic style
Speech to Speech Translation
Intonation Modeling
◆
Next steps
●
●
●
◆
Better Accent Group modeling
Three tiers (or more ?)
Better objective measure
But where do you put the accents?
●
●
How do you use them?
(Lenzo PhD 2014 ...)