11-752: Prosody Statistical Phrase Accent Modeling (SPAM) SPAM ◆ Gopala Anumanchipalli's PhD Thesis ● ● ● ● Data drive approach Higher level than frames/syllables Accent Groups Multi-tiered Raw vs Smooth Frame-based prediction ToBI labelling Fujisaki Model Tilt Model (Taylor) Prediction level Frame ◆ HMM State ◆ Phone ◆ Syllable ◆ Word ◆ Accent Group ◆ Phrase ◆ Sentence ◆ Frame Level Syllable Level Word Level Accent Group Resynth from Accent Groups Multi-tiered ◆ Fujisaki: ● ◆ Phrase and Accent groups combine SPAM can be multi-tiered too ● ● ● Phrase component Accent Group component Microprosody component Multi-tiered SPAM Multi-tiered SPAM Full SPAM Prediction SPAM Examples ◆ Frame based ◆ SPAM based SPAM Preference ◆ User preference ● ● ◆ SPAM is smoother, more “sing-songy” Preferred about 2-1 in listening tests Object Measures ● ● ● RMSE and Correlation Not always better Mean and variance better But there is more … ◆ SPAM in voice conversion ● ● Not just frame-by-frame zscore mapping Accent shape to Accent shape F0 speaker variance SPAM in voice conversion Speech to Speech Translation ◆ Find out what a speaker accents ● ● ● Transfer that to the translated word Captures cross-linguistic emphasis Captures cross-linguistic style Speech to Speech Translation Intonation Modeling ◆ Next steps ● ● ● ◆ Better Accent Group modeling Three tiers (or more ?) Better objective measure But where do you put the accents? ● ● How do you use them? (Lenzo PhD 2014 ...)
© Copyright 2026 Paperzz