Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005 Christina Bennett Language Technologies Institute Carnegie Mellon University Student Research Seminar September 23, 2005 What is corpus-based speech synthesis? Speech Synthesizer Transcript Corpus + Voice talent speech New text = New speech 2 M o t i v a t i o n Need for Speech Synthesis Evaluation Determine effectiveness of our “improvements” Closer comparison of various corpus-based techniques Learn about users' preferences Healthy competition promotes progress and brings attention to the field 3 M o t i v a t i o n Blizzard Challenge Goals Compare methods across systems Remove effects of different data by providing & requiring same data to be used Establish a standard for repeatable evaluations in the field [My goal:] Bring need for improved speech synthesis evaluation to forefront in community (positioning CMU as a leader in this regard) 4 C h a l l e n g e Blizzard Challenge: Overview Released first voices and solicited participation in 2004 Additional voices and test sentences released Jan. 2005 1 - 2 weeks allowed to build voices & synthesize sentences 1000 samples from each system (50 sentences x 5 tests x 4 voices) 5 C h a l l e n g e Evaluation Methods Mean Opinion Score (MOS) Modified Rhyme Test (MRT) Evaluate sample on a numerical scale Intelligibility test with tested word within a carrier phrase Semantically Unpredictable Sentences (SUS) Intelligibility test preventing listeners from using knowledge to predict words 6 C h a l l e n g e Challenge setup: Tests 5 tests from 5 genres 3 MOS tests (1 to 5 scale) News, prose, conversation 2 “type what you hear” tests MRT – “Now we will say ___ again” SUS – ‘det-adj-noun-verb-det-adj-noun’ 50 sentences collected from each system, 20 selected for use in testing 7 C h a l l e n g e Challenge setup: Systems 6 systems: (random ID A-F) CMU Delaware Edinburgh (UK) IBM MIT Nitech (Japan) Plus 1: “Team Recording Booth” (ID X) Natural examples from the 4 voice talents 8 C h a l l e n g e Challenge setup: Voices CMU ARCTIC databases American English; 2 male, 2 female 2 from initial release bdl (m) slt (f) 2 new DBs released for quick build rms (m) clb (f) 9 C h a l l e n g e Challenge setup: Listeners Three listener groups: S – speech synthesis experts (50) V – volunteers (60, 97 registered*) 10 requested from each participating site Anyone online U – native US English speaking undergraduates (58, 67 registered*) Solicited and paid for participation *as of 4/14/05 10 C h a l l e n g e Challenge setup: Interface Entirely online http://www.speech.cs.cmu.edu/blizzard/register-R.html http://www.speech.cs.cmu.edu/blizzard/login.html Register/login with email address Keeps track of progress through tests Can stop and return to tests later Feedback questionnaire at end of tests 11 R e s u l t s Fortunately, Team X is clear “winner” Listener type S Listener type V Listener type U MOS type-in MOS type-in MOS type-in X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3 D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3 E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3 C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6 B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7 F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8 A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2 12 R e s u l t s Team D consistently outperforms others Listener type S Listener type V Listener type U MOS type-in MOS type-in MOS type-in X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3 D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3 E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3 C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6 B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7 F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8 A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2 13 R e s u l t s Speech experts are biased “optimistic” Listener type S Listener type V Listener type U MOS type-in MOS type-in MOS type-in X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3 D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3 E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3 C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6 B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7 F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8 A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2 14 R e s u l t s Speech experts are better in fact experts Listener type S Listener type V Listener type U MOS type-in MOS type-in MOS type-in X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3 D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3 E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3 C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6 B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7 F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8 A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2 15 R e s u l t s Voice results: Listener preference slt is most liked, followed by rms Type S: Type V: slt - 50% of votes cast; rms - 28.26% Type U: slt - 43.48% of votes cast; rms - 36.96% slt - 47.27% of votes cast; rms - 34.55% But, preference does not necessarily match test performance… 16 R e s u l t s Voice results: Test performance Female voices - slt Listener type S Listener type V Listener type U all sys-MOS natural-MOS all sys-type-in natural-type-in rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2 clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3 slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4 bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3 clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8 rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0 slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0 bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1 clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9 slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9 rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8 bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1 17 R e s u l t s Voice results: Test performance Female voices - clb Listener type S Listener type V Listener type U all sys-MOS natural-MOS all sys-type-in natural-type-in rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2 clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3 slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4 bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3 clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8 rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0 slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0 bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1 clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9 slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9 rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8 bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1 18 R e s u l t s Voice results: Test performance Male voices - rms Listener type S Listener type V Listener type U all sys-MOS natural-MOS all sys-type-in natural-type-in rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2 clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3 slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4 bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3 clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8 rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0 slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0 bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1 clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9 slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9 rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8 bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1 19 R e s u l t s Voice results: Test performance Male voices - bdl Listener type S Listener type V Listener type U all sys-MOS natural-MOS all sys-type-in natural-type-in rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2 clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3 slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4 bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3 clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8 rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0 slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0 bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1 clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9 slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9 rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8 bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1 20 R e s u l t s Voice results: Natural examples Listener type S MOS type-in bdl - 4.827 Listener type V MOS type-in rms - 3.2 rms - 4.568 Listener type U MOS type-in rms - 3.8 slt - 4.611 slt - 5.9 rms - 4.809 clb - 9.3 clb - 4.404 bdl - 12.0 clb - 4.587 clb - 5.9 slt - 4.738 bdl - 9.4 bdl - 4.382 slt - 12.0 rms - 8.8 clb - 4.690 slt - 11.3 slt - 4.296 clb - 13.1 bdl - 4.551 rms - 4.584 bdl - 9.1 What makes natural rms different? 21 R e s u l t s Voice results: By system Only system B consistent across listener types: (slt best MOS, rms best WER) Most others showed group trends, i.e. (with exception of B above and F*) S: rms always best WER, often best MOS V: slt usually best MOS, clb usually best WER U: clb usually best MOS and always best WER Again, people clearly don’t prefer the voices they most easily understand 22 L e s s o n s Lessons learned: Listeners Reasons to exclude listener data: Type-in tests very hard to process automatically: Incomplete test, failure to follow directions, inability to respond (type-in), unusable responses Homophones, misspellings/typos, dialectal differences, “smart” listeners Group differences: V most variable, U most controlled, S least problematic but not representative 23 L e s s o n s Lessons learned: Test design Feedback re tests: MOS: Give examples to calibrate scale (ordering schema); use multiple scales (lay-people?) Type-in: Warn about SUS; hard to remember SUS; words too unusual/hard to spell Uncontrollable user test setup Pros & Cons to having natural examples in the mix Analyzing user response (+), differences in delivery style (-), availability of voice talent (?) 24 L e s s o n s Goals Revisited One methodology clearly outshined rest All systems used same data allowing for actual comparison of systems Standard for repeatable evaluations in the field was established [My goal:] Brought attention to need for better speech synthesis evaluation (while positioning CMU as the experts) 25 F u t u r e For the Future (Bi-)Annual Blizzard Challenge Introduced at Interspeech 2005 special session Improve design of tests for easier analysis post-evaluation Encourage more sites to submit their systems! More data resources (problematic for the commercial entities) Expand types of systems accepted (& therefore test types) e.g. voice conversion 26
© Copyright 2026 Paperzz