Large Scale Evaluation of Corpus-based Synthesizers

Large Scale Evaluation
of Corpus-based Synthesizers:
The Blizzard Challenge 2005
Christina Bennett
Language Technologies Institute
Carnegie Mellon University
Student Research Seminar
September 23, 2005
What is corpus-based speech synthesis?
Speech
Synthesizer
Transcript
Corpus
+
Voice talent
speech
New
text
=
New speech
2
M
o
t
i
v
a
t
i
o
n
Need for Speech Synthesis Evaluation




Determine effectiveness of our
“improvements”
Closer comparison of various
corpus-based techniques
Learn about users' preferences
Healthy competition promotes
progress and brings attention to the
field
3
M
o
t
i
v
a
t
i
o
n
Blizzard Challenge Goals




Compare methods across systems
Remove effects of different data by
providing & requiring same data to be
used
Establish a standard for repeatable
evaluations in the field
[My goal:] Bring need for improved
speech synthesis evaluation to forefront
in community (positioning CMU as a
leader in this regard)
4
C
h
a
l
l
e
n
g
e
Blizzard Challenge: Overview



Released first voices and solicited
participation in 2004
Additional voices and test sentences
released Jan. 2005
1 - 2 weeks allowed to build voices
& synthesize sentences

1000 samples from each system
(50 sentences x 5 tests x 4 voices)
5
C
h
a
l
l
e
n
g
e
Evaluation Methods

Mean Opinion Score (MOS)


Modified Rhyme Test (MRT)


Evaluate sample on a numerical scale
Intelligibility test with tested word
within a carrier phrase
Semantically Unpredictable
Sentences (SUS)

Intelligibility test preventing listeners
from using knowledge to predict words
6
C
h
a
l
l
e
n
g
e
Challenge setup: Tests

5 tests from 5 genres

3 MOS tests (1 to 5 scale)


News, prose, conversation
2 “type what you hear” tests
MRT – “Now we will say ___ again”
 SUS – ‘det-adj-noun-verb-det-adj-noun’


50 sentences collected from each
system, 20 selected for use in
testing
7
C
h
a
l
l
e
n
g
e
Challenge setup: Systems

6 systems: (random ID A-F)







CMU
Delaware
Edinburgh (UK)
IBM
MIT
Nitech (Japan)
Plus 1: “Team Recording Booth” (ID X)

Natural examples from the 4 voice talents
8
C
h
a
l
l
e
n
g
e
Challenge setup: Voices

CMU ARCTIC databases

American English; 2 male, 2 female

2 from initial release
bdl (m)
 slt (f)


2 new DBs released for quick build
rms (m)
 clb (f)

9
C
h
a
l
l
e
n
g
e
Challenge setup: Listeners

Three listener groups:

S – speech synthesis experts (50)


V – volunteers (60, 97 registered*)


10 requested from each participating site
Anyone online
U – native US English speaking
undergraduates (58, 67 registered*)

Solicited and paid for participation
*as of 4/14/05
10
C
h
a
l
l
e
n
g
e
Challenge setup: Interface

Entirely online
http://www.speech.cs.cmu.edu/blizzard/register-R.html
http://www.speech.cs.cmu.edu/blizzard/login.html




Register/login with email address
Keeps track of progress through
tests
Can stop and return to tests later
Feedback questionnaire at end of
tests
11
R
e
s
u
l
t
s
Fortunately, Team X is clear “winner”
Listener type S
Listener type V
Listener type U
MOS
type-in
MOS
type-in
MOS
type-in
X - 4.76
X - 8.5
X - 4.41
X - 10.3
X - 4.58
X - 7.3
D - 3.19
D - 14.7
D - 3.02
D - 17.1
D - 3.06
D - 16.3
E - 3.11
B - 15.0
E - 2.83
A - 19.7
E - 2.83
A - 19.3
C - 2.91
A - 17.4
B - 2.66
B - 20.3
B - 2.67
B - 19.6
B - 2.88
E - 20.6
C - 2.48
E - 25.0
C - 2.42
E - 21.7
F - 2.15
C - 22.5
F - 2.07
C - 25.6
A - 2.00
C - 22.8
A - 2.07
F - 32.7
A - 1.98
F - 41.8
F - 1.98
F - 35.2
12
R
e
s
u
l
t
s
Team D consistently outperforms others
Listener type S
Listener type V
Listener type U
MOS
type-in
MOS
type-in
MOS
type-in
X - 4.76
X - 8.5
X - 4.41
X - 10.3
X - 4.58
X - 7.3
D - 3.19
D - 14.7
D - 3.02
D - 17.1
D - 3.06
D - 16.3
E - 3.11
B - 15.0
E - 2.83
A - 19.7
E - 2.83
A - 19.3
C - 2.91
A - 17.4
B - 2.66
B - 20.3
B - 2.67
B - 19.6
B - 2.88
E - 20.6
C - 2.48
E - 25.0
C - 2.42
E - 21.7
F - 2.15
C - 22.5
F - 2.07
C - 25.6
A - 2.00
C - 22.8
A - 2.07
F - 32.7
A - 1.98
F - 41.8
F - 1.98
F - 35.2
13
R
e
s
u
l
t
s
Speech experts are biased “optimistic”
Listener type S
Listener type V
Listener type U
MOS
type-in
MOS
type-in
MOS
type-in
X - 4.76
X - 8.5
X - 4.41
X - 10.3
X - 4.58
X - 7.3
D - 3.19
D - 14.7
D - 3.02
D - 17.1
D - 3.06
D - 16.3
E - 3.11
B - 15.0
E - 2.83
A - 19.7
E - 2.83
A - 19.3
C - 2.91
A - 17.4
B - 2.66
B - 20.3
B - 2.67
B - 19.6
B - 2.88
E - 20.6
C - 2.48
E - 25.0
C - 2.42
E - 21.7
F - 2.15
C - 22.5
F - 2.07
C - 25.6
A - 2.00
C - 22.8
A - 2.07
F - 32.7
A - 1.98
F - 41.8
F - 1.98
F - 35.2
14
R
e
s
u
l
t
s
Speech experts are better in fact experts
Listener type S
Listener type V
Listener type U
MOS
type-in
MOS
type-in
MOS
type-in
X - 4.76
X - 8.5
X - 4.41
X - 10.3
X - 4.58
X - 7.3
D - 3.19
D - 14.7
D - 3.02
D - 17.1
D - 3.06
D - 16.3
E - 3.11
B - 15.0
E - 2.83
A - 19.7
E - 2.83
A - 19.3
C - 2.91
A - 17.4
B - 2.66
B - 20.3
B - 2.67
B - 19.6
B - 2.88
E - 20.6
C - 2.48
E - 25.0
C - 2.42
E - 21.7
F - 2.15
C - 22.5
F - 2.07
C - 25.6
A - 2.00
C - 22.8
A - 2.07
F - 32.7
A - 1.98
F - 41.8
F - 1.98
F - 35.2
15
R
e
s
u
l
t
s
Voice results: Listener preference

slt is most liked, followed by rms

Type S:


Type V:


slt - 50% of votes cast; rms - 28.26%
Type U:


slt - 43.48% of votes cast; rms - 36.96%
slt - 47.27% of votes cast; rms - 34.55%
But, preference does not necessarily
match test performance…
16
R
e
s
u
l
t
s
Voice results: Test performance
Female voices - slt
Listener
type S
Listener
type V
Listener
type U
all sys-MOS
natural-MOS
all sys-type-in
natural-type-in
rms - 3.233
bdl - 4.827
rms - 10.5
rms - 3.2
clb - 3.154
rms - 4.809
clb - 16.0
clb - 9.3
slt - 2.994
slt - 4.738
slt - 20.8
bdl - 9.4
bdl - 2.941
clb - 4.690
bdl - 22.7
slt - 11.3
clb - 2.946
rms - 4.568
rms - 14.0
rms - 3.8
rms - 2.894
clb - 4.404
clb - 17.1
bdl - 12.0
slt - 2.884
bdl - 4.382
slt - 25.2
slt - 12.0
bdl - 2.635
slt - 4.296
bdl - 29.3
clb - 13.1
clb - 2.987
slt - 4.611
clb - 11.9
slt - 5.9
slt - 2.930
clb - 4.587
slt - 17.5
clb - 5.9
rms - 2.873
rms - 4.584
rms - 17.6
rms - 8.8
bdl - 2.678
bdl - 4.551
bdl - 28.7
bdl - 9.1
17
R
e
s
u
l
t
s
Voice results: Test performance
Female voices - clb
Listener
type S
Listener
type V
Listener
type U
all sys-MOS
natural-MOS
all sys-type-in
natural-type-in
rms - 3.233
bdl - 4.827
rms - 10.5
rms - 3.2
clb - 3.154
rms - 4.809
clb - 16.0
clb - 9.3
slt - 2.994
slt - 4.738
slt - 20.8
bdl - 9.4
bdl - 2.941
clb - 4.690
bdl - 22.7
slt - 11.3
clb - 2.946
rms - 4.568
rms - 14.0
rms - 3.8
rms - 2.894
clb - 4.404
clb - 17.1
bdl - 12.0
slt - 2.884
bdl - 4.382
slt - 25.2
slt - 12.0
bdl - 2.635
slt - 4.296
bdl - 29.3
clb - 13.1
clb - 2.987
slt - 4.611
clb - 11.9
slt - 5.9
slt - 2.930
clb - 4.587
slt - 17.5
clb - 5.9
rms - 2.873
rms - 4.584
rms - 17.6
rms - 8.8
bdl - 2.678
bdl - 4.551
bdl - 28.7
bdl - 9.1
18
R
e
s
u
l
t
s
Voice results: Test performance
Male voices - rms
Listener
type S
Listener
type V
Listener
type U
all sys-MOS
natural-MOS
all sys-type-in
natural-type-in
rms - 3.233
bdl - 4.827
rms - 10.5
rms - 3.2
clb - 3.154
rms - 4.809
clb - 16.0
clb - 9.3
slt - 2.994
slt - 4.738
slt - 20.8
bdl - 9.4
bdl - 2.941
clb - 4.690
bdl - 22.7
slt - 11.3
clb - 2.946
rms - 4.568
rms - 14.0
rms - 3.8
rms - 2.894
clb - 4.404
clb - 17.1
bdl - 12.0
slt - 2.884
bdl - 4.382
slt - 25.2
slt - 12.0
bdl - 2.635
slt - 4.296
bdl - 29.3
clb - 13.1
clb - 2.987
slt - 4.611
clb - 11.9
slt - 5.9
slt - 2.930
clb - 4.587
slt - 17.5
clb - 5.9
rms - 2.873
rms - 4.584
rms - 17.6
rms - 8.8
bdl - 2.678
bdl - 4.551
bdl - 28.7
bdl - 9.1
19
R
e
s
u
l
t
s
Voice results: Test performance
Male voices - bdl
Listener
type S
Listener
type V
Listener
type U
all sys-MOS
natural-MOS
all sys-type-in
natural-type-in
rms - 3.233
bdl - 4.827
rms - 10.5
rms - 3.2
clb - 3.154
rms - 4.809
clb - 16.0
clb - 9.3
slt - 2.994
slt - 4.738
slt - 20.8
bdl - 9.4
bdl - 2.941
clb - 4.690
bdl - 22.7
slt - 11.3
clb - 2.946
rms - 4.568
rms - 14.0
rms - 3.8
rms - 2.894
clb - 4.404
clb - 17.1
bdl - 12.0
slt - 2.884
bdl - 4.382
slt - 25.2
slt - 12.0
bdl - 2.635
slt - 4.296
bdl - 29.3
clb - 13.1
clb - 2.987
slt - 4.611
clb - 11.9
slt - 5.9
slt - 2.930
clb - 4.587
slt - 17.5
clb - 5.9
rms - 2.873
rms - 4.584
rms - 17.6
rms - 8.8
bdl - 2.678
bdl - 4.551
bdl - 28.7
bdl - 9.1
20
R
e
s
u
l
t
s
Voice results: Natural examples
Listener type S
MOS
type-in
bdl - 4.827
Listener type V
MOS
type-in
rms - 3.2 rms - 4.568
Listener type U
MOS
type-in
rms - 3.8 slt - 4.611
slt - 5.9
rms - 4.809 clb - 9.3
clb - 4.404
bdl - 12.0 clb - 4.587
clb - 5.9
slt - 4.738
bdl - 9.4
bdl - 4.382
slt - 12.0
rms - 8.8
clb - 4.690
slt - 11.3
slt - 4.296
clb - 13.1 bdl - 4.551
rms - 4.584
bdl - 9.1
What makes natural rms different?
21
R
e
s
u
l
t
s
Voice results: By system


Only system B consistent across listener
types: (slt best MOS, rms best WER)
Most others showed group trends, i.e.
(with exception of B above and F*)
 S: rms always best WER, often best MOS
 V: slt usually best MOS, clb usually best WER
 U: clb usually best MOS and always best WER
 Again, people clearly don’t prefer the
voices they most easily understand
22
L
e
s
s
o
n
s
Lessons learned: Listeners

Reasons to exclude listener data:


Type-in tests very hard to process
automatically:


Incomplete test, failure to follow directions,
inability to respond (type-in), unusable
responses
Homophones, misspellings/typos, dialectal
differences, “smart” listeners
Group differences:

V most variable, U most controlled, S least
problematic but not representative
23
L
e
s
s
o
n
s
Lessons learned: Test design

Feedback re tests:




MOS: Give examples to calibrate scale (ordering
schema); use multiple scales (lay-people?)
Type-in: Warn about SUS; hard to remember
SUS; words too unusual/hard to spell
Uncontrollable user test setup
Pros & Cons to having natural examples in
the mix

Analyzing user response (+), differences in
delivery style (-), availability of voice talent (?)
24
L
e
s
s
o
n
s
Goals Revisited




One methodology clearly outshined rest
All systems used same data allowing for
actual comparison of systems
Standard for repeatable evaluations in the
field was established
[My goal:] Brought attention to need for
better speech synthesis evaluation (while
positioning CMU as the experts)
25
F
u
t
u
r
e
For the Future

(Bi-)Annual Blizzard Challenge





Introduced at Interspeech 2005 special session
Improve design of tests for easier
analysis post-evaluation
Encourage more sites to submit their
systems!
More data resources (problematic for the
commercial entities)
Expand types of systems accepted (&
therefore test types)

e.g. voice conversion
26