Creating a Voice for Festival

Creating a Voice for
Festival
Presentation by Matthew Hood
Supervisors: S. Bangay
A. Lobb
Voice: cmu_uk_rab_diphone
Presentation Overview



About the project
Festival
About Text to Speech




Making a voice



3 layer approach
Waveform Generation
Languages, phones and diphones
Recording Diphones
Labelling
Results
About the Project





Text to speech programs have been around for many
years without much excitement.
Many new applications have arisen, sparking new
interest.
One of the factors limiting its usefulness is the limited
number of voices (fewer than 10?)
Creating a voice is a long, tedious process. But a
greater problem is the lack of documentation.
This project aims to give a comprehensive overview
of how to make a voice in Festival, pointing out all the
pitfall ahead of time.
Festival




Festival is an open source TTS system
developed at the University of Edinburgh in
the late 90s.
“It offers a free, portable, language
independent, run-time speech synthesis
engine for various platforms under various
APIs.” [Black et al]
Supported by the FestVox toolkit.
Documented in “Building Synthetic Voices”
[Black et al]
General Text to Speech

Text Analysis
Words and Utterances identified.

Linguistic Analysis
Words analysed in context and pronunciation
generated e.g. 1990.

Waveform Generation
Utterances turned into sound and the words
“Spoken”. Due to abstraction from previous
layers, this is the only layer were the voice is
used.
Waveform Generation


Festival is a concatenative synthesis system.
This means sound clips are joined together to
generate speech eg Talking Clocks.
Recorded Sound set
“The time is”; “past”; “o’clock”; numbers etc.
Generated Output
“The time is” – “half” – “past” – “three”.
Voice: cmu_us_kal_diphone
Waveform Generation



For a more general system it is not feasible to
record everything that could be said.
Speech needs to be broken down into
smaller units.
A phone is a single phonetic sound that is
generated by a human when speaking.
eh
s
zh
- get ; feather
- sit ; mass
- vision ; casual
Languages




A language is defined by its phoneme set.
A phoneme set is a collection of every
phonetic sound used in any word in the
language (including silence).
US English phoneset used in Festival has 44
phones.
BUT it is not enough to record every phone in
the phoneset.
Diphones




We donot always pronounce a phone the
same way.
Its pronunciation depends on its neighbouring
phones. This is know as the co-articulatory
effect.
Festival relies on the simplifying assumption
that the co-articulatory effect does not extend
across more than a pair of phones.
These are known as diphones.
Diphones


By combining recorded diphones, we can
now “say” any word in the language.
E.g. Jack - jh-ae-k
jh - ae
__- jh
ae - k
k - __
Recording Diphones


Because of the co-articulatory effect, it is
nearly impossible to pronounce a diphone
accurately on its own.
Using made up words is preferable to using
real words.
us_006 “pau t aa k aa k aa pau” - “k-aa” “aa-k”
us_603 “pau t aa t ey ah t aa pau” - “ey-ah”
Recording Diphones




In theory the number of diphones needed to
speak a language is the number of phones
squared.
But we don’t actually talk every combination.
The standard US diphone list used by festival
contains 1396 diphones.
It is often worth extending this list to take into
account strong accents or common foreign
words.
Recording Diphones

Because pronouncing the words can be a bit
tricky, especially the first few times you try,
FestVox provides a prompting tool.
Recording Environment





The better the recording the better the voice.
With a decent sound card it is possible to
record straight onto the PC.
Background noise must be kept to a
minimum.
Takes approximately 1.5 hours to record all
diphones.
Enviroment must be repeatable.
Labelling




Labelling is the hardest and one of the most
important part of creating a voice.
Label file consists of series of boundary
times.
Emu label is an open source program that
graphically shows where in the wave file the
phones are marked.
Part of the Emu Speech Tools available on
Source Forge.
Hand Labelling



Displays phones,
frequency and
waveform.
Sound extracted
from mid point of
labels.
Worth moving
further into the
phone when
recording eh-__.
Us-0603 “ey- ah”
Auto Labelling - results




FestVox provides an auto labeller.
1.6% failure rate.
8 – 15% error rate.
70% useable diphones. (400+ hand
correction)
Auto labeller




Test, test and
retest.
Created
splittest.pl
Hand label
any problem
phones.
Remove DB
markers.
Finishing voice






Once happy with labels.
Optional pitchmark extraction.
Volume levelling.
Load the voice into festival and test with
actual speech.
Build final voice database.
Create symbolic link.
What I have learnt & achieved






Learnt a lot about speech and speech synthesis.
Learnt a lot about Linux and sound editing.
Created a number of variations of
ru_us_matt_diphone, used to test different labelling
methods, how recordings affect results etc.
Final paper giving step by step guide and helpful
hints.
There is much room for future work, including voice
adaptation.
Am sick of the sound of my own voice.
Voice: ru_us_matt_diphone