prosodic analysis of indian languages and its applications to text to

PROSODIC ANALYSIS OF INDIAN
LANGUAGES AND ITS APPLICATIONS TO
TEXT TO SPEECH SYNTHESIS
A THESIS
submitted by
RAGHAVA KRISHNAN K
for the award of the degree
of
MASTER OF SCIENCE
(by Research)
DEPARTMENT OF ELECTRICAL ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY MADRAS.
JULY 2015
THESIS CERTIFICATE
This is to certify that the thesis titled PROSODIC ANALYSIS OF INDIAN
LANGUAGES AND ITS APPLICATIONS TO TEXT TO SPEECH
SYNTHESIS, submitted by Raghava Krishnan K, to the Indian Institute of
Technology, Madras, for the award of the degree of Master of Science, is a bona
fide record of the research work done by him under our supervision. The contents
of this thesis, in full or in parts, have not been submitted to any other Institute
or University for the award of any degree or diploma.
Prof. S. Umesh
Research Guide
Dept. of Electrical Engineering
IIT-Madras, 600 036
Place: Chennai
Date: 26th July, 2015
Prof. Hema. A. Murthy
Research Guide
Dept. of Computer Science and
Engineering
IIT-Madras, 600 036
ACKNOWLEDGEMENTS
It would not have been possible to complete this thesis without the contribution
of several people. I would like to express my gratitude to my advisor Prof. Hema
A. Murthy for her guidance and unwavering support. She has been a constant
source of encouragement and guidance and has played a major role in instilling
confidence in me as a researcher. My interactions with her over the last five years
have not only shaped my outlook on research, but on life as well. Her unabated
energy and enthusiasm for research are qualities I truly admire and can only aspire
to emulate.
I am grateful to my co-advisor Prof. S. Umesh and the members of the GTC
Committee, Prof. C. S. Ramalingam and Prof. C. Chandra Sekhar for their
insightful comments and suggestions with respect to my thesis. I would also like
to express my gratitude to Prof. Kishore Prahallad for his valuable suggestions
and criticism on various tasks that I undertook.
I would like to thank Jom, Anusha, Aswin, Kasthuri, Akshay, Shreya and other
members of Microsoft lab and Donlab for their support and encouragement over
the years and for helping me conduct numerous listening tests. I would like to
thank Anjana Babu, in particular, for having played an invaluable part in the
initial work that we did on prosodic analysis.
Lastly, I would like to thank my parents, my sister Jananie and Aunt Usha
Rani for their unreserved support and encouragement. Knowing that I have them
behind me has always made my life so much easier and has given me the strength
and courage to pursue any path I wish to choose.
i
ABSTRACT
KEYWORDS:
Syllable-based; Prosody; Pruning; Prosodic phrasing; Structural similarity; Rhythmic similarity.
Synthesis of natural sounding speech for Indian languages has been a challenging
task in the field of text-to-speech synthesis over the past few years. The quality
of synthesised speech mainly suffers due to the presence of artifacts owing to the
mismatch in the acoustic properties both at segmental and suprasegmental levels.
These artifacts affect the naturalness and intelligibility of speech, which in turn is
reflected in the poor mean opinion scores on listening tests. In this thesis, methods
have been proposed to improve the quality of speech synthesis by correcting errors
in the speech database, and by manipulating the prosody of utterances to suit the
given context.
Predicting prosody for text to speech synthesisers is heavily dependent on the
punctuation marks present in the text and the part of speech (POS) tags of the
words in the text. Therefore, incorporating the appropriate prosody for a given
text in a text to speech synthesis system, especially for Indian languages that
seldom have punctuation marks and do not have effective methods of part of
speech tagging, is a challenging task.
Prosody in speech is characterised by rhythm, stress and intonation, and is
primiarily a suprasegmental feature. Suprasegmental refers to unit levels above
the phoneme such as syllables, words, phrases etc. In this work, we refer to features
at the syllable level as segmental features because syllables are the preferred units
for synthesis, for syllable-timed Indian languages. Suprasegmental therefore refers
to levels above the syllable.
At the segmental level, ‘bad’ units are discarded from the database using the
acoustic properties of syllable units. This process is called pruning. This ensures
that acoustic continuity is maintained in the database and segmentation errors are
also corrected. This method results in a considerable improvement in the quality
of synthesis. Additionally, using the units remaining after pruning, to initialise
ii
hidden Markov models to build a statistical parametric speech synthesiser, is also
helpful.
At the suprasegmental level, an analysis is conducted to understand the factors
that affect the tones and breaks in a spoken utterance. A method to predict
prosodic phrase breaks using cues from the text is proposed. The synthesis quality
obtained from this system is superior to that of a system without prosodic phrase
break prediction.
The role played by the structure of a phrase on the prosody of an utterance is
also analysed. A new measure called structural similarity, which attempts to correlate two phrases based on the structure of the text present in them, is presented.
Structural similarity is also used to define a modified cost measure to select units
to synthesise speech in a syllable-based unit selection speech synthesiser.
Further, the effect of syllable rhythm on the prosody of a spoken utterance is
studied. A measure called rhythmic similarity that correlates two phrases based on
their syllabic rhythm patterns is proposed. This analysis shows that rhythmically
similar phrases show similarities in prosodic characteristics.
iii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS
i
ABSTRACT
ii
LIST OF TABLES
vii
LIST OF FIGURES
ix
ABBREVIATIONS
x
1 Introduction
1
1.1
Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Organisation of the thesis . . . . . . . . . . . . . . . . . . . . .
4
1.3
Contribution of the thesis . . . . . . . . . . . . . . . . . . . . .
5
2 Theoretical Background of USS and HTS
2.1
2.2
2.3
6
Description of Work done on USS for Indian languages . . . . .
7
2.1.1
Training . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1.2
Pre-clustering . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.3
Fallback Units . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.4
Synthesis Phase . . . . . . . . . . . . . . . . . . . . . . .
13
Description of Work done on HTS for Indian languages . . . . .
14
2.2.1
Training . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2.2
Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . .
17
Related Previous Work . . . . . . . . . . . . . . . . . . . . . . .
17
2.3.1
Previous work on maintaining consistency in speech databases
18
and pruning them . . . . . . . . . . . . . . . . . . . . .
2.3.2
Previous work on Prosodic phrase break prediction . . .
20
2.3.3
Related work on prosody prediction . . . . . . . . . . . .
22
2.3.4
Related work on speech rhythm . . . . . . . . . . . . . .
23
iv
2.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 A method to prune speech databases to improve the quality of
Indian TTSes
3.1
24
25
Pruning Technique for USS systems . . . . . . . . . . . . . . . .
27
3.1.1
Training USS systems . . . . . . . . . . . . . . . . . . .
28
3.2
Preliminary Experiments and Results . . . . . . . . . . . . . . .
29
3.3
Using pruning to improve the quality of phone-based HTS systems
31
3.3.1
3.4
Pruning speech databases using likelihood as an additional cue .
3.4.1
3.5
Results of Listening tests conducted to evaluate effect of
pruning on HTS system . . . . . . . . . . . . . . . . . .
Results of Listening tests conducted to evaluate the pruning
using likelihood approach . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Prosodic Phrase Break Prediction
31
32
34
35
36
4.1
Importance of Prosodic Phrase Break Prediction . . . . . . . . .
36
4.2
Challenges faced in prosodic phrase break prediction for Indian
languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.2.1
Lack of Punctuation and POS taggers . . . . . . . . . .
37
4.2.2
Agglutinative nature of Indian languages . . . . . . . . .
37
4.2.3
Low resourcedness of Indian languages . . . . . . . . . .
38
4.3
Case Markers for Prosodic phrasing . . . . . . . . . . . . . . . .
38
4.4
Word terminal syllables for prosodic phrasing . . . . . . . . . .
39
4.5
Experiments and Results . . . . . . . . . . . . . . . . . . . . . .
40
4.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
5 Analysing the Effects of Phrase Structure and Syllable Rhythm
on the Prosody of Syllable-Timed Indian Languages
5.1
5.2
42
Structural Similarity . . . . . . . . . . . . . . . . . . . . . . . .
42
5.1.1
Transplantation . . . . . . . . . . . . . . . . . . . . . . .
44
5.1.2
Application of structural similarity to USS . . . . . . . .
45
5.1.3
Experiments and Results . . . . . . . . . . . . . . . . . .
48
5.1.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
49
Rhythmic Similarity . . . . . . . . . . . . . . . . . . . . . . . .
50
v
5.2.1
Data Preparation . . . . . . . . . . . . . . . . . . . . . .
51
5.2.2
Rhythmic Similarity and Transplantation . . . . . . . .
53
5.2.3
Experiments and Results . . . . . . . . . . . . . . . . . .
55
5.2.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
58
6 Conclusion
59
6.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
6.2
Criticism of the thesis . . . . . . . . . . . . . . . . . . . . . . .
60
6.3
Scope for future work . . . . . . . . . . . . . . . . . . . . . . . .
60
LIST OF TABLES
2.1
Language databases used . . . . . . . . . . . . . . . . . . . . . .
2.2
Small portion of the Common Label Set illustrating how a few examples have been mapped to their roman character equivalent . .
16
3.1
Pairwise comparison tests for Hindi . . . . . . . . . . . . . . . .
30
3.2
Pairwise comparison tests for Tamil . . . . . . . . . . . . . . . .
30
3.3
Pairwise comparison tests for Hindi and Tamil to evaluate performance of HTS system after initialising using pruned models . .
32
Pairwise comparison test results for Hindi and Tamil to observe the
performnce of systems that have been using likelihood as a criteria
35
Probabilities of Hindi case markers and Tamil word-terminal syllables (along with their notation in common label set format [1])
being followed by phrase breaks . . . . . . . . . . . . . . . . . .
39
Results of Pairwise comparison tests for Hindi and Tamil USS to
compare systems with and without prosodic phrasing . . . . . .
41
5.1
Similarity to the original utterance scores for Hindi and Tamil .
45
5.2
Results of DMOS and WER tests for Hindi . . . . . . . . . . .
48
5.3
Results of DMOS and WER tests for Tamil . . . . . . . . . . .
49
5.4
Similarity to the original utterance listening test results . . . . .
56
5.5
DTW distances between pitch and energy contours of Hindi phrases
56
5.6
DTW distances between pitch and energy contours of Tamil phrases
56
3.4
4.1
4.2
vii
6
LIST OF FIGURES
2.1
Plot showing the long-tailed distribution of syllables in 2 languagesHindi and Tamil . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2
Waveform segmented at the syllable-level . . . . . . . . . . . . .
10
2.3
Flowchart of the hybrid segmentation algorithm . . . . . . . . .
11
2.4
An example portion of a CART for the unit sabeg . . . . . . . .
13
2.5
Overview of an HMM-based TTS . . . . . . . . . . . . . . . . .
15
2.6
(a) Histogram of duration difference between a pair of adjacent
syllables in the database (b) Histogram of average f0 difference
between a pair of adjacent syllables in the database (c) Histogram
of average energy difference between a pair of adjacent syllables in
the database . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.1
Example waveform with artifact . . . . . . . . . . . . . . . . . .
25
3.2
Example waveform and transcription with a segmentation error
26
3.3
Distribution of acoustic parameters for syllable /see/_end . . .
30
3.4
Comparison between syllable segments obtained using the 2 approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.1
An example of Prosodic hierarchy . . . . . . . . . . . . . . . . .
36
4.2
A portion of the CART tree used for predicting phrase breaks for
Hindi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
A portion of the CART tree used for predicting phrase breaks for
Tamil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
(a) Pie chart showing the number of syllables belonging to each type
structure for Hindi (b) Pie chart showing the number of syllables
belonging to each type structure for Tamil . . . . . . . . . . . .
43
(A) Waveform, pitch and energy contours of a Hindi phrase (B)
Waveform, pitch and energy contours of a Hindi phrase which is
structurally similar to (A) (C) Waveform, pitch and energy contours of the phrase obtained by transplanting the prosodic contour
of (B) on (A) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
Plot showing the range of scores for Similarity to the original utterance tests for Hindi and Tamil . . . . . . . . . . . . . . . . .
45
Selecting a structurally similar phrase from the database . . . .
47
4.3
5.1
5.2
5.3
5.4
viii
5.5
5.6
5.7
5.8
5.9
(a) Similarity to the original utterance scores for Hindi. (b)Similarity
to the original utterance scores for Tamil. . . . . . . . . . . . .
49
(a) Correlation between duration of phrase and number of syllables
per phrase for Hindi. (b) Correlation between duration of phrase
and number of syllables per phrase for Tamil. . . . . . . . . . .
51
(A) Pitch and energy contours of rhythmically similar phrases of
Hindi, (B) Pitch and energy contours of rhythmically dissimilar
phrases of Hindi . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
(A) Pitch and energy contours of rhythmically similar phrases of
Tamil, (B) Pitch and energy contours of rhythmically dissimilar
phrases of Tamil . . . . . . . . . . . . . . . . . . . . . . . . . .
53
(A) Waveform, pitch contour and energy contour of a Hindi phrase
(B) Waveform, pitch contour and energy contour of the waveform
in (A) when a rhythmically dissimilar Hindi phrase is transplanted
on it (C) Waveform, pitch contour and energy contour of the waveform in (A) when a rhythmically dissimilar Hindi phrase is transplanted on it . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
5.10 (a) Similarity to the original utterance scores for Hindi. (b)Similarity
to the original utterance scores for Tamil. . . . . . . . . . . . .
57
5.11 (a) DTW alignment between f0 contours of 2 rhythmically similar
phrases (b) DTW alignment between f0 contours of 2 rhythmically
dissimilar phrases . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.12 (a) DTW alignment between energy contours of 2 rhythmically
similar phrases (b) DTW alignment between energy contours of 2
rhythmically dissimilar phrases . . . . . . . . . . . . . . . . . .
57
ix
ABBREVIATIONS
TTS
Text to Speech
USS
Unit Selection Speech Synthesis
HMM
Hidden Markov Models
CDHMM
Context Dependent HMM
GPMF
Global Prosodic Mismatch Function
MSDHMM Multi-Space Probability Distribution HMM
HTS
HMM based speech synthesis system
CART
Classification and Regression Trees
STE
Short-Term Energy
MOS
Mean Opinion Score
DMOS
Degradation MOS
WER
Word Error Rate
ToBI
Tones and Break Indices
DTW
Dynamic Time Warping
x
CHAPTER 1
Introduction
Text-to-speech synthesis (TTS) as the name suggests, is the process of converting
text input to speech output. The main focus of TTS research in the recent past
has been to make synthetic speech sound more natural. Since the input to the TTS
is only text, the challenge in the system building part of the TTS is finding ways
of extracting appropriate information from text that can be realised acoustically
to make the synthesised speech output sound more natural and intelligible.
State-of-the-art high-quality unit selection speech synthesisers (USS) for Indian languages have been built using syllable as the basic unit. USS systems are
based on concatenating actual speech units from the database to synthesise speech.
Available literature show that syllables are a better choice of sub-word units as
compared to phones and diphones for USS [2], [3]. The reasons as to why syllables
are a better choice of sub-word unit for speech synthesis can be summarised as
follows:
• Indian languages belong to the category of syllable-timed languages [4].
• Syllables are the fundamental units of speech production [5].
• Syllables tend to capture the co-articulation between phonemes well.
• Being relatively large units, the use of syllables results in a reduction in the
number of concatenation points for speech synthesised using the concatenative speech synthesis framework.
• Syllable boundaries are regions of low energy because of which spectral discontinuities at concatenation points are not perceived.
A major consortium effort on USS for Indian languages is based on syllablelike units [6]. Although the syllable-based USS do perform well, the performance
in many cases is inconsistent. The synthesised speech seems to lack the flow,
continuity and rhythm of natural speech and also suffer in terms of intelligibility
due to the introduction of various artifacts. The feedback obtained from various
listening tests conducted to evaluate systems was that the synthesised speech
lacked the prosody of natural speech.
Prosody plays a crucial role in making speech sound natural, and also in the
comprehension of the syntactics and semantics of a spoken utterance. As a field,
study of the prosody of Indian languages, is still in its infancy. Therefore, the
main focus of this work is to critically analyze the factors that affect the prosody
of a spoken utterance.
Nooteboom in [7] defines prosody as the study of tone, melody and rhythm in
speech. Prosody serves semantic purposes and cannot be represented using just
the orthographic transcription and the sequence of phonemes. It has to be dealt
with at a level higher than that of phonemes (suprasegmental level) and has to be
studied more as phenomena caused by the combined effect of sub-word units.
Languages like English, Japanese, French etc. use rule-based approaches to
predict prosodic characteristics with certain cues from the text. These prosodic
characteristics which are called tones and break indices (ToBI), as described in [8],
mainly encompass intonation contours and different levels of breaks in a spoken
utterance. These methods have proven to be very successful in synthesising highquality speech.
This work describes some of the efforts made at improving the synthesis quality of Indian language syllable-based TTS systems. Attempts have been made
to study the prosody of syllable-timed Indian languages. The work is directed
towards minimising errors in the speech database and extracting appropriate features from the input text that can be used to predict the prosody, and in turn
improve the quality of speech synthesis.
Statistical parametric speech synthesis using hidden Markov models (HMMs)
[9] has gained a lot of popularity in the recent past. These systems differ from USS
systems in that they model sub-word speech units and do not concatenate actual
speech waveforms. Speech is synthesised by generating speech waveforms from the
models built using context information. These systems use the context-dependent
monophone as the basic sub-word unit. The automatic speech segmentation algorithm used to obtain syllable segments is also used to segment speech waveforms
at the monophone level. The performance of these systems is again heavily dependent on the accuracy of segmentation. Techniques to improve the quality of
synthesis of these systems have been dealt with in this thesis in addition to the
work done on USS.
2
1.1
Overview of Thesis
Synthesised speech usually suffers from many artifacts such as spikes, overlaps,
sudden variations in acoustic properties of units across the synthesised utterance,
buzziness etc. These artifacts are caused by the effects of errors in segmentation,
inconsistencies in recording, poor prosody prediction, etc. These artifacts are
usually introduced into the speech waveform due to errors at the sub-word unit
level. Artifacts when present in the synthesised speech output cause a degradation
in the system’s performance.
Various approaches to reduce artifacts in synthesised speech have already been
proposed in [10], [11], [12], [13], [14] and [15]. These approaches use acoustic cues
such as average f0 , average short-term energy (STE) and duration of units to
prune outlier units from the database. Along with these features, the approach in
this work uses likelihood obtained from the forced Viterbi alignment step as an
additional feature for pruning.
Artifacts can also be introduced in synthesised speech due to erroneous segmentation. Segmentation of the speech waveforms in the case of speech synthesis
has to be very precise. The accuracy of segmentation has a direct effect on the
quality of speech synthesis. Small errors in segmentation can lead to degradation
in quality due to the co-articulation in speech. Algorithms that produce precise
syllable segments are therefore of great importance in building a TTS system. Although the algorithm used to segment the speech waveform in this thesis is very
accurate, there still are cases where there are errors in segmentation. A method
has been proposed in this thesis using which such segmentation errors can be
corrected using the acoustic cues of individual units.
Prosody modeling for Indian languages is a particularly hard task. Prosody
models for many languages are based on predicting tones and break indices (ToBI)
based on a set of rules. Accents are first predicted using the punctuations and
part of speech (POS) tags of the text. The rules are then used to predict ToBI
using these predicted accents. ToBI have been widely used in various speech
applications and have been found to enhance the performance of these applications
considerably.
Developing rule-based methods for Indian languages is a hard task due to the
lack of information present in the text. Indian languages are rarely punctuated
3
except for the punctuations denoting the end of a sentence. Punctuations such
as commas denoting prosodic phrase breaks are usually absent. Part of speech
tags are additional cues from the text which are very useful in predicting prosodic
phrase breaks, and tools to POS tag Indian language text are still not completely
effective. These breaks have to therefore, be predicted using cues directly from
the text. A method has been proposed in this thesis that uses word-level features
of the text to predict prosodic phrase breaks. The rules to predict prosodic phrase
breaks are learnt from the text using classification and regression trees (CART).
While predicting breaks which have more to do with the rhythmic aspect of
speech [16] has been successful, predicting the tonal aspects of speech still remains
a challenging task. The absence of punctuations and POS tags again proves to be
a major hurdle for this task. A part of the work in this thesis aims at correlating
the structure of a phrase with prosody. A new measure called structural similarity
which measures the similarity in structure between two phrases is defined. It was
observed that transplanting prosody between two structurally similar phrases did
not degrade the naturalness and intelligibility significantly. However, this measure
was not effective when used as an additional criteria in the cost measure to select
units for a syllable-based USS system.
Further analysis has shown that syllabic rhythm can also be used to correlate
the prosody between two phrases. A new measure called rhythmic similarity has
been defined which shows that rhythmically similar phrases exhibit similarities
in prosodic characteristics as compared to rhythmically dissimilar phrases. This
criterion has however not yet been incorporated in the synthesis paradigm.
1.2
Organisation of the thesis
The rest of the thesis is organised as follows. Chapter 2 gives a theoretical background of the popular TTS paradigms. Chapter 3 describes the technique used
to minimise database errors and prune outlier units from the speech database.
The subsequent chapters deal with modeling suprasegmental prosody. Chapter 4
describes the method used to predict prosodic phrases using cues from the text.
Chapters 5 describe the work done on structural similarity and rhythmic similarity. Chapter 6 concludes the work, discusses issues in the proposed methods and
prospects for future work.
4
1.3
Contribution of the thesis
The following are the major contributions of this research thesis:
• Acoustically inconsistent units are those which have acoustic properties very
different from the rest of the units of their class in the database. A method
to discard these units using a process called pruning has been proposed.
• Improving the quality of HMM based TTS (HTS) synthesis by correcting
segmentation errors
• Development of textual cues for prediction of prosodic phrase breaks
• Analysis of structural similarity and its role in prosody
• Analysing the role played by the syllable rhythm on the prosody of an utterance
5
CHAPTER 2
Theoretical Background of USS and HTS
Text to speech synthesis (TTS) is the artificial production of human speech by a
system which converts the input text into output speech. The challenge is to derive
maximum acoustic information from the text so that high-quality speech can be
synthesised. TTS systems are comprised of 2 parts - i) the text analysis part,
and ii) the speech synthesis part. There are various paradigms of speech synthesis
based on the method of waveform synthesis. Two paradigms have been mainly
dealt with in this thesis, and those have been described in this chapter. The two
paradigms of speech synthesis are Unit Selection Speech Synthesis Systems (USS)
and HMM-based speech synthesis systems (HTS). The various parts of these two
systems with respect to Indian languages have been described here. Analysis has
been carried out and systems built for 2 languages.
The data to build TTS systems has to be collected with great care. The
quality of the synthesised output depends very heavily on the quality of the data
collected. Therefore, as a first step, the training text was chosen very carefully and
longer words were avoided. Also the sentences chosen for recording were purely
declarative sentences. The text for speech recording was selected to maximize
syllable coverage. It was also ensured that as many aksharas1 as possible and all
monophones in the language were covered, which were the back-off units in the
absence of a syllable. The complete details of data collection are given in [17].
The speech was recorded in noise-free studio environment sampled at 48KHz with
a resolution of 16 bits per sample. The waveforms were downsampled to 16KHz
to build the USS systems. A native speaker of the language was chosen as the
voice artist. The amount of data collected and details of the speaker are given in
Table 2.1.
Table 2.1: Language databases used
Language Hours of Data
Hindi
6.45
Tamil
6
1
Speaker
Male
Female
C*V* units where C stands for consonant and V for vowel
2.1
Description of Work done on USS for Indian
languages
Unit selection is the simplest form of speech synthesis technology which is based
on the concatenation of simple sub-word units. The commonly chosen sub-word
unit to build a unit selection speech synthesiser is a phoneme. The sub-word unit
chosen in the case of Indian languages is the syllable. Syllables are units of the
form C*VC*, where C stands for consonant and V stands for a vowel. Therefore,
syllables are usually composed of at least one vowel and may or may not have
consonants preceding and/or succeeding the vowel. The reasons for choosing the
syllable as the sub-word unit are listed in Chapter 1.
The training phase of building any speech synthesiser involves organising the
speech database in such a way, that selecting units or generating a speech waveform
is easier during the synthesis phase. The various processes involved in the training
and testing phase of a USS system are described below. The USS systems were
built using the Festival Speech synthesiser based on the FestVox platform.
2.1.1
Training
The training phase is when the speech database is organised into structures that
make retrieving the most suitable unit for a given context easy. The structures to
organise the speech database in this case, are called classification and regression
trees (CART). To build these structures, the text has to be broken down into its
sub-word unit representation. This is followed by the waveforms being segmented
at the syllable level corresponding to the sub-word unit representation. The CART
structures are built using linguistic, acoustic and phonetic features. The details
of the entire process are described below.
2.1.1.1 Letter to Sound Rules
Letter to sound rules is a set of rules that help us break the given text into its
corresponding sub-word unit representation i.e. convert the grapheme representation to the phoneme representation or the written to the spoken form. Indian
languages being low resourced, there is not enough transcribed text available for
Indian languages from which letter to sound rules can be derived automatically.
7
The letter to sound rules, in this case, are a set of handwritten rules. The rules
have been written for two different kinds of language sub-categories, for Aryan
and Dravidian languages. Using the set of rules for Aryan languages, the rules
can be directly adapted to any language under the Aryan language sub-category
and the same applies to the Dravidian languages. The major differentiation when
it comes to Aryan languages is the aspect of schwa (ə) deletion. The rest of the
rules for the two sub-categories are more or less the same.
Apart from the language-specific part, the rules have been written to break the
given text into syllables. This syllabic representation is then broken down into
aksharas or monophones. The system is trained using all 3 kinds of units for the
same set of sentences.
2.1.2
Pre-clustering
After breaking the text into its sub-word unit representation, another set of rules
are then applied on the syllables constituting each word to tag them based on
their position in a word and their occurence in a geminate context. The reason
for retaining the positional context is because it was observed that the syllables
occurring at different positions within a word have different acoustic properties.
The tags used for positional context are begin, middle and end. Monosyllabic
words are also tagged with a begin tag.
It was also found necessary to tag words based on whether they belong in a
geminate context. This is because it was found that the segmentation algorithm
segmented syllables corresponding to geminates erroneously in many cases. This
was because syllables in geminate contexts are articulated very differently from
when they are not in geminate contexts. Therefore, it was deemed necessary to
add an extra context which indicated whether a syllable belonged to a geminate
context or not. During synthesis, care was taken to use a syllable in its original
context as far as possible.
2.1.3
Fallback Units
The syllable set for a language is finite but large in number. This can be seen
from the Figure 2.1 in which we can observe the distribution is long-tailed for the
8
0.025
Hindi
Tamil
Probability of occurrence
0.02
0.015
0.01
0.005
0
0
50
100
150
200
250
300
Syllables
Figure 2.1: Plot showing the long-tailed distribution of syllables in 2 languagesHindi and Tamil
2 languages Hindi and Tamil. In the event that a syllable is not present in the
database, fallback units are used. A three-level fallback is employed. When a
syllable from a particular positional context is not available, it falls back to a the
same syllable with a different positional context. This is the first level of fallback.
The second fallback level is aksharas. Aksharas are defined as the set of C ∗ V
and C units. The aksharas are also tagged with positional context tags beg, mid
and end. The third fallback level are monophones. In the absence of an akshara,
monophones are used. Since the database is designed to cover all monophones
of a language, the system is capable of synthesising any arbitrary text with this
three-level fallback.
2.1.3.1 A hybrid approach to segmenting speech waveforms at the syllable, akshara and monophone level
Speech segmentation refers to the process by which a waveform of continuous
speech is segmented into the sub-word units it is composed of. Figure 2.2 shows a
waveform segmented at the syllable level. Speech segmentation in the context of
speech synthesis is very important, as the sub-word units need to be segmented
very precisely in order to obtain high-quality synthesis. Segmenting speech waveforms into their corresponding sub-word units is a very hard task as we need to
convert the given text into its corresponding sub-word unit representation and
correlate it with the articulatory properties of the waveform. Earlier, a semiautomatic labeling tool was employed to segment the waveform at the syllablelevel [18]. This is still prone to errors, and the labeling was corrected a number of
9
Figure 2.2: Waveform segmented at the syllable-level
times. The manual intervention required makes the syllable as a fundamental unit
for synthesis a tall order. In this work, an automatic speech segmentation algorithm is used which is found to give reasonably accurate syllable boundaries. This
is achieved by using the Hybrid Segmentation algorithm which employs Hidden
Markov Models (HMMs) in tandem with the group delay algorithm to segment
the speech waveform. The details of this algorithm are given in [19]. The overall
algorithm has been outlined in the flowchart given in Figure 2.3.
In the hybrid segmentation approach, HMM-based segmentation and group
delay (GD) based segmentation are performed iteratively to obtain accurate segmentation automatically. Parameters are tuned in GD based segmentation to
over-estimate the syllable boundaries. This results in many spurious boundaries,
but the correct boundaries are not misplaced.
Flat start initialized embedded training of monophone HMMs followed by
forced Viterbi alignment are first given as input to this algorithm. HMMs and
the group delay algorithm are used iteratively to correct the syllable and monophone boundaries. The group delay boundaries in the proximity of the boundaries
given by HMMs are considered as correct boundaries for syllables. Flat start embedded training is performed on the monophones within a syllable restricted to
the syllable to obtain accurate monophone labels. This process is performed iteratively to obtain accurate segmentation at the syllable level and monophone levels.
The algorithm for hybrid segmentation is given in Figure 2.3.
10
Figure 2.3: Flowchart of the hybrid segmentation algorithm
Labels for Aksharas are obtained by concatenating the segments in the monophone label files corresponding to each Akshara. The accuracy of segmentation
is found to be very crucial and it was found that this algorithm gives reasonably
accurate labels. These labels are then used in the voice building process to build
CARTs.
11
2.1.3.2 Clustering Syllables
Any database used for building a text to speech synthesiser will have multiple
occurrences of each syllable. These syllables, therefore, have to be clustered to
help reduce the search space and the time complexity during synthesis. Linguistic,
acoustic and phonetic criteria are used to cluster these syllables during the training
phase. The acoustic distance in Equation 2.1 used to cluster units is a weighted
Mahalanobis distance to measure the distance between two phonemes of the same
class. Using the context information and the acoustic distance these units are
clustered using CART [20], [12]. Figure 2.4 shows a small portion of a CART. As
can be seen from the figure, the leaf nodes in the CART are clusters. The number
next to each indexed unit is the target cost for that particular unit, which is the
acoustic distance of that unit from the cluster centre. The acoustic distance for
clustering and the target cost is computed as follows
if |V | > |U |
|U |
W D ∗ |U | ∑ ∑ Wj .(abs(Fij (U ) − F(i∗|V |/|U |)j (V )))
T dist(V, U ) =
∗
|V |
SDj ∗ n ∗ |U |
i=1 j=1
n
(2.1)
Where
• |U | is the number of frames in U
• Fij (U ) is parameter j of frame i of unit U
• SDj is the standard deviation of the parameter j
• Wj is the weight for the parameter j
• W D is the duration penalty weight
• Equation gives mean weighted distance between the two units with the
shorter unit linear interpolated to the longer unit
The features used for acoustic cost computation are the mel-frequency cepstral
coefficients (mfcc) and their velocity co-efficients, f0 , and absolute power.
CART is a decision tree based on a set of yes/no questions. Every unit in the
database has its own CART, which contains all occurrences of that unit in the
database clustered with respect to the context in which they occur in the training
sentence.
12
Figure 2.4: An example portion of a CART for the unit sabeg
2.1.4
Synthesis Phase
The synthesis phase involves breaking the input text using a set of LTS rules, and
searching for a unit in the CART that best suits the context the unit is present,
in the sentence to be synthesised.
2.1.4.1 Letter to sound rules
The letter to sound rules for synthesis are very similar to the rules that are used
during training. The language-specific rules and the syllabification process are the
same as the process followed during the training phase. The main difference is
that it searches for the context closest to what is present in the input sentence.
In the absence of the syllable, it is broken down into smaller sub-word units or
back-off units which are substituted for the missing syllable. Two levels of back-off
are performed. The syllable is first broken down into aksharas and searched, in
the absence of an akshara the monophones corresponding to the missing units are
substituted.
2.1.4.2 Selection of units
Once the text has been broken down into the smaller sub-word units, a target
specification is generated using the units that make up the sentence that has to
be synthesised. Using this target specification, an appropriate cluster is found. A
Viterbi search is performed through each one of these candidate clusters to find
the optimal set of units that could be used to synthesise the sentence. The optimal
13
set of units is found by minimising this cost
N
∑
Cdist(Si ) + W ∗ Jcost(Si , Si−1 )
(2.2)
i=1
Cdist(Si ) is the distance of syllable S from the center of the cluster known
as the target cost. Jcost(Si , Si−1 ) is the cost of concatenating syllables Si and
previous syllable Si1 . W is used to weigh join cost over target cost. N is the
number of syllables in the utterance to be synthesized.
The target cost is computed as the acoustic distance given in 2.1 between each
unit in the candidate cluster with the cluster centre. The concatenation cost is
computed by measuring the cepstral distance, the difference in absolute power and
f0 at the point of concatenation of the pair of syllables.
2.1.4.3 Synthesis using the selected units
Once the units have been selected, they are concatenated using the windowed join
method where the ends of the unit are windowed and are concatenated. Concatenation using optimal coupling is usually preferred when phones are used as
the sub-word units. Optimal coupling is disabled in the case of syllables as the
boundaries of syllables are usually acoustically stable.
2.2
Description of Work done on HTS for Indian
languages
HTS systems, unlike USS, do not involve concatenation of waveforms. Instead,
sub-word units are modeled using sequences of coded vectors and speech parameter sequences are generated using these models. The model that best suits our
purpose is the Hidden Markov Model (HMM) as it effectively captures sequential
information.
Once the system has been built, there is no need to retain the original speech
waveforms as speech parameter sequences are generated using the models built.
This results in a large reduction in the footprint size. But HMM-based speech
synthesisers suffer mildly in terms of naturalness as the models are an average
14
Figure 2.5: Overview of an HMM-based TTS
representation of each subword unit. This results in the synthesised waveform
sounding buzzy. This kind of synthesis though is highly intellegible as a continuous stream of parameters are generated, while USS systems suffer in terms of
intelligibility as the speech waveform is formed by concatenating smaller waveforms and the synthesised speech sounds discontinuous. Figure 2.5 illustrates
the overview of an HMM-based speech synthesiser. The sub-word unit used, in
this case, is the context dependent monophone. The details of the procedure for
building an HMM-based speech synthesiser are described below.
2.2.1
Training
2.2.1.1 Letter to sound rules
The letter to sound rules are the same as the rules mentioned in Section 2.1.1.1.
The input text is first broken down into syllables using the language specific rules
followed by syllabification rules. Once the words have been broken down into
syllables, the syllables are broken down into their constituent monophones. To
map the characters in the native language script to their corresponding phoneme
mapping in roman script, a mapping scheme has been proposed in [1]. In this
15
scheme, common sounds from 13 different Indian languages are mapped to a common symbol in the roman script. This mapping is called the common label set. A
small portion of the common label set for the two languages dealt in this paper is
shown in Table 2.2.
Table 2.2: Small portion of the Common Label Set illustrating how a few examples
have been mapped to their roman character equivalent
Common
Label set
Notation
a
i
ii
rq
zh
k
kh
Hindi
अ
इ
ई
ऋ, ॠ
क
ख
Tamil
அ
இ
ஈ
-
ழ
க
-
2.2.1.2 Segmentation of waveforms at monophone level
The waveforms are segmented at the monophone level using the algorithm described in Section 2.1.3.1. Once the speech waveforms have been segmented using
the hybrid segmentation algorithm, monophone2 and fullcontext3 labels are generated from the utterances. These label files are then used to initialise models for
an HMM-based TTS system.
2.2.1.3 Feature Extraction
An HMM-based speech synthesiser is built by generating models of sub-word units
using features of the speech waveform. Since the speech production mechanism
can be viewed as glottal pulses exciting the vocal tract to produce speech output,
the features extracted correspond to the excitation and vocal tract characteristics.
The features used, in this case, are mel-generalised cepstral coefficients (MGC) and
its dynamic features (35+35+35), logf0 (lf0 ) and its dynamic features (1+1+1),
and duration of each sub-word unit to build models. MGC, in this case, represent
the vocal tract characteristics and lf0 represent excitation.
2
contains only time-stamps and the monophone transcription
Contains detailed context information -segmental and suprasegmental - information about
the utterance
3
16
2.2.1.4 Model Building
The model building process involves initialising and re-estimating context independent (CI) and context dependent (CD) monophone HMMs based on the maximum
likelihood criterion:
λ̂ = arg max p(O|W, λ)
λ
(2.3)
where λ represents the model parameters, O is the training data and W is the
transcriptions corresponding to the training data.
The acoustic parameters which are modeled using multi-space probability distributions HMMs (MSDHMMs) as both MGC, lf0 and their dynamic parameters
are modeled using one HMM. Acoustic parameters and duration are modeled separately using 5 states for each phoneme model.
2.2.2
Synthesis
An utterance structure is formed for the text to be synthesised using the letter to
sound rules (Section 2.2.1.1) and context information is extracted for the same.
Speech parameter sequences are generated using CDHMMs for this sequence of
phonemes using:
ô = arg max p(o|w, λ̂)
o
(2.4)
where o represents speech parameters and w is the transcription of the test
sentence. The output speech is synthesised using these sequences of parameters.
2.3
Related Previous Work
There have been several efforts in the past that have worked on pruning speech
databases and analysing the rhythmic and tonal characteristics of speech. In
Sections 2.3.1, 2.3.3, 2.3.2 and 2.3.4 a brief review of these works is presented.
17
250
60
40
20
200
number of examples
80
0
−0.4
140
120
100
number of examples
number of examples
120
150
100
50
100
80
60
40
20
−0.3
−0.2
−0.1
0
0.1
duration difference
0.2
0.3
0
−150
−100
−50
0
50
100
150
difference in average f0 of syllables
0
−5000
0
5000
difference in average energy of syllables
Figure 2.6: (a) Histogram of duration difference between a pair of adjacent syllables in the database (b) Histogram of average f0 difference between
a pair of adjacent syllables in the database (c) Histogram of average
energy difference between a pair of adjacent syllables in the database
2.3.1
Previous work on maintaining consistency in speech
databases and pruning them
[11], [14] and [15], propose various methods for pruning speech database to reduce
the size of the footprint to enable porting the TTS onto devices with smaller
memory capabilities and for various other applications. The initial impetus for
the work done in this thesis was given by [21].
2.3.1.1 A Probabilistic Approach to Selecting Units for Speech Synthesis Based on Acoustic Similarity
In [21] a method is proposed in which the USS synthesises sentences based on a
cost measure which focuses on reducing the acoustic variability between adjacent
units. This reduces sudden prosodic variations in the synthesised sentence and
makes the synthesised output sound more continuous.
To synthesise speech, differences in f0 , energy and duration of consecutive
pairs of units in the database were observed. Figure 2.6 shows the histogram of
differences in these 3 parameters for a pair of syllables at adjacent positions. It
was observed from this figure and many others that the distributions in most cases
were Gaussian. These differences were converted to probability density functions.
During synthesis, the differences in the 3 parameters were computed for all possible
adjacent pairs of candidate units and the units were then chosen based on the
values of the 3 parameters that best fit the distribution.
This work showed that it was very important to maintain acoustic continuity
across a synthesised utterance, and that sudden jumps in acoustic properties between adjacent units in a synthesised utterance causes the system to score very
18
low on listening tests.
2.3.1.2 Automatic pruning of unit selection speech databases for synthesis without loss of naturalness
In [11], two methods of selecting the most suitable syllable unit for synthesis have
been described.
Selecting the average unit - The first method describes choosing the average
unit from a number of realisations such that it is prosodically neutral with minimal
influence of context. This was done by computing the mean of prosodic features of
all the realisations of the unit in the speech database, and selecting the unit whose
prosodic features were closest to the computed mean. The prosodic features used
were pitch (f0 ), short-term energy (STE) and duration.
Selecting the optimal unit - In this approach, a measure called global prosodic
mismatch function (GPMF) is defined. This function is computed as follows:
}
N {
∑
Px Dx Ex GP M F (X) =
1 − P (Ai ) + 1 − D(Ai ) + 1 − E(Ai ) i=1
(2.5)
Where, Ai is the unit under consideration, N is the number od instances of the
unit, P (Ai ), D(Ai ) and E(Ai ) are the pitch, duration and energy of the unit Ai
and PX , DX and EX are the expected values of pitch, duration and energy of Ai .
This function measures the distance between candidate units in the database
and the target specification that have been predicted using the corresponding
acoustic models. The ideal unit, in this case, would be the unit which has the
minimum value of GPMF.
Perceptual tests showed that GP M F worked better as a cost function than
the average unit method. A large number of sentences were synthesised from a
large text corpus and the GP M F cost was computed for all units in the database.
The instances of a particular unit with minimum cost were added to the database
while the others were excluded. It was found that this method of pruning not only
resulted in a reduction of the database size but also improved the quality of synthesis since it removed ‘bad’ units from the database. Further, using these corrected
19
labels to build an HMM-based speech synthesiser also proved very helpful.
2.3.1.3 A Statistical Method for Database Reduction for Embedded
Unit Selection Speech Synthesis
[14] proposed a method to prune the database for a TTS to reduce the size of
the footprint. This method relies on the statistics produced by the unit selection
speech synthesis on a large text corpus. This method uses the frequency of occurrence of each unit and acoustic cost [20], to discard redundant units from the
database. A fitness vector was defined which was formed using a combination of
the frequency of occurrence of a unit during synthesis on text from a large corpus,
and the mean of the acoustic cost of the unit each time it is used for synthesis.
This method effectively removes redundant units from the database because the
fitness vector of each unit under consideration was multiplied with the score difference with the previously selected unit. This reduced the fitness value of the
similar units and increased the fitness value of the dissimilar units. Units with
higher fitness values were added to the database and the others were excluded. It
was observed that this method of pruning reduced the size of the database with
minimal degradation in quality.
These methods to pruning the speech database rely primarily on the feedback
of synthesised output to decide on the units that have to be discarded. The
method proposed in this thesis is different from the aforementioned approaches in
that, it discards the ‘bad’ units prior to the system building phase. The labels
from the output of the segmentation phase are used and the syllable units are
discarded based on their prosodic consistency with the rest of the units in the
speech database.
2.3.2
Previous work on Prosodic phrase break prediction
Prosodic phrases have traditionally been predicted using the ToBI approach which
use POS tags to predict different levels of tones and breaks. Initially, punctuation
marks in the text were assumed to be good predictors of phrase breaks. However,
later research suggested that punctuation marks are often wrongly used in the
text and can lead to several errors in phrase break prediction. Therefore, several
approaches that use different cues from text were proposed to handle this problem
20
of POS tagging. [22] compare and contrast various methods of phrase break
prediction for the English language. These methods have been briefly described
in this section.
2.3.2.1 POS tag based approach
In this approach, the sentence for which phrase breaks have to be predicted is first
grouped into one of the generalised POS tag groups from a ‘bag’ of phrases. After
this, the sentence is analysed further and classified into a more specific group that
better matches the combination of words in it. Prosodic phrase breaks are then
predicted using a set of rules for that particular combination of POS tags.
2.3.2.2 Prediction by example approach
In this approach, phrase breaks are predicted using an exemplar-based approach.
This approach requires a very large and accurately annotated text corpus. A sentence which is similar to the given input sentence is searched for in the database
and breaks are placed in the test sentence at positions corresponding to the sentence in the text corpus. This similarity is measured in terms of the combination
of POS tags. Therefore, a sentence is first reduced to a set of POS tags. This
reduced set is then searched for in the text corpus and the best matching sentence
is used to predict phrase breaks for the input sentence. The problem with this
approach is that it needs an almost infinite text corpus in addition to which the
approach is very slow.
2.3.2.3 Prediction by phrase modelling
HMMs have also been used to predict phrase breaks. These models have been
quite successful in phrase break prediction mainly because HMMs [23] capture
sequential information very effectively. The sentences in the training corpus are
first POS tagged and clustered using k-means clustering [24]. Left-to-right HMMs
are then built using these clustered HMMs with word skipping allowed. The
number of states is decided by the mean length of the phrases in each cluster.
Phrase breaks are predicted given the POS tags for a sentence. Phrase breaks are
postulated at the junctures of these models.
21
2.3.2.4 Prediction using features from a syntactic parser
This approach uses a syntactic parser to output the best set of features which
can be used for clustering. The syntactic parser takes an input sentence and
gives parsed output as a set of features which denote various levels of syntactic
information about the sentence. These features were then used to build decision
trees which were found to predict phrase breaks very effectively.
There have also been various efforts that have successfully predicted phrase
breaks for Indian languages. Predicting phrase breaks for Indian languages is a
hard task as the text is seldom punctuated and POS tagging for Indian languages
has not been perfected yet. [3] and [25] propose a method using which pauses can
be predicted using cues from the text. The method proposed in this thesis uses
these same cues to predict prosodic phrase breaks.
2.3.3
Related work on prosody prediction
The tonal aspect of prosody has been successfully predicted for many languages
such as English, Japanese, Mandarin, Portuguese etc. These languages use a
system of prosody labeling called ToBI.
2.3.3.1 Tones and break indices (ToBI)
This is a rule-based approach that was initially developed for English and has now
been extended to various languages. ToBI predicts 2 kinds of prosodic aspects for
a given text namely tones and breaks. Tones are predicted from the accents
associated with every word in the phrase. These accents are predicted using POS
tags. There are different kinds of tone indices such as word pitch accent, phrase
pitch accent, etc. which are predicted based on a set of rules. Breaks are another
prosodic aspect which are again dependent on the POS tags. There are different
levels of break indices as well, such as the breaks that separate words, the breaks
at the ends of phrases, etc.
Although ToBI have successfully been used for various languages, adapting a
similar set of rules for Indian languages is a hard task as Indian languages lack
punctuation and methods of POS tagging. Therefore, the method described in
this thesis analyses prosody based on the structure of the text rather than the
22
content. This criterion has also been included in the cost measure to select units
in a USS system. This cost function has been added as an additional criteria to
the one described in [12].
2.3.4
Related work on speech rhythm
Rhythm is defined as any regularly recurring event. [7] states that rhythm in
speech is defined either with respect to every syllable in an utterance or with
respect to stressed syllables. Languages in which rhythm is defined as an equal
duration between the production of successive syllables are called syllable-timed
languages. In these languages, there is an equal prominence associated with every
syllable and they generally lack reduced vowels. Stress-timed languages are those
in which there is an equal duration between stressed syllables and the unstressed
syllables are shortened or lengthened accordingly.
In [26] though, a different interpretation to speech rhythm is presented. It is
said that no language can be strictly classified as syllable-timed or stress-timed,
and that all languages exhibit both varieties of timing. Further, the same speaker
can exhibit different kinds of timing on different occasions. Further analysis
showed that the syllable durations in syllable-timed languages are not uniform.
For Hindi, the mean and standard deviation in syllable duration are around 196ms
and 76ms respectively and for Tamil, is around 199ms and 91ms respectively for
the databases used in this paper. This shows that there is a large variation in the
duration of syllables spoken by one speaker for one language.
Since it is not possible to define rhythm on the basis of syllable and stress timing, a new criteria for analysing rhythm has been proposed in this paper. Syllabic
rhythm in this paper has been defined as the rate of production of syllables, in
terms of the number of syllables per word and the duration of each word. The
analysis has been conducted at the intonational phrase level of the prosodic hierarchy [27]. The reason for restricting the analysis to an intonational phrase is
mainly because the analysis could then be restricted to within one prosodic contour. Also, there is a considerable amount of co-articulation between syllables
within a phrase, while this occurring between phrases is not possible as there is a
significant pause separating two phrases.
There have been many other efforts through the years aimed at analysing
23
the rhythmic aspect of speech. [28; 29] characterise and describe methods to
generate rhythm patterns particularly for the purpose of text-to-speech synthesis.
[30; 31; 32; 33] detail classification of languages based on rhythm into stress-timed
and syllable-timed while [34], [35], [36] and [37] study isochrony (rhythmic division
of time into equal portions by a language) from a production and perception
perspective. The effect of syllable stress and rate on isochrony is explored in
[34]. [35] analyses the role played by rhythmic and syntactic units in production
and perception and suggests that isochrony is more of a perceptual phenomenon.
[38; 39; 40; 41] analyse the effects of speech rate in perceiving rhythm and [42]
studies the factors of language such as syllable structure, etc. that can be used to
characterise rhythm.
2.4
Summary
The procedures to build USS and HTS systems for Indian languages are first
discussed. Previous efforts on removing artifacts focused on removing ‘dirty’ units
from the database based on their acoustic score during synthesis and frequency of
occurrence. Prosody prediction in previous works on languages such as English
relied more on punctuations and POS tags. There have also been efforts towards
data-driven methods to phrase break prediction. The prediction of tonal aspects
of speech though relies more on rule-based approaches which again depend on
POS tags.
24
CHAPTER 3
A method to prune speech databases to improve
the quality of Indian TTSes
The synthesis system described in the previous chapter, has serious issues in terms
of prosody. These issues include errors due to artifacts and inaccuracy of segmentation. Speech synthesisers that work on unit concatenation suffer in quality
mainly due to sudden variation in acoustic properties between adjacent units. An
example of such an artifact in synthesised speech is shown in Figure 3.1. In this
figure, the unit थे (thee) is not articulated properly. Artifacts such as this cause
the performance of the USS to degrade significantly. Also, incorporating the appropriate rules to select a unit from the database that has the right co-articulative
influences for the given context is a hard task. This chapter describes methods
using which the chances of selecting a wrongly segmented unit or a ’bad’ unit is
minimised.
Figure 3.1: Example waveform with artifact
Another artifact that is commonly encountered in speech synthesised using
a USS is an erroneously segmented unit. In these cases, a unit that is part of
the utterance that is to be synthesised has retained a part of the adjacent unit
from the original sentence it was selected from. This is due to errors in the
speech segmentation process. An example of a unit which has been erroneously
segmented is shown in Figure 3.2. Using the method described in this chapter,
erroneous segmentation can be corrected and many of the erroneously segmented
units which have not been corrected can also be discarded.
Figure 3.2: Example waveform and transcription with a segmentation error
Another issue faced in building a TTS is the inconsistency in the database. Despite hiring a professional voice talent to record the data, there are inconsistencies
present in the recording. The recording of text is done over many sessions and the
conditions vary across sessions. In addition to these problems, there is this problem of maintaining a uniform syllable rate while speaking, especially when there
are complex words in the text. The speaker will have a tendency to shorten the
syllable duration while speaking. For example, the words पु षो म अ वाल (purushottam agrawaal) is in some cases articulated as पु षो म वाल (purushotta magrawaal).
Here the syllable length changes and the syllabification is different from what is
obtained using the text syllabification rules. When such syllables are picked up
by the USS, the quality of synthesis degrades and the mean opinion score (MOS)
drops significantly.
It is essential that some uniformity in syllable parameters such as duration,
average short-term energy (STE) and average pitch (f0 ) be maintained across the
units in the database. This is achieved by pruning the database of outliers, by
making use of average STE, average f0 and duration criteria [10], or based on a
score that depends on a weighted sum of the three parameters [13], [11]. [12] and
26
[14] focus on reducing the size of the database by pruning units from the database,
with little or no degradation in synthesised speech quality or naturalness. The
same is achieved by a vector quantization technique in [15].
There have been a number of attempts aimed at (i) improving the quality of
unit selection system (USS) and (ii) reducing the size of the speech database in
[12], [13], [11], [14]. These have resulted directly or indirectly in the re-organisation
of the classification and regression tree (CART).
In this Chapter, we use a syllable based approach to USS with an attempt to
maintain consistency in acoustic parameters across an utterance by appropriate
pruning of the database. Instances of a unit which vary quite significantly from
the average acoustic properties of the unit are discarded. The first step is to
prepare a syllable database with units that are chosen carefully. The transcription
must be completely free of errors. Next, the syllable boundaries must be very
accurate. Owing to the size of the syllable being larger than the phone, an effective
segmentation algorithm as described in Section 2.1.3.1 is used.
Once accurate syllable boundaries are available, different pruning techniques
based on acoustic properties of the syllables are employed (Section 3.1). In particular, the statistical characteristics of duration, average STE and average pitch
f0 of the syllable are used to prune units from the database.
3.1
Pruning Technique for USS systems
Pruning is performed mainly to remove badly segmented units and also to avoid the
effect of inconsistencies in recording. The database is pruned using the combined
effects of the acoustic cues duration, average STE and average f0 .
Syllable-timing primarily corresponds to duration. Nevertheless, in the context
of preserving the semantic content of the utterance, prosodic parameters like energy and f0 must also be considered. Energy and f0 play an important role mainly
because, if the sequence of units chosen for synthesis has non-uniform pitch and
energy, overlaps are perceived in the synthesised speech. The pruning performed
thus ensures rhythmic and acoustic consistency in the database.
The steps to select the units to be pruned are as follows:
• The duration, average f0 and average STE are computed for each unit using
27
all instances of that particular unit.
• The mean (µ) and standard deviation (σ) of these parameters for each unit
are then computed.
• The units lying outside the region specified by some fraction of σ for all 3
parameters are tagged with a special symbol so that they will not participate
in synthesis.
• Only units with greater than 10 occurrences in the database are chosen for
pruning.
• If there are greater than 50 occurrences of the unit after pruning, the first
50 occurrences are retained
• For pruning back-off units (aksharas and monophones), only the units corresponding to the syllables that have not been pruned are retained.
The reason for choosing the first 50 occurrences of a unit and not adopting
a k-nearest neighbour approach is mainly to prevent the synthesis from sounding
too monotonous. The main aim of this paper is to remove the unwanted units
from the database and at the same time make an effort to synthesise speech that
sounds prosodically natural. This also reduces the search space considerably.
Aksharas and monophones are the back-off units that are used for synthesis in
the case of a syllable being absent from the database during synthesis. Syllable
based USS systems are built using units that remain after pruning the speech
database.
3.1.1
Training USS systems
To test the claim, systems with two levels of pruning have been built for two
Indian languages - Hindi and Tamil. Hindi is an Aryan language, and Tamil is a
Dravidian language.
The Unit Selection TTS systems are built using the Festival speech synthesiser
based on the Festvox framework as described in Section 2.1. The systems are built
based on the syllable as the basic unit. A set of hand written pronunciation rules
are written to split the text into syllables as described in Section 2.1.1.1 and CART
are built using linguistic and acoustic context as described in Section 2.1.3.2. The
appropriate sequence for synthesising output speech is chosen by performing a
Viterbi search through a set of target clusters. The set of units with the optimum
28
weighted sum of target and concatenation cost are chosen and concatenated to
synthesise speech.
Prediction of prosodic phrases for Indian Languages plays a very crucial role
in synthesising good quality speech. This is mainly because Indian Languages
are seldom punctuated. Therefore, it is important to have a robust method of
predicting prosodic phrase breaks. Details of this are given in Chapter 4. Cues
from the text are used to build CART to predict prosodic phrase breaks. Wordterminal syllables and case markers are used as features to build a CART, which
are later used as cues to predict pauses during synthesis.
3.2
Preliminary Experiments and Results
Systems with both pruned and unpruned databases for USS were built for the
two languages. Pruning was based on the parameters duration, average STE and
average f0 . Pairwise comparison listening tests [43] were performed to evaluate the
effect of pruning with different values of k in k × σ. The results of the subjective
listening tests for 0.25 σ and 0.75 σ are presented in Tables 3.1 and 3.2. Pruning
was also performed for other values of standard deviation - 0.5 and 1 times the σ
values. By informal listening tests, these systems were excluded from being part
of the subjective evaluation.
The systems were evaluated subjectively using pairwise comparison tests [43].
This test consists of giving a preference between synthesised sentences of two
systems, A and B, with the text for both remaining the same. The first part of
the test is an “A-B” test, where the synthesised sentences of system A are always
played first against those of system B, and vice-versa in the “B-A” test. The
score “A-B+B-A” gives an overall preference for system A against system B and
is calculated by the following formula:
“A − B + B − A” =
(“A − B” + (100 − “B − A”))
2
In our evaluation, the proposed system is always system A and the default system
is always B. The test was conducted on 15 listeners on a set of 10 synthesised
sentences. We will follow these conventions when referring to systems built using
databases pruned to
29
35
30
35
25
20
15
10
Number of Syllables
25
Number of Syllables
Number of Syllables
30
20
15
10
30
25
20
15
10
5
5
0
40
5
0
0.1
0.2
0.3
0.4
Duration
0
0
2000
4000
6000
8000
0
0
Average Energy
50
100
150
200
250
300
Average f0
Figure 3.3: Distribution of acoustic parameters for syllable /see/_end
1. 0.75 σ - System P
2. 0.25 σ - System Q
Table 3.1: Pairwise comparison tests Table 3.2: Pairwise comparison tests
for Hindi
for Tamil
Score
P
Q
Score
P
Q
A-B
58.3
75
A-B
58.3 62.5
B-A
66.7
20.8
B-A
45.8 37.5
A-B+B-A
45.83
77
A-B+B-A
56.2 62.5
From the USS results, we observe that the overall preference is for the proposed
systems, except in the case of system P against default system for Tamil, in which
the preference is for the default system. The most preferred though is system S
which is pruned to 0.25 σ of the three parameters. This result is consistent with
our initial observation that reducing acoustical variability improves the synthesis
quality. As seen from the Hindi USS results, pruning using a combination of
the three parameters and excluding units lying outside the region of 0.25 × σ
definitely boosts the system performance. Although there is an improvement in
the case of both Hindi and Tamil, it can be seen that the improvement for Tamil
is not as significant as it is for Hindi. This is mainly due to the agglutinative
nature of Tamil, because of which the language is replete with geminates. A lot
of segmentation errors can occur when geminates are encountered. This causes a
drop in system performance.
Figure 3.3 shows the distribution of duration, average STE and average f0 for
all instances of the Hindi syllable से_end (see_end). It can be seen from the figure
30
that, the deviation of the above three parameters is not very high. Therefore,
choosing a smaller value of standard deviation would suffice.
3.3
Using pruning to improve the quality of phonebased HTS systems
Since pruning using acoustic cues helped in improving the quality of USS systems,
it was proposed that they could also be indirectly used to improve the quality of
HTS systems. Since HTS systems are phone-based, only the monophones corresponding to the unpruned syllables were used to initialise an HTS system.
Once pruning using the method described in Section 3.1 was performed and the
monophones corresponding to the pruned syllables were excluded, the remaining
monophones were then used to build monophone HMMs. Re-estimation and forced
alignment were then iteratively performed to obtain accurate monophone labels.
These monophone labels were then used to initialise models to build an HTS
system. The building process is the same as the one mentioned in Section 2.2.
Listening tests were performed to evaluate how important the initialisation to an
HMM-based system was.
3.3.1
Results of Listening tests conducted to evaluate effect of pruning on HTS system
Pairwise comparison listening tests described in Section 3.2 were conducted to
evaluate the effect of initialising HMMs using the monophones of pruned syllables.
The results of the pairwise comparison test are given below. The proposed system
was compared with the system which is initialised without the pruned models the
results of which are given in Table 3.3. The test was conducted on 15 listeners on
a set of 10 synthesised sentences.
31
Table 3.3: Pairwise comparison tests for Hindi and Tamil to evaluate performance
of HTS system after initialising using pruned models
Score
Hindi Tamil
A-B
56.72
45.55
B-A
18.8
32.22
A-B+B-A
68.96
56.66
From the results above it can be concluded that initialising an HTS using the
monophones corresponding to pruned syllables does improve the performance of
the HTS and that this system is preferred over the HTS whose models are initialised using the output of the automatic segmentation algorithm. This shows
that initialising the HTS with accurate segments is crucial to improving the synthesis quality. Here again, the improvement in performance for Tamil is not as
significant as it is for Hindi. This can again be attributed to the agglutinative
nature of Tamil which makes segmental correction very difficult.
3.4
Pruning speech databases using likelihood as
an additional cue
It can be clearly seen from the results in Section 3.3 that pruning the speech
database using acoustic cues does have a significant effect on the quality of speech
synthesis for USS. Although this is true for all the languages, further analysis
showed that a few erroneous units were still retained which resulted in the performance of the system being inconsistent. A slightly modified algorithm to the one
described in Section 3.1 was proposed.
Since most of the units that remain after pruning the speech database can
be assumed to be units with acoustic consistency and accurate segmentation, the
monophones corresponding to these were used to initialise monophone models.
Syllable level forced Viterbi alignment was then performed using these models
and it was observed that the syllable labels obtained using this method were
very accurately segmented. This can be seen from Figure 3.4 that the erroneous
syllable boundaries given by the hybrid segmentation approach are corrected when
syllable segments are obtained using the pruned monophones. The dotted black
line indicates the erroneous syllable boundary and the region in yellow enclosed
32
Figure 3.4: Comparison between syllable segments obtained using the 2 approaches
within the solid black lines indicates the correct segment for that same syllable.
After correcting the syllable boundaries using the aforementioned approach,
pruning was performed again and this time likelihood obtained while performing
forced alignment at the syllable level was used as an additional cue. The algorithm
for the two pass pruning is now as follows:
• The duration, average f0 and average STE are computed for each unit using
all instances of that particular unit.
• The mean (µ) and standard deviation (σ) of these parameters for each unit
is then computed.
• The units lying outside the region specified by some fraction of σ for all 3
parameters are tagged with a special symbol so that they will not participate
in synthesis.
• Only units with greater than 10 occurrences in the database are chosen for
pruning.
• If there are greater than 50 occurrences of the unit after pruning, the first
50 occurrences are retained
• For pruning back-off units (aksharas and monophones), only the units corresponding to the syllables that have not been pruned are retained.
• Initialise HMMs using the monophones that have not been excluded
• Perform initialisation, re-estimation and forced alignment at the monophone
level iteratively till the monophone segments obtained are accurate
33
• Use these re-estimated monophone models to perform forced Viterbi alignment at the syllable level
• The duration, average f0 , average STE and likelihood are then extracted for
each unit
• The mean and standard deviation (σ) for each unit is then computed for the
average f0 , average STE, duration and likelihood
• The units lying outside the region 0.25 × σ for average f0 and average STE
and those which are < 0.25 × σ of duration are tagged with a special symbol.
The longer units are retained.
• For the units that have not been excluded, the ones with likelihood value
< 0.25 × σ were excluded. The units with greater values of likelihood were
retained
• Only units with greater than 10 occurrences in the database are chosen for
pruning
• If there are greater than 50 occurrences of the unit after pruning, the first 50
occurrences of the unit lying within the specified value of sigma are retained
• For pruning back-off units, only the units corresponding to the syllables that
had not been pruned are retained. The other back-off units i.e. Aksharas
and monophones are excluded
During the second iteration, the reason for not excluding the units with higher
duration was because previously conducted informal tests had suggested that using
long well articulated syllable units improves synthesis quality. These units would
be excluded by the likelihood criterion if they were erroneous segments.
USS systems were built using the labels obtained using this approach, and the
results of the listening tests conducted are given.
3.4.1
Results of Listening tests conducted to evaluate the
pruning using likelihood approach
Pairwise comparison listening tests were conducted to evaluate the new pruning
technique. The evaluation methodology is as described in Section 3.2. Each of the
systems was compared with systems built using the previous pruning technique.
The results are given in Table 3.4. The test was conducted on 15 listeners on a set
of 10 synthesised sentences. In this case A is the system which has been pruned
using the additional acoustic cue of likelihood while B is the system that has been
pruned using the old method described in Section 3.1.
34
Table 3.4: Pairwise comparison test results for Hindi and Tamil to observe the
performnce of systems that have been using likelihood as a criteria
Score
Hindi Tamil
A-B
50.00
69.23
B-A
38.80
26.16
A-B+B-A
58.60
71.53
From the results obtained it can be seen that pruning using the two pass
technique does result in a significant improvement in the performance of the USS.
It can be seen that the improvement for Tamil is a lot more significant than it is for
Hindi. This shows that the agglutinative nature of Tamil causes the segmentation
to be erroneous in many cases which is effectively corrected during forced Viterbi
alignment using pruned monophone models.
3.5
Summary
In this paper, we propose a new approach to prune speech databases for syllabletimed Indian languages. Due to various issues faced in developing a syllable based
USS system, this technique of pruning proves to be very useful. It has been shown
that using appropriate prosodic criteria for pruning results in a database that can
be used to build better quality speech synthesisers. Pairwise comparison of the
standard USS systems with that of the pruned versions shows that pruning using
duration, STE and f0 is preferred. Also, allowing a very small deviation in prosodic
parameters of the individual unit is also sufficient. Further, using the unpruned
units to initialise monophone HMMs for an HTS results in an improvement in
the quality of the HTS. Also, since these models were precise enough to cause an
improvement in HTS, they were used to obtain syllable and monophone labels for
a USS in which way a lot of segmentation errors were corrected. These labels when
pruned and used to build a USS worked better than the old pruning technique. In
the new technique, the likelihood score obtained as a result of forced alignment was
used as an extra cue in addition with the average f0 , average STE and duration.
35
CHAPTER 4
Prosodic Phrase Break Prediction
In natural speech, humans tend to group words together with noticeable pauses between the groups. These groups are called prosodic phrases and the pause between
them is called a prosodic phrase break. A prosodic phrase linguistically known as
the intonational phrase in the prosodic hierarchy (Figure 4.1), is a segment that
occurs within one prosodic contour. Prosodic phrases help in understanding the
semantics of a spoken utterance and have been found to be very important in the
context of TTS. Prosodic phrase breaks also contribute to the rhythm of a spoken
utterance [16] and have been found to enhance the quality of synthesised speech.
This chapter describes a knowledge-based approach which uses cues from the text,
to predict prosodic phrase breaks.
Figure 4.1: An example of Prosodic hierarchy
4.1
Importance of Prosodic Phrase Break Prediction
Prosodic phrase break prediction is breaking the given text into meaningful chunks
of information. This improves the quality of text to speech synthesisers because
it inserts pauses in the synthesised speech wherever required and makes the synthesised more meaningful. Previously conducted informal listening tests showed
that inserting these pauses in the synthesised speech was very crucial. This is
because, if there was no pause, the listeners perceived the units of speech to be
overlapping over each other wherever they expected a prosodic phrase break. This
resulted in the TTS systems scoring very poorly on the mean opinion score (MOS)
test. Therefore, an efficient strategy to predict prosodic phrase breaks was deemed
necessary.
4.2
Challenges faced in prosodic phrase break
prediction for Indian languages
Prosodic phrase break prediction is especially hard in the case of Indian languages
because of certain characteristics of Indian language texts which makes it very difficult to disambiguate the semantics of the utterance to be spoken. The following
are some of the challenges faced.
4.2.1
Lack of Punctuation and POS taggers
The most crucial part of a text to speech synthesis system is to derive as much
relevant information from the text as possible which can be used to achieve highquality speech synthesis. Prosody prediction for languages such as English are
handled using rule-based approaches that depend on punctuations and part of
speech (POS) tags. These rules are called Tones and Break Indices (ToBI) and
are widely used in English TTSes. These rules have also been developed for
Japanese, French and many other languages. Indian languages though, do not
have well developed POS taggers and generally lack punctuations except for the
full-stop at the end of sentences, which makes the task harder.
4.2.2
Agglutinative nature of Indian languages
Many of the languages in India are agglutinative, examples are Tamil and Telugu.
Agglutinative languages are those in which multiple words can be combined and
spoken as one single word. In these cases, the meaning does not change while the
prosodic characteristics can vary significantly. An example of this in Tamil is the
three words வந்து-ெகாண்டு-இருக்கிறான் (vandu-kondu-irukkiraan) which can be spoken
in isolation or as a complex word as in வந்துெகாண்டிருக்கிறான் (vandukondirukkiraan).
Although these words whether spoken in isolation or as a single word mean the
37
exact same thing, the prosodic hierarchy in these two cases changes significantly.
4.2.3
Low resourcedness of Indian languages
Most of the Indian languages are low resourced languages in that, there are rarely
any accurately annotated text corpora available for most Indian languages. Therefore, developing approaches that use machine learning to learn from a huge text
corpus to achieve the task is not feasible.
It is due to issues such as these that knowledge-based approaches to prosodic
phrasing was found necessary in the context of Indian language TTSes. Prosodic
phrase break prediction for Indian languages has been included in the Festvox
framework. Since phrase break prediction in the FestVox framework happens at
the word level, only features at the word level and higher were used to build
CARTs. Phrase break prediction for the two languages dealt with in this thesis
has been described in the following section.
4.3
Case Markers for Prosodic phrasing
Initially, to understand the phrasing pattern in Hindi, the text transcription needs
to be very precise. Pauses were marked manually in the text by listening to
the sentences in the database, and by marking commas in the text wherever the
speaker has paused. Corrections were also made to the text if there was a disparity
between the text transcription and the recording. On doing this, for Hindi it was
found that there were certain monosyllabic words in the text which had a very
high probability of being followed by a pause. A list of a few of these words is given
in Table 4.1. These words known as case markers were used as cues to perform
phrase break prediction. A CART was built to predict pauses using the following
textual features:
• identity of present word
• identity of previous word
• identity of next word
• position of word in the phrase
38
Figure 4.2: A portion of the CART tree used for predicting phrase breaks for Hindi
Table 4.1: Probabilities of Hindi case markers and Tamil word-terminal syllables (along with their notation in common label set format [1]) being
followed by phrase breaks
Hindi
है (hei)
Tamil
ேவ (we)
0.93
0.51
थी (thii)
0.91 னால் (naal)
0.62
था (thaa)
0.42
வும் (vum)
0.43
पर (par)
0.44
க_v (ka)
0.34
को (ko)
0.33
ைய (yai)
0.46
An example portion of the CART used to predict pauses for Hindi is shown
in Figure 4.2. In the figure P(B) corresponds to the probability of a phrase break
and P(NB) corresponds to the probability of a no-break. Using this CART it was
found that pauses could be marked with an accuracy of ≈ 89%.
4.4
Word terminal syllables for prosodic phrasing
Unlike Hindi, phrase breaks in Tamil could not be predicted using just simple
word identity features. This is mainly because Tamil is an agglutinative language
and the identity of a word is lost when it is merged with other words. Therefore,
in the case of Tamil, the identity of word terminal syllables were used to build
a CART to predict phrase breaks. Examples of word-terminal syllables used for
Tamil are given in Table 4.1. The textual features used to build CARTs for Tamil
39
are the same as Hindi. The additional features used were just the features for
word-terminal syllables. These are:
• Identity of word-terminal syllable of present word
• Identity of word-terminal syllable of previous word
• Identity of word-terminal syllable of next word
Figure 4.3: A portion of the CART tree used for predicting phrase breaks for Tamil
An example portion of the CART used to predict pauses for Tamil is shown in
Figure 4.3. Using this CART it was found that pauses could be marked with an
accuracy of ≈ 86%. Using this method to predict prosodic phrase breaks it was
found that the quality of synthesis did improve considerably for both Hindi and
Tamil.
4.5
Experiments and Results
Pairwise comparison listening tests were conducted to evaluate the performance of
pause prediction. The results of the test were evaluated as described in Section 3.2.
The results of these tests are given in Table 4.2. The listening test was conducted
on 15 listeners on a set of 20 synthesised sentences. For both languages, A is
the system with prosodic phrase break prediction while B is the system without
prosodic phrase break prediction.
40
Table 4.2: Results of Pairwise comparison tests for Hindi and Tamil USS to compare systems with and without prosodic phrasing
Score
Hindi Tamil
A-B
50.00
46.36
B-A
26.00
62.72
A-B+B-A
62.00
54.54
From Table 4.2 it can be seen that the system with phrase break prediction is
given a higher preference for both Hindi and Tamil. The improvement for Tamil
however, is not as significant as it is for Hindi. This can be attributed to the
agglutinative nature of Tamil which makes prosody prediction for Tamil a hard
task.
4.6
Summary
In this chapter methods to predict prosodic phrase breaks using cues from the text
are described. The lack of punctuations and efficient methods of POS tagging for
Indian languages does make this task harder. Therefore, a knowledge-based approach had to be developed to achieve this task. Case markers and word-terminal
syllables were therefore identified as cues to predict prosodic phrase breaks. It
was found that these cues were effective in predicting prosodic phrases and that
the quality of synthesis did improve on using prosodic phrase break prediction.
41
CHAPTER 5
Analysing the Effects of Phrase Structure and
Syllable Rhythm on the Prosody of
Syllable-Timed Indian Languages
Stress, intonation, intensity and rhythm of speech are factors that generally characterise the prosody of a speaker [44]. Since prosodic features are suprasegmental,
segmental correction of prosody alone is, therefore, inadequate. Speech can be
grouped into prosodic units called phrases. Chapter 4 describes breaking a given
text into phrases using just cues from the text. Phrase break prediction deals more
with the rhythmic analysis of speech. The work in this chapter tries to analyse
the factors of text that can be used to try and predict the tonal elements of a
spoken utterance.
Two criteria are proposed in this chapter to analyse the tonal aspects of speech.
The first criterion proposed in this chapter focuses on correlating acoustic elements
of a phrase with similarities in text patterns, for a speech database of declarative
sentences. This criterion is then used to define a modified acoustic cost measure
which is used along with the traditional acoustic cost to select units for a syllable
based unit selection text to speech synthesiser.
The second criterion correlates two phrases based on the similarities in their
syllable rhythm. This analysis showed that phrases with similar rhythmic patterns
also have similar prosodic characteristics.
5.1
Structural Similarity
In [45] and [46], it is shown that features such as syllable structure, the position
of the syllable in a word, the number of syllables in the word etc. play a crucial
role in prosody. A thorough analysis was performed to decide on the features to
be extracted from the text. The aim was to find a pair of phrases that could be
matched in terms of patterns in their text and then see if they could be correlated
in any way. To analyse prosody, a new measure called structural similarity is
proposed which matches a pair of phrases in terms of the following parameters:
1. Position of the phrase in a sentence
• On analysing the prosodic characteristics of phrases uttered at different
parts of the sentence it was observed that the characteristics of phrases
are different depending on their position in a sentence
2. Number of words in the phrase
• The number of words in a phrase, are highly correlated with the duration of a phrase. The duration of a phrase is indicative of the rhythm
of the utterance.
3. Number of syllables per each word in the phrase
• This can be used as a measure for rhythm. Syllables towards the end
of words tend to get shorter as the word gets longer.
4. Syllable structure of each syllable in the phrase (V, CV, CVC, etc.)
• Most Indian languages contain syllables with a very simple structure.
This can be seen in Figure 5.1 that most of the syllables in Hindi and
Tamil are of the form CV, VC, CVC and V. Also, in [47] and [42] it
has been shown that the structure of a syllable affects the rhythm of a
spoken utterance.
5. Position of the syllable in a word
• The articulatory properties of syllables change when uttered at different
positions in the word.
(a)
(b)
Figure 5.1: (a) Pie chart showing the number of syllables belonging to each type
structure for Hindi (b) Pie chart showing the number of syllables belonging to each type structure for Tamil
43
(B)
0
1
1.5
Samples
2
x 10
100
2
5000
50
100
Frames
0
−1
0
4
x 10
0.5
1
1.5
Samples
2
4
x 10
100
0
0
50
100
Frames
50
100
Frames
10000
5000
0
0
1
200
10000
Energy(J)
Energy(J)
1
1.5
Samples
100
0
0
50
100
Frames
10000
0
0
0.5
200
Pitch(Hz)
Pitch(Hz)
200
0
0
−1
0
4
Pitch(Hz)
0.5
0
Energy(J)
−1
0
(C)
1
Amplitude
Amplitude
Amplitude
(A)
1
50
100
Frames
5000
0
0
50
100
Frames
Figure 5.2: (A) Waveform, pitch and energy contours of a Hindi phrase (B)
Waveform, pitch and energy contours of a Hindi phrase which is structurally similar to (A) (C) Waveform, pitch and energy contours of the
phrase obtained by transplanting the prosodic contour of (B) on (A)
5.1.1
Transplantation
Initial experiments involved transplanting the prosodic characteristics of one structurally similar phrase on to another and observing if there was any degradation in
naturalness. Transplanting prosody is resynthesising one of the phrases using the
prosodic features of the other. Phase vocoder pitch synchronous overlap add was
used to perform time-scale modification and pitch synchronous overlap-add for
pitch scale modification. Informal listening tests showed that degradation in the
naturalness of the resulting utterance was considerably less. An example of such a
transplanted phrase is shown in Figure 5.2. In this figure, (C) is the resultant utterance and (B) is the utterance whose prosodic characteristics were transplanted
onto (A).
It can be seen from the figure that there are no striking similarities in the
prosodic contours of the reference phrase and the structurally similar phrase, in
spite of which transplanting prosodies between them resulted in no significant loss
in naturalness. Similarity to the original utterance listening tests were conducted
to verify this claim. In this test, listeners are asked to rate the phrase with transplanted prosody on a scale of 1 − 5 based on how similar they felt the transplanted
phrase was to the original. The results are given in Table 5.1. The test was
conducted on 15 listeners on a set of 15 phrases.
44
5
4.5
4
3.5
3
Hindi
Tamil
Figure 5.3: Plot showing the range of scores for Similarity to the original utterance
tests for Hindi and Tamil
Table 5.1: Similarity to the original utterance scores for Hindi and Tamil
Language Hindi Tamil
Score
4.24
3.75
It is evident from the results that transplanting prosody between structurally
similar phrases results in a minimal loss in naturalness. From Figure 5.3 it can be
seen that for Hindi, the degradation from the original is considerably less while
for Tamil there is some degradation. The degradation for Tamil is more probably
because of the agglutinative nature of Tamil. Since this criterion shows minimum
degradation when transplanted, it was proposed that this criterion be used to
define a new cost measure to select units for a USS system.
5.1.2
Application of structural similarity to USS
Transplanting prosody between two structurally similar phrases results in minimal degradation in naturalness. It was therefore proposed that during synthesis,
acoustically similar units should be chosen from structurally similar phrases in the
database.
5.1.2.1 Training the USS
To train the USS, the same steps as mentioned in Section 2.1 are followed. This
system though has been built purely for syllables and will fail to synthesise any text
in which there is a syllable that is not present in the database. The initial steps to
45
build the system are using a set of hand-written LTS rules to break the sentences
in the text corpus into their respective syllables. These are then used to segment
the speech waveforms using the hybrid segmentation algorithm as described in
Section 2.1.3.1. After obtaining syllable-level segmentation for all the waveforms
in the database, the syllables are clustered using linguistic, acoustic and phonetic
criteria and CART are built for each unit.
After the usual steps of building a USS, the additional steps that need to be
carried out are pre-clustering the phrases based on the number of words in them
and their position in a sentence. The syllable structure of each syllable belonging
to each phrase is also stored. The prosodic phrase prediction module has also
been included because during synthesis, the sentence to be synthesised has to first
be broken down into phrases after which a structurally similar phrase from the
database has to be looked up.
5.1.2.2 Synthesis
During synthesis, the text to be synthesised is first broken down into phrases
using the phrase prediction module as described in Chapter 4. Once the phrases
are obtained, an exemplar-based approach is used to find the structurally similar
phrase from the database. Depending on the position of the phrase in the sentence
and the number of words in the phrase, an appropriate phrase from the training
corpus is looked up and the units to be synthesised are selected based on their
acoustic similarity to their corresponding units in the structurally similar phrase.
The process of selecting a structurally similar phrase from the database is
shown in Figure 5.4. The two phrases are matched in terms of the features mentioned in Section 5.1. In the figure, we can see that the two phrases are matched
in terms of the number of words in the phrase, number of syllables per word, and
the structure of every syllable in the phrase. Even if an exact match is not found,
the phrase in the database that matches closest with the phrase to be synthesised
is selected.
After selecting the structurally similar phrase and the cluster for every unit, a
Viterbi search is performed through the candidate clusters to select the optimal
sequence of units. The cost measure used to perform the Viterbi search is a
slightly modified cost as compared to the cost described in [20] which, in this case,
46
Figure 5.4: Selecting a structurally similar phrase from the database
is referred to as the traditional cost measure.
Unit selection speech synthesisers traditionally used two cost measures to decide the optimal set of units for synthesis which are target cost and concatenation
cost. In equation 5.1 Cdist(Si ) is the distance of syllable S from the centre of the
cluster known as target cost. Jcost(Si , Si−1 ) is the cost of concatenating syllables
Si and previous syllable Si−1 . W is used to weigh join cost over target cost. N is
the number of syllables in the utterance to be synthesized.
N
∑
Cdist(Si ) + W ∗ Jcost(Si , Si−1 )
(5.1)
i=1
The modified cost measure is as follows:
N
∑
Cdist(Si ) + W ∗ Jcost(Si , Si−1 ) + DT W (Si , Sir )
(5.2)
i=1
Where DT W (Si , Sir ) is the dynamic time warped distance between each candidate unit Si at position i and its corresponding unit in the structurally similar
phrase Sir selected from the database. Using the modified cost, the optimal sequence of units is selected and the units are concatenated to synthesise output
speech.
For the process of concatenation, the syllable units are windowed at the unit
47
boundaries to make sure the join between two units is smooth. This ensures that
there are not too many discontinuities in the synthesised speech.
5.1.3
Experiments and Results
Listening tests were performed to evaluate the performance of the system with the
modified cost measure (A), and the USS with the traditional cost measure (B).
Degradation Mean Opinion Score (DMOS) and Word Error rate (WER) tests were
conducted for Hindi and Tamil. The scores obtained in these tests are given in
Table 5.2 and Table 5.3.
5.1.3.1 DMOS
The DMOS test was conducted by playing sentences synthesised using the two
approaches and natural sentences in random order. The main reason for incudin
natural sentences in the test is mainly to normalise the scores, since for a TTS
system the synthesis quality is restricted by the quality of data. The test was
conducted on 15 listeners on a set of 10 synthesised sentences using each approach.
The scores are given on a scale of 1 − 5 where 5 is excellent and 1 is very poor.
5.1.3.2 WER
In this test, semantically unpredictable sentences synthesised using the 2 approaches are played in random order and listeners were asked to transcribe whatever they understood from the synthesised sentences. Insertions, deletions and
substitutions were counted as errors. The test was conducted on 15 listeners on a
set of 10 sentences synthesised using each approach.
Table 5.2: Results of DMOS and WER tests for Hindi
Score
A
B
DMOS
3.35 3.52
WER(%)
6.53 6.14
48
Table 5.3: Results of DMOS and WER tests for Tamil
Score
A
B
DMOS
2.79
3.53
WER(%)
12.67
8.91
5
5
4.5
4.5
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
A
B
A
(a)
B
(b)
Figure 5.5: (a) Similarity to the original utterance scores for Hindi. (b)Similarity
to the original utterance scores for Tamil.
As seen from Table 5.2 and Table 5.3, the proposed method in the case of
Hindi performs almost as well as the traditional USS while this is not the case for
Tamil. Figures 5.5(a) and 5.5(b) show the spread of the scores of the two systems.
This shows that the performance of the traditional cost measure is consistently
better than the system with the modified cost.
5.1.4
Summary
In this Section, the effect of similarities in structure on the prosody of a spoken
utterance is analysed. It is observed that the structure of a phrase does have a
role to play in its prosody. Although transplanting prosody across structurally
similar phrases did show promising results, the idea of structural similarity has
not yet been used to synthesise speech effectively.
49
5.2
Rhythmic Similarity
In the previous section, analysis was performed to observe the effect of the structure of a phrase on the prosody of an utterance. The analysis showed that even
though transplanting prosody between structurally similar phrases does not degrade naturalness, structural similarity alone is not sufficient to synthesise high
quality speech. This section analyses the effect of syllable rhythm on the prosody
of a spoken utterance.
Rhythm is defined as any regularly recurring event. Rhythm in speech is
defined either with respect to every syllable in an utterance or with respect to
stressed syllables. Languages in which rhythm is defined as an equal duration
between the production of successive syllables are called syllable-timed languages.
In these languages, there is an equal prominence associated with every syllable
and they generally lack reduced vowels. Stress-timed languages are those in which
there is an equal duration between stressed syllables and the unstressed syllables
are shortened or lengthened accordingly.
Previous research suggests that Indian languages belong to the syllable-timed
category [4]. This can also be seen from Figures 5.6(a) and 5.6(b) which show
the correlation between the duration of phrases and the number of syllables in
them for the languages Hindi and Tamil respectively. In these figures, the second
quadrant represents the histogram of the number of syllables per phrase, and the
fourth quadrant the histogram of their durations. The first and third quadrants
show the correlation measures between the two, calculated as the upper and lower
triangles of the corresponding cross-correlation matrices. The correlation value
comes to approximately 0.96 for Hindi and 0.95 for Tamil which does support
our initial hypothesis that Indian languages are syllable-timed. This is because
phrases with a similar number of syllables have similar durations. But detailed
analysis shows that the durations of the syllables within the phrase itself vary
significantly depending on the syllable structure.
Since it is not possible to define rhythm on the basis of syllable and stress timing, a new criteria for analysing rhythm has been proposed in this paper. Syllabic
rhythm in this paper has been defined as the rate of production of syllables, in
terms of the number of syllables per word and the duration of each word. The
analysis has been conducted at the intonational phrase level of the prosodic hi-
50
Correlation between Duration and Number of Syllables per phrase
Correlation between Duration and Number of Syllables per phrase
7
0.96
6
12
5
10
NOS
NOS
0.95
14
4
3
8
6
2
4
1
2
0
0.96
70
30
60
25
50
Duration
Duration
35
20
15
10
40
30
20
5
0
0 0
0
0.95
10
1
2
3
4
5
6
7
NOS
0
5
10
15
20
25
30
35
2
Duration
4
6
8
10
12
14
NOS
(a)
10
20
30
40
50
60
Duration
(b)
NOS - number of syllables
Figure 5.6: (a) Correlation between duration of phrase and number of syllables
per phrase for Hindi. (b) Correlation between duration of phrase and
number of syllables per phrase for Tamil.
erarchy [27]. The reason for restricting the analysis to an intonational phrase is
mainly because the analysis could then be restricted to within one prosodic contour. Also, there is a considerable amount of co-articulation between syllables
within a phrase while this occurring between phrases is not possible as there is a
significant pause separating two phrases.
This work analyses the role played by the syllabic rhythm in the prosody of
speech. The analysis is performed to observe characteristics of rhythmically similar
and dissimilar utterances. Various listening tests conducted support the hypothesis that there are similarities in prosodic characteristics between rhythmically
similar phrases.
5.2.1
Data Preparation
This section describes the various steps involved in data preparation. The major part of data preparation goes into segmenting the speech waveforms. This is
because manually isolating phrases from continuous speech recordings is a very
tedious task and, therefore, has to be done automatically. To obtain maximum
precision in extracting phrases, the speech waveforms have to be accurately segmented at the monophone, syllable, word and phrase levels.
51
70
5.2.1.1 Accurate Transcriptions
The text transcriptions for this task need to be very precise. The text needs to
be accurately annotated with pauses marked at the correct places, which makes
it easier to isolate phrases. Thus, pauses were marked manually in the text by
listening to the sentences in the database, and by marking commas in the text
wherever the speaker has paused. Corrections were also made to the text if there
was a disparity between the text transcription and what the recording was.
5.2.1.2 Letter to Sound Rules
The letter to sound rules are a set of hand-written rules for each language written
in consultation with a linguist. These rules are written to convert the grapheme
notation of a word into its phonetic representation. Rules are written to break
a word into syllables. These rules include language specific rules such as schwa
deletion1 and voiced/un-voiced tagging2 , followed by syllabification rules to get
the appropriate syllables. These syllables are then broken down into monophones
based on a mapping given in [1].
5.2.1.3 Accurate Segmentation of the Speech Waveforms
The segmentation needs highly precise in order to extract the phrases from the
waveforms. Segmentation was performed using the hybrid segmentation approach
described in [19] where Hidden Markov Models (HMMs) are used in tandem with
the group delay algorithm to obtain accurate syllable and monophone labels. Even
though the monophone and syllable labels obtained using this method are fairly
accurate, there are certain cases where they go wrong. To correct this, ‘bad’ syllables in the database were discarded using acoustic cues. The remaining syllables
and its constituent monophones, were then used to re-build HMMs and forced
alignment was performed at both the monophone and syllable level. The syllable
and monophone labels obtained using this method were found to be more accurate.
1
In Hindi, the schwa (ə) attached to the last consonant in a syllable is deleted in many cases
In Tamil, voiced and un-voiced stop consonants have the same orthographic representation.
Rules are written to tag these consonants as voiced and unvoiced depending on context
2
52
A
B
200
original
rhythmically similar
150
Pitch (Hz)
Pitch (Hz)
200
100
50
0
0
50
100
150
200
100
50
0
0
250
original
rhythmically dissimilar
150
50
Time (samples)
10000
5000
0
0
50
100
150
150
15000
original
rhythmically similar
Energy (J)
Energy (J)
15000
100
200
250
Time (samples)
200
original
rhythmically dissimilar
10000
5000
0
0
250
50
Time (samples)
100
150
200
250
Time (samples)
Figure 5.7: (A) Pitch and energy contours of rhythmically similar phrases of Hindi,
(B) Pitch and energy contours of rhythmically dissimilar phrases of
Hindi
A
B
400
original
rhythmically similar
Pitch (Hz)
Pitch (Hz)
300
200
100
0
0
50
100
150
200
200
100
0
0
250
original
rhythmically dissimilar
300
100
Time (samples)
4000
2000
0
0
6000
original
rhythmically similar
Energy (J)
Energy (J)
6000
50
100
150
200
300
400
Time (samples)
200
4000
2000
0
0
250
Time (samples)
original
rhythmically dissimilar
100
200
300
400
Time (samples)
Figure 5.8: (A) Pitch and energy contours of rhythmically similar phrases of
Tamil, (B) Pitch and energy contours of rhythmically dissimilar
phrases of Tamil
5.2.2
Rhythmic Similarity and Transplantation
Prosodic analysis of Indian languages using text is a hard task mainly because
Indian language text do not contain punctuation marks or any other prosodic
markers except for a full stop. Text to speech synthesisers for Indian languages
suffer in quality mainly because of problems like these. Take the innocuous English sentence, ‘The Panda eats, shoots, and leaves ’; here the punctuation marks
provide crucial discriminatory information. The complete lack of such prosodic
markers in Indian languages is the prime motivation for pursuing a more text53
independent approach. Furthermore, the agglutinative nature of Indian languages
causes additional issues in the prosody of an utterance. This is mainly because the
prosody of words that have been concatenated to make new words can be significantly different from that of the original words taken in isolation. An example of
this in Tamil is the three words வந்து-ெகாண்டு-இருக்கிறான் (vandu-kondu-irukkiraan)
which can be spoken in isolation or as a complex word as in வந்துெகாண்டிருக்கிறான்
(vandukondirukkiraan). Although these words whether spoken in isolation or as
a single word mean the exact same thing, the prosody in these two cases is significantly different. Due to these reasons, a method which gives less importance to
the lexical and semantic content of the text had to be adopted.
5.2.2.1 Computing Rhythmic Similarity
To compute rhythmic similarity, phrases were first pre-clustered based on the
number of words contained in them and their position of occurrence in a sentence3 .
For each of these phrases, a 2−dimensional vector was formed by extracting the
number of syllables in each word of a phrase, and the duration of each word
contained in it. The Frobenius norm was then computed between each pair of
these phrases as follows
v
u m 2
u∑ ∑
||A| |F = t
|aij − bij |2
(5.3)
i=1 j=1
where ||A| |F represents the value of Frobenius norm, a and b represent the 2
phrases and m represents the number of words in either phrase. The pair of phrases
with the smallest Frobenius distance between them was considered as rhythmically
most similar and the pair with maximum Frobenius distance between them was
taken as the rhythmically most dissimilar. Such ‘best’ and ‘worst’ pairs were found
for numerous phrases, and prosody was transplanted between them.
5.2.2.2 Transplantation
Rhythmic analysis of the prosodic characteristics of reference phrases was performed under two different experimental conditions4 : (i) transplanting the prosody
3
begin, middle, end and isolated
Phase vocoder synchronous overlap add algorithm was used for time-scale modification and
pitch synchronous overlap add algorithm was used for pitch-scale modification
4
54
(B)
0
−1
0
1
2
Samples
100
Frames
−1
0
3
Pitch(Hz)
50
100
Frames
Energy(J)
50
100
Frames
150
200
3
4
x 10
50
100
Frames
150
200
50
100
Frames
150
200
10000
5000
0
0
2
Samples
100
0
0
150
10000
5000
1
200
100
0
0
150
0
4
Energy(J)
Pitch(Hz)
Pitch(Hz)
50
1
x 10
200
10000
Energy(J)
2
Samples
100
0
0
1
4
x 10
200
0
0
0
−1
0
3
(C)
1
Amplitude
Amplitude
Amplitude
(A)
1
50
100
Frames
150
5000
0
0
Figure 5.9: (A) Waveform, pitch contour and energy contour of a Hindi phrase
(B) Waveform, pitch contour and energy contour of the waveform in
(A) when a rhythmically dissimilar Hindi phrase is transplanted on it
(C) Waveform, pitch contour and energy contour of the waveform in
(A) when a rhythmically dissimilar Hindi phrase is transplanted on it
of a rhythmically similar phrase on the reference, and (ii) transplanting the prosody
of a rhythmically dissimilar phrase on the reference waveform shown in Figure
5.9(A). It can be seen from Figures 5.9(B) and 5.9(C) that (i) does not affect the
prosody of the waveform significantly, while (ii) causes significant distortion in
the prosodic characteristics of the waveform. From Figures 5.7(A) and 5.8(A),
it is evident that the pitch and energy contours of pairs of rhythmically similar
phrases show some similarity, while from 5.7(B) and 5.8(B) it can be seen that
rhythmically dissimilar pairs of phrases show no similarity in terms of prosodic
characteristics.
5.2.3
Experiments and Results
Various experiments were conducted to analyse the similarity in prosodic characteristics of phrases that are rhythmically similar. Listening tests were conducted
and the dynamic time-warped (DTW) distances between the prosodic contours of
the rhythmically similar and dissimilar phrases were also analysed.
5.2.3.1 Listening Test
A similarity to the original utterance listening test was performed to analyse the
similarity in prosodic characteristics of rhythmically similar phrases as described
55
in Section 5.1.1. Phrases of different lengths were taken and the prosody of rhythmically similar and dissimilar phrases were transplanted on them as described in
Section 5.2.2.2. Evaluators were asked to rate each of such groups of phrases on a
scale of 1 to 5 based on how similar to the original phrase they felt the phrase with
the transplanted prosody was. The results of the listening test are given in Table
5.4 and in the plots of Figures 5.10(a) and 5.10(b). The test was undertaken by
15 listeners proficient in the concerned languages. The test was conducted on a
set of 15 phrases. The plots show the range of scores for each kind of transplanted
utterances. The red line indicates the median value of the scores obtained.
• P: Original phrase with the prosody of a rhythmically similar phrase transplanted on it
• Q: Original phrase with the prosody of a rhythmically dissimilar phrase
transplanted on it
Table 5.4: Similarity to the original utterance listening test results
Language P
Q
Hindi
3.85 3.02
Tamil
3.34 2.59
Table 5.5: DTW distances between Table 5.6: DTW distances between
pitch and energy contours
pitch and energy contours
of Hindi phrases
of Tamil phrases
Pitch
Phrase
P
Q
Energy
P
Q
Pitch
Phrase
P
Q
Energy
P
Q
Phrase-1 1.64 1.83
1.37 1.91
Phrase-1 6.12 8.62
2.40 1.77
Phrase-2 2.22 2.41
1.60 1.71
Phrase-2 3.53 5.48
1.43 2.30
Phrase-3 1.32 2.23
2.38 2.03
Phrase-3 4.69 5.07
2.76 4.12
Phrase-4 1.88 8.31
2.70 3.35
Phrase-4 4.00 8.18
1.16 1.74
Phrase-5 2.74 2.99
2.41 2.17
Phrase-5 6.31 6.99
3.05 3.73
5.2.3.2 Dynamic Time Warped Distance
The pitch contours and energy contours of the rhythmically similar and rhythmically dissimilar phrases were compared by computing the DTW distance between
them. The pitch and energy contours were first scaled to values between 0 and 1.
The DTW distances between 5 example phrases for the pitch and energy contours
56
Similarity Scores
Similarity Scores
5
5
4.5
4.5
4
3.5
Score
Score
4
3.5
3
2.5
3
2
2.5
1.5
2
1
Similar Phrase
Dissimilar Phrase
Similar Phrase
Dissimilar Phrase
Type of Score
Type of Score
(a)
(b)
180
160
140
120
100
80
60
40
20
0
0
20
40
60
80
100
120
140
160
f0 of a rhythmically similar phrase
180
f0 of reference phrase
f0 of reference phrase
Figure 5.10: (a) Similarity to the original utterance scores for Hindi. (b)Similarity
to the original utterance scores for Tamil.
180
160
140
120
100
80
60
40
20
0
0
20
40
60
80
100
120
140
160
180
f0 of a rhythmically dissimilar phrase
(a)
(b)
STE of reference phrase
STE of reference phrase
Figure 5.11: (a) DTW alignment between f0 contours of 2 rhythmically similar
phrases (b) DTW alignment between f0 contours of 2 rhythmically
dissimilar phrases
180
160
140
120
100
80
60
40
20
0
0
20
40
60
80
100
120
140
160
180
STE of rhythmically similar phrase
(a)
180
160
140
120
100
80
60
40
20
0
0
20
40
60
80
100
120
140
160
180
STE of rhythmically dissimilar phrase
(b)
Figure 5.12: (a) DTW alignment between energy contours of 2 rhythmically similar phrases (b) DTW alignment between energy contours of 2 rhythmically dissimilar phrases
57
200
has been shown in Table 5.5 for Hindi and Table 5.6 for Tamil. The optimal matching path between the f0 contours of a pair of rhythmically similar and dissimilar
phrases has been shown in Figure 5.11. The optimal matching path between the
energy contours of a pair of rhythmically similar and dissimilar phrases has been
shown in Figure 5.12.
As can be seen from Table 5.4, similarity to the original utterance does not
degrade a great deal when the prosody of a rhythmically similar phrase is transplanted on to a reference. On the other hand, transplanting the prosody of a
rhythmically dissimilar phrase results in a significant degradation with respect to
the prosody of the original waveform.
It can be observed from Tables 5.5 and 5.6 that in most of the cases the
DTW distances between the pitch and energy contour of the reference and the
rhythmically similar phrase were less than the dynamic time warped distance between the reference and the rhythmically dissimilar phrase. Figures 5.11 and 5.12
show the optimal matching path between the f0 and energy contours of a pair of
rhythmically similar and a pair of rhythmically dissimilar phrases. It can be seen
from these figures that the prosodic contours of rhythmically similar phrases align
much better with the reference than the prosodic contours of rhythmically dissimilar phrases. This was found to be true in 59% and 58.31% of the pitch contours
and 67.11% and 69.20% of the energy contours of Hindi and Tamil respectively.
Although this was found to be true for the f0 and energy contours individually,
joint f0 and energy contours did not show such consistency.
5.2.4
Summary
From the experiments and results in this section, it can be concluded that rhythmic similarity does have a significant effect on the prosody of a phrase. Also,
rhythmically similar phrases have similar prosodic characteristics when compared
to rhythmically dissimilar phrases. This has been verified by performing listening
tests and also by measuring the DTW distances between the prosodic features
of these phrases. An application of this idea would be to use it to improve the
quality of a text-to-speech synthesis systems and speech recognition systems.
58
CHAPTER 6
Conclusion
6.1
Summary
The work done in this thesis discusses methods of overcoming the issues faced
in building Indian language TTS systems and analysing the factors that affect
the prosody of a spoken utterance. Methods of improvement for two paradigms
namely USS and HTS are proposed.
Firstly, an approach is proposed in which ‘bad’ units from the database are
discarded using acoustic cues such as average f0 , average STE and duration. This
did result in an improvement in performance, but the quality of synthesis did seem
inconsistent. The main problem faced was that of erroneously segmented units
not being discarded because the acoustic properties of these units did lie within
the permitted range. Hence, another method which minimised these segmentation
errors and an added cue which also takes into consideration the likelihood score
obtained during the process of segmentation was also used to discard ‘bad’ units.
This method seemed to perform a lot more consistently than the previous method.
The task of prosodic phrase break prediction for Indian languages has also
been handled in this work. Since Indian languages are seldom punctuated and do
not have methods of POS tagging, a method that uses word level cues from the
text to predict prosodic phrase breaks has been proposed. Implementing this, it
was found that this did result in better quality synthesis. These results also show
that the cues that were used for this task namely, case markers and word-terminal
syllables are accurate predictors of phrase breaks.
The task of phrase break prediction dealt with the rhythmic aspect of speech.
Predicting the tonal aspect of speech in the case of Indian languages still posed
a big problem. For this purpose, the effect of the structure of a phrase and
the syllabic rhythm within a phrase were analysed to study the tonal aspect of
prosody. Experiments conducted that both these factors did have a role to play
in the aspects of prosody. It was observed that transplanting prosody between
structurally similar and rhythmically similar phrases did not affect the prosody
significantly. This prompted the inclusion of these aspects in the cost computation
to select units in a syllable based USS. Structural similarity was used as a factor
in the cost computation. Although it was seen that this did not really result in a
significant improvement in the synthesis quality.
6.2
Criticism of the thesis
• Word-terminal syllables and case markers only help in predicting intonational phrase boundaries. Other levels of breaks are yet to be dealt with.
• Although structural similarity and syllabic rhythm do have an effect on the
prosody, these have not yet been incorporated in the TTS framework
6.3
Scope for future work
• Finding methods of predicting other levels of breaks within a phrase which
would help in achieving better quality speech synthesis
• Coming up with robust methods of prosody prediction which can easily be
extended to other Indian languages
• Explore other factors of prosody and try to achieve articulatory speech synthesis
60
REFERENCES
[1] R. B, S. L. Christina, G. A. Rachel, S. Solomi V, M. K. Nandwana,
A. Prakash, A. S. S, R. Krishnan, S. K. Prahalad, K. Samudravijaya, P. Vijayalakshmi, T. Nagarajan, and H. Murthy, “A common attribute based unified
HTS framework for speech synthesis in Indian languages,” in 8th ISCA Speech
Synthesis Workshop, Barcelona, pp. 311–316, 2013.
[2] S. P. Kishore and A. W. Black, “Unit size in unit selection speech synthesis,”
in Eurospeech, Geneva, pp. 1317–1320, 2003.
[3] A. Bellur, K. B. Narayan, K. Krishnan, and H. A. Murthy, “Prosody modeling for syllable-based concatenative speech synthesis of Hindi and Tamil.”
National Conference on Communications, http://ieeexplore.ieee.org/
stamp/stamp.jsp?tp=&arnumber=5734737, 2011.
[4] M. B. Emeneau, “India as a lingustic area,” Language, pp. 3–16, 1956.
[5] O. Fujimura, “Syllable as a unit of speech recognition,” Acoustics, Speech and
Signal Processing, IEEE Transactions on, vol. 23, no. 1, pp. 82–87, 1975.
[6] H. A. Murthy, A. Bellur, V. Viswanath, B. Narayanan, A. Susan, G. Kasthuri,
R. Krishnan, K. S. Rao, S. Maity, N. Narendra, et al., “Building unit selection
speech synthesis in Indian languages: An initiative by an Indian consortium.”
Proceedings of COCOSDA, Kathmandu, Nepal, http://ravi.iiit.ac.in/
~speech/publications/C45.pdf, 2010.
[7] S. Nooteboom, “The prosody of speech: melody and rhythm.” Handbook
of phonetic sciences 5, http://citeseerx.ist.psu.edu/viewdoc/summary?
doi=10.1.1.160.9791, 1997.
[8] K. E. Silverman, M. E. Beckman, J. F. Pitrelli, M. Ostendorf, C. W. Wightman, P. Price, J. B. Pierrehumbert, and J. Hirschberg, “ToBI: a standard for
labeling english prosody.,” ICSLP, vol. 2, pp. 867–870, 1992.
[9] J.
Yamagishi,
“An
introduction
61
to
hmm-based
speech
synthesis.”
https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/TrajectoryModelling/
HTS-Introduction.pdf, 2006.
[10] Y. Zhao, M. Chu, H. Peng, and E. Chang, “Custom-tailoring TTS
voice font - keeping the naturalness when reducing database size.” INTERSPEECH, http://users.ece.gatech.edu/~yzhao33/publications/
PruningVoicefont_Eurospeech2003.pdf, ISCA, 2003.
[11] R. Kumar and S. P. Kishore, “Automatic pruning of unit selection speech
databases for synthesis without loss of naturalness.” INTERSPEECH, http:
//www.iiit.ac.in/techreports/2004_24.pdf, ISCA, 2004.
[12] A. W. Black and P. Taylor, “Automatically clustering similar units for unit
selection in speech synthesis,” in Proc. Eurospeech ’97, (Rhodes, Greece),
pp. 601–604, September 1997.
[13] P. Rutten, M. P. Aylett, J. Fackrell, and P. Taylor, “A statistically
motivated database pruning technique for unit selection synthesis.” INTERSPEECH, http://www.researchgate.net/publication/221485375_
A_statistically_motivated_database_pruning_technique_for_unit_
selection_synthesis, ISCA, 2002.
[14] P. Tsiakoulis, A. Chalamandaris, S. Karabetsos, and S. Raptis, “A statistical
method for database reduction for embedded unit selection speech synthesis,”
in IEEE International Conference on Acoustics, Speech, and Signal Processing, (USA, Las Vegas), pp. 759–762, 2008.
[15] S. Kim, Y. Lee, and K. Hirose, “Pruning of redundant synthesis instances
based on weighted vector quantization,” in INTERSPEECH, pp. 2231–2234,
ISCA, 2001.
[16] M. Hammond, “Paul kiparsky and gilbert youmans (eds.)(1989). rhythm and
meter.(phonetics and phonology 1). san diego: Academic press. pp. xii+ 399.,”
Phonology, vol. 9, no. 02, pp. 358–362, 1992.
[17] H. Patil, T. B. Patel, N. J. Shah, H. B. Sailor, R. Krishnan, G. Kasthuri,
T. Nagarajan, L. Christina, N. Kumar, V. Raghavendra, et al., “A syllablebased framework for unit selection synthesis in 13 indian languages,” in Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language
62
Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference, pp. 1–8, IEEE, 2013.
[18] P. Deivapalan, M. Jha, R. Guttikonda, and H. A. Murthy, “DONLabel: An
automatic labeling tool for indian languages,” in National Conference on
Communication (NCC), pp. 263–266, February 2008.
[19] S. A. Shanmugam and H. Murthy, “A hybrid approach to segmentation
of speech using group delay processing and hmm based embedded reestimation.” INTERSPEECH, https://mazsola.iit.uni-miskolc.hu/~czap/
letoltes/IS14/IS2014/PDF/AUTHOR/IS140719.PDF, 2014.
[20] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing (München,
Germany), vol. 1, pp. 373–376, 1996.
[21] A. Babu, R. Krishnan, A. K. Sao, and H. A. Murthy, “A probabilistic approach to selecting units for speech synthesis based on acoustic similarity.”
National Conference on Communications, http://ieeexplore.ieee.org/
stamp/stamp.jsp?arnumber=6811333, 2014.
[22] I. Read and S. Cox, “Stochastic and syntactic techniques for predicting phrase
breaks,” Computer Speech & Language, vol. 21, no. 3, pp. 519–542, 2007.
[23] S. J. Young, G. Evermann, M. J. F. Gales, T. Hain, D. Kershaw, G. Moore,
J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. C. Woodland, The HTK
Book, version 3.4. Cambridge, UK: Cambridge University Engineering Department, 2006.
[24] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John Wiley
& Sons, 2012.
[25] A. Vadapalli, P. Bhaskararao, and K. Prahallad, “Significance of wordterminal syllables for prediction of phrase breaks in text-to-speech systems for
Indian languages,” in 8th ISCA Workshop on Speech Synthesis, (Barcelona,
Spain), pp. 209–214, August 2013.
[26] P. Roach,
“On the distinction between ‘stress-timed’and ‘syllable-
timed’languages.”
Linguistic
Controversies,
reading.ac.uk/~llsroach/phon2/frp.pdf, 1982.
63
http://www.personal.
[27] H. Tokizaki, “Prosodic hierarchy and prosodic boundary,” Bulletin of Universities and Institutes, vol. 56, pp. 81–99, 2002.
[28] P. Barbosa and G. Bailly, “Generation and evaluation of rhythmic patterns
for text-to-speech synthesis.” ESCA Workshop on Prosody, http://www.
isca-speech.org/archive_open/prosody_93/pro3_066.html, 1993.
[29] P. Barbosa and G. Bailly, “Characterisation of rhythmic patterns for text-tospeech synthesis,” Speech Communication, vol. 15, no. 1, pp. 127–137, 1994.
[30] A. Arvaniti, “Rhythm classes and speech perception,” Understanding
Prosody: The Role of Context, Function and Communication, vol. 13, p. 75,
2012.
[31] A. D. Patel, Music, language, and the brain. Oxford university press, 2007.
[32] E. Grabe and E. L. Low, “Durational variability in speech and the rhythm
class hypothesis,” Papers in laboratory phonology, vol. 7, no. 515-546, 2002.
[33] I. Steiner, “A refined acoustic analysis of speech rhythm.” Proceedings
of the 38th Linguistic Colloquium, http://www.coli.uni-saarland.de/
~steiner/publications/LingColl2003.pdf, 2003.
[34] B. Tuller, K. S. Harris, and J. S. Kelso, “Stress and rate: Differential transformations of articulation,” The Journal of the Acoustical society of America,
vol. 71, no. 6, pp. 1534–1543, 1982.
[35] I. Lehiste, “Rhythmic units and syntactic units in production and perception,” The Journal of the Acoustical Society of America, vol. 54, no. 5,
pp. 1228–1234, 1973.
[36] M. Bull, “Isochrony and the rhythm of conversation,” in Proceedings of the
Edinburgh Linguistics Department Conference, vol. 94, pp. 5–16, Citeseer,
1994.
[37] E. Le Coultre and M. Carroll, “The effect of visualizing speech rhythms on
reading comprehension and fluency..” Journal of Reading Behavior, http://
jlr.sagepub.com/content/13/3/279.full.pdf, Lawrence Erlbaum, 1981.
[38] V. Dellwo, “Rhythm and speech rate: A variation coefficient for deltac.”
Language and language-processing, http://discovery.ucl.ac.uk/12181/,
2006.
64
[39] V. Dellwo, “The role of speech rate in perceiving speech rhythm.”
Speech Prosody, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=
10.1.1.486.7082, 2008.
[40] M. Schreuder and D. Gilbers, “The influence of speech rate on rhythm patterns,” On the Boundaries of Phonology and Phonetics, pp. 183–201, 2004.
[41] V. Dellwo and P. Wagner, “Relationships between rhythm and speech
rate.” 15th International Congress of the Phonetic Sciences, http://pub.
uni-bielefeld.de/publication/1785384, 2003.
[42] F. Ramus, “Acoustic correlates of linguistic rhythm: Perspectives.” In:
Proc. Speech Prosody 2002, Aix-en-Provence, http://cogprints.org/
2273/, 2002.
[43] P. Salza, E. Foti, L. Nebbia, and M. Oreglia, “MOS and pair comparison
combined methods for quality evaluation of text to speech systems,” in Acta
Acustica, vol. 82, pp. 650–656, 1996.
[44] T. Dutoit, An Introduction to Text-to-Speech Synthesis. Dordrecht: Kluwer,
1997.
[45] M. Vinodh, A. Bellur, K. Narayan, D. M. Thakare, A. Susan, N. Suthakar,
H. Murthy, et al., “Using polysyllabic units for text to speech synthesis in
indian languages,” in Communications (NCC),National Conference on, pp. 1–
5, IEEE, 2010.
[46] M. N. Rao, S. Thomas, T. Nagarajan, and H. A. Murthy, “Text-to-speech
synthesis using syllable-like units,” in Communications (NCC), National Conference on, pp. 227–280, 2005.
[47] J. Brown and E. Matene, “Is speech rhythm an intrinsic property of language?.” Fifteenth Annual Conference of the International Speech Communication Association, https://researchspace.auckland.ac.nz/handle/
2292/23719, 2014.
65
LIST OF PAPERS BASED ON THESIS
1. Ashwin Bellur, K. Badri Narayan, Raghava Krishnan, and Hema Murthy,
“Prosody modeling for syllable-based concatenative speech synthesis of Hindi
and Tamil,” in Communications NCC, 2011 National Conference on, pp. 1-5,
IEEE, 2011.
2. Raghava Krishnan, S Aswin Shanmugam, A Prakash, K Sekaran and Hema
A Murthy, “IIT Madras’s Submission to the Blizzard Challenge 2014,” Blizzard Challenge 2014, Singapore, Sep’ 14.
3. Anjana Babu, Raghava Krishnan K, Anil Kumar Sao and Hema A Murthy,
“A Probabilistic Approach to Selecting Units for Speech Synthesis Based on
Acoustic Similarity”, Communications (NCC), 2014 Twenty First National
Conference on, IEEE, 2014.
66