Cross-lingual Speech Synthesis and Enhancement of Dysarthric

Cross-lingual Speech Synthesis and Enhancement of
Dysarthric Speech
A THESIS
submitted by
ANUSHA PRAKASH
for the award of the degree
of
MASTER OF SCIENCE
(by Research)
DEPARTMENT OF APPLIED MECHANICS
INDIAN INSTITUTE OF TECHNOLOGY, MADRAS
February 2017
THESIS CERTIFICATE
This is to certify that the thesis titled Cross-lingual Speech Synthesis and Enhancement of Dysarthric Speech, submitted by Anusha Prakash, to the Indian
Institute of Technology, Madras, for the award of the degree of Master of Science (by
Research), is a bonafide record of the research work done by her under our supervision.
The contents of this thesis, in full or in parts, have not been submitted to any other
Institute or University for the award of any degree or diploma.
Dr. M. Ramasubba Reddy
Research Guide
Dept. of Applied Mechanics
IIT-Madras, 600 036
Place: Chennai
Date: February, 2017
Dr. Hema A. Murthy
Research Guide
Dept. of Computer Science
& Engineering
IIT-Madras, 600 036
ACKNOWLEDGEMENTS
This thesis has been an enriching learning experience for me at academic and personal
levels. I am extremely thankful to my advisor Prof. M. Ramasubba Reddy, for his
sustained support and guidance throughout the period of the study. Discussions with
him have been both motivating and thought provoking. His constructive comments and
attention to details have contributed to making this study more nuanced.
I would like to thank my co-advisor Prof. Hema A. Murthy for her encouragement
in pursuing research. It would not have been possible to complete my thesis without
her academic inputs and encouragement – including prodding me at various times. Her
outlook on research and life has not only inspired me but also has influenced my own
perspectives. A thesis thrives not just on ideas but also on food. Prof. Hema Murthy had
a clear understanding of this and ensured regular supply of food to keep up our spirits.
I am grateful to Prof. C. Chandra Sekhar for his insightful inputs during various
stages of my Masters. I would also like to thank Dr. K. Samudravijaya of TIFR, Mumbai
and Dr. T. Nagarajan of SSNCE, Chennai for engaging me in numerous discussions on
a variety of issues related to the study.
I would like to especially thank Raghav, Jom, Aswin, Jeena, Rupak, Arun, Anju,
Krishnaraj, Akshay, Shreya, Shrey, Venkat and other members of DONLab and SMT
Lab for their invaluable help, especially for conducting listening tests. They kept me
sane and always encouraged me to look on the bright side of life.
I appreciate and thank M.V.N. Murthy (Appi mama) for keeping an eye on my
i
progress and entertaining me whenever I needed it, which was quite frequent. The sensitive talks that we had helped me to keep moving. I would like to thank my parents,
Devaki and Prakash, sisters Pavitra and Kaveri, and grandparents who have always stood
by me by refraining from asking about the progress of my study.
A work of this kind would not have been possible without the financial aid of the
Department of Information Technology (Deity), Government of India. I am grateful for
this. I would like to thank IIT Madras for giving me an opportunity to conduct this
study.
Anusha Prakash
ii
ABSTRACT
KEYWORDS:
Text-to-speech synthesis, Indian languages, unified framework,
cross-lingual, continuous dysarthric speech, durational analysis
India ranks second in the world in the usage of mobile phones. With about 81 wireless
connections per 100 citizens, smartphones (owing to their small form factor) employing
speech technologies are likely to become the most convenient assistive devices. The
objective is to develop small footprint speech synthesis technologies that will eventually
be integrated into mobile phones to aid marginalised people. Specifically, the current
work focuses on two tasks- developing Indian language text-to-speech (TTS) synthesisers
for the language-challenged, and enhancement of continuous dysarthric speech.
The first task aims to make Indian language content available on digital platforms
accessible to the language challenged, i.e., people who are unlettered or have difficulty
in reading the native script(s). Even though Indian languages are spoken by over 17%
of people in the world, they are low-resource languages in terms of the availability of
annotated data and natural language processing tools. In this context, Indian language
text-to-speech synthesisers are developed in a uniform manner by exploiting the similarities that exist among the languages. Cross-lingual analysis and borrowing of models
enable faster development of TTSes for new languages. An average degradation mean
opinion score (DMOS) of 3.59 and word error rate (WER) of 5.63% are obtained from
subjective evaluation of the speech synthesisers.
The second task focuses on aiding people with dysarthria to communicate more efiii
fectively by improving the quality of their speech. Dysarthria is a group of motor speech
disorders as the result of a neurological injury. It affects the physical production of
speech. As a result, dysarthric speech is not as comprehensible as normal speech. In this
work, durational attributes of dysarthric speech utterances are studied and corrected to
match those of normal speech. An automatic technique is developed to incorporate these
modifications. This technique is compared with two other techniques available in the
literature, namely, a formant re-synthesis technique and a hidden Markov model (HMM)
based adaptive synthesis technique. Dysarthric speech modified using the proposed approach is preferred over the existing approaches.
In building Indian language TTSes, a cross-lingual approach is adopted. Analogously,
to improve the quality of dysarthric speech, normal speech characteristics are transplanted
onto dysarthric speech.
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENTS
i
ABSTRACT
iii
LIST OF TABLES
viii
LIST OF FIGURES
x
ABBREVIATIONS
xi
1 Overview of the Thesis
1
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Overview of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Key contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.4
Organisation of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .
3
I A Unified Approach to Developing Indian Language Textto-Speech Synthesisers
5
2 Building Statistical Parametric based TTSes for Indian Languages
6
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.2
Source-filter model of speech production . . . . . . . . . . . . . . . . .
7
2.3
State-of-the-art techniques to build a TTS system . . . . . . . . . . . .
9
2.3.1
Unit selection synthesis (USS) . . . . . . . . . . . . . . . . . . .
10
2.3.2
HMM-based speech synthesis systems (HTS) . . . . . . . . . . .
11
2.4
Building TTSes for Indian languages . . . . . . . . . . . . . . . . . . .
13
2.5
Common label set (CLS) . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.6
Common question set (CQS) . . . . . . . . . . . . . . . . . . . . . . . .
17
v
2.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Cross-lingual Analysis and Synthesis
20
21
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.2
Cross-lingual analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.2.1
Datasets used . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.2.2
Pre-processing the data . . . . . . . . . . . . . . . . . . . . . .
23
3.2.3
Analysis of syllables . . . . . . . . . . . . . . . . . . . . . . . .
24
Annotating data in a language . . . . . . . . . . . . . . . . . . . . . . .
29
3.3.1
Flat start method
. . . . . . . . . . . . . . . . . . . . . . . . .
29
3.3.2
Bootstrap method . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.3.3
Hybrid segmentation . . . . . . . . . . . . . . . . . . . . . . . .
30
Cross-lingual borrowing of HMMs for annotation . . . . . . . . . . . .
31
3.4.1
Evaluation of TTSes . . . . . . . . . . . . . . . . . . . . . . . .
33
3.4.2
Bootstrap segmentation in source language . . . . . . . . . . . .
33
3.4.3
Hybrid segmentation in source language . . . . . . . . . . . . .
35
3.4.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.3
3.4
3.5
II Enhancement of Continuous Dysarthric Speech
38
4 Dysarthric Speech
39
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.2
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.3
Datasets used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.3.1
Nemours database . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.3.2
Indian English dysarthric speech dataset . . . . . . . . . . . . .
44
4.4
Formant re-synthesis technique . . . . . . . . . . . . . . . . . . . . . .
45
4.5
HMM-based TTS using adaptation . . . . . . . . . . . . . . . . . . . .
48
4.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
vi
5 Proposed Modifications to Dysarthric Speech
51
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.2
Durational analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.2.1
Vowel durations . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.2.2
Average speech rate . . . . . . . . . . . . . . . . . . . . . . . .
55
5.2.3
Total utterance duration for text in the datasets . . . . . . . . .
56
5.3
Manual modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.4
Proposed automatic method . . . . . . . . . . . . . . . . . . . . . . . .
59
5.4.1
Features used . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
5.4.2
Dynamic Time Warping (DTW) . . . . . . . . . . . . . . . . .
60
5.4.3
Adding frame thresholds . . . . . . . . . . . . . . . . . . . . . .
63
5.4.4
Adding short-term energy criteria . . . . . . . . . . . . . . . . .
63
Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
5.5.1
Pairwise comparison tests . . . . . . . . . . . . . . . . . . . . .
67
5.5.2
Intelligibility tests . . . . . . . . . . . . . . . . . . . . . . . . .
70
5.5.3
Analogy to speech synthesis techniques . . . . . . . . . . . . . .
71
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
5.5
5.6
6 Conclusion and Future Work
73
6.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
6.2
Criticisms of the thesis and scope of future work . . . . . . . . . . . . .
74
Appendices
76
A Common Label Set
77
B Common Question Set
86
LIST OF TABLES
2.1
Major Indian languages listed according to the language families . . . .
15
2.2
Pentaphone model: example . . . . . . . . . . . . . . . . . . . . . . . .
17
3.1
Details of speech data used for analysis . . . . . . . . . . . . . . . . . .
23
3.2
Syllable rate of different speakers . . . . . . . . . . . . . . . . . . . . .
27
3.3
Flat start segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.4
Cross-lingual segmentation: bootstrap segmentation in source language
34
3.5
Degradation mean opinion scores (DMOS) . . . . . . . . . . . . . . . .
35
3.6
Word error rates (WER) (%) . . . . . . . . . . . . . . . . . . . . . . .
35
4.1
Details of Nemours database used in the experiments . . . . . . . . . .
43
4.2
Details of Indian English dysarthric speech dataset . . . . . . . . . . .
45
viii
LIST OF FIGURES
2.1
Text-to-speech synthesis . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.2
Parts involved in speech production . . . . . . . . . . . . . . . . . . . .
8
2.3
Source-filter model of speech production . . . . . . . . . . . . . . . . .
9
2.4
Building a TTS system . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.5
Overview of HTS System . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.6
Partial common label set . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.7
Example of decision tree . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.1
Zipfian distribution (Zipf) for male data across languages . . . . . . . .
24
3.2
Percentage of types of syllables based on number of constituent phones for
different datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
Percentage of types of words based on number of constituent syllables for
different datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
Distribution of average duration of top 300 syllables for male data (in
descending order of average duration) . . . . . . . . . . . . . . . . . . .
27
Distribution of average duration of top 300 syllables for female data (in
descending order of average duration) . . . . . . . . . . . . . . . . . . .
28
3.6
Segmentation at phone level for speech utterance in Bengali . . . . . .
30
4.1
Example of UBM-GMM and mean-adapted UBM-GMM . . . . . . . .
46
4.2
Formant re-synthesis of dysarthric speech . . . . . . . . . . . . . . . . .
47
4.3
Overview of HTS system with adaptation . . . . . . . . . . . . . . . .
49
5.1
Average vowel durations across dysarthric and normal speakers . . . . .
53
5.2
Duration plot of vowels for (a) dysarthric speech BB and normal speech
JPBB, (b) dysarthric speech RL and normal speech JPRL, and (c) Indian
dysarthric speech IE and normal speech of other speakers . . . . . . . .
54
5.3
Average speech rates across dysarthric and normal speakers . . . . . . .
56
5.4
Total utterance durations across dysarthric and normal speakers . . . .
57
3.3
3.4
3.5
ix
5.5
Original (top panel) and manually modified (bottom panel) dysarthric
speech for the utterance The badge is waking the bad spoken by speaker
SC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.7
DTW grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
5.8
Flowchart of automatic (DTW+STE) method to modify dysarthric speech
64
5.9
DTW paths of an utterance of speaker BK between: (a) original dysarthric
speech and normal speech, and (b) modified dysarthric speech and normal
speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.10 Original (top panel) and automatically modified (bottom panel) dysarthric
speech for the utterance The dive is waning the watt spoken by speaker
BK. PI- phone insertions . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.11 Total utterance durations across dysarthric speech, modified dysarthric
speech and normal speech . . . . . . . . . . . . . . . . . . . . . . . . .
66
5.12 Preference for manually and automatically (DTW+STE) modified speech
over original dysarthric speech of different speakers . . . . . . . . . . .
68
5.13 Word error rates for different types of speech across dysarthric speakers
69
x
ABBREVIATIONS
CLS
Common Label Set
CQS
Common Question Set
DMOS
Degradation Mean Opinion Score
DTW
Dynamic Time Warping
GMM
Gaussian Mixture Model
HMM
Hidden Markov Model
HTS
Hidden Markov model-based Speech Synthesis System
JDE
Joint Density Estimation
LTS
Letter-To-Sound
MFCC
Mel Frequency Cepstral Coefficients
MGC
Mel Generalized Cepstral coefficients
NLP
Natural Language Processing
SPSS
Statistical Parametric Speech Synthesis
STE
Short-Term Energy
TTS
Text-To-Speech
UBM
Universal Background Model
USS
Unit Selection Synthesis
WER
Word Error Rate
xi
CHAPTER 1
Overview of the Thesis
1.1 Introduction
India is the second largest consumer of mobile phones in the world. More than 20% of the
population in India uses smartphones, and this figure is expected to rise rapidly in the
coming years. According to the Telecom Regulatory Authority, the wireless tele-density
of India is 81.35, i.e., for every 100 citizens there are 81.15 connections available [38].
Smartphones have the potential to be the most popular and convenient devices of assistive
technology.
The objective of this work is to develop speech synthesis technologies to aid the
language challenged and people with dysarthria. These synthesisers should have a small
footprint so that they can be eventually integrated into mobile phones.
With cheaper smartphones available in the market, there has been an increase in
the Indian language content on the web to cater to the requirements of the non-English
speaking population. Indian language content is in the form of regional newspapers,
stories, blogs, online books, general information, etc. With so much textual information
available in native languages on the digital platform, one of the tasks is to make the
information accessible to people who are language challenged. We define the language
challenged as people who are unlettered or people who have difficulty in reading or writing
the native script(s). For this purpose, text-to-speech (TTS) synthesisers are developed
for Indian languages to read out the native text.
The second task deals with dysarthria, which refers to a group of neuromuscular speech
disorders. Dysarthria affects the motor component of the motor-speech system, i.e., the
physical production of speech. Disruption in muscular control makes the speech imperfect.
The person with dysarthria may have additional problems of drooling, swallowing while
speaking. As a result, dysarthric speech is not as comprehensible as normal speech. Most
often, people with dysarthria have problems with communicating and this inhibits their
social participation. Hence, for effective communication, it is extremely vital that we
develop assistive speech technologies for people with dysarthria.
1.2 Overview of the work
A TTS synthesiser takes the given text as input and generates the corresponding synthesised speech. Although Indian languages are spoken by over 17% of people in the world,
they are considered as low-resource languages in terms of the availability of annotated
data, natural language processing (NLP) tools, etc. In this work, a unified framework
is developed to build TTS systems for multiple Indian languages. Similarities among
Indian languages are studied so as to enable cross-lingual borrowing of models. This
cross-lingual approach results in a reduction of the turnaround time for building a TTS
system for a new language, without compromising on the synthesis quality.
For assisting people with dysarthria to communicate better, the quality of continuous
dysarthric speech needs to be improved. The goal is to improve the naturalness and
intelligibility of dysarthric speech while retaining the characteristics of the speaker. For
this purpose, durational attributes across dysarthric and normal speech utterances are
first studied. An automatic technique is developed to correct dysarthric speech closer
to normal speech. In other words, normal speech characteristics are transplanted onto
2
dysarthric speech to improve the latter’s quality while preserving the speaker’s voice
characteristics. The performance of this technique is compared with that of two other
techniques available in the literature– a formant re-synthesis technique and a hidden
Markov model (HMM) based adaptive speech synthesis technique.
1.3 Key contributions
The major contributions of this work are as follows:
• A unified framework is developed for easier and faster building of Indian language
TTSes.
• Cross-lingual analysis and borrowing of models are performed to build good quality
TTSes effortlessly.
• Based on analysis of durational attributes, an automatic technique is developed
to transplant normal speech characteristics onto dysarthric speech to improve the
latter’s quality.
1.4 Organisation of the thesis
The thesis is organised into two parts. Part I deals with developing a generic framework
to build Indian language TTS synthesisers. Chapter 2 reviews the source-filter model of
speech production, the state-of-the-art techniques to build TTSes for Indian languages
and the proposed common framework for system building. Chapter 3 describes the crosslingual studies across Indian languages and experiments involving cross-lingual borrowing
of models to build TTS systems.
Part II deals with work done in the domain of dysarthric speech. The datasets used
in the experiments and two existing techniques for improving dysarthric speech quality
3
are detailed in Chapter 4. Durational analyses and the proposed technique for improving
dysarthric speech quality are presented in Chapter 5.
The work is summarised in Chapter 6. Limitations of the work along with possible
directions for future implementations are briefly mentioned.
4
Part I
A Unified Approach to Developing
Indian Language Text-to-Speech
Synthesisers
5
CHAPTER 2
Building Statistical Parametric based TTSes for
Indian Languages
2.1 Introduction
A text-to-speech (TTS) synthesiser automatically converts the given text-to-speech. This
is illustrated in Figure 2.1. The type of TTS synthesisers used in railway announcement
systems, weather forecast systems, etc., are restricted domain TTSes. As the TTSes that
need to be developed for the given task are not restricted to any particular domain, they
should have the ability to synthesise any arbitrary text. These generic domain TTSes
can be used in a wide range of applications such as aids to persons with disabilities
(visual impairment, dyslexia), interactive voice response (IVR) systems. The challenge
is to produce human-like speech. The synthesised speech should be both natural and
intelligible.
Figure 2.1: Text-to-speech synthesis
With over 1.2 billion people [37], India is the second most populous nation. While
over 74% of this population is literate in the vernacular, less than 5% of the population
is fluent in English. In this scenario, speech synthesis technologies in regional languages
are crucial enablers for people who are unlettered or have difficulty in reading the native
language scripts.
This work focuses on developing TTS synthesisers for different Indian languages. The
emphasis is on building TTSes as effortlessly and as language-independently as possible.
The objective is to make use of similarities that exist among Indian languages for easier
system building. In this sense, our approach to system building is cross-lingual.
To represent speech, features are extracted from the speech signal. These features are
based on the source-filter model of speech production. Then appropriate synthesis techniques are used to train the TTS. This chapter reviews the features used for representing
speech and the popular synthesis techniques used for building Indian language TTSes.
The proposed generic framework for system building is also presented.
2.2 Source-filter model of speech production
To synthesise human-like speech, the features used to represent speech should approximate the human speech production mechanism. The major components involved in
producing speech are– lungs, larynx and vocal tract. Air from the lungs passes through
the larynx and is modulated by the vocal tract to produce speech. The larynx contains
the vocal folds or vocal cords. To produce voiced sounds, the vocal folds vibrate at periodic intervals. For unvoiced sounds, the vocal folds remain open. The vocal tract consists
of the oral and nasal cavities. The lips, teeth, tongue, alveolar ridge, palate (soft and
hard) and glottis are the articulators. Depending on the vocal folds activity and the configuration of the vocal tract, different sounds are produced. The various organs involved
in the speech production mechanism are shown in Figure 2.21 .
1
Source- www.teachit.co.uk/armoore/lang/phonology.htm
7
Figure 2.2: Parts involved in speech production
The signal produced from the vocal fold activity is termed as the source signal, x[n].
The vocal tract configuration acts like a linear filter (system), whose impulse response is
h[n]. Speech signal, y[n], is the result of the discrete convolution of the source and filter:
y[n] = x[n] ∗ h[n].
(2.1)
The source-filter model is the most popular theory of speech production. This model is
shown in Figure 2.3. The assumption here is that the source and the filter are independent
of each other. Voiced sounds are approximated by a train of impulses with fundamental
frequency (f 0) or pitch2 and unvoiced sounds by white noise. Hence, to represent the
source information of speech, pitch values are used. System parameters are represented by
mel generalized cepstral (MGC) coefficients [58]. MGC coefficients model the spectrum
of speech by taking into account the behaviour of the human auditory system, which has
higher resolution in low frequency regions. The dynamic characteristics of speech are
2
Although pitch is perceived frequency of a sound, the terms pitch and fundamental frequency have
been used interchangeably in this thesis.
8
accounted for, by using velocity and acceleration parameters of both source and system.
Figure 2.3: Source-filter model of speech production3
2.3 State-of-the-art techniques to build a TTS system
The procedure to build a TTS system is illustrated in Figure 2.4. The training data
consists of some amount of recorded speech along with the corresponding transcriptions
or text. The text is converted to a sub-word sequence (labels) using language-specific
letter-to-sound (LTS) rules or parser. The sub-word units may be phones, diphones,
syllables, etc. Wavefiles are then time-aligned or labeled according to the sub-word
sequence. This process is known as segmentation. A TTS system is then built using
the training data. In the synthesis phase, the test sentence is parsed into the sub-word
sequence and speech is synthesised using the TTS system.
The existing state-of-the-art techniques to build TTSes for Indian languages are unit
3
This figure is re-drawn from [40]
9
Figure 2.4: Building a TTS system
selection synthesis (USS) [23] and statistical parametric speech synthesis (SPSS) techniques. These techniques are discussed briefly in the following sub-sections.
2.3.1
Unit selection synthesis (USS)
In USS, after segmenting the speech data according to the given transcriptions, annotated waveforms are stored in the database. During synthesis, the sub-word sequence is
obtained from the text. For Indian languages, syllables have been observed to be the most
suitable units for synthesis [18, 30]. Depending on a cost criteria, which is a combination
of target and concatenation costs, suitable syllables are chosen from the database. The
corresponding waveforms are concatenated to produce the synthesised speech. Though
the synthesis quality is natural, discontinuities are perceived at the concatenation points.
Moreover, since the original speech waveforms are stored in the database, the footprint
size of the synthesiser is large (in hundreds of MB depending on the amount of training
10
data). Such a large footprint size is unsuitable for our purpose.
The other paradigm in speech synthesis is statistical parametric speech synthesis.
Instead of storing the actual waveforms in the database, models of the sub-word units
are stored. Although due to averaging, the synthesised speech is muffled, the footprint
size reduces considerably to about 6-10 MB, making SPSS a popular synthesiser for use
in mobile phones. One popular SPSS techniques is hidden Markov model (HMM) based
speech synthesis system, popularly known by its acronym HTS [70].
2.3.2
HMM-based speech synthesis systems (HTS)
The training and synthesis phases of HTS are shown in Figure 2.5. In the training
phase of the HTS, the native script is converted to a sequence of labels. Spectral and
excitation parameters are generated from speech data. Spectral parameters are 105dimensional feature vectors consisting of MGC features (35), along with their velocity
(35) and acceleration values (35). Excitation parameters are log f 0 values, along with
their velocity and acceleration values. The spectral and excitation features corresponding
to each monophone in the language are modeled by an HMM using the time-aligned
phonetic transcriptions. The HMMs are four-stream HMMs with five states and a single
mixture component per state. The first stream is for the static and dynamic values of
MGC. The remaining three streams each correspond to the log f 0 features, their velocity
and acceleration values. The excitation features model both voiced and unvoiced regions
and hence, they are multi-space probability distributions in three different streams. Fivestream duration HMMs, with a single state and a single mixture component per state
are also generated for each phone. The model parameters are estimated based on the
11
Figure 2.5: Overview of HTS System4
maximum likelihood criterion:
λ̂ = arg max p(O|W, λ),
λ
(2.2)
where λ represents the model parameters, λ̂ the re-estimated parameters, O the extracted features from training data and W the transcriptions corresponding to the training
data. These are the context-independent monophone models in the language.
To preserve variations in continuous speech, HTS considers a pentaphone context and
4
This figure is re-drawn from [70]
12
additionally about 55 suprasegmental contexts. The contexts for the training data are
obtained from the utterance structure which is derived from Festival [10]. The basic unit
in an HTS system is the context-dependent pentaphone model. Context-dependent pentaphone models are initialised with context-independent monophone models and stored in
the database. To handle the various contexts, decision tree based clustering is performed
using a question set. Details of the question set are given in Section 2.6.
In the synthesis phase, labels are generated from the test sentence. Based on the
context, appropriate context-dependent HMMs are concatenated to form the sentence
HMM. Based on the duration HMMs, state durations are determined. State sequence is
then obtained from the state durations. Spectral and excitation parameters are generated
such that the output probability is maximised:
ô = arg max p(o|w, λ̂),
o
(2.3)
where o represents the speech parameters, ô the re-estimated speech parameters and
w the transcription of the test sentence. During the generation of speech parameters,
dynamic features are also taken into account. To get the filter coefficients from the
spectral features, mel log spectral approximation (MLSA) filter is used and speech is
synthesised.
2.4 Building TTSes for Indian languages
According to the 1961 Census report, there are 1652 languages in India. Of these, 29
languages which are written in different scripts have more than 1 million native speakers
in each of the languages. Indian languages come from different language families of the
13
world. Most of the Indian languages primarily belong to the Indo-Aryan or Dravidian
language group. The former is predominant in the northern part of India and the latter
in the four southern states. Owing to the geographical proximity of the regions and
intermixing among races, there is significant borrowing across languages.
Building a TTS system for each Indian language from scratch is time-consuming and
expensive. Given the lack of annotated data for each language and the rising demand for
the faster development of TTS for multiple Indian languages, a common framework is
developed for training Indian language TTSes. This work is inspired by the efforts in [39]
to build synthesisers for multiple languages. The objective is to exploit similarities that
exist among languages to aid in system building. A natural design choice is to make the
system modules as language-independent as possible, and eventually build TTSes from
only the speech data and transcriptions.
The quality of the synthesised speech in the HTS framework crucially depends on an
accurate representation of the sounds in the language and the question set for clustering
of the phones. Hence, as part of preliminary work, the following modules are developed
for the generic framework:
1. Common label set (CLS)
2. Common question set (CQS)
The common label set and common question set are discussed in detail.
2.5 Common label set (CLS)
The first step in system building is to list the phones in the language. Scripts of Indian languages are non-Latin. Hence, they require a separate representation in Latin
14
Table 2.1: Major Indian languages listed according to the language families
Language family
Languages
Indo-Aryan
Assamese, Bengali,
Gujarati, Hindi,
Marathi, Odia,
Rajasthani
Dravidian
Sino-Tibetian
Kannada, Malayalam,
Tamil, Telugu
Bodo, Manipuri
alphabets. Scripts of most Indian languages trace their origin to the Brahmi script. The
characteristic of the Brahmi script is that the characters are ordered according to their
place and manner of articulation. The native scripts are structured corresponding to the
classes of sounds– vowels, stop consonants, semi-vowels, fricatives, voicing and aspiration. It is observed that each language has 11 to 13 vowels. Most consonants among
different languages are observed to be phonetically similar. 33 consonants are common
to all languages, except Tamil, which has only 26 consonants. In addition to the long
vowels for a, i and u present in most Indo-Aryan languages, Dravidian languages also
have long vowels for e and o.
Studies in [53] suggest the possibility of a compact phone set for Indian languages.
By exploiting the common attributes across Indian languages, a common label set (CLS)
is designed. The labels in the CLS are standard notations for phones across 13 Indian
languages5 . Table 2.1 shows the languages listed according to their language families.
In the CLS, similar sounds across languages are mapped together and denoted by the
same label. International phonetic alphabet (IPA) symbols are used as references for the
mapping. The v tag for characters in Tamil is used for denoting voiced sounds, as voicing
in Tamil is not represented separately in the written form.
The features of the common label set are as follows:
1. All labels are in lowercase Roman characters. They do not contain any special
characters such as a quote, hyphen, etc.
5
These 13 languages are currently being processed by ASR/TTS consortia of TDIL, Deity, Government of India
15
2. Suffixes used: Since the number of sounds in a language exceeds the number of
Roman characters, certain suffixes are used in the labels. For aspiration, suffix h is
used (g (ग, ಗ, ଗ) vs. gh (घ, ಘ, ଘ) ). Suffix x is used to denote the retroflex place of
articulation (d (द, ದ, ଦ) vs. dx (ड, ಡ, ଡ) ).
3. Labels for language-specific sounds are provided:
• Retroflex zh is predominant in Tamil (ழ) and Malayalam (ഴ)
• In addition to palatal affricates c (च) and j (ज) present in all languages, Marathi
is characterised by dental affricates cx and jx represented by the same characters (च, ज).
• Some Hindi characters with nukta (kq (क़), f (फ़), z (ज़) ) are included to account
for sounds borrowed from foreign languages like Urdu and Persian.
4. The label of a diphthong (ei, ou) is a concatenation of the labels of the individual
vowels. The reverse may not be true. For example, eu is a monophthong and not
a diphthong.
5. Separate grapheme (G) and phoneme (P) representations are provided for few languages. This is to ensure that the native script is largely recoverable from the CLS
transliteration.
6. The label for a vowel matra is the same as that of the vowel.
7. The halant (◌् ) or the virama ( ್, ్ ) in Indian language scripts denotes the absence
of the inherent vowel a (schwa) in consonant characters. It is not a sound and
hence, there is no label for halant.
A partial set is shown in Figure 2.6. The complete CLS is given in Appendix A. The
LTS rules or the parser converts the native text to the spoken form represented by the
CLS labels. The advantages of using the CLS are:
• Standard representation of phones across languages
• Easier to build systems for languages that one does not know
• For a new language, any new phone can be mapped to a phonetically similar phone
in the CLS. The common framework can then be used.
• Provides a common platform for cross-lingual studies of languages with different
scripts
• To develop speech interfaces for languages that do not have a script, the CLS can
be used for representing phones of the language.
16
Figure 2.6: Partial common label set
2.6 Common question set (CQS)
The basic unit in HTS is the context-dependent pentaphone model. The pentaphone
model is in fact the monophone model in a pentaphone context. This is illustrated with
an example in Table 2.2. For the first three words of the text, “It is a lovely day”, the
monophone and pentaphone context models are given. The model for the central phone
is trained with the given pentaphone context.
Table 2.2: Pentaphone model: example
Monophones Pentaphone context
i
x-x-i-tx-i
tx
x-i-tx-i-s
i
i-tx-i-s-a
s
tx-i-s-a-l
a
i-s-a-l-a
To obtain natural sounding TTS, suprasegmental contexts are included. Contexts
17
Figure 2.7: Example of decision tree
such as the position of the current phoneme identity in the current syllable, whether
the previous syllable is stressed/accented or not, the number of phonemes in the current
syllable, position of the current syllable in the current word, etc., are considered.
If a language has 50 phones, the number of monophone models are 50, and the number
of pentaphone (context) models are 505 . Including suprasegmental contexts will result
in a huge number of combinations (or models). It is not possible to cover all these
combinations in the training data. Some of the combinations may not be even valid
in the language. The problem arises when unseen combinations are present in the test
sentence. To resolve this problem, decision tree-based context clustering is performed
in HTS [69]. A binary tree is used for clustering, where HMMs are split into two subcategories based on certain yes/no questions. The model parameters are shared among
states in each leaf node. An example of the decision tree is shown in Figure 2.7.
A question set is is required for tree-based clustering. It consists of questions based on
linguistic and phonetic contexts. Linguistic questions are based on the structure of the
sentence. The information regarding the sentence structure is obtained from the utterance
structure derived from Festival. Phonetic questions are based on the characteristics of the
sounds such as vowels, consonants, stop consonants, nasals, front vowels, back consonants,
sonorants, fricatives, affricates, etc. Phonetic questions need to be carefully designed for
18
each language. The design of the question set requires the knowledge of acoustic-phonetics
of all the sounds in a language.
In this work, we propose a common question set (CQS), which is derived from the
common label set. Instead of designing a careful set of questions for each language, as is
done conventionally, the CQS is used for training TTSes for multiple Indian languages.
The CQS is a superset of questions across the 13 Indian languages. Questions were first
carefully designed for Hindi (Indo-Aryan) and Tamil (Dravidian). For Hindi, most of the
questions were derived from [17]. Hindi and Tamil covered most of the labels in CLS.
The CQS was then extended to include remaining labels from CLS.
Since the conventional question set is designed for English, questions were adapted to
suit the Indian language pronunciations. Some questions present in the conventional
English question set were excluded from the CQS. Indian languages have short and
long vowels. Hence, the question Reduced_Vowel was excluded. Since the question
Syllabic_Consonant had only one member, label rq (ऋ, ಋ, ঋ), it was also excluded
from CQS. Along with affricates present in the English question set, aspirated affricates
(ch (छ,
ഛ, છ); jh (झ, ഝ, ઝ) ) and dental affricates (cx (च), jx (ज) ) were included.
In
the question Fricatives, labels khq (ख़) and gq (ग़) were included. The CQS has a total of
53 questions relevant to Indian languages. The number of questions in the CQS is fixed
regardless of the language, since irrelevant entries in the question set are ignored while
clustering. A part of the CQS is given in Appendix B
Using the common question set makes the task of system building easier. For a new
language, any new phone can be mapped to labels in CLS, and the CQS can be used
directly for system building.
19
2.7 Summary
In this chapter, the source-filter model of speech and the state-of-the-art techniques for
building TTS synthesisers for Indian languages, USS and HTS, are reviewed. The HTS
framework is chosen for system building due to its low footprint size. The common
label set for a standard representation of phones and the common question set for treebased clustering are designed. Using this generic framework, TTSes for multiple Indian
languages can be built with ease.
20
CHAPTER 3
Cross-lingual Analysis and Synthesis
3.1 Introduction
The common label set and the common question set are part of the generic framework
for system building. They facilitate the easier development of Indian language TTSes.
In this chapter, we attempt to build TTS system for a new language with the aid of
another language. This is especially useful in the case of under-resourced languages.
TTSes for such languages can be developed with the aid of another language which has
a considerable amount of data. To perform this type of cross-lingual system building,
similarities among languages are determined based on syllable analyses.
This work is distinctly different from polyglot synthesis [32,33] where common phones
are generated using monophones from various languages. This is also different from the
work in [9], where speakers are clustered to produce a monolingual synthesiser for a new
language with little adaptation data. It is an attempt similar to the global phone project,
where speech recognisers for multiple languages are built quickly [50].
Previous efforts on analysis have mostly focused on individual Indian languages [41,54,
60]. In [51], similarities among different languages are studied from a language identification perspective. From a multilingual perspective to better design speech technologies, [8]
analyses subtle variations in phonetic features across Indian languages.
In this work, a simple analysis is performed to study multilingual characteristics that
may enable the development of synthesisers cross-lingually. This is motivated by the fact
that Indian languages share a common base of sounds, and vary mostly in the phonotactics (sequencing of phonemes) across languages. Moreover, Indians are characterised
by their bilingual (or multilingual) nature. In this work, syllables are analysed from
both textual and speech data across multiple Indian languages. Based on the analysis,
similarities among languages are determined. For the purpose of annotating speech data
in a language, models are borrowed across languages based on language similarity. The
cross-lingual analysis and experiments with cross-lingual segmentation are presented in
this chapter.
3.2 Cross-lingual analysis
Cross-lingual analysis of syllables is carried out in the context of continuous speech.
Syllables are defined as C ∗ V C ∗ units, where V represents a vowel and C represents a
consonant. C ∗ indicates that it is zero or more consonants. Syllables are chosen for
analysis as they are the fundamental units of speech production [20]. Moreover, Indian
languages are characterised by syllable-timed rhythm [46]. The datasets used for analysis
and the types of analysis carried out are detailed in this section.
3.2.1
Datasets used
Six Indian languages are considered for the analysis– Bengali, Hindi and Marathi belonging to the Indo-Aryan language group; and Kannada, Tamil and Telugu to the Dravidian
group. The datasets consist of studio recordings of speech by native professional news
readers or radio jockeys. Transcriptions corresponding to the audio data are also provided. Both female and male data are used. Details of the datasets are given in Table
3.1. The datasets are part of the Indic TTS database [4].
22
Table 3.1: Details of speech data used for analysis
Data
Bengali
Hindi
Marathi
Kannada
Tamil
Telugu
3.2.2
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Duration
(in hrs)
4.78
4.98
5.16
5.18
3.56
4.8
3.4
3.95
4.56
5.37
4.24
5.0
Pre-processing the data
The text corresponding to the speech data is parsed into a sequence of syllables using
the language-specific parser. Reference [61] reports that the properties of a syllable are
significantly influenced by its position in a word. The acoustic properties of the same
syllable vary when the syllable is in the beginning, middle or end of a word. Hence,
syllables are postfixed with BEG, MID and END tags, depending on their position in a
word, and are treated as distinct syllables.
The number of syllables in a language is finite, but large in number. The probability
of occurrence of top 300 frequent syllables for male data of different languages are plotted
in Figure 3.1. It is observed that the plots follow Zipfian (Zipf) distributions [24]. The
Zipf law states that if the terms in any natural language data are arranged in decreasing
order of their frequency, then the frequency of each term is inversely proportional to its
rank in the order. The Zipf distributions have a long tail. In the current work, therefore,
the analysis is restricted to only the top 300 syllables, referred to as the “top syllables”.
Since the Indian languages considered in this work have different scripts, syllables in
23
the native script are mapped to labels in the common label set. This makes it easier to
compare syllables belonging to different languages. To obtain syllable-level segmentation
of speech data, the hybrid segmentation algorithm is used [52].
Probability of occurrence of syllables
0.045
Bengali male
Hindi male
Marathi male
Kannada male
Tamil male
Telugu male
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0
50
100
150
200
250
300
Syllable index
Figure 3.1: Zipfian distribution (Zipf) for male data across languages
3.2.3
Analysis of syllables
Analysis of syllables from text and speech data are carried out. As part of textual
analysis, syllable types and word types are studied. Distributions of acoustic attributes
(duration, average energy, average f 0) are compared across languages. Based on the
analyses, similarities among languages are determined.
Textual analysis
The number of phones that constitute a syllable from the top syllables list are analysed.
The statistics of the same are shown in Figure 3.2. It is seen that syllables consisting
of two phones occur most frequently across all languages. It is mainly of CV structure,
24
Percentage of syllable type (%)
One phone per syllable
Two phones per syllable
Three phones per syllable
Four or more phones per syllable
100
100
100
100
100
100
90
90
90
90
90
90
80
80
80
80
80
80
70
70
70
70
70
70
60
60
60
60
60
60
50
50
50
50
50
50
40
40
40
40
40
40
30
30
30
30
30
30
20
20
20
20
20
20
10
10
10
10
10
10
Male Female
Male Female
Male Female
Male Female
Male Female
Male Female
Bengali
Hindi
Marathi
Kannada
Tamil
Telugu
Figure 3.2: Percentage of types of syllables based on number of constituent phones for
different datasets
and in very few instances of V C structure. This is in keeping with the fact that the
writing system of Indian languages is akshara-based, which has a C ∗ V structure. The
next frequently occurring syllable structure consists of three phones, mostly of CV C
structure, and sometimes of CCV structure. Syllables with a single phone are vowels (V
structure). Syllables consisting of four or more phones occur rarely and are due to the
presence of English words in the text in most of the cases.
Percentage of word type (%)
Monosyllabic
Bisyllabic
Trisyllabic
Tetrasyllabic
Pentasyllabic
Others
100
100
100
100
100
100
90
90
90
90
90
90
80
80
80
80
80
80
70
70
70
70
70
70
60
60
60
60
60
60
50
50
50
50
50
50
40
40
40
40
40
40
30
30
30
30
30
30
20
20
20
20
20
20
10
10
10
10
10
10
Male Female
Male Female
Male Female
Male Female
Male Female
Male Female
Bengali
Hindi
Marathi
Kannada
Tamil
Telugu
Figure 3.3: Percentage of types of words based on number of constituent syllables for
different datasets
25
Unique words in the database are syllabified. The probability of mono-syllabic, bisyllabic, tri-syllabic words, etc., are studied. The statistics of different word types are
shown in Figure 3.3. For Indo-Aryan languages, the occurrence of words is in the following descending order: bi-syllablic, tri-syllabic, mono-syllabic, and further reduces as the
number of syllables in the word increases. This pattern is in contrast to that of Dravidian
languages. For Dravidian languages, words consisting of three or more syllables are common. Words containing six or more syllables are least probable in Indo-Aryan languages
compared to Dravidian languages, where this probability is quite high. This highlights
the agglutinative nature of Dravidian language scripts [65]. In agglutinative languages,
multiple words are concatenated together and written as a single word. The meaning
of the word before and after concatenation does not change. For example, in Tamil,
வந்து ெகாண்டு இருக்கிறான் (vandu kondu irukkiraan) is written as வந்துெகாண்டிருக்கிறான்
(vandukondirukkiraan).
Acoustic analysis
For each instance of a syllable belonging to the top syllables list, acoustic features, namely,
duration, average energy and average f 0 are calculated. A frame size of 25 ms and a
frame shift of 10 ms are used for determining these values. The average duration of
each syllable is normalised with respect to the average syllable rate of the speaker. The
average syllable rate is the number of syllables uttered in one second. The syllable rate
of different speakers is given in Table 3.2.
The distribution of normalised average duration of syllables is plotted after arranging
the top 300 syllables in descending order of their values in each language. It is clearly seen
from Figure 3.4, which is the distribution for male data, that syllables in the Dravidian
languages have lower average syllable durations, when compared to those belonging to the
26
Table 3.2: Syllable rate of different speakers
Syllable Rate
(syllables/sec)
4.91
5.14
5.01
4.94
4.15
4.22
5.73
4.94
5.66
5.18
6.52
5.2
Data
Bengali
Hindi
Marathi
Kannada
Tamil
Telugu
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
Male
Female
0.11
Bengali male
Hindi male
Indo−Aryan
Marathi male
Rajasthani male
Kannada male
Dravidian
Tamil male
Telugu male
Average duration of syllables (ms)
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
50
100
150
200
250
300
Syllable index
Figure 3.4: Distribution of average duration of top 300 syllables for male data (in descending order of average duration)
27
0.2
Bengali female
Hindi female
Marathi female
Rajasthani female
Kannada female
Tamil female
Telugu female
Average duration of syllables (ms)
0.18
0.16
0.14
0.12
Indo−Aryan
Dravidian
0.1
0.08
0.06
0.04
0.02
0
0
50
100
150
200
250
300
Syllable index
Figure 3.5: Distribution of average duration of top 300 syllables for female data (in
descending order of average duration)
Indo-Aryan languages. The shorter average duration of syllables in Dravidian languages
is due to the agglutinative nature of its scripts. As the word becomes longer in terms of
the number of syllables, speakers tend to shorten the word in terms of duration. This
is not so in the case of distributions for the female data (Figure 3.5). Distributions of
Bengali and Hindi data overlap with those of Dravidian languages. Additionally, 5 hours
of Rajasthani male and female data are used for the analysis, and their distributions
conform to the said pattern for both male and female data. Further studies have to be
carried out to analyse the anomaly occurring for female data in Hindi and Bengali.
The values of average energy and average f 0 are normalised to zero mean and variance
one for every syllable in the top syllables list. An analysis similar to that of duration is
carried out for average energy and average f 0 of syllables. They are not presented in the
thesis because of lack of conclusive results.
The cross-lingual analysis of syllables validate the following features with respect to
Indian languages:
28
• The writing system of Indian languages is akshara-based.
• The agglutinative nature of Dravidian language scripts is evident from the analysis
of word types and average syllable durations.
• These studies highlight the differences between Dravidian and Indo-Aryan language
groups.
Using this classification of languages into Indo-Aryan and Dravidian, TTS synthesisers
are built with the aid of another language belonging to the same language group.
3.3 Annotating data in a language
For building a phone-based HTS system, speech data has to be segmented at the phone
level. Segmentation is performed using the speech wavefiles and the corresponding transcriptions. An example of a Bengali utterance segmented at the phone level is shown in
Figure 3.6. Accurate phone-level segmentation results in better acoustic models using
which a good quality TTS can be developed. The techniques generally used for segmentation are flat start initialisation [69], bootstrap method and hybrid segmentation
technique [52]. Flat start initialisation and hybrid segmentation techniques are fully automatic, while the bootstrap method is semi-automatic. For segmentation, HMMs model
only the system parameters, which are mel frequency cepstral coefficients (MFCC) and
their dynamic values.
3.3.1
Flat start method
In the flat start method, the entire duration of a wave file is divided equally among the
phones that make up that utterance. Monophone HMMs are then built from the timealigned data. Basically, the state means and variances of HMMs are initialised with global
29
Figure 3.6: Segmentation at phone level for speech utterance in Bengali
mean and variance [69]. These models are then iteratively re-estimated using embedded
re-estimation and phone-level segmentation is obtained by forced Viterbi alignment.
3.3.2
Bootstrap method
The bootstrap method is a semi-automatic method. Few minutes of data are chosen with
enough phone coverage and are manually segmented at the phone level. HMMs are built
from the manually segmented data and are used to segment the entire data at the phone
level using forced Viterbi algorithm. Models are then trained from the entire segmented
data and iteratively re-estimated using embedded re-estimation. Segmentation is then
obtained from the re-estimated models.
3.3.3
Hybrid segmentation
The hybrid segmentation algorithm [52] is a segmentation algorithm where machine learning is used in tandem with signal processing. Using flat start initialisation, where all the
30
monophone HMMs are assigned with global mean and variance, HMMs are re-estimated.
However, these HMMs do not consider any boundary information. The boundary information is obtained from signal processing. Group-delay based segmentation method gives
a set of syllable boundaries. The closest group-delay boundary in the vicinity of an HMM
boundary is considered as the correct syllable boundary. Then embedded re-estimation
is performed by restricting segmentation to the syllable boundary. This is performed
a couple of times iteratively. Further boundary correction is achieved using short-term
energy and spectral flux as cues. The resulting boundaries are the syllable boundaries.
HMMs are then re-estimated within syllable boundaries and used to time-align the data
at the phone level.
3.4 Cross-lingual borrowing of HMMs for annotation
Owing to similarities among Indian languages, we develop text-to-speech (TTS) synthesisers by cross-lingual segmentation. The objective is to segment speech data at the phone
level using acoustic models from other languages as initial models, after mapping phones
of the target language to source language.
In the field of speech recognition, borrowing of acoustic models across languages has
been quite successful. In [5,7,35], the phoneme mapping from source language(s) to target
language is obtained through data-driven or knowledge-based approaches. Next, contextindependent acoustic models are copied to the target language. With some adaptation
techniques, waveforms of the target language are segmented into phonemes. In some
instances, context-dependent acoustic models are also borrowed across languages [34].
Unlike [56], where phoneme mapping is obtained automatically, the common label set for
Indian languages is used for mapping in the current work. Context-independent HMMs
31
(CI-HMMs) are considered as the acoustic models for cross-lingual borrowing.
The following steps summarise the procedure to annotate speech data cross-lingually:
1. Phone-level transcriptions in terms of the labels in CLS are obtained for the source
and target languages.
2. Phone-level aligned speech data is obtained in the source language using any of the
conventional segmentation techniques.
3. Context-independent monophone HMMs (CI-HMMs) are trained from the aligned
data.
4. Context-independent monophone HMMs from source language are borrowed to the
target language for all the phones in the latter. These are the initial monophone
models in the target language.
5. For any phone present in the target language that is not present in the source
language, the model corresponding to a phonetically similar phone in the source
language is borrowed.
6. Using the initial HMMs and the CLS transcriptions, speech data of the target
language is segmented using forced Viterbi alignment.
7. Using the newly derived time-aligned speech data, new context-independent phone
models are trained.
8. Steps 6 and 7 are performed iteratively N times (for our experiments, N = 5).
After N iterations, a set of language-dependent HMMs is obtained.
9. The HMMs are used to segment the entire data again. These are the final phonelevel boundaries in the target language.
Using the time-aligned data and the common question set for decision tree based
clustering, a TTS system is built for the target language. TTSes are trained by crosslingual segmentation by considering different conventional segmentation techniques in
the source language. For Indo-Aryan languages, Hindi is chosen as the source language,
and for Dravidian languages, Tamil is chosen. This is because of the availability of some
amount of manually annotated data in these languages.
32
3.4.1
Evaluation of TTSes
To evaluate the quality of synthesised speech, subjective evaluations are conducted. The
test sentences used in the evaluations should not be from the training data. Therefore,
test sentences are collected from the web belonging to different domains like news, sports,
nature, travel, etc. The evaluation consists of degradation mean opinion score (DMOS)
[63] and word error rate (WER) tests [6]. In DMOS test, listeners are asked to rate the
quality of the synthesised speech on a scale of 1-5, 5 being the best (human-like speech).
The scores are then calculated with respect to the scores of natural or recorded speech. In
WER test, listeners are asked to transcribe a set of semantically unpredictable sentences
(SUS). SUS are grammatically correct sentences, which do not convey any meaning.
WER is calculated based on the number of insertions, deletions and substitutions of
words. DMOS is a measure of the naturalness and WER is a measure of the intelligibility
of the system. A good quality TTS has high DMOS and low WER scores.
The evaluation is conducted in a noise-free environment using headphones. To evaluate TTS systems, it is better to have a large number of listeners. Due to non-availability of
native listeners, and after discarding evaluations of few listeners (outliers), on an average,
10 native listeners have evaluated systems in each language.
3.4.2
Bootstrap segmentation in source language
About 3 hours of female data in Bengali, Hindi, Marathi, Malayalam, Tamil and Telugu
are used for the experiments. For each source language, few minutes of data (≈ 5 minutes)
are chosen such that all the phones of that language are covered. Speech data is manually
annotated at the phone level using visual representations such as waveforms and the
corresponding spectrograms. Wavesurfer software is used for this purpose [55]. Speech
33
Table 3.3: Flat start segmentation
Language DMOS
Bengali
2.30
Tamil
2.40
Telugu
1.93
WER (%)
25.56
6.73
14.63
Table 3.4: Cross-lingual segmentation: bootstrap segmentation in source language
Target Language
Bengali
Marathi
Telugu
Malayalam
Tamil
Tamil
Source Language
Hindi
Hindi
Tamil
Tamil
Tamil
Hindi
DMOS
2.50
2.79
2.63
2.88
2.97
2.53
WER (%)
15.06
3.48
16.41
3.13
1.92
5.16
data in the source language is time-aligned using bootstrap segmentation. One set of
TTSes is built for the target languages using flat start segmentation. Another set of
TTSes is built for different target languages by cross-lingual borrowing of CI-HMMs from
the source language (Tamil or Hindi depending on the language group) for annotation.
Evaluations are conducted to compare the performance of both sets of TTSes. For
each system, 10 sentences are evaluated by each listener. From Tables 3.3 and 3.4, it is
seen that the cross-lingual segmentation outperforms the flat-start method with higher
DMOS and lower WER for most target languages. Models in the source language built by
bootstrap segmentation are better initial models than those built in the target language
using flat start segmentation.
Further, an experiment is performed across language groups. A Tamil TTS is first
developed by segmenting the Tamil speech data at the phone level using the bootstrap
method. This is the language-specific Tamil TTS. For the language-specific Tamil TTS,
Tamil is both the source and target language. Another Tamil TTS is built by borrowing
models from Hindi, which is an Indo-Aryan language. This is the cross-lingually trained
Tamil TTS. Although the language-specific Tamil TTS performs better than its cross34
Table 3.5: Degradation mean opinion scores (DMOS)
Language
Bengali
Hindi
Tamil
Telugu
Using hybrid
segmentation
2.99
3.45
4.21
3.28
Source
language
Hindi
Tamil
Hindi
Tamil
Using source
language
3.24
3.43
4.22
3.46
Table 3.6: Word error rates (WER) (%)
Language
Bengali
Hindi
Tamil
Telugu
Using hybrid
segmentation
9.82
6.06
0.4
12.5
Source
language
Hindi
Tamil
Hindi
Tamil
Using source
language
3.67
5.4
3.9
9.54
lingually trained counterpart, the degradation is not considerable, given that it is crosslingual synthesis across language groups.
3.4.3
Hybrid segmentation in source language
About 4-5 hours of male data in Bengali, Hindi, Tamil and Telugu are used in the
experiments. One set of TTSes is trained for the target languages using the hybrid
segmentation algorithm. The next set of TTSes is developed for different target languages
by using CI-HMMs from source language as the initial models for segmentation. The
source language data is segmented using the hybrid segmentation technique. Additionally,
another experiment is conducted wherein the source language is from a different family
group. For Hindi, Tamil is used as the source language, and vice-versa.
A set of 15 and 10 synthesised sentences for the DMOS and WER tests, respectively,
was evaluated by each listener. Results of the evaluation are shown in Tables 3.5 and 3.6.
From both DMOS and WER scores, it is evident that using HMMs from a source language
leads to better synthesis quality in terms of naturalness and intelligibility in most of the
35
cases. This is due to the use of better initial models as a result of accurate phone-level
segmentation in the source language. It is interesting to note that cross-lingual synthesis
across language groups also leads to good synthesis quality.
3.4.4
Discussion
In Section 3.4.2, we see that cross-lingual synthesis using bootstrap segmentation outperforms the language-specific system built using flat start segmentation. Although the
flat start method for segmentation is simple in implementation, the initial assumption
of equal importance to all phones is wrong (especially for vowels against consonants).
From Tables 3.4, 3.5 and 3.6, it is observed that the language-specific and cross-lingual
systems built using hybrid segmentation outperform both the language-specific and the
cross-lingual systems built using bootstrap segmentation. This is in accordance with the
observations made for language-specific systems in [52]. As phones cannot be perceived
properly in isolation, manual labeling of phone boundaries is prone to errors. Hybrid segmentation gives better phone boundaries. Cross-lingual borrowing of models also leads
to better synthesis quality. This is due to robust models in the source language as a
result of accurate segmentation.
Cross-lingual synthesis across language groups also leads to good synthesis quality.
This can be attributed to the fact that Indian languages share a common set of sounds.
Over time, due to the borrowing of sounds (or words) across languages, the divergence
between the language groups has narrowed down [15].
3.5 Summary
This chapter describes the cross-lingual approach to building a TTS synthesiser. Based
on syllable analyses across Indian languages, similarities among languages are deter36
mined. Developing good quality TTS systems crucially depends on accurate phone-level
segmentation of the speech data. The performance of cross-lingually trained TTSes is
on par with systems built conventionally. This paves the way for the development of
polyglot-based synthesisers for Indian languages.
37
Part II
Enhancement of Continuous
Dysarthric Speech
38
CHAPTER 4
Dysarthric Speech
4.1 Introduction
Dysarthria refers to a group of speech disorders as a result of neurological injury [2]. It
can be developmental (cerebral palsy) or acquired. The causes of acquired dysarthria
include degenerative diseases (Parkinson’s disease, multiple sclerosis), traumatic brain
injuries and strokes affecting neuromuscular control. Based on the region of the nervous
system that is affected, dysarthrias are classified as spastic, ataxic, flaccid, hyperkinetic,
hypokinetic and mixed.
The word dysarthria, originating from dys and arthrosis, means difficult or imperfect articulation. Dysarthria affects any of the speech subsystems such as respiration,
phonation, resonance, articulation and prosody. Dysarthric speech is characterised by
the poor articulation of phonemes, problems with speech rate, incorrect pitch trajectory,
the voicing of unvoiced units, etc. Additionally, speech may be affected due to problems with swallowing or drooling while speaking. As a result, dysarthric speech is not
as comprehensible as normal speech. This affects communication, social interaction and
rehabilitation of people with dysarthria. Hence, to aid people with dysarthria, the quality
of dysarthric speech needs to be improved.
The objective of this work is to create aids for people with dysarthria such that
given a dysarthric speech utterance, more natural and intelligible speech is produced
while retaining the characteristics of the speaker. The focus is on continuous speech
rather than isolated words (or sub-word units). Given the multiple causes and types of
dysarthria, the challenge is in developing a generic technique that can enhance dysarthric
speech across this wide range.
This chapter reviews the previous attempts to enhance dysarthric speech. The standard dysarthric speech Nemours database [36] and an Indian English dysarthric speech
dataset used in the experiments are then detailed. To compare the performance of the
proposed technique for enhancement (presented in Chapter 5), two techniques available
in the literature– a formant re-synthesis technique and an HMM-based adaptive synthesis
technique, are implemented. These two techniques are briefly described in this chapter.
4.2 Related work
For enabling people with dysarthria, there have been several efforts in speech recognition [12, 25] and automatic correction/synthesis of dysarthric speech [22, 28, 43]. In [22],
dynamic time warping (DTW) is first performed across dysarthric and normal phoneme
feature vectors for each utterance, and then a transformation function is determined to
correct dysarthric speech. The word-level intelligibility of dysarthric speech utterances
of speaker LL in Nemours database is improved from 67% to 87%. The drawback of this
method is that labeling and segmentation of dysarthric and normal speech need to be
manually verified and corrected. In [43] and [44], dysarthric speech is improved by correcting pronunciation errors and by morphing the waveform in time and frequency. The
authors report that the morphing does not increase the intelligibility of dysarthric speech.
Moreover, to correct pronunciation errors, corresponding transcriptions are required.
Some corrections are made by using an HMM-based speech recogniser followed by
a concatenation algorithm and grafting technique to correct wrongly uttered units [67],
40
or by synthesising speech using HMM-based adaptation [13]. In [49], poorly uttered
phonemes are replaced by phonemes from normal speech with discontinuities in short
term energy, pitch and formant contours at concatenation points addressed. In [67]
and [49], the poorly uttered units in dysarthric speech need to be identified. In [27]
and [28], the intelligibility of vowels in isolated words spoken by a dysarthric person
is improved by formant re-synthesis of transformed formants, smoothened energy and
synthetic pitch contours.
Two techniques available in the literature are implemented to improve the intelligibility of dysarthric speech. The first is the formant re-synthesis method [28] with few
modifications to suit the data used in the experiments. This method is explained in Section 4.4. The second is the HMM-based text-to-speech (TTS) synthesis system adapted
to the dysarthric person’s voice [13]. We assume that a recognition system having 100%
recognition accuracy is already available to transcribe speech for synthesis. The HMMbased technique is discussed in Section 4.5.
4.3 Datasets used
Standard databases available for dysarthric speech are Universal Access, TORGO and
Nemours [29,36,45]. Universal Access database contains audiovisual isolated word recordings and is hence not suitable for our purpose. TORGO database consists of acoustic
and articulatory data of non-words, short words, and complete sentences. However, complete sentences are fewer in number and they account for low phone coverage. Nemours
database contains 74 utterances for each dysarthric speaker and is therefore suitable for
the current work.
Additionally, speech of a native Indian having dysarthria is collected and used in the
41
experiments. The motive for collecting the Indian English dataset is two-fold:
• To make use of unstructured text, unlike the structured text present in Nemours
database (explained in Section 4.3.1).
• To assess the performance of the techniques for dysarthric speech of a native Indian.
The Indian English dysarthric speech data is referred to as “IE” in this work 1 . The
Nemours database and the Indian English dysarthric speech dataset are described in this
section.
4.3.1
Nemours database
Nemours database [36] consists of dysarthric speech data of 11 male North American
speakers. The degree of severity of dysarthria varies across speakers: mild (BB, FB, LL,
MH), moderate (JF, RK, RL) and severe (BK, BV, KS, SC). The speech data consists of
74 nonsense sentences and 2 paragraphs for each speaker. Since phone-level transcriptions
for the paragraphs are not provided, speech data of the paragraphs are excluded from
the experiments. The sentences follow the same format: The X is Y’ing the Z, where
X and Z are selected from a set of 74 monosyllabic nouns with X ̸= Z, and Y’ing is
selected from a set of 37 bisyllabic verbs. An example of such a structured sentence is–
The bash is pairing the bath. Along with the recording of each dysarthric speaker, the
corresponding speech of a normal speaker is provided. The normal speakers are appended
with the prefix “JP”. Phone-level transcriptions of each word are available in terms of
Arpabet labels [66].
Pauses within an utterance were already marked for speaker RK in the database,
however they were not available for speakers BK, RL and SC. Hence, for these three
1
The Indian English dysarthric speech data can be found at the link: www.iitm.ac.in/donlab/
website_files/resources/IEDysarthria.zip
42
Table 4.1: Details of Nemours database used in the experiments
Number of speakers
Number of sentences/speaker
Number of unique words/speaker
Total duration/speaker
10
74
113
2.9-9 minutes depending
on the speaker
speakers, pauses were marked manually. Significant intra-utterance pauses are not present
in the speech of other dysarthric speakers. For speaker KS, phonemic labeling is not
provided. Hence, it is excluded from the experiments. Details of the Nemours database
used in the experiments are given in Table 4.1.
Phone-level segmentation is available for dysarthric speech data while only word-level
segmentation is available for normal speech data. The procedure to obtain phone-level
segmentation from word-level segmentation for normal speech data is described.
Segmentation of normal speech data at the phone level
Hidden Markov models (HMM) are used to segment normal speech data at the phone
level. Word-level boundaries and phone transcriptions for each word are available in
the database. HMMs are used to model source and system parameters of monophones
in the data. The source features are log f 0 (pitch) values, along with their velocity
and acceleration values. The system parameters are mel frequency cepstral coefficients
(MFCC), along with their velocity and acceleration values. Instead of embedded training
of HMM parameters at the sentence level, embedded re-estimation is restricted to the
word boundary. This is inspired by [3], where phone-level alignment is obtained from
embedded training within syllable boundaries.
HMMs built using Carnegie Mellon University (CMU) corpus [31] were used as initial
monophone HMMs instead of using the conventional flat start method to build HMMs,
43
where the models were initialised with global mean and variance. This resulted in better
phone boundaries. Data of American speaker referred to as “rms” in CMU corpus was
used for this purpose.
4.3.2
Indian English dysarthric speech dataset
The process of text selection, speech recording and segmentation of the Indian English
dysarthric speech data is detailed.
Text selection
The text was chosen from CMU corpus which contains unstructured text [31]. 73 sentences were selected such that they ensured enough phone coverage. The phoneme transcriptions of the text were obtained from CMU pronunciation dictionary [11] and were
later manually corrected when the word pronunciation varied. An additional label “pau”
was added to account for pauses or silences.
Speech recording
Speech of an Indian male suffering from cerebral palsy, who is mildly dysarthric, was
recorded. The speech was recorded in a low-noise environment and sampled at 16 kHz,
with 16 significant bits. The recording was performed over several sessions, each session
not exceeding half-an-hour. Frequent breaks were given during the sessions as per the
convenience of the speaker so that fatigue did not affect the quality of speech. About 11
minutes of speech data was collected. Frenchay dysarthria assessment (FDA) [16] was
not performed due to unavailability of a speech pathologist. Details of the IE dataset are
given in Table 4.2.
44
Table 4.2: Details of Indian English dysarthric speech dataset
Number of sentences
73
Number of unique words
369
Total duration
11 minutes
Segmentation at the phone level
Before segmenting the dysarthric speech data, long silence regions (more than 100 ms)
were removed from the speech waveforms by voice activity detection (VAD). 11 minutes
of data then reduced to 8.5 minutes. Segmentation was performed semi-automatically.
HMMs were built from already available normal English speech data of an Indian (Malayalam) speaker “IEm” [4], as speaker IE is a native Malayalam speaker. These HMMs were
used as initial HMMs to segment dysarthric speech data at the phone level. Segmentation
was then manually inspected and corrected.
4.4 Formant re-synthesis technique
In reference [28], the intelligibility of dysarthric vowels in isolated words of CVC type
(C- consonant, V- vowel) is improved. Borrowing from this work, a similar approach is
adopted to improve intelligibility of continuous dysarthric speech in the current work.
Formants F1-F4, pitch and short-term energy values are extracted from dysarthric and
normal speech. Frame length of 25 ms and frame shift of 10 ms are considered. Formant
transformation from dysarthric space to normal space is only carried out in the vowel
regions. For this purpose, utterances segmented at the phone level are required. The
transformation makes use of vowel boundaries and vowel identities. Then, formant values
at the stable point of the vowels are determined [28]. The stable point (or region) is the
vowel point (or region) that is least affected by context. A 4-dimensional feature vector
45
Figure 4.1: Example of UBM-GMM and mean-adapted UBM-GMM
represents each instance of a vowel– F1stable, F2stable, F3stable and vowel duration.
In [28], formant transformation is achieved by training Gaussian mixture model (GMM)
parameters using joint density estimation (JDE). This works well for data that is phonetically balanced. The data used in the experiments in this work suffers from data
imbalance as the frequency of individual vowels in the database varies. To overcome this
problem, a universal background model-GMM (UBM-GMM) [42] is trained and adapted
to individual vowels of dysarthric and normal speech. Maximum a posteriori (MAP) is
the adaptation algorithm used. An example of the UBM-GMM and its mean adapted
model is shown in Figure 4.1. The procedure to obtain adapted models is as follows:
• Each frame in a vowel region is represented by a 4-dimensional feature vector–
formants F1-F3, and vowel duration. All the feature vectors, irrespective of stable
points, are pooled together for all vowel instances across dysarthric and normal
speech to train the UBM-GMM.
• The adaptation data for a vowel of dysarthric or normal speech is a 4-dimensional
feature set (F1stable, F2stable, F3stable, vowel duration) across all instances of
that vowel.
• A set of (2∗number_of _vowels) models is obtained by adapting only the means of
the UBM-GMM. This is a codebook of means for the same vowel across dysarthric
and normal speech.
46
The dysarthric speech data is initially split into train (80%) and test data (20%).
Normal speech corresponding to the dysarthric speech in Nemours database is used for
obtaining the codebook. For the Indian dysarthric speech, speech of speaker “ksp” from
CMU corpus [31] is used as normal speech. Adapted models are built using the train
data. The codebook size of the UBM-GMM is 64.
Figure 4.2: Formant re-synthesis of dysarthric speech2
The procedure to re-synthesise dysarthric speech is shown in Figure 4.2. For test
data, pitch (F 0) and energy contours are smoothened. Smoothening is performed by
using a median filter of order 3 and then low-passing using a Hanning window. This
approach differs from the work carried out in reference [28], where a synthetic F0 contour
is used for re-synthesis. Using the vowel boundaries, every vowel in the test utterance is
represented by a 4-dimensional feature vector (stable F1-F3+vowel duration). Using the
2
This figure is re-drawn from [28]
47
codebook of means for vowels across dysarthric and normal speech, the features of the
dysarthric vowels are replaced by the means of their normal counterpart. The replaced
or transformed stable point formants represent the entire vowel. Hence, the same stable
point formant value is repeated across the duration of the vowel. Using the transformed
formant contours, smoothened pitch and energy contours, speech is synthesised using a
formant vocoder [40]. The modified dysarthric speech is then obtained by replacing nonvowel regions in the re-synthesised dysarthric speech by the original dysarthric speech.
4.5 HMM-based TTS using adaptation
An HMM-based TTS synthesiser (HTS) adapted to the dysarthric person’s voice is developed [13, 68]. This is to evaluate the maximum intelligibility of synthesised speech
that can be obtained given a recognition system for dysarthric speech that is 100% accurate. The purpose of using an HMM-based adaptive TTS synthesiser is two-fold: (1)
not enough data to build a speaker-dependent system for every dysarthric speaker, and
(2) to correct the pronunciation of the dysarthric speaker.
The HMM-based adaptive TTS can be divided into three phases- training, adaptation and synthesis. This is illustrated in Figure 4.3. Audio files and corresponding
transcriptions are available for training and adaptation data. In the training phase,
mel-generalized cepstral (MGC) coefficients and log f 0 values, along with their velocity
and acceleration values are extracted from the audio files. Average voice models are then
trained from speech features corresponding to the training data. In the adaptation phase,
CSMAPLR+MAP adaptation (CSMAPLR- constrained structural maximum a posteriori linear regression) is performed to adapt the average voice models to the adaptation
features. Speaker adaptive training (SAT) is performed to reduce the influence of speaker
48
Figure 4.3: Overview of HTS system with adaptation3
differences in the training data. In the synthesis phase, the test sentence is converted to a
sequence of phones. Phone HMMs are chosen based on the context and concatenated to
form the sentence HMM. MGC coefficients and f 0 values are generated from the sentence
HMM, and speech is synthesised using mel log spectrum approximation (MLSA) filter.
3
This figure is re-drawn from [68]
49
To build an adaptive TTS system for speakers in Nemours database, speech of two
normal American male speakers, “bdl” and “rms” from the CMU corpus, is used as the
training data. For the Indian English dysarthric data, the training data is speech of an
Indian speaker “ksp” from the CMU corpus. 1 hour of speech data is available for every
speaker in the CMU corpus. Dysarthric speech data is split into adaptation data (80%)
and test data (20%). Synthesised speech of the sentences in the test data is used in the
subjective evaluation. For developing the HMM-based adaptive TTS synthesiser, HTS
version 2.3 software is used.
4.6 Summary
This chapter briefly reviews the previous attempts at improving the quality of dysarthric
speech automatically. The Nemours database and the Indian English dysarthric speech
dataset used in the experiments are briefly described. Two existing techniques available
in the literature for enhancement of dysarthric speech are implemented. The formant
re-synthesis technique is modified to suit the dysarthric speech databases. To generate
more intelligible speech, an HMM-based synthesiser is developed by adapting it to the
dysarthric person’s voice. The next chapter presents the preliminary analysis and the
proposed technique for enhancing dysarthric speech.
50
CHAPTER 5
Proposed Modifications to Dysarthric Speech
5.1 Introduction
In the previous chapter, normal speech data was either used to determine the transformation function or to build an HTS system. The aim of both techniques was to correct
dysarthric speech closer to normal speech. In this work too, the objective is to improve
the quality of continuous dysarthric speech by matching dysarthric speech characteristics
closer to that of normal speech. For this purpose, dysarthric speech and normal speech
have to be analysed.
Different types of acoustic analyses have been conducted in the past. The types of
acoustic measures to be analysed largely depend on their application. In [59], based on
acoustic analysis of certain sounds over several months, the progression of dysarthria in
Amyotrophic Lateral Sclerosis (ALS) patients were studied. In [21], acoustic measures
such as syllable duration were analysed at syllabic and intra-syllabic segments across four
different types of neurological dysarthrias. The test materials included a set of isolated
words, isolated vowels and nonsense sentences in German. Reference [1] reported reduced
speech tempo in terms of syllable and utterance durations with a set of 24 structured
sentences in German as test materials. Reference [57] performed some acoustic analysis on
Nemours and TIMIT databases [19,36] to determine the severity and cause of dysarthria.
The authors reported that normalised mean duration for overall phonemes and normalised
speech rate of dysarthric speakers were always greater than those of normal speakers.
The analysis carried out in this work is based on duration, which is an important
acoustic measure. Durational attributes of both structured (Nemours database) and unstructured sentences (IE dataset) are studied. Based on the analysis, manual corrections
are made directly to the dysarthric speech waveforms. A technique to automatically incorporate these modifications is then developed. Analogous to the cross-lingual approach
used in building Indian language TTSes, normal speech characteristics are transplanted
onto dysarthric speech to improve the latter’s quality while preserving the speaker’s voice
characteristics.
Details of the analysis, manual modifications and the proposed automatic technique
for enhancement are presented in this chapter. The performance of the proposed technique is assessed and compared with those of the formant re-synthesis technique and the
HMM-based adaptive synthesiser. Results of the subjective evaluation are presented at
the end of this chapter.
5.2 Durational analysis
Though dysarthria is mostly characterised by slow speech [45, 57], there are studies reporting rapid rate of speech [2, 14]. In [14], it is observed that dysarthric speakers have
higher speaking rates for interrogative sentences. Hypokinetic dysarthria, which is often associated with Parkinson’s disease, is characterised by bursts of rapid speech [26].
Hence, it is necessary to assess the speech of people with dysarthria in terms of durational
attributes, for the task of speech enhancement.
The corresponding normal speech for the utterances spoken by every dysarthric speaker
is available in the Nemours database. For the Indian dysarthric speaker IE, the speech of
Malayalam speaker “IEm” in the Indic TTS database [4], is considered as the reference
52
normal speech. Using phonemic labeling of the speech data, dysarthric and normal speech
are compared based on vowel duration, average speech rate and total utterance duration.
In all the subsequent figures, BB-SC refers to speakers in the Nemours database, while
IE is the Indian speaker with dysarthria.
5.2.1
Vowel durations
0.35
dysarthric speech
normal speech
Average vowel duration (sec)
0.3
0.25
0.2
0.15
0.1
0.05
0
BB
BK
BV
FB
JF
LL
MH
RK
RL
SC
IE
Speakers
Figure 5.1: Average vowel durations across dysarthric and normal speakers
The average vowel duration is the average duration of all the vowels pooled together.
It is observed that the average vowel duration of dysarthric speech in the databases are
longer than their normal speech counterparts [57]. This is illustrated in Figure 5.1. It
is observed that the average vowel duration for normal speakers is about 110-150 ms,
while it is about 140-320 ms for dysarthric speakers. Longer average vowel durations are
noticed especially for speakers BK and RL.
Average durations of individual vowels are also analysed. The number of vowels in
the Nemours database and IE dataset are 12 and 16, respectively. It is observed that
53
6
5
probability density function (pdf)
probability density function (pdf)
6
BB
JPBB
4
3
2
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
4
3
2
1
0
0
0.8
RL
JPRL
5
0.1
0.2
Duration (sec)
0.3
0.4
0.5
0.6
Duration (sec)
(a)
(b)
probability density function (pdf)
9
8
IE (Dysarthric Speaker)
Hindi Speaker
Tamil Speaker
Telugu Speaker
Malayalam Speaker
American Speaker
7
6
5
4
3
2
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Duration (sec)
(c)
Figure 5.2: Duration plot of vowels for (a) dysarthric speech BB and normal speech
JPBB, (b) dysarthric speech RL and normal speech JPRL, and (c) Indian
dysarthric speech IE and normal speech of other speakers
the average duration for most vowels is longer for dysarthric speech compared to normal
speech in most cases. For dysarthric speakers BV and RK, the average duration of
vowel “er” is shorter as it is hardly articulated. Surprisingly, for vowel “eh”, the average
duration is longer for normal speech compared to dysarthric speech for most speakers.
54
0.7
0.8
We conclude that this may be a characteristic of the normal speaker.
Standard deviations of vowel durations of dysarthric speakers are also observed to
be higher. A larger standard deviation indicates that either the vowel is sustained for
a longer duration or is hardly uttered. In Figure 5.2, the estimated probability density
function (pdf) of vowel duration for different speakers are plotted. The probability of
vowel duration being in a specific range is given by the area of the pdf within that range.
It is observed from Figures 5.2(a) and (b), that the degree to which the dysarthric persons’
vowel duration differ from their normal counterpart is speaker-dependent. The average
and standard deviation of vowel durations are closer to normal speech for speaker BB
than those for speaker RL.
Speech data of the Indian dysarthric speaker IE is compared with the speech of different normal speakers. Four different nativities of Indian English (Hindi, Tamil, Telugu
and Malayalam) in the Indic TTS corpus [4], and speech of an American speaker “rms”
from CMU corpus [31] are the normal speech data considered. It is observed that the
duration plot of speaker IE is clearly shifted with respect to that of normal speakers
(Figure 5.2(c)).
5.2.2
Average speech rate
Average speech rate is defined as the average number of phones uttered per second. Figure
5.3 shows the average speech rate for different dysarthric and normal speakers. While for
normal speakers, the average speech rate is about 7.5-9.5 phones per second, it ranges
from 4 to 8.5 phones for different dysarthric speakers. Lower speech rates for dysarthric
speakers can be attributed to the fact that the coordination and movement amongst the
articulators involved in speech production are not as smooth and effortless as those for
55
normal speakers. This results in the sustenance of the same sound for a longer time.
Average speech rate (phones/sec)
12
dysarthric speech
normal speech
10
8
6
4
2
0
BB
BK
BV
FB
JF
LL
MH
RK
RL
SC
IE
Speakers
Figure 5.3: Average speech rates across dysarthric and normal speakers
5.2.3
Total utterance duration for text in the datasets
For the same set of sentences spoken by dysarthric and normal speakers, the total utterance duration is longer for the dysarthric speaker (Figure 5.4). The normal speaker in
the Nemours database takes about 2-3 minutes to speak multiple sets of 74 six-worded
sentences. Most dysarthric speakers take about 2.5-5 minutes to speak the same set of
sentences. Longest utterance durations are evident for speakers BK, RL and SC. This
indicates insertion of phones, intra-utterance pauses, etc. while speaking.
Based on the above analysis, if the duration is reduced closer to that of normal
speech, the quality of dysarthric speech may improve. Reference [62] observes that as
phone durations of dysarthric speech increase, the intelligibility of speech in terms of
Frenchay dysarthria assessment (FDA) score [16] comes down. Taking this observation
forward, in this work, dysarthric speech is modified both manually and automatically to
56
10
Total utterance duration (min)
9
dysarthric speech
normal speech
8
7
6
5
4
3
2
1
0
BB
BK
BV
FB
JF
LL
MH
RK
RL
SC
IE
Speakers
Figure 5.4: Total utterance durations across dysarthric and normal speakers
achieve this durational correction.
5.3 Manual modifications
For the datasets used in the experiments, it is observed that dysarthric speech has longer
average vowel duration, lower speech rate and longer utterance duration compared to
normal speech. Increasing the speech rate of the entire dysarthric utterance is not useful;
specific corrections need to be made. Each phone segment of the dysarthric speech is
compared with its counterpart in normal speech. The following manual modifications are
made to dysarthric speech:
• removing artifacts in the utterance
• deleting long intra-utterance pauses
• splicing out steady regions of elongated vowels
• removing repetitions of phones
57
Original dysarthric speech
0.15
Amplitude
0.1
0.05
0
−0.05
−0.1
−0.15
−0.2
−0.25
badge
The
0
1
2
pau
waking
is
3
4
5
pau
6
7
the
8
bad
9
Samples
Manually modified dysarthric speech
10
4
x 10
0.15
Amplitude
0.1
0.05
0
−0.05
−0.1
−0.15
−0.2
−0.25
badge
The
0
1
is
2
waking
3
the
bad
4
5
Samples
6
7
8
9
10
4
x 10
Figure 5.5: Original (top panel) and manually modified (bottom panel) dysarthric
speech for the utterance The badge is waking the bad spoken by speaker
SC
Modifications are made such that the intelligibility of speech is not degraded. Regions
of the dysarthric utterance are carefully deleted so as not to cause a sudden change in
spectral content. To ensure this, the spectrogram of the waveform is used for visual
representation.
Figure 5.5 shows the word-level labeled waveform of an utterance– The badge is waking
the bad, spoken by dysarthric speaker SC and its manually modified waveform. The first
thing to be observed is the reduction in utterance duration in terms of the number of
samples. Long pauses, labeled as “pau”, are removed from the waveform. It is also
observed that the second pause contains some artifacts, which are spliced out in the
manual process. The elongation of the vowel “ae” in the word badge is also addressed.
For each dysarthric speaker, a set of 12 dysarthric speech waveforms is manually
modified. Informal evaluation of original and corresponding manually modified waveforms
indicate an improvement in quality in the latter case. Encouraged by this, an automatic
58
technique is developed to replace the manual procedure.
5.4 Proposed automatic method
An automatic technique is developed to improve the quality of continuous dysarthric
speech. A dysarthric speech utterance is compared to the same utterance spoken by a
normal speaker at the frame level, and dissimilar regions in the former are spliced out
based on certain criteria. To compare the two utterances, the Dynamic Time Warping (DTW) algorithm is used. The features used to represent an utterance, the DTW
algorithm and criteria for deletion, are discussed in the following subsections.
5.4.1
Features used
To represent speech or the content of the spoken utterance, mel frequency cepstral coefficients (MFCC) are used. MFCCs capture the speech spectrum by taking the speech
perception nature of humans into consideration. 13 static MFCCs are extracted from
the audio waveform. To capture the dynamic nature of speech, 13 first-order derivatives
(velocity) and 13 second-order derivatives (acceleration) of the MFCCs are also determined. Hence, each frame of the speech signal is represented by a 39-dimensional feature
vector. Overlapping sliding Hamming windows of 25 ms length and shift of 10 ms are
the frame attributes. Cepstral mean subtraction (CMS) is then performed to compensate
the effects of speaker variation [64]. The mean of cepstral coefficients for an utterance is
subtracted from each frame cepstral coefficient of that utterance. The mean subtracted
MFCCs of the dysarthric and normal speech are the features used for comparison.
59
5.4.2
Dynamic Time Warping (DTW)
The Dynamic Time Warping (DTW) is a dynamic programming algorithm that finds
the optimal alignment between two temporal sequences [47]. It measures the similarity
between two utterance sequences that may vary in timing and pronunciation. As seen
from Figure 5.6, the DTW between two similar signals warps non-linearly on the timeaxis, i.e., elongated regions in one signal get warped with smaller regions in the other
signal and vice-versa for compressed regions.
Figure 5.6: Example of dynamic time warping1
Let the two sequences be:
X = [⃗x1 , ⃗x2 , ..., ⃗xi , ..., ⃗xm ],
(5.1)
Y = [⃗y1 , ⃗y2 , ..., ⃗yj , ..., ⃗yn ].
⃗xi and ⃗yj are the frame vectors of each utterance, respectively. Distance between
elements ⃗xi and ⃗yj is defined by the Euclidean distance,
dist(⃗xi , ⃗yj ) = ∥⃗xi − ⃗yj ∥
√
= (xi1 − yj1 )2 + (xi2 − yj2 )2 + ... + (xid − yjd )2 ,
1
Source: Wikipedia- https://en.wikipedia.org/wiki/Dynamic_time_warping
60
(5.2)
where d is the dimensionality of feature vectors, which is 39 in our case. The two sequences
can be arranged in the form of an n × m matrix (or grid) as shown in Figure 5.7. Each
element (⃗xi , ⃗yj ) in the matrix corresponds to the cumulative distance between elements
⃗xi and ⃗yj . The goal of the DTW is to find an optimal path [(a1 , b1 ), (a2 , b2 ), ..., (aq , bq )]
that minimises the cumulative distance between the two sequences:
Np
∑
DT W (X, Y) = min[
dist(⃗xak , ⃗ybk )],
p
(5.3)
k=1
where p is the number of possible paths, and Np is the number of cells in the pth path.
Figure 5.7: DTW grid
To restrict the space of the warping paths, the following constraints are included [47]:
• Monotonicity: The indices of the path should be monotonically ordered with respect
to time: ak−1 ≤ ak and bk−1 ≤ bk .
• Continuity: Only one step movement of the path is allowed: ak−1 − ak ≤ 1 and
bk−1 − bk ≤ 1.
• Boundary conditions: The first and last vectors of X and Y are aligned to each
other: a1 = b1 = 1 and aq = m, bq = n.
An example of an optimal path is shown in Figure 5.7. Diagonal regions indicate
similarity between the sequences, horizontal regions indicate elongation in sequence X
and vertical regions indicate elongation in sequence Y. The procedure to obtain the
61
Algorithm 1 Finding the DTW path
Input: Vectors X of length m, Y of length n
Matrices:
dist, DT W , minIndex of size (m × n)
DTW calculation :
Calculate dist(⃗xi , ⃗yj ) for all i = 1, ..., m and j = 1, ..., n
Set DT W (1, 1) = dist(1, 1)
Set minIndex(i, j) = 0 for all i = 1, ..., m and j = 1, ..., n
for each i in 2, ..., m do Calculate DT W (i, 1) = dist(i, 1) + DT W (i − 1, 1)
end for
for each j in 2, ..., n do Calculate DT W (1, j) = dist(1, j) + DT W (1, j − 1)
end for
for each i in 2, ..., m do
for each j in 2, ..., n do
Calculate DT W (i, j) = dist(i, j)+min[DT W (i − 1, j), DT W (i − 1, j − 1), DT W (i, j − 1)]
{z
} |
{z
} |
{z
}
|
Direction(←)
Direction(↙)
Direction(↓)
Store the selected direction in minIndex(i, j)
end for
end for
Backtracking :
Backtrack along the matrix minIndex from (m, n) to (1, 1) to get the optimal path
DTW path is given in Algorithm 1. This algorithm is adapted from [47]. The dysarthric
utterance is the test sequence (X) and the normal utterance is the reference sequence
(Y). The DTW algorithm is used to obtain the optimal path between the features
of dysarthric and normal speech utterances. Then the slope of the path is calculated.
Wherever the slope is zero, the corresponding frames are deleted from the waveform.
However, discontinuities are perceived in the resultant waveform. Therefore, certain
criteria are considered before deletion of frames.
62
5.4.3
Adding frame thresholds
An initial criterion for deletion of speech frames is the allowance of some amount of
elongation in the dysarthric utterance. Wherever the slope of the DTW path is zero for
a minimum number of frames, termed as frameThres, speech frames corresponding to the
horizontal path are deleted.
5.4.4
Adding short-term energy criteria
When deleting frames, it is important to ensure that there is no sudden change in energy
at the points of join, i.e., the energies between frames before and after deleted regions.
It is observed that artifacts are introduced in places where the energy difference between
frames at concatenation points is high. The short-term energy (STE- square of L2-norm of
speech segment) difference is considered as an additional criterion for deletion. Whenever
STE difference is less than a certain limit, STEThres, frames are deleted. Algorithm 2 is
proposed in this work for selecting the frames for deletion.
Algorithm 2 Determining the frames for deletion
Input: Frame numbers corresponding to a deletion region (delF rames) after thresholding with frameThres
Let delF rames = [g1 , g2 , ..., gk ]
Steps :
1. consideredF rames = [g1 , g2 , ..., gk , gk+1 ]
2. Calculate STE for all frames in consideredF rames
3. Calculate absolute STE difference for every pair of frame combinations
4. Identify frame pairs with ST E ≤ STEThres
5. From the identified frame pairs, choose the frame pair for which maximum frames
can be deleted. If the chosen frame pair is (gi , gj ) with i < j, then the frames to be
deleted are gi+1 , ..., gj−1 .
The automatic modification method is referred to as the DTW+STE modification
technique. This automatic technique is illustrated in Figure 5.8. In the experiments,
frameThres and STEThres are set to 6 and 0.5, respectively. These thresholds are ob63
250
250
200
200
JPBK (Normal)
JPBK (Normal)
Figure 5.8: Flowchart of automatic (DTW+STE) method to modify dysarthric speech
150
100
50
0
0
150
100
50
DTW
500
1000
0
0
1500
BK (Dysarthric)
DTW
50
100
150
200
250
300
BK (Dysarthric)
(a)
(b)
Figure 5.9: DTW paths of an utterance of speaker BK between: (a) original dysarthric
speech and normal speech, and (b) modified dysarthric speech and normal
speech
64
Original dysarthric speech
0.8
Amplitude
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
is
pau
0
pau
0.5
The
PI
pau
1
dive
pau
PI
waning the watt
1.5
2
Samples
2.5
5
x 10
Automatically modified dysarthric speech
0.8
Amplitude
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
is
0
0.5
the
dive
watt
waning
The
1
1.5
Samples
2
2.5
5
x 10
Figure 5.10: Original (top panel) and automatically modified (bottom panel) dysarthric
speech for the utterance The dive is waning the watt spoken by speaker
BK. PI- phone insertions
tained empirically after testing with frameThres ranging from 4 to 10 and STEThres
ranging from 0.3 to 2.5. Experiments are also performed using normalised STE difference as a threshold. The performance of using normalised STE difference as threshold is
on par with that using STEThres.
The DTW paths of a sample utterance of dysarthric speaker BK before and after
automatic modifications compared with respect to the same utterance of normal speaker
JPBK are shown in Figure 5.9. Comparing Figures 5.9a and 5.9b, it can be concluded
that the modified speech is closer to that of normal speech as indicated by the diagonal
DTW path in Figure 5.9b. It also results in a considerable reduction in the number of
frames or duration of the utterance.
An example of the same utterance spoken by dysarthric speaker BK and that modified
using the DTW+STE method is given in Figure 5.10. It is observed that pauses and
insertion of phones (PI) present in the original dysarthric speech are removed in the
65
10
Total utterance duration (min)
9
dysarthric speech
modified dysarthric speech
normal speech
8
7
6
5
4
3
2
1
0
BB
BK
BV
FB
JF
LL
MH
RK
RL
SC
IE
Speakers
Figure 5.11: Total utterance durations across dysarthric speech, modified dysarthric
speech and normal speech
automatic technique. Overall there is a reduction in the number of samples or duration of
the dysarthric utterance. The total utterance duration of the original and automatically
modified dysarthric speech are given in Figure 5.11. The total utterance duration of the
modified speech is closer to normal speech, except in the case of speakers BB and RK,
where the reduced utterance duration compared to normal speech is an indication of some
amount of elongation present in the normal speech. Moreover, for these two speakers,
dysarthric speech is closer to normal speech in terms of the duration attributes.
The DTW+STE method not only results in a durational reduction, it also improves
intelligibility of dysarthric speech by removing insertions, long pauses and elongations.
Subjective evaluation is performed to compare the quality of the automatically modified
and original dysarthric speech.
66
5.5 Performance evaluation
Subjective evaluation is conducted to evaluate the techniques implemented for enhancement of dysarthric speech. A pairwise comparison test is performed to assess the proposed modification techniques and a word error rate test to compare intelligibility across
different methods. Naive listeners are used in the subjective tests rather than expert
listeners to assess how a naive listener, who has little or no interaction with dysarthric
speakers, evaluates the quality of dysarthric speech. Tests are conducted in a noise-free
environment.
5.5.1
Pairwise comparison tests
A pairwise comparison (PC) test is conducted to compare the quality of speech modified
by the proposed techniques and original dysarthric speech [48]. In the “A-B” test, “A” is
played first and then “B”, and vice-versa in the “B-A” test to remove the bias in listening.
“A” is the modified speech and “B” is the original speech in both the tests. Preference
is always calculated in terms of the audio sample played first. The score “A-B+B-A”
gives an overall preference for system “A” against system “B” and is calculated by the
following formula:
“A − B + B − A” =
“A − B” + (100 − “B − A”)
.
2
About 11 listeners evaluated a set of 8 sentences for each speaker. Results of the evaluation are shown in Figure 5.12. Results indicate a preference for the modified versions
over original dysarthric speech in almost all the cases. From Figure 5.12, it is evident that
the manual method outperforms the DTW+STE (automatic) method. This is because
67
Preference over original speech (%)
100
manually modified
automatically modified
90
80
70
60
50
40
30
20
10
0
BB
BK
BV
FB
JF
LL
MH
RK
RL
SC
IE
Speakers
Figure 5.12: Preference for manually and automatically (DTW+STE) modified speech
over original dysarthric speech of different speakers
manual modifications are hand-crafted carefully so as to produce better-sounding speech.
For speakers BB and IE, who are mildly dysarthric, the performance of the DTW+STE
method drops drastically due to artifacts introduced in the modified speech. This is true
for speakers BK and BV, where artifacts in the original speech are not eliminated by
the DTW+STE technique. The drop in performance from the manual to the automatic
technique is quite high for speaker SC because of the slurry nature of speech. Hence,
in such cases, identifying the specifics of dysarthria for individual speakers is vital to
improving speech quality. Nonetheless, the performance of both methods is almost on
par for speakers JF, LL, FB, MH, RL, RK who are mild to severely dysarthric.
Pairwise comparison tests were also conducted between original and formant resynthesised speech, and between automatically modified and formant re-synthesised speech.
About 10 listeners evaluated a set of 8 sentences for each speaker in each test. The formant re-synthesis technique was clearly not preferred; the competing technique scored
considerably better at 82% in both the tests.
68
Figure 5.13: Word error rates for different types of speech across dysarthric speakers
69
5.5.2
Intelligibility tests
To evaluate intelligibility across different systems, a word error rate (WER) test was
conducted. Based on the feedback on the pairwise comparison tests and the text in
Nemours database containing nonsensical sentences, it is difficult to recognise words in
dysarthric speech. Hence, given the text, listeners were asked to enter the number of words
that was totally unintelligible. Though the knowledge of the pronounced word may have
an influence on its recognition, this is a uniform bias that is present when evaluating all
the systems. About 10 listeners participated in the evaluation. The following types of
speech were used in the listening tests:
P (original): original dysarthric speech
Q (DTW+STE): dysarthric speech modified using the DTW+STE method
R (Formant Synth): output speech of the formant re-synthesis technique
S (HTS-in): speech synthesised using the HMM-based adapted TTS for text in the
database not used for training (held-out sentences)
T (HTS-out): speech synthesised using the HMM-based adapted TTS for text from the
web
The results of the WER test are presented in Figure 5.13. It can be seen that the
intelligibility of formant re-synthesis technique is poor for all the speakers. For the
DTW+STE method, WER is higher compared to original dysarthric speech for a majority of speakers. WER of HMM-based adaptive synthesiser on held out-sentences, i.e.,
sentences not used during training is high compared to original dysarthric speech in almost all the cases. Intelligibility of sentences synthesised from the web is quite poor
compared to that of held-out sentences for speakers in Nemours database. This is the
opposite for Indian dysarthric speaker IE. This is due to the similar structure of heldout sentences and sentences used in training the HMM-based synthesiser in Nemours
70
database, unlike the sentences in the Indian dysarthric dataset that are unstructured.
Overall, the intelligibility of original dysarthric speech does not increase. However, for
speakers BK, BV and JF, DTW+STE modified speech has the lowest WER. For speaker
RK, the intelligibility of HMM-based adaptive synthesised speech is on par with that of
original dysarthric speech. By informal evaluation, it is observed that some pronunciations of the dysarthric speaker do get corrected in the sentences synthesised using the
HMM-based adaptive TTS system. This indicates that the technique used to increase
intelligibility largely depends on the type and severity of dysarthria.
5.5.3
Analogy to speech synthesis techniques
In the speech synthesis domain, the HMM-based adaptive synthesiser [68] is a statistical
parametric speech synthesiser (SPSS) and the DTW+STE technique is analogous to a
unit selection speech (USS) synthesiser [23]. The synthesised speech of the HMM-based
synthesiser lacks the voice quality of the dysarthric speaker. Similar to the USS system,
the speech output of the DTW+STE method has discontinuities, while preserving the
voice characteristics of the dysarthric speaker. In USS, sub-word units are concatenated
together to produce speech. Units are selected based on certain target and concatenation
costs. In a similar manner, in the DTW+STE method, frames to be concatenated are
selected based on STE difference criteria.
5.6 Summary
Continuous dysarthric speech quality is improved upon in this work. A durational analysis is performed by comparing dysarthric and normal speech for speakers in Nemours
database and the Indian English dysarthric speech dataset. Based on the analysis,
71
dysarthric speech is directly modified manually, and the DTW+STE technique is developed to make modifications automatically. The intelligibility of dysarthric speech
modified using different techniques is studied. Though the DTW+STE technique does
not increase intelligibility for most speakers, the overall perceptual quality of the modified
dysarthric speech is improved. This emphasises the importance of duration in perceptual
speech quality, indicating that this kind of modification may be used as a pre-processing
step for enhancing dysarthric speech.
72
CHAPTER 6
Conclusion and Future Work
6.1 Summary
The work carried out in this thesis is focused on developing small footprint speech synthesis technologies for the language challenged and people with dysarthria. To enable
people who are language-challenged to access Indian language content on digital platforms, text-to-speech synthesisers are developed. To enable people with dysarthria to
communicate more effectively, the quality of dysarthric speech is enhanced. The central
idea in the thesis is the cross-lingual approach in the first task and transplantation of
normal speech characteristics onto dysarthric speech in the second task.
Given that Indian languages are low-resource languages, TTSes are built for multiple
Indian languages in a unified framework. The common label set for a standard representation of phones and the common question set for decision-tree based clustering are
designed as part of preliminary work. To build a TTS system for a new language with
the aid of another language, cross-lingual studies are first carried out. Properties of syllables in continuous speech across languages, from both the speech data and the text, are
studied. Based on the analysis, similarities among languages are determined. Contextindependent monophone HMMs are borrowed across similar languages for the purpose
of annotating speech data at the phone level. TTS systems built cross-lingually produce
good quality speech provided that the speech data of the source language is segmented
accurately. An average degradation mean opinion score of 3.59 and word error rate of
5.63% are obtained from subjective evaluation of these speech synthesisers. Moreover,
cross-lingual synthesis across language groups also results in good quality TTSes. Despite
there being a divergence among the language groups, there is a convergence in linguistic
features due to an significant borrowing of sounds (or words) across languages.
To assist people with dysarthria to communicate effectively, dysarthric speech is enhanced. In addition to using the Nemours database for experiments, continuous and
unstructured speech of a native Indian having dysarthria is collected. Normal speech
characteristics are transplanted onto dysarthric speech to improve the latter’s quality
while preserving the speaker’s voice characteristics. A durational analysis is first performed across dysarthric and normal speech. Based on the analysis, manual modifications
are made directly to the dysarthric speech waveforms. The DTW+STE technique is developed to automatically correct dysarthric speech to match the durational attributes of
normal speech. This technique outperforms two other techniques available in the literature, namely, a formant re-synthesis technique and a hidden Markov model (HMM)-based
adaptive TTS technique for most dysarthric speakers. The dysarthric speech modified
using the DTW+STE technique is preferred 67.04% over the original speech.
6.2 Criticisms of the thesis and scope of future work
In the cross-lingual studies, the focus is on textual and durational analyses of syllables.
Detailed study of other acoustic properties of syllables across languages and genderspecific studies can be looked into. Such studies may enable the cross-lingual study of
prosody to produce natural sounding speech that is pleasing to listeners. The ultimate
goal would be to build a generic TTS across languages and build TTSes for new languages
using small amounts of adaptation data.
74
While the DTW+STE technique does not need labels or segmented boundaries, it
makes use of a reference (normal speech utterance) for comparison. Insertion of sounds
in dysarthric speech are alone taken care of, deletion and substitution of phones are not
addressed. In addition to analysis of durational attributes, this work can be extended to
analyse other attributes that affect the speech of a person with dysarthria. To build an
efficient and practical device for enhancing dysarthric speech, a person-centric approach
needs to be adopted.
75
Appendices
76
APPENDIX A
Common Label Set
78
79
80
81
82
83
84
85
APPENDIX B
Common Question Set
Phonetic classifications with respect to the second succeeding phones (LL-) are given
here. The complete common question set can be downloaded from www.iitm.ac.in/
donlab/tts/synthDocs.php.
QS "LL-Vowel" {a^*, ax^*, aa^*, i^*, ii^*, u^*, eu^*, uu^*, e^*, rq^*, ee^*, ei^*,
ai^*, oi^*, o^*, oo^*, ae^*, au^*, ou^*}
QS "LL-Consonant" {k^*, kh^*, g^*, gh^*, ng^*, c^*, ch^*, cx^*, j^*, jh^*, jx^*,
nj^*, tx^*, txh^*, dx^*, dxh^*, nx^*, t^*, th^*, d^*, dh^*, n^*, nd^*, p^*, ph^*,
b^*, bh^*, m^*, y^*, r^*, l^*, lx^*, w^*, sh^*, sx^*, s^*, h^*, kq^*, khq^*, gq^*,
z^*, jhq^*, dxq^*, dxhq^*, dhq^*, f^*, bq^*, yq^*, nq^*, rx^*, sq^*, zh^*, q^*,
hq^*, mq^*}
QS "LL-Stop" {k^*, kh^*, g^*, gh^*, c^*, ch^*, cx^*, j^*, jh^*, jx^*, tx^*, txh^*,
dx^*, dxh^*, t^*, th^*, d^*, dh^*, p^*, ph^*, b^*, bh^*, kq^*, dxq^*, dxhq^*,
dhq^*, bq^*}
QS "LL-Nasal" {ng^*, nj^*, nx^*, n^*, nd^*, m^*, nq^*, q^*, mq^*}
QS "LL-Fricative" {sh^*, sx^*, s^*, h^*, khq^*, gq^*, z^*, jhq^*, f^*, sq^*, hq^*}
QS "LL-Liquid" {y^*, r^*, l^*, lx^*, w^*, yq^*, rx^*, zh^*}
QS "LL-Front" {i^*, ii^*, e^*, rq^*, ee^*, p^*, ph^*, b^*, bh^*, m^*, w^*, f^*, bq^*}
QS "LL-Central" {c^*, ch^*, cx^*, j^*, jh^*, jx^*, nj^*, tx^*, txh^*, dx^*, dxh^*,
nx^*, t^*, th^*, d^*, dh^*, n^*, nd^*, y^*, r^*, l^*, lx^*, sh^*, sx^*, s^*, z^*,
jhq^*, dxq^*, dxhq^*, dhq^*, yq^*, nq^*, rx^*, sq^*, zh^*, q^*, mq^*}
QS "LL-Back" {a^*, ax^*, aa^*, u^*, eu^*, uu^*, o^*, oo^*, k^*, kh^*, g^*, gh^*,
ng^*, h^*, kq^*, khq^*, gq^*, hq^*}
QS "LL-Front_Vowel" {i^*, ii^*, e^*, rq^*, ee^*}
QS "LL-Back_Vowel" {a^*, ax^*, aa^*, u^*, eu^*, uu^*, o^*, oo^*}
QS "LL-Long_Vowel" {aa^*, ii^*, uu^*, ee^*, oo^*}
QS "LL-Short_Vowel" {a^*, ax^*, i^*, u^*, eu^*, e^*, rq^*, o^*}
QS "LL-Dipthong_Vowel" {ax^*, eu^*, ei^*, ai^*, oi^*, ae^*, au^*, ou^*}
QS "LL-High_Vowel" {u^*, eu^*, uu^*, i^*, ii^*}
QS "LL-Medium_Vowel" {ax^*, e^*, rq^*, ee^*, o^*, oo^*}
QS "LL-Low_Vowel" {a^*, aa^*}
QS "LL-Rounded_Vowel" {ax^*, u^*, eu^*, uu^*, oi^*, o^*, oo^*, au^*, ou^*}
QS "LL-Unrounded_Vowel" {a^*, aa^*, i^*, ii^*, e^*, rq^*, ee^*, ei^*, ai^*, ae^*}
QS "LL-IVowel" {i^*, ii^*}
QS "LL-EVowel" {eu^*, e^*, rq^*, ee^*, ei^*}
QS "LL-AVowel" {a^*, ax^*, aa^*, ai^*, ae^*, au^*}
QS "LL-OVowel" {oi^*, o^*, oo^*, ou^*}
QS "LL-UVowel" {u^*, uu^*}
QS "LL-Unvoiced_Consonant" {k^*, kh^*, c^*, ch^*, cx^*, tx^*, txh^*, t^*, th^*,
p^*, ph^*, sh^*, sx^*, s^*, h^*, kq^*, khq^*, f^*, sq^*, hq^*}
QS "LL-Voiced_Consonant" {g^*, gh^*, j^*, jh^*, jx^*, dx^*, dxh^*, d^*, dh^*, b^*,
bh^*, y^*, r^*, l^*, lx^*, w^*, gq^*, z^*, jhq^*, dxq^*, dxhq^*, dhq^*, bq^*, yq^*,
rx^*, zh^*}
QS "LL-Front_Consonant" {p^*, ph^*, b^*, bh^*, m^*, w^*, f^*, bq^*}
QS "LL-Central_Consonant" {c^*, ch^*, cx^*, j^*, jh^*, jx^*, nj^*, tx^*, txh^*,
dx^*, dxh^*, nx^*, t^*, th^*, d^*, dh^*, n^*, nd^*, y^*, r^*, l^*, lx^*, sh^*, sx^*,
s^*, z^*, jhq^*, dxq^*, dxhq^*, dhq^*, yq^*, nq^*, rx^*, sq^*, zh^*, q^*, mq^*}
QS "LL-Back_Consonant" {k^*, kh^*, g^*, gh^*, ng^*, h^*, kq^*, khq^*, gq^*, hq^*}
QS "LL-Fortis_Consonant" {k^*, kh^*, c^*, ch^*, cx^*, tx^*, txh^*, t^*, th^*,
p^*, ph^*}
QS "LL-Lenis_Consonant" {g^*, gh^*, j^*, jh^*, jx^*, dx^*, dxh^*, d^*, dh^*,
b^*, bh^*}
QS "LL-Neigther_F_or_L" {ng^*, nj^*, nx^*, n^*, nd^*, m^*, y^*, r^*, l^*, lx^*, w^*,
sh^*, sx^*, s^*, h^*, kq^*, khq^*, gq^*, z^*, jhq^*, dxq^*, dxhq^*, dhq^*, f^*,
bq^*, yq^*, nq^*, rx^*, sq^*, zh^*, q^*, hq^*, mq^*}
QS "LL-Coronal_Consonant" {c^*, ch^*, cx^*, j^*, jh^*, jx^*, nj^*, tx^*, txh^*, dx^*,
87
dxh^*, nx^*, t^*, th^*, d^*, dh^*, n^*, nd^*, y^*, r^*, l^*, lx^*, sh^*, sx^*, s^*,
z^*, jhq^*, dxq^*, dxhq^*, dhq^*, yq^*, nq^*, rx^*, sq^*, zh^*, q^*, mq^*}
QS "LL-Non_Coronal" {k^*, kh^*, g^*, gh^*, ng^*, p^*, ph^*, b^*, bh^*, m^*, w^*, h^*,
kq^*, khq^*, gq^*, f^*, bq^*, hq^*}
QS "LL-Anterior_Consonant" {c^*, ch^*, cx^*, j^*, jh^*, jx^*, t^*, th^*, d^*, dh^*,
n^*, nd^*, p^*, ph^*, b^*, bh^*, m^*, r^*, l^*, w^*, sh^*, sx^*, s^*, z^*, jhq^*,
dhq^*, f^*, bq^*, nq^*, rx^*, sq^*, q^*, mq^*}
QS "LL-Non_Anterior" {k^*, kh^*, g^*, gh^*, ng^*, nj^*, tx^*, txh^*, dx^*, dxh^*,
nx^*, y^*, lx^*, h^*, kq^*, khq^*, gq^*, dxq^*, dxhq^*, yq^*, zh^*, hq^*}
QS "LL-Continuent" {a^*, ax^*, aa^*, i^*, ii^*, u^*, eu^*, uu^*, e^*, rq^*, ee^*,
ei^*, ai^*, oi^*, o^*, oo^*, ae^*, au^*, ou^*, y^*, r^*, l^*, lx^*, w^*, sh^*, sx^*,
s^*, h^*, z^*, jhq^*, f^*, yq^*, rx^*, sq^*, hq^*}
QS "LL-No_Continuent" {k^*, kh^*, g^*, gh^*, ng^*, c^*, ch^*, cx^*, j^*, jh^*, jx^*,
nj^*, tx^*, txh^*, dx^*, dxh^*, nx^*, t^*, th^*, d^*, dh^*, n^*, nd^*, p^*, ph^*,
b^*, bh^*, m^*, kq^*, khq^*, gq^*, dxq^*, dxhq^*, dhq^*, bq^*, nq^*, zh^*, q^*, mq^*}
QS "LL-Positive_Strident" {c^*, ch^*, cx^*, j^*, jh^*, jx^*, sh^*, sx^*, s^*, jhq^*,
z^*, f^*, sq^*}
QS "LL-Negative_Strident" {h^*, khq^*, gq^*, hq^*}
QS "LL-Neutral_Strident" {k^*, kh^*, g^*, gh^*, ng^*, nj^*, tx^*, txh^*, dx^*, dxh^*,
nx^*, t^*, th^*, d^*, dh^*, n^*, nx^*, p^*, ph^*, b^*, bh^*, m^*, y^*, r^*, l^*, lx^*,
w^*, kq^*, dxq^*, dxhq^*, dhq^*, bq^*, yq^*, nq^*, rx^*, zh^*, q^*, mq^*}
QS "LL-Glide" {y^*, w^*, yq^*, zh^*}
QS "LL-Voiced_Stop" {g^*, gh^*, j^*, jh^*, jx^*, dx^*, dxh^*, d^*, dh^*, b^*, bh^*,
dxq^*, dxhq^*, dhq^*, bq^*}
QS "LL-Unvoiced_Stop" {k^*, kh^*, c^*, ch^*, cx^*, tx^*, txh^*, t^*, th^*, p^*,
ph^*, kq^*}
QS "LL-Front_Stop" {p^*, ph^*, b^*, bh^*, bq^*}
QS "LL-Central_Stop" {c^*, ch^*, cx^*, j^*, jh^*, jx^*, tx^*, txh^*, dx^*, dxh^*,
t^*, th^*, d^*, dh^*, dxq^*, dxhq^*, dhq^*}
QS "LL-Back_Stop" {k^*, kh^*, g^*, gh^*, kq^*}
QS "LL-Voiced_Fricative" {gq^*, z^*, jhq^*}
88
QS "LL-Unvoiced_Fricative" {sh^*, sx^*, s^*, h^*, khq^*, f^*, sq^*, hq^*}
QS "LL-Central_Fricative" {sh^*, sx^*, s^*, z^*, jhq^*, sq^*}
QS "LL-Back_Fricative" {h^*, khq^*, gq^*, hq^*}
QS "LL-Affricate_Consonant" {c^*, ch^*, cx^*, j^*, jh^*, jx^*}
QS "LL-Not_Affricate" {sh^*, sx^*, s^*, h^*, khq^*, gq^*, z^*, jhq^*, f^*, sq^*, hq^*}
89
REFERENCES
[1] Hermann Ackermann and Ingo Hertrich. Speech rate and rhythm in cerebellar dysarthria:
An acoustic analysis of syllabic timing. Folia Phoniatrica et Logopaedica, 46(2):70–78,
1994.
[2] American Speech Language Hearing Association. Dysarthria. www.asha.org/public/
speech/disorders/dysarthria/. [last accessed 22-2-2017].
[3] S. Aswin Shanmugam and Hema A Murthy. Group delay based phone segmentation for
HTS. In National Conference on Communication (NCC), pages 1–6, Kanpur, India, February 2014.
[4] Arun Baby, Anju Leela Thomas, Nishanthi N L, and Hema A Murthy. Resources for Indian
languages. In Community-based Building of Language Resources (International Conference
on Text, Speech and Dialogue), pages 37–43, Brno, Czech Republic, September 2016.
[5] R. Bayeh, S. Lin, G. Chollet, and C. Mokbel. Towards multilingual speech recognition
using data driven source/target acoustical units association. In International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), pages 521–524, Quebec, Canada,
May 2004.
[6] C Benoit, M Grice, and V Hazan. The SUS test: A method for the assessment of textto-speech synthesis intelligibility using semantically unpredictable sentences. Speech Communication, 18(4):381–392, 1996.
[7] AP. Beyerlein, W. Byrne, J. M. Huerta, S. Khudanpur, B. Marthi, J. Morgan, N. Peterek,
J. Picone, and W. Wang. Towards language independent acoustic modeling. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 1029–1032,
Istanbul, Turkey, June 2000.
[8] Peri Bhaskararao. Salient phonetic features of Indian languages in speech technology.
Sadhana, 36(5):587–599, 2011.
[9] Alan W Black and Tanja Schultz. Speaker clustering for multilingual synthesis. In ISCA
Tutorial Research Workshop (ITRW) on Multilingual Speech and Language Processing,
Stellenbosch, South Africa, April 2006.
[10] Alan W Black, Paul Taylor, and Richard Caley. The festival speech synthesis system: System documentation. www.festvox.org/docs/manual-2.4.0/festival_toc.html. [last
accessed 22-2-2017].
[11] Carnegie Mellon University. The CMU pronounciation dictionary. www.speech.cs.cmu.
edu/cgi-bin/cmudict. [last accessed 22-2-2017].
[12] J.R. Deller, D. Hsu, and L.J. Ferrier. On the use of hidden Markov modelling for recognition
of dysarthric speech. Computer Methods and Programs in Biomedicine, 35(2):125–139,
1991.
90
[13] M. Dhanalakshmi and P. Vijayalakshmi. Intelligibility modification of dysarthric speech
using HMM-based adaptive synthesis system. In 2nd International Conference on Biomedical Engineering (ICoBE), pages 1–5, Penang, Malaysia, March 2015.
[14] Guylaine Le Dorze, Lisa Ouellet, and John Ryalls. Intonation and speech rate in dysarthric
speech. Journal of Communication Disorders, 27(1):1 – 18, 1994.
[15] M. B. Emeneau. India as a lingustic area. Language, Linguistic Society of America,
32(1):3–16, 1956.
[16] P. Enderby. Frenchay Dysarthria Assessment. International Journal of Language & Communication Disorders, 15(3):165–173, December 2010.
[17] P. Eswar. A rule based approach for spotting characters from continuous speech in Indian
languages. PhD dissertation, Indian Institute of Technology, Department of Computer
Science and Engg., Madras, India, 1991.
[18] Hemant A Patil et. al. A syllable-based framework for unit selection synthesis in 13 Indian
languages. In Oriental COCOSDA held jointly with Asian Spoken Language Research and
Evaluation (O-COCOSDA/CASLRE), pages 1–8, Delhi, India, November 2013.
[19] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren.
DARPA TIMIT acoustic phonetic continuous speech corpus CDROM, 1993.
[20] Steven Greenberg. Understanding speech understanding: Towards a unified theory of
speech perception. In Proceedings of the ESCA Workshop, pages 1–8, Keele, United Kingdom, 1996.
[21] I. Hertrich and H. Ackermann. Acoustic analysis of durational speech parameters in neurological dysarthrias. In From the Brain to the Mouth, volume 12 of Neuropsychology and
Cognition, pages 11–47. Springer Netherlands, 1997.
[22] J. P. Hosom, A. B. Kain, T. Mishra, J. P. H. van Santen, M. Fried-Oken, and J. Staehely.
Intelligibility of modifications to dysarthric speech. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 924 – 927, Hong Kong, China, April
2003.
[23] Andrew J. Hunt and Alan W. Black. Unit selection in a concatenative speech synthesis
system using a large speech database. In International Conference on Acoustics and Speech
Signal Processing (ICASSP), pages 373–376, Atlanta, USA, May 1996.
[24] B. D. Jayaram and M. N. Vidya. Zipf’s Law for Indian Languages. Journal of Quantitative
Linguistics, 15(4):293–317, 2008.
[25] G. Jayaram and K. Abdelhamied. Experiments in dysarthric speech recognition using
artificial neural networks. Rehabilitation Research and Development, 32(2):162–169, May
1995.
[26] A.M. Johnson and S.G. Adams.
Nonpharmacological management of hypokinetic
dysarthria in Parkinson’s disease. Journal of Geriatrics and Aging, 9(1):40 – 43, 2006.
[27] Alexander Kain, Xiaochuan Niu, John-Paul Hosom, Qi Miao, and Jan P. H. van Santen.
Formant re-synthesis of dysarthric speech. In Fifth ISCA ITRW on Speech Synthesis, pages
25–30, Pittsburgh, USA, June 2004.
91
[28] Alexander B. Kain, John-Paul Hosom, Xiaochuan Niu, Jan P.H. van Santen, Melanie
Fried-Oken, and Janice Staehely. Improving the intelligibility of dysarthric speech. Speech
Communication, 49(9):743 – 759, 2007.
[29] Heejin Kim, Mark Hasegawa-Johnson, Adrienne Perlman, Jon Gunderson, Thomas S.
Huang, Kenneth Watkin, and Simone Frame. Dysarthric speech database for universal
access research. In Annual Conference of the International Speech Communication Association, INTERSPEECH, pages 1741–1744, Brisbane, Australia, September 2008.
[30] S P Kishore and Alan W Black. Unit size in unit selection speech synthesis. In European
Conference on Speech Communication and Technology (EUROSPEECH)-INTERSPEECH,
pages 1317–1320, Geneva, Switzerland, September 2003.
[31] John Kominek and Alan W Black. The CMU arctic speech databases. In 5th ISCA Speech
Synthesis Workshop, pages 223–224, June 2004.
[32] J. Latorre, K. Iwano, and S. Furui. Polyglot synthesis using a mixture of monolingual corpora. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP),
pages 1–4, Philadelphia, USA, March 2005.
[33] Javier Latorre, Koji Iwano, and Sadaoki Furui. New approach to polyglot synthesis: how
to speak any language with anyone’s voice. In ISCA Tutorial Research Workshop (ITRW)
on Multilingual Speech and Language Processing, Stellenbosch, South Africa, April 2006.
[34] V B Le, L Besacier, and T Schultz. Acoustic-phonetic unit similarities for context dependent acoustic model portability (icassp). In International Conference on Acoustics, Speech
and Signal Processing, pages 1101–1104, Toulouse, France, May 2006.
[35] Viet Bac Le and L. Besacier. First steps in fast acoustic modeling for a new target language:
Application to Vietnamese. In International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 821–824, Philadelphia, USA, March 2005.
[36] X. Menendez-Pidal, J.B. Polikoff, S.M. Peters, J.E. Leonzio, and H.T. Bunnell. The
Nemours database of dysarthric speech. In Fourth International Conference on Spoken
Language (ICSLP), pages 1962–1965, Philadelphia, USA, October 1996.
[37] Ministry of Home Affairs, Government of India. Census of India, 2011.
[38] Telecom Regulatory Authority of India. TRAI Press Release No. 49/2016, June 2016.
[39] Project funded by the European Community’s Seventh Framework Programme (FP7/20072013). Simple 4 all. www.simple4all.org. [last accessed 22-2-2017].
[40] Lawrence Rabiner and Ronald Schafer. Theory and Applications of Digital Speech Processing. Prentice Hall Press, Upper Saddle River, NJ, USA, 1st edition, 2010.
[41] B.B. Rajapurohit and Central Institute of Indian Languages. Acoustic Studies in Indian
Languages: Research Papers Prepared at the Summer Institute in Advanced Phonetics,
1984. Central Institute of Indian Languages, 1986.
[42] Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn. Speaker verification using
adapted gaussian mixture models. Digital Signal Processing, 10(1-3):19–41, January 2000.
92
[43] Frank Rudzicz. Acoustic transformations to improve the intelligibility of dysarthric speech.
In 2nd Workshop on Speech and Language Processing for Assistive Technologies (SLPAT),
pages 11–21, Edinburgh, UK, July 2011.
[44] Frank Rudzicz. Adjusting dysarthric speech signals to be more intelligible. Computer
Speech & Language, 27(6):1163–1177, 2013.
[45] Frank Rudzicz, Aravind Kumar Namasivayam, and Talya Wolff. The TORGO database of
acoustic and articulatory speech from speakers with dysarthria. Language Resources and
Evaluation, 46(4):523–541, 2012.
[46] S Rupak Vignesh, Aswin Shanmugam, and Hema A. Murthy. Significance of pseudosyllables in building better acoustic models for Indian English TTS. In IEEE International
conference on Acoustics, Speech and Signal processing (ICASSP), pages 5620–5624, Shanghai, China, 2016.
[47] Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing,
26(1):43–49, 1978.
[48] P. Salza, E. Foti, L. Nebbia, and M. Oreglia. MOS and pair comparison combined methods
for quality evaluation of text to speech systems. In Acta Acustica, volume 82, pages 650–
656, 1996.
[49] M. Saranya, P. Vijayalakshmi, and N. Thangavelu. Improving the intelligibility of
dysarthric speech by modifying system parameters, retaining speaker’s identity. In International Conference on Recent Trends In Information Technology (ICRTIT), pages
60–65, Chennai, India, April 2012.
[50] Tanja Schultz, Martin Westphal, and Alex Waibel. The global phone project: Multilingual
lvcsr with janus-3. In Multilingual Information Retrieval Dialogs: 2nd SQEL Workshop,
pages 20–27, Plzen, Czech Republic, 1997.
[51] Debapriya Sengupta and Goutam Saha. Study on similarity among Indian languages using
language verification framework. Advances in Artificial Intelligence, 2015:1–24, 2015.
[52] S Aswin Shanmugam and Hema Murthy. A hybrid approach to segmentation of speech
using group delay processing and HMM based embedded reestimation. In Annual Conference of the International Speech Communication Association, INTERSPEECH, pages
1648–1652, Singapore, September 2014.
[53] A K Singh. A computational phonetic model for Indian language scripts. In Constraints
on Spelling Changes: 5th International Workshop on Writing Systems, Gelderland, Netherlands, October 2006.
[54] S. Sinha, S. S. Agrawal, and A. Jain. Dialectal influences on acoustic duration of Hindi
phonemes. In Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken
Language Research and Evaluation (O-COCOSDA/CASLRE), pages 1–5, India, November
2013.
[55] Kåre Sjölander and Jonas Beskow. Wavesurfer - an open source speech tool. In Annual Conference of the International Speech Communication Association, INTERSPEECH, pages
464–467, Beijing, China, October 2000. ISCA.
93
[56] J J Sooful and J C Botha. An acoustic distance measure for automatic cross-language
phoneme mapping. In Pattern Recognition Association of South Africa (PRASA’01), pages
99–102, November 2001.
[57] V Surabhi, P Vijayalakshmi, T S Lily, and Ra V Jayanthan. Assessment of laryngeal
dysfunctions of dysarthric speakers. In IEEE Engineering in Medicine and Biology Society,
pages 2908–2911, Minnesota, USA, September 2009.
[58] Keiichi Tokuda, Takao Kobayashi, Takashi Masuko, and Satoshi Imai. Mel-generalized
cepstral analysis-a unified approach to speech spectral estimation. In International Conference on Spoken Language Processing, pages 1043–1046, Yokohama, Japan, September
1994.
[59] Barbara Tomik, Jerzy Krupinski, Lidia Glodzik-Sobanska, Maria Bala-Slodowska, Wieslaw
Wszolek, Monika Kusiak, and Anna Lechwacka. Acoustic analysis of dysarthria profile in
ALS patients. Journal of the Neurological Sciences, 169(1–2):35 – 42, 1999.
[60] S. S. Vel, D. M. N. Mubarak, and S. Aji. A study on vowel duration in Tamil: Instrumental
approach. In IEEE International Conference on Computational Intelligence and Computing
Research (ICCIC), pages 1–4, India, December 2015.
[61] Venugopalakrishna.Y.R., Vinodh.M.V., Hema A. Murthy, and C.S. Ramalingam. Methods
for improving the quality of syllable based speech synthesis. In Spoken Language Technology
(SLT) workshop, pages 29–32, Goa, India, December, 2008.
[62] P. Vijayalakshmi and M.R. Reddy. Assessment of dysarthric speech and analysis on
velopharyngeal incompetence. In 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS), pages 3759–3762, New York, USA,
August 2006.
[63] M Viswanathan and M Viswanathan. Measuring speech quality for text-to-speech systems:
development and assessment of a modified mean opinion score (MOS) scale. Computer,
Speech and Language, 19(1):55–83, 2005.
[64] Martin Westphal. The use of cepstral means in conversational speech recognition. In
European Conference on Speech Communication and Technology (EUROSPEECH), pages
1143–1146, Rhodes, Greece, September 1997.
[65] Wikipedia. Agglutinative language. https://en.wikipedia.org/wiki/Agglutinative_
language. [last accessed 22-2-2017].
[66] Wikipedia. Arpabet. https://en.wikipedia.org/wiki/Arpabet. [last accessed 22-22017].
[67] M. S. Yakcoub, S. A. Selouani, and D. O’Shaughnessy. Speech assistive technology to
improve the interaction of dysarthric speakers with machines. In 3rd International Symposium on Communications, Control and Signal Processing (ISCCSP), pages 1150–1154,
Malta, March 2008.
[68] J. Yamagishi, T. Nose, H. Zen, Z. H. Ling, T. Toda, K. Tokuda, S. King, and S. Renals. Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Transactions on Audio,
Speech, and Language Processing, 17(6):1208–1230, August 2009.
94
[69] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G. Moore, J. Odell,
D. Ollason, D. Poveyand V. Valtchev, and P. Woodland. The HTK Book (for HTK Version
3.4). Cambridge University Engineering Department, 2002.
[70] H Zen, K Tokuda, and A W Black. Statistical parametric speech synthesis. Speech Communication, 51(3):1039–1064, November 2009.
95
LIST OF PAPERS BASED ON THESIS
1. B. Ramani, S. Lilly Christina, G. Anushiya Rachel, V. Sherlin Solomi, Mahesh Kumar Nandwana, Anusha Prakash, Aswin Shanmugam S., Raghava Krishnan, S. P.
Kishore, K. Samudravijaya, P. Vijayalakshmi, T. Nagarajan and Hema A. Murthy,
“A Common Attribute based Unified HTS framework for Speech Synthesis in Indian Languages”, in Speech Synthesis Workshop (SSW8), pages 311-316, Barcelona,
Spain, August 2013.
2. Anusha Prakash, M. Ramasubba Reddy, Nagarajan T. and Hema A. Murthy, “An
Approach to Building Language-Independent Text-to-Speech Synthesis for Indian
Languages”, in National Conference on Communications (NCC), pages 1-5, Kanpur, India, February 2014.
3. Anusha Prakash, Jeena J. Prakash and Hema A. Murthy, “Acoustic Analysis of
Syllables across Indian Languages”, Annual Conference of the International Speech
Communication Association, INTERSPEECH, pages 327-331, San Francisco, USA,
September 2016.
4. Anusha Prakash, M. Ramasubba Reddy and Hema A. Murthy, “Improvement of
Continuous Dysarthric Speech Quality”, in Speech and Language Processing for
Assistive Technology (SLPAT), pages 43-49, San Francisco, USA, September 2016.
96
OTHER PUBLICATIONS
1. Abhijit Pradhan, S. Aswin Shanmugham, Anusha Prakash, V. Kamakoti and Hema
A. Murthy, “A syllable-based statistical text to speech system”, in European Signal
Processing Conference (EUSIPCO), pages 1-5, Marrakech, Morocco, September
2013.
2. Raghava Krishnan K., S. Aswin Shanmugam, Anusha Prakash, Kasthuri G. R. and
Hema A. Murthy, “IIT Madras’s Submission to the Blizzard Challenge 2014”, in
Blizzard Challenge, Singapore, September 2014.
3. Abhijit Pradhan, Anusha Prakash, S. Aswin Shanmugam, G. R. Kasthuri, Raghava
Krishnan and Hema A. Murthy, “Building Speech Synthesis Systems for Indian
Languages”, Invited paper in National Conference on Communications (NCC),
pages 1-6, Bombay, India, February 2015.
4. Anusha Prakash, Arun Baby, Aswin Shanmugam S., Jeena J. Prakash, Nishanthi
N. L., Raghava Krishnan K., Rupak Vignesh Swaminathan and Hema A. Murthy,
“Blizzard Challenge 2015 : Submission by DONLab, IIT Madras”, in Blizzard
Challenge, Berlin, Germany, September 2015.
97