a segmented syllable-based isolated word

A SEGMENTED SYLLABLE-BASED
ISOLATED WORD RECOGNIZER FOR
INDIAN LANGUAGES
A THESIS
submitted by
P. G. DEIVAPALAN
for the award of the degree
of
MASTER OF SCIENCE
(by Research)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY MADRAS
JUNE 2008
THESIS CERTIFICATE
This is to certify that the thesis entitled A segmented syllable-based isolated
word recognizer for Indian languages submitted by P. G. Deivapalan to
the Indian Institute of Technology Madras, for the award of the degree of Master
of Science (by research), is a bonafide record of the research carried out by
him under my supervision and guidance. The contents of this thesis, in full or in
parts, have not been submitted to any other Institute or University for the award
of any degree or diploma.
Place: Chennai
Date:
Dr. Hema. A Murthy
ABSTRACT
KEYWORDS:
Segmented syllable-based approach for large vocabulary IWR;
Group delay based segmentation; Delta entropy; Hierarchical
HMM-based recognition; Automatic labeling tool
Small vocabulary isolated word recognition (IWR) systems are the most successful speech recognition systems. As the vocabulary size increases i.e., for large
vocabulary, the complexity of the IWR system increases and results in longer response time and degradation in accuracy, when word-based approach is used (as
in the case of small vocabulary). To overcome this, subword-based (syllable) approach is used for large vocabulary IWR systems to recognize words. But the main
disadvantage of the subword-based approach is the word insertion problem (a long
duration word utterance is recognized as two shorter words) and it is because of
misalignment of subword boundaries.
This work proposes a segmented syllable-based approach and avoids the word
insertion problem by providing the subword boundary information to the recognizer. The group delay based segmentation is used to segment the word utterance
into syllable-like units. In this work, for the first time, the group delay based segmentation has been tried in both clean and noisy speech conditions for an Indian
English database. The proposed approach expects a very high syllable recognition
accuracy of the syllable segments to achieve good word accuracy. But the presence of confusable syllables degrades the accuracy. To improve the discriminability
i
among syllable models, a delta entropy based technique is used to identify acoustically similar syllable models and they are combined to form merged models. The
recognition of syllable segments can be performed in two ways:
1. single pass recognition - It uses only the leaf nodes of the model tree (i.e.,
individual syllable models are used in the recognition of the syllable segments).
2. multi pass recognition - It uses nodes at different levels of the model tree
(i.e., both individual and merged syllable models are used in the recognition
of the syllable segments). Each syllable segment undergoes multiple passes
and in each pass, different set of syllable models are used for recognition and
this is governed by the hierarchy of the model tree.
This produces the recognized syllable sequence. It is then matched with a
dictionary which consists of words in the vocabulary and their corresponding syllable sequence using a simple Edit distance based method. The syllable sequence
that gives the least score is selected and the word corresponding to the selected
syllable sequence is considered as the recognized word. If the least score is above
a particular threshold (Threshold-oov), then the word utterance is considered to
be an OOV word and the syllable sequence is output as it is. The proposed system
is evaluated on three different isolated words database of very small, small and
medium size vocabulary. Experimental results show that the proposed approach
performs better than the conventional methods. And also, an automatic labeling
tool for Indian languages has been developed and it is used for annotating the
training corpus.
ii
TABLE OF CONTENTS
ABSTRACT
i
LIST OF TABLES
vii
LIST OF FIGURES
ix
1 Overview of the thesis
1
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Scope of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3
Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4
Major contributions of the thesis . . . . . . . . . . . . . . . . .
6
2 Isolated word recognition system
7
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2
Databases used in the study . . . . . . . . . . . . . . . . . . . .
8
2.3
Medium sized vocabulary IWR systems . . . . . . . . . . . . . .
9
2.3.1
A word-based approach . . . . . . . . . . . . . . . . . . .
9
2.3.2
Existing word-based approaches to reduce the response time
11
2.3.3
Experiments based on word duration and syllable count .
12
2.3.4
Disadvantages of the word-based approach . . . . . . . .
13
2.3.5
A subword-based approach . . . . . . . . . . . . . . . . .
14
2.3.6
Advantages of subword-based approaches . . . . . . . . .
18
2.3.7
Baseline accuracy . . . . . . . . . . . . . . . . . . . . . .
18
2.4
Disadvantages of current subword-based approaches . . . . . . .
19
2.5
Need for a segmented subword-based approach . . . . . . . . . .
20
2.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3 Segmented syllable-based IWR (SSWR) system
iii
24
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.2
SSWR system description . . . . . . . . . . . . . . . . . . . . .
30
3.2.1
Training phase of the SSWR system . . . . . . . . . . . .
32
3.2.2
Testing phase of the SSWR system . . . . . . . . . . . .
33
Segmentation unit - a group delay based segmentation . . . . .
33
3.3.1
Effect of segmentation on different words and speakers .
34
3.3.2
Effect of segmentation on noisy speech signals . . . . . .
37
3.3.3
Results and discussion . . . . . . . . . . . . . . . . . . .
39
3.3
3.4
Recognition unit - decoding of the syllable segments
. . . . . .
46
3.4.1
Isolated style recognition . . . . . . . . . . . . . . . . . .
46
3.4.2
Merging of the syllable models . . . . . . . . . . . . . . .
48
3.4.3
Hierarchical recognition of the syllable segments
. . . .
50
3.5
Matching unit - Edit distance based matching . . . . . . . . . .
52
3.6
Results and Discussion . . . . . . . . . . . . . . . . . . . . . . .
55
3.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4 Out-Of-Vocabulary (OOV) words
57
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4.2
Review of approaches to OOV word detection . . . . . . . . . .
58
4.2.1
Acoustic modeling . . . . . . . . . . . . . . . . . . . . .
58
4.2.2
Language modeling . . . . . . . . . . . . . . . . . . . . .
59
4.2.3
Confidence measures . . . . . . . . . . . . . . . . . . . .
59
Handling of the OOV words in the SSWR system . . . . . . . .
60
4.3.1
Handling of Type-I OOV words . . . . . . . . . . . . . .
61
4.3.2
Handling of Type-II OOV words . . . . . . . . . . . . . .
62
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.3
4.4
5 Conclusion
64
5.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.2
Major contributions of the thesis . . . . . . . . . . . . . . . . .
65
5.3
Criticisms of the work . . . . . . . . . . . . . . . . . . . . . . .
65
iv
AppendixA
67
A.1 Baum-Welch algorithm . . . . . . . . . . . . . . . . . . . . . . .
67
A.2 Viterbi decoding . . . . . . . . . . . . . . . . . . . . . . . . . .
69
A.3 A discriminative training algorithm . . . . . . . . . . . . . . . .
71
AppendixB
73
B.1 Syllabification algorithm . . . . . . . . . . . . . . . . . . . . . .
73
B.2 Group delay based segmentation algorithm . . . . . . . . . . . .
73
B.3 DONLabel: an automatic labeling tool for Indian languages . .
75
REFERENCES
84
LIST OF PUBLICATIONS
85
v
LIST OF TABLES
2.1
Performance of word-based IWR systems. . . . . . . . . . . . .
11
2.2
Performance of word duration and syllable count based methods.
Baseline accuracy is 97% and baseline response time is 1.103 msec.
13
Baseline accuracy of word-based and syllable-based IWR systems.
(LM denotes language models). . . . . . . . . . . . . . . . . . .
19
Performance of the group delay based segmentation algorithm on
the test datasets. . . . . . . . . . . . . . . . . . . . . . . . . . .
39
Performance of the group delay based segmentation algorithm on
noisy (20 dB SNR) medium sized vocabulary test datasets. . . .
39
3.3
Syllable recognition accuracy of the SSWR system. . . . . . . .
47
3.4
A partial list of confusable pairs (out of 200) occuring more than
once in the train dataset. . . . . . . . . . . . . . . . . . . . . . .
47
3.5
Merging of the syllable models. . . . . . . . . . . . . . . . . . .
49
3.6
Entropy of HMMs. . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.7
Delta entropy of HMMs. . . . . . . . . . . . . . . . . . . . . . .
50
3.8
The structure of a node in the model tree. . . . . . . . . . . . .
51
3.9
Illustration of Edit distance computation between two strings, mothers
and brother. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
2.3
3.1
3.2
3.10 Edit distance based matching between kunamatai and kunamataitha
53
3.11 Edit distance based matching between kunamatai and kanamana
53
3.12 Modified Edit distance based matching between kunamatai and
kunamataitha . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
3.13 Modified Edit distance based matching between kunamatai and
kanamana . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
3.14 Performance of the SSWR system on test dataset. . . . . . . . .
55
3.15 Performance of the word-based and syllable-based IWR systems on
test dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
3.16 Performance of the SSWR system on medium vocabulary noisy test
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
vi
4.1
A partial list of OOV words (out of 1000 examples) and their syllable sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.1 Illustration of syllabification of a Tamil sentence. . . . . . . . .
vii
62
79
LIST OF FIGURES
2.1
A typical isolated word recognition system. . . . . . . . . . . . .
7
2.2
Functional block diagram of a word-based IWR system. . . . . .
10
2.3
Functional block diagram of a subword-based IWR system . . .
15
2.4
Training of subword models in a syllable-based CSR system for the
word athiradi . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.5
Testing process (Viterbi search) in a subword-based IWR system
17
2.6
An example showing different alignments of subword models in recognizing the word thodanggukirathu. . . . . . . . . . . . . . .
21
An example showing explicit segmentation of the test utterance of
the word thodanggukiradhu. . . . . . . . . . . . . . . . . . . .
21
Training of subword models in conventional syllable-based system
for the word athiradi . . . . . . . . . . . . . . . . . . . . . . .
25
An example that shows the alignment of subword models to recognize the word athiradi. . . . . . . . . . . . . . . . . . . . . . .
26
Training of syllable models in segmented subword-based approach
for the word athiradi. . . . . . . . . . . . . . . . . . . . . . . .
27
An example showing explicit segmentation of the test utterance of
the word athiradi. . . . . . . . . . . . . . . . . . . . . . . . . .
28
Testing process in the segmented syllable-based IWR system for the
test utterance of the word athiradi. . . . . . . . . . . . . . . .
29
3.6
Block diagram of the segmented syllable-based word recognizer.
31
3.7
A screenshot of labeling tool. . . . . . . . . . . . . . . . . . . .
32
3.8
Functional block diagram of the testing part of the SSWR system.
34
3.9
Functional block diagram of the group delay based segmentation.
34
2.7
3.1
3.2
3.3
3.4
3.5
3.10 Group delay segmentation of a word utterance apara. . . . . . .
35
3.11 case (A): Segmentation of two Tamil words with different durations
uttered by same speaker. . . . . . . . . . . . . . . . . . . . . . .
36
3.12 case (B): Segmentation of Tamil word pAkisthAnudan with different
durations uttered by different speakers. . . . . . . . . . . . . . .
38
viii
3.13 Segmentation of word utterances corrupted with white noise and
resulted in 15 dB and 10 dB SNR. . . . . . . . . . . . . . . . . .
40
3.14 Segmentation of word utterances corrupted with white noise and
resulted in 0 dB and -5 dB SNR. . . . . . . . . . . . . . . . . .
41
3.15 Segmentation of word utterances corrupted with babble noise and
resulted in 15 dB and 10 dB SNR. . . . . . . . . . . . . . . . . .
42
3.16 Segmentation of word utterances corrupted with babble noise and
resulted in 0 dB and -5 dB SNR. . . . . . . . . . . . . . . . . .
43
3.17 Segmentation of word utterances corrupted with pink noise and
resulted in 15 dB and 10 dB SNR. . . . . . . . . . . . . . . . . .
44
3.18 Segmentation of word utterances corrupted with pink noise and
resulted in 0 dB and -5 dB SNR. . . . . . . . . . . . . . . . . .
45
3.19 Isolated style recognition of syllable segments . . . . . . . . . .
47
3.20 Structure of a hierarchical syllable model tree . . . . . . . . . .
50
3.21 Illustration of a hierarchical recognition of a syllable segment /pA/.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.22 Edit distance based word searching method. . . . . . . . . . . .
54
4.1
Augmenting OOV words into the vocabulary using a generic word
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
B.1 Functional block diagram of DONLabel . . . . . . . . . . . . . .
78
B.2 A screenshot of labeling tool. . . . . . . . . . . . . . . . . . . .
80
B.3 A sample lab file represented using standard format. . . . . . . .
81
ix
CHAPTER
1
Overview of the thesis
1.1
Introduction
The task of an isolated word recognition (IWR) system is to identify a spoken
word from a fixed vocabulary. IWR systems have been widely used in applications such as airline reservations, information and data retrieval systems [1].
These applications require atleast a medium size vocabulary of words, about a
few thousand words. In general, words in the vocabulary can be represented in
two ways: (a) as whole word units, i.e., a separate model for each word and (b)
as subword units, assuming that, a word is made up of a sequence of subwords
like phonemes or syllables. The choice of representative unit depends upon the
size of the vocabulary, i.e., a whole word unit is the natural choice for very small
vocabulary systems, whereas a subword unit is a better choice for medium and
large vocabulary systems.
Commercially available IWR systems use hidden Markov model (HMM) based
techniques to achieve speaker-independent speech recognition. Two phases in
HMM-based systems are the training phase and the testing phase. The training
phase involves building a HMM for each representative unit in the fixed vocabulary,
whereas the testing phase (recognition) involves likelihood computation against all
the previously built HMMs. The disadvantages of using the whole word models for
a medium sized vocabulary system are as follows: (a) large amount of training data
is required to build all the word models, (b) after building the system, the addition
of new words to the vocabulary requires additional effort in speech data collection.
Also words that are not part of the vocabulary (OOV or out-of-vocabulary words)
1
cannot be handled, and (c) recognition time is longer, as the total number of word
HMMs is voluminous. These issues can be overcome by using subword models.
The problem of larger training data size can be avoided as different words share
common subword segments. New words can be added to the vocabulary without
building new models, as long as the trained subword models corresponding to the
words are already available. Recognition time is comparatively less, as the total
number of subword models is less in number when compared to the whole word
models. The objective of this thesis is to build a medium sized vocabulary IWR
system for Indian languages using subword models.
In conventional subword-based word recognition system, the training process
involves building a HMM for each subword unit. During testing, the given word
utterance is recognized by aligning the subword models. The main disadvantage
of the subword-based approach is the word insertion problem (i.e., a single long
duration word utterance is recognized as two short words) which is due to the
misalignment of subword models. This problem can be overcomed by providing
the segment boundaries. Hence an explicitly segmented subword-based approach
is necessary for medium and large vocabulary systems.
In speech recognition systems, phonemes and syllables are the most popular
subword units used in recognition [2] [3]. Phoneme is the smallest phonetic unit
in a language that is capable of conveying distinction in meaning. Syllable, a
sequence of speech sounds, is made up of a vowel (nucleus), enclosed with or
without consonants. Locating phoneme boundaries in a speech signal is quite
difficult and is always prone to errors. Most Indian languages are syllable-centric
in nature and algorithms exist for automatic segmentation of speech signals into
syllable-like units [4]. Hence syllable has been chosen as the representative unit
for recognition.
This thesis attempts to develop a medium sized vocabulary IWR system for
Indian languages using a segmented subword-based approach. In this approach,
2
the test utterance is explicitly segmented and thus reduces the insertion problem.
This IWR system is known as segmented syllable-based word recognition
(SSWR) system as it uses the boundary information of syllable segments for recognition.
Segmented syllable-based IWR system
During the training phase of the SSWR system, different word examples are segmented into syllable-like units using group delay based segmentation [5]. These
syllable segments are then annotated with the syllabified transcription text corresponding to the word utterances. To automate and speed up the training process,
an automatic segmentation and labeling tool known as DONLabel has been developed. The syllable examples are then used to build individual syllable models
using Baum-Welch algorithm [6].
During the testing phase of the SSWR system, the word utterance is first
segmented into syllable segments using group delay based segmentation [5]. The
SSWR system requires a high syllable recognition accuracy to enable good word
recognition accuracy. But the presence of confusable syllables (similar sounding
syllables) degrade the accuracy and hence the corresponding syllable models have
to be merged to achieve better accuracy. A delta entropy [7] based similarity
measure is used to determine acoustically similar syllable models. The merged
syllable models and the individual syllable models are represented in the form of
a tree known as the model tree. The internal nodes of the model tree represent the
merged models, whereas the leaf nodes of the model tree represent the individual
models. The recognition of the syllable segments can be performed in any one of
two ways:
1. single pass recognition - It uses only the leaf nodes of the model tree (i.e.,
individual syllable models are used in the recognition of the syllable seg3
ments).
2. multi pass recognition - It uses nodes at different levels of the model tree
(i.e., both individual and merged syllable models are used in the recognition
of the syllable segments). Each syllable segment undergoes multiple passes
and in each pass, different set of syllable models are used for recognition and
this is governed by the hierarchy of the model tree.
This produces a syllable sequence. The recognized syllable sequence is then
looked up in a dictionary which consists of words and their corresponding syllable
sequence. The recognized word is obtained by performing a simple Edit distance [8]
based matching instead of applying a computationally intensive language model.
The Edit distance scores1 between the recognized syllable sequence and syllable
sequences of different words in the dictionary are calculated and the sequence that
gives the least score is selected. The word corresponding to the selected syllable
sequence is considered as the recognized word.
OOV words
The presence of OOV words is a known source of recognition errors in speech
recognition applications [9]. To minimize such errors, the handling of OOV words
becomes crucial in speech recognition systems. The proposed SSWR system outputs the recognized syllable sequence in the case of OOV words. The SSWR
system detects the OOV words using an OOV threshold, i.e., if the smallest Edit
distance score is above a particular threshold (THRESHOOV ), then the word utterance is considered to be an OOV word and the system outputs the recognized
syllable sequence.
1
If Edit distance score is zero, then the two strings are same.
4
1.2
Scope of the thesis
The SSWR system presented in this thesis involves the training phase and the
testing phase. The novelty of this work is in the usage of explicit segmentation
of words into syllable-like units during the recognition. A substantial part of
the thesis contains the detailed discussion on the recognition phase of the SSWR
system.
Further, to reduce the search space, a hierarchical approach is used at the
subword level. Finally, if the Edit distance score obtained during testing is poor,
the syllable sequence is output as it is. Thus enabling handling of OOV words.
As a byproduct of the SSWR system, the OOV words are handled in a better
manner. However, the problem of identification of OOV words is not within the
scope of this thesis.
1.3
Organization of thesis
The thesis is organized as follows.
Chapter 2 gives an overview of word-based and subword-based approaches to
medium and large vocabulary IWR systems. The need for a segmented subwordbased approach is discussed in this chapter. The various existing approaches to
reduce the response time of large vocabulary IWR systems are explained briefly.
Chapter 3 describes a segmented subword-based approach for medium and
large vocabulary systems. A hierarchical recognition approach is proposed and
explained in this chapter.
Chapter 4 addresses the OOV words. It also explains the novelty of the SSWR
system in handling the OOV words.
Finally in Chapter 5, the thesis is summarised. A brief discussion of the major
5
contributions and drawbacks of this thesis are also given in this chapter.
1.4
Major contributions of the thesis
The following are the major contributions of the thesis:
(a) A segmented syllable-based word recognizer for a medium sized vocabulary
has been developed.
(b) A merging strategy has been explored to merge acoustically similar models.
(c) A hierarchical recognition has been attempted to reduce the time taken
during decoding.
(d) An automatic labeling tool has been developed and used to speed up the
adaptation of the recognizer for new tasks especially during the training phase.
6
CHAPTER
2
Isolated word recognition system
2.1
Introduction
Speech recognition systems can be broadly classified into three categories based on
their input types. They are (a) isolated word recognition (IWR) system, in which
the user is asked to speak one word at a time, (b) connected word recognition
system, in which the user speaks multiple words but with pauses between them,
and (c) continuous speech recognition (CSR) system, in which the user has the
freedom to speak words continuously without disfluencies [10]. Among these three
systems, IWR system is the most successful because of its high accuracy (close to
100%) [1]. Fig. 2.1 shows a pictorial view of a typical IWR system in which the
speech signal (word utterance) is converted into its symbolic form (text). As shown
in Fig. 2.1, the IWR system recognizes the spoken word utterance “Hello” and
outputs the text Hello. Generally, IWR systems achieve very high accuracy with
small vocabularies [1]. As the vocabulary size increases, the accuracy decreases
rapidly and the response time increases drastically [1], which are undesirable.
As many real-time applications [1] require medium and large vocabulary IWR
systems, it is important to address this problem.
"Hello"
Figure 2.1: A typical isolated word recognition system.
7
In this chapter, we first discuss the various approaches to handle this problem.
This chapter is organized as follows: The databases used in this study are described
in Section. 2.2. Section. 2.3 gives an overview of word-based and subword-based
approaches for medium sized vocabulary IWR systems. The disadvantages of the
subword-based system are given in Section. 2.4. Section. 2.5 presents the need for
a segmented syllable-based IWR system for medium sized vocabulary.
2.2
Databases used in the study
For the isolated word recognition task, three databases of very small, small and
medium sized vocabulary are used in this study. They are described below.
(a) Very small sized vocabulary - This consists of speech utterance of 36 Hindi
words. These words were excised from DDNEWS Hindi bulletin [11] in such a way
that, each word has more than 60 examples. 60% of the data is used for training
and remaining 40% of the data is used for testing.
(b) Small sized vocabulary - This consists of speech utterances of 400 Tamil
words spoken by 50 speakers. These words are taken from the context of cricket
news domain. Each speaker records all the words once. The training dataset consists of 40 examples per word, while the testing dataset consists of the remaining
10 examples per word.
(c) Medium sized vocabulary - This consists of 2000 most frequently occuring
English words collected from non-native English speakers (mostly from native
Tamil speakers). The training dataset consists of 40 examples per word and the
testing dataset consists of remaining 10 examples per word. The total number of
speakers is 50.
(d) Noisy database - The noises (babble, pink and white) are added to the
medium sized vocabulary. This results in three sets of 20 dB signal-to-noise ratio.
8
The isolated speech utterances were quantized at 16 bits per sample, with a
sampling frequency of 16 kHz. Mel-frequency cepstral coefficients (MFCC) features are extracted from these speech signals. They are of 42 dimensions (13
MFCC, 13 ∆ MFCC, 13 ∆∆ MFCC, energy, ∆ energy, and ∆∆ energy ).
2.3
Medium sized vocabulary IWR systems
Depending upon the representative unit to be modeled, the hidden Markov model
(HMM) based IWR systems are of two types. They are (a) word-based IWR
system, which uses word HMMs, and (b) subword-based IWR system, which uses
subword HMMs.
2.3.1
A word-based approach
Fig. 2.2 shows the functional block diagram of a word-based IWR system. During
the training phase, word examples are collected from different speakers and individual word models are built for each word in the vocabulary. The training of a
word HMM involves two steps (see Fig. 2.2(a)). Firstly, the MFCCs [12] (observation sequence) are extracted from examples of a particular word. Secondly, the
observation sequences are then used to estimate the HMM parameters using the
Baum-Welch algorithm [6] (see Appendix for a brief overview of the algorithm).
Similarly, word models are built for all the words in the vocabulary.
During the testing phase (see Fig. 2.2(b)), the MFCC features are extracted
from the given test word utterance. The likelihood probability of this observation sequence being generated by each word model is evaluated using the Viterbi
decoding [13]. The word corresponding to the model which has the maximum
likelihood probability is considered as the recognized word.
The word-based approach has been attempted on the test datasets and the re9
Input word utterances
eg 1
Feature
eg 2
extraction
Features
Model
parameters
estimation
Word
HMM
eg n
(a) Training phase of a word−based IWR system
Input
word
utterance
Feature
extraction
Features
Likelihood
Decoding
probabilities
Decision
rule
Word
HMMs
(b) Testing phase of a word−based IWR system
Figure 2.2: Functional block diagram of a word-based IWR system.
10
Recognized
word
Table 2.1: Performance of word-based IWR systems.
Database
Very small
Small
Medium
Vocabulary
36
400
2000
Word accuracy (in %)
92.36
97.00
99.06
Response time (in CPU sec)
0.074
1.103
12.037
sults are shown in Table. 2.1. Observe that the response time increases drastically
as the vocabulary size increases from small to medium. To improve the response
time of IWR systems, several methods have been proposed in the literature and
some of them are discussed in the next section.
2.3.2
Existing word-based approaches to reduce the response time
Hypothesis and verification paradigm
In these methods [14] [15] [16] [17], during recognition instead of computing the
likelihood against all the word models, a sub set of models are selected (using fast
selection algorithms in the first pass) such that the most likely word is among
them. In the second pass, a detailed Viterbi decoding is performed on those
selected word models.
In [14], L. R. Bahl et al. proposed a rapid scheme for obtaining an approximate acoustic match for all the words in the vocabulary. In this approach, a fast
matching algorithm is used to identify a few words such that the correct word is
obtained with high probability. The likelihood is computed against those selected
words. The proposed algorithm is reported to be a hundred times faster than
doing a detailed acoustic likelihood computation on all the words. The dataset
used was the IBM Office Correspondence isolated word dictation task which has
a vocabulary of 20000 words. The disadvantage of this approach is that it uses
11
a word-based approach which requires larger training data to build all the word
models.
In [16], K. Minamino et al. proposed a structural search using a word-node
graph to speed up the response time of IWR systems. A distance measure is
defined for comparing two word models. Let λa and λb be the two given models and
assume that the model λa generates the observation sequence, O. The likelihood
probabilities P (O|λa ) and P (O|λb ) are computed. The difference between their
probabilities gives the distance between them. A graph is constructed depending
upon the distance between the words. Based on this graph, the recognition is
performed at different levels considering the most likely nodes in different levels
of the graph. This restricts the number of words to be examined. The distance
measure used in this approach does not guarantee the symmetric property ie.,
model λa similar to λb does not imply that λb is similar to λa .
2.3.3
Experiments based on word duration and syllable
count
In an attempt to reduce the recognition time in the word-based IWR system, we
used simple heuristics, such as word duration and syllable count and performed
some experiments. In these experiments, the training process involves grouping of
word models depending upon either the duration of words or the syllable count1
of words. A HMM is trained for each group using the word examples in that
group. The testing involves a two pass recognition. In the first pass, the given
word utterance is recognized against all the group models and a group model is
selected. In the second pass, the given word utterance is recognized against the
word models of the selected group.
These methods are tested on the small sized vocabulary and the results are
1
Total number of syllables in a word is obtained using the group delay based segmentation.
12
Table 2.2: Performance of word duration and syllable count based methods. Baseline accuracy is 97% and baseline response time is 1.103 msec.
Heuristic
Word duration
Syllable count
Accuracy
(in %)
78.0
70.7
Response time
(in CPU sec)
0.67
0.70
% improvement in
response time
39
36
% reduction in
accuracy
19.0
27.1
shown in Table.2.2. The word duration based approach reduces the response time
by 39% but degrades the accuracy by 19% for the test dataset of small sized
vocabulary. Similarly, the syllable count based approach reduces the response
time by 36% but degrades the accuracy by 27.13% for the same test dataset of
small sized vocabulary. Since these heuristic based methods introduce a reduction
in accuracy, no further investigations were carried out in this direction.
As in above methods, a single HMM for each group may not be the right way to
go about, since two acoustically very different words having same durations may
end up in the same group according to the word duration heuristic. So, the group
HMM representing neither of them fails in the first pass. This could be one of the
possible reasons for the poor accuracy. Similar arguments hold good for syllable
count based method also. But the motivation for doing such experiments exists in
the sense that, surely some set of models are unlikely to be recognized for a given
test word utterance and hence recognizing against all the models is a brute-force
way of solving this problem. Instead, it would be better to perform a hierarchical
recognition thereby removing unlikely models at each stage in recognition.
2.3.4
Disadvantages of the word-based approach
The word-based approach for medium sized vocabulary IWR systems introduce
various issues which are listed below:
(a) Unavailability of larger training data size - The true distribution of
13
the word examples can be accurately described by the statistical model only when
the size of the training dataset tends to infinity. So to get a better estimate of the
model parameters, a large amount of training data is required.
(b) Longer recognition time - Recognition time becomes longer, as the total
number of word HMMs is voluminous (proportional to the vocabulary size).
(c) Addition of new words to the vocabulary - Adding new words to
the vocabulary requires an additional effort in collecting the speech data, apart
from training new word models. Also words that are not part of the vocabulary
(out-of-vocabulary or OOV words) cannot be handled.
To overcome these issues, subword-based approaches are used in large vocabulary IWR systems and it is explained in the next section.
2.3.5
A subword-based approach
Fig. 2.3 gives the functional block diagram of a subword-based IWR system. During the training phase, the MFCC features of different word examples are uniformly
segmented using a dictionary (words in the vocabulary and their corresponding
subword sequences). This is called as a flat start. The initial subword boundaries
are marked as dashed vertical lines in the speech signal and are shown in Fig. 2.4.
The observation sequences corresponding to different subwords are then used to
estimate (initialize) their model parameters. Since the subword boundaries are
obtained from uniformly segmented observation sequences, the boundaries may
not correspond to actual subwords. So the boundaries have to be modified. The
subword HMMs are Viterbi aligned and from the observation sequence new subword boundaries are obtained, which are then used to re-estimate the subword
HMM parameters. The above step is repeated either for a number of predefined
iterations or until no further change in the boundaries are detected. The final
subword boundaries are marked as solid vertical lines in the speech signal and are
14
Re−estimation
Different word utterances
eg 1
eg 2
Uniform
segmentation
features (flat start)
Feature
extraction
subword
model
parameters
estimation
Viterbi
alignment
eg n
Is
no
converged
?
yes
Dictionary
Subword
HMMs
(a) Training phase of a subword−based IWR system
Subword
HMMs
Input
word
utterance
Feature
extraction
Features
Viterbi
alignment
Language
models
Recognized
word
Dictionary
(b) Testing phase of a subword−based IWR system
Figure 2.3: Functional block diagram of a subword-based IWR system
15
/a/
/thi/
/di/
/ra/
0.2
0.15
Initial
boundary
0.1
0.05
0
−0.05
−0.1
−0.15
−0.2
0
1000
2000
Model
/a/
3000
4000
5000
Model
/thi/
6000
7000
Model
/ra/
8000
9000
Model
/di/
Re−estimation
Viterbi alignment
Misaligned
boundary
0.2
0.15
0.1
0.05
0
Final
boundary
−0.05
−0.1
−0.15
−0.2
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Figure 2.4: Training of subword models in a syllable-based CSR system for the
word athiradi
16
shown in Fig. 2.4.
P(s1 | s1)
HMM of
s1
P(s1 | start)
P(s2 | s1)
HMM of
s2
P(end | s1)
P(s2 | start)
P(sn | s2)
Start
End
P(s2 | sn)
P(sn | start)
P(end | sn)
HMM of
sn
Figure 2.5: Testing process (Viterbi search) in a subword-based IWR system
During the testing phase, the observation sequences are extracted from the
given test word utterance and are used in the Viterbi search (see Fig. 2.5). This
search considers various paths of different subword HMMs by evaluating the combined probability score (acoustic model probability and the language model (LM)
probability2 . For example, in Fig. 2.5, P (s1 |start) and P (end|s1 ) denotes the bigram probability of syllable, s1 occuring at the beginning and at the end of the
word respectively). The subword HMMs are Viterbi aligned (see Fig. 3.2) such
that the combined probability score is maximized and the recognized word corresponding to the aligned subword sequence is obtained from the dictionary.
2
The language model gives the probability of word that follows given the current word.
These probabilities are estimated from large text. In case of IWR, the language model gives the
probability of syllables that can occur after the current syllable.
17
2.3.6
Advantages of subword-based approaches
The advantages of the subword-based approach over the word-based approach for
large vocabulary are listed below:
(a) The problem of larger training data size can be avoided in the subwordbased approach, as different words share common subword segments.
(b) New words can be added to the vocabulary without building new models,
as long as the subwords corresponding to the words to be added exist. In this way,
the OOV words are handled to a certain extent.
(c) The recognition time is less when compared with the word-based approach,
as the total number of subword models is lesser than the total number of wholeword models.
2.3.7
Baseline accuracy
The performance (word accuracy) of word-based and syllable-based IWR systems
are shown in Table 2.3. The syllable-based IWR system shows a 20% degradation
in accuracy when compared with word-based IWR system for small sized vocabulary. A discriminative training [18] method is attempted instead of ordinary maximum likelihood method to achieve better accuracy. This discriminative training
algorithm is explained in Appendix A. Since the improved accuracy is negligible
(0.2%), no further investigations were carried out in this direction. Table 2.3 reveals that the word-based IWR system gives better accuracy when compared to
the syllable-based IWR system for very small, small and medium sized vocabulary.
Results from Table. 2.3 show that the whole word units represent the words
better than the subword units. Also observe that the difference in accuracies between word-based and subword-based systems tend to be small, as the vocabulary
size increases from small to medium. Hence subword-based approach is considered
18
Table 2.3: Baseline accuracy of word-based and syllable-based IWR systems. (LM
denotes language models).
Database
Vocabulary
size
Very small
Small
Medium
36
400
2000
Word accuracy (in %)
word-based syllable-based
no LM LM
92.36
90.09
97.00
77.59 99.19
99.06
95.70
-
Response time (in CPU sec)
word-based syllable-based
no LM LM
0.074
0.153
1.103
1.050
12.037
9.607
-
to be a better alternative to word-based approach for medium and large vocabulary. Nevertheless the subword-based approach has its own disadvantages and
they are discussed in next section.
2.4
Disadvantages of current subword-based approaches
Most IWR systems use a continuous speech recognizer (CSR) for large vocabulary
IWR [19]. In these systems, the words are treated as sentences and subwords
correspond to the words in the sentence. Although the subword-based approach
shows good response time when compared to word-based approach, there are inherent disadvantages. A language model is required for such recognizers [19]. The
application of language models during recognition is computationally expensive.
This becomes a major drawback in using these subword-based IWR systems for
memory constrained devices like PDAs and laptops.
Here is a chicken-and-egg problem in CSR: the subword boundaries are obtained by aligning the acoustic models, whereas to build better acoustic models,
annotated speech data with proper subword boundaries are required. This way
of obtaining subword boundaries using the trained acoustic models lead to mismatch at subword boundaries. During testing, Viterbi based search is performed
19
to identify the test utterance. In this search, the language model probabilities are
combined with acoustic model probabilities. The increase of the search space in
Viterbi decoding due to the misaligned subword boundaries and the usage of computationally intensive language models are the disadvantages of the subword-based
approaches.
2.5
Need for a segmented subword-based approach
As mentioned in the previous section, the use of continuous speech recognizer
(CSR) for medium and large vocabulary IWR is an overkill as language models
are needed. In another prespective, the language model probabilities are important
for continuous speech utterances but not necessarily for isolated word utterances,
because the syllables are well articulated in the latter. Hence if the words can be
segmented at the subword level, then each subword can be recognized individually
in isolated style and simple text matching techniques can be employed to recognize the spoken word. Hence a subword-based approach for medium and large
vocabulary IWR systems without language models can be built.
Fig. 2.6 shows the recognition of the test word utterance thodanggukirathu
using a conventional subword-based approach. It also shows three different types
of boundaries that are marked in the speech signal as vertical solid, dashed, and
dashed-dot lines. Below the speech signal (see Fig. 2.6), three different paths
corresponding to three different types of boundary are shown. Observe that in
Fig. 2.6, different paths branch at different instances of time, i.e., the starting
and the ending of the subword models occur at different instances of time. This
increases the search space and it is the major disadvantage of the conventional
subword-based approach.
A segmented subword-based approach is proposed in this thesis and explained
in Chapter. 3. In this approach, the word utterances are segmented into syllable20
0.3
0.2
0.1
0
−0.1
−0.2
−0.3
−0.4
0
2000
/tho/
4000
/dang/
6000
/gu/
8000
/ki/
10000
/ra/
12000
/dhu/
Exit
/tho/
/dar/
/ka/
/lil/
/lum/
Entry
/thOr/
/ka/
/dik/
/kap/
/pat/
/tAr/
Figure 2.6: An example showing different alignments of subword models in recognizing the word thodanggukirathu.
/tho/
/dang/
/gu/
/ki/
/ra/
/dhu/
0.4
0.2
0
−0.2
−0.4
0
2000
4000
6000
8000
10000
12000
0
2000
4000
6000
8000
10000
12000
1
0.5
0
−0.5
Figure 2.7: An example showing explicit segmentation of the test utterance of the
word thodanggukiradhu.
21
like units using group delay based segmentation [5]. In group delay based segmentation, a signal processing based technique is used to obtain syllable boundaries.
This avoids the problem involved in training of syllable models for Indian languages. One of the important features of this approach is that the boundary
information of the syllable segments provided to the recognizer reduces the search
space during recognition. The recognizer considers only those words which have
similar number of syllables as that of the test word utterance and hypothesizes
them. Further, the search space is restricted as the search branches to different
paths only at the syllable boundaries.
Fig. 2.7 shows the same test utterance thodanggukirathu that is explicitly
segmented into syllable-like units using group delay based segmentation. Observe
that, the syllable boundaries are identified correctly3 . Due to the explicit segmentation of the test utterance, during Viterbi decoding, the branching happens
only at the syllable boundaries. The proposed approach differs from the conventional subword-based approach and is advantageous in terms of reducing the
search space.
2.6
Summary
The word-based and the subword-based IWR systems are explained in this chapter. Several existing word-based and subword-based approaches for reducing the
response time of large vocabulary IWR systems are reviewed. The preliminary
word-based experiments based on heuristics like word duration and syllable count
are found to be inappropriate for reducing the response time of the system, since
they degrade the accuracy. And also, various issues in word-based approach for
large vocabulary IWR systems are discussed. The existing subword-based approaches overcome these issues, but increases search space and use language mod3
These boundaries are verified by listening to the same.
22
els. To overcome these disadvantages, a segmented subword-based approach is
proposed. In the next chapter, a novel segmented subword-based approach is
described.
23
CHAPTER
3
Segmented syllable-based IWR (SSWR) system
3.1
Introduction
A segmented syllable-based approach for an IWR system is proposed in this chapter. In this approach, the boundary information of syllables are made available to
the system at the time of recognition by segmenting the test word utterance into
syllable-like units using group delay based segmentation [4]. The explicit segmentation of the word utterance not only avoids the use of language models but also
reduces the search space during recognition discussed earlier in Chapter 2.
To demonstrate the difference between the conventional subword-based and
the segmented syllable-based approaches, consider the test utterance of a Tamil
word athiradi which consists of four syllables namely /a/, /thi/, /ra/, and
/di/.
Fig. 3.1 shows the training of the syllable models in a conventional syllablebased system for the utterance athiradi. Different examples of the word athiradi
are used to train the syllable models, but for the sake of illustration consider one
such example as shown in Fig. 3.1. The example is uniformly segmented into four
segments and inherently each segment is assigned to each syllable in sequence. The
initial boundaries are marked as dashed vertical lines and are shown in Fig. 3.1.
and this process is referred to as flat start. MFCC features (observation sequence)
are extracted from these segments and are then used to estimate the parameters
of syllable models. Since the initial boundaries are not appropriate, the trained
four syllable models are used in Viterbi alignment of the given observation sequence to produce new syllable boundaries. From the newly obtained syllable
24
/a/
/thi/
/di/
/ra/
0.2
0.15
Initial
boundary
0.1
0.05
0
−0.05
−0.1
−0.15
−0.2
0
1000
2000
Model
/a/
3000
4000
5000
Model
/thi/
6000
7000
Model
/ra/
8000
9000
Model
/di/
Re−estimation
Viterbi alignment
Misaligned
boundary
0.2
0.15
0.1
0.05
0
Final
boundary
−0.05
−0.1
−0.15
−0.2
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Figure 3.1: Training of subword models in conventional syllable-based system for
the word athiradi
25
boundaries, the syllable models are again trained and used to obtain new syllable
boundaries. The above step is repeated for a pre-defined number of iterations.
The final boundaries are marked in solid vertical lines and are shown in Fig. 3.1.
Observe that one of the final boundaries is not properly aligned (see Fig. 3.1).
0.2
0.15
0.1
0.05
0
−0.05
−0.1
−0.15
−0.2
0
1000
2000
3000
4000
5000
6000
7000
8000
HMM
HMM
HMM
HMM
/a/
/thi/
/ra/
/di/
9000
Entry
Exit
HMM
/tho/
HMM
/dang/
HMM
/gu/
HMM
/ki/
HMM
/ra/
HMM
/dhu/
Figure 3.2: An example that shows the alignment of subword models to recognize
the word athiradi.
During recognition, the syllable models are Viterbi aligned as shown in Fig. 3.2
to recognize the word athiradi. Observe that, two very different paths are hypothesized (one recognizing the utterance as word thodanggukiradhu while the
another recognizing the utterance as word athirrudhu as shown in Fig. 3.2). Notice that the syllable model /dang/ (in the bottom hypothesis) starts before the
model /a/ (in the top hypothesis) reaches its final state, and this can happen at
every instance of time. Hence this leads to increase in search space.
We now explain a different approach to building HMMs. Here words are explic-
26
/a/
/thi/
/ra/
Example:
/di/
/a/
0.1
0.2
0.05
0
0.1
−0.05
−0.1
0
0.1
0
200
400
600
800 1000 1200 1400 1600 1800
0.05
0
−0.1
−0.05
−0.2
−0.1
0
1000
2000
3000
4000
5000
6000
7000
8000
0
200
400
600
800 1000 1200 1400 1600 1800
9000
Example: /thi/
1
0.2
0.1
0
0.5
0.2
−0.1
0.1
0
−0.2
0
200 400 600 800 1000 1200 1400 1600 1800
0
−0.1
−0.2
−0.5
0
1000
2000
3000
4000
5000
6000
7000
8000
0
200 400 600 800 1000 1200 1400 1600 1800
9000
Example: /di/
0.1
Example:
/ra/
0.2
0.05
0.1
0
0
−0.05
−0.1
−0.2
0.2
0
500
1000
1500
2000
−0.1
0.1
2500
3000
500
1000
0.05
1500
2000
2500
3000
0
0
−0.05
−0.1
−0.2
0.1
0
−0.1
0
500
1000
1500
2000
2500
0
500
1000
1500
3000
Figure 3.3: Training of syllable models in segmented subword-based approach for
the word athiradi.
27
2000
2500
3000
/a/
/thi/
/ra/
/di/
0.2
0.1
0
−0.1
−0.2
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1
0.5
0
−0.5
0.1
0.2
0.05
0.1
0
−0.1
200
400
600
800 1000 1200 1400 1600 1800
−0.2
0.1
0.05
0
−0.05
−0.1
−0.1
0
0.1
0
0
−0.05
0.2
−0.2
0
200 400 600 800 1000 1200 1400 1600 1800
−0.1
0
500
1000
1500
2000
2500
3000
0
500
1000
1500
2000
2500
Figure 3.4: An example showing explicit segmentation of the test utterance of the
word athiradi.
28
3000
0.1
0.2
0.05
0.1
0.2
0.1
0.1
0.05
0
0
0
0
−0.05
−0.1
−0.1
−0.1
0
200
400
600
−0.2
800 1000 1200 1400 1600 1800
Recognition of
syllable segments
−0.2
0
200 400 600 800 1000 1200 1400 1600 1800
a−thi−ra−di
Syllable
HMMs
−0.05
−0.1
0
500
1000
1500
Edit distance
Matching
2000
2500
3000
0
500
1000
1500
2500
athiradi
(Recognized word)
Dictionary
Figure 3.5: Testing process in the segmented syllable-based IWR system for the
test utterance of the word athiradi.
29
2000
3000
itly segmented into syllables both during training and testing. Fig. 3.3 shows the
training of syllable models namely /a/, /thi/, /ra/, and /di/. During training,
the different examples of word athiradi is segmented into syllable-like units using group delay based segmentation. The transcription corresponding to the word
is syllabified using rule-based methods (see Appendix for the syllabification algorithm). And syllable segments obtained are annotated with the syllabified text.
In the segmented subword-based approach, during testing, the word utterance is
explicitly segmented into syllable-like units using group delay based segmentation
and is shown in Fig. 3.4. In Fig. 3.4, observe that the four syllable boundaries are
accurately identified using this signal processing based segmentation algorithm.
The solid vertical lines in Fig. 3.4 represent the syllable boundaries. The explicitly segmented units (see Fig. 3.5) are recognized individually using the trained
syllable models (as in the case of typical IWR system). This produces a syllable
sequence. The recognized syllable sequence is then matched against a dictionary
and the recognized word is obtained.
We now detail SSWR system in the next section (Section 3.2). The training
phase is discussed in Section. 3.2.1. The testing phase involves a) the group delay
based segmentation of speech signal, b) the recognition of syllable segments, and c)
the matching of syllable sequence. They are explained in Section 3.3, Section. 3.4
and Section 3.5 respectively. In Section 3.6, the experimental results are discussed.
3.2
SSWR system description
The functional block diagram of the SSWR system is shown in Fig. 3.6. The SSWR
system consists of two phases: a) the training phase and b) the testing phase. The
training process (see Fig. 3.6) involves the collection of different syllable examples
from different words and the estimation of syllable model parameters.
The testing process consists of three units namely the input unit, the processing
30
Input unit
Output unit
0.25
0.2
0.15
0.1
Syllable sequence
0.05
0
−0.05
−0.1
−0.15
−0.2
0
2000
4000
6000
8000
10000
12000
Word
Processing unit
Yes
/seg−1/seg−2/seg−3/
0.15
0.1
0.25
0.2
0.2
0.15
0.15
/syl−1/syl−2/syl−3/
0.15
0.1
0.1
0.05
0.05
0.05
0.05
−0.05
0
0.25
0.2
0.2
0.15
0.15
0.1
0
0.1
0.1
0.05
0
0
0.05
−0.05
0
0
−0.05
−0.05
−0.1
Segmentation
unit
−0.2
0
1000
2000
−0.05
−0.1
−0.1
−0.15
No
Is
score >
Thresh_oov
?
−0.05
−0.1
−0.15
−0.15
−0.2
−0.2
0
2000
4000
−0.1
−0.1
0
1000
2000
3000
Recognition
unit
−0.15
−0.2
−0.15
−0.15
0
1000
2000
−0.2
0
2000
4000
−0.2
0
1000
2000
3000
Syllable
models
Matching
unit
Dictionary
Testing phase of the Segmented syllable−based recognizer
syllable 1
data
Syllable
Models
syllable m
data
Word 1
data
Word 2
data
Word n
data
Training phase of the Segmented syllable−based recognizer
Figure 3.6: Block diagram of the segmented syllable-based word recognizer.
31
unit and the output unit (see Fig. 3.6). The processing unit is further subdivided
into the segmentation unit, the recognition unit and the matching unit. The
input unit captures the isolated word utterance. The segmentation unit splits the
word utterance into syllable-like segments. The recognition unit recognizes the
syllable segments using the trained syllable models. The matching unit matches
the recognized syllable sequence with the dictionary and obtains the recognized
word.
3.2.1
Training phase of the SSWR system
Diamond shape
Waveform Panel
Text Panel
Group Delay Panel
Figure 3.7: A screenshot of labeling tool.
Different syllable examples are extracted from different word examples. The
word examples are segmented into syllable-like units using the group delay based
segmentation algorithm and the syllable segments are annotated with the help of
syllabified text (word) corresponding to the chosen speech signal (see Appendix
B for syllabification algorithm). In order to automate this task, an automatic
32
labeling tool for Indian languages has been developed. It is called DONLabel
(see Fig. B.2). The description of DONLabel is given in Appendix B. It provides
an easy to use web-based interface. The annotated database is collected in a
centralised place (server). The MFCC features are extracted from the syllable
examples and are then used to estimate the syllable model parameters using the
Baum-Welch algorithm.
3.2.2
Testing phase of the SSWR system
Fig. 3.8 illustrates the testing process of the SSWR system. The given test word
utterance is first segmented into syllable-like units using the group delay based
segmentation. These syllable segments are then recognized using the syllable
models. A simple Edit distance based matching of the recognized syllable sequence
against the dictionary is performed to output the recognized word. The testing
process is carried out using three different units, namely the segmentation unit,
the recognition unit and the matching unit. These units are explained in Section
3.3, Section 3.4, and Section 3.5.
3.3
Segmentation unit - a group delay based segmentation
The negative derivative of the Fourier transform phase is defined as the “group
delay”. In [5], it has been shown that the minimum phase group delay function
derived from the short term energy function can be used for segmentation of a
speech utterance into syllable-like units. Fig. 3.9 shows the functional block diagram of the group delay based segmentation. The group delay based segmentation
algorithm is explained in Appendix B. Fig. 3.10 shows the syllable boundaries of a
Tamil word apara, where the solid vertical lines represent the syllable boundaries.
33
Segment−1
Feature
Extraction
MFCCs
Likelihood
Computation
Segment−2
Feature
Extraction
MFCCs
Likelihood
Computation
Speech
Matching
Rule
Signal
Feature
Extraction
Segment−m
MFCCs
Recognized
Word
Likelihood
Computation
Dictionary
Syllable
Models
Figure 3.8: Functional block diagram of the testing part of the SSWR system.
In order to avoid spurious boundaries, a threshold of 0.25 (found experimentally)
has been used.
Speech
Short term
energy
computation
signal
Short term
energy
Symmetrise
function
Symmetrised
short term
energy
Invert
function
Inverted
symmetrised
short term
energy
Group delay
spectrum
Syllable
boundaries
Peak
detection
Group delay
spectrum
Group delay
processing
Root
cepstrum
IDFT
computation
Figure 3.9: Functional block diagram of the group delay based segmentation.
3.3.1
Effect of segmentation on different words and speakers
The performance of the group delay based segmentation algorithm is investigated
under the following two cases: (A) segmentation of different words of short and
34
a
pa
ra
0.1
Speech signal
0
−0.1
0
2000
4000
6000
Number of samples −−>
8000
10000
Group delay spectrum
1
0
−1
0
2000
4000
6000
Number of samples −−>
8000
5
4
x 10
10000
Energy function
2
0
0
2000
4000
6000
Number of samples −−>
8000
10000
Figure 3.10: Group delay segmentation of a word utterance apara.
long durations and (B) segmentation of short and long duration examples of the
same word. The former experiment is conducted to check the consistency of the
algorithm across different words, while the latter experiment is conducted to check
the consistency of the algorithm across different speakers. Figs. 3.11 and 3.12
illustrates the segmentation of the word utterances for case (A) and case (B)
respectively.
case (A): Segmentation of different words of short and long durations
Consider a short duration example (365.9 msec) of the word, ani which consists of two syllables /a/ and /ni/ and a long duration example (772.5 msec) of
the word, thikkumukkAdinArgal which consists of seven syllables /thik/, /ku/,
/muk/, /kAd/, /di/, /nAr/, and /gal/. The segmentation results are shown in the
Fig. 3.11. It demonstrates that the syllable boundaries are identified accurately in
both the cases and in order to suppress the spurious boundaries, a peak threshold
value of 0.25 (found experimentally) has been used.
35
a
ni
Speech signal
0.1
0
−0.1
0
1000
2000
3000
4000
5000
Number of samples −−>
6000
7000
8000
Group delay spectrum
1
0.5
0
−0.5
0
1000
2000
3000
4000
5000
Number of samples −−>
6000
8000
Energy function
4
x 10
10
7000
5
0
0
1000
2000
3000
4000
5000
Number of samples −−>
6000
7000
8000
(a) Case I: short duration example of the word, ani
thik
ku
muk
kA
di
nAr
gal
0.5
Speech signal
0
−0.5
0
0.5
1
1.5
Number of samples −−>
2
4
x 10
Group delay spectrum
1
0.5
0
−0.5
0
0.5
1
1.5
Number of samples −−>
2
4
x 10
6
2
Energy function
x 10
1.5
1
0.5
0
0
0.5
1
1.5
Number of samples −−>
2
4
x 10
(b) Case II: long duration example of the word, thikkumukkAdinArgal
Figure 3.11: case (A): Segmentation of two Tamil words with different durations
uttered by same speaker.
36
case (B): Segmentation of short and long duration examples of the
same word
Consider a short duration (847.5 msec) example and a long duration (1102.5
msec) example of the word, pAkistAnudan which consists of five syllables /pA/,
/kis/, /thA/, /nu/, and /dan/. The segmentation results are shown in Fig. 3.12.
It shows that the syllable boundaries are identified accurately in the case of a
long duration example, whereas one boundary is missed in the case of a short
duration example. This is shown as dashed vertical lines between /thA/ and /nu/
in Fig. 3.12 (Case II). But interestingly, the location of other boundaries are not
disturbed. A simple duration analysis as mentioned in [20] is used to improve the
segmentation accuracy .
3.3.2
Effect of segmentation on noisy speech signals
To check the segmentation accuracy on corrupted speech data, the following experiments have been carried out on the noisy database (mentioned in Chapter 2).
Noises are added to the medium size vocabulary such that three datasets ( babble,
pink, and white) contain speech signal of four different (15 dB, 10 dB, 0 dB, −5
dB) SNR.
The segmentation results are shown in Figs. 3.13, 3.14, 3.15, 3.16, 3.17, and
3.18. Three different types of noises are added to an example of the word thodakkukirathu
which contains seven syllables viz (/tho/, /dak/, /ku/, /ki/, /ir/, /ra/, and
/thu/). Figures show that the segmentation algorithm provides correct and accurate syllable boundaries of the word utterance corrupted with noise. This is
important because automatic speech recognition on a noisy data is considered to
be a difficult task. Observe that even at very low SNR say −5dB, the group delay
based segmentation algorithm is able to give accurate boundaries as shown in the
figures. These promising results strength the segmented subword-based approach
37
pA
kis
nu
thA
dan
0.1
Speech signal
0
−0.1
0
2000
4000
6000
8000
10000 12000
Number of samples −−>
14000
16000
Group delay spectrum
1
0
−1
0
2000
4000
6000
8000
10000 12000
Number of samples −−>
14000
5
x 10
2
16000
Energy function
1
0
0
2000
4000
6000
8000
10000 12000
Number of samples −−>
14000
16000
(a) Case I: long duration (1102.5 ms) example of the word, pAkisthAnudan
pA
kis
thA
nu
0.1
dan
Speech signal
0
−0.1
0
2000
4000
6000
8000
Number of samples −−>
10000
12000
Group delay spectrum
1
0
−1
0
2000
4000
6000
8000
Number of samples −−>
10000
12000
5
2
x 10
Energy function
1
0
0
2000
4000
6000
8000
Number of samples −−>
10000
12000
(b) Case II: short duration (847.5 ms) example of the word, pAkisthAnudan
Figure 3.12: case (B): Segmentation of Tamil word pAkisthAnudan with different
durations uttered by different speakers.
38
Table 3.1: Performance of the group delay based segmentation algorithm on the
test datasets.
Database
Very small
Small
Medium
Syllable count
6168
13938
22014
No of insertions
4
916
1496
No of deletions
0
738
828
Accuracy (in %)
99.9
87.97
89.44
Table 3.2: Performance of the group delay based segmentation algorithm on noisy
(20 dB SNR) medium sized vocabulary test datasets.
Noise type
Babble
Pink
White
Syllable count
22541
22678
22924
No of insertions
3538
3670
3748
No of deletions
1979
2020
1914
Accuracy (in %)
75.52
74.90
75.30
proposed in this thesis.
3.3.3
Results and discussion
The group delay based segmentation algorithm has been evaluated on the test
datasets (see Section.2.5) and the results are shown in Table. 3.2. It was found
that, the segmentation algorithm gives accurate boundaries (see Table. 3.2) in most
of the cases. The errors in the segmentation are of two kinds namely: (a) the split
error - splitting a larger segment into two smaller segments and (b) the merge
error - joining two smaller segments into a single large segment. The split errors
result in insertions of syllable segments, while the merge errors result in deletions
of syllable segments. Simple error corrective measures like duration check on the
segments have been employed to check whether the segment actually corresponds
to a syllable unit or not. Nevertheless, unnoticed errors have to be handled at
later stages in recognition.
39
c) white +15db
0.15
0.1
0.05
0
−0.05
−0.1
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1
0.5
0
−0.5
−1
(a) Case I: Segmentation of the utterance thodakkukirathu with an SNR of 15
dB corrupted with white noise.
b) white +10db
0.15
0.1
0.05
0
−0.05
−0.1
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1
0.5
0
−0.5
−1
(b) Case II: Segmentation of the utterance thodakkukirathu with an SNR of 10 dB corrupted
with white noise.
Figure 3.13: Segmentation of word utterances corrupted with white noise and resulted in 15 dB and 10 dB SNR.
40
a) white +0db
0.2
0.1
0
−0.1
−0.2
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1
0.5
0
−0.5
−1
(a) Case I: Segmentation of the utterance thodakkukirathu with an SNR of 0
dB corrupted with white noise.
e) white −5db
0.2
0.1
0
−0.1
−0.2
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1
0.5
0
−0.5
−1
(b) Case II: Segmentation of the utterance thodakkukirathu with an SNR of -5 dB corrupted
with white noise.
Figure 3.14: Segmentation of word utterances corrupted with white noise and resulted in 0 dB and -5 dB SNR.
41
c) babble +15db
0.15
0.1
0.05
0
−0.05
−0.1
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1
0.5
0
−0.5
−1
(a) Case I: Segmentation of the utterance thodakkukirathu with an SNR of 15
dB corrupted with babble noise.
b) babble +10db
0.2
0.1
0
−0.1
−0.2
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1
0.5
0
−0.5
−1
(b) Case II: Segmentation of the utterance thodakkukirathu with an SNR of 10 dB corrupted
with babble noise.
Figure 3.15: Segmentation of word utterances corrupted with babble noise and
resulted in 15 dB and 10 dB SNR.
42
a) babble 0db
0.2
0.1
0
−0.1
−0.2
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1
0.5
0
−0.5
−1
(a) Case I: Segmentation of the utterance thodakkukirathu with an SNR of 0
dB corrupted with babble noise.
e) babble −5db
0.4
0.2
0
−0.2
−0.4
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1
0.5
0
−0.5
−1
(b) Case II: Segmentation of the utterance thodakkukirathu with an SNR of -5 dB corrupted
with babble noise.
Figure 3.16: Segmentation of word utterances corrupted with babble noise and
resulted in 0 dB and -5 dB SNR.
43
c) pink +15db
0.15
0.1
0.05
0
−0.05
−0.1
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1
0.5
0
−0.5
−1
(a) Case I: Segmentation of the utterance thodakkukirathu with an SNR of 15
dB corrupted with pink noise.
(b) pink +10db
0.15
0.1
0.05
0
−0.05
−0.1
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1
0.5
0
−0.5
−1
(b) Case II: Segmentation of the utterance thodakkukirathu with an SNR of 10 dB corrupted
with pink noise.
Figure 3.17: Segmentation of word utterances corrupted with pink noise and resulted in 15 dB and 10 dB SNR.
44
a) pink +0db
0.2
0.1
0
−0.1
−0.2
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1
0.5
0
−0.5
−1
(a) Case I: Segmentation of the utterance thodakkukirathu with an SNR of 0
dB corrupted with pink noise.
e) pink −5db
0.2
0.1
0
−0.1
−0.2
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1
0.5
0
−0.5
−1
(b) Case II: Segmentation of the utterance thodakkukirathu with an SNR of -5 dB corrupted
with pink noise.
Figure 3.18: Segmentation of word utterances corrupted with pink noise and resulted in 0 dB and -5 dB SNR.
45
3.4
Recognition unit - decoding of the syllable
segments
This section discusses the recognition of the syllable segments. Further to reduce the search space, a hierarchical recognition is proposed and explained in this
section.
3.4.1
Isolated style recognition
Fig. 3.19 shows the functional block diagram of the isolated style recognition of
syllable segments. The syllable segments are recognized using isolated syllable
models (as in typical isolated style recognition) and a syllable sequence is obtained. Let λi be the ith individual syllable model of N syllable models. Let ζj
be the j th syllable segment and Oj be the observation sequence of ζj . Then the
recognized syllable sequence is given by S = (s1 s2 · · · sk · · · sK ) of K segments
(ζ1 ζ2 · · · ζk · · · ζK ), where sk is obtained using Eqn (3.1). The syllable recognition
accuracy (SRA) on the test datasets are shown in Table. 3.3. The SSWR system
gives a SRA of 82.3%, 61.5%, and 81.5% for very small, small and medium vocabulary (see Table. 3.3). The Table. 3.3 shows that there are more substitution errors
than insertion or deletion errors. The increase in substitution errors is because of
the presence of confusable syllables (similar sounding syllables) in the database as
shown in Table. 3.4.
sk = arg max log P (Oj |λi )
i
46
;1 ≤ i ≤ N
(3.1)
Speech
Signal
Segmentation
Segment−2
Syllable
HMM 2
MFCCs
O
m
P(O| λ 1 )
Syllable
HMM 1
MFCCs
Segment−1
arg
max
S
m
P(O| λ 2)
O O
2
1
Syllable
HMM n
P(O| λ n )
Segment−m
MFCCs
Figure 3.19: Isolated style recognition of syllable segments
Table 3.3: Syllable recognition accuracy of the SSWR system.
Database
Vocabulary size
Very small
Small
Medium
36
400
2000
Errors (in %)
Insertion Deletion Substitution
0.5
0.4
17.1
7.2
6.0
40.6
0.4
0.5
18.6
Accuracy
(in %)
82.3
61.5
81.5
Table 3.4: A partial list of confusable pairs (out of 200) occuring more than once
in the train dataset.
(tha, thu)
(ka, tha)
(ti, thi)
(ku, thu)
(tu, thu)
(ka, ra)
(path, pa)
(il, vil)
(ka, kan)
(a, pa)
(ka, kal)
(kal, thal)
(thir, thirk) (thu, thum) (thi, thir) (kum, vum)
(nai, thai)
(thil, til)
(tha, than)
(vil, vi)
(na, nar)
(il, thil)
(tai, thai)
(thil, thi)
Improving syllable recognition accuracy
As the presence of confusable syllables (or similar sounding syllables) degrade the
syllable recognition accuracy, the models corresponding to those syllables have to
be merged. This is done to increase the separability among syllable models. The
merging of the syllable models is explained in the next section.
47
S
2
S1
3.4.2
Merging of the syllable models
To determine which pair of models should be merged, let us consider two distinct
models namely λa and λb . Their corresponding observation sequences are Oa and
Ob . Let us also consider a merged model λa+b with merged observation sequence
Oa+b . The entropy of an individual model (λa ) producing an observation sequence,
Oa of length, T1 , is given by Eqn. (3.2). Similarly, the entropy of λb producing
and observation sequence Ob of length T2 is given by Eqn. (3.3). The entropy of
the merged model λa+b producing an observation sequence Oa+b of length T1 + T2
is given by Eqn. (3.4). The delta entropy is defined as the difference between the
entropy of the merged model and the sum of entropy of two individual models.
Now the change in entropy resulting from the merged model is given by Eqn. (3.5).
Whenever the delta entropy, ∆Hab [7] between the two models are small or less
than a particular threshold it means that the change in entropy resulting from the
merged models will not affect the system performance. Hence the two models can
be merged. In Table. 3.5, a partial list of syllables are given in the first column.
In column 2, a set of syllables which are confused with syllables in the first column
are shown. The confusable syllable list is obtained from the testing on the training
dataset. Using delta entropy, acoustically similar syllable models are determined.
Finally, the syllable model in the first column is merged with most similar model
in the second column. The merged models are shown in the third column of Table.
3.5.
48
Table 3.5: Merging of the syllable models.
Syllable Confusable model list
Merged models
/kA/
/kAr/, /kAi/, /kAn/
/kA/ + /kAn/
/kil/
/thil/, /nil/, /til/
/kil/ + /til/
/pArp/
/pAr/, /pA/, /pu/
/pAr/ + /pArp/
/rap/
/ra/, /lap/, /rup/
/rap/ + /rup/
/taiN/
/tain/, /tai/, /thai/
/taiN/ + /tai/
/thal/
/tha/, /kal/, /ra/
/thal/ + /ra/
/thith/
/the/, /thi/, /thir/
/thith/ + /thir/
H(λa ) = −
X
p(O|λa ) log p(O|λa )
(3.2)
p(O|λb ) log p(O|λb )
(3.3)
∀O∈Oa
H(λb ) = −
X
∀O∈Ob
H(λa+b ) = −
X
p(O|λa+b ) log p(O|λa+b )
∀O∈Oa
−
X
p(O|λa+b ) log p(O|λa+b )
(3.4)
∀O∈Ob
∆Hab = H(λa+b ) − {H(λa ) + H(λb )}
(3.5)
For illustration purposes, consider three syllable models /a/, /A/, and /di/.
Let us try to find acoustically similar syllable models among these models. The
entropy 1 of these models are shown in Table. 3.6. Table. 3.6 also shows the entropy
of the merged models viz. /a+A/, /a+di/, and /A+di/. The delta entropy between
different syllable models are shown in Table. 3.7. From Table. 3.7, it is clear that
syllable models /a/ and /A/ are acoustically similar among /a/, /A/, and /di/.
Thus the syllable models /a/ and /A/ are merged to form a merged model, /a+A/.
1
Training syllable examples are used for the entropy computation.
49
Table 3.6: Entropy of HMMs.
Model name
/a/+/A/
/A/+/di/
/a/+/di/
Entropy
8.8301e+04
8.3927e+04
6.4112e+04
Model name
/a/
/A/
/di/
Entropy
3.3458e+04
5.2355e+04
2.7882e+04
Table 3.7: Delta entropy of HMMs.
Model Name
/a/
/A/
/di/
3.4.3
/a/
0.2488
0.3772
/A/
0.2488
0.3690
/di/
0.3772
0.3690
-
Hierarchical recognition of the syllable segments
The syllable models (merged models and individual models) are represented using
a tree data structure called as the model tree (see Fig. 3.20). Each node in the
tree represents a model. The internal nodes represent the merged syllable models,
whereas the leaf nodes represent the individual syllable models.
Root
Individual
Model
Level − 0
Merged
Model
Individual
Model
Individual
Model
Merged
Model
Individual
Model
Level − 1
Individual
Model
Individual
Model
Level−2
Figure 3.20: Structure of a hierarchical syllable model tree
The syllable segments obtained from the group delay based segmentation are
recognized with the help of the model tree. Fig. 3.21 illustrates the hierarchical
recognition of a syllable segment /pA/. If the height of the tree is h, then the
recognition takes atmost h passes. The dotted lines in Fig. 3.21 represent the
50
Root
Level − 0
/tai/
+
/taiN/
+
/tain/
/thA/
/tai/
/taiN/
/pA/
+
/pAr/
/tain/
Level − 1
/pA/
/pAr/
Level−2
Figure 3.21: Illustration of a hierarchical recognition of a syllable segment /pA/.
Table 3.8: The structure of a node in the model tree.
nodeId (integer)
nodeName (string)
nodeChildren (array of nodes)
pruned paths and solid lines represent the active paths. The structure of each
node, j in the model tree is shown in Table. 3.8.
The children of node j is NULL, if and only if node, j is a leaf node. The syllable sequence is given by S = (s1 s2 · · · sk · · · sK ) of K segments (ζ1 ζ2 · · · ζk · · · ζK ),
where sk is obtained using the following procedures.
Procedure. 1: Main
maxNode ← 0
List ← GetChildrenOfNode ( Root ) /* Get children of Root node */
While List 6= ∅
node ← ProcessList ( List ) /* Obtain the model with max prob */
List ← GetChildrenOfNode ( node ) /* Get children of node with max prob */
endWhile
return maxModelName
51
Procedure. 2: ProcessList (List)
Forall node ∈ List
logProb ← GetLikelihoodProb ( ObsSeq, node )
UpdateMaxNode ( logProb, maxNode, node )
endFor
return maxNode
3.5
Matching unit - Edit distance based matching
Edit distance is a popular distance measure for comparing two strings. It is computed as the minimum number of basic operations required to convert one string
into another. The basic operations are (i) inserting a character, (ii) deleting a
character, and (iii) substituting a character.
For example, consider two strings namely mothers and brother. The Edit
distance computation is illustrated in Table. 3.22 and their distance score2 is 3. To
demonstrate the Edit distance based matching of syllable sequence, consider two
Tamil words vocabulary, where the Tamil words are kunamataitha and kanamana.
Let kunamatai be the recognized syllable sequence of a test word utterance. The
task is to find the recognized word i.e., to find the word in the dictionary such
that it is the closest match for the recognized syllable sequence. Table. 3.10 and
Table. 3.11 give the Edit distance scores for the two pair of strings (kunamatai,
kunamataitha) and (kunamatai, kanamana) respectively.
Table. 3.10 and Table. 3.11 show the Edit distance score between both these
pairs (kunamatai, kunamataitha) and (kunamatai, kanamana) as 3. Because of
this, there exists an ambiguity in deciding the closest match for the given test word
utterance whose recognized syllable sequence is kunamatai. A modified approach
2
assuming that the cost of insertion, deletion, and substitution are 1.
52
Table 3.9: Illustration of Edit distance computation between two strings, mothers
and brother.
mothers
⇒
⇒
⇒
mother (delete character, s)
rother (substitute character, m with character, r)
brother (insert a character, b)
Table 3.10: Edit distance based matching between kunamatai and kunamataitha
kunamatai
⇒
⇒
⇒
kunamatait (insert a character, t)
kunamataith (insert a character, h)
kunamataitha (insert a character, a)
Table 3.11: Edit distance based matching between kunamatai and kanamana
kunamatai
⇒
⇒
⇒
kanamatai (substitute character, u with character, a)
kanamanai (substitute character,t with character, n)
kanamana (delete character, i)
has been proposed to resolve the ambiguity. In the proposed method, three basic
operations are performed with syllable (an array of characters) instead of characters. If the ambiguity still exists, then Edit distance is performed in two steps:
(a) use syllable as a basic unit and restrict words and (b) use characters as a basic
unit on those selected words (from step (a)). To illustrate the modified approach,
consider the syllable sequence of the two Tamil words as /ku/na/ma/tai/tha/
and /ka/na/ma/na/. The recognized syllable sequence of the given test word is
/ku/na/ma/tai/. Now the modified Edit distance score between these two pair of
strings are given in Table. 3.12 and Table. 3.13. From Tables 3.12 and 3.13, it is
clear that the word kunamataitha is the closest match for the recognized syllable
sequence kunamatai. Hence the word, kunamataitha is chosen as the recognized
word. Observe that, the recognized syllable sequence and the recognized word
differ by one last syllable. This is because of a deletion error in the group delay
based segmentation.
Fig. 3.22 shows the Edit distance score between the recognized syllable se53
Table 3.12: Modified Edit distance based matching between kunamatai and
kunamataitha
/ku/na/ma/tai/
⇒
/ku/na/ma/tai/tha/ (insert a syllable, /tha/)
Table 3.13: Modified Edit distance based matching between kunamatai and
kanamana
/ku/na/ma/tai/
⇒
⇒
/ka/na/ma/tai/ (substitute syllable, /ku/ with syllable, /ka/)
/ka/na/ma/na/ (substitute syllable, /tai/ with syllable, /na/)
quence and syllable sequence of different words in the vocabulary. Let sst be the
recognized syllable sequence of the test word utterance and ssi be the syllable sequence of ith word in the dictionary. Let Edist (sst , ss1 ) be the Edit distance score
between the test syllable sequence sst and the syllable sequence of the 1st word
in the dictionary. Similarly, the Edit distance score is computed against all the
syllable sequence of words in the dictionary. The least score is given by Eqn. (3.6)
and the recognized word is obtained using Eqn. (3.7).
score = min Edist (sst , ssi ) ; i = 1, 2, 3, . . . , V
i
recognized word = arg min Edist (sst , ssi ) ; i = 1, 2, 3, . . . , V
i
score=1
a
pA
tA
score=2
score=3
a
pA
rA
a
dain
tha
A
dA
thu
Figure 3.22: Edit distance based word searching method.
54
(3.6)
(3.7)
Table 3.14: Performance of the SSWR system on test dataset.
Database
Very small
Small
Medium
Vocabulary
size
36
400
2000
Accuracy (in %)
single-pass multi-pass
98.11
96.4
99.9
98.13
99.29
Response time (in CPU sec)
single-pass
multi-pass
0.146
1.071
0.90
6.541
3.689
Table 3.15: Performance of the word-based and syllable-based IWR systems on
test dataset.
Database
Very small
Small
Medium
3.6
Vocabulary
size
36
400
2000
Accuracy (in %)
word-based syllable-based
92.36
90.09
97
77.59
99.06
95.70
Response time (in CPU sec)
word-based syllable-based
0.074
0.153
1.103
1.050
12.037
9.607
Results and Discussion
The performance of the SSWR system is shown in Table. 3.14. It reveals that
hierarchical recognition using the model tree improves the accuracy at the same
time reduces the response time. Table. 3.14 shows that the SSWR system indeed
works well for very small, small and medium vocabulary. The performance of
the word-based and the syllable-based IWR systems are shown in Table 3.15 for
comparison purpose. Observe that the response time is reduced to almost 50% for
the medium vocabulary system. The height of the tree obtained for this wordlist
is 5. There is an increase in the response time (see Table 3.15) in the case of
the syllable-based IWR system for small vocabulary, when compared to the wordbased IWR system in case of small vocabulary. This is because the number of
syllable models are more than the number of word models. Since the accuracy
of the SSWR system for small vocabulary is better than other systems under
consideration, the hierarchical recognition is not performed for that case.
55
Table 3.16: Performance of the SSWR system on medium vocabulary noisy test
datasets.
Train
Noise
Clean
3.7
Noise
Babble
Pink
White
Acc (in %) Acc (in %) Acc (in %)
67.96
66.69
62.22
66.29
58.28
61.00
Summary
In this chapter, a segmented syllable-based IWR system for medium sized vocabulary has been proposed. The SSWR system performs better than the conventional
syllable-based IWR system and avoids the problem caused due to the misalignment of syllable boundaries. It works well for very small, small and medium size
vocabulary. Methods for two different problems have been suggested in this chapter, they are: (a) use of merged syllable models to improve the syllable recognition
accuracy and (b) performing a hierarchical recognition of the syllable segments to
reduce the recognition time for medium size vocabulary systems. The SSWR system presented in this chapter is a simple syllable-based IWR system in the sense
that it does not have any overhead in terms of applying language models. Due to
this fact, the recognized syllable sequence is available for OOV words and this is
explained in the next chapter.
56
CHAPTER
4
Out-Of-Vocabulary (OOV) words
4.1
Introduction
Current state-of-the-art ASR systems cannot recognize words which are not contained within the vocabulary. As the vocabulary size is always fixed, the out-ofvocabulary (OOV) words cannot be avoided. However, in domain specific dialog
applications, the OOV words are rejected using rejection strategy as in [21]. This
is justified in these systems, because acting upon a misunderstood input may incur a high cost to the user. But many speech recognition systems are expected to
handle such OOV words and not simply reject them. Hence handling the OOV
words becomes crucial in speech recognition systems.
A SSWR system has been proposed in Chapter 3. Handling the OOV words
is a byproduct of the SSWR system. The system outputs the recognized syllable
sequence in case of OOV words and this sequence is used to get a clue about the
spoken OOV word.
The rest of this chapter is organized as follows. The various existing approaches
for detecting the OOV words are explained in Section. 4.2. Section. 4.3 explains
the procedure used to handle the OOV words in the SSWR system. Section. 4.4
gives the summary of this chapter.
57
4.2
Review of approaches to OOV word detection
Several researchers have proposed various techniques in the literature for detecting
the OOV words. They are briefly discussed in this section. Approaches to detect
the OOV words can be broadly categorized into the following types:
4.2.1
Acoustic modeling
In this approach, a single acoustic model is built for all the OOV words, such that
the model picks up the OOV words during recognition. This technique has been
used in [22] [23].
In [22], I. Bazzi et al. proposed a generic word model for OOV words. The
generic word model is trained with all the phonemes in the pronunciation dictionary, such that any phoneme can follow any other phoneme and thus represent the
OOV words. To allow OOV words during recognition, the vocabulary of the IWR
system is augmented with an OOV word whose underlying model is the generic
word model as shown in Fig. 4.1.
p
1
OOV word
pn
w1
wn
Figure 4.1: Augmenting OOV words into the vocabulary using a generic word
model
58
Fig. 4.1 shows the word models (w1 , w2 , . . . , wn ) corresponding to the words in
the vocabulary. A single generic OOV word model is added to these set of models.
The OOV word model is trained in such a way that, it recognizes all possible
phoneme sequences and it is marked as OOV word in Fig. 4.1. Hence during
recognition, the OOV word is treated as any other words in the vocabulary.
In [23], I. Bazzi et al. described an approach in which the words in the vocabulary are categorized into multiple classes such as verbs, nouns, and adjectives. An
OOV word model (as in [22]) is trained for each class. These explicit OOV models
compete with in-vocabulary word models during recognition and are hypothesized
just as any other word.
4.2.2
Language modeling
A powerful language model improves the detection of OOV words. Both the word
language models and subword language models have been used in [24].
In [24], A. Yazgan et al. showed the importance of a language model for
detection of OOV words. For the acoustic model, a generic word model was used
as in [22]. For the language model, all words in the training corpus which are not
included in the vocabulary are mapped into a single symbol. This has the effect
of modeling a generic OOV word.
4.2.3
Confidence measures
Instead of incorporating explicit models into the system, information about certain parameters during recognition can be used implicitly to detect the OOV
words. These parameters are usually termed as confidence measures. Two useful
confidence measures have been proposed in [9]:
• duration normalized acoustic scores - this measure takes the clue from the
59
actual acoustic score. The acoustic scores for OOV words tend to be low
when compared with that of in-vocabulary words.
• normalized score drop - this measure finds the clue from the relative acoustic
score drop. The difference between the scores of the first best and the
second best hypotheses will be large for an in-vocabulary word, whereas
their difference becomes small for OOV words.
In [25], B. Decadt et al.
proposed an approach based on a phoneme-to-
grapheme conversion. For each word with low confidence score, the corresponding acoustic data is sent to the phoneme recognizer and the resulting phoneme
string is then transformed into a grapheme string using an automatic phonemeto-grapheme converter. This grapheme string replaces the originally recognized
words having low confidence scores.
4.3
Handling of the OOV words in the SSWR
system
These existing approaches use either an acoustic model (generic word model) and
a language model or some confidence measures to detect OOV words. The SSWR
system, proposed in this thesis, is based on a segmented subword-based approach.
It recognizes the words using Edit distance based matching and does not need any
language models. Because of these, acoustic word model and language model based
techniques to detect OOV words cannot be used in the SSWR system. Therefore,
a threshold based on the Edit distance score is used to detect the OOV words.
The two confidence measures namely the normalized score drop and the duration
normalized acoustic scores are used to validate the detection of OOV words.
In case of OOV words, the Edit distance score is very high since the recognized
syllable sequence does not match with any syllable sequence in the dictionary. If
60
the Edit distance score is above an OOV threshold and satisfies the confidence
measures, then the SSWR system outputs the recognized syllable sequence instead
of the matched word. This is the byproduct of the SSWR system.
To evaluate the SSWR on OOV words, a test dataset which consists of 100
Tamil OOV words for small vocabulary (see Section. 2.5) is chosen. The test
dataset contains utterances of 10 speakers (5 male and 5 female) and 100 Tamil
words are spoken by each speaker. If the least distance score between the recognized syllable sequence and syllable sequence of different words in the dictionary
contain more than 39% of misclassified syllable, then the spoken word is considered to be an OOV word (as the overall syllable recognition accuracy is 61% (see
Table. 3.3)). Table. 4.1 shows a partial list of OOV words and their syllable sequence. Table. 4.1 reveals that the syllable sequence of the OOV words are indeed
readable. All the recognized test OOV words along with their syllable sequence
are logged for analysis.
In the analysis, the OOV words are split into two types namely: (a) Type-I
OOV words - all the subwords of the OOV words exist in the lexicon, and (b)
Type-II OOV words - one or more subwords of the OOV words is not present
in the lexicon. For example, consider a vocabulary of three Tamil words AdAmal
(/A/dA/mal/), apAra (/a/pA/ra/), and adiththAr (/a/dith/thAr/). The syllable sequences are given in brackets. Let us consider two OOV words adiththAl
(/a/dith/thAl/) and pAdAmal (/pA/dA/mal/). The OOV words, adiththAl
and pAdAmal are of Type-II and Type-I respectively.
4.3.1
Handling of Type-I OOV words
The analysis reveals that in the case of Type-I OOV words, the recognized words
are “wrong” but their syllable sequences are “correct”. If the correct words along
with their actual syllable sequences are added to the dictionary, then “correct” rec-
61
Table 4.1: A partial list of OOV words (out of 1000 examples) and their syllable
sequence.
Word
Original syllable sequence Recognised syllable sequence
etiyathu
e-ti-ya-thE
e-ti-yi-thE
kunamataitha
ku-nA-ma-tai-tha
ku-na-ma-thai-ka
kuvithanar
ku-vi-tha-nar
ku-vi-ta-nar
thoku
tho-ku
thu-ku
thotarthu
tho-tar-thu
tho-nar-thu
ulaga
u-la-ga
ku-la-ga
ullathu
ul-la-thu
pul-la-thu
vadikatti
va-di-kat-ti
va-di-kat-ti
vakikkum
va-ki-kum
va-ti-kum
vativam
va-ti-vam
a-ti-vam
varai
va-rai
va-rai
varukira
va-ru-ki-ra
va-ru-ki-ra
varukirArgal
va-ru-ki-rAr-gal
va-ra-ki-ra-gal
varumAru
va-ru-mA-ru
va-ru-mA-thu
vAkai
vA-kai
mA-kai
vEkamAna
vE-ka-mA-na
vE-kam-na
vidhaMaka
vi-dha-mA-ka
vi-dha-mA-ka
vinaval
vi-na-val
vi-na-val
vivaram
vi-var-ram
i-var-ram
ognized word will be obtained in future recognition. The Type-I OOV words and
their syllable sequence can be directly added to the dictionary without disturbing
the SSWR system, as their subwords are already present in the lexicon.
4.3.2
Handling of Type-II OOV words
The analysis shows that in the case of Type-II OOV words, the recognized words
are “wrong” and their recognized syllable sequences are “partially right” because
some subwords are missing in the lexicon. The Type-II OOV words cannot be
added directly to the dictionary (as in the case of Type-I OOV words). After
building models for the missing subword units, only then the Type-II OOV words
can be added. Nevertheless, the recognized syllable sequences are readable and
give better clues of the recognized words.
62
4.4
Summary
In this chapter, a simple threshold-based approach has been used to detect the
OOV words. Besides using an OOV threshold, two confidence measures have also
been used in detecting the OOV words. However, the problem of detection of
OOV words is not within the scope of this thesis and the handling of OOV words
is just a byproduct of the SSWR system. As no language models are used in the
SSWR system, the recognized syllable sequences are available for OOV words and
hence the OOV words are handled in a better manner. The recognized syllable
sequences are readable and they give better clue about the actual spoken OOV
word.
63
CHAPTER
5
Conclusion
5.1
Summary
The work presented in this thesis represent an attempt to develop an isolated word
recognition system for medium and large vocabulary. The segmented subwordbased approach is shown to be better than conventional subword-based approach
for medium size vocabulary as the former avoids the misalignment of subword
boundaries during recognition. In addition to that, the SSWR system proposed
in this thesis works well for very small and small vocabulary. In this approach,
besides adding new words to the vocabulary with a constraint, the OOV words
are also handled in a better manner.
A hierarchical recognition using the model tree is proposed in this thesis. The
model tree contains the merged models and the individual models. A delta entropy
based measure is used to identify acoustically similar models and these confusable
models are combined to form merged models. Hierarchical recognition not only
improves the word accuracy, but also reduces the response time of the SSWR
system. Without applying computationally expensive language models, the word
utterance is recognized using a simple Edit distance based matching. And also, an
automatic segmentation and labeling tool for Indian languages has been developed.
It is used for adaptation of the speech recognition system for a new domain or
new task, particularly during the training of the Indian language speech corpus.
64
5.2
Major contributions of the thesis
The following are the major contributions of the thesis:
(a) A segmented syllable-based word recognizer for medium sized vocabulary
has been developed.
(b) A delta entropy base measure has been explored to merge acoustically
similar models.
(c) A hierarchical recognition has been attempted to reduce the response time
of the IWR system.
(d) An automatic labeling tool for Indian language has been developed.
5.3
Criticisms of the work
The purpose of this work is to build a small footprint IWR system for handheld
devices. To make the SSWR system as one such system, the following issues need
to be addressed.
1. The SSWR system has been tested under aritificial noise environments. But
in actual practice, data must be collected from all environments.
2. The SSWR system has been tested for medium sized vocabulary. The performance of the proposed system for large vocabulary is still a conjecture.
At the time of this writing, no such dataset for Indian English exists for
evaluation.
3. The detection of OOV words using an OOV threshold is a simple and adhoc
way of solving this problem. The choice of OOV threshold is crucial in
getting few false positives.
65
4. The parameters of the group delay segmentation algorithm need to be tuned,
as they vary from one database to another.
66
AppendixA
A.1
Baum-Welch algorithm
Generally the HMM parameters are estimated using the maximum likelihood (ML)
objective function. The aim of the ML estimation is to find the parameter set that
maximizes the likelihood of the training utterances given their transcription. The
ML estimation emerge from the assumption that the speech signal is distributed
according to the model. Another advantage of the ML estimation is its simplicity
of implementation using the Baum-Welch re-estimation algorithm [26].
Let the vocabulary consists of V words say {1, 2, . . . , v, . . . , V }. The speech
signal is divided into frames. The MFCC features are extracted for each frame.
Let ot denotes the features of tth frame. Let O = (o1 , o2 , . . . , oT ) be the feature
sequence (or observation sequence) of T frames of the given speech signal. Let
us assume that each word, v in the vocabulary is characterized by a conditional
probability density function, p(O|v). Let us choose to model the words by a
Gaussian mixture density, i.e., the probability density function of v is
pθ (O|v) =
T
T
X Y
Y
{ ast st+1 }{ bst (ot )}
s
t=0
(A.1)
t=1
where {aij } are the transition probabilities. s = s0 , . . . , sT +1 is the state sequence where states s0 and sT +1 are constrained to be the initial and the final
states respectively. The summation is over all possible state sequences. bi (ot ) are
the output distributions.
bi (ot ) =
K
X
cik bik (ot )
(A.2)
k=1
cik is the weight of mixture k in state i, and bik (.) is a Gaussian distribution
V
with a mean vector µik and a diagonal covariance matrix ik . Thus the parameter
set of each word comprises of the following elements:
• aij is the transition probability from state i to state j such that
PN
aij = 1.
• cik is the weight of the kth mixture of the ith state such that
PK
= 1.
j=1
k=1 cik
• µik is the mean vector of the kth mixture of the ith state.
•
V
ik
2
2
= diag{σik1
, . . . , σikn
} is the diagonal covariance matrix of the kth mix-
ture of the ith state.
Let θ denote the entire parameter set of all the words in the vocabulary. The
training of these models are performed according to the given training set. The
training set consists of the U utterances O = (O1 , . . . , OU ) and their corresponding
transcriptions W = (w1 , . . . , wU ). ML training is basically the maximization of
the ML objective function, L(θ) which is defined as
L(θ) = log pθ (O|W ) =
U
X
log pθ (Ou |wu ) =
u=1
V X
X
log pθ (Ou |v)
(A.3)
v=1 u∈Av
where Av = {u|wu = v}. Assume that parameters are not tied across words.
Hence, it is clear that training can be performed on each word separately. The
optimization of the ML objective function is iteratively implemented using the
Baum-Welch algorithm. The re-estimation formulas are given below:
68
āij =
P
PT
(A.4)
c̄ik =
P
PT
(A.5)
µ̄ikj =
P
PT
(A.6)
P
PT
(A.7)
2
σ̄ikj
=
u∈Av
pθ (st = i, st+1 = j|Ou , v)
P
PT
u
t=0 ψi (t)
u∈Av
t=0
u
t=1 ψik (t)
P
PT
u
t=1 ψi (t)
u∈Av
u∈Av
u
u
t=1 [ot ]j ψik (t)
P
PT
u
t=1 ψi (t)
u∈Av
u∈Av
u∈Av
u
2 u
t=1 ([ot ]j − µ̄ijk ) ψik (t)
P
PT
u
t=1 ψi (t)
u∈Av
where
A.2
u
ψik
(t) = pθ (st = i, gt = k|Ou , v)
(A.8)
ψiu (t) = pθ (st = i|Ou , v)
(A.9)
Viterbi decoding
Problem statement: Given the observation sequence and the model, how to
obtain the “optimal” state sequence associated with the observation sequence.
Viterbi decoding [27] is performed to find the optimize solution for this problem, i.e., finding the “optimal” state sequence associated with the given observation sequence. It is a parallel search algorithm and it searches for the best state
sequence by processing all the states in parallel. Let Q = (q1 , q2 , · · · , qT ) be the
state sequence of length T , where q1 and qT are the initial state and the final state
respectively. The most widely used criterion to find the single best state sequence
is to maximize P (Q|O, λ), which is equivalent to maximizing the P (Q, O|λ).
Let us define a probability quantity δt (i) which represents the maximum prob-
69
ability along the best possible state sequence path of a given observation after t
instants and being in state i;
δt (i) =
max
q1 ,q2 ,··· ,qt−1
P [q1 q2 · · · qt−1 , qt = i, o1 , o2 , · · · , ot |λ]
(A.10)
The best state sequence is backtracked by another function ψt (j). This function
holds the index of the state at time t − 1, from which the best transition is made
to the current state. The complete algorithm is given below:
1. Initialization:
δ1 (i) = πi bi (o1 ), 1 ≤ i ≤ N
(A.11)
ψ1 (i) = 0
(A.12)
2. Recursion:
δt (j) =
max [δt−1 (i)aij ]bj (ot ), 2 ≤ t ≤ T
1≤i≤N
ψt (j) = arg max [δt−1 (i)aij ], 1 ≤ j ≤ N
1≤i≤N
(A.13)
(A.14)
3. Termination:
P∗ =
max [δT (i)]
1≤i≤N
(A.15)
Q∗T = arg max [δT (i)]
(A.16)
Q∗t = ψt+1 (Q∗t+1 ), t = T − 1, T − 2, · · · , 2, 1.
(A.17)
1≤i≤N
4. Backtracking:
It is clear that Viterbi recursion is similar to that of the forward procedure
70
except that maximization over previous states is used instead of summation.
A.3
A discriminative training algorithm
This algorithm uses maximum mutual information (MMI) estimation instead of
maximum likelihood (ML) estimation for parameters of HMM. Let the training
set consists of the U utterances O = (O1 , . . . , OU ) and their corresponding transcriptions be W = (w1 , . . . , wU ). The MMI objective function is given by
M (θ) = log pθ (W |O)
U
X
=
log pθ (wu |Ou )
=
u=1
U
X
=
U
X
p(wu )pθ (Ou |wu )
log PV
u
v=1 p(v)pθ (O |v)
u=1
u
u
u
{log[p(w )pθ (O |w )] − log
u=1
V
X
p(v)pθ (Ou |v)}
v=1
By applying the approximation, we get the modified objective function.
log
X
Xi ≈ log max Xi
(A.18)
i
M (θ) =
U
X
{log[p(wu )pθ (Ou |wu )] − log max[p(v)pθ (Ou |v)]}
v
u=1
V
X
X
X
M (θ) ≈
{
log[p(v)pθ (Ou |v)] −
log[p(v)pθ (Ou |v)]}
v=1 u∈Av
(A.19)
(A.20)
u∈Bv
where Av = {u|wu = v} and Bv = {u|v = arg maxw [p(w)pθ (Ou |w)]}. The
Av represents the examples of word, v that are correctly recognized as word v
71
and the Bv represents the examples of words other than word v, that are wrongly
recognized as word v.
Now the objective function becomes,
Jv (θ) =
X
log[pθ (Ou |v)] − λ
u∈Av
X
log[pθ (Ou |v)]}
(A.21)
u∈Bv
where λ takes the value between 0 and 1. The value of zero indicates that this
objective function becomes the ML objective function.
The two steps in the algorithm are as follows:
• Perform recognition on the training dataset and obtain the Bv sets, and the
objective function Jv (θ).
• Maximize Jv (θ) with respect to θ, and obtain new estimates of the parameters.
The models are initialized with ML parameter estimation and re-estimated
using re-estimation formulae given below.
āij =
c̄ik =
µ̄ikj =
2
σ̄ikj
=
P
P
P
pθ (st = i, st+1 = j|Ou , v) − λ u∈Bv Tt=0 pθ (st = i, st+1 = j|Ou , v)
P
PT
P
PT
u
u
t=0 ψi (t) − λ
u∈Av
t=0 ψi (t)
u∈Bv
P
PT
P
P
T
u
u
t=1 ψik (t) − λ
u∈Av
t=1 ψik (t)
u∈Bv
P
PT
P
PT
u
u
t=1 ψi (t) − λ
u∈Av
t=1 ψi (t)
u∈Bv
P
PT
P
PT
u
u
u
u
t=1 [ot ]j ψik (t) − λ
t=1 [ot ]j ψik (t)
u∈Av
u∈Bv
P
P
PT
PT
u
u
t=1 ψi (t) − λ
u∈Av
t=1 ψi (t)
u∈Bv
P
PT
P
PT
u
2 u
u
2 u
t=1 ([ot ]j − µ̄ijk ) ψik (t) − λ
t=1 ([ot ]j − µ̄ijk ) ψik (t)
u∈Bv
u∈Av
P
PT
P
P
T
u
u
t=1 ψi (t) − λ
t=1 ψi (t)
u∈Av
u∈Bv
u∈Av
PT
t=0
where
u
ψik
(t) = pθ (st = i, gt = k|Ou , v)
(A.22)
ψiu (t) = pθ (st = i|Ou , v)
(A.23)
72
AppendixB
B.1
Syllabification algorithm
For each word, w in the sentence, do the following:
1. Find the first occurrence of the vowel, let it be at character cf .
2. Starting from the cf , find the next occurrence of the vowel and let it be
character cs , where the position of cs is greater than the position of cf in
that particular word w.
3. If there exist a consonant between cf and cs , then split it before the last
consonant.
4. If not, then split cf and cs .
For example, consider a Tamil sentence “aram say virumbu”. It is syllabified
as “/a/ram/ /say/ /vi/rum/bu/”.
B.2
Group delay based segmentation algorithm
The group delay function exhibits an additive property. If
H(w) = H(w1).H(w2)
(B.24)
then the group delay function τh (w) can be written as,
τh (w) = −∂(arg(H(w)))/∂w
= τh1 (w) + τh2 (w)
(B.25)
(B.26)
From Eqns (B.24) and (B.26), it is clear that a multiplication in the spectral
domain becomes an addition in the group delay domain. In the group delay spectrum of any signal, the peaks (poles) and the valleys (zeros) are resolved properly
only when the signal is a minimum phase signal. The minimum phase signal derived from the magnitude spectrum can be used for segmenting the acoustic speech
signal into syllable-like units.
The group delay function of the minimum phase signal is a better representation than the short term energy function to perform segmentation. In [5], it has
been shown that the minimum phase group delay function derived from the short
term energy function can be used for segmentation of a speech utterance into syllable like units. The peaks and valleys of the group delay function correspond to
the peaks and valleys in the short term energy function. In general, the number
of syllables is equal to the number of voiced segments. In the short term energy
function of a syllable segment, the energy is quite high in the voiced region and
tapers down at both the ends. The consonants present in the ends contribute
to the local energy fluctuations. These local variations have to be smoothed to
consider the valley points as the syllable boundary. Fig 3.8 shows the functional
block diagram of the group delay based segmentation algorithm. The algorithm
for segmentation of speech signal is given below.
• Let x(n) be the given digitized speech signal of a word utterance.
• Compute the short term energy function E(n) using overlapped windows.
Consider this as the magnitude spectrum of some arbitrary speech signal
and denote it as E(K).
74
• Symmetrize the magnitude spectrum E(K) and invert it along Y axis. Let
us denote it as Éi (K).
• Compute the inverse DFT of Éi (K) and obtain the root cepstrum, é(m).
The causal portion of é(m) has the properties of the minimum phase signal.
• Compute the minimum phase group delay function of the windowed causal
portion of é(m) and denote it as Égd (K). The window size, Nc is given by
Nc =
Length of energy f unction
window scale f actor
(B.27)
• The location of the peaks in the group delay function, Égd (K) corresponds
to the syllable boundaries, as the signal has been inverted.
B.3
DONLabel: an automatic labeling tool for
Indian languages
Adapting an existing speech synthesis or speech recognition systems to a new
language or to a new task require building the systems on a large labeled database.
The bottleneck is not in collecting the database, but in labeling them. Typically,
the labeling process involves defining a speech segment and assigning a suitable
label to it. Manual labeling is not only laborious, but also prone to errors. Thus
an automatic labeling tool is necessary for speech-related systems.
Labeling tool requirement specifications
Labeling tool is a software that segments the speech signal and labels them appropriately. To label a large speech corpus, a suitable software environment is
75
required. The requirements of a labeling tool for Indian languages are as follows:
• automatic labeling of speech corpus - The tool should perform labeling in
an automatic manner.
• handling long audio signals - The software should be able to work with
limited amount of memory.
• displaying labels in Indian languages - The display of labels in Indian languages is important, so that it helps novice users to use the tool.
• distributing the task - A web-based interface (client) is required to distribute
the labeling task, so that the annotated speech corpus is collected at a centralized place (server).
• simple and easy navigation of the tool is required. For example, users should
be allowed to add, delete or modify segment boundaries by clicking with the
mouse. In addition to viewing the speech signal, an interactive access with
zoom options should also be provided.
• graphical user interface (GUI) should be provided.
The following are the facilities provided by DONLabel:
1. faster automatic segmentation and labeling
2. provides graphical user interface
3. allows segment boundaries to be added or removed or changed
4. currently displays the labels in two Indian languages (Tamil and Hindi)
5. plays a selected segment of speech or the entire speech signal
6. zooms a selected speech segment or the entire speech signal
7. facilities to re-segment the selected portion of the speech signal
76
DONLabel: design and implementation
The DONLabel performs labeling using two steps. They are: (a) speech signal
segmentation - The given speech signal is segmented into syllable-like units using
the group delay based segmentation, and (b) annotation of syllable segments - The
transcription corresponding to the speech signal is syllabified using syllabification
rules and are used to annotate the syllable speech segments.
The DONLabel frontend is developed in JAVA [28], whereas the backend (core
engines) is developed in C [29]. There are two versions available, one the standalone version and another the web based version. The stand-alone version can
be installed in the machine running Linux [30]. The web-based interface is also
provided for this tool and it is available online at: http://www.lantana.tenet.
res.in/apache2-default/donlabel.php.
Architecture of DONLabel
The DONLabel consists of three units namely (a) the input unit, (b) the processing
unit, and (c) the output unit. The functional block diagram of DONLabel is shown
in Fig. B.1.
(a) Input unit - This unit takes four parameters viz. (i) the speech signal to
be labeled, (ii) the transcription corresponding to the given speech signal, (iii) the
Indian language in which the labels have to be displayed, and (iv) the controlled
parameter (WSF is explained in Appendix B) value for proper segmentation of
speech signal.
(b) Processing unit - The processing unit is subdivided into (i) waveform
segmentation unit and (ii) text syllabification unit. In the waveform segmentation unit, the speech signal is segmented into syllable-like unit using the group
delay based segmentation (see Appendix B). In the text syllabification unit, the
77
Input unit
Output unit
Signal
Waveform
Panel
Processing unit
WSF
Text
syllable
boundaries
Group delay
Segmenter
syllabified
text
Text Segmenter
language
selector
Figure B.1: Functional block diagram of DONLabel
78
Group delay
Panel
Text
Panel
Table B.1: Illustration of syllabification of a Tamil sentence.
Sentence:
⇒
⇒
ammA inggE vA vA Asay muththam thA thA
/am/mA/ /ing/gE/ /vA/ /vA/ /A/sai/ /muth/tham/ /thA/ /thA/
/vc/cv/ /vcc/cv/ /cv/ /cv/ /v/cvc/ /cvcc/ccvc/ /ccv/ /ccv/
transcription corresponding to the speech signal is syllabified using rule-based
methods. The syllabified text is used to annotate the speech segments.
The syllabification rules are given below and an example is illustrated in
Table. B.1.
• No two vowels should occur next to each other in the segment.
• Only one vowel is allowed in the segment, but any number of consonants can
be present.
• Only consonants is not allowed in the segment.
(c) Output unit - The output unit consists of three display panels namely (i)
the text panel - displays the labels in Indian languages, (ii) the waveform panel
- displays the waveform of the speech signal, and (iii) the group delay panel displays the group delay function of the speech signal (see Fig. B.2).
Graphical user interface
The GUI provided by DONLabel are listed below:
(a) Audio playback server - In the web-based interface, an audio playback
is provided by the server and controlled through sockets. Hence, users at remote
place can playback the speech signal (server) and hear it on their machines (client).
Several clients can concurrently access to the sound driver. DONLabel allows the
user to play the entire waveform by pressing the leftmost play button (see Fig.
B.2). It also supports play of a particular segment which is selected by the user.
79
The selection of a segment is made easy with mouse click and drag operations.
After that the pressing of the second play button from the left plays the selected
segment (see Fig. B.2).
(b) Sound viewer - The waveform of the speech signal is displayed in a
window (waveform panel). Zoom options are provided to realize scaling of the
speech waveform. While zooming, the alignment of the group delay panels and
the text panels are maintained. DONLabel provides three types of zoom options
(zoom in, zoom out, zoom to fit).
Diamond shape
Waveform Panel
Text Panel
Group Delay Panel
Figure B.2: A screenshot of labeling tool.
(c) addition or deletion or modification of boundary lines - The addition
of the boundary lines are achieved by clicking the left mouse button. Removal of
the boundary lines are performed by clicking the right mouse button. DONLabel
allows the user to move the boundary lines. By left clicking on the diamond shape
(see Fig. B.2) and dragging the mouse to a desired location, the boundary lines
can be moved.
80
Lab file format
The information such as duration of each segment along with the label of each
segments are stored in the Lab file. The Lab file for a particular speech signal is
generated by pressing the save button (see Fig. B.2). The Lab file format is similar
to the format used in Emulabel and it is shown in (see Fig. B.3). In addition to
that, the labels can be represented in two Indian languages (Tamil and Hindi).
The first three lines in the Lab file give the header information. Remaining
lines contain the duration of the syllable segment. See from Fig. B.3, the start and
end duration of the syllable, nak is 0.153 sec and 0.255 sec respectively.
Figure B.3: A sample lab file represented using standard format.
81
REFERENCES
[1] B. Shneiderman, “The Limits of Speech Recognition,” Communications of
the ACM, vol. 43, no. 9, pp. 63–65, September 2000.
[2] O. Fujimura, “Syllable as a unit of speech recognition,” in IEEE Trans.
Acoust., Speech, Signal Processing, vol. 23, February 1975, pp. 82–87.
[3] J. L. Gauvian, “A syllable-based isolated word recognition experiment,” in
IEEE International Conf. on ICASSP’86, vol. 11, April 1986, pp. 57–60.
[4] T. Nagarajan and H. A. Murthy, “Group delay based segmentation of spontaneous speech into syllable-like units,” in ISCA and IEEE Workshop on
Spontaneous Speech Processing and Recognition, Tokyo, Japan, April 2003,
pp. 115–118.
[5] T. Nagarajan, V. K. Prasad, and H. A. Murthy, “The minimum phase signal
derived from the magnitude spectrum and its application to speech segmentation,” in Sixth Biennial Conference of Signal Processing and Communications,
July 2001.
[6] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition.
Jersey: PTR Prentice Hall, 1993.
New
[7] K. F. Lee, Automatic Speech Recognition - The Development of the SPHINX
System. Boston: Kluwer Academic Publishers, 1989.
[8] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, second
edition ed. Singapore: A Wiley-Interscience Publication, 2001.
[9] I. Bazzi and J. R. Glass, “Modeling out-of-vocabulary words for robust speech
recognition,” in Proceedings of Int. Conf. Spoken Language Processing, Beijing, China, October 2000, pp. 401–404.
[10] R. Reddy, “Speech recognition by Machines - A Review,” Proc. IEEE, vol. 64,
no. 4, pp. 501–531, April 1976.
[11] “Database for Indian languages,” Speech and Vision lab, IIT Madras, India,
2001.
[12] P. Mermelstein and S. B. Davis, “Comparison of parametric representations
for monosyllabic word recognition in continuously spoken sentences,” IEEE
Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp.
357–366, August 1980.
[13] G. D. Fornay, “The Viterbi algorithm,” in Proc. IEEE, March 1973, pp. 268–
278.
82
[14] L. R. Bahl, R. Bakis, P. V. de Souza, and R. L. Mercer, “Obtaining candidate words by polling in a large vocabulary speech recognition system,” in
Proceedings of IEEE Int. Conf. Acoust., Speech, and Signal Processing, vol. 5,
Newyork, April 1988, pp. 489–492.
[15] J. Macias-Guarasa, A. Gallarado, J. Ferreiros, J. M. Pardo, and L. Villarrubia, “Initial evaluation of a preselection module for a flexible large vocabulary
speech recognition system in Telephone environment,” in Proceedings of Int.
Conf. Spoken Language Processing, vol. 1, Philadelphia, October 1996, pp.
1343–1346.
[16] L. R. Rabiner and J. Wilpon, “Isolated word recognition using a two-pass
pattern recognition,” in Proceedings of IEEE Int. Conf. Acoust., Speech, and
Signal Processing, vol. 6, Atlanta, April 1981, pp. 724–727.
[17] J. Miwa and K. Kido, “Speaker-independent word recognition for large vocabulary using pre-selection and non-linear spectral matching,” in Proceedings of
IEEE Int. Conf. Acoust., Speech, and Signal Processing, vol. 3, Tokyo, April
1986, pp. 2695–2698.
[18] A. B. Yishai and D. Burshtein, “A Discriminative Training Algorithm for
Hidden Markov Models,” IEEE Transactions on Speech and Audio Processing,
vol. 12, no. 3, pp. 204–217, May 2004.
[19] R. Chen, M. Tanaka, D. Wu, L. Olorenshaw, and M. Amador, “A Four
layer sharing HMM system for large vocabulary isolated word recognition,”
in Proceedings ICSLP-98, vol. 2, Sydney, December 1998, pp. 309–313.
[20] A. Lakshmi, “A Syllable based Continuous Speech Recognizer for Indian Languages,” Master’s thesis, Indian Institute of Technology Madras, July 2007.
[21] B. Decadt, J. Duchateau, W. Daelemns, and P. Wambacq, “Transcription
of out-of-vocabulary words in large vocabulary speech recognition based
on phoneme-to-grapheme conversion,” in Proceedings of IEEE Int. Conf.
Acoust., Speech, and Signal Processing, place, month 2002, pp. 861–864.
[22] I. Bazzi and J. Glass, “A multi-class approach for modelling out-of-vocabulary
words,” in Proceedings of Int. Conf. Spoken Language Processing, Denver,
CO, USA, September 2002, pp. 1613–1616.
[23] A. Yazgan and M. Saraclar, “Hybrid language models for out of vocabulary
word detection in large vocabulary conversational speech recognition,” in Proceedings of IEEE Int. Conf. Acoust., Speech, and Signal Processing, Montreal,
Canada, May 2004, pp. 745–748.
[24] H. Sun, G. Zhang, F. Zheng, and M. Xu, “Using word confidence measure for
OOV words detection in large vocabulary,” in Proceedings of EUROSPEECH,
Geneva, September 2003, pp. 2713–2716.
[25] N. Bach, M. Noamany, I. Lane, and T. Schultz, “Handling OOV words in
Arabic ASR via flexible morphological constraints,” in Proceedings of EUROSPEECH, Antwerp, Belgium, August 2007, pp. 2373–2376.
83
[26] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition.
Jersey: PTR Prentice Hall, 1993.
New
[27] G. D. Fornay, “The Viterbi algorithm,” in Proc. IEEE, March 1973, pp. 268–
278.
[28] “JAVA,” Available Online: http://www.sun.java.com/.
[29] “The C language,” Available Online: http://www.gnu.gcc.org/.
[30] “Fedora Project,” Available Online: http://www.fedora.org/.
84
LIST OF PUBLICATIONS
1. P. G. Deivapalan and Hema A. Murthy, ”A syllable-based IWR for Tamil
handling OOV words,” in National Conference on Communications 2008,
pp. 267-271, IIT Bombay, February 2008.
2. P. G. Deivapalan, Mukund Jha, Rakesh Guttikonda, and Hema A. Murthy,
”DONLABEL: An Automatic LabelingTool for Indian Languages,” in National Conference on Communications 2008, pp. 263-266, IIT Bombay,
February 2008.
85