Dealing with Emphatic Consonants to Improve the

The 13th International Arab Conference on Information Technology ACIT'2012 Dec.10-13
ISSN : 1812-0857
Dealing with Emphatic Consonants to Improve the
Performance of Arabic Speech Recognition
Majed Alsabaan*
Iman Alsharhan*
Allan Ramsay*
Hanady Ahmad**
*School of computer science, The University of Manchester, UK
**Arabic department, Qatar University, Qatar.
Abstract: The Arabic language presents a number of challenges for speech recognition, due in part to the presence of some
unique phonemes in its phonetic system. In this paper, we are concerned with developing a speaker-independent isolated word
Automatic Speech Recognition (ASR) system that can deal with the unique emphatic consonants of the Arabic language. This
system is based on the Hidden Markov Models (HMMs) and is designed using the Hidden Markov Model Toolkit (HTK). The
database used for both training and testing consists of 3200 tokens in total, created with the help of 20 Arabic native speakers.
For the best use of data, the research uses 5 fold cross-validation in order to improve the robustness of the testing. The
experimental results shows that the overall system performance was 22.6% when using the word as an acoustic unit in
modelling the speech, which confirms that these Arabic unique consonants are a major source of difficulty in developing
Arabic ASR systems. Furthermore, the research finds that by using the phonemes as the acoustic model the accuracy rate can
be improved to 61.4%. The research presents a novel approach for confusability reduction which uses underspecified phonetic
transcription for the input speech. The presented technique reported overall system error reduction rates by 23% compared to
the baseline system. The research suggests that using this system in language learning is particularly helpful to non-native
speakers for correctly producing the required sounds of the language.
Keywords: Automatic Speech Recognition (ASR), Arabic Speech Recognition, Emphatic Consonants, Phoneme
Confusability, Underspecified transcription.
for the recognition process. Section 3 presents related
work. Section 4 gives an overview of the experimental
framework beginning with a brief description of the
Arabic unique consonants, the database used in
training and testing the system, and finally it gives an
overview of the proposed recognition system. Section
5 addresses the results of the conducted experiments.
Section 6 presents the discussions and then draws the
main conclusions of the paper and finally gives a
concrete proposal for a future work on the current
topic.
1. Introduction
Arabic is part of the Semitic language family and
differences are naturally to be found if it is compared
with any of the Indo-European languages. One such
difference is related to the phonetic system. Arabic has
four unique consonants called emphatic consonants
which have non-emphatic counterparts with a high
degree of similarity especially in the places of
articulation. Researchers have confirmed that these
distinctive Arabic consonants are a major source of
difficulty for developing an Arabic ASR system [7]
[15]. In this paper, we attempt to uphold the view that
Arabic emphatic consonants are among the most
challenging tasks in the field of ASR. Hence, the main
concern is to find a solution for handling the
confusability resulted by processing the emphatic
sounds. Furthermore, we also aim to highlight the
factors that may affect the ASR system's performance
negatively or positively. Therefore, several experiments
will be carried out by using limited vocabularies,
different speakers (in age, gender, and dialect).
Phoneme level and word level HMM models are also
represented and compared in this research.
2. Background
2.1 Automatic Speech Recognition (ASR)
ASR is a process of converting an acoustic waveform
to a string of words. In other words, it is a technology
that allows a computer to identify the words that a
person speaks through a microphone or telephone.
This technology can make life much easier when
employed in several applications, such as in handsfree operation and control (as in cars and airplanes)
and interactive voice response [4]. Moreover, it can
also be used to help people with disabilities to interact
with society. Furthermore, it is an ideal way for
language teaching and testing.
Throughout this paper, we use the symbols of
International Phonetic Alphabet (IPA) to present the
Arabic phonemes. The rest of the paper is organised as
follows: Section 2 gives a short background to speech
recognition in general and the Arabic language, hoping
to identify aspects that make the language very difficult
Considering the importance of ASR, there have
been several research studies conducted on ASR
which aimed at allowing a computer to recognise
words in real-time with 100% level of accuracy.
313
Obviously, this is a very challenging task due to various
factors. Perhaps the complexity of the human language
is the most significant challenge for speech recognition.
Humans usually use their knowledge of the speaker, the
subject and language system to predict the intended
meaning while in ASR we only have speech signals.
Languages have, for instance, various aspects of
ambiguity. This is, of course, a problem for any
computer-related language application. Many kinds of
ambiguities usually arise within the ASR like
homophones (words that are pronounced the same but
with different meanings such as “to”, “too” and “two”)
and word boundary ambiguity from words that are not
clearly delineated in continuous speech [8].
There are numerous speech recognition toolkits.
HTK5, Sphinx, CSLU and Julius are examples of these
toolkits which are widely used for research purposes.
2.2 Arabic ASR
Developing high-accuracy speech recognition systems
for Arabic faces many hurdles besides the difficulties
mentioned above. The main problem is the enormous
dialectal variety as the same word could be
pronounced in various ways by different speakers with
different dialects even within the MSA. Take the
example of IJُL‫ ر‬/radʒul/ (means “man”), which could
be pronounced /radʒul/, /ragul/, or /raʒul/. These
regional differences in pronunciations of MSA might
create an error in recognition.
Furthermore, numerous studies have been tempted
to consider the lack of diacritics in Arabic scripts as
one of main obstacles for developing Arabic ASR
systems [13] [12]. The studies argue that the lack of
diacritics causes difficulties in both acoustic and
language modeling which leads to a significant
increase in Word Error Rate (WER). The issue here is
that the absence of diacritics causes an ambiguity in
the pronunciation of words. For example, the word
OJJJPQ /ʕlm/ when diacritised can be: OJَPَQ
َ /ʕalam/ (means
“flag”), OJJPِQ
َ /ʕalim/ (means “he knew”), OJJPْ Q
ِ /ʕilm/
(means “science”), OJJJِPQ
ُ /ʕulim/ (means “it was
known”), OJJPUQ
َ /ʕallam/ (means “he taught”) or OJJPUQ
ُ
/ʕullim/ (means “he was taught”). However, we
proved in previous experiments that if the system is
trained and tested on non diacritised text we get 3.2%
decrease in WER when compared with the fully
diacritised text. Our explanation for that is by using
underspecified text we are making the task of the
recogniser considerably easier since we minimised the
predictions it has to make. For instance, all the
different forms of the mentioned example (OJJJPQ/ʕlm/)
can be identified as a single word.
Arabic is also considered one of the most
morphologically complex languages, commonly
referred to as non-concatenative morphology. Where a
simple verb root such as ‫ ك ت ب‬/ktb/ (means “to
write”) can be modified into more than thirty words in
a non-concatenative manner using processes like
infixation and germination [5]. This has two
consequences for Arabic speech recognition. Firstly, it
means that the set of utterances required for training a
system might have to be enormous to cover the
multiple instances of every word we want to
recognise. So, much larger training sets are needed
than are typically used for languages with less
complex morphology. Secondly, it results in a great
lexical ambiguity which requires to have some
On the other hand, achieving a high performance
speech recogniser may be influenced by paralinguistic
(i.e. non linguistic ) factors [16] such as the speaker's
age and sex, surrounding noise, signal channels, and
other factors which make it impossible for the speech
signal to repeat itself [1].
Speech recognition systems can be classified into
different classes according to the types of utterances
they can recognise, that is, whether the system is
trained and can only recognise speech by a particular
person (speaker dependent) [11] or it can recognise the
speech from people whose speech has not previously
been recognised by the system (speaker independent);
or whether the system is able to recognise continuous
speech or only isolated words [11]. It can also be
classified by the size of the system's vocabulary from
small vocabulary to large vocabulary. Large vocabulary
means the system has roughly 20,000 to 60,000 words.
However, a given package can fit multiple classes.
ASR systems typically use the Hidden Markov
Model (HMM) as a statistical method for speech
recognition [10]. This technique is basically
recognising the speech by determining the probability
of each phoneme at continuous frames of the speech
signal1 2 3.The fundamental part of the Markov model is
the state. A group of states match the positions of the
vocal system of the speaker. `A Markov process is a
process which moves from state to state depending on
the previous n states4. These states are hidden (as noted
from the model's name). An embedded task in an ASR
system performs the decoding of current state sequence
to show the speech signal [2]. Afterwards, target words
to which these states belong are modelled in a sequence
of phonemes, and a search method is applied to match
these phonemes with phonemes of spoken words [4].
1
http://htk.eng.cam.ac.uk
http://www.speech.cs.cmu.edu/sphinx
3
http://www.speech.cs.cmu.edu/sphinx
4
http://www.comp.leeds.ac.uk/roger/HiddenMarkovMo
dels/html\_dev/main.html
2
5
314
http://htk.eng.cam.ac.uk
effective disambiguation mechanism to avoid
perplexity when processing the language.
Another problem with Arabic is the existence of some
confusable phonemes like emphatic sounds. These
sounds have two related problems. Firstly, they are
mispronounced most of the time as researchers have
confirmed, even by Arabic native speakers [15].
Secondly, Arabic makes use of other consonants that
have great similarity with the emphatic consonants. It is
quite challenging to train the speech recognition system
to distinguish between the emphatic consonants and the
non-emphatic ones and to recognise each of them
accurately as we will see.
equivalent consonant, respectively ‫ د‬/d/, ‫ ت‬/t/, ‫ س‬/s/, ‫ذ‬
/ð/. This can be shown in the following examples of
minimal pairs (two words with different meanings
when only one sound is changed):
The emphatic sounds and their non emphatic
counterparts have many features in common. Table 1
provides a description of these two groups in terms of
place and manner of articulation.
3 Related Works
Table 1 Emphatic and non-emphatic sounds
Despite the fact that Arabic is the fourth language in
terms of the number of native speakers, the volume of
research conducted thus far on Arabic ASR is not
comparable to the large number of Arabic native
speakers in the world. However, development of an
Arabic ASR system for MSA has recently been
addressed by a number of researchers. Al-Otaibi and
Mohammad (2010) designed an Arabic phoneme
recognition system to investigate the problem of
misrecognising pharyngeal and uvular phonemes in
MSA, especially by non-native Arabic speakers [7]. H.
Satori et al. (2007) designed a spoken Arabic
recognition system based on CMUSphinx-4 from
Carnegie Mellon University and applied this system to
isolated Arabic digits in order to demonstrate the
possible adaptability of this system to Arabic speech
[15]. Selouania and Caelen (1998) presented an
approach for identifying problematic Arabic phonemes
in continuous speech based on a mixture of Artificial
Neural Networks (ANN). This study includes the four
emphatic consonants of Arabic and used time delay
neural networks beside the autoregressive backpropagation algorithm (AR-TDNN). In the aforesaid
study, they observed total failure in recognising the
As illustrated in Table 1, there are no significant
differences between the two groups. The similarity
between the emphatic consonant and its non emphatic
counterpart was noted early by the famed Arab
grammarian Siybawayh (796 A.D.) who said that the
different between the two groups is in the place of
articulation where the emphatics have two places of
articulation. He called the emphatic consonants
(Alhuruf AlmuTbaqa) cJJJJJJJJdefgh‫وف ا‬lJJJJJJJJmh‫' ا‬covered
letters' because it is produced by having the tongue
covering the area extending from the main place of
articulation towards the palate. This has been verified
by modern studies which confirm that the main
articulatory difference between the two groups is that
the articulation of the emphatic consonants involves a
secondary articulation where the tongue root is
retracting, which results in a narrowing in the upper
portion of the pharynx accompanied by a retraction in
the lower part of the pharynx's interior wall [9] [3].
Figure 1 shows the tongue configuration during the
articulation of the emphatic consonants and the
articulation of the non-emphatic counterparts based on
the acoustic description presented in [9] and [3].
emphatic consonant ‫ ض‬/dɭ/. They referred that to the
poor ability of the speakers to utter it correctly even
though they were native speakers, beside the difficulty
inherent to the consonant's acoustical proprieties. Their
overall results confirmed that the three designed
systems had relatively high error rates in recognising
emphatic consonants when compared to vowels,
fricatives, plosives, nasals, liquid, and geminated
consonants [17].
4 Experimental Work
4.1 Arabic unique consonants
Arabic has some unique consonants which do not exist
in any other languages. For instance, there are four
emphatic consonants in Arabic: ‫ ض‬/dɭ/, ‫ ط‬/tɭ/, ‫ ص‬/sɭ/,
and
‫ ظ‬/ðɭ/ Each emphatic consonant has a plain
315
Figure 3 Spectrogram representation for the word /darb/
4.2 Training and testing
A set of real Arabic words was chosen in which half
of them contains emphatic consonants and the other
half contains their non-emphatic counterparts. The
data contains a set of words where the emphatic and
non-emphatic consonants occur within the same
context (minimal pairs), and also contain a set of
words based on the position of the confusable
phoneme (initially, medially, finally). The speakers
were asked to repeat each word 5 times. The total
number of utterances is 3200 tokens collected from 20
Arabic native speakers from different nationalities:
three from Kuwait, three from Egypt, three from
Syria, two from Saudi Arabia, two from Yemen, two
from Jordan, two from Sudan, one from Bahrain, one
from Iraq, and one from Palestine. Twelve of these
speakers were male, and eight were female. They all
are aged between 25 and 35.
To ensure that we get reliable results, we used 5fold cross-validation approach as a way of assessing
the proposed system. This involves randomly
partitioning the data into 5 equal size subsets
performing the training on 4 subsets and validating the
performance on the other subset. This process is
repeated 5 times with each of the 5 subsets used only
once as the validation data. The results from the 5
folds then can be averaged to compute a single
estimation. The advantage of this method over the
standard evaluation approach is that all observations
are used for both training and testing which give more
robust testing for experiments with small data sets.
The recognition platform used throughout all the
experiments is based on HMMs, using the Hidden
Markov Model ToolKit (HTK). HTK is a portable
toolkit for building and manipulating HMMs
developed at Cambridge University6. HTK is
considered one of the most successful tools used in
speech recognition research over the last two decades.
Figure 1 Tongue position during the articulation of the
emphatic (C ɭ) and non emphatic consonants(C)
Two native Arabic annotators were asked to identify
the words used in the system to see whether
recognising the words that contain emphatics is
difficult for both machine and humans or only machine.
The experiment shows that all the utterances were
recognised successfully by the annotators. The word
/dɭarb/ was recognised correctly despite having some of
the speakers saying it as /ðɭarb/. This is because
humans refer to their knowledge about the language
beside their hearing ability. In general, speech
recognisers depend mainly on the acoustic input to
identify the phonemes without any knowledge about
the speaker or the language which makes its task very
challenging compared to humans. Figures 2 and 3
illustrates the spectrogram representation of the words
/dɭarb / and /darb/.
Figure 2 Spectrogram representation for the word /dɭarb/
6
316
http://htk.eng.cam.ac.uk/
In this research, we conducted five main
experiments. The first two experiments will show
how using different speech units (phoneme level
or word level) can affect the accuracy of the
recognition. In the third and fourth experiments,
the system was trained and tested by only male
speakers and only female speakers with the same
size of data set. The aim of this division is to
determine the influence of having the same
gender speakers on the recognition accuracy.
From these experiments, we analysed the
confusion matrix to identify the most confusable
phonemes by the recogniser. The final experiment
was aimed at assessing the proposed solution for
the problem of confusable phonemes.
4.3 Recognition system overview
The system developed in this paper is designed to
recognise Arabic phonemes with respect to the
emphatic consonants using HTK. This includes
two major processing stages as shown in Figures 4
and 5.
The training phase is concerned with initialising
and estimating maximum likelihood of the HMM
model parameters. It requires having the speech
data supplemented by its phonetic transcription.
The simple structure of this process is presented in
Figure 4. A set of HTK tools are used for purpose
of building a well-trained set of HMMs.
5 Results
The results reported here are based on the outcomes of
the Arabic ASR system described above. The results
are presented in four subsections. The first subsection
gives the outcome of word and phoneme level
experiments. The second subsection discusses the
results of the gender dependent experiments. The third
subsection reviews the confusable phonemes noted
from these experiments. The final subsection presents
the result of applying underspecified transcription
method in the recogniser.
Figure 4 Training stage
5.1 Phone level and word level experiment
The HTK, like all speech recognition systems,
requires a dictionary and a grammar to constrain what
is going to be said. However, there is a question about
what the terminal symbols of the grammar should be:
should they be words or phonemes?.
Two experiments were carried out with twenty
speakers. The first experiment is at the word level,
using a grammar with words as terminal symbols. The
second experiment is at the phone level, using a
grammar with phonemes as terminal symbols.
At the phone level, the HTK thinks of each
phoneme as a word. The recognition accuracies were
61.4% and 22.6% at the phoneme level and word level
respectively. This result supports the fact that using
phonemes as the HMM models is superior for limited
vocabulary ASR systems as found in [6]. For this
reason, the phoneme level is going to be used as a
baseline system for the forthcoming experiments.
The second stage is the testing stage, whose
function is matching the input speech with a
network of HMMs and returning a transcription for
each speech signal. This process can be shown in
Figure 5.
5.2 Male and female speakers' experiments
Two experiments are carried out in order to
investigate the effect of gender type on the recognition
accuracy. The first experiment was performed with
eight male speakers which showed that the HTK can
recognize 103 words out of 256 with an accuracy of
62.9%. The second experiment is performed with
Figure 5 Testing stage
317
eight female speakers, and the HTK gave an accuracy
of 66.9%.
recognised. The research found that the /dɭ/consonant
is the most confusable consonant among the
emphatics. This refers to the acoustic confusability
beside the difficulty that native speakers find to
pronounce this sound. A novel technique is carried out
to enhance the performance of a speech recognition
system. The major idea of this technique is using
underspecified transcription by blurring the
differences between confusable sounds. The research
found that using this technique on limited vocabulary
systems can give promising results. The research also
suggests to train the HTK on more vocabularies
containing these confusable phonemes to limit the
confusion. Moreover, it has been approved that using
phonemes as the acoustic unit can give better results
when compared with the word. This finding supports
previous research into this area which indicated that
the phoneme level HMM models are superior for
limited vocabulary ASR7. Finally, the research shed
the light on the gender variation influence as it found
that female speakers perform better than male
speakers. Further studies with bigger data size need to
be carried out to confirm this finding.
5.3 Confusability of emphatic consonants
By analysing the confusion matrix resulted from the
phoneme level experiment, we can confirm that the
emphatic sounds pose a major source of difficulty in
recognising Arabic speech. A part from ‫ ص‬/sɭ/ and its
non-emphatic counterpart ‫ س‬/s/, it is apparent from
Table 2 that the recognition of the emphatic sounds
and their non-emphatic counter parts is below 50 per
cent. The poorest accuracy was found in the ‫ ض‬/dɭ/
sound (20%) and it has been confused most often with
the emphatic sound ‫ ظ‬/ðɭ/ 16.2%.
This confusion may be resulted from the fact that
most Arabic speakers tend to pronounce the sound ‫ض‬
/dɭ/ as /ð/ while speaking as researchers confirm [14]. It
can be also observed that the /d/ sound shows relatively
high confusion with the sound /t/ and vice versa. This
might be due to the fact that these two sounds share all
the features except that the /d/ is voiced and the /t/ is
voiceless.
6.1 Future Work
Table 2 Confusion matrix for emphatic sounds
The purpose of the current study was to investigate the
factors that affect the recognition outcomes and to
train the speech recogniser with respect to these
factors to identify the Arabic emphatic consonants
correctly. However, this investigation was limited by
the small training data size. Therefore, further
experimental investigations need to be carried out to
confirm these findings and to realise the amount of
data we need to establish a greater degree of accuracy
on this matter.
5.4 Using underspecified phonetic
transcription
Moreover, the methods used for recognising the
emphatic consonants may be applied to other
confusing phonemes like vowels, pharyngeal, and
uvular consonants which may be also hurdles for
developing a speech recognition system.
As a technique for handling the investigated problem,
the research uses the HTK to acquire an underspecified
phonetic representation of the input speech. This is
done by blurring the difference between the emphatic
sounds and their non-emphatic counterpart. So, we used
one symbol to represent both emphatic and nonemphatic consonant. This makes the task of the
recogniser considerably easier-it has fewer distinctions
to make, so it has less chance of making a mistake. This
process should be followed by using linguistic analysis
to extract the intended content of the utterance from
this underspecified transcription.
The findings of this study have, for instance, a
number of important applications for future practice.
One of the possible applications is to help the
language learners to produce sounds correctly. By
using a graphical representation, the learners can be
given an animation of the vocal tract on how the
sounds have been articulated, and how they are to be
correctly produced.
We reported a decrease in word error rate (WER) by
23% using this technique.
6 Discussion and Conclusion
In this research the Arabic emphatic consonants are
investigated from the ASR point of view. The research
confirms that those consonants are hard to be
7
http://www.animatedspeech.com/Research/research.h
tml
318
[13] Massaro D.W. and Light J., “Read my tongue
movements: bimodal learning to perceive and produce
non-native speech/r/and/l/,” Proceedings of the 8th
European Conference on Speech Communication and
Technology, pp. 1-4, 2003.
References
[1] Čeidait÷ G. and Telksnys L., “Analysis of Factors
Influencing Accuracy of Speech Recognition,”
Electronics and Electrical Engineering.-Kaunas:
Technologija., vol. 105, no. 9, pp. 69-72, 2010.
[14] Massaro D.W., Perceiving talking faces: From
speech perception to a behavioral principle, The MIT
Press, 1998.
[2] ÇÖMEZ M.A., “Large vocabulary continuous
speech recognition for Turkish using HTK,” A Master
thesis in the department of Electrical and Electronics
Engineering, Middle East Technical University, 2003.
[15] Satori H., Harti M., and Chenfour N.,
“Introduction to Arabic Speech Recognition Using
CMUSphinx
System,”
Arxiv
preprint
arXiv:0704.2083, 2007.
[3] Al-Ani S.H., “Arabic Phonology: An Acoustical
and Physiological Investigation,” Mouton and Co., NV,
182 Van Aerssenstraat, The Hague, Netherlands, 1970.
[16] Schötz, S., “Linguistic & Paralinguistic Phonetic
Variation in Speaker Recognition & Text-to-Speech
Synthesis,” GSLT: Speech Technology., 2002.
[4] Ali M., Elshafei M., Al-Ghamdi M., Al-Muhtaseb
H., and Al-Najjar A., “Generation of Arabic phonetic
dictionaries for speech recognition,” International
Conference on Innovations in Information Technology.,
pp. 59-63, 2008.
[17] Selouani S.A. and Caelen J., “Arabic phonetic
features recognition using modular connectionist
architectures,” Proceedings of the IEEE Interactive
Voice
Technology
for
Telecommunications
Applications., pp.155-160, 1998.
[5] Al Najem S.R.J., “An Exploration of Computational
Arabic Morphology,” A PhD thesis in Computational
Linguistics, department of language linguistics,
University of Essex, England, 1998.
[6] Alotaibi, Y.A., “Is Phoneme Level Better than
Word Level for HMM Models in Limited Vocabulary
ASR Systems?”, the Seventh IEEE International
Conference on Information Technology, pp.332-337,
2010.
[7] Alotaibi Y.A. and Muhammad G., “Study on
pharyngeal and uvular consonants in foreign accented
Arabic for ASR,” Computer Speech & Language, vol.
24, no. 2, pp. 219-231, 2010.
[8] Forsberg M., “Why is Speech Recognition
Difficult?,” Chalmers University of Technology, 2003.
[9] Hess S., “Pharyngeal articulations in Akan and
Arabic,” Ms., UCLA, Los Angeles, Calif, 1990.
[10] Huang X. and others, Spoken language processing,
Prentice Hall, Englewood Cliffs, New Jersey, USA,
2001.
[11] Jurafsky D., Martin J.H., and Kehler A., Speech
and language processing: An introduction to natural
language processing, computational linguistics, and
speech recognition, Prentice Hall, 2009.
[12] Kirchhoff K. and others, “Novel approaches to
Arabic speech recognition: report from the 2002 JohnsHopkins Summer Workshop” Acoustics, Speech, and
Signal Processing, vol. 1, pp. I-344, 2003.
319