The 13th International Arab Conference on Information Technology ACIT'2012 Dec.10-13 ISSN : 1812-0857 Dealing with Emphatic Consonants to Improve the Performance of Arabic Speech Recognition Majed Alsabaan* Iman Alsharhan* Allan Ramsay* Hanady Ahmad** *School of computer science, The University of Manchester, UK **Arabic department, Qatar University, Qatar. Abstract: The Arabic language presents a number of challenges for speech recognition, due in part to the presence of some unique phonemes in its phonetic system. In this paper, we are concerned with developing a speaker-independent isolated word Automatic Speech Recognition (ASR) system that can deal with the unique emphatic consonants of the Arabic language. This system is based on the Hidden Markov Models (HMMs) and is designed using the Hidden Markov Model Toolkit (HTK). The database used for both training and testing consists of 3200 tokens in total, created with the help of 20 Arabic native speakers. For the best use of data, the research uses 5 fold cross-validation in order to improve the robustness of the testing. The experimental results shows that the overall system performance was 22.6% when using the word as an acoustic unit in modelling the speech, which confirms that these Arabic unique consonants are a major source of difficulty in developing Arabic ASR systems. Furthermore, the research finds that by using the phonemes as the acoustic model the accuracy rate can be improved to 61.4%. The research presents a novel approach for confusability reduction which uses underspecified phonetic transcription for the input speech. The presented technique reported overall system error reduction rates by 23% compared to the baseline system. The research suggests that using this system in language learning is particularly helpful to non-native speakers for correctly producing the required sounds of the language. Keywords: Automatic Speech Recognition (ASR), Arabic Speech Recognition, Emphatic Consonants, Phoneme Confusability, Underspecified transcription. for the recognition process. Section 3 presents related work. Section 4 gives an overview of the experimental framework beginning with a brief description of the Arabic unique consonants, the database used in training and testing the system, and finally it gives an overview of the proposed recognition system. Section 5 addresses the results of the conducted experiments. Section 6 presents the discussions and then draws the main conclusions of the paper and finally gives a concrete proposal for a future work on the current topic. 1. Introduction Arabic is part of the Semitic language family and differences are naturally to be found if it is compared with any of the Indo-European languages. One such difference is related to the phonetic system. Arabic has four unique consonants called emphatic consonants which have non-emphatic counterparts with a high degree of similarity especially in the places of articulation. Researchers have confirmed that these distinctive Arabic consonants are a major source of difficulty for developing an Arabic ASR system [7] [15]. In this paper, we attempt to uphold the view that Arabic emphatic consonants are among the most challenging tasks in the field of ASR. Hence, the main concern is to find a solution for handling the confusability resulted by processing the emphatic sounds. Furthermore, we also aim to highlight the factors that may affect the ASR system's performance negatively or positively. Therefore, several experiments will be carried out by using limited vocabularies, different speakers (in age, gender, and dialect). Phoneme level and word level HMM models are also represented and compared in this research. 2. Background 2.1 Automatic Speech Recognition (ASR) ASR is a process of converting an acoustic waveform to a string of words. In other words, it is a technology that allows a computer to identify the words that a person speaks through a microphone or telephone. This technology can make life much easier when employed in several applications, such as in handsfree operation and control (as in cars and airplanes) and interactive voice response [4]. Moreover, it can also be used to help people with disabilities to interact with society. Furthermore, it is an ideal way for language teaching and testing. Throughout this paper, we use the symbols of International Phonetic Alphabet (IPA) to present the Arabic phonemes. The rest of the paper is organised as follows: Section 2 gives a short background to speech recognition in general and the Arabic language, hoping to identify aspects that make the language very difficult Considering the importance of ASR, there have been several research studies conducted on ASR which aimed at allowing a computer to recognise words in real-time with 100% level of accuracy. 313 Obviously, this is a very challenging task due to various factors. Perhaps the complexity of the human language is the most significant challenge for speech recognition. Humans usually use their knowledge of the speaker, the subject and language system to predict the intended meaning while in ASR we only have speech signals. Languages have, for instance, various aspects of ambiguity. This is, of course, a problem for any computer-related language application. Many kinds of ambiguities usually arise within the ASR like homophones (words that are pronounced the same but with different meanings such as “to”, “too” and “two”) and word boundary ambiguity from words that are not clearly delineated in continuous speech [8]. There are numerous speech recognition toolkits. HTK5, Sphinx, CSLU and Julius are examples of these toolkits which are widely used for research purposes. 2.2 Arabic ASR Developing high-accuracy speech recognition systems for Arabic faces many hurdles besides the difficulties mentioned above. The main problem is the enormous dialectal variety as the same word could be pronounced in various ways by different speakers with different dialects even within the MSA. Take the example of IJُL ر/radʒul/ (means “man”), which could be pronounced /radʒul/, /ragul/, or /raʒul/. These regional differences in pronunciations of MSA might create an error in recognition. Furthermore, numerous studies have been tempted to consider the lack of diacritics in Arabic scripts as one of main obstacles for developing Arabic ASR systems [13] [12]. The studies argue that the lack of diacritics causes difficulties in both acoustic and language modeling which leads to a significant increase in Word Error Rate (WER). The issue here is that the absence of diacritics causes an ambiguity in the pronunciation of words. For example, the word OJJJPQ /ʕlm/ when diacritised can be: OJَPَQ َ /ʕalam/ (means “flag”), OJJPِQ َ /ʕalim/ (means “he knew”), OJJPْ Q ِ /ʕilm/ (means “science”), OJJJِPQ ُ /ʕulim/ (means “it was known”), OJJPUQ َ /ʕallam/ (means “he taught”) or OJJPUQ ُ /ʕullim/ (means “he was taught”). However, we proved in previous experiments that if the system is trained and tested on non diacritised text we get 3.2% decrease in WER when compared with the fully diacritised text. Our explanation for that is by using underspecified text we are making the task of the recogniser considerably easier since we minimised the predictions it has to make. For instance, all the different forms of the mentioned example (OJJJPQ/ʕlm/) can be identified as a single word. Arabic is also considered one of the most morphologically complex languages, commonly referred to as non-concatenative morphology. Where a simple verb root such as ك ت ب/ktb/ (means “to write”) can be modified into more than thirty words in a non-concatenative manner using processes like infixation and germination [5]. This has two consequences for Arabic speech recognition. Firstly, it means that the set of utterances required for training a system might have to be enormous to cover the multiple instances of every word we want to recognise. So, much larger training sets are needed than are typically used for languages with less complex morphology. Secondly, it results in a great lexical ambiguity which requires to have some On the other hand, achieving a high performance speech recogniser may be influenced by paralinguistic (i.e. non linguistic ) factors [16] such as the speaker's age and sex, surrounding noise, signal channels, and other factors which make it impossible for the speech signal to repeat itself [1]. Speech recognition systems can be classified into different classes according to the types of utterances they can recognise, that is, whether the system is trained and can only recognise speech by a particular person (speaker dependent) [11] or it can recognise the speech from people whose speech has not previously been recognised by the system (speaker independent); or whether the system is able to recognise continuous speech or only isolated words [11]. It can also be classified by the size of the system's vocabulary from small vocabulary to large vocabulary. Large vocabulary means the system has roughly 20,000 to 60,000 words. However, a given package can fit multiple classes. ASR systems typically use the Hidden Markov Model (HMM) as a statistical method for speech recognition [10]. This technique is basically recognising the speech by determining the probability of each phoneme at continuous frames of the speech signal1 2 3.The fundamental part of the Markov model is the state. A group of states match the positions of the vocal system of the speaker. `A Markov process is a process which moves from state to state depending on the previous n states4. These states are hidden (as noted from the model's name). An embedded task in an ASR system performs the decoding of current state sequence to show the speech signal [2]. Afterwards, target words to which these states belong are modelled in a sequence of phonemes, and a search method is applied to match these phonemes with phonemes of spoken words [4]. 1 http://htk.eng.cam.ac.uk http://www.speech.cs.cmu.edu/sphinx 3 http://www.speech.cs.cmu.edu/sphinx 4 http://www.comp.leeds.ac.uk/roger/HiddenMarkovMo dels/html\_dev/main.html 2 5 314 http://htk.eng.cam.ac.uk effective disambiguation mechanism to avoid perplexity when processing the language. Another problem with Arabic is the existence of some confusable phonemes like emphatic sounds. These sounds have two related problems. Firstly, they are mispronounced most of the time as researchers have confirmed, even by Arabic native speakers [15]. Secondly, Arabic makes use of other consonants that have great similarity with the emphatic consonants. It is quite challenging to train the speech recognition system to distinguish between the emphatic consonants and the non-emphatic ones and to recognise each of them accurately as we will see. equivalent consonant, respectively د/d/, ت/t/, س/s/, ذ /ð/. This can be shown in the following examples of minimal pairs (two words with different meanings when only one sound is changed): The emphatic sounds and their non emphatic counterparts have many features in common. Table 1 provides a description of these two groups in terms of place and manner of articulation. 3 Related Works Table 1 Emphatic and non-emphatic sounds Despite the fact that Arabic is the fourth language in terms of the number of native speakers, the volume of research conducted thus far on Arabic ASR is not comparable to the large number of Arabic native speakers in the world. However, development of an Arabic ASR system for MSA has recently been addressed by a number of researchers. Al-Otaibi and Mohammad (2010) designed an Arabic phoneme recognition system to investigate the problem of misrecognising pharyngeal and uvular phonemes in MSA, especially by non-native Arabic speakers [7]. H. Satori et al. (2007) designed a spoken Arabic recognition system based on CMUSphinx-4 from Carnegie Mellon University and applied this system to isolated Arabic digits in order to demonstrate the possible adaptability of this system to Arabic speech [15]. Selouania and Caelen (1998) presented an approach for identifying problematic Arabic phonemes in continuous speech based on a mixture of Artificial Neural Networks (ANN). This study includes the four emphatic consonants of Arabic and used time delay neural networks beside the autoregressive backpropagation algorithm (AR-TDNN). In the aforesaid study, they observed total failure in recognising the As illustrated in Table 1, there are no significant differences between the two groups. The similarity between the emphatic consonant and its non emphatic counterpart was noted early by the famed Arab grammarian Siybawayh (796 A.D.) who said that the different between the two groups is in the place of articulation where the emphatics have two places of articulation. He called the emphatic consonants (Alhuruf AlmuTbaqa) cJJJJJJJJdefghوف اlJJJJJJJJmh' اcovered letters' because it is produced by having the tongue covering the area extending from the main place of articulation towards the palate. This has been verified by modern studies which confirm that the main articulatory difference between the two groups is that the articulation of the emphatic consonants involves a secondary articulation where the tongue root is retracting, which results in a narrowing in the upper portion of the pharynx accompanied by a retraction in the lower part of the pharynx's interior wall [9] [3]. Figure 1 shows the tongue configuration during the articulation of the emphatic consonants and the articulation of the non-emphatic counterparts based on the acoustic description presented in [9] and [3]. emphatic consonant ض/dɭ/. They referred that to the poor ability of the speakers to utter it correctly even though they were native speakers, beside the difficulty inherent to the consonant's acoustical proprieties. Their overall results confirmed that the three designed systems had relatively high error rates in recognising emphatic consonants when compared to vowels, fricatives, plosives, nasals, liquid, and geminated consonants [17]. 4 Experimental Work 4.1 Arabic unique consonants Arabic has some unique consonants which do not exist in any other languages. For instance, there are four emphatic consonants in Arabic: ض/dɭ/, ط/tɭ/, ص/sɭ/, and ظ/ðɭ/ Each emphatic consonant has a plain 315 Figure 3 Spectrogram representation for the word /darb/ 4.2 Training and testing A set of real Arabic words was chosen in which half of them contains emphatic consonants and the other half contains their non-emphatic counterparts. The data contains a set of words where the emphatic and non-emphatic consonants occur within the same context (minimal pairs), and also contain a set of words based on the position of the confusable phoneme (initially, medially, finally). The speakers were asked to repeat each word 5 times. The total number of utterances is 3200 tokens collected from 20 Arabic native speakers from different nationalities: three from Kuwait, three from Egypt, three from Syria, two from Saudi Arabia, two from Yemen, two from Jordan, two from Sudan, one from Bahrain, one from Iraq, and one from Palestine. Twelve of these speakers were male, and eight were female. They all are aged between 25 and 35. To ensure that we get reliable results, we used 5fold cross-validation approach as a way of assessing the proposed system. This involves randomly partitioning the data into 5 equal size subsets performing the training on 4 subsets and validating the performance on the other subset. This process is repeated 5 times with each of the 5 subsets used only once as the validation data. The results from the 5 folds then can be averaged to compute a single estimation. The advantage of this method over the standard evaluation approach is that all observations are used for both training and testing which give more robust testing for experiments with small data sets. The recognition platform used throughout all the experiments is based on HMMs, using the Hidden Markov Model ToolKit (HTK). HTK is a portable toolkit for building and manipulating HMMs developed at Cambridge University6. HTK is considered one of the most successful tools used in speech recognition research over the last two decades. Figure 1 Tongue position during the articulation of the emphatic (C ɭ) and non emphatic consonants(C) Two native Arabic annotators were asked to identify the words used in the system to see whether recognising the words that contain emphatics is difficult for both machine and humans or only machine. The experiment shows that all the utterances were recognised successfully by the annotators. The word /dɭarb/ was recognised correctly despite having some of the speakers saying it as /ðɭarb/. This is because humans refer to their knowledge about the language beside their hearing ability. In general, speech recognisers depend mainly on the acoustic input to identify the phonemes without any knowledge about the speaker or the language which makes its task very challenging compared to humans. Figures 2 and 3 illustrates the spectrogram representation of the words /dɭarb / and /darb/. Figure 2 Spectrogram representation for the word /dɭarb/ 6 316 http://htk.eng.cam.ac.uk/ In this research, we conducted five main experiments. The first two experiments will show how using different speech units (phoneme level or word level) can affect the accuracy of the recognition. In the third and fourth experiments, the system was trained and tested by only male speakers and only female speakers with the same size of data set. The aim of this division is to determine the influence of having the same gender speakers on the recognition accuracy. From these experiments, we analysed the confusion matrix to identify the most confusable phonemes by the recogniser. The final experiment was aimed at assessing the proposed solution for the problem of confusable phonemes. 4.3 Recognition system overview The system developed in this paper is designed to recognise Arabic phonemes with respect to the emphatic consonants using HTK. This includes two major processing stages as shown in Figures 4 and 5. The training phase is concerned with initialising and estimating maximum likelihood of the HMM model parameters. It requires having the speech data supplemented by its phonetic transcription. The simple structure of this process is presented in Figure 4. A set of HTK tools are used for purpose of building a well-trained set of HMMs. 5 Results The results reported here are based on the outcomes of the Arabic ASR system described above. The results are presented in four subsections. The first subsection gives the outcome of word and phoneme level experiments. The second subsection discusses the results of the gender dependent experiments. The third subsection reviews the confusable phonemes noted from these experiments. The final subsection presents the result of applying underspecified transcription method in the recogniser. Figure 4 Training stage 5.1 Phone level and word level experiment The HTK, like all speech recognition systems, requires a dictionary and a grammar to constrain what is going to be said. However, there is a question about what the terminal symbols of the grammar should be: should they be words or phonemes?. Two experiments were carried out with twenty speakers. The first experiment is at the word level, using a grammar with words as terminal symbols. The second experiment is at the phone level, using a grammar with phonemes as terminal symbols. At the phone level, the HTK thinks of each phoneme as a word. The recognition accuracies were 61.4% and 22.6% at the phoneme level and word level respectively. This result supports the fact that using phonemes as the HMM models is superior for limited vocabulary ASR systems as found in [6]. For this reason, the phoneme level is going to be used as a baseline system for the forthcoming experiments. The second stage is the testing stage, whose function is matching the input speech with a network of HMMs and returning a transcription for each speech signal. This process can be shown in Figure 5. 5.2 Male and female speakers' experiments Two experiments are carried out in order to investigate the effect of gender type on the recognition accuracy. The first experiment was performed with eight male speakers which showed that the HTK can recognize 103 words out of 256 with an accuracy of 62.9%. The second experiment is performed with Figure 5 Testing stage 317 eight female speakers, and the HTK gave an accuracy of 66.9%. recognised. The research found that the /dɭ/consonant is the most confusable consonant among the emphatics. This refers to the acoustic confusability beside the difficulty that native speakers find to pronounce this sound. A novel technique is carried out to enhance the performance of a speech recognition system. The major idea of this technique is using underspecified transcription by blurring the differences between confusable sounds. The research found that using this technique on limited vocabulary systems can give promising results. The research also suggests to train the HTK on more vocabularies containing these confusable phonemes to limit the confusion. Moreover, it has been approved that using phonemes as the acoustic unit can give better results when compared with the word. This finding supports previous research into this area which indicated that the phoneme level HMM models are superior for limited vocabulary ASR7. Finally, the research shed the light on the gender variation influence as it found that female speakers perform better than male speakers. Further studies with bigger data size need to be carried out to confirm this finding. 5.3 Confusability of emphatic consonants By analysing the confusion matrix resulted from the phoneme level experiment, we can confirm that the emphatic sounds pose a major source of difficulty in recognising Arabic speech. A part from ص/sɭ/ and its non-emphatic counterpart س/s/, it is apparent from Table 2 that the recognition of the emphatic sounds and their non-emphatic counter parts is below 50 per cent. The poorest accuracy was found in the ض/dɭ/ sound (20%) and it has been confused most often with the emphatic sound ظ/ðɭ/ 16.2%. This confusion may be resulted from the fact that most Arabic speakers tend to pronounce the sound ض /dɭ/ as /ð/ while speaking as researchers confirm [14]. It can be also observed that the /d/ sound shows relatively high confusion with the sound /t/ and vice versa. This might be due to the fact that these two sounds share all the features except that the /d/ is voiced and the /t/ is voiceless. 6.1 Future Work Table 2 Confusion matrix for emphatic sounds The purpose of the current study was to investigate the factors that affect the recognition outcomes and to train the speech recogniser with respect to these factors to identify the Arabic emphatic consonants correctly. However, this investigation was limited by the small training data size. Therefore, further experimental investigations need to be carried out to confirm these findings and to realise the amount of data we need to establish a greater degree of accuracy on this matter. 5.4 Using underspecified phonetic transcription Moreover, the methods used for recognising the emphatic consonants may be applied to other confusing phonemes like vowels, pharyngeal, and uvular consonants which may be also hurdles for developing a speech recognition system. As a technique for handling the investigated problem, the research uses the HTK to acquire an underspecified phonetic representation of the input speech. This is done by blurring the difference between the emphatic sounds and their non-emphatic counterpart. So, we used one symbol to represent both emphatic and nonemphatic consonant. This makes the task of the recogniser considerably easier-it has fewer distinctions to make, so it has less chance of making a mistake. This process should be followed by using linguistic analysis to extract the intended content of the utterance from this underspecified transcription. The findings of this study have, for instance, a number of important applications for future practice. One of the possible applications is to help the language learners to produce sounds correctly. By using a graphical representation, the learners can be given an animation of the vocal tract on how the sounds have been articulated, and how they are to be correctly produced. We reported a decrease in word error rate (WER) by 23% using this technique. 6 Discussion and Conclusion In this research the Arabic emphatic consonants are investigated from the ASR point of view. The research confirms that those consonants are hard to be 7 http://www.animatedspeech.com/Research/research.h tml 318 [13] Massaro D.W. and Light J., “Read my tongue movements: bimodal learning to perceive and produce non-native speech/r/and/l/,” Proceedings of the 8th European Conference on Speech Communication and Technology, pp. 1-4, 2003. References [1] Čeidait÷ G. and Telksnys L., “Analysis of Factors Influencing Accuracy of Speech Recognition,” Electronics and Electrical Engineering.-Kaunas: Technologija., vol. 105, no. 9, pp. 69-72, 2010. [14] Massaro D.W., Perceiving talking faces: From speech perception to a behavioral principle, The MIT Press, 1998. [2] ÇÖMEZ M.A., “Large vocabulary continuous speech recognition for Turkish using HTK,” A Master thesis in the department of Electrical and Electronics Engineering, Middle East Technical University, 2003. [15] Satori H., Harti M., and Chenfour N., “Introduction to Arabic Speech Recognition Using CMUSphinx System,” Arxiv preprint arXiv:0704.2083, 2007. [3] Al-Ani S.H., “Arabic Phonology: An Acoustical and Physiological Investigation,” Mouton and Co., NV, 182 Van Aerssenstraat, The Hague, Netherlands, 1970. [16] Schötz, S., “Linguistic & Paralinguistic Phonetic Variation in Speaker Recognition & Text-to-Speech Synthesis,” GSLT: Speech Technology., 2002. [4] Ali M., Elshafei M., Al-Ghamdi M., Al-Muhtaseb H., and Al-Najjar A., “Generation of Arabic phonetic dictionaries for speech recognition,” International Conference on Innovations in Information Technology., pp. 59-63, 2008. [17] Selouani S.A. and Caelen J., “Arabic phonetic features recognition using modular connectionist architectures,” Proceedings of the IEEE Interactive Voice Technology for Telecommunications Applications., pp.155-160, 1998. [5] Al Najem S.R.J., “An Exploration of Computational Arabic Morphology,” A PhD thesis in Computational Linguistics, department of language linguistics, University of Essex, England, 1998. [6] Alotaibi, Y.A., “Is Phoneme Level Better than Word Level for HMM Models in Limited Vocabulary ASR Systems?”, the Seventh IEEE International Conference on Information Technology, pp.332-337, 2010. [7] Alotaibi Y.A. and Muhammad G., “Study on pharyngeal and uvular consonants in foreign accented Arabic for ASR,” Computer Speech & Language, vol. 24, no. 2, pp. 219-231, 2010. [8] Forsberg M., “Why is Speech Recognition Difficult?,” Chalmers University of Technology, 2003. [9] Hess S., “Pharyngeal articulations in Akan and Arabic,” Ms., UCLA, Los Angeles, Calif, 1990. [10] Huang X. and others, Spoken language processing, Prentice Hall, Englewood Cliffs, New Jersey, USA, 2001. [11] Jurafsky D., Martin J.H., and Kehler A., Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition, Prentice Hall, 2009. [12] Kirchhoff K. and others, “Novel approaches to Arabic speech recognition: report from the 2002 JohnsHopkins Summer Workshop” Acoustics, Speech, and Signal Processing, vol. 1, pp. I-344, 2003. 319
© Copyright 2026 Paperzz