Challenges for Design of Pronunciation Lexicon Specification (PLS) for Punjabi Language Swaran Lata Head, Technology Development for Indian Languages (TDIL) Programme, Department of Information Technology, Govt. of India [email protected] Abstract: The paper is an attempt towards developing pronunciation lexicon for the Punjabi language. Punjabi belongs to the Indo-Aryan branch of the Indo-European family of languages but is unique due to its tonal characteristics. Scandinavian and Lithuanian languages among Indo-European family exhibit similar traits. Among Indo-Aryan languages, tonal feature of Punjabi makes it phonetically complex. The major hurdle in creating PLS for Punjabi is to capture the pronunciation nuances as properly understood by a native speaker. The web content in Punjabi is scarce and is mostly non-standard and using proprietary fonts. Awareness about Unicode and IPA is very limited among the print media and the public. Hence in spite of being spoken by a very large number of speakers (approx 30 million in India as per 2011 census) it can still be called as less-resource language (LRL) on account of non-availability of electronic resources. The negligible work available with respect to phonetic resources makes it even more less resourced language from the PLS point-of-view. Keywords: PLS, Punjabi, Tone, Phonology, Indo-Aryan, Indo-European, LRL, Unicode, IPA, W3C, XML, TTS, ASR 1. Introduction 1.1 What is Pronunciation Lexicon Specification (PLS)? PLS is a standard of World-Wide Web Consortium (W3C) and its current version is PLS 1.0 (2008) produced by Voice Browser Working Group of W3C. The PLS has been designed with a goal to have inter-operable specifications of pronunciation information which can be used for speech technology development. It provides a mapping between the words or short phrases, their written representations and their pronunciation especially for use by speech engines. The PLS data will be prepared in the XML format for specific language using the base line PLS specification of W3C. This specification provides the possibility of providing multiple pronunciations for the same orthography as well as multiple orthographies against an entry of single pronunciation in the PLS. This will adequately cover homophones and homographs. There is a possibility of incorporating acronyms and abbreviations also by providing them as alias. PLS specification provides a framework and guideline which can be tailored to the needs of a specific language and consequently the XML tag set can be defined to build the PLS data using IPA as UTF 8 representation. PLS can be used by Text to Speech (TTS) and Automatic Speech Recognition (ASR) Engines and can have a wide variety of applications like voice browsers, pedagogical tools etc. 1.2 Global Status of PLS Development PLS work for Indian Languages is almost non-existent. Some work has started recently in Bangla (Das Mandal Shyamal, 2010) and Hindi but it is in a very initial stage. Development of PLS data for European languages have already taken up extensively. Some of the reported works are elucidated below: For European languages, SI-PRON, a Comprehensive pronunciation lexicon for Slovenian (1.4 m words) has been prepared. For Swedish, a Swedish Pronunciation Lexicon has been developed. This lexicon has 8529 words and the delivery comes in two formats namely (a) a tab-separated format and (b) an XML format. Similar work has been reported for Turkish, named as Finite State Pronunciation Lexicon. Turkish being an agglutinating language with extremely productive inflectional and derivational morphology, it has an essentially infinite lexicon. It takes word form as an input and produces all possible pronunciation. The pronunciations are encoded using SAMPA encoding. Total number of words are approximately 7,50,000. However as mentioned earlier, such extensive study is almost nonexistent for Indian languages, especially in Punjabi. The paper is organized as follows: Section 2 describes specific phonological and supra-segmental features for Punjabi language. The specific requirement for development of PLS in Punjabi is described in Section 3. The challenges faced during development of PLS in Punjabi have been touched upon in Section 4 and Section 5 concludes future directions for building up PLS in Punjabi. 2. Phonological and Supra-segmental Features of Punjabi 2.1 Phonological Features Conjunct Consonants: Three types of conjunct consonants are written in which the modified form of the second consonant letter is sub-joined to the first unaltered consonant letter. The member consonant letters are ਹ /h/, ਰ /r/, ਲ /l/ e.g. لٟᴇ /pə́ɽ/ Study لᴆ ي ْذ /prkar/ Type of / Similar ِ ᴆ ُ ᴈ /ə̀rsv/ Small Diphthongs: There are six glides. The first member of dipthong is always a short vowel and second one is a long vowel. /Іо/ = ਇ + ਓ, e.g. ਿਪਓ =/рІо/; /Іᴐ/ = ਇ + ਔ, e.g. ਿਲਔਣਾ /lІᴐɳɑ/; /əi/ = ਅ + ਈ, e.g.ਗਈ =/gəi/; /əe/ = ਅ + ਏ, e.g. ਗਏ =/ɡəe/; /əu/ = ਅ + ਉ, e.g. ਗਊ=/ɡəu/; /Ua/ = ਉ + ਆ, e.g.ਗੁਵਾਚਾ= /ɡUvatʃa/ Geminates: Consonants can be geminated by using Addak ਅੱਦਕ = /əddk/ It is put on previous character and the following character is pronounced as full character and ᵍᵍᵍhalf character. e.g. ਿਮੱਟੀ = /mІtti/ Prolative Vowel: Addak is also used to elongate the long vowel. When the vowel occurs at the end of a word. It makes the vowel one and a half times the length of vowel. e.g. ਰਲਾ=/rla/ (noun) ਰਲਾੱ=/rlaa/ (verb); ਲਮਕਾ=/ləmka/ (noun) ਲਮਕਾੱ=/ləmkaa/ (verb) 2.2 Supra-segmental Features of Punjabi Tone: Punjabi is highly tonal (Haudricourt, 1971) and this is the contrastive feature of Punjabi among IndoAryan languages. Punjabi doesn’t have contour tones as are found in mandarin. There are five tonal characters and three types of tone i.e. high-tone /Ó/, low-tone /Ò/ and mid-tone /ō/. Synchronically the tone placement interacts with accent/stress. In the production of tones there is neither friction nor stoppage of air in the mouth. These are pronounced always concurrently with a syllable. In the production of low-tone, there is a considerable amount of constriction in the larynx along with some creakiness. Sometimes fall of the larynx is accompanied by the lowest possible pitch. The fall in pitch is followed by a rise not to the same level in all the cases. The pitch of the voice is raised and falls down in the same syllable in a monosyllabic word but in polysyllabic words the fall is realized on the tail syllable which follows the onset syllable. In mid-tone words, the pitch remains fairly level which may rise towards the end. The rise is not necessarily realized in all the cases. High-tone: It is a rising-falling tone. It is used to represent ਹ /h/ in the words. e.g. ਸਾਹ = /sá / ਸ਼ੈਹਰ = /ʃɛ́r/ ਇਹਦਾ = /ɪ́da/ ਓਹਦਾ = /óda/ Low-tone: It is a falling tone and is used in words using half ਹ /h/ only in the initial position. e.g. كᴇ ْ ਤਾ = /nàta/ ਹੋਇਆ = /òІa/ Mid-tone: It is considered to be intermediate in pitch between the high and low tones. The syllable is of an intermediate height in this case. It is not marked in the phonetic transcription since it is predicted by rules of redundancy. If a vowel doesn’t bear any tone specification at the level of phonetic representation, it carries a mid tone. Stress: Stress is not a prominent feature of Punjabi. It is utilized in disyllabic syllables to distinguish between the grammatical categories. In Punjabi, accent is used on stressed syllables is a combination of length and pitch. Unstressed syllables lack length and a high pitch. Emphasized syllables contain a greater amount of energy. Phonemic stress can fall on both initial and final syllable. e.g.ਨੱਕ /nəkk/ ਕੰਨ /kə̃n/ Intonation: Intonation is the pitch fluctuation pattern as applied to a unit larger than the word i.e. a clause or a sentence, hence it is not pertinent to PLS. A given sentence may be spoken in more than one way to present and express different altitudes, each of the different terms of the tonal system accordingly having more than one pitch-exponent. (Bahl, 1957). e.g. ਕਰਮ ਿਸੰਘ ਚਲਾ ਗਯਾ ? /kərm/ /sĨg/ /tʃla/ /gІa/? ਕਰਮ ਿਸੰਘ ਚਲਾ ਗਯਾ ! /kərm/ /sĨg/ /tʃla/ /gІa/! ਕਰਮ ਿਸੰਘ ਚਲਾ ਗਯਾ। /kərm/ /sĨg/ /tʃla/ /gІa/ । 3. Requirements for a Punjabi PLS: Tone is the most characteristic feature of Punjabi (Singh Harkirtat, 1991) i.e. the tones arise as a reinterpretation of different consonant series in terms of pitch. It imposes special requirements in the development of Pronunciation Lexicon. Pulmonic aggressive airstream mechanism is involved in the production of all phonetic segments of the language. (Bhatia Tej K). The issues relating to geminates and Diphthongs also need to be examined. The homophones and word variants also need to be accounted while designing the PLS. The Homophones will need to cover Homonyms (which are very frequently used in Punjabi and heterographs (which are rarely used). 3.1 What needs to be done for designing Punjabi Language PLS Phonetic Transcription using IPA: Most of the existing literature has used non-standard representations for transcription of Punjabi, hence IPA phonemic inventory needs to be standardized as per International Phonetic Alphabet; IPA 2005 since this is the primary requirement for building the standard PLS data. Internationally IPA is the preferred phonetic representation, which has now one to one mapping in Unicode standard. In India Unicode standard is already being adopted for storage of data. The phonemic inventory of standard spoken Punjabi has been defined here using IPA 2005 (Wiki & Pandey 2011).However, XSAMPA representations may also be studied for comparison. Vowels: Three basic symbols are used to form vowels. Tilde (~) sign put on top of all vowels makes these nasalized. [ɪ] (ਇ), [i] (ਈ), [e] (ਏ), [ɛ] (ਐ), [ə] (ਅ), [a] (ਆ), [u] (ਊ), [U] (ਉ), [o] (ਓ), [ɔ] (ਔ) Semi-vowels: [j](ਯ), [v](ਵ) Labio-dental (most frequently used), [w](ਵ) Bilabial (rare in use) Consonants: Stops [p] [pʰ] [b] (ਪ) (ਫ) (ਬ) (ਭ)* [t] [tʰ] [d] (ਤ) (ਥ) (ਦ) (ਧ)* [ʈ] [ʈʰ] [ɖ] (ਟ) (ਠ) (ਡ) (ਢ)* Velar [k] [kʰ] [ɡ] (ਕ) (ਖ) (ਗ) (ਘ)* Affricates [tʃ] [tʃʰ] [dʒ] (ਚ) (ਛ) (ਜ) (ਝ)* Nasals [m] (ਮ) [n] (ਨ) [ɳ] (ਣ) Laterals [l] (ਲ) Fricatives [s] [h] [ʃ] (ਸ) (ਹ) (ਸ਼)¹* * Trill [r] (ਰ) [z] [f] (ਜ਼)²* (ਫ਼)³* Flap [ɽ] (ੜ) [x] [ɣ] (ਖ਼)4* (ਗ਼)5* These Tonal Characters are represented by the corresponding aspirated/ un-aspirated and voiced/unvoiced forms and also marked with high rising tone /Ó/ and low rising tone /Ò/ on top of the accompanying vowel * 1, 2, 3, 4&5 used only for borrowed words from Perso-Arabic and English. (Bhatia, 1997) Rules for characterization of tonal characters: Word using Represented Type of Tonal character by tone (ਘ) Word Initial ਘਰ Un-aspirated Low tone = /k‵ər/ and unvoiced /Ò/ /k/ Word Medial Un-aspirated Low tone ਮਘਾਯਾ = /mgaj‵a/ and voiced /g/ /Ò/ Word Final ਮਾਘ Un-aspirated High = /m′ag/ and voiced /g/ tone /Ó/ Rules for phonetic stress placement: 1. Stress falls on the final syllable in cases where the initial syllable has a syllable peak with centralized vowels (ɪ, ə, u) and the final syllable has a consonant cluster with centralized vowel peaks. e.g. ਪਾਖੰਡੀ /pɑ ˈkʰandi/ Hypocrite 2. When the first syllable is long for heavy and the second is a close one, the stress falls on the first syllable. e.g. ਗਾਜਰ /ˈgadʒər/ Carrot 3. In disyllabic words, the initial syllable has a stress if the final syllable is open e.g. ਮਾਲੀ /ˈmali/ Gardener 4. In tri-syllabic words the, the stress falls on the second syllable if it is long otherwise it falls on the first syllable. e.g. ਚਮਕੀਲਾ /tʃəmˈkila/ Shining (Bhatia, 1997) Collection of data: Most frequently used words need to be covered in PLS. The selection of words also needs to be governed so that following phonetic considerations are kept in mind to have complete coverage of phonetic specificities of the language: Words formed by all tonal characters in their initial/medial/final position Pairs of words with normal /a/ and prolated /a/ in the end and different durations of small vowels depending on their position in the word. Homonyms, Allophones, Word Variants Minimal pairs of tonal and short & long consonants Duplicate words, Borrowed words, Proper Names, Numbers in word form Methodology of selection of data: Punjabi Tagged Corpora of 100,000 sentences from Tourism and Health domain (TDIL Anglabharti Project) was used for selection of data. Frequency Count Tool was used for selection of data. Phonetically rich word list and also covering word variants, homonyms and minimal pairs was also collected. 1000 words have been selected. Recording of data: Recording of data by native speakers inclusive of the author (a native speaker of Punjabi) is in progress. Within one year the rough PLS data of 10,000 most frequently occurring words will be developed as per the criteria mentioned above. Recording of the data is done with Dynamic Mic. with frequency response of 80 Hz-20KHz, in Speech Studio (SNR>=-45dB) having recording format of 16 bit PCM Mono, 48 KHz. Emotion of the informant was kept neutral. Number of informants were 2 (1 male 1 female). Annotation: Presently this data is raw data. As there is negligible work in Punjabi and raw data is being compiled ab-initio. The data will be annotated using open-source semi automatic tool. 4. Challenges of Punjabi PLS: Tonal Minimal Pairs: Only Punjabi and Sindhi are tonal languages out of the 22 constitutionally recognized languages, which poses extra challenges in building PLS. The difference between the mid-tone, high-tone/low tone on similar sounding minimal pairs requires sound phonetic knowledge for building accurate PLS, e.g. Mid-tone ਕੋੜਾ ਕੜੀ Low-tone ਘੋੜਾ ਘੜੀ /koɽa/ /kəɽi/ /kòɽa/ /k‵əɽi/ Whip A ring of a chain Horse a small pitcher or watch Mid-tone ਕੋੜਾ /koɽa/ High-tone ਕੋਹੜਾ /kÓɽa/ Whip Leper ُ ᶧ /sɔ̃/ ُ ِᶧ Sleep Oath Stress: There are three graduations of small vowels according to their position of occurrence (Singh H. 1991) and these result in distinctive phonemes for PLS work. Vowel Longest Normal Least Length /І/ ਇੱਕ/Іkk/ ਿਵੱਚ /vІtʃ/ ਇਕਿਧਰ/ІkdІ‵r/ ਿਵਚਕਾਰ/vІtʃkar/ ਇਕਾਈ/Іkai/ ਿਵਚੋਲਾ/vІtʃola/ /ə/ ਅੱਗੇ /əgge/ ਲੱਗ /ləgg/ ਅਗਲਾ /əgla/ ਲਗਨ /ləgn/ ٗ زئ ؿᴆ ْ /əgetra/ ਲਗਾ /lga/ /o/ ت᷀ ؼ/Uɖɖ/ ਘੁੱਟ /kÙtt/ ਉਡਣਾ /Uɖɖɳa/ ਘੁਟਵਾਂ /kÙtvã/ ਉਡਾਈ /Uɖɖai/ ਘੁਟਾਈ /kÚtai/ Homographs ਚਲ ਚਲ /tʃəl/ /tʃəl/ Flood Mind/Hindrance Minimal pairs of short and long consonants: ਿਦਲੀ /dІli/ Internal ਿਦੱਲੀ /dІlli/ Delhi ਚੁਨੀ /tʃUni/ selected ਚੁੰਨੀ /tʃŨni/ dupatta Allophones: Complementary Allophones e.g. Word initial ਤਾਰ Word final ਰਾਤ Contrastive allophones e.g. ਸੋਨਾ /sona/ ਸੌਣਾ /sɔɳa/ ਗੋਲੀ /goli/ .ਗੋਲੀ /ɣoli/ ਜਰ /dʒər/ ਜ਼ਰ /zər/ /tar/ /rat/ Gold To sleep Toffee Maid Servant To tolerate Land Homophones: Words having same pronunciation but having different grammatical category or different meaning would require POS information (Das Mandal, 2010) to resolve meaning e.g. ت᷀ يؿ Noun Answer ت᷀ يؿ Adj North Direction Word Variants: Words having minor variation in spellings but having same meaning. e.g. ਜਾਲਮ /dʒaləm/ ਜ਼ਾਲਮ /zaləm/ Tyrant ਜਸਪਤ /dʒspət/ ਜਸਪਿਤ /dʒsptІ/ ਜਸਪਤੀ /dʒspti/ ; Famous Person 5. Conclusion: PLS is designed using Pronunciation Mark-up Language, based on PLS Version 1.0 standard of W3C, which will allow open and portable specification of pronunciation. The definitions of various hierarchical tag structures for Punjabi will be defined keeping in mind all the phonetic requirements brought out in this paper. References [1] W3C Recommendation (2008), Pronunciation Lexicon Specification Ver 1.0 [2] Gros, J.Z. (2006), SI-PRON Pronunciation Lexicon: a New Language Resource for Slovenian, Informatica [3] Das Mandal, Shyamal, Chandra, Somnath Lata Swaran, (2010), Use of Parts Of Speech (POS) and morphological information for resolving Multiple PLS Indian Languages- Bengali as a Case Study. USA: W3C Workshop on conversational applications, USA [4] Karamat, Nayyara (2010), Phonetic Inventory of Punjabi: Pakistan, Center for Research in Urdu Language Processing [5].Singh, Atam (1993), Linguistics. Chandigarh: Punjab State University Text Book Board [6] Singh, Harkirat (1991), Prominent features of Punjabi language. Patiala: Publication Bureau, Punjabi University [7] Bhatia, T. K. (1997), Punjabi, A cognitive-descriptive grammar. London/New York: Routledge [8] Gupta, B. Raj (1990), Indian Linguistics. New Delhi: Punjabi-Tamil Phonology [9] Pandey, P. (2011), Phonetics & Phonology of Indian Languages. [10] Joshi, S.S. (1973), Pitch and Related Phenomena in Punjabi. Patiala:Pakha Sanjam [11] Ladefoged, Peter (1996), Acoustics Phonetics. Chicago: The Univ. of Chicago Press Books [12] Gill, H.S. (1962), A descriptive grammar of Punjabi, PhD dissertation, Patiala: Panjabi Univ [13] Haudricourt, A.G. (1971), Tones in Punjabi. Paris: C.N.R.S. [14].Dulai, N. K. (1980), Punjabi Phonetic Reader. Mysore: CIIL [15] Bahl, K.C. (1957), Indian Linguistics Vol. 17 [16] A Swedish Pronunciation Lexicon for TTS/ASR Development (2008), [email protected]://stts.se [17] Oflazer, Kemal, The Architecture and the Implementation of a Finite State Pronunciation Lexicon for Turkish.
© Copyright 2026 Paperzz