Challenges for Design of Pronunciation Lexicon Specification

Challenges for Design of Pronunciation Lexicon Specification (PLS) for Punjabi
Language
Swaran Lata
Head, Technology Development for Indian Languages (TDIL) Programme,
Department of Information Technology, Govt. of India
[email protected]
Abstract:
The paper is an attempt towards developing pronunciation lexicon for the Punjabi language. Punjabi belongs to the Indo-Aryan branch
of the Indo-European family of languages but is unique due to its tonal characteristics. Scandinavian and Lithuanian languages among
Indo-European family exhibit similar traits. Among Indo-Aryan languages, tonal feature of Punjabi makes it phonetically complex.
The major hurdle in creating PLS for Punjabi is to capture the pronunciation nuances as properly understood by a native speaker. The
web content in Punjabi is scarce and is mostly non-standard and using proprietary fonts. Awareness about Unicode and IPA is very
limited among the print media and the public. Hence in spite of being spoken by a very large number of speakers (approx 30 million
in India as per 2011 census) it can still be called as less-resource language (LRL) on account of non-availability of electronic
resources. The negligible work available with respect to phonetic resources makes it even more less resourced language from the PLS
point-of-view.
Keywords: PLS, Punjabi, Tone, Phonology, Indo-Aryan, Indo-European, LRL, Unicode, IPA, W3C, XML, TTS, ASR
1. Introduction
1.1 What is Pronunciation Lexicon Specification
(PLS)?
PLS is a standard of World-Wide Web Consortium
(W3C) and its current version is PLS 1.0 (2008) produced
by Voice Browser Working Group of W3C. The PLS has
been designed with a goal to have inter-operable
specifications of pronunciation information which can be
used for speech technology development. It provides a
mapping between the words or short phrases, their written
representations and their pronunciation especially for use
by speech engines. The PLS data will be prepared in the
XML format for specific language using the base line
PLS specification of W3C. This specification provides
the possibility of providing multiple pronunciations for
the same orthography as well as multiple orthographies
against an entry of single pronunciation in the PLS. This
will adequately cover homophones and homographs.
There is a possibility of incorporating acronyms and
abbreviations also by providing them as alias. PLS
specification provides a framework and guideline which
can be tailored to the needs of a specific language and
consequently the XML tag set can be defined to build the
PLS data using IPA as UTF 8 representation. PLS can be
used by Text to Speech (TTS) and Automatic Speech
Recognition (ASR) Engines and can have a wide variety
of applications like voice browsers, pedagogical tools etc.
1.2 Global Status of PLS Development
PLS work for Indian Languages is almost non-existent.
Some work has started recently in Bangla (Das Mandal
Shyamal, 2010) and Hindi but it is in a very initial stage.
Development of PLS data for European languages have
already taken up extensively. Some of the reported works
are elucidated below:
For European languages, SI-PRON, a Comprehensive
pronunciation lexicon for Slovenian (1.4 m words) has
been prepared. For Swedish, a Swedish Pronunciation
Lexicon has been developed. This lexicon has 8529
words and the delivery comes in two formats namely (a)
a tab-separated format and (b) an XML format. Similar
work has been reported for Turkish, named as Finite State
Pronunciation Lexicon. Turkish being an agglutinating
language with extremely productive inflectional and
derivational morphology, it has an essentially infinite
lexicon. It takes word form as an input and produces all
possible pronunciation. The pronunciations are encoded
using SAMPA encoding. Total number of words are
approximately 7,50,000.
However as mentioned earlier, such extensive study is
almost nonexistent for Indian languages, especially in
Punjabi.
The paper is organized as follows: Section 2 describes
specific phonological and supra-segmental features for
Punjabi language. The specific requirement for
development of PLS in Punjabi is described in Section 3.
The challenges faced during development of PLS in
Punjabi have been touched upon in Section 4 and Section
5 concludes future directions for building up PLS in
Punjabi.
2. Phonological and Supra-segmental
Features of Punjabi
2.1 Phonological Features
Conjunct Consonants: Three types of conjunct
consonants are written in which the modified form of the
second consonant letter is sub-joined to the first unaltered
consonant letter. The member consonant letters are ਹ /h/,
ਰ /r/, ਲ /l/ e.g.
‫ل‬ٟᴇ
/pə́ɽ/
Study
‫ل‬ᴆ
‫ي ْذ‬
/prkar/
Type of / Similar
ِ ᴆ
ُ ᴈ
/ə̀rsv/
Small
Diphthongs: There are six glides. The first member of
dipthong is always a short vowel and second one is a long
vowel.
/Іо/ = ਇ + ਓ, e.g. ਿਪਓ =/рІо/;
/Іᴐ/ = ਇ + ਔ, e.g. ਿਲਔਣਾ /lІᴐɳɑ/;
/əi/ = ਅ + ਈ, e.g.ਗਈ =/gəi/;
/əe/ = ਅ + ਏ, e.g. ਗਏ =/ɡəe/;
/əu/ = ਅ + ਉ, e.g. ਗਊ=/ɡəu/;
/Ua/ = ਉ + ਆ, e.g.ਗੁਵਾਚਾ= /ɡUvatʃa/
Geminates: Consonants can be geminated by using
Addak ਅੱਦਕ = /əddk/
It is put on previous character and the following character
is pronounced as full character and ᵍᵍᵍhalf character.
e.g. ਿਮੱਟੀ = /mІtti/
Prolative Vowel: Addak is also used to elongate the long
vowel. When the vowel occurs at the end of a word. It
makes the vowel one and a half times the length of
vowel.
e.g.
ਰਲਾ=/rla/ (noun)
ਰਲਾੱ=/rlaa/ (verb);
ਲਮਕਾ=/ləmka/ (noun)
ਲਮਕਾੱ=/ləmkaa/ (verb)
2.2 Supra-segmental Features of Punjabi
Tone: Punjabi is highly tonal (Haudricourt, 1971) and
this is the contrastive feature of Punjabi among IndoAryan languages. Punjabi doesn’t have contour tones as
are found in mandarin. There are five tonal characters and
three types of tone i.e. high-tone /Ó/, low-tone /Ò/ and
mid-tone /ō/. Synchronically the tone placement interacts
with accent/stress. In the production of tones there is
neither friction nor stoppage of air in the mouth. These
are pronounced always concurrently with a syllable. In
the production of low-tone, there is a considerable
amount of constriction in the larynx along with some
creakiness. Sometimes fall of the larynx is accompanied
by the lowest possible pitch. The fall in pitch is followed
by a rise not to the same level in all the cases. The pitch
of the voice is raised and falls down in the same syllable
in a monosyllabic word but in polysyllabic words the fall
is realized on the tail syllable which follows the onset
syllable. In mid-tone words, the pitch remains fairly level
which may rise towards the end. The rise is not
necessarily realized in all the cases.
High-tone: It is a rising-falling tone. It is used to
represent ਹ /h/ in the words. e.g.
ਸਾਹ = /sá / ਸ਼ੈਹਰ = /ʃɛ́r/ ਇਹਦਾ = /ɪ́da/ ਓਹਦਾ = /óda/
Low-tone: It is a falling tone and is used in words using
half ਹ /h/ only in the initial position. e.g.
‫ك‬ᴇ
ْ ਤਾ = /nàta/
ਹੋਇਆ = /òІa/
Mid-tone: It is considered to be intermediate in pitch
between the high and low tones. The syllable is of an
intermediate height in this case. It is not marked in the
phonetic transcription since it is predicted by rules of
redundancy. If a vowel doesn’t bear any tone
specification at the level of phonetic representation, it
carries a mid tone.
Stress: Stress is not a prominent feature of Punjabi. It is
utilized in disyllabic syllables to distinguish between the
grammatical categories. In Punjabi, accent is used on
stressed syllables is a combination of length and pitch.
Unstressed syllables lack length and a high pitch.
Emphasized syllables contain a greater amount of energy.
Phonemic stress can fall on both initial and final syllable.
e.g.ਨੱਕ /nəkk/
ਕੰਨ
/kə̃n/
Intonation: Intonation is the pitch fluctuation pattern as
applied to a unit larger than the word i.e. a clause or a
sentence, hence it is not pertinent to PLS. A given
sentence may be spoken in more than one way to present
and express different altitudes, each of the different terms
of the tonal system accordingly having more than one
pitch-exponent. (Bahl, 1957). e.g.
ਕਰਮ ਿਸੰਘ ਚਲਾ ਗਯਾ ? /kərm/ /sĨg/ /tʃla/ /gІa/?
ਕਰਮ ਿਸੰਘ ਚਲਾ ਗਯਾ ! /kərm/ /sĨg/ /tʃla/ /gІa/!
ਕਰਮ ਿਸੰਘ ਚਲਾ ਗਯਾ। /kərm/ /sĨg/ /tʃla/ /gІa/ ।
3. Requirements for a Punjabi PLS:
Tone is the most characteristic feature of Punjabi (Singh
Harkirtat, 1991) i.e. the tones arise as a reinterpretation of
different consonant series in terms of pitch. It imposes
special requirements in the development of Pronunciation
Lexicon. Pulmonic aggressive airstream mechanism is
involved in the production of all phonetic segments of the
language. (Bhatia Tej K).
The issues relating to
geminates and Diphthongs also need to be examined.
The homophones and word variants also need to be
accounted while designing the PLS. The Homophones
will need to cover Homonyms (which are very frequently
used in Punjabi and heterographs (which are rarely used).
3.1 What needs to be done for designing Punjabi
Language PLS
Phonetic Transcription using IPA: Most of the existing
literature has used non-standard representations for
transcription of Punjabi, hence IPA phonemic inventory
needs to be standardized as per International Phonetic
Alphabet; IPA 2005 since this is the primary requirement
for building the standard PLS data. Internationally IPA is
the preferred phonetic representation, which has now one
to one mapping in Unicode standard. In India Unicode
standard is already being adopted for storage of data.
The phonemic inventory of standard spoken Punjabi has
been defined here using IPA 2005 (Wiki & Pandey
2011).However, XSAMPA representations may also be
studied for comparison.
Vowels: Three basic symbols are used to form vowels.
Tilde (~) sign put on top of all vowels makes these
nasalized.
[ɪ] (ਇ),
[i] (ਈ),
[e] (ਏ), [ɛ] (ਐ),
[ə] (ਅ),
[a] (ਆ),
[u] (ਊ), [U] (ਉ),
[o] (ਓ),
[ɔ] (ਔ)
Semi-vowels: [j](ਯ), [v](ਵ) Labio-dental (most
frequently used), [w](ਵ) Bilabial (rare in use)
Consonants:
Stops
[p]
[pʰ] [b]
(ਪ) (ਫ) (ਬ)
(ਭ)*
[t]
[tʰ]
[d]
(ਤ) (ਥ)
(ਦ)
(ਧ)*
[ʈ]
[ʈʰ]
[ɖ]
(ਟ) (ਠ)
(ਡ)
(ਢ)*
Velar
[k]
[kʰ]
[ɡ]
(ਕ) (ਖ)
(ਗ)
(ਘ)*
Affricates
[tʃ] [tʃʰ] [dʒ]
(ਚ) (ਛ)
(ਜ)
(ਝ)*
Nasals
[m] (ਮ)
[n] (ਨ)
[ɳ] (ਣ)
Laterals
[l] (ਲ)
Fricatives
[s]
[h] [ʃ]
(ਸ) (ਹ) (ਸ਼)¹*
*
Trill
[r] (ਰ)
[z]
[f]
(ਜ਼)²* (ਫ਼)³*
Flap
[ɽ] (ੜ)
[x]
[ɣ]
(ਖ਼)4* (ਗ਼)5*
These Tonal Characters are represented by the
corresponding
aspirated/
un-aspirated
and
voiced/unvoiced forms and also marked with high
rising tone /Ó/ and low rising tone /Ò/ on top of the
accompanying vowel
* 1, 2, 3, 4&5 used only for borrowed words from
Perso-Arabic and English. (Bhatia, 1997)
Rules for characterization of tonal characters:
Word using
Represented
Type of
Tonal character
by
tone
(ਘ)
Word Initial ਘਰ
Un-aspirated
Low tone
= /k‵ər/
and unvoiced
/Ò/
/k/
Word Medial
Un-aspirated
Low tone
ਮਘਾਯਾ = /mgaj‵a/ and voiced /g/
/Ò/
Word Final ਮਾਘ
Un-aspirated
High
= /m′ag/
and voiced /g/
tone /Ó/
Rules for phonetic stress placement:
1. Stress falls on the final syllable in cases where the
initial syllable has a syllable peak with centralized
vowels (ɪ, ə, u) and the final syllable has a consonant
cluster with centralized vowel peaks.
e.g. ਪਾਖੰਡੀ
/pɑ ˈkʰandi/
Hypocrite
2. When the first syllable is long for heavy and the
second is a close one, the stress falls on the first
syllable.
e.g. ਗਾਜਰ
/ˈgadʒər/
Carrot
3. In disyllabic words, the initial syllable has a stress if
the
final
syllable
is
open
e.g.
ਮਾਲੀ /ˈmali/
Gardener
4. In tri-syllabic words the, the stress falls on the
second syllable if it is long otherwise it falls on the
first syllable.
e.g. ਚਮਕੀਲਾ
/tʃəmˈkila/
Shining
(Bhatia, 1997)
Collection of data: Most frequently used words need to
be covered in PLS. The selection of words also needs to
be governed so that following phonetic considerations are
kept in mind to have complete coverage of phonetic
specificities of the language:
 Words formed by all tonal characters in their
initial/medial/final position
 Pairs of words with normal /a/ and prolated /a/ in the
end and different durations of small vowels
depending on their position in the word.
 Homonyms, Allophones, Word Variants
 Minimal pairs of tonal and short & long consonants
 Duplicate words, Borrowed words, Proper Names,
Numbers in word form
Methodology of selection of data: Punjabi Tagged
Corpora of 100,000 sentences from Tourism and Health
domain (TDIL Anglabharti Project) was used for
selection of data. Frequency Count Tool was used for
selection of data. Phonetically rich word list and also
covering word variants, homonyms and minimal pairs
was also collected. 1000 words have been selected.
Recording of data: Recording of data by native speakers
inclusive of the author (a native speaker of Punjabi) is in
progress. Within one year the rough PLS data of 10,000
most frequently occurring words will be developed as per
the criteria mentioned above. Recording of the data is
done with Dynamic Mic. with frequency response of 80
Hz-20KHz, in Speech Studio (SNR>=-45dB) having
recording format of 16 bit PCM Mono, 48 KHz. Emotion
of the informant was kept neutral. Number of informants
were 2 (1 male 1 female).
Annotation: Presently this data is raw data. As there is
negligible work in Punjabi and raw data is being
compiled ab-initio. The data will be annotated using
open-source semi automatic tool.
4. Challenges of Punjabi PLS:
Tonal Minimal Pairs: Only Punjabi and Sindhi are tonal
languages out of the 22 constitutionally recognized
languages, which poses extra challenges in building PLS.
The difference between the mid-tone, high-tone/low tone
on similar sounding minimal pairs requires sound
phonetic knowledge for building accurate PLS, e.g.
Mid-tone ਕੋੜਾ
ਕੜੀ
Low-tone ਘੋੜਾ
ਘੜੀ
/koɽa/
/kəɽi/
/kòɽa/
/k‵əɽi/
Whip
A ring of a chain
Horse
a small pitcher or watch
Mid-tone ਕੋੜਾ
/koɽa/
High-tone ਕੋਹੜਾ /kÓɽa/
Whip
Leper
ُ ᶧ
/sɔ̃/
ُ ِᶧ
Sleep
Oath
Stress: There are three graduations of small vowels
according to their position of occurrence (Singh H. 1991)
and these result in distinctive phonemes for PLS work.
Vowel Longest
Normal
Least Length
/І/
ਇੱਕ/Іkk/
ਿਵੱਚ /vІtʃ/
ਇਕਿਧਰ/ІkdІ‵r/
ਿਵਚਕਾਰ/vІtʃkar/
ਇਕਾਈ/Іkai/
ਿਵਚੋਲਾ/vІtʃola/
/ə/
ਅੱਗੇ /əgge/
ਲੱਗ /ləgg/
ਅਗਲਾ /əgla/
ਲਗਨ /ləgn/
‫ٗ زئ‬
‫ؿ‬ᴆ
ْ /əgetra/
ਲਗਾ /lga/
/o/
‫ت‬᷀
‫ؼ‬/Uɖɖ/
ਘੁੱਟ /kÙtt/
ਉਡਣਾ /Uɖɖɳa/
ਘੁਟਵਾਂ /kÙtvã/
ਉਡਾਈ /Uɖɖai/
ਘੁਟਾਈ /kÚtai/
Homographs
ਚਲ
ਚਲ
/tʃəl/
/tʃəl/
Flood
Mind/Hindrance
Minimal pairs of short and long consonants:
ਿਦਲੀ
/dІli/
Internal
ਿਦੱਲੀ
/dІlli/
Delhi
ਚੁਨੀ
/tʃUni/
selected
ਚੁੰਨੀ
/tʃŨni/
dupatta
Allophones:
Complementary Allophones e.g.
Word initial
ਤਾਰ
Word final
ਰਾਤ
Contrastive allophones e.g.
ਸੋਨਾ
/sona/
ਸੌਣਾ
/sɔɳa/
ਗੋਲੀ
/goli/
.ਗੋਲੀ
/ɣoli/
ਜਰ
/dʒər/
ਜ਼ਰ
/zər/
/tar/
/rat/
Gold
To sleep
Toffee
Maid Servant
To tolerate
Land
Homophones: Words having same pronunciation but
having different grammatical category or different
meaning would require POS information (Das Mandal,
2010) to resolve meaning e.g.
‫ت‬᷀
‫يؿ‬
Noun
Answer
‫ت‬᷀
‫يؿ‬
Adj
North Direction
Word Variants: Words having minor variation in
spellings but having same meaning. e.g.
ਜਾਲਮ
/dʒaləm/
ਜ਼ਾਲਮ
/zaləm/
Tyrant
ਜਸਪਤ /dʒspət/
ਜਸਪਿਤ
/dʒsptІ/ ਜਸਪਤੀ /dʒspti/
; Famous Person
5. Conclusion:
PLS is designed using Pronunciation Mark-up Language,
based on PLS Version 1.0 standard of W3C, which will
allow open and portable specification of pronunciation.
The definitions of various hierarchical tag structures for
Punjabi will be defined keeping in mind all the phonetic
requirements brought out in this paper.
References
[1]
W3C Recommendation (2008), Pronunciation
Lexicon Specification Ver 1.0
[2] Gros, J.Z. (2006), SI-PRON Pronunciation Lexicon: a
New Language Resource for Slovenian, Informatica
[3] Das Mandal, Shyamal, Chandra, Somnath Lata
Swaran, (2010), Use of Parts Of Speech (POS) and
morphological information for resolving Multiple PLS
Indian Languages- Bengali as a Case Study. USA:
W3C Workshop on conversational applications, USA
[4] Karamat, Nayyara (2010), Phonetic Inventory of
Punjabi: Pakistan, Center for Research in Urdu
Language Processing
[5].Singh, Atam (1993), Linguistics. Chandigarh: Punjab
State University Text Book Board
[6] Singh, Harkirat (1991), Prominent features of Punjabi
language. Patiala: Publication Bureau, Punjabi
University
[7] Bhatia, T. K. (1997), Punjabi, A cognitive-descriptive
grammar. London/New York: Routledge
[8] Gupta, B. Raj (1990), Indian Linguistics. New Delhi:
Punjabi-Tamil Phonology
[9] Pandey, P. (2011), Phonetics & Phonology of Indian
Languages.
[10] Joshi, S.S. (1973), Pitch and Related Phenomena in
Punjabi. Patiala:Pakha Sanjam
[11] Ladefoged, Peter (1996), Acoustics Phonetics.
Chicago: The Univ. of Chicago Press Books
[12] Gill, H.S. (1962), A descriptive grammar of Punjabi,
PhD dissertation, Patiala: Panjabi Univ
[13] Haudricourt, A.G. (1971), Tones in Punjabi. Paris:
C.N.R.S.
[14].Dulai, N. K. (1980), Punjabi Phonetic Reader.
Mysore: CIIL
[15] Bahl, K.C. (1957), Indian Linguistics Vol. 17
[16] A Swedish Pronunciation Lexicon for TTS/ASR
Development (2008), [email protected]://stts.se
[17] Oflazer, Kemal, The Architecture and the
Implementation of a Finite State Pronunciation
Lexicon
for
Turkish.