Pronunciation Lexicon Specification (PLS)
W3C Working Draft
This version:
Editor:
Authors:
Abstract
This document defines the syntax for specifying pronunciation lexicons to be used by Automatic Speech
Recognition and Speech Synthesis engines in voice browser applications. A pronunciation lexicon is a
collection of words or phrases together with their pronunciations specified using an appropriate
pronunciation alphabet.
Table of Content
1. Introduction
2. PLS
2.1. Orthographic Requirement
2.2. Pronunciation Requirement
2.3. Pronunciation Alphabet Requirement
2.4. Lexicon Requirement
3. TTS
3.1. Standards used in TTS
4. SSML
5. SRGS
6. SAMPA
7. IPA
8. Validation criteria for spoken language resources
8.1. Overview
8.2. Documentation
8.2.1. Technical information
8.2.2. Database contents
8.2.3. Lexicon
8.3. Formal and technical criteria
8.4. Validation checks for Lexicon
1
8.5. Examples
8.5.1.LILA: Cellular Telephone Speech Databases from Asia
8.5.2.Annotated Speech Corpora Development in Indian Languages
9. Multiple Pronunciation for the same Orthography in Hindi
9.1. Multiple Orthographies
9.2. Homophones
9.3. Homographs
9.4. Pronunciation by Orthography (Acronyms, Abbreviations)
10. Voice XML
10.1. VoiceXML Future
11. Issue regarding pronunciation of Hindi Language
12. Glossary
13. References
14. Acknowledgement
Introduction
The PLS specification is about “Pronunciation Lexicon”: How to pronounce words and phrases, How to
deal with the variability of pronunciations by country, region, person, etc, How to spell abbreviations
and acronyms. The Pronunciation Lexicon Specification (PLS) is a W3C Recommendation, which is
designed to enable interoperable specification of pronunciation information for both speech recognition
and speech synthesis engines within voice browsing applications. The language is intended to be easy to
use by developers while supporting the accurate specification of pronunciation information for
international use. The language allows one or more pronunciations for a word or phrase to be specified
using a standard pronunciation alphabet or if necessary using vendor specific alphabets. Pronunciations
are grouped together into a PLS document which may be referenced from other markup languages, such
as the Speech Recognition Grammar Specification SRGS and the Speech Synthesis Markup Language
SSML.
Here is an example PLS document:
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>माहराष्</grapheme>
<grapheme>माहाराष्ट </grapheme>
<phoneme>महाराष्</phoneme>
</lexeme>
2
2. Pronunciation Lexicon Specification (PLS)
The pronunciation lexicon markup language will enable consistent, platform independent control of
pronunciations for use by voice browsing applications. This markup language should be sufficient to
cover the requirements of speech recognition and speech synthesis systems within a voice browser. It
will be an XML language and shall be interoperable with relevant W3C specifications. It should be easy
and computationally efficient to automatically generate and process documents using the pronunciation
lexicon markup language. All features of the pronunciation lexicon markup language should be
implementable with existing, generally available technology. Anticipated capabilities should be
considered to ensure future extensibility (but are not required to be covered in the specification) and It
should be easy to author, where appropriate deriving from existing pronunciation lexicons formats and
using existing pronunciation alphabets. Pronunciation Lexicon specification will be usable in a large
number of human languages.
2.1. Orthographic Requirement
Multi word orthographies
The pronunciation lexicon markup must allow multi word orthographies. This is particularly important
for natural speech applications where common phrases may have significantly different pronunciations
to that of the concatenated word pronunciations, requiring a phrase level pronunciation. An example
would be "how about" often pronounced "how 'bout".
Alternate orthographies
The pronunciation lexicon markup must provide the ability to indicate an alternative equivalent form of
the orthography. This is required to cover the following situations:
Regional spelling variations e.g. "प्रशा" and "प्रसा“ (mostly in North India)
Free spelling variations e.g. "�हदं �" and "�हन्द"
Alternate writing systems, e.g. Oriya language.
Ancient vs. Modern spellings e.g. Bangla Language before and after the reform of the spelling
System. (Nagari-derived script and Bagali Script)
Handling of orthographic textual variability
The pronunciation lexicon markup must provide a mechanism to indicate the allowable textual
variability in the orthography. Types of variability include, but are not limited to,
• Whitespace handling
• Case sensitivity
• Unicode sequence variation
• Valid character sets
• Diacritics within languages such as Arabic or Farsi
• Accent matching within languages such as French.
The definition of a standard text normalization scheme is beyond the scope of this specification.
3
Handling of homographs
The pronunciation lexicon markup may provide a mechanism to deal with the problem of specifying
homographs (words with the same spelling, but potentially different meanings and pronunciations),
within the same document.
2.2. Pronunciation Requirements
Single Pronunciations
The pronunciation lexicon markup must provide the ability to specify a single pronunciation for a given
lexicon entry as a sequence of symbols according to the pronunciation alphabet selected.
Multiple pronunciations
The pronunciation lexicon markup must support the ability to specify multiple pronunciations for a given
lexicon entry.
Dialect indication
The pronunciation lexicon markup may provide a mechanism for indicating the dialect or language
variation for each pronunciation.
Pronunciation preference
The pronunciation lexicon markup must enable indication of which pronunciation is the preferred form
for use by a speech synthesizer where there are multiple pronunciations for a lexicon entry. The
pronunciation lexicon markup must define the default selection behavior for the situations where there
are multiple pronunciations but no indicated preference.
Pronunciation weighting
The pronunciation lexicon markup may allow for relative weightings to be applied to pronunciations.
Orthographic Specification of Pronunciation
The pronunciation lexicon markup should allow the specification of the pronunciation of orthography in
terms of other orthographies with previously defined pronunciations, for example, the pronunciation for
"W3C" specified as the concatenation of pronunciations of the words "double you three see".
2.3. Pronunciation Alphabet Requirement
Standard Pronunciation alphabets
We will standardize on at least one existing pronunciation alphabet, such as the phonetic alphabet
defined by the International Phonetic Association IPA. We do not plan of developing a new standard
pronunciation alphabet.
Internationalization
The pronunciation alphabet must allow the specification of pronunciations for any language including
tonal languages.
Supra segmental annotations
The pronunciation alphabet must provide a mechanism for indicating supra segmental structure such as,
word/syllable boundaries, and stress markings. The specification may address other types of supra
segmental structure.
Interoperability
The choice of pronunciation alphabet should take into account the requirements of interoperability
between platforms.
4
Vendor Specific Pronunciation Alphabets
The pronunciation lexicon markup must allow for vendor specific pronunciation alphabets to be used.
2.4. Lexicon Requirements
Pronunciation lexicon means a mapping between words (or short phrases), their written
representations, and their pronunciations suitable for use by an ASR engine or a TTS engine. However,
the word "lexicon" can mean other things in other contexts, like in its most general sense, a lexicon is
merely a list of words or phrases, possibly containing information associated with and related to the
items in the list.
Multiple entries per lexicon: The pronunciation lexicon markup must support the ability to specify
multiple entries within a document, each entry containing orthographic and pronunciation information.
Multiple lexicons per document: The pronunciation lexicon markup may provide named groupings of
lexicon entries within a single lexicon document. This may be useful for separating lexicons into
application specific classes of pronunciation e.g. all city names. Pronunciation alphabet per lexicon: The
pronunciation lexicon markup must provide the ability to specify the pronunciation alphabet for use by
all entries within a document, such as the phonetic alphabet defined by the International Phonetic
Association IPA. Language identifier per lexicon: The pronunciation lexicon markup must provide the
ability to specify language identifiers for use by all entries within a document. Each language identifier
must be expressed following RFC 3066, Tags for Identification of Languages. Language identifier per
Lexicon Entry: The pronunciation lexicon may support the ability to specify language identifiers for an
individual entry within a document. Each language identifier must be expressed following RFC 3066
(Identification of Languages). Lexicon can import other lexicons: The pronunciation lexicon markup may
support the ability to import other pronunciation lexicons written in the pronunciation lexicon markup.
Lexicon can import individual lexicon entries: The pronunciation markup may support the ability to
import lexicon entries from other pronunciation lexicons. Metadata information: The pronunciation
lexicon markup should provide a mechanism for specifying metadata within pronunciation lexicon
documents. This metadata can contain information about the document itself rather than document
content. For example: record the purpose of the lexicon document, the author, etc.
3. TTS (Text-to-speech) system
Text-To-Speech system can be defined as the automatic production of speech, through a grapheme-tophoneme transcription of the sentences to utter. A text-to-speech system (or "engine") is composed of
four parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text
containing symbols like numbers and abbreviations into the equivalent of written-out words. This
process is often called text normalization, pre-processing, or tokenization. The front-end then
assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units,
like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is
called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody
information together make up the symbolic linguistic representation that is output by the front-end. The
back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation
into sound.
5
Currently, many commercial TTS systems employ unit selection synthesis techniques and deliver highly
intelligible synthetic speech. A unit selection TTS system works well most of the time and can synthesize
very natural speech. Voice inventories of unit selection TTS are generally created by an automatic
labeling method for which the accuracy has a direct influence on the segmental quality of the TTS
system.
Analysis of Pronunciation Variants
For a unit selection TTS system having several hundred thousand labeled units in each speaker’s voice
inventory, it is impractical to check transcription variants without ASR techniques. To detect speakerspecific pronunciation variants, we use a hidden Markov model (HMM)-based phone recognizer. Even
though we train HMM models with speaker dependent data, the accuracy of the phone recognizer is
limited. The outputs of the phone recognizer are not reliable enough to use without manual checking.
Therefore, we identified frequent mistakes of the phone recognizer and applied them as a filter to
reduce the number of transcriptions to check. The transcriptions that our phone recognizer routinely
mis-transcribes, but the speaker does not actually produce, are mostly caused by misinterpretations of
co-articulation phenomena. The following examples show the frequent mistranscriptions categorized
into insertion, deletion, and substitution errors.Potential applications of High Quality TTS Systems are
indeed numerous.
•
•
•
•
•
Telecommunications services: TTS systems make it possible to access textual information over
the telephone.
Language education: High Quality TTS synthesis can be coupled with a Computer Aided Learning
system, and provide a helpful tool to learn a new language. To our knowledge, this has not been
done yet, given the relatively poor quality available with commercial systems, as opposed to the
critical requirements of such tasks.
Aid to handicapped persons: Voice handicaps originate in mental or motor/sensation disorders.
Blind people also widely benefit from TTS systems, when coupled with Optical Recognition
Systems (OCR), which give them access to written information.
Talking books and toys: The toy market has already been touched by speech synthesis. Many
speaking toys have appeared, under the impulse of the innovative 'Magic Spell' from Texas
Instruments.
Vocal Monitoring: In some cases, oral information is more efficient than written messages. The
appeal is stronger, while the attention may still focus on other visual sources of information.
Hence the idea of incorporating speech synthesizers in measurement or control systems.
6
•
•
Multimedia, man-machine communication: In the long run, the development of high quality TTS
systems is a necessary step (as is the enhancement of speech recognizers) towards more
complete means of communication between men and computers. Multimedia is a first but
promising move in this direction.
Fundamental and applied research: TTS synthesizers possess a very peculiar feature which
makes them wonderful laboratory tools for linguists: they are completely under control, so that
repeated experiences provide identical results (as is hardly the case with human beings).
4. SSML
The Speech Synthesis Markup Language specification (SSML) is a W3C markup language specification
that defines directives in the form of XML tags that can be used along with Text-to-Speech synthesis
systems (TTS) to control different speech parameters (e.g. pronunciation, prosody) and also provide
additional information such as, language and metadata for enhancing the quality of synthetic speech
output in voice based applications.
Speech Synthesis Markup Language or SSML is one of three types of markup language used to create
voice enabled functionality with Internet browsers and email programs. Sometimes used as a standalone approach, SSML is also sometimes used in tandem with Spoken Text Markup Language (STML) and
Java Speech Markup Language (JSML). The ultimate goal of SSML is to provide applications that allow
persons to use voice commands with various online tasks such as searching the Internet, receiving and
responding to emails, and enjoying the content of various web sites.
The structure of an SSML document can be well understood with an example which is given below:
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="hi-IN">
<lexicon uri="http://www.somelexiconfile.com/lexicon.file"/>
<voice gender="female">
<p>
<s>I speak <emphasis>Hindi</emphasis></s>
<s>I also speak <emphasis>Marathi</emphasis></s>
</p>
<sub alias="International Phonetic Association">IPA</sub>
</voice>
<audio src="royal.wav">
आपका का <emphasis>स्वागत</emphasis> है
</audio>
</speak>
There are certain aspects that would be useful in addition to the existing tags that are specified in the
current SSML version, especially in the context of Indian languages. The two tag extensions to the
existing SSML specification in the context of Indian languages are:
7
Transliteration tag - <transliterate>
The text input to most Indian language TTS systems is either an English transliteration of the Indian
language script or is in Unicode. There are however no uniform transliteration scheme to represent
different Indian language scripts. Various Indian language TTS systems assume a particular input
transliteration scheme. A popular scheme in use is the IRANS package. It is therefore important to have
a mechanism to specify whether the input text is using Unicode or has been transliterated using a
particular transliteration scheme. We propose a <transliterate> tag that has two attributes: "codepage"
and "uri". This can allow the correct phone sequence to be generated for the input text stream by the
speech synthesizer. An example of such a situation is given below.
<?xml version="1.0"?>
<speak version="1.0" xml:lang="hi-IN">
<transliterate codepage="1137">
मेरा नाम राम है । </transliterate>
</speak>
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
<transliterate codepage="1252" uri="http://www.example.com/trans.file">
mera nam ram hai </transliterate>
</speak>
In these instances of the <transliterate>_tag, the codepage attribute specifies how the input text has to
interpreted. In the first example, the codepage attribute is set to the code page index for Hindi which is
1137. This indicates that the text has to be interpreted using the Hindi character set and has not been
transliterated. In the second example, because the input has already been transliterated to English, the
codepage attribute is set to 1252 (the codepage for English). The mapping scheme that has been used
for transliteration is indicated in the uri attribute. This tag is also useful because it would allow the
speech synthesizer to use different text parsers depending on various transliteration schemes that could
be used.
Tag to specify lexicon for phrases or words - <foreign>
The current SSML specification has <lexicon>, <phoneme> tags that can be used to find the
pronunciation of words or phrases. The <lexicon> tag is used to reference external pronunciation
dictionaries that are applicable to the entire document. In the case of the <phoneme> tag, the
pronunciation for the word or short phrase has to be specified explicitly. We propose to have a tag that
can be used to indicate that a certain word or phrase needs to be pronounced using a different
pronunciation scheme without having to specify its exact phone sequence. In this case, the tag would
point to a lexicon which is different from the globally specified lexicon for the whole document. Such a
tag would be helpful when dealing with foreign language words/phrases embedded in a given language
text or even in the case of loan words. We propose a tag <foreign>_tag that has two attributes "lang"
and "uri".
<?xml version="1.0"?>
<speak version="1.0" xml:lang="hi-IN">
म�ने उसको <foreign lang="en-US"
8
uri="http://www.example.com/lex.file"> Good Morning </foreign> कहा और उसने अंग्रेजी �फ <foreign
lang="en-US" uri="http://www.example.com/lex.file"> “The Pirates of Caribbean”</foreign> को दे खने
क� इच्छा �दखलाई।
</speak>
In the above example, instances of the tag </foreign> tag have been used to indicate how a English
word – Good Morning (Way of greeting) and a English phrase “The Pirates of Caribbean” (the name of a
Englsh movie) have been used in an Hindi sentence without having to specifying their exact
pronunciations. External lexicons are specified for these phrases using the uri attribute of the tags.
5. SRGS
Speech Recognition Grammar Specification (SRGS) is the preferred markup for grammar syntax used in
speech recognition. SRGS has two forms: ABNF and XML. The ABNF is an augmented Backus-Naur Form
(BNF) grammar format, which has been modeled after JSGF. The XML syntax uses XML elements to
represent the grammar constructs. It is a W3C markup language standard for speech recognition, used
in the VoiceXML standard and other IVR / speech platforms. A speech recognition grammar is a set of
word patterns, and tells a speech recognition system what to expect a human to say. For instance, if you
call a voice directory application, it will prompt you for the name of the person you would like to talk
with. It will then start up a speech recognizer, giving it a speech recognition grammar. This grammar
contains the names of the people in the directory, and the various sentence patterns callers typically
respond with.
If the speech recognizer returned just a string containing the actual words spoken by the user, the voice
application would have to do the tedious job of extracting the semantic meaning from those words. For
this reason, SRGS grammars can be decorated with tag elements, which when executed, build up the
semantic result. SRGS does not specify the contents of the tag elements: this is done in a companion
W3C standard, Semantic Interpretation for Speech Recognition (SISR). SISR is based on ECMA Script, and
ECMA Script statements inside the SRGS tags build up an ECMA Script semantic result object that is easy
for the voice application to process.
Both SRGS and SISR are W3C Recommendations, the final stage of the W3C standards track. The
W3C VoiceXML standard, which defines how voice dialogs are specified, depends heavily on SRGS and
SISR.
Simple SRGS grammar:
<?xml version="1.0" encoding="ISO-8859-1"?>
<grammar xmlns="http://www.w3.org/2001/06/grammar"
xml:lang="hi-IN" version="1.0" root="city_state" mode="voice">
<rule id="city" scope="public">
<one-of> <item>Boston</item>
<item>Miami</item>
<item>Fargo</item> </one-of>
</rule>
<rule id="state" scope="public">
9
<one-of> <item>Florida</item>
<item>North Dakota</item>
<item>Massachusetts</item> </one-of>
</rule>
<rule id="city_state" scope="public">
<ruleref uri="#city"/> <ruleref uri="#state"/>
</rule>
</grammar>
An SRGS example with PLS
The grammar allows different pronunciations of words to accommodate many different speakers. This is
a simple SRGS grammar that references an external Pronunciation Lexicon:
<?xml version="1.0" encoding="ISO-8859-1"?>
<grammar xmlns="http://www.w3.org/2001/06/grammar"
xml:lang="en-US" version="1.0" root="city_state" mode="voice">
<lexicon uri= =“http://www.example.com/city_state.pls"/>
<rule id="city" scope="public">
<one-of> <item>Boston</item>
<item>Miami</item>
<item>Fargo</item> </one-of>
</rule>
<rule id="state" scope="public">
<one-of> <item>Florida</item>
<item>North Dakota</item>
<item>Massachusetts</item> </one-of>
</rule>
<rule id="city_state" scope="public">
<ruleref uri="#city"/> <ruleref uri="#state"/>
</rule>
</grammar>
The primary use of a speech recognizer grammar is to permit a speech application to indicate to a
recognizer what it should listen for, specifically:
•
•
•
Words that may be spoken,
Patterns in which those words may occur,
Spoken language of each word.
Speech recognizers may also support the Stochastic Language Models (N-Gram) Specification [NGRAM].
Both specifications define ways to set up a speech recognizer to detect spoken input but define the
word and patterns of words by different and complementary means. Some recognizers permit crossreferences between grammars in the two formats. The rule reference element of this specification
describes how to reference an N-gram document.
10
6. SAMPA (Speech Assessment Methods Phonetic Alphabet)
The Speech Assessment Methods Phonetic Alphabet (SAMPA) is a computer-readable phonetic script
using 7-bit printable ASCII characters, based on the International Phonetic Alphabet (IPA).
It was originally developed in the late 1980s for six European languages by the EEC ESPRIT information
technology research and development program. As many symbols as possible have been taken over
from the IPA; where this is not possible, other signs that are available are used, e.g. [@]
for schwa (IPA [ə]), [2] for the vowel sound found in French deux (IPA [ø]), and [9] for the vowel sound
found in French neuf (IPA [œ]). Today, officially, SAMPA has been developed for all the sounds of the
following languages:
Arabic
German Russian Cantonese Hebrew
Serbo- Danish
Slovak
French
Bulgarian Greek Scots
Czech
Hungarian Croatian Dutch
Estonian
Romanian
Italian
Polish Thai
Spanish Swedish
Turkish English
Portuguese Norwegian
The characters ["s{mp@] represent the pronunciation of the name SAMPA in English. Like IPA, SAMPA is
usually enclosed in square brackets or slashes, which are not part of the alphabet proper and merely
signify that it is phonetic as opposed to regular text.
Features of SAMPA
SAMPA was developed in the late 1980s in the European Commission funded ESPRIT project 2589
"Speech Assessment Methods" (SAM), hence "SAM Phonetic Alphabet" in order to facilitate email data
exchange and computational processing of transcriptions in phonetics and speech technology.
SAMPA is a partial encoding of the IPA. The first version of SAMPA was the union of the sets of phoneme
codes for Danish, Dutch, English, French, German and Italian; later versions extended SAMPA to cover
other European languages. Since SAMPA is based on phoneme inventories, each SAMPA table is valid
only in the language it was created for. In order to make this IPA encoding technique universally
applicable, X-SAMPA was created, which provides one single table without language-specific differences.
Vowels
Consonants
Nasals, Fricatives, Semi-vowels,
Laterals, Trills and Flaps
SAMPA
IPA
HINDI
SAMPA
IPA
HINDI
SAMPA
IPA
HINDI
ə
a
k
k
N
अ
क्
ङ
a:
a
I
I
i
i
u
U
e
e
O
ई
ऊ
ए
ऍ
{
o
इ
उ
U
E
आ
o
o
ऐ
ओ
औ
kh
kh
g
g
gh
ख्
ग्
घ्
h
g
ट
t`
ठ
t` h
ड
d`
ढ
d` h
t
त्
t
थ्
h
th
t
d
d
द
11
J
ञ्
n`
ण्
n
n
m
m
j
j
r
r
l
l
v\
S
न्
म्
य्
र्
ल्
-
व्
श्
ऑ
dh
dh
a~
अँ
p
p
a:~
ã
आँ
ph
p
I~
Ĩ
इW/इं
b
b
Q
D
ई/�
bh
उँ
tS
u~
Ũ
ũ
ऊँ
tS h
e~
ẽ
ऍ
dZ
i~
U~
E~
o~
O~
ध्
प्
फ्
h
ब्
भ्
h
b
च्
छ
ज्
ऍ/ऐं
s
s
r`
स्
ह
h
r
h
ड़
ढ़
r` h
r
q
q
क़्
x
x
ख़्
ग़्
G
z
f
/ओं
õ
ष्
s`
z
f
ज़्
ड़
/औं
The difference between SAMPA and IPA is shown above with Hindi equivalant. SAMPA was devised as
a hack to work around the inability of text encodings to represent IPA symbols. Consequently,
as Unicode support for IPA symbols becomes more widespread, the necessity for a separate, computerreadable system for representing the IPA in ASCII decreases. However, text input relies on specific
keyboard encodings or input devices. For this reason, SAMPA and X-SAMPA are still widely used in
computational phonetics and in speech technology.
SAMPA (Speech Assessment Methods Phonetic Alphabet) is a widely accepted scheme for encoding the
IPA into ASCII. For representing tonal variations SAMPROSA notations are used. The below chart
contains comparision between SAMPA and IPA. It also contains the Hindi Sounds used with both SAMPA
and IPA.
7. IPA- International Phonetic Alphabet
Phonetic Alphabets are used to describe the pronunciation of a word/phrase. An alphabet contains
symbols to represent speech sounds, just like in a dictionary, e.g.
Cracked
/krakt/
adj.
1. Having cracks. 2. (predic.) slang crazy
IPA can be referred as “Universally agreed system of notation for sounds of languages”. IPA is created by
International Phonetic Association (active since 1896), collaborative effort by all the major phoneticians
around the world. International Phonetic Alphabet is largely used by phoneticians, by dictionaries and
phonetic resources. W3C chose to normatively reference IPA in SSML and PLS specifications. IPA is used
in many dictionaries and by phoneticians for broad and narrow transcriptions. It describes the
phonemes that cover all the world languages – Consonants, Vowels, Other Symbols, Diacritics, Supra
segmental, Tones and Word Accent.
IPA Full Chart
Consonants (some)
All these are possible Plutonic Consonants, the columns are “places of articulation”, the rows are
“manner of articulation”, and the gray areas are considered to be impossible to articulate.
12
Vowels
A speech sound created by the relatively free passage of breath through the larynx and oral cavity,
usually forming the most prominent and central sound of a syllable. Vowels are distinguished on the
basis of “Height” and “Backness”. The IPA diagram resembles the place of articulation of the vowels.
Diacritics
Diacritics are the small marks that can be added to a symbol to modify its value and is used to
differentiate allophones of a phoneme. They are very important for narrow transcriptions, which show
more phonetic details.
Supra segmental
It is aspects of speech that involve more than single phonemes, principal features are stress, length,
tone and intonation.
Tones and Word Accents
Pitch variations that affect the meaning of word-i.e. /ma/ in Chinese Mandarin may mean “mother”,
“hemp”, “horse”, or “scold”, by changing tone from “high level”, “low level”, “rising”, and “going”.
8. Validation Criteria for Spoken Language Resources (SLR)
8.1. Overview
Validation is understood as the quality evaluation of a database against a checklist of relevant criteria.
SLR validation can be performed in two fundamentally different ways: (a) during the definition of the
specifications the feasibility of its evaluation and the criteria to be employed for such an evaluation are
already taken into account. (b) A SLR is created, and the validation criteria and procedure are defined
13
afterwards. Furthermore, validation can be done in house (internal validation) or by another
organization (external validation).
Validator
Validation scheduling
During production
Internal
External
After production
Each database producer should safeguard the A final check should be an obvious,
database quality during the collection and be it ideally superfluous, part of
processing of the data in order to ascertain that this procedure. In principle, this is
the specifications are met.
the way in which the Linguistic
Data Consortium (LDC) operates.
An external organization can be contracted to Final check after database
carry out the validation of an SLR. In that case the completion.
best approach is that this organization is closely
involved in the definition of the specifications,
and performs quality checks for all phases of the
production process.
8.2. Documentation
Documentation, addressing the written design specification that should accompany every database; this
includes issues related to the number of speakers, selection of speakers, recording conditions, etc. The
documentation should include an explanation and motivation of the decision that were made in
designing and building the database. For obvious reasons, the documentation cannot be evaluated in an
automatic way.
8.2.1. Technical information
The minimum documentation must comprise sufficient technical information about the contents and
structure of the database to allow ELRA and its customer’s efficient and effective access to the data. This
part of the documentation must include:
• layout of the CD-ROMs, DVDs or tapes
•
•
•
•
•
•
•
file nomenclature and directory structure
formats of the signal files and of the label files
coding
compression
sampling frequency
number of bits per sample
Multiplexed signals
8.2.2. Database contents
The documentation should clearly describe the purpose with which the resource was collected and the
types of speech material recorded (e.g. multi-party conversations, human-human dialogues, humanmachine dialogues, read sentences, connected and/or isolated digits, isolated words, etc.
14
The database contains the following information:
Linguistic contents
• A specification of the
individual items of the
prompting material.
• Specification
(and
motivation) for the sheet
design (e.g. how items
were spread over the
sheet to prevent list
effects);
• In the case of text
prompting, an example
prompting sheet should
be provided.
Speaker information
• Speaker
recruitment
strategies
• Number of speakers
• Distribution of speakers
over the categories of
• Sex
• Age
• Dialect
regions,
including a reasoned
description
of
the
regional pronunciation
variants
that
are
distinguished.
Recording platforms
•
Recording
platform
and
telephone
link
description
(analogue, digital);
• Network from which the call
originated;
• Environment in which the caller
was speaking (quiet office, pay
phone in public location, etc.)
• Handsets
8.2.3. Lexicon
Each database should be accompanied by a lexicon comprising all words occurring in the annotations.
The description in the documentation should include:
• The format of the lexicon
• Procedures used to obtain phonemic forms from the orthographic input
• Symbols in the transcriptions used as delimiter to obtain the lexicon entries
• An explanation of or reference to the phoneme set used
• Phonological or higher order phenomena accounted for in the phonemic transcriptions
• Case sensitivity of entries (matching the transcriptions)
8.3. Formal and technical criteria
Formal and technical criteria, addressing issues like the medium on which a database is delivered, the
structure of the directory trees, the format of the speech files and of the annotation files, contents lists,
speaker tables, etc. Formal and technical criteria are by definition language independent and most are
amenable to (semi-)automatic checks
8.4. Validation checks for Lexicon
For the lexicon the following checks are carried out:
•
•
•
The entries should be taken from the orthographic transcriptions.
A list of delimiters used to generate the orthographic entries in the lexicon must be
provided. Preferably, words are split by spaces only, not by apostrophes, and not by
hyphens.
Each entry should have at least one phonemic transcription.
15
•
•
•
•
•
•
The lexicon should be complete. A check is carried out on the orthographic transcriptions in
the label files in order to find out if the lexicon is under complete or over complete. Under
completeness of the lexicon is not acceptable, whereas over completeness is not
problematic.
Words which only occur with a distortion marker may not appear in the lexicon.
The orthographic lexicon entries should exactly match the transcriptions.
Frequency information is optional. Also alternative transcriptions are optional, unless the
design specifications of the SLR say otherwise.
The entries should be alphabetically ordered.
Optional information that may be present in the phonemic transcriptions include: stress,
word/morphological/syllabic boundaries.
The lexicon validation is focused on the format of the lexicon table only; the lexicon contents (i.e. the
correctness of the phonemic transcriptions) are not validated.
Changes in the orthographic transcriptions directly affect the lexicon. It is however impossible for the
user to regenerate the lexicon after (new) words is added in the transcriptions. A software tool for this
makes no sense since it is impossible for the owner to include a database containing all words and their
phone transcriptions in the database. Therefore, this regeneration can only be done by the database
producer, and quick and efficient ways should be developed to update deficiencies in the lexicon. (This is
different from the updating of contents list, which can be directly carried out by the database user if the
appropriate software is available).
8.5. Examples in Context of Indian Languages
•
LILA: Cellular Telephone Speech Databases from Asia
In this Project the main focus is given on two aspects of Hindi Language and Indian English. They have
categories Hindi as First language and Hindi as Second language. Apart from this they have also
considered Indian English.
Hindi as First Language (Hindi L1): Hindi is spoken as first language by about 420 million
people in India. Most native speakers reside in North-Central India, i.e. the states of Delhi, Uttar
Pradesh, Haryana, Chhattisgarh, Madhya Pradesh, Rajasthan, Bihar, Himachal Pradesh and Uttaranchal.
The LILA Hindi L1 database comprises 2000 speakers from North-Central India. There are 5 main dialects
of Hindi: Western Hindi, Eastern Hindi, Rajasthani, Bihari and the Pahari. Supervisors for each region
were hired to recruit speakers. In the LILA Hindi L1 database, spelling items are read out slowly instead
of being spelled out as the notion of spelling does not apply to Hindi. Also, the frequency of names for
weekdays differs from the specifications since in Hindi there are a total of nine possible names for the
seven weekdays. Orthographic transcriptions are written in Devnagari script. The Romanized
transcription is in a modified form of INSROT where the vowels have been separated into inherent and
non-inherent. Also, a SAMPA set was developed for Hindi to provide the phonetic representations for
the lexicon.
16
Hindi as Second Language (Hindi L2)
Hindi is spoken as a second language in India by approximately 160 million speakers. Hindi L2 speakers
are classified according to their native language into ten dialects: Tamil, Guajarati, Telugu, Malayalam,
Kannada, Bengali, Assamese, Punjabi, Marathi and Urdu. Speakers were selected based on Hindi L2
fluency, having studied Hindi as a language up to the secondary level. . Although Hindi as a second
language has quite a regular spelling/pronunciation correspondence, there are some variations that
occur as a result of the influence of the speakers’ first language. There is little standardisation in the use
of nuktas on these particular consonants, so where appropriate in the lexicon, (ph) and (j) which would
normally have the pronunciation /p_h/ and /dZ/ respectively, may also have variation coded in the form
of /f/ and /z/._ As there is little standardisation in the spelling of Hindi words within India, considerable
work was done to ensure consistent and standard spellings. In particular, the use of diacritics,
chandrabindu, anusvara, and word-final halant can be varied and internal rules were developed to
ensure that spellings correctly reflecting the pronunciation were used.
Indian English
Ten dialect regions were identified based on the native language of the speaker: Hindi/Urdu, Tamil,
Guajarati, Telugu, Malayalam, Kannada, Bengali, Assamese, Punjabi and Marathi. To accurately
represent the pronunciation of these words, a foreign lexicon was used. The phone set for the foreign
lexicon was similar to the Hindi phone set.
•
Annotated Speech Corpora Development in Indian Languages
Corpora of written text exist in most of the Indian Languages; unfortunately none of these exist in
spoken versions of them. The major issues for building up such Speech Corpora are the selection of
dialect, content, informants and recording environment. Proper annotation and marking of linguistics
and phonetic units are necessary. In the selected dialect, it should adequately cover speaking pattern of
native informants, normal as well as under different emotions, in different types of sentences and
discourses.
Text Corpora in major Indian Languages are of little significance for speech research and technology.
Usually, the spoken language is at large variance with the written one. The spoken lexicon may be widely
different from the grapheme lexicon as well as the syntax and grammar. Furthermore, the acoustic
parameters of spoken units of sounds and prosodic structure for each of the languages, even if one
restricts oneself to the standard dialect, need accurate identification for each of the languages. As all of
them are rather nebulous in nature, statistical requirement needs large-scale data analysis. It will be
impossible to derive the requisite knowledge base for the speech technology without a well structured
database. It is, therefore, imperative to built standard speech corpora (databases) for all major Indian
languages.
9. Multiple Pronunciation for the same Orthography in Hindi
For ASR systems it is common to rely on multiple pronunciations of the same word or phrase in order to
cope with variations of pronunciation within a language. In the Pronunciation Lexicon language, multiple
pronunciations are represented by more than one <phoneme> (or <alias>) element within the same
<lexeme> element. In the following example the word "Newton" has two possible pronunciations.
17
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="en-GB">
<lexeme>
<grapheme>Newton</grapheme>
<phoneme>njutən</phoneme>
<phoneme>nutən</phoneme>
</lexeme>
</lexicon>
9.1. Multiple orthographies
In some situations there are alternative textual representations for the same word or phrase. This can
arise due to a number of reasons. Because these are representations that have the same meaning (as
opposed to homophones), it is recommended that they be represented using a single <lexeme> element
that contains multiple graphemes. Here are two simple examples of multiple orthographies: alternative
spelling of an English word and multiple writings of a Japanese word.
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>colour</grapheme>
<grapheme>color</grapheme>
<phoneme>kʌlər</phoneme>
</lexeme>
</lexicon>
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="jp">
<!-- Japanese entry showing how multiple writing systems are handled
romaji, kanji and hiragana orthographies -->
18
<lexeme>
<grapheme>nihongo</grapheme>
<grapheme>日本語</grapheme>
<grapheme>にほんご</grapheme>
<phoneme>ɲihoŋo</phoneme>
</lexeme>
</lexicon>
9.2. Homophones
Most languages have homophones, words with the same pronunciation but different meanings (and
possibly different spellings), for instance "बल�" (बलवान) and "ब�ल" (ब�लदान). It is recommended that
these be represented as different lexemes.
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang=" hi-IN">
<lexeme>
<grapheme>ब�ल</grapheme>
<phoneme>बल�</phoneme>
</lexeme>
<lexeme>
<grapheme>बल� </grapheme>
<phoneme>बल�</phoneme>
</lexeme>
</lexicon>
9.3. Homographs
Most languages have words with different meanings but the same spelling (and sometimes different
pronunciations), called homographs. For example, in Hindi the word कनक (धतरू ा) and the word कनक
(सोना) have identical spellings but different meanings. Although it is recommended that these words be
represented using separate <lexeme> elements that are distinguished by different values of the role
attribute, if a pronunciation lexicon author does not want to distinguish between the two words they
could simply be represented as alternative pronunciations within the same <lexeme> element. In the
latter case the TTS processor will not be able to distinguish when to apply the first or the second
transcription. In this example the pronunciations of the homograph "कनक" are shown.
<?xml version="1.0" encoding="UTF-8"?>
19
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="hi-IN">
<lexeme>
<grapheme>कनक</grapheme>
<phoneme>कनक</phoneme>
<phoneme>कनक</phoneme>
</lexeme>
</lexicon>
9.4. Pronunciation by Orthography (Acronyms, Abbreviations, etc.)
For some words and phrases pronunciation can be expressed quickly and conveniently as a sequence of
other orthographies. The developer is not required to have linguistic knowledge, but instead makes use
of the pronunciations that are already expected to be available. To express pronunciations using other
orthographies the <alias> element may be used. This feature may be very useful to deal with acronym
expansion.
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa" xml:lang="hi-IN">
<!-- Acronym expansion -->
<lexeme>
<grapheme>र</grapheme>
<alias>रपए</alias>
</lexeme>
<!-- number representation -->
<lexeme>
<grapheme>101</grapheme>
<alias>एक सौ एक</alias>
</lexeme>
<!-- crude pronunciation mechanism and acronym expansion -->
<lexeme>
<grapheme>BBC 1</grapheme>
<alias>बी बी सी एक</alias>
</lexeme>
</lexicon>
20
10. Voice XML
VXML, or VoiceXML, technology allows a user to interact with the Internet through voice-recognition
technology. It is the W3C's standard XML format for specifying interactive voice dialogues between a
human and a computer. It allows voice applications to be developed and deployed in an analogous way
to HTML for visual applications. Just as HTML documents are interpreted by a visual web browser,
VoiceXML documents are interpreted by a voice browser. A common architecture is to deploy banks of
voice browsers attached to the Public Switched Telephone Network (PSTN) so that users can use a
telephone to interact with voice applications. Using VXML, the user interacts with voice browser by
listening to audio output that is either pre-recorded or computer-synthesized and submitting audio
input through the user's natural speaking voice or through a keypad, such as a telephone.
Web Server(s)
Images, Audio,
Video
Script
HTML
VoiceXML
Gateway
Voice XML
Voice
Browser
Audio/Gramm
VoiceXML 1.0 was published by the VoiceXML Forum, a consortium of over 500 companies, in March
2000. The Forum then gave control of the standard to the World Wide Web Consortium (W3C), and now
concentrates on conformance, education, and marketing. The W3C has just published VoiceXML 2.0 as a
Candidate Recommendation. Products based on VoiceXML 2.0 are already widely available.
Many commercial VoiceXML applications have been deployed, processing millions of telephone calls per
day. These applications include: order inquiry, package tracking, driving directions, emergency
notification, wake-up, flight tracking, voice access to email, customer relationship management,
prescription refilling, audio newsmagazines, voice dialing, real-estate information and national directory
assistance applications.
VoiceXML has tags that instruct the voice browser to provide speech synthesis, automatic speech
recognition, dialog management, and audio playback. VoiceXML is designed for creating audio dialogs
that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of
spoken input, telephony, and mixed-initiative conversations. Its major goal is to bring the advantages of
Web-based development and content delivery to interactive voice response applications. The following
is an example of a VoiceXML document:
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">
<form>
<block>
21
<prompt>
...........स्वागत.............
</prompt>
</block>
</form>
</vxml>
10.1. VoiceXML Future
Beginning in 2003, the W3C's Voice Browser Working Group will start work on VoiceXML 3.0. Some
suggestions that were too large to incorporate in 2.0 will be addressed, as well as other new extensions.
There will also likely be changes to VoiceXML to support new multimodal markup standards. The
conceptually cleanest approaches to multimodal use XHTML as a container for mode-specific markup
(XHTML for visual, VoiceXML for voice, InkXML for ink, etc.), and then define how the modes interact
using XML Events. As part of this effort, a modularization of VoiceXML would be defined such that one
subset could be used for multimodal markup. Finally, voice, and therefore VoiceXML, is important for
web devices other than the phone.
11. Issue regarding pronunciation of Hindi Language
•
•
•
•
Second Language Speaker- Nukta Consonants: Some of the nukta consonants are pronounced
in one way only, e.g. _ and __will always be produced as /r`/ and /r`_h/, respectively. Others,
such as __(k.), _ (kh.), and __ (g.) may vary in pronunciation depending on the first language of
the speaker – those familiar with Arabic-influenced languages will produce /q/, /x/ and /G/,
while others not familiar with these languages will produce /k/, /K/ and /g/, hence the decision
to put these in different lexicons (as discussed above). The nukta consonants _ (ph.) and _ (j.)
are primarily pronounced as /f/ and /z/ by most speakers of Hindi as a second language and can
be considered native sounds; however, there is also a proportion of speakers who use /p_h/
and /dZ/ respectively for these. Therefore these have been included as dispreferred variants in
the main lexicon.
Several things are going on here: Spoken Hindi does not match written/textbook Hindi. When
speaking, यह ya hand ये ye are pronounced the same, as are वह vah and वे ve. यह yah and ये
ye are both pronounced ye. Hindi is mostly spoken as spelled, but a common exception is that
an “a” before an “h” is usually pronounced like an “e”. वह vah and वे ve are both
pronounced vo. पे and पर।
The general belief amongst people is that there is exact conformity between the spoken and
written forms of words in Hindi and that words are written exactly as they are spoken and
spoken precisely as they are written. The truth is, however, remarkably different. 'नाना' for
example is pronounced as 'नाँनाँ', 'उपन्या' as 'उपन ्?न्याँ', 'बहन' as 'बैहन' and so on. It is
because of this gulf between the spellings and pronunciations that we have tried to indicate the
actual pronunciations of all the main entries.
राजमहल- राजमेहल
22
•
•
•
अँगल
ु �- उँ गल
ु �
करोड़/ करोड
रुपय/रुप
12. Glossary
13. References
14. Acknowledgement
23
© Copyright 2026 Paperzz