Hindi Noun Inflection and Distributed Morphology

Hindi Noun Inflection and Distributed
Morphology
Smriti Singh
Vaijayanthi M Sarma
Indian Institute of Technology Bombay
Indian Institute of Technology Bombay
Proceedings of the 17th International Conference on
Head-Driven Phrase Structure Grammar
Université Paris Diderot, Paris 7, France
Stefan Müller (Editor)
2010
CSLI Publications
pages 307–321
http://csli-publications.stanford.edu/HPSG/2010
Singh, Smriti, & Sarma, Vaijayanthi M. (2010). Hindi Noun Inflection and Distributed Morphology. In Stefan Müller (Ed.): Proceedings of the 17th International
Conference on Head-Driven Phrase Structure Grammar, Université Paris Diderot,
Paris 7, France (pp. 307–321). Stanford, CA: CSLI Publications.
1
Introduction
This paper1 primarily presents an analysis of nominal inflection in Hindi
within the framework of Distributed Morphology (Halle & Marantz 1993,
1994 and Harley and Noyer 1999). Müller (2002, 2003, 2004) for German,
Icelandic and Russian nouns respectively and Weisser (2006) for Croatian
nouns have also used Distributed Morphology (henceforth DM) to analyze
nominal inflectional morphology. This paper will discuss in detail the
inflectional categories and inflectional classes, the morphological processes
operating at syntax, the distribution of vocabulary items and the readjustment
rules required to describe Hindi nominal inflection. Earlier studies on Hindi
inflectional morphology (Guru 1920, Vajpeyi 1958, Upreti 1964, etc.) were
greatly influenced by the Paninian tradition (classical Sanskrit model) and
work with Paninian constructs such as root and stem. They only provide
descriptive studies of Hindi nouns and verbs and their inflections without
discussing the role or status of affixes that take part in inflection. The
discussion on the mechanisms (morphological operations and rules) used to
analyze or generate word forms are missing in these studies. In addition,
these studies do not account for syntax-morphology or morphologyphonology mismatches that show up in word formation. One aim of this
paper is to present an economical way of forming noun classes in Hindi as
compared to other traditional methods, especially gender and stem ending
based or paradigm based methods that give rise to a large number of
inflectional paradigms. Using inflectional class information to analyse the
various forms of Hindi nouns, we can reduce the number of affixes and wordgeneration and readjustment rules that are required to describe nominal
inflection. The analysis also helps us in developing a morphological analyzer
for Hindi. The small set of rules and fewer inflectional classes are of great
help to lexicographers and system developers. To the best of our knowledge,
the analysis of Hindi inflectional morphology based on DM and its
implementation in a Hindi morphological analyzer has not been done before.
The methods discussed here can be applied to other Indian languages for
analysis as well as word generation.
1
Acknowledgements
We thank the anonymous conference reviewers for their reviews. In this revised
submission, we have tried to incorporate their suggestions and answer their
questions. We also thank P. Bhattacharyya and O. Damani (IIT Bombay) for their
input and support and Nikhilesh Sharma who helped us implement the Hindi
morphological analyzer.
308
308
2
Inflection in Hindi Nouns
Hindi nouns show morphological marking only for number and case. Number
can be either singular or plural and can be represented as a binary valued
feature [±pl]. Singular [-pl] is the default value for number which is
morphologically unexpressed while plural or the non-default value [+pl] may
be phonologically realized. Case marking on Hindi nouns is either direct or
oblique. Marked (oblique) nouns show cumulative exponence for case and
number, e.g., -e in lǝɽk-e (boy-oblique) and -õ in rājā-õ (kings-oblique) for
singular-oblique and plural-oblique respectively. Gender, an inherent, lexical
property of Hindi nouns(masculine or feminine) is not morphologically
marked, but is realized via agreement with adjectives and verbs. We must
point out that (1) a few nouns may be in either gender given the context, e.g.,
dost or mitr (friend) and that (2) natural sex distinction in humans lǝɽkā-lǝɽkī
(boy-girl), bəccā-bəccī (baby-boy and baby-girl), in a few animals ghoɽāghoɽī (horse-mare) and some kinship terms dādā-dādī (paternal grandpagrandma), māmā-māmī (maternal uncle-aunt) are marked using specific stem
endings, i.e., feminine nouns tend to end in vowel /ī/ while masculine nouns
tend to end in /ā/. This is, however, not generally the case, for example pānī
(water) is masculine and mālā (garland) is feminine.
In the following tables we show the inflections selected by Hindi nouns.
Table 1 shows that Hindi feminine nouns of inflection Type 1 are marked
null for all number-case values. Type 2 and Type 3 nouns inflect only in the
plural for both case values. Table 2 shows the inflection for masculine Hindi
nouns. Inflection is seen again in Type 2 and 3 nouns in the plural for both
case values and in the singular for only Type 2 nouns in the oblique.
Table 1: Types of Inflections for Hindi Feminine Nouns
Singular
Examples
Type 1
Direct
Oblique
null
null
āg ‘fire’, āg, pyās
pyās
‘thirst’
Type 2
Direct
Oblique
null
null
nədī
nədī,
‘river’,
šǝkti
šǝkti
‘power’
Type 3
Direct Oblique
null
null
lətā
lətā, rāt
‘vine’,
rāt
‘night’
Plural
Examples
null
āg, pyās
-yā̃
nədi-yā̃ ,
šǝkti-yā̃
-ẽ
lətā-ẽ,
rāt-ẽ
null
āg, pyās
309
309
-yõ
nədi-yõ,
šǝkti-yõ
-õ
lətā-õ,
rāt-õ
Table 2: Types of Inflections for Hindi Masculine Nouns
Singular
Example
Plural
Example
3
Type 1
Direct
Oblique
null
null
krodh
krodh,
‘anger’,
pyār
pyār
‘love’
null
krodh,
pyār
null
krodh,
pyār
Type 2
Direct Oblique
null
-e
lǝɽkā
lǝɽk-e,
‘boy’,
bǝcc-e
bǝccā
‘baby’
-e
lǝɽk-e,
bǝcc-e
-õ
lǝɽk-õ,
bǝcc-õ
Type 3
Direct Oblique
null
null
ādmī
ādmī,
‘man’,
ghər
ghər
‘home’
null
ādmi
ghər
-õ/-yõ
ādmi-yõ,
ghər-õ
Noun Classification Systems for Hindi in the Literature
Traditional classification (from the Paninian perspective) of Hindi nouns is
based on gender and stem endings. This system does not allow two nouns of
different genders or different stem endings to be in one class. With two
genders and around nine stem endings (ā, ī, i, ū, u, o, O/au, yā and
consonant), we get at least eighteen classes. In addition, nouns that have one
of these stem endings but take null for all case-number values are put into
different inflectional classes. This results in a large number of nominal
classes (approximately thirty) that display similar inflectional behaviour.
Many readjustment rules are also required to explain the phonological
changes in the inflected forms. Table 3 provides one example of nouns
placed in different classes because of different stem endings even though
they take similar inflectional markers and belong to the same gender.
Table 3: Hindi Feminine Nouns Taking Similar Inflections
consonant
ending
ā
ending
Noun
rāt
‘night’
mātā
‘mother’
Pl-dir
rāt-ẽ
mātā-ẽ
Pl-obl
rāt-õ
mātā-õ
ū
ending
bǝhū
‘daughter-in
-law’
u
ending
au
ending
ritu
‘season’
lau
‘flame’
bəhu-ẽ
ritu-ẽ
lau-ẽ
bəhu-õ
ritu-õ
lau-õ
Kachru (2006) categorizes Hindi nouns into five declension types as given in
Table 4 below. This classification is based on how Hindi nouns decline for
gender, number and case. The classification criteria, however, are not clear.
310
310
Each class includes both masculine and feminine nouns. The last three
declensions include nouns with identical stem endings i, ū and consonant
respectively while the first two do not, i.e., the masculine nouns in the first
declension are ā ending while feminine nouns are ī ending and the second
declension has ī ending masculine nouns and ā ending feminine nouns.
Further, rules that describe affix insertion, stem alternation/modification are
also missing from the discussion.
Table 4: Kachru's Classification of Hindi Nouns (Kachru, 2006, p52-53)
Class 1
Masc: ā,
Fem: ī
ending
Class 2
Masc: ī,
Fem: ā
ending
Class 3
i ending
Class 4
ū ending
Class 5
consonant
ending
Masc
[-pl, -obl]
[pl,+obl]
[+pl,-obl]
ləɽkā ‘boy’
ləɽkī ‘girl’
ləɽk-e
ləɽkī
ləɽk-e
ləɽki-yā̃
[+pl,+ob
l]
ləɽk-õ
ləɽki-yā̃
sāthī ‘friend’
sāthī
sāthī
sāthi-yõ
kənyā ‘girl’
kənyā
kənyā-ẽ
kənyā-õ
pəti
pəti
pəti-yõ
sidd i
sāɽū
h
sidd i-yā̃
sāɽū
siddhi
sāɽu-õ
bəhū
bəhū-ẽ
bəhū-õ
siyār
siyār
siyār-õ
cīl
cīl-ẽ
cīl-õ
Fem
Masc
Fem
Masc pəti ‘husband’
Fem siddhi ‘success’
Masc sāɽū‘co-brother’
bəhū ‘daughterFem
in-law’
siyār
‘jackal’
Masc
cīl ‘eagle’
Fem
h
We see in Table 4 that the feminine nouns in Classes 2, 4 and 5 show similar
inflectional behaviour as they are marked with -ẽ and -õ in the plural, direct
and the plural, oblique respectively. Similarly, the feminine nouns in Classes
1 and 3 take similar inflections. The masculine nouns in Classes 2, 3, 4 and 5
are marked with -õ or -yõ in the plural, oblique and null for all other
combinations of number and case values. Since many of these classes group
together quite naturally they should be merged. This classification appears to
be neither intuitive nor systematic.
4
Inflection-based Noun Classes for Hindi Nouns
We propose that nominal classes in Hindi should be formed based entirely on
the inflectional behaviour of nominal forms. Consequently, all feminine
311
311
nouns in Table 3 can be put in a single class. The feminine nouns in Classes 2
and 4 in Kachru’s classification scheme given in Table 4 belong in this class.
Class 1 and Class 3 feminine nouns in her classification may be merged to
form another class. Masculine nouns in Classes 2, 3, 4 and 5 can be merged
into one class, while the masculine nouns in Class 1 form a separate class.
This classification is similar to that of Shapiro (2000), summarized in Table
5, who identifies four inflectional classes based on the inflectional behaviour
of Hindi nouns, two each for masculine and feminine nouns. Shapiro,
however, does not give any reasons for his classification strategy nor the
rules to derive the inflectional forms.
Table 5: Shapiro's Classification of Hindi Nouns (Shapiro, 2000, p31-33, 38-39)
Feminine
Class I
Class II
Masculine
Class III
Class IV
Sg-dir
null
null
null
null
Sg-obl
null
null
-e
null
Pl-dir
-yā̃
-ẽ
-e
null
Pl-obl
-yõ
-õ
-õ
-yõ/-õ
Shapiro also does not discuss the behaviour of nouns that are marked null for
all case-number pairs. We put these nouns in Class A along with Type 1
feminine and Type 1 masculine nouns seen in Tables 1 and 2 respectively.
The five proposed nominal classes along with the exponents (leaving out
vocative case inflections) are shown in Table 6 below.
Table 6: Inflectional Classes and Suffixes for Hindi Nouns
Class A
Class B
Class C
Class D
Class E
Sg-dir
null
null
null
null
null
Sg-obl
null
null
null
-e
null
Pl-dir
null
-yā̃
-ẽ
-e
null
Pl-obl
null
-yõ
-õ
-õ
-yõ/-õ
The inflection based nominal classification system, permits us to describe the
inflectional behaviour of Hindi nouns using a very small set of affixes and
readjustment rules. All nouns of one class display similar inflectional
312
312
behaviour for all case-number pairs. In the following we discuss briefly some
identifiable properties of each class.
Class A: Includes those nouns (masculine and feminine) that take null for all
case-number values such as pyār (love), krodh (anger), bhūkh (hunger), pyās
(thirst), mithās (sweetness), etc. These nouns are typically abstract or
uncountable2.
Class B: Includes /ī/, /i/ or /yā/ ending feminine nouns that take -yā̃ for the
features [+pl, -oblique] and -yõ for [+pl, +oblique] such as lǝɽkī (girl), šǝkti
(power) and dibiyā (small box), guɽiyā (doll), etc.
Class C: Includes feminine nouns that take -ẽ for the feature [+pl] and -õ for
[+pl, +oblique] such as rāt (night), mālā (garland), bəhū (daughter-in-law),
ritu (season), lO (flame), etc.
Class D: Includes masculine nouns that end in /ā/ or /yā/ such as lǝɽkā (boy),
dhāgā (thread), lohā (iron), kuā̃ (water well), etc. A few kinship terms such
as bhətījā (paternal nephew), bhā̃ jā (maternal nephew), sālā (borther-in-law)
(Guru, 1920) are also a part of this class. Nouns borrowed directly from
Sanskrit such as rājā (king), pitā (father), yuvā (youngster), devtā (God),
kərtā (doer), etc. are excluded.
Class E: Includes masculine nouns that inflect only for the features [+pl,
+oblique]. The nouns in this class end with /ū/, /u/, /ī/, /i/ or a consonant.
Examples are ālū (potato), sādhū (saint), mālī (gardener), kəvi (poet), ghər
(home), khet (farm), etc. The /ā/ ending tatsam masculine nouns borrowed
from Sanskrit such as rājā (king), pitā (father), yuvā (youngster), etc. also
belong to this class.
There are significant advantages to forming inflection based noun classes.
First the classification is based on the choice of inflectional markers for four
case-number pairs rather than on the stem endings or gender property of
nouns which do not uniquely describe the inflectional behaviour of nominals
in Hindi. Gender or stem endings are stored as lexical features of the nouns.
Second, this approach yields fewer nominal classes, and this economy is
coupled with greater generalization of nominal inflectional behaviour. Many
stem alternation patterns are properly left to the domain of phonology.
2
According to classical Hindi Grammar, these nouns are bhāvavācək
(abstract) or guŋavācək (qualitative) nouns (Guru, 1920).
313
313
5
Syncretism and Allomorphy in Hindi Nouns
In DM, syncretism is defined as the realization of a single vocabulary item
(affix) that is matched with more than one set of features on a terminal node.
Intra-class syncretism in Hindi is exhibited by suffix -e of Class D that
consists of /a/ ending masculine nouns. This suffix marks nouns of the same
class for two different set of morphological features [+pl, -oblique] as well as
[-pl, +oblique]. Some of the nominal suffixes are also allomorphic. The two
suffixes, i.e., -õ and -yõ which realize the features [+pl, +oblique] for Classes
B and C are phonologically conditioned allomorphs selected based on the
phonological form of the stem. Nouns that end in the vowels /ī/, /i/, or /yā/
take the suffix -yõ while all other vowel and consonantal ending nouns take õ. Allomorphy in Hindi is also driven by etymological origins of the words.
Masculine tatsam nouns such as rājā (king) and pitā (father) do not behave
like non-tatsam /ā/ ending words such as lǝɽkā (boy) and dhāgā (thread). All
/ā/ ending Hindi nouns take -e for the features [-pl, +oblique] and [+pl, oblique] (except those in Class A). But, tatsam nouns do not inflect for these
features in the language of origin, Sanskrit, and appear to retain the same
behaviour in Hindi as well.
6
Predicting Inflectional Class for New Lexemes
Using the inflection based nominal classification system, let us see how a
new noun lexeme entering the language could be assigned gender and how
we could predict its inflectional class. Gender can be assigned in two ways to
a new lexeme 1) by virtue of its phonological form and 2) by semantically
mapping the noun to an existing noun in Hindi. In Hindi, most of the
masculine nouns end in ā while feminine nouns end in ī. If the new lexeme
ends in one of these vowels, it is relatively easy to assign it gender. Certain
new words such as kār (car) or moʈər (motor) refer to ‘gāɽī’ (vehicle) in
Hindi which is feminine. Both borrowed words are assigned feminine gender.
After gender is lexically assigned to the new lexeme, its inflectional class can
be predicted using the procedure outlined in Figure 1. A masculine noun may
or may not be inflected - based on its semantic property. If it is an abstract
noun or a mass noun it will fall into the non-inflecting Class A irrespective of
its phonological form. On the other hand, a countable lexeme will fall into
one of the two masculine classes based on its phonological form. For
example, zirauks (xerox) and pepər (paper) are both consonant final nouns
that fall into the second masculine class, Class E. Similar procedures apply to
feminine nouns as well.
314
314
Hindi Nouns
Take Inflection
Class D
[±masc]
[-masc]
[+masc
ā/yā ending
(non-tatsam)
Do not take inflection
ā ending
(tatsam),
other vowels
and consonant
Class E
ī, ɪ or
yā
ending
Class B
Other vowels,
consonant
ending
Class C
Any
ending
Class A
Figure 1: Diagrammatic Representation of New Noun Classification in Hindi
7
Morphological Operations and Hindi Nouns
In DM, before vocabulary insertion, the terminal nodes available in the
syntactic structure undergo morphological operations such as merger, fusion,
fission, and impoverishment (Halle & Marantz 1993, Harley & Noyer 1999).
The operations account for the mismatches between the syntactic and
morphological structures of word forms. In Hindi, where number and case
inflections are marked cumulatively on a noun, a terminal node with casenumber features accompanies the N node for all nouns in the syntactic tree.
The noun node raises up the tree by head movement and merges with the
case-number node (after fusion of case and number node). Thus, even though
syntax provides insertion nodes for root, case and number, only two remain
available for final insertion after morphological operations are applied. This
results in a structure where two kinds of morphemes (root and an affix) are
inserted in the two nodes. The final surface form is realized as a single word
with two morpheme pieces such as rājā-õ (kings-pl-oblique), ləɽki-yā̃ (girlspl-direct), māli-yõ (gardeners-pl-oblique), etc.
After syntax and the application of morphological operations, vocabulary
items are inserted into terminal nodes to provide connections between
phonological and grammatical features. This is called vocabulary insertion
315
315
in DM. These vocabulary items are underspecified and compete for insertion
at the terminal nodes. The items are arranged in order of specificity (highly
specified followed by less specified ones) and feature hierarchy (plural
entries followed by those for singular). The more specific entries succeed
over less specified items. The vocabulary items for Hindi nouns are given
below in (1).
(1) Vocabulary Insertion Rules
[±pl, ±oblique] ↔ null / Class A
[+pl, +oblique] ↔ -yõ / Class B & E (Stem ending ī or yā)
[+pl, +oblique] ↔ -õ
[+pl] ↔ -yā̃ / Class B
[+pl] ↔ -ẽ / Class C
[+pl] or [-pl, +oblique] ↔ -e / Class D
[±pl] ↔ null
----------- 1
----------- 2
----------- 3
----------- 4
----------- 5
----------- 6
----------- 7
(elsewhere rule)
Rule 1 applies when a noun root is specifically marked for Class A. It inserts
null for all case-number values. Rule 2 is for those /ī/ and /yā/ ending nouns
that take -yõ for the features [+pl, +oblique]. Rule 3 inserts -õ for the features
[+pl, +oblique] for all other nouns. Rule 4 and 5 are specific for plurals of
Class B and Class C respectively. Rule 6 applies to Class D nouns in [+pl]
and [-pl, +oblique]. Rule 7 is the elsewhere rule that entails null insertion for
the remaining plural and singular noun forms.
We also propose an impoverishment rule in (2) that deletes [-oblique] when
the number feature is present. This means that the entries specified for
number (singular or plural) need not be specified for [-oblique] feature (or for
direct case). Thus the rules [-pl, -oblique] ↔ null and [+pl, -oblique] ↔ null
can be replaced by a single rule, i.e., [±pl] ↔ null.
(2) Impoverishment Rule
[-oblique] → null / [±pl]
Affixation also yields some phonological changes. We propose the following
Readjustment rules for Hindi:
(3) Readjustment Rules
Stem final /ā/ → null / Class D with -e or -õ
Stem final /ū/ → u / -ẽ or -õ
Stem final /ī/ → i / -yā̃ or -õ
316
316
----------- 8
----------- 9
----------- 10
The first readjustment rule (rule 8) deletes the stem final vowel of Class D
nouns that take either -e or -õ, e.g., lǝɽkā-e, lǝɽkā-õ and sāyā-õ and create
ləɽke, lǝɽkõ and sāyõ respectively. Rules 9 and 10 are not class specific and
result in final vowel shortening in nouns (masculine or feminine) that end in
either /ū/ or /ī/. Thus, bəhū-ẽ and bahū-õ become bəhuẽ and bəhuõ while
ləɽkī-yā̃ and ləɽkī-yõ become ləɽkiyā̃ and ləɽkiyõ respectively.
8
DM Based Hindi Morphological Analyzer
A morphological analyzer aims to recover from an inflected word its base
form (stem) by stripping off possible affixes. To this base, phonological
(readjustment) rules are applied to generate the root. A search is made for this
root in the lexicon to determine if there is a match. This process can also
yield multiple roots belonging to multiple lexical categories. Morphological
information for roots and suffixes is also provided. In order to develop such a
system, a root lexicon, affixal entries and phonological rules are needed. We
developed a Hindi lexicon with forty thousand noun root entries. These roots
were manually categorized into five classes and were then marked with
information about the inflectional class, lexical category, gender and stem
ending. Vocabulary items or affixal rule entries were developed that provide
information about the context(s) in which affixes appear. Since these rules
are bidirectional, these can be used to analyze as well as generate nominal
forms. We provide an example below of the analysis of a noun using the DM
based morphological analyzer.
•
•
•
•
•
Input noun form: lǝɽkiyā̃ (girls)
Rule (vocabulary item) applicable: [+pl] ↔ -yā̃ / Class B (rule 4)
Output after extracting out the suffix → Stem: lǝɽki, Suffix: yā̃
Readjustment Rule applied: Stem final /ī/→i/ -yā̃ or -õ (rule10)
Apply the rule in the reverse direction to get the root and look for it
in the lexicon.
If found, output the root which is lǝɽkī (girl). If not, try applying
another applicable rule.
The actual output of the system for the input words शहर (šǝhǝrõ) ‘cities’ and
मौक़े (mauke) ‘chances’ is given below.
(4) Token: शहर, Total Output: 1
[Root: शहर, Class: E, Category: noun, Suffix: ◌ो◌ं]
[Gender: +masc, Number: +pl, Case: +oblique]
317
317
(5) Token: मौक़े, Total Output : 1
[Root: मौक़ा, Class: D, Category: noun, Suffix: ◌े ]
[Gender: +masc, Number: -pl, Case: +oblique]
[Gender: +masc, Number: +pl, Case: -oblique]
It may be noted that we require a few more affixal rules to implement the
morphological analyzer since the analyzer works on Hindi data in the
devanagri script, the new set of rules is given below in (6). Rules 3, 5, 6, 9
and 10 have been split into a and b to account for different devanagri
characters for the phonemes /õ/, /ẽ/, /e/, /ū/ and /ī/ respectively. . We have
also made some modification to our previous list of Stem Readjustment rules
(rules 8-10 in (3)) for the same reason.
(6) Vocabulary Insertion Rules (revised)
[±pl, ±oblique] ↔ null /Class A
[+pl, +oblique] ↔ -य / Class B and E (Stem ending ī, i or yā)
------------ 1
------------ 2
[+pl, +oblique] ↔ -◌ो◌ं / Class C and E [NC], Class D
-----------3a
[+pl, +oblique] ↔ -ओं
----------- 3b
[+pl] ↔ -याँ / Class B
------------ 4
[+pl] ↔ -◌े◌ं / Class C [NC]
-----------5a
[+pl, -oblique] ↔ -एँ / Class C
-----------5b
[+pl] or [-pl, +oblique] ↔ -◌े / Class D [Nā]
-----------6a
[+pl] or [-pl, +oblique] ↔ -ए / Class D
[±pl] ↔ null
-----------6b
------------ 7
(Note: NC: noun stem ending in a consonant, Nā: Noun stem ending in ā)
(7) Readjustment Rules (revised)
Stem final -◌ा or -आ → ø / Class D [Nā] with -◌े or -◌ो◌ं
------------ 8
Stem final -◌ू → -◌ु / -एँ or -ओं
-----------9a
Stem final -ऊ → -उ / -एँ or -ओं
-----------9b
Stem final -◌ी → -ि◌ / -याँ or -य
----------10a
Stem final -ई → -इ / -याँ or -य
----------10b
318
318
9
Evaluation, Results and Future Directions
We performed the test on 14480 Hindi noun forms extracted from news items
sourced from the website www.bbc.co.uk/Hindi and carried out manual
evaluation to verify the results. The system was able to identify and produce
correct root and morphological analysis for 12784 nouns (more than half of
which had more than one possible stem) while 1696 remain unidentified. Out
of these 1696 noun forms, about 900 were unique forms. Analysis showed
that many of these words (two hundred) were left unidentified because of
either incorrect or variant spelling. Hyphenated compound nouns (350) too
remain unidentified. A large number of the remaining unrecognized entries
were uninflected nouns for which the lexicon lacked entries. The current
system does not produce any output for these uninflected nouns. The types of
unidentified words with their counts are given in Table 7 and Table 8 below.
Table 7: Results of DM Based Hindi Morphological Analyzer
Testing Results
Total Number of Words in the Testing Corpus
14480
Number of words correctly analyzed
12784
Total number of unidentified words
1696
Total number of unique unidentified words
900
Table 8: Types of Unidentified Words and their Counts
Unique unidentified/unknown words (900)
Words with incorrect or variant spelling
200
Hyphenated words
350
Missing root entry in the lexicon
350
Below are various types of errors faced by the system and the examples of
each error type.
Roots not available in the lexicon:
इंटरनेट ‘internet’, मेमर ‘memory’, टॉयलेट ‘toilet’
Spelling variants, Urdu-Hindi letter alternations,
nasalization etc.:
319
319
nasal
vs.
क़ै!दय/कै!दय‘prisoners’, ह$ते/ह&ते ‘weeks’, 'ाि(तकार/'ां)तकार
•
‘revolutionists’, क*प)नय/कंप)नय‘ companies’, ,त*भ /,तंभ
‘pillar’
Hyphenated words:
दाह-सं,कार ‘cremation’, वण1-भेद ‘casteism’
Incorrect spelling:
भैस (correct spelling: भ2स) ‘buffaloes’, क3त4 (correct spelling:
क3)त1) ‘fame’, कज1 (correct spelling: क़ज़1) ‘debt’
Adjectives/qualifiers functioning as nouns:
स2कड़ ‘thousands’, तीन ‘all three’
We would like to emphasize that there was no instance of failure at analysis
of a nominal form as long as the root was available in the lexicon. In
addition, roots for a number of forms including borrowed words from English
taking Hindi nominal inflections such as kār-ẽ (car-s), moʈər-õ (motor-s),
pepərõ (paper-s) for which roots are missing in the dictionary are also,
interestingly, suggested by the system. This is done by applying a rule that is
applicable for the given form (i.e., if there was a match between the suffix in
the word form and in the rule). Thus, the morphological analysis that is
discussed here finds reliable, natural extension in other Natural Language
Processing systems and tools such as Part-of-Speech Taggers and Parsers.
References
Guru, K. P. (1920). Hindi Vyakaran. Kashi: Lakshmi Narayan Press. (1962
edition).
Halle, M. and A. Marantz (1993). 'Distributed Morphology and the pieces of
inflection'. In K. Hale and S. J. Keyser (eds.), The View from Building
20. Cambridge, MA: MIT Press, 111–176.
Halle, M. and A. Marantz (1994). ‘Some Key Features of Distributed
Morphology’. In A. Carnie, et al. (eds.), Papers in Phonology and
Morphology, 275-288. MITWPL 21.
Harley, H. and R. Noyer (1999). ‘Distributed Morphology’. GLOT
International 4:4: 3-9.
Lieber, R. (1992). Deconstructing Morphology. Chicago: University of
Chicago Press.
Kachru, Y. (2006). Hindi. Amsterdam and Philadelphia: John Benjamins,
2006.
320
320
Müller, G. (2002). Remarks on Nominal Inflection in German. In: I.
Kaufmann & B. Stiebels, eds., More than words: A Festschrift for
Dieter Wunderlich. Akademie Verlag, Berlin, 113-145.
Müller, G. (2003). Syncretism and Iconicity in Icelandic Noun Declensions:
A Distributed Morphology Approach. Ms., IDS Mannheim.
Müller, G. (2004). A Distributed Morphology Approach to Syncretism in
Russian Noun Inflection. Proceedings of Formal Approaches to Slavic
Linguistics 12, 353-373.
Shapiro, M. (2000). A Primer of modern standard Hindi Grammar. Motilal
Banarsidass Publications.
Upreti, M. L. (1964). Hindi me Pratayay Vichar. Agra: Vinod Pustak
Mandir.
Vajpeyi, K. (1958). Hindi Shabdanushasan. Kashi: Nagri Pracharni Sabha.
Weisser, P. (2006). A distributed morphology analysis of Croatian noun
inflection. In G. Müller, & J. Trommer (Eds.), Subanalysis of argument
encoding in distributed morphology. Linguistische Arbeitsberichte (Vol.
84, pp. 131–142). Universität Leipzig.
321
321