Hindi Word Sense Disambiguation Using Semantic Relatedness

Hindi Word Sense Disambiguation Using Semantic
Relatedness Measure
Satyendr Singh1, Vivek Kumar Singh2, and Tanveer J. Siddiqui3
1,3
Department of Electronics & Communication, University of Allahabad, Allahabad, India
2
South Asian University, New Delhi, India
{satyendr,vivekks12}@gmail.com, [email protected]
Abstract. In this paper we propose and evaluate a method of Hindi word sense
disambiguation that computes similarity based on the semantics. We adapt an
existing measure for semantic relatedness between two lexically expressed concepts of Hindi WordNet. This measure is based on the length of paths between
noun concepts in an is-a hierarchy. Instead of relying on direct overlap the algorithm uses Hindi WordNet hierarchy to learn semantics of words and exploits it
in the disambiguation process. Evaluation is performed on a sense tagged dataset consisting of 20 polysemous Hindi nouns. We obtained an overall average
accuracy of 60.65% using this measure.
Keywords: Hindi Word Sense Disambiguation, Semantic Relatedness,
Semantic Similarity, Semantic Distance.
1
Introduction
Polysemy is inherent to natural languages. Every natural language has a large number
of words carrying multiple meanings. For instance, the English noun bark can mean
sound made by a dog, a type of sailing vessel with three or more masts or the outermost layers of stems and roots of woody plants; similarly the Hindi word हल (hal)
can mean जमीन जोतने का एक उपकरण (an instrument used to plough field) or
समाधान/ नबटारा (solution). Human beings are fairly apt in determining the correct
sense of the word. But for machines this is a very difficult task. The task of computational identification of the correct sense of a word in a given context is called Word
Sense Disambiguation (WSD). One of the widely used approaches in automatic WSD
is to measure similarity between the textual context of an ambiguous word and dictionary definition of its senses and assigning appropriate sense based on the similarity
value. However, this method requires an exact match of content words and fails to
take into account semantic or conceptual similarity.
In this paper, we describe a method of word sense disambiguation based on the
semantics. Our algorithm uses Leacock-Chodorow semantic relatedness measure [7]
to compute semantic similarity between two lexically expressed concepts. It uses
Hindi WordNet [2] hierarchy to compute distance between the two concepts. Instead
of relying on direct overlap the algorithm uses Hindi WordNet hierarchy to learn
S. Ramanna et al. (Eds.): MIWAI 2013, LNCS 8271, pp. 247–256, 2013.
© Springer-Verlag Berlin Heidelberg 2013
248
S. Singh, V.K. Singh, and T.J. Siddiqui
semantics of words and exploits it in the disambiguation process. The algorithm does
not require any training corpus and can disambiguate any ambiguous word that is
found in Hindi WordNet. The performance of semantic similarity based WSD algorithm is compared with Lesk-like direct overlap method [11].
A number of earlier work reports on the use of semantic similarity measures for
WSD and other NLP task. Budanitsky and Hirst [3] evaluated five measures of semantic relatedness using WordNet. They compared the performance of these measures in detecting and correcting real world spelling errors. They found information
content based measure proposed by Jiang and Conrath superior to those proposed by
Hirst and St-Onge [4], Leacock and Chodorow [7], Lin [8], and Resnik [10]. They
reported an accuracy level of 65% on F1-Score for all the three languages. Reddy et
al. [9] evaluated and compared six semantic relatedness measures for Hindi semantic
category labeling. They reported that adapted Lesk performed better than other measures. Sinha and Mihalcea [13] proposed an unsupervised graph-based WSD algorithm
and reported results using six different measures on SENSEVAl-2 and SENSEVAL-3
dataset. Torres and Gelbukh [14] compared Lesk algorithm for original Lesk algorithm. They evaluated Jiang–Conrath, Lesk and combination of similarity measures
using random sense and most frequent sense as a back-off procedure on senseval-2,
Senseval-3, Semeval and Semcor corpus. The experimental results show that different
measures performed better on different corpus and using different back-off procedure.
In general, combination of similarity measures was more accurate than each measure
separately. However, Lesk measure performs better for Senseval-3, Semval and Semcor corpus when back-off is random sense while Jiang–Conrath [5] measure shows
good results with most frequent sense back-off is used. However, these results can not
be generalized for Hindi. The work presented in this paper is an attempt to investigate
the usefulness of existing semantic similarity measures for Hindi WSD task. To the
best of our knowledge this kind of work is not reported earlier for Hindi WSD.
Other work on Indian languages includes [6, 9, 11, 12]. Sinha et al. [12] used an
extended Lesk-like algorithm and performed contextual overlapping for Hindi WSD
task. They explored extended sense definitions derived from Hindi WordNet and
context of target polysemous word to perform contextual overlap between them.
Winner sense is assigned as one which maximized the overlap. Khapra et al. [6] performed their study on domain specific WSD for nouns, adjectives and adverbs in a
trilingual setting of English, Hindi and Marathi. Dominant senses of words in specific
domains were used for performing disambiguation. Singh and Siddiqui [11] evaluated
the effects of stemming, stop word removal and context window size on a Lesk like
overlap based algorithm for Hindi WSD. They reported an improvement of 9.24% in
precision after stop word removal and stemming over baseline.
The rest of the paper is organized as follows: In Section 2, an overview of semantic
similarity is given and Leacock-Chodorow measure of semantic relatedness is discussed. Section 3 discusses the proposed algorithm. Details of the dataset and experiments conducted are provided in Section 4. Results are discussed in Section 5 and
finally conclusions are drawn in Section 6.
Hindi Word Sense Disambiguation Using Semantic Relatedness Measure
2
249
Overview of Semantic Similarity
Semantic similarity is a kind of relatedness between two words. Semantic relatedness
includes relationships between concepts including semantic similarity and other relations such as is-a-kind-of, is-a-part-of, is-a-specific-example-of, is-the-opposite-of,
etc. A number of measures of semantic relatedness have been proposed [4, 5, 7, 8,
10]. The relatedness measure by Hirst and St-Onge [4] considers many other relations
in WordNet and is not restricted to nouns. This measure assigns a relatedness score
for word types rather than concepts. Jiang and Conrath [5] used information content
proposed by Resnik [10]. They augmented it with a notion of path length between
concepts. Resnik[10] introduced a measure of relatedness based on the information
content. The information content is assigned to each concept in a hierarchy based on
evidence found in the corpus. In Lin measure [8] of semantic relatedness, similarity
between two concepts is measured by the ratio of the amount of information needed
to state the commonalty of two concepts to the amount of information needed to describe them. In this work, we use the Leacock-Chodorow measure [7] of semantic
relatedness to disambiguate a polysemous word. Relatedness is measured by computing the length of the shortest path between two concepts. We have used only nouns in
this work. The underlying assumption is that the concepts of nearby nouns of target
word are better indicator of the correct sense of target polysemous noun. The Leacock-Chodorow measure [7] of semantic relatedness is based on the length of paths
between noun concepts in an is-a hierarchy. The shortest path between two concepts
is the path having least number of intermediate nodes. The shortest path is scaled by
the depth of the hierarchy. Depth is the length of the longest path from a leaf node to
the root node of the hierarchy. The relatedness between two concepts c1 and c2 is
computed as:
relatednesslch (c1, c2) = [- log (ShortestLength (c1, c2) / (2*D)]
ShortestLength (c1, c2) is the shortest path length between two concepts and D is the
maximum depth of the taxonomy. A hypothetical root node is created to join all the
noun hierarchies. The maximum depth of noun hierarchy is 6 for Hindi WordNet.
3
WSD Algorithm
We extend the idea of direct overlap method for performing disambiguation using the
Leacock-Chodorow measure of semantic relatedness. Firstly, stop words are removed
from test instances. A test instance is a context containing an ambiguous word. Then
we extract all the nouns from the test instances irrespective of their usage in the context. The test instance is represented as a vector of words appearing in a window containing two nouns on either side of the target word. The sense definitions are also
represented as a vector comprising of words appearing in the sense definitions. The
similarity between these two vectors is computed. Instead of using a keyword-based
similarity the Hindi WordNet hierarchy is utilized to get meaning-based similarity.
250
S. Singh, V.K. Singh, and T.J. Siddiqui
The algorithm takes as input synset id’s of various senses of an ambiguous word and
synset id’s of nouns appearing in the test instance. For example, for Hindi noun
सोना (sona), there are 3 senses listed in Hindi WordNet as shown in Fig. 1. The synsets pertaining to these 3 senses in Hindi WordNet along with their synset id are
shown in Fig. 2.
1. सोना, वण, कंचन, हे म, कनक, सव
ु रन, कांचन, सव
ु ण, अ , हर य, वरवण, शातकंु भ, शातकु भ, शातक भ,
शातकौ भ, शु ,
ने , चामीकर, पु द, ज़र, व ण, अह, अव भ
ं , अव म ्भ, ीम कंु भ, ीम कु भ, रस वरोधक,
रं जन, मनोहर, शतकंु भ, शतकु भ, शतक भ, शतक भक, शतखंड, शतख ड, भ , अ मकर, अ ापद, म त ्, द ,
आ नेय, वसु, गा ड़, तामरस, ता य; एक बहुमू य पील धातु िजसके गहने आ द बनते ह ; "आज कल
सोने का भाव आसमान छू रहा है / चैत य महापु ष के शर र से वण जैसी आभा नकलती थी"
2. शयन, सोना, सयन; सोने क
या; "शयन के लए ह रा त बनायी गयी है"
3. सोनापाठा, योनाक, टटू, सोना, सोनापाढ़ा,
पू तव ृ ,
एक
मत
ृ ,
वणव कल, नसोथ, नसत
ृ ा, नसौत, या ादनी, पू तप ,
मत
ृ ा, शुक, शुकनास, प ोण, अरलु,
वेला, भूतपु प, धौतकोषज, पथ
ृ ु शंब, पथ
ृ ु श ब, भू मपु ;
कार का ऊँचा पेड़; "सोना पाठा के बीज, छाल और फल दवा के
प म काम आते ह"
Fig. 1. Senses of सोना in Hindi WordNet
3045 - NOUN - [सोना, वण, कंचन, हे म, कनक, सव
ु रन, कांचन, सव
ु ण, अ , हर य, वरवण,
शातकंु भ, शातकु भ, शातक भ, शातकौ भ, शु ,
अव
भ,
ीम कंु भ,
ने , चामीकर, पु द, ज़र, व ण, अह, अव भ
ं ,
ीम कु भ, रस वरोधक, रं जन, मनोहर, शतकंु भ, शतकु भ, शतक भ,
शतक भक, शतखंड, शतख ड, भ , अ मकर, अ ापद, म त ्, द , आ नेय, वस,ु गा ड़, तामरस,
ता य]
8042 - NOUN - [शयन, सोना, सयन]
18571 - NOUN - [सोनापाठा, योनाक, टटू, सोना, सोनापाढ़ा, वणव कल, नसोथ, नसत
ृ ा,
नसौत, या ादनी, पू तप , पू तव ृ ,
मत
ृ ,
भूतपु प, धौतकोषज, पथ
ृ ु शंब, पथ
ृ ु श ब, भू मपु ]
मत
ृ ा, शुक, शुकनास, प ोण, अरल,ु
वेला,
Fig. 2. Synset id’s of सोना in Hindi WordNet
For our dataset, we have considered two senses of सोना , pertaining to gold and
sleep sense. This consists of sense1 and 2 from Hindi WordNet. Thus, the synset id’s
of the target word सोना pertaining to sense 1 and 2 are {3045, 8042}. This constitutes
the input sysnet id’s of target word ( सोना ) used in this experiment.
The synset id’s of the words in the test instance vector, excluding target word are
computed. This forms output synset id’s. For each synset id of the target word, semantic relatedness is computed for all output synset id’s using the measure of Leacock Chodorow. Currently, we are not discriminating the senses of nouns appearing
in the nearby context. The overall score for each sense is computed by summing the
pair-wise semantic similarity between each of its senses. The sense pertaining to the
synset id which maximizes the score is assigned as the winner sense. The steps in
WSD algorithm are shown in fig. 3.
Hindi Word Sense Disambiguation Using Semantic Relatedness Measure
251
WSD Algorithm
1. Remove stop words from the test instances, extract all the nouns and then create context vector
by extracting all the nouns to the left and right of the target word in the proximity of 2 words
2. Extract input synset id’s of target word (Syn-input)
3. Extract output synset id’s of the nouns in context vector, excluding target word (Syn-output)
4. For i = 1 to number of Syn-input do
Scorei similarity (Syn-inputi, Syn-output)
5. Return Syn-inputi for which score is maximum.
Computing score:
similarity (Syn-inputi,, Syn-output) // Syn-inputi is the synset-id of the ith sense of target word and
Syn-output is the synset id’s of all nouns , excluding target words in context vector //
sense_score 0
For each synset id a in Syn-output
relatednesslch (Syn-inputi, a) = [ - log (ShortestLength (Syn-inputi, a) / (2*D)]
// D is maximum depth of Hindi WordNet (12) //
sense_scoresense_score + relatednesslch (Syn-inputi, a)
return sense_score
Fig. 3. WSD algorithm
4
Data Set and Experiment
4.1
Data Set
We evaluate our algorithm on a test dataset consisting of 20 Hindi polysemous nouns
(Table 1). Sense inventory is derived from Hindi WordNet. Some of the senses having
fine-grained sense distinctions are merged. Some of the senses for which instances
were not easily available have been deleted.Instances were collected from Hindi Corpus [1] created by Centre for Indian Language Technology (CFILT), IIT Bombay.
Instances were also collected from www.khoj.com and www.google.com by firing
queries derived from sense definitions. Instances are from varying domains including
news, literature, medical, sports, etc. Our dataset has a total of 710 instances. The
average number of instances per word is 35.5. The average number of instances per
sense is 14.48. The average number of senses per word is 2.45. The detail of the dataset is given in Table A1 in Appendix. The translation and transliteration of the dataset
is given in Table A2 in Appendix.The performance evaluation is donein terms of
precision and recall. Precision is defined as the ratio of the correctly answered instances and the total number of test instances answered for the target word. Recall is
defined as the ratio of the correctly answered instances and the total number of test
instances to be answered for the target word.
Table 1. Dataset
No of
Senses
2
3
4
Words (Nouns)
कोटा, गु , चंदा, जेठ, डाक, तान, तीर, दाम, माँग, व ध, सोना, हल, हार
उ र, कंु भ, फल, वचन, सं मण, संबंध
मूल
252
4.2
S. Singh, V.K. Singh, and T.J. Siddiqui
Experiment
For evaluating the proposed algorithm test run for each target word is conducted on
fixed window size of two. Precision and recall for 20 ambiguous Hindi words are
shown in Table 2.We also conducted a test run using direct overlap method. The direct overlap method used for comparison is adapted from WSD algorithm after stop
word removal from sense definitions and test instances (Case III) as given in [11].The
precision and recall values averaged over a window size of 5, 10, 15, 20, 25 for direct
overlap method are shown in Table 2. Table 3 summarizes the results of comparison.
Table 2. Precision and recall
Word
Semantic relatedness (LeacockChodorow Measure)
Direct Overlap Measure
Precision
Recall
Precision
उ र
0.9030
0.8898
Recall
0.7888
0.7772
कंु भ
0.5701
0.5701
0.6490
0.6490
कोटा
0.6416
0.6416
0.1940
0.1940
गु
0.6782
0.6782
0.5881
0.5881
चंदा
0.5687
0.4409
0.8265
0.7276
जेठ
0.6666
0.6666
0.2428
0.2428
डाक
0.5000
0.5000
0.1336
0.1336
तान
0.4166
0.4166
0.3500
0.3000
तीर
0.7500
0.7500
0.5615
0.5615
दाम
0.7368
0.7368
0.4798
0.4798
फल
0.6526
0.6288
0.4985
0.4737
माँग
0.6363
0.6363
0.3999
0.3999
मूल
0.9019
0.9019
0.4359
0.4272
वचन
0.3809
0.3809
0.2559
0.2559
वध
0.5909
0.4090
0.3811
0.3681
सं मण
0.1809
0.1809
0.3919
0.3919
संबंध
0.3888
0.3055
0.6499
0.6499
सोना
0.7500
0.4732
0.4916
0.3196
हल
0.6114
0.6114
0.5070
0.5070
हार
0.6052
0.6052
0.3263
0.3263
Hindi Word Sense Disambiguation Using Semantic Relatedness Measure
253
Table 3. Comparision of Precision and recall
Method
Semantic relatedness
(Leacock-Chodorow
Measure)
Direct
Overlap
Measure
5
Overall average
Precision (over
20 words)
Overall average
Recall (over 20
words)
0.6065
0.5711
0.4576
0.4386
Results and Discussion
As shown in Table 3, the average precision and recall using WSD algorithm based on
Leacock-Chodorow measure over 20 words is 60.65% and 57.11%.We obtained highest precision of 0.9030 for word उ र . The lowest precision of 0.1809 was observed
for the word सं मण . A comparison of the proposed algorithm with Lesk like direct
overlap method is shown in Table 3. For Direct Overlap method results were taken on
window size of 5, 10, 15, 20 and 25 after stop word removal for every target word and
then average is computed. The average precision over 20 words using direct overlap
method is 45.76% as shown in Table 3.We obtained higher precision for many words
using Leacock-Chodorow measure than direct overlap measure. The direct overlap
method showed better precision than Leacock-Chodorow measure for words कंु भ , चंदा,
सं मण and संबंध . This is due the availability of better content words in dictionary definitions for which contextual match is possible in the test instances. The use of semantic similarity measure results in an improvement of 32.53%. This is mainly due
to the use of WordNet hierarchy in identifying similarity between the test instance and
the sense definitions.
6
Conclusions and Future Work
In this paper, we proposed and evaluated Hindi Word Sense Disambiguation using
Leacock-Chodorow measure of semantic relatedness. The experimental results demonstrate that the use of semantics increases the accuracy of disambiguation. A comparison with the direct overlap method yields an improvement of 32.53%. One of the
drawbacks of our algorithm is that it does not consider part of speech other than noun.
If words have multiple syntactic categories then the algorithm utilizes only the semantic similarity arising out of noun category of matching words.
References
1. Hindi Corpus, http://www.cfilt.iitb.ac.in/Downloads.html
2. Hindi WordNet, http://www.cfilt.iitb.ac.in/wordnet/webhwn/wn.php
3. Budanitsky, A., Hirst, G.: Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics 32(1), 13–47 (2006)
254
S. Singh, V.K. Singh, and T.J. Siddiqui
4. Hirst, G., St-Onge, D.: Lexical chains as representations of context for the detection and
correction of malapropisms. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, pp. 305–332 (1998)
5. Jiang, J., Conrath, D.: Semantic similarity based on corpus statistics and lexical taxonomy.
In: Proceedings on International Conference on Research in Computational Linguistics,
Taiwan, (1997)
6. Khapra, M., Bhattacharyya, P., Chauhan, S., Nair, S., Sharma, A.: Domain Specific Iterative Word Sense Disambiguation in a Multilingual Set-ting. In: Proceedings of International Conference on NLP (ICON 2008), Pune, India (2008)
7. Leacock, C., Chodorow, M.: Combining local context and WordNet similarity for word
sense identification. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database,
pp. 265–283 (1998)
8. Lin, D.: Using syntactic dependency as a local context to resolve word sense ambiguity.
In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pp. 64–71 (1997)
9. Reddy, S., Inumella, A., Singh, N., Sangal, R.: Hindi Semantic Category Labeling using
Semantic Relatedness Measures. In: Proceedings of Global WordNet Conference (2010)
10. Resnik, P.: Using information content to evaluate semantic similarity in taxonomy. In:
Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal,
pp. 448–453 (1995)
11. Singh, S., Siddiqui, T.J.: Evaluating Effect of Context Window Size, Stemming and Stop
Word Removal on Hindi Word Sense Disambiguation. In: Proceedings of the International
Conference on Information Retrieval & Knowledge Management, CAMP 2012, Malaysia,
pp. 1–5 (2012)
12. Sinha, M., Kumar, M., Pande, P., Kashyap, L., Bhattacharyya, P.: Hindi Word Sense Disambiguation. In: International Symposium on Machine Translation, Natural Language
Processing and Translation Support Systems, Delhi, India (2004)
13. Sinha, R., Mihalcea, R.: Unsupervised Graph-based Word Sense Disambiguation Using
Measures of Word Semantic Similarity. In: Proceedings of the International Conference on
Semantic Computing, ICSC 2007, pp. 363–369 (2007)
14. Torres, S., Gelbukh, A.: Comparing Similarity Measures for Original WSD Lesk Algorithm Advances in Computer Science and Applications. Research in Computing
Science 43, 155–166 (2009)
Hindi Word Sense Disambiguation Using Semantic Relatedness Measure
Appendix
Table. A1. Statistics of Dataset
Word
No. of
Senses
Instances
Sense1
Sense 2
Sense 3
Sense 4
उ र
3
10
23
12
कंु भ
3
22
21
19
कोटा
2
20
21
गु
2
31
19
चंदा
2
25
22
जेठ
2
3
7
डाक
2
19
20
तान
2
6
7
तीर
2
30
13
दाम
2
19
6
फल
3
30
25
माँग
3
4
11
मूल
5
4
3
17
12
वचन
3
7
9
8
वध
2
22
22
सं मण
3
10
19
7
संबंध
3
8
12
3
सोना
2
21
8
हल
2
8
23
हार
2
10
19
12
255
256
S. Singh, V.K. Singh, and T.J. Siddiqui
Table. A2. Dataset – Translation& Transliteration
Word
No
of
Sens
es
Sense1
Sense2
Sense3
Sense4
उ र (uttar)
3
Answer
North direction
A person’s name
कंु भ (kumbh)
3
Waterpot made of
mud
A Sun Sign (Aquarius) in Hindi
A Holy event happing every 12 years
in India
कोटा (kota)
2
Reservation, quota
Name of a district in
Rajasthan in India
गु (guru)
2
teacher
Jupiter (name of a
planet)
चंदा (chanda)
2
moon
Financial contribution, subscription
जेठ (jeth)
2
Name of a month in
Hindi
Husband’s
elder
brother, brother in
law
डाक (dak)
2
Bid, bidding
Post, postal service
तान (taan)
2
Process of stretching
Music tone
तीर (teer)
2
arrow
shore of river or sea
दाम (dam)
2
Cost, price
Type of strategy or
policy
फल (phal)
3
fruit
result
माँग (maang)
2
Requirement, need,
Parting of hairs on
head where married
Hindu woman put
vermilion as a sign
of marriage
मूल (mul)
4
root of plant
Basic reason, fundamental
Time for a type of
star
वचन (vachan)
3
whatever one speaks
or says, saying
Promise,
ment
commit-
Agent
in
Hindi
grammar to denote
singular or plural
व ध (vidhi)
2
Way or process of
doing something
law
3
Process of sun’s
transition from one
star-sign to another
Process of disease
infection
Process of transition
from one place or
state to another place
or state
3
relation
Agent
in
Hindi
grammar that shows
relation between two
words
marriage
सोना (sona)
2
gold
sleep
हल (hal)
2
solution
Instrument used to
plough
हार (har)
2
defeat
necklace, garland
सं मण
(san-
kraman)
संबंध
(sambandh)
Front sharp part of
arrow or spear
Capital/Principal
money