Short-Text Similarity Measurement Using Word

Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence
3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings
Springer-Verlag Berlin Heidelberg 2010
Short-Text Similarity Measurement Using Word Sense
Disambiguation and Synonym Expansion
Khaled Abdalgader and Andrew Skabar
Department of Computer Science and Computer Engineering
La Trobe University, Bundoora, Australia
[email protected], [email protected]
Abstract. Measuring the similarity between text fragments at the sentence level
is made difficult by the fact that two sentences that are semantically related may
not contain any words in common. This means that standard IR measures of
text similarity, which are based on word co-occurrence and designed to operate
at the document level, are not appropriate. While various sentence similarity
measures have been recently proposed, these measures do not fully utilise the
semantic information available from lexical resources such as WordNet. In this
paper we propose a new sentence similarity measure which uses word sense
disambiguation and synonym expansion to provide a richer semantic context to
measure sentence similarity. Evaluation of the measure on three benchmark
datasets shows that as a stand-alone sentence similarity measure, the method
achieves better results than other methods recently reported in the literature.
1 Introduction
Measuring the similarity between small-sized text fragments (e.g., sentences) is a
fundamental function in applications such as text mining and text summarization, which
usually operate at the sentence or sub-sentence level [1][2]; question answering, where
it is necessary to calculate the similarity between a question-answer pair [3][4]; and
image retrieval, where we are interested in the similarity between a query and an image
caption [5]. Although methods for measuring text similarity have been in existence for
decades, most approaches are based on word co-occurrence [6][7]. The assumption here
is that the more similar two texts are, the more words they have in common. While this
assumption is generally valid for large-size text fragments (e.g., documents)—and hence
the widespread and successful use of these methods in information retrieval (IR)—the
assumption does not hold for small-sized text fragments such as sentences, since two
sentences may be semantically similar despite having few, if any, words in common.
One approach to measuring similarity between two sentences is based on
representing the sentences in a reduced vector space consisting only of the words
contained in the sentences. For example the sentences “Dogs chase cats” and “Felines
kill mice” could be represented respectively as the vectors (1, 1, 1, 0, 0, 0) and (0, 0, 0,
1, 1, 1) in a vector space in which dimensions correspond to the ordered terms ‘cat’,
‘chase’, ‘dog’, ‘feline’, ‘kill’ and ‘mice’, and a vector entry of 1 (0) represents the
presence (absence) of the corresponding word in the sentence. However this in itself
J. Li (Ed.): AI 2010, LNAI 6464, pp. 435–444, 2010.
© Springer-Verlag Berlin Heidelberg 2010
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence
3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings
Springer-Verlag Berlin Heidelberg 2010
436
K. Abdalgader and A. Skabar
does not solve the problem of lack of word co-occurrence, since vector space similarity
measures such as cosine similarity will still yield a value of 0.
To solve this problem, rather than assigning a value of 0 for the vector entry
corresponding to an absent word, we can estimate a non-zero value which reflects the
extent to which the word is related semantically to the collection of words in the sentence.
For example, the vector for the first sentence above has a 0 entry corresponding to ‘feline’.
By comparing the word ‘feline’ semantically with the words ‘dog’, ‘chase’ and ‘cat’ (by
using a dictionary, for example), we would expect to arrive at a non-zero value, since
‘feline’ would presumably be found to be semantically related to ‘cat’, as well as to ‘dog’
(by virtue of cats and dogs being members of the animal kingdom). This will result in a
non-zero value for the fourth entry in the first vector, and hence a non-zero value when the
cosine similarity of the resulting vectors is calculated. We refer to vectors determined in
this way as semantic vectors. There are different approaches to measuring the similarity
between a word, x, and a collection of n words. One approach is to take the mean semantic
similarity between x and each of the n words; another is to use the maximum word-toword similarity score, which is the approach taken in Li et al. (2006) [8].
Sentence similarity measures can also be defined in which sentences are not
explicitly represented in a vector space. For example, in Mihalcea et al. (2006) [9],
each word in a sentence is assigned a score determined as the maximum semantic
similarity between the word and the words in the opposing sentence. These scores are
then weighted by inverse document frequency (idf) values, summed over both
sentences, and finally normalized, resulting in a measure of sentence similarity.
Although the above approaches differ in how the final similarity score is calculated,
both are based on the estimation of similarity between a word and a set of words, and the
purpose of this paper is to explore how this measurement can be improved through better
utilising the semantic information available from lexical resources such as WordNet [10].
The contribution of the paper is two-fold. Firstly, the paper proposes a method by which
word sense identification, used in conjunction with synonym expansion, can be used to
create an enriched semantic context, enabling a more accurate estimate of semantic
similarity. Results of applying the measure to three benchmark datasets shows that as a
stand-alone measure, the method achieves better results than other methods recently
reported in the literature. The second contribution of the paper is a novel word sense
disambiguation (WSD) algorithm that operates by comparing WordNet glosses of the
target word with a context vector comprising the remaining words in the sentence.
The remainder of this paper is structured as follows. Section 2 introduces the
method of synonym expansion through word sense identification. Section 3 describes
the WSD algorithm that we have developed, and the word-to-word semantic
similarity measures that we use. Section 4 provides empirical results, and Section 5
concludes the paper.
2 Word Sense Disambiguation and Synonym Expansion
The approach that we present in this section is depicted in Figure 1. For each of the
sentences being compared, we first apply a word sense disambiguation step to identify
the sense in which words are being used within the sentence. We then apply a
synonym expansion step, allowing a richer semantic context from which to estimate
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence
3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings
Springer-Verlag Berlin Heidelberg 2010
Short-Text Similarity Measurement
437
WordNet
Sentence 1
Synonym
Expansion Set 1
Semantic
Vector 1
Union Set, U
Sentence 2
WSD for
Sentence 2
Synonym
Expansion Set 2
Semantic
Vector 2
Similarity Score
WSD for
Sentence 1
Fig. 1. Sentences Similarity Computation Diagram
semantic vectors. The similarity between semantic vectors can then be calculated
using a standard vector space similarity measure such as cosine similarity. We first
describe the role of WSD. We then describe the procedure for synonym expansion.
2.1 The Role of Word Sense Disambiguation (WSD)
Sentence similarity as measured using methods such as those described in the
Introduction is based on word-to-word similarities. The standard approach used within
sentence similarity measures based on WordNet [10] is to simply use the first WordNet
sense for each of the two words being compared [8][9]. (Senses in WordNet are
ordered from most-frequent to least-frequent). However this can lead to inaccurate
similarity measurements. To illustrate, consider the following:
Sentence 1:
Sentence 2:
I deposited a cheque at the bank.
There is oil sediment on the south bank of the river.
Using the reduced vector space representation we obtain
S1:
S2:
U:
V1:
V2:
[('deposited', 0), ('cheque', 0), ('bank', 0)]
[('oil', 0), ('sediment', 0), ('south', 0), ('bank', 0), ('river', 0)]
[('river', 0), ('south', 0), ('oil', 0), ('sediment', 0), ('deposited', 0), ('cheque', 0), ('bank', 0)]
[0.066, 0.062, 0.058, 0.055, 1.0, 1.0, 1.0]
[1.0, 1.0, 1.0, 1.0, 0.0, 0.059, 1.0]
where S1 and S2 contain the word-sense pairs for non-stopwords (stopwords are words
such as ‘a’, ‘the’, etc, and are removed because they carry little semantic information); U
is the reduced vector space, consisting of all word-sense pairs in the union of S1 and S2;
and V1 and V2 are the vectors for S1 and S2 in this reduced vector space. The entry
corresponding to a word x in V1 (V2) is determined as the maximum similarity between
x and the words in S1 (S2). For example, 0.066 is the maximum similarity between
('river', 0) and the words in S1 having the same part of speech. (Many WordNet word-toword similarity measures are only defined between words with the same part of speech).
Calculating the cosine similarity between V1 and V2 results in a value of 0.33.
The similarity value of 0.33 is likely to be an overestimate. For example, the word
‘bank’ appears in both sentences, but its sense is different in each. Using a WordNet
sense of 0 will always result in a maximum similarity between these. Problems might
also arise between words which are not common between the two sentences. For
example, there is a sense of ‘deposit’ which is closely related to ‘sediment’ (An oil
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence
3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings
Springer-Verlag Berlin Heidelberg 2010
438
K. Abdalgader and A. Skabar
deposit might be considered a sediment). If we perform the same calculation, but by
incorporating WSD we now obtain:
S1:
S2:
U:
V1:
V2:
[('deposited', 1), ('cheque', 0), ('bank', 13)]
[('oil', 4), ('sediment', 1), ('south', 3), ('bank', 0), ('river', 0)]
[('river', 0), ('bank', 0), ('south', 3), ('deposited', 1), ('sediment', 1), ('cheque', 0), ('oil', 4), ('bank', 13)]
[0.059, 0.051, 0.052, 1.0, 0.044, 1.0, 0.050, 1.0]
[1.0, 1.0, 1.0, 0.050, 1.0, 0.059, 1.0, 0.049]
which results in a cosine similarity value of 0.11. This is lower than that achieved
without the use of WSD, and is more in accord with the human judgement that S1 and
S2 bare little semantic similarity.
Now consider the following sentences, which most humans would consider to be
semantically related:
Sentence 3:
Sentence 4:
The world is in economic crisis.
The current dismal fiscal situation is global.
Calculating sentence similarity with and without WSD results in similarity values of 0.08
and 0.09 respectively. It is problematic that a value 0.08 has been obtained for a pair of
sentences which we consider to be semantically related, yet a higher value of 0.11 was
obtained for Sentences 1 and 2, which we consider not to be semantically related. Thus
use of WSD on its own appears to be insufficient. In the next section we describe how
using synonym expansion can solve this problem.
2.2 Increasing Semantic Context through Synonym Expansion
WordNet [10] represents each unique meaning of a word by a synset consisting of that
word together with its synonyms (if any). Synonyms are words with the same meaning.
For example, the synset ['fiscal', 'financial'] represents the sense of ('fiscal', 0): “involving
financial matters”. Synsets provides a means of expanding the semantic context. For
example, consider Sentences 3 and 4 above. Disambiguating the words in these sentences
results in a vector space consisting of the following sense-assigned words:
U:
[('fiscal', 0), ('current', 3), ('crisis', 0), ('dismal', 0), ('situation', 0), ('global', 1), ('world', 1), ('economic', 0)]
We can use information from the respective synsets of these words to add context to the
original sentences. For example, Sentence 4 above was originally represented as the set
[('current', 3), ('dismal', 0), ('fiscal', 0), ('situation', 0), ('global', 1)]
Using information from the synsets of these words, we can expand this to
[('current', 0), ('blue', 0), ('dark', 0), ('dingy', 0), ('disconsolate', 0), ('gloomy', 0), ('grim', 0), ('sorry', 0),
('drab', 0), ('drear', 0), ('dreary', 0), ('financial', 0), ('state_of_affairs', 0), ('ball-shaped', 0), ('globose', 0),
('globular', 0), ('orbicular', 0), ('spheric', 0), ('spherical', 0)]
It is important to note that all synonyms are added with sense 0. While this might appear
counter-intuitive, since this may not be the sense of the synonym in the original synset
(i.e., the synset of the word being expanded), it is precisely through including synonyms
with sense 0 that we are able to expand the context. There are two inter-related reasons
for this. Firstly, adding the correct sense for a synonym would achieve nothing, since
the similarity of some word x to this synonym would be the same as its similarity to all
other words in the same synset (which includes the identified sense of the original word
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence
3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings
Springer-Verlag Berlin Heidelberg 2010
Short-Text Similarity Measurement
439
used to produce the synset). Secondly, WordNet assigns a sense of 0 to the most
frequently used sense of a word. This means that using this sense is most likely (but not
guaranteed) to expand the context in a semantic direction of benefit in finding possible
semantic similarities between words in the two sentence being compared. We also note
that using synonym expansion does not require the dimensionality of the vector space to
be increased (i.e., we do not add synonyms to U). The expanded context is utilized when
we calculate the semantic vectors. Whereas originally the entries for these vectors was
based only on similarities to words in the original sentence, we now consider similarities
to the synonyms that have been introduced.
To complete the above example, the expanded description for Sentence 3 is [('domain',
0), ('economic', 0), ('crisis', 0)]. This results in the following semantic vectors:
V3: [0.0, 0.807, 1.0, 0.0, 0.0, 0.068, 0.059, 0.0]
V4: [1.0, 1.0, 0.111, 1.0, 0.0, 0.074, 1.0, 0.0]
These vectors have a cosine similarity of 0.38, which is higher than the value of 0.08
achieved without synonym expansion.
WSD and synonym expansion pull in opposite directions: WSD tends to decrease
similarity values; synonym expansion tends to increase them. Thus, even though
synonym expansion has increased the similarity value for Sentences 3 and 4, it is likely
also to have increased the similarity value for Sentences 1 and 2. While it may appear
that WSD and synonym expansion are working at odds, this is not the case. What is
crucial to note is that synonym expansion is based on identified word senses. The
semantic context is not expanded blindly, but is focused in the vicinity of the semantic
context provided by the sense-assigned meanings of the original words. Synonym
expansion is not independent from WSD, it requires WSD. We also note that ultimately
it is relative—not absolute—similarity values which are important.
In Section 4 we demonstrate empirically that the resulting similarity measure
outperforms other recently-proposed measures. We now describe the WSD algorithm
we have developed, and the WordNet word-to-word similarity measures which we use.
3 Word Sense Disambiguation
Synonym expansion relies on the correct identification of the WordNet sense in which
the word to be expanded is being used. In this section we describe a novel variant of
the word sense disambiguation algorithm originally proposed by Lesk (1988) [11].
Lesk’s (1988) [11] method determines the sense of a polysemous word by calculating
the word overlap between the glosses (i.e., definitions) of two or more target words. The
actual senses of the target words are then assumed to be those whose glosses have the
greatest word overlap. For example, in the case of two words w1 and w2, the Lesk score is
defined as ScoreLesk(S1, S2) = |gloss(S1) ∩ gloss(S2)|, where S1
Senses(w1), S2
Senses(w2) and gloss(Si) is the bag of words in the definition of sense Si of wi. Senses which
score the highest value from the above calculation are assigned to the respective words.
While this approach is feasible when the context is small (e.g., two words) it leads to
combinatorial explosion as the number of words increases. For example, in a two-word
context the number of gloss overlap calculations is |senses(w1)| · |senses(w2)|, whereas in the
case of an n-word context, this increases exponentially to |senses(w1)| · |senses(w2)| · … ·
|senses(wn)|. For this reason, a simplified version of this approach is commonly used, in
∈
∈
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence
3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings
Springer-Verlag Berlin Heidelberg 2010
440
K. Abdalgader and A. Skabar
which the actual sense for word w is selected as the one whose gloss has the greatest
overlap with the words in the context of w. That is, ScoreLeskVar(S) = |context(w) ∩ gloss(S)|,
where context(w) is the bag of words in a context window that surrounds the word w.
The method that we propose similarly disambiguates words one at a time; however,
rather than using the context provided only in some fixed-size context window
surrounding the target word, the method disambiguates the target word using the
context provided by all remaining words in the sentence. Essentially, the algorithm
computes the semantic similarity (not overlap) between WordNet glosses of the target
polysemous word and the text made up of all of the remaining words in the sentence,
which we refer to as context text. The target word is then assigned the sense associated
with the gloss which has the highest semantic similarity score to the context text. This
procedure is then repeated for all other words in the sentence.
To formalize, let W = {wi | i=1..N} be the set of non-stopwords in the sentence
containing the words to be disambiguated, and suppose that we wish to disambiguate word
wi. Let Gwi be the set of WordNet glosses corresponding to word wi,; i.e.,
{
}
Gwi = g wk i | k = 1..N wi , where N wi is the number of WordNet senses for wi, and g wk i is
the set of non-stopwords in the kth WordNet gloss of wi. Let Ri be the context vector
{
}
comprising all words from W, except wi; i.e., Ri = w j | w j ∈W , j ≠ i . The sense for
k
word wi is identified as the k value for which g wi is semantically most similar to Ri.
The problem, therefore, is again one of calculating the similarity between two text
fragments: the gloss, and the context text. The situation is thus somewhat circular, as our
motivation for introducing word sense disambiguation was to improve the measurement
of short-text similarity. Since attempting to identify the sense of polysemous words in the
gloss and context vectors would lead to an infinite regress, we use only the first WordNet
sense in comparing these vectors, and define the similarity between a gloss and context
vector simply as their cosine similarity in the reduced vector space.
3.1 Word-to-Word Semantic Similarity
Various word-to-word similarity measures have been proposed in the literature, and
can broadly be categorized as either corpus-based, in which case similarity is
calculated based on distributional information derived from large corpora, or
knowledge-based, in which similarity is based on semantic relations expressed in
external resources such as dictionaries or thesauri. In this paper we use knowledgebased measures; specifically, we use lexical knowledge-base WordNet [10].
Two widely used WordNet-based measures, and the measures used in this research,
are shortest path similarity [12] and the Jiang and Conrath [13] measure. Shortest path
similarity is the simpler of the two, and is defined as:
SimPath ( w1 , w2 ) =
1
length( w1 , w2 )
(1)
where length is the length of the shortest path between two words, and is determined by
simple node counting. The Jiang and Conrath measure [13] is a more sophisticated
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence
3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings
Springer-Verlag Berlin Heidelberg 2010
Short-Text Similarity Measurement
441
measure, based on the idea that the degree to which two words are similar is
proportional to the amount of information they share:
SimJ &C ( w1 , w2 ) =
1
IC ( w1 ) + IC ( w2 ) − 2 × IC ( LCS ( w1, w2 ))
(2)
where LCS(w1, w2) is the word that is the deepest common ancestor of w1 and w2 in
the WordNet hierarchy, IC(w) is the information content of word w, and is defined as
IC(w) = −log P(w), where P(w) is the probability that word w appears in a large
corpus (e.g., the Brown corpus).
Unlike shortest path, the Jiang and Conrath measure is not capable of calculating the
similarity between words with different part-of-speech. For this reason, we use the shortest
path measure in the word sense disambiguation phase where we are assuming WordNet
sense 0 for words in the gloss and context vectors, and use either of the measures when
calculating similarity between sense-assigned words. For a comprehensive review of these
and other word similarity measures, see Budanitsky & Hirst (2006) [14].
4 Empirical Results
We present results from applying the similarity measure to three benchmark datasets:
the Microsoft Research Paraphrase (MSRP) Corpus [15], the recognizing textual
entailment challenge (RTE2, RTE3) [16], and the 30 sentence pairs dataset [17].
4.1 Paraphrase Recognition
The MSRP dataset consists of 5801 pairs of text fragments collected from a large
number of web newswire postings over a period of 18 months. Each pair was manually
labelled with a binary true or false value by two human annotators, indicating whether
or not the two fragments in a pair were considered a paraphrase of each other. The
corpus is divided into 4076 training pairs and 1725 test pairs. Since the proposed
algorithm is unsupervised (i.e., does not require training from labelled data), we use
only test data. Since it is a binary classification task, a classification threshold needs to
be determined (i.e., the candidate pair is classified as a paraphrase if the similarity
score exceeds this threshold), and the ideal method for comparing performance
between classifiers is to look at the performance corresponding to different
thresholds; e.g., by comparing area under ROC-curve. Unfortunately this information
is not available for other methods, so in line with other researchers, we consider
thresholds in increments of 0.1, and provide results corresponding to the best
threshold, which in this case was 0.6. Table 1 compares the performance of our
measure with other recently reported approaches, as well as two baselines. Our measure
achieves best performance in terms of both overall accuracy and F-measure, and far
exceeds the baselines. Best performance achieved by a human judge was 83%.
4.2 Textual Entailment Recognition
Textual entailment recognition is the task of determining whether a text fragment is
entailed by a hypothesis (another text fragment). Entailment is an asymmetric relation
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence
3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings
Springer-Verlag Berlin Heidelberg 2010
442
K. Abdalgader and A. Skabar
Table 1. Comparison of performance with other techniques on MSRP classification dataset.
Vector-based baseline measures cosine similarity between vectors in a full bag-of-words
representation with tf-idf weighting. Random baseline was created by randomly assigning a true or
false value to pairs of text fragments. Baselines are due to Mihalcea et al. (2006) [9].
Measure
Acc
Prec
Rec
Proposed Similarity Measure
J&C
75.5
91.5
74.6
Path
73.9
92.4
73.2
Islam & Inkpen (2008), Corpus-based
STS
72.6
74.7
89.1
Mihalcea et al. (2006), Corpus-based
PMI-IR
69.9
70.2
95.2
LSA
68.4
69.7
95.2
Mihalcea et al. (2006), WordNet-based
L&C
69.5
72.4
87.0
J&C
69.3
72.2
87.1
Baselines
Vector-based
65.4
71.6
79.5
Random
51.3
68.3
50.0
F
82.7
82.1
81.3
81.0
80.5
79.0
79.0
75.3
57.8
based on directional inference, and symmetric similarity measures such as that
proposed in this paper should not be expected to perform as well as measures
designed to utilize a deeper semantic analysis specifically to determine entailment.
Nevertheless, the dataset has been previously been used as a measure of (asymmetric)
sentence similarity, and we follow suit.
Table 2 shows performance of our measure compared with that recently reported by
Ramage et al. (2009) [19]. Note that two sets of results are reported in [19]: one set in
which the Random Graph Walk method is used as a stand–alone measure, and a
second set in which the graph walk method is incorporated within an existing RTE
system (i.e., a system designed specifically to detect entailment) [20]. The baseline
represents the original performance of this RTE system [20]. The performance of our
measure markedly exceeds that of both the baseline and the Ramage et al. (2009)
measure used a stand-alone. It also performs better on the RTE 3 dataset than the
Ramage et al. [19] method incorporated into the RTE system, but approximately
equally on the RTE2 dataset. As noted above, participants in the RTE challenge have
used a variety of strategies beyond lexical relatedness, and accuracies as high as 75.4%
[16] and 80% [21] respectively have been reported on the RTE2 and RTE3 datasets.
4.3 30-Sentences Dataset
This dataset is due to Li et al. (2006) [16], and was created by taking a set of 65 noun
pairs, replacing the nouns with their dictionary definitions, and having 32 human
participants rate the similarity in meaning of each of the sentence pairs on a scale of 0.0
to 4.0. When the similarity scores were averaged, the distribution of the scores was
heavily skewed toward the low similarity end of the scale, with 46 pairs rated from 0.0 to
0.9, and 19 pairs rated from 1.0 to 4.0. To obtain a more even distribution across the
similarity range, a subset of 30 sentence pairs was selected, consisting of all 19 sentence
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence
3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings
Springer-Verlag Berlin Heidelberg 2010
Short-Text Similarity Measurement
443
pairs rated 1.0 to 4.0, and 11 taken at equally spaced intervals from the 46 pairs rated 0.0
to 0.9, [17]. Unlike the dataset described above, in which the task is binary classification,
this dataset has been used to compare correlation with human-rated similarity.
The similarity measures proposed in Islam and Inkpen (2008) [18] and Li et al.
(2006) [8] achieved correlations of 0.853 and 0.816 respectively on this task. Our
sentence similarity measure exceeds both of these results, achieving correlations of
0.877 and 0.874 respectively using J&C and Path Length word-to-word similarity
measures. These figures also exceed the mean human correlation of 0.825, and are not
far from the highest correlation of 0.921 achieved by a human participant
Table 2. Comparison of performance against results recently reported by Ramage et al. (2009)
[19], and a baseline RTE system [20]. Classification threshold is 0.5.
Measure
RTE3 Accuracy
RTE2 Accuracy
Proposed similarity measure
Path
70.2
62.8
J&C
68.7
63.8
Ramage et al., (2009) with Random Graph Walk
Cosine
55.7
57.0
Dice
55.7
54.2
Jensen-Shannon
56.7
57.5
Ramage et al., (2009) with existing RTE system
Cosine
65.8
64.5
Dice
65.4
63.1
Jensen-Shannon
65.4
63.2
Baselines
Existing RTE3
65.4
63.6
5 Conclusion
The results from the previous section are positive, and suggest that incorporating
word sense disambiguation and synonym expansion does to lead to improvement in
sentence similarity measurement. Importantly, this improvement is gained with very
little increase in computational cost. Although we have described in this paper how
these ideas can be incorporated into a measure based on a reduced vector space
representation, the ideas can readily be applied to measures such as that of Mihalcea
et al. (2006), which do not use an explicit vector space representation. While the
empirical results reported in this paper have focused mainly on binary classification
tasks, we believe that an important test for a sentence similarity measure is how well
it performs when used in the context of a more encompassing task such as text mining
or document summarization. Evaluating sentence similarity measures in such contexts
is difficult, however, as many different factors play a role in the success of such
systems, and it is difficult to isolate the effect that any specific sentence similarity
measure may have. As a step towards such a broader evaluation, we are in the process
of applying the measure to challenging sentence clustering tasks. We are also
comparing the performance of the disambiguation algorithm with that of other
disambiguation algorithms directly on standard WSD datasets.
Jiuyong Li \(Ed.\), AI 2010: Advances in Artificial Intelligence
3rd Australasian Joint Conference Adelaide,Australia,December 2010 Proceedings
Springer-Verlag Berlin Heidelberg 2010
444
K. Abdalgader and A. Skabar
References
1. Atkinson-Abutridy, J., Mellish, C., Aitken, S.: Combining Information Extraction with
Genetic Algorithms for Text Mining. IEEE Intelligent Systems 19(3), 22–30 (2004)
2. Erkan, G., Radev, D.: LexRank: Graph-based Lexical Centrality as Salience in Text
Summarization. Journal of Art. Int. Research. 22, 457–479 (2004)
3. Bilotti, M.W., Ogilvie, P., Callan, J., Nyberg, E.: Structured Retrieval for Question
Answering. In: SIGIR 2007, pp. 351–358. ACM, New York (2007)
4. Mohler, M., Mihalcea, R.: Text-to-Text Semantic Similarity for Automatic Short Answer
Grading. In: EC-ACL 2009, Athens, Greece, pp. 567–575 (2009)
5. Coelho, T., Calado, P., Souza, L., Ribeiro-Neto, B., Muntz, R.: Image Retrieval using
Multiple Evidence Ranking. IEEE Tran. On KDD 16(4), 408–417 (2004)
6. Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of
Information by Computer. Addison-Wesley, Reading (1989)
7. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval.
Cambridge University Press, Cambridge (2008)
8. Li, Y., McLean, D., Bandar, Z., O’Shea, F., Crockett, K.: Sentence Similarity Based on
Semantic Nets and Corpus Statistics. IEEE TKDE 18(8), 1138–1150 (2006)
9. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and Knowledge-based Measures
of Text Semantic Similarity. In: 21st National Conference on Art. Int., vol. 1, pp. 775–780
(2006)
10. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge
11. Lesk, M.: Automatic Sense Disambiguation using Machine Readable Dictionaries: How to
tell a pine cone from an ice cream cone. In: Proc. of the SIGDOC, pp. 24–26 (1986)
12. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and Application of a Metric to
Semantic Nets. IEEE Trans. Sys., Man and Cyb. 19(1), 17–30 (1989)
13. Jiang, J.J., Conrath, D.W.: Semantic Similarity Based on Corpus Statistics and Lexical
Taxonomy. In: 10th Inter. Conf. on Research in Computational Linguistics, pp. 19–33 (1997)
14. Budanitsky, A., Hirst, G.: Evaluating WordNet-based Measures of Lexical Semantic
Relatedness. Computational Linguistics 32(1), 13–47 (2006)
15. Dolan, W., Chris Quirk, C., Brockett, C.V.: Unsupervised Construction of Large
Paraphrase Corpora: Exploiting Massively Parallel News Sources. In: 20th International
Conf. on Computational Linguistics, pp. 350–356 (2004)
16. Dagan, I., Dolan, B., Giampiccolo, D., Magnini, B.: The Third PASCAL Recognizing
Textual Entailment Challenge. In: ACL-PASCAL Workshop on Textual Entailment and
Paraphrasing, pp. 1–9 (2007)
17. Li, Y., McLean, D., Bandar, Z., O’Shea, F., Crockett, K.: Pilot Short Text Semantic
Similarity Benchmark Data Set: Full Listing and Description (2009),
http://www.mendeley.com
18. Islam, A., Inkpen, D.: Semantic Text Similarity using Corpus-based Word Similarity and
String Similarity. ACM Trans. on KDD 2(2), 1–25 (2008)
19. Ramage, D., Rafferty, A., Manning, C.: Random Walks for Text Semantic Similarity. In:
ACL-IJCNLP 2009, pp. 23–31 (2009)
20. Chambers, N., Cer, D., Grenager, T., Hall, D., Kiddon, C., MacCartney, B., de Marneffe,
M., Ramage, D., Yeh, E., Manning, C.: Learning Alignments and Leveraging Natural
Logic. In: ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp.
165–170 (2007)
21. Hickl, A., Bensley, J.: A Discourse Commitment-based Framework for Recognizing
Textual Entailment. In: ACL-PASCAL Workshop on Textual Entailment and Paraphrasing,
pp. 171–176 (2007)