An Information-Theoretic Definition of Similarity

An Information-Theoretic Definition of Similarity
Dekang Lin
Department of Computer Science
University of Manitoba
Winnipeg, Manitoba, Canada R3T 2N2
Abstract
Similarity is an important and widely used concept. Previous definitions of similarity are tied
to a particular application or a form of knowledge representation. We present an informationtheoretic definition of similarity that is applicable as long as there is a probabilistic model. We
demonstrate how our definition can be used to
measure the similarity in a number of different
domains.
1 Introduction
Similarity is a fundamental and widely used concept.
Many similarity measures have been proposed, such as
information content [Resnik, 1995b], mutual information
[Hindle, 1990], Dice coefficient [Frakes and Baeza-Yates,
1992], cosine coefficient [Frakes and Baeza-Yates, 1992],
distance-based measurements [Lee et al., 1989; Rada et al.,
1989], and feature contrast model [Tversky, 1977]. McGill
etc. surveyed and compared 67 similarity measures used in
information retrieval [McGill et al., 1979].
A problem with previous similarity measures is that each
of them is tied to a particular application or assumes a
particular domain model. For example, distance-based
measures of concept similarity (e.g., [Lee et al., 1989;
Rada et al., 1989]) assume that the domain is represented in
a network. If a collection of documents is not represented
as a network, the distance-based measures do not apply.
The Dice and cosine coefficients are applicable only when
the objects are represented as numerical feature vectors.
Another problem with the previous similarity measures is
that their underlying assumptions are often not explicitly
stated. Without knowing those assumptions, it is impossible to make theoretical arguments for or against any par-
ticular measure. Almost all of the comparisons and evaluations of previous similarity measures have been based on
empirical results.
This paper presents a definition of similarity that achieves
two goals:
Universality: We define similarity in informationtheoretic terms. It is applicable as long as the domain
has a probabilistic model. Since probability theory
can be integrated with many kinds of knowledge
representations, such as first order logic [Bacchus,
1988] and semantic networks [Pearl, 1988], our definition of similarity can be applied to many different
domains where very different similarity measures had
previously been proposed. Moreover, the universality
of the definition also allows the measure to be used in
domains where no similarity measure has previously
been proposed, such as the similarity between ordinal
values.
Theoretical Justification: The similarity measure is not
defined directly by a formula. Rather, it is derived
from a set of assumptions about similarity. In other
words, if the assumptions are deemed reasonable, the
similarity measure necessarily follows.
The remainder of this paper is organized as follows. The
next section presents the derivation of a similarity measure from a set of assumptions about similarity. Sections 3
through 6 demonstrate the universality of our proposal by
applying it to different domains. The properties of different
similarity measures are compared in Section 7.
2 Definition of Similarity
Since our goal is to provide a formal definition of the intuitive concept of similarity, we first clarify our intuitions
about similarity.
Intuition 1: The similarity between A and B is related
to their commonality. The more commonality they
share, the more similar they are.
ferences. That is,
sim
common
The domain of is
Intuition 2: The similarity between A and B is related to
the differences between them. The more differences
they have, the less similar they are.
Intuition 3 states that the similarity measure reaches a constant maximum when the two objects are identical. We assume the constant is 1.
Intuition 3: The maximum similarity between A and B is
reached when A and B are identical, no matter how
much commonality they share.
Assumption 4: The similarity between a pair of identical
objects is 1.
Our goal is to arrive at a definition of similarity that captures the above intuitions. However, there are many alternative ways to define similarity that would be consistent
with the intuitions. In this section, we first make a set of
additional assumptions about similarity that we believe to
be reasonable. A similarity measure can then be derived
from those assumptions.
In order to capture the intuition that the similarity of two
objects are related to their commonality, we need a measure
of commonality. Our first assumption is:
Assumption 1: The commonality between A and B is measured by
common
where common
is a proposition that states the comis the amount of informonalities between A and B;
mation contained in a proposition .
For example, if A is an orange and B is an apple. The
proposition that states the commonality between A and B
is “fruit(A) and fruit(B)”. In information theory [Cover and
Thomas, 1991], the information contained in a statement
is measured by the negative logarithm of the probability of
the statement. Therefore,
common
fruit
and fruit
We also need a measure of the differences between two objects. Since knowing both the commonalities and the differences between A and B means knowing what A and B
are, we assume:
Assumption 2: The differences between A and B is measured by
description
common
where description
is a proposition that describes
what A and B are.
Intuition 1 and 2 state that the similarity between two objects are related to their commonalities and differences. We
assume that commonalities and differences are the only factors.
Assumption 3: The similarity between A and B,
, is a function of their commonalities and difsim
description
.
When A and B are identical, knowing their commonalities
means knowing what they are, i.e., common
description
. Therefore, the function must have
.
the property:
When there is no commonality between A and B, we assume their similarity is 0, no matter how different they are.
For example, the similarity between “depth-first search”
and “leather sofa” is neither higher nor lower than the similarity between “rectangle” and “interest rate”.
Assumption 5:
.
Suppose two objects A and B can be viewed from two independent perspectives. Their similarity can be computed
separately from each perspective. For example, the similarity between two documents can be calculated by comparing the sets of words in the documents or by comparing their stylistic parameter values, such as average word
length, average sentence length, average number of verbs
per sentence, etc. We assume that the overall similarity of
the two documents is a weighted average of their similarities computed from different perspectives. The weights are
the amounts of information in the descriptions. In other
words, we make the following assumption:
Assumption 6:
From the above assumptions, we can proved the following
theorem:
Similarity Theorem: The similarity between A and B is
measured by the ratio between the amount of information
needed to state the commonality of A and B and the information needed to fully describe what A and B are:
sim
common
description
Proof:
(Assumption 6)
(Assumption 4 and 5)
Q.E.D.
50%
probability
Since similarity is the ratio between the amount of information in the commonality and the amount of information
in the description of the two objects, if we know the commonality of the two objects, their similarity tells us how
much more information is needed to determine what these
two objects are.
15%
In the next 4 sections, we demonstrate how the above definition can be applied in different domains.
10%
5%
excellent good average
3 Similarity between Ordinal Values
Many features have ordinal values. For example, the “quality” attribute can take one of the following values “excellent”, “good”, “average”, “bad”, or “awful”. None of the
previous definitions of similarity provides a measure for
the similarity between two ordinal values. We now show
how our definition can be applied here.
If “the quality of X is excellent” and “the quality of Y is
average”, the maximally specific statement that can be said
of both X and Y is that “the quality of X and Y are between
“average” and “excellent”. Therefore, the commonality between two ordinal values is the interval delimited by them.
Suppose the distribution of the “quality” attribute is known
(Figure 1). The following are four examples of similarity
calculations:
sim
sim
sim
sim
20%
bad
awful
Quality
Figure 1: Example Distribution of Ordinal Values
caused by less important features. The assignment of the
weight parameters is generally heuristic in nature in previous approaches. Our definition of similarity provides a
more principled approach, as demonstrated in the following case study.
4.1 String Similarity—A case study
Consider the task of retrieving from a word list the words
that are derived from the same root as a given word. For
example, given the word “eloquently”, our objective is to
retrieve the other related words such as “ineloquent”, “ineloquently”, “eloquent”, and “eloquence”. To do so, assuming that a morphological analyzer is not available, one
can define a similarity measure between two strings and
rank the words in the word list in descending order of their
similarity to the given word. The similarity measure should
be such that words derived from the same root as the given
word should appear early in the ranking.
We experimented with three similarity measures. The first
one is defined as follows:
The results show that, given the probability distribution in
Figure 1, the similarity between “excellent” and “good” is
much higher than the similarity between “good” and “average”; the similarity between “excellent” and “average” is
much higher than the similarity between “good” and “bad”.
4 Feature Vectors
Feature vectors are one of the simplest and most commonly
used forms of knowledge representation, especially in casebased reasoning [Aha et al., 1991; Stanfill and Waltz, 1986]
and machine learning. Weights are often assigned to features to account for the fact that the dissimilarity caused
by more important features is greater than the dissimilarity
simedit
editDist
where editDist
is the minimum number of character
insertion and deletion operations needed to transform one
string to the other.
The second similarity measure is based on the number of
different trigrams in the two strings:
simtri
tri
tri
tri
tri
where tri
is the set of trigrams in . For example,
tri(eloquent) = elo, loq, oqu, que, ent .
Rank
1
2
3
4
5
6
7
8
9
10
Table 1: Top-10 Most Similar Words to “grandiloquent’
simedit
simtri
sim
grandiloquently 1/3 grandiloquently 1/2
grandiloquently
grandiloquence 1/4 grandiloquence 1/4
grandiloquence
magniloquent
1/6 eloquent
1/8
eloquent
gradient
1/6 grand
1/9
magniloquent
grandaunt
1/7 grande
1/10 ineloquent
gradients
1/7 rand
1/10 eloquently
grandiose
1/7 magniloquent
1/10 ineloquently
diluent
1/7 ineloquent
1/10 magniloquence
ineloquent
1/8 grands
1/10 eloquence
grandson
1/8 eloquently
1/10 ventriloquy
Root
agog
cardi
circum
gress
loqu
Table 2: Evaluation of String Similarity Measures
11-point average precisions
Meaning
simedit simtri sim
leader, leading, bring 23
37%
40% 70%
heart
56
18%
21% 47%
around, surrounding
58
24%
19% 68%
to step, to walk, to go 84
22%
31% 52%
to speak
39
19%
20% 57%
The third similarity measure is based on our proposed definition of similarity under the assumption that the probability of a trigram occurring in a word is independent of other
trigrams in the word:
tri
sim
tri
tri
tri
Table 1 shows the top 10 most similar words to “grandiloquent” according to the above three similarity measures.
To determine which similarity measure ranks higher the
words that are derived from the same root as the given
word, we adopted the evaluation metrics used in the Text
Retrieval Conference [Harman, 1993]. We used a 109,582word list from the AI Repository.1 The probabilities of
trigrams are estimated by their frequencies in the words.
denote the set of words in the word list and
Let
denote the subset of
that are derived from
. Let
denote the ordering of
in descending similarity to according to a similarity measure.
The precision of
at recall level N% is desuch that
fined as the maximum value of
and
. The qualcan be measured by the
ity of the sequence
1
0.92
0.89
0.61
0.59
0.55
0.55
0.50
0.50
0.50
0.42
http://www.cs.cmu.edu/afs/cs/project/ai-repository
11-point average of its precisions at recall levels 0%, 10%,
20%, ..., and 100%. The average precision values are then
averaged over all the words in
. The results on 5
roots are shown in Table 2. It can be seen that much better
results were achieved with sim than with the other similarity measures. The reason for this is that sim edit and simtri
treat all characters or trigrams equally, whereas sim is able
to automatically take into account the varied importance in
different trigrams.
5 Word Similarity
In this section, we show how to measure similarities between words according to their distribution in a text corpus
[Pereira et al., 1993]. Similar to [Alshawi and Carter, 1994;
Grishman and Sterling, 1994; Ruge, 1992], we use a parser
to extract dependency triples from the text corpus. A dependency triple consists of a head, a dependency type and
a modifier. For example, the dependency triples in “I have
a brown dog” consist of:
(1) (have subj I), (have obj dog), (dog adj-mod brown),
(dog det a)
where “subj” is the relationship between a verb and its subject; “obj” is the relationship between a verb and its object;
“adj-mod” is the relationship between a noun and its adjective modifier and “det” is the relationship between a noun
and its determiner.
We can view dependency triples extracted from a corpus
as features of the heads and modifiers in the triples. Suppose (avert obj duty) is a dependency triple, we say that
“duty” has the feature obj-of(avert) and “avert” has the feature obj(duty). Other words that also possess the feature
obj-of(avert) include “default”, “crisis”, “eye”, “panic”,
“strike”, “war”, etc., which are also used as objects of
“avert” in the corpus.
Table 3 shows a subset of the features of “duty” and “sanction”. Each row corresponds to a feature. A ‘x’ in the
“duty” or “sanction” column means that the word possesses
that feature.
Table 3: Features of “duty” and “sanction”
Feature
: subj-of(include)
: obj-of(assume)
: obj-of(avert)
: obj-of(ease)
: obj-of(impose)
: adj-mod(fiduciary)
: adj-mod(punitive)
: adj-mod(economic)
duty
x
x
x
x
x
x
sanction
x
x
x
x
x
x
3.15
5.43
5.88
4.99
4.97
7.76
7.10
3.70
Let
be the set of features possessed by .
can
be viewed as a description of the word . The commonalities between two words
and
is then
.
The similarity between two words is defined as follows:
(2) sim
where
is the amount of information contained in a set
of features . Assuming that features are independent of
one another,
, where
is the
probability of feature . When two words have identical
sets of features, their similarity reaches the maximum value
of 1. The minimum similarity 0 is reached when two words
do not have any common feature.
can be estimated by the percentage
The probability
of words that have feature among the set of words that
have the same part of speech. For example, there are 32868
unique nouns in a corpus, 1405 of which were used as subjects of “include”. The probability of subj-of(include) is
. The probability of the feature adj-mod(fiduciary) is
because only 14 (unique) nouns were modified by
“fiduciary”. The amount of information in the feature adjmod(fiduciary), 7.76, is greater than the amount of infor-
mation in subj-of(include), 3.15. This agrees with our intuition that saying that a word can be modified by “fiduciary”
is more informative than saying that the word can be the
subject of “include”.
The fourth column in Table 3 shows the amount of information contained in each feature. If the features in Table 3
were all the features of “duty” and “sanction”, the similarity between “duty” and “sanction” would be:
which is equal to 0.66.
We parsed a 22-million-word corpus consisting of Wall
Street Journal and San Jose Mercury with a principle-based
broad-coverage parser, called PRINCIPAR [Lin, 1993;
Lin, 1994]. Parsing took about 72 hours on a Pentium
200 with 80MB memory. From these parse trees we extracted about 14 million dependency triples. The frequency
counts of the dependency triples are stored and indexed in
a 62MB dependency database, which constitutes the set of
feature descriptions of all the words in the corpus. Using
this dependency database, we computed pairwise similarity
between 5230 nouns that occurred at least 50 times in the
corpus.
The words with similarity to “duty” greater than 0.04 are
listed in (3) in descending order of their similarity.
(3) responsibility, position, sanction, tariff, obligation,
fee, post, job, role, tax, penalty, condition, function,
assignment, power, expense, task, deadline, training,
work, standard, ban, restriction, authority,
commitment, award, liability, requirement, staff,
membership, limit, pledge, right, chore, mission,
care, title, capability, patrol, fine, faith, seat, levy,
violation, load, salary, attitude, bonus, schedule,
instruction, rank, purpose, personnel, worth,
jurisdiction, presidency, exercise.
The following is the entry for “duty” in the Random House
Thesaurus [Stein and Flexner, 1984].
(4) duty n. 1. obligation , responsibility ; onus;
business, province; 2. function , task , assignment ,
charge. 3. tax , tariff , customs, excise, levy .
The shadowed words in (4) also appear in (3). It can be
seen that our program captured all three senses of “duty” in
[Stein and Flexner, 1984].
Two words are a pair of respective nearest neighbors
(RNNs) if each is the other’s most similar word. Our program found 622 pairs of RNNs among the 5230 nouns that
Table 4: Respective Nearest Neighbors
Rank
1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
301
311
321
331
341
351
361
371
381
391
401
411
421
431
441
451
461
471
481
491
501
511
521
531
541
551
561
571
581
591
601
611
621
RNN
earnings profit
revenue sale
acquisition merger
attorney lawyer
data information
amount number
downturn slump
there way
fear worry
jacket shirt
film movie
felony misdemeanor
importance significance
reaction response
heroin marijuana
championship tournament
consequence implication
rape robbery
dinner lunch
turmoil upheaval
biggest largest
blaze fire
captive westerner
imprisonment probation
apparel clothing
comment elaboration
disadvantage drawback
infringement negligence
angler fishermen
emission pollution
granite marble
gourmet vegetarian
publicist stockbroker
maternity outpatient
artillery warplanes
psychiatrist psychologist
blunder fiasco
door window
counseling therapy
austerity stimulus
ours yours
procurement zoning
neither none
briefcase wallet
audition rite
nylon silk
columnist commentator
avalanche raft
herb olive
distance length
interruption pause
ocean sea
flying watching
ladder spectrum
lotto poker
camping skiing
lip mouth
mounting reducing
pill tablet
choir troupe
conservatism nationalism
bone flesh
powder spray
Sim
0.50
0.39
0.34
0.32
0.30
0.27
0.26
0.24
0.23
0.22
0.21
0.21
0.20
0.19
0.19
0.18
0.18
0.17
0.17
0.17
0.17
0.16
0.16
0.16
0.15
0.15
0.15
0.15
0.14
0.14
0.14
0.14
0.14
0.13
0.13
0.13
0.13
0.13
0.12
0.12
0.12
0.12
0.12
0.11
0.11
0.11
0.11
0.11
0.11
0.10
0.10
0.10
0.10
0.09
0.09
0.09
0.09
0.09
0.08
0.08
0.08
0.07
0.06
occurred at least 50 times in the parsed corpus. Table 4
shows every 10th RNN.
Some of the pairs may look peculiar. Detailed examination
actually reveals that they are quite reasonable. For example, the 221 ranked pair is “captive” and “westerner”. It is
very unlikely that any manually created thesaurus will list
them as near-synonyms. We manually examined all 274 occurrences of “westerner” in the corpus and found that 55%
of them refer to westerners in captivity. Some of the bad
RNNs, such as (avalanche, raft), (audition, rite), were due
to their relative low frequencies, 2 which make them susceptible to accidental commonalities, such as:
(5) The avalanche, raft drifted, hit ....
To hold, attend the audition, rite .
An uninhibited audition, rite .
6 Semantic Similarity in a Taxonomy
Semantic similarity [Resnik, 1995b] refers to similarity between two concepts in a taxonomy such as the WordNet
[Miller, 1990] or CYC upper ontology. The semantic similarity between two classes and is not about the classes
themselves. When we say “rivers and ditches are similar”, we are not comparing the set of rivers with the set
of ditches. Instead, we are comparing a generic river and
to be the
a generic ditch. Therefore, we define sim
similarity between and if all we know about and
is that
and
.
” and “
” are indepenThe two statements “
dent (instead of being assumed to be independent) because
the selection of a generic is not related to the selection
of a generic . The amount of information contained in
“
and
” is
where
and
are probabilities that a randomly
selected object belongs to and , respectively.
and
Assuming that the taxonomy is a tree, if
, the commonality between and is
, where
is the most specific class that subsumes both
and . Therefore,
sim
For example, Figure 2 is a fragment of the WordNet. The
. The similarity
number attached to each node is
2
They all occurred 50–60 times in the parsed corpus.
entity
0.395
inanimate-object
natural-object
0.0000189
hill
Word Pair
0.0163
geological-formation
0.000113 natural-elevation
Table 5: Results of Comparison between Semantic Similarity Measures
0.167
0.00176
shore
0.0000836
coast
0.0000216
Figure 2: A Fragment of WordNet
between the concepts of Hill and Coast is:
sim Hill Coast
Geological-Formation
Hill
Coast
which is equal to 0.59.
There have been many proposals to use the distance between two concepts in a taxonomy as the basis for their
similarity [Lee et al., 1989; Rada et al., 1989]. Resnik
[Resnik, 1995b] showed that the distance-based similarity measures do not correlate to human judgments as
well as his measure. Resnik’s similarity measure is
quite close to the one proposed here: simResnik
common
.
For example, in Figure 2,
Geological-Formation .
simResnik Hill Coast
Wu and Palmer [Wu and Palmer, 1994] proposed a measure
for semantic similarity that could be regarded as a special
:
case of sim
simWu&Palmer
where
and
are the number of IS-A links from A and
is the
B to their most specific common superclass C;
number of IS-A links from to the root of the taxonomy.
For example, the most specific common superclass of Hill
,
and Coast is Geological-Formation. Thus,
,
and simWu&Palmer Hill Coast
.
car, automobile
gem, jewel
journey, voyage
boy, lad
coast, shore
asylum, madhouse
magician, wizard
midday, noon
furnace, stove
food, fruit
bird, cock
bird, crane
tool, implement
brother, monk
crane, implement
lad, brother
journey, car
monk, oracle
food, rooster
coast, hill
forest, graveyard
monk, slave
coast, forest
lad, wizard
chord, smile
glass, magician
noon, string
rooster, voyage
Correlation with
Miller & Charles
Miller&
Charles
3.92
3.84
3.84
3.76
3.70
3.61
3.50
3.42
3.11
3.08
3.05
2.97
2.95
2.82
1.68
1.66
1.16
1.10
0.89
0.87
0.84
0.55
0.42
0.42
0.13
0.11
0.08
0.08
1.00
Resnik
11.630
15.634
11.806
7.003
9.375
13.517
8.744
11.773
2.246
1.703
8.202
8.202
6.136
1.722
3.263
1.722
0
1.722
.538
6.329
0
1.722
1.703
1.722
2.947
.538
0
0
0.795
Wu &
Palmer
1.00
1.00
.91
.90
.90
.93
1.00
1.00
.41
.33
.91
.78
.90
.50
.63
.55
0
.41
.7
.63
0
.55
.33
.55
.41
.11
0
0
0.803
sim
1.00
1.00
.89
.85
.93
.97
1.00
1.00
.18
.24
.83
.67
.80
.16
.39
.20
0
.14
.04
.58
0
.18
.16
.20
.20
.06
0
0
0.834
the same data set and evaluation methodology to compare
simResnik , simWu&Palmer and sim. Table 5 shows the similarities between 28 pairs of concepts, using three different
similarity measures. Column Miller&Charles lists the average similarity scores (on a scale of 0 to 4) assigned by
human subjects in Miller&Charles’s experiments [Miller
and Charles, 1991]. Our definition of similarity yielded
slightly higher correlation with human judgments than the
other two measures.
Interestingly, if
is the same for all pairs of concepts such that there is an IS-A link from to
in the
taxonomy, simWu&Palmer
coincides with sim
.
7 Comparison between Different Similarity
Measures
Resnik [Resnik, 1995a] evaluated three different similarity measures by correlating their similarity scores on 28
pairs of concepts in the WordNet with assessments made
by human subjects [Miller and Charles, 1991]. We adopted
One of the most commonly used similarity measure is
call Dice coefficient. Suppose two objects can be deand
scribed with two numerical vectors
Table 6: Comparison between Similarity Measures
Similarity Measures:
WP:
simWu&Palmer
R:
simResnik
Dice: simdice
Property
sim WP
R
Dice simdist
increase with
yes yes yes yes
no
commonality
decrease with yes yes no
yes
yes
difference
triangle
no
no
no
no
yes
inequality
Assumption 6 yes yes no
yes
no
max value=1
yes yes no
yes
yes
semantic
yes yes yes
no
yes
similarity
word
yes no
no
yes
yes
similarity
ordinal
yes no
no
no
no
values
, their Dice coefficient is defined as
simdice
Another class of similarity measures is based a distance
is a distance metric between
metric. Suppose
two objects, simdist can be defined as follows:
simdist
Table 6 summarizes the comparison among 5 similarity
measures.
Commonality and Difference: While most similarity
measures increase with commonality and decrease with
difference, simdist only decreases with difference and
simResnik only takes commonality into account.
Triangle Inequality: A distance metrics must satisfy the
triangle inequality:
.
Consequently, simdist has the property that simdist
cannot be arbitrarily close to 0 if none of sim dist
and
simdist
is 0. This can be counter-intuitive in some
situations. For example, in Figure 3, A and B are similar in
their shades, B and C are similar in their shape, but A and
C are not similar.
A
B
C
Figure 3: Counter-example of Triangle Inequality
Assumption 6: The strongest assumption that we made in
Section 2 is Assumption 6. However, this assumption is
not unique to our proposal. Both simWu&Palmer and simdice
also satisfy Assumption 6. Suppose two objects A and B
and
are represented by two feature vectors
, respectively. Without loss of generality,
features repsuppose the first features and the rest
resent two independent perspectives of the objects.
simdice
which is a weighted average of the similarity between A
and B in each of the two perspectives.
Maximum Similarity Values: With most similarity measures, the maximum similarity is 1, except simResnik , which
have no upper bound for similarity values.
Application Domains: The similarity measure proposed in
this paper can be applied in all the domains listed in Table
6, including the similarity of ordinal values, where none of
the other similarity measures is applicable.
8 Conclusion
Similarity is an important and fundamental concept in AI
and many other fields. Previous proposals for similarity
measures are heuristic in nature and tied to a particular domain or form of knowledge representation. In this paper,
we present a universal definition of similarity in terms of
information theory. The similarity measure is not directly
stated as in earlier definitions, rather, it is derived from a
set of assumptions. In other words, if one accepts the assumptions, the similarity measure necessarily follows. The
universality of the definition is demonstrated by its applications in different domains where different similarity measures have been employed before.
Acknowledgment
The author wishes to thank the anonymous reviewers for
their valuable comments. This research has been partially
supported by NSERC Research Grant OGP121338.
References
[Aha et al., 1991] Aha, D., Kibler, D., and Albert, M.
(1991). Instance-Based Learning Algorithms. Machine
Learning, 6(1):37–66.
[Alshawi and Carter, 1994] Alshawi, H. and Carter, D.
(1994). Training and scaling preference functions for
disambiguation. Computational Linguistics, 20(4):635–
648.
[Bacchus, 1988] Bacchus, F. (1988). Representing and
Reasoning with Probabilistic Knowledge. PhD thesis,
University of Alberta, Edmonton, Alberta, Canada.
[Cover and Thomas, 1991] Cover, T. M. and Thomas, J. A.
(1991). Elements of information theory. Wiley series in
telecommunications. Wiley, New York.
[Frakes and Baeza-Yates, 1992] Frakes, W. B. and BaezaYates, R., editors (1992). Information Retrieval, Data
Structure and Algorithms. Prentice Hall.
[Grishman and Sterling, 1994] Grishman, R. and Sterling,
J. (1994). Generalizing automatically generated selectional patterns. In Proceedings of COLING-94, pages
742–747, Kyoto, Japan.
[Harman, 1993] Harman, D. (1993). Overview of the first
text retrieval conference. In Proceedings of SIGIR’93,
pages 191–202.
[McGill et al., 1979] McGill et al., M. (1979). An evaluation of factors affecting document ranking by information retrieval systems. Project report, Syracuse University School of Information Studies.
[Miller, 1990] Miller, G. A. (1990). WordNet: An on-line
lexical database. International Journal of Lexicography,
3(4):235–312.
[Miller and Charles, 1991] Miller, G. A. and Charles,
W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28.
[Pearl, 1988] Pearl, J. (1988). Probabilistic Reasoning in
Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufmann Publishers, Inc., San Mateo, California.
[Pereira et al., 1993] Pereira, F., Tishby, N., and Lee, L.
(1993). Distributional Clustering of English Words.
In Proceedings of ACL-93, pages 183–190, Ohio State
University, Columbus, Ohio.
[Rada et al., 1989] Rada, R., Mili, H., Bicknell, E., and
Blettner, M. (1989). Development and application ofa
metric on semantic nets. IEEE Transaction on Systems,
Man, and Cybernetics, 19(1):17–30.
[Resnik, 1995a] Resnik, P. (1995a). Disambiguating noun
groupings with respect to wordnet senses. In Third
Workshop on Very Large Corpora. Association for Computational Linguistics.
[Resnik, 1995b] Resnik, P. (1995b). Using information
content to evaluate semantic similarity in a taxonomy.
In Proceedings of IJCAI-95, pages 448–453, Montreal,
Canada.
[Ruge, 1992] Ruge, G. (1992). Experiments on linguistically based term associations. Information Processing
& Management, 28(3):317–332.
[Hindle, 1990] Hindle, D. (1990). Noun classification
from predicate-argument structures. In Proceedings of
ACL-90, pages 268–275, Pittsburg, Pennsylvania.
[Stanfill and Waltz, 1986] Stanfill, C. and Waltz, D.
(1986). Toward Memory-based Reasoning. Communications of ACM, 29:1213–1228.
[Lee et al., 1989] Lee, J. H., Kim, M. H., and Lee, Y. J.
(1989). Information retrieval based on conceptual distance in is-a hierarchies. Journal of Documentation,
49(2):188–207.
[Stein and Flexner, 1984] Stein, J. and Flexner, S. B., editors (1984). Random House College Thesaurus. Random House, New York.
[Lin, 1993] Lin, D. (1993). Principle-based parsing without overgeneration. In Proceedings of ACL–93, pages
112–120, Columbus, Ohio.
[Lin, 1994] Lin, D. (1994). Principar—an efficient, broadcoverage, principle-based parser. In Proceedings of
COLING–94, pages 482–488. Kyoto, Japan.
[Tversky, 1977] Tversky, A. (1977). Features of similarity.
Psychological Review, 84:327–352.
[Wu and Palmer, 1994] Wu, Z. and Palmer, M. (1994).
Verb semantics and lexical selection. In Proceedings of
the 32nd Annual Meeting of the Associations for Computational Linguistics, pages 133–138, Las Cruces, New
Mexico.

Download Report

An Information-Theoretic Definition of Similarity

Paperzz.com

Your Paperzz