Automatic Retrieval and Clustering of Similar

Paper Id: 573
Automatic Retrieval and Clustering of Similar Words
Dekang Lin
Department of Computer Science
University of Manitoba
Winnipeg, Manitoba, Canada R3T 2N2
[email protected]
January 28, 1998
Abstract
Bootstrapping semantics from text is one of the greatest challenges in natural language learning. Earlier
research showed that it is possible to automatically identify words that are semantically similar to a given
word based on the syntactic collocation patterns of the words. We present an approach that goes a step
further by obtaining a tree structure among the most similar words so that dierent senses of a given word
can be identied with dierent subtrees.
Submission Type: paper
Topic Areas: R2: Lexical Resources
Author of Record: Dekang Lin
Under consideration for other conferences (specify)? none
Paper Id: 573
Automatic Retrieval and Clustering of Similar Words
Abstract
Bootstrapping semantics from text is one of the greatest challenges in natural language learning. Earlier
research showed that it is possible to automatically identify words that are semantically similar to a given
word based on the syntactic collocation patterns of the words. We present an approach that goes a step
further by obtaining a tree structure among the most similar words so that dierent senses of a given word
can be identied with dierent subtrees.
1 Introduction
results and compare them with the corresponding entries in the WordNet (Miller et al., 1990).
Finally, in Section 4, we briey review related
work and summarize our contributions.
Bootstrapping semantics from syntax is one
of the greatest challenges in natural language
learning. Earlier research showed that it is possible to automatically identify words that are
semantically similar to a given word based on
the syntactic collocation patterns of the words
(Grefenstette, 1994; Hindle, 1990; Ruge, 1992).
We present a method that goes a step further by
creating a tree structure among the most similar
words so that so that dierent senses of a given
word can be identied with dierent subtrees.
The main advantage of automatically retrieved similar words over manually constructed
general-purpose dictionaries and thesauri is that
automatically extracted similar words to a given
word are related to the meanings of the given
word in the corpus. One of the biggest problems
for using general-purpose lexical resources in a
particular application is that they contain many
senses of words that are never used in the application domain (Jacob, 1991). Furthermore, automatically extracted similar words can provide
valuable help in compilation of dictionaries and
thesauri which is a tremendously dicult task.
By comparison with the WordNet (Miller et al.,
1990), we domonstrate that our program is able
to identify common usages of words that have
been overlooked by lexicographers.
The next section is concerned with computing similarities between words according to the
collocations of the words. In Section 3, we dene the notion of similarity tree which organizes
similar words of a given word in a tree structure
and then present an algorithm for pruning the
similarity tree. We will also discuss some sample
2 Word Similarity
2.1 Dependency triples
Similar to (Alshawi and Carter, 1994; Grishman
and Sterling, 1994; Ruge, 1992), we use a parser
to extract dependency triples from the text corpus. A dependency triple consists of a head, a
dependency type and a modier. For example,
the triples extracted from the sentence \I have
a brown dog" are:
(1) (have subj I), (have obj dog),
(dog adj-mod brown), (dog det a)
Our text corpus consists of 55-million-word
Wall Street Journal and 45-million-word San
Jose Mercury. Two steps are taken to reduce the
number of errors in the parsed corpus. Firstly,
only sentences with no more than 25 words are
fed into the parser. Secondly, only complete
parses are included in the parsed corpus. The
100-million-word text corpus is parsed in about
72 hours on a Pentium 200 with 80MB memory.
There are about 22 million words in the parse
trees.
2.2 Similarity measure
We can view dependency triples extracted from
the corpus as features of the words that participate in them. Suppose (avert obj duty) is a dependency triple extracted from corpus, we say
that \duty" has the feature obj-of(avert) and
\avert" has the feature obj(duty). Other words
1
by \duciary". The amount of information in
the feature adj-mod(duciary) is greater than
the amount of information in subj-of(include).
This agrees with our intuition that saying that
a word can be modied by \duciary" is more
informative than saying that the word can be
the subject of \include".
The ? log ( ) column in Table 1 shows the
amount
of information contained in each feaTable 1: Features of \duty" and \sanction"
ture. If features in Table 1 were all the features
Feature
duty sanction ? log P (f ) of \duty" and \sanction", their similarity would
f1 : subj-of(include)
x
x
3.15
be: I (ff1 ;f2 ;f3 ;f52;f6I;f(f7fg1)+;f3I;f(f5f;f1 ;f7 g3);f4 ;f5 ;f7 ;f8 g) =0.66.
f2 : obj-of(assume)
x
5.43
that also have the feature obj-of(avert) include
\default", \crisis", \eye", \panic", \strike",
\war", etc.
Table 1 shows a subset of the features of
\duty" and \sanction". Each row corresponds
to a feature. A `x' in the \duty" or \sanction"
column means that the word has that feature.
P f
f3 :
f4 :
f5 :
f6 :
f7 :
f8 :
obj-of(avert)
obj-of(ease)
obj-of(impose)
adj-mod(duciary)
adj-mod(punitive)
adj-mod(economic)
x
x
x
x
x
x
x
5.88
4.99
4.97
7.76
7.10
3.70
x
x
2.3 Sample results
The following are words with similarity to
\duty" greater than 0.4:
(3) responsibility, position, sanction, tari,
obligation, fee, post, job, role, tax,
penalty, condition, function, assignment,
power, expense, task, deadline, training,
work, standard, ban, restriction, authority,
commitment, award, liability, requirement,
sta, membership, limit, pledge, right,
chore, mission, care, title, capability,
patrol, ne, faith, seat, levy, violation,
load, salary, attitude, bonus, schedule,
instruction, rank, purpose, personnel,
worth, jurisdiction, presidency, exercise
The following is the entry for \duty" in the
Random House Thesaurus (Stein and Flexner,
1984).
(4) duty n. 1. obligation , responsibility ;
onus; business, province; 2. function ,
task , assignment , charge. 3. tax ,
tari , customs, excise, levy .
The shadowed words in (4) also appear in (3).
Two words are a pair of respective nearest
neighbors (RNNs) if each is the other's most
similar word. Our program found 622 pairs of
RNNs among the 5230 nouns that occurred at
least 50 times in the parsed corpus. Table 2
shows every 10th RNN.
Some of the pairs may look peculiar. Detailed
examination actually reveals that they are quite
reasonable. For example, the 221 ranked pair is
The similarity between two words can be
computed according to their features. Our similarity measure is based on a proposal in (Lin,
1997), where the similarity between two objects
is dened to be the amount of information contained in the commonality between the objects
divided by the amount of information in the descriptions of the objects. Let ( ) be the set of
features possessed by and ( ) be the amount
of information contained in a set of features .
We dene the similarity between two words as
follows:
w1 )\F (w2 ))
(2) sim( 1 2 ) = I2(FI((wF1())+
I (F (w2 ))
Assuming that features
P are independent of one
another, ( ) = ? f 2S log ( ), where ( )
is the probability of feature . When two words
have identical sets of features, their similarity
reaches the maximum value 1.0. The minimum
similarity 0 is reached when two words do not
have any common feature.
The probability ( ) is estimated by the percentage of words that have feature among
the set of words that have the same part of
speech. For example, there are 32868 unique
nouns in the parsed corpus, 1405 of which were
used as the subject of \include". The proba1405
. The probability of subj-of(include) is 32868
14
bility of the feature adj-mod(duciary) is 32868
because only 14 (unique) nouns were modied
F w
w
I S
S
w ;w
P f
I S
P f
f
P f
f
2
\captive" and \westerner". It is very unlikely
for any manually created thesaurus to consider
them as near-synonyms. We examined all 274
occurrences of \westerner" in a 45-million-word
San Jose Mercury corpus and found that 55% of
them refer to westerners in captivity. Some of
the bad RNNs, such as (avalanche, raft), (audition, rite), are due to their relative low frequencies,1 which make them susceptible to accidental commonalities, such as:
(5) The favavalanche, raftg fdrifted, hitg ....
To fhold, attendg the faudition, riteg. An
uninhibited faudition, riteg.
Table 2: Respective Nearest Neighbors
Rank
1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
301
311
321
331
341
351
361
371
381
391
401
411
421
431
441
451
461
471
481
491
501
511
521
531
541
551
561
571
581
591
601
611
621
Respective Nearest Neighbors
earnings prot
revenue sale
acquisition merger
attorney lawyer
data information
amount number
downturn slump
there way
fear worry
jacket shirt
lm movie
felony misdemeanor
importance signicance
reaction response
heroin marijuana
championship tournament
consequence implication
rape robbery
dinner lunch
turmoil upheaval
biggest largest
blaze re
captive westerner
imprisonment probation
apparel clothing
comment elaboration
disadvantage drawback
infringement negligence
angler shermen
emission pollution
granite marble
gourmet vegetarian
publicist stockbroker
maternity outpatient
artillery warplanes
psychiatrist psychologist
blunder asco
door window
counseling therapy
austerity stimulus
ours yours
procurement zoning
neither none
briefcase wallet
audition rite
nylon silk
columnist commentator
avalanche raft
herb olive
distance length
interruption pause
ocean sea
ying watching
ladder spectrum
lotto poker
camping skiing
lip mouth
mounting reducing
pill tablet
choir troupe
conservatism nationalism
bone esh
powder spray
Similarity
0.50
0.39
0.34
0.32
0.30
0.27
0.26
0.24
0.23
0.22
0.21
0.21
0.20
0.19
0.19
0.18
0.18
0.17
0.17
0.17
0.17
0.16
0.16
0.16
0.15
0.15
0.15
0.15
0.14
0.14
0.14
0.14
0.14
0.13
0.13
0.13
0.13
0.13
0.12
0.12
0.12
0.12
0.12
0.11
0.11
0.11
0.11
0.11
0.11
0.10
0.10
0.10
0.10
0.09
0.09
0.09
0.09
0.09
0.08
0.08
0.08
0.07
0.06
3 Clustering and Pruning
The automatically retrieved similar words in (3)
covers all three senses of \duty". However, unlike the thesaurus entry (4), the similar words
are not divided into groups. In this section,
we present algorithms for clustering the similar
words and pruning the clusters.
3.1 Similarity tree
Let = f
n g be a list of words in deW
w1 ; : : : ; w
scending order of their similarity to a given word
0 . The similarity tree for
0 is created as follows:
Initialize the similarity tree to consist of a
single node 0 .
For =1, 2, , , insert i 2 as a child
of j such that j is the most similar one
to i among f 0 , , i?1 g.
The similarity tree for \duty" is shown in Figure 1. The rst number after a word is its similarity to its parent. The second number is its
similarity to the root node of the tree.
The rst three subtrees of \duty" in Figure 1
clearly correspond to the three senses of \duty"
in (4). Our program also extracted \training" because \training" is often associated with
\duty", e.g.,
(6) fcombat, crowd control, leadership,
lifeguard, management, National Guardg
fduty, trainingg
w
w
w
i
:::
n
w
w
1
3
w
W
w
w
:::
w
They all occurred 50{60 times in the parsed corpus.
Although \duty" and \tari" are synonyms,
our similarity measure ranks \sanction" to be
more similar to \duty" and \tari" than they
are to each other. This is because the similarity
being measured is the similarity between words
instead of word senses. The existance of other
senses of \duty" reduces its similarity to \tari".
duty 1 1
responsibility 0.13 0.13
obligation 0.09 0.09
commitment 0.09 0.06
position 0.10 0.10
post 0.20 0.08
title 0.10 0.05
seat 0.12 0.05
presidency 0.08 0.04
job 0.17 0.08
assignment 0.10 0.07
award 0.08 0.06
work 0.14 0.07
patrol 0.06 0.05
role 0.17 0.08
power 0.13 0.07
authority 0.18 0.06
sta 0.10 0.06
personnel 0.12 0.04
jurisdiction 0.11 0.04
freedom 0.10 0.06
right 0.10 0.05
capability 0.09 0.05
faith 0.08 0.05
attitude 0.09 0.05
condition 0.10 0.07
standard 0.12 0.06
function 0.10 0.07
task 0.08 0.07
chore 0.10 0.05
purpose 0.09 0.05
pledge 0.08 0.06
mission 0.08 0.05
rank 0.09 0.05
sanction 0.10 0.10
tari 0.10 0.09
fee 0.13 0.09
penalty 0.15 0.08
ne 0.14 0.05
violation 0.09 0.05
expense 0.15 0.07
membership 0.09 0.06
load 0.11 0.05
salary 0.14 0.05
bonus 0.12 0.05
tax 0.19 0.08
liability 0.12 0.06
worth 0.07 0.04
levy 0.13 0.05
deadline 0.08 0.07
ban 0.15 0.06
restriction 0.17 0.06
requirement 0.15 0.06
schedule 0.08 0.05
limit 0.18 0.06
training 0.07 0.07
care 0.09 0.05
instruction 0.07 0.05
exercise 0.08 0.04
3.2 Meaning shift
An interesting observation one could make from
Figure 1 is that there are sometimes meaning
shifts along directed paths. Consider the path
duty!sanction!ban. Both \duty" and \ban"
are quite similar to \sanction" because they
may both be a form of sanction. The commonalities between \duty" and \ban" are mostly due
to their commonality with \sanction". For example,
(7) The fban, duty, sanctiong faected,
became, forced, resulted, took eectg ...
To fbegin, breach, continue, impose, lead
to, live, overturn, recommend, undermine,
violateg a fban, duty, sanctiong
Along the path duty!sanction!ban, the
meaning shifts from one form of sanction to another.
Other examples of meaning shifts in Figure 1
include:
position!function!purpose.
The words \position" and \function" are
similar because there is usually a functionality associated with a position/post. The
words \function" and \purpose" are similar because a functionality usually serves
certain purposes.
duty!position!post!title.
The meaning shifts from the responsibility
associated with a position/post to its name.
3.3 Pruning the similarity tree
Previous approaches for nding similar words
usually use a threshold for similarity or ranking
of similarity to select a subset of words from
an ordered list of similar words. We propose a
dierent approach that is based on the detection
Figure 1: Similarity tree for \duty"
4
of meaning shifts along the directed paths of
similarity trees. That is a subtree is pruned if a
meaning shift is detected.
Consider the following path in a similarity
tree:
A
Compared with Figure 1, the pruned similarity tree for \duty" has much higher percentage
of closely related words.
marriage. Our program ranked \relationship" as the most similar word to \marriage".
In WordNet1.5, they are not particularly close:
d AC
d AB
state
B
dBC C
relationship
where the arcs are in the direction from root to
leaf, and xy = sim1(x;y) ? 1 is the dissimilarity
between and .
The fact that C is a child of B implies that
AC
AB and AC
BC . When AC is much
greater than AB and BC , the similarities between A and C are often related to two dierent
senses of B, i.e., there is a meaning shift along
the path A!B!C. Therefore, we used the following method to prune the similarity tree:
(8) Let A be the root node of a similarity
tree. For any node C in tree, the subtree
rooted at C is deleted if there exists an
ancestor B of C such that
2
2
2
AC
AB + BC + AB BC .
The right-hand side of the above inequality is
the average of ( AB + BC )2 and 2AB + 2BC .
If 2AC ( AB + BC )2 , the dissimilarities between A, B and C violate the triangular inequality that must be satised by a distance metric.
If A, B and C are treated as points in Euclidean
space, the vectors AB and BC are orthogonal
when 2AC = 2AB + 2BC .
d
y
> d
d
> d
d
d
d
d
d
d
>
d
d
d
d
> d
d
d
d
d
d
d
marital status
marriage
Our program also discovered two aspects of
\marriage" that are missed by WordNet1.5.
The rst is the process aspect of \marriage"
which has a time span and can be \ruined",
\wrecked" or \saved". The second is that \marriage" is a deal/accord.
The words \matrimony" and \wedlock" were
not found in the similarity tree because they
did not occur frequently enough (50) to be
included in similarity computation.
score. The word \score" has 11 senses in
WordNet1.5. The similarity tree for \score"
captured 5 of them (including the rst four
senses): mark/grade, number/abstraction, musical score, a set of 20 things, and success
in games. The similarities between \score"
and \star", as well as \score" and \pay" were
boosted by several accidental common features.
suit. Our program identied both \lawsuit"
and \clothing" senses of \suit". WordNet1.5
also contains a \playing card" sense. The
word \jacket" is the 36th most similar word to
\suit". The similarity tree in Figure 2 obviously
boosted the prominence of \jacket" among the
similar words to \suit".
The word \plainti" is not close to any one of
the senses of \suit" in WordNet1.5. However, it
is close to \suit" in the similarity tree because
\plainti" and \suit" share many common features, such as: subj-of(accuse), subj-of(allege),
subj-of(ask for), subj-of(assert), subj-of(claim),
subj-of(contend), subj-of(demand), and subjof(seek).
attack. The three subtrees of \attack" correspond to physical, verbal and military attack,
respectively. The word \visit" is similar to at-
d
x
status
d
3.4 Sample results
Figure 2 shows the pruned similarity trees for
4 nouns (duty, marriage, score, suit), a verb
(attack), an adjective (powerful) and an adverb (openly). We now compare these similarity
trees with the corresponding entries in WordNet
(Miller et al., 1990). For the purpose of this section, we say two words are WordNet synonyms
if they are within 3 hyponym or hypernym links
or 1 synonym link from each other in WordNet
(version 1.5).
duty. The word \duty" has three senses in
WordNet1.5, which corresponds to the rst 3
subtrees of \duty" in Figure 2.
5
duty 1 1
responsibility 0.13 0.13
obligation 0.09 0.09
commitment 0.09 0.06
position 0.10 0.10
post 0.20 0.08
job 0.17 0.08
assignment 0.10 0.07
work 0.14 0.07
role 0.17 0.08
power 0.13 0.07
condition 0.10 0.07
standard 0.12 0.06
function 0.10 0.07
pledge 0.08 0.06
mission 0.08 0.05
sanction 0.10 0.10
tari 0.10 0.09
tax 0.19 0.08
deadline 0.08 0.07
training 0.07 0.07
care 0.09 0.05
instruction 0.07 0.05
exercise 0.08 0.04
marriage 1 1
relationship 0.11 0.11
career 0.10 0.10
stint 0.11 0.07
reign 0.13 0.07
sentence 0.08 0.06
life 0.14 0.07
struggle 0.08 0.06
friendship 0.07 0.05
retirement 0.07 0.07
deal 0.08 0.08
accord 0.15 0.07
release 0.09 0.07
alliance 0.08 0.06
truce 0.11 0.06
experiment 0.09 0.06
conversion 0.08 0.05
concept 0.07 0.05
combination 0.08 0.05
wedding 0.07 0.07
couple 0.06 0.06
brother 0.07 0.04
sex 0.05 0.05
dream 0.05 0.05
suit 1 1
lawsuit 0.25 0.25
litigation 0.18 0.15
complaint 0.14 0.14
case 0.22 0.22
action 0.19 0.18
proposal 0.22 0.13
petition 0.14 0.09
award 0.10 0.08
claim 0.18 0.17
charge 0.20 0.15
indictment 0.20 0.13
allegation 0.19 0.13
plainti 0.09 0.09
jacket 0.08 0.08
shirt 0.22 0.08
sock 0.14 0.07
pant 0.14 0.06
uniform 0.08 0.06
tie 0.06 0.06
shoe 0.11 0.05
attack 1 1
kill 0.12 0.12
strike 0.10 0.09
hit 0.15 0.08
catch 0.09 0.06
re 0.09 0.06
seize 0.06 0.06
visit 0.07 0.05
oppose 0.09 0.09
support 0.29 0.08
ght 0.15 0.08
threaten 0.10 0.05
defend 0.12 0.08
reject 0.23 0.08
denounce 0.12 0.07
express 0.12 0.05
accuse 0.08 0.08
criticize 0.13 0.07
challenge 0.14 0.07
bomb 0.07 0.07
blast 0.10 0.05
score 1 1
touchdown 0.07 0.07
pass 0.17 0.07
total 0.06 0.06
ratio 0.11 0.06
count 0.08 0.05
mark 0.06 0.04
song 0.05 0.05
piece 0.07 0.04
sound 0.06 0.04
text 0.05 0.05
le 0.06 0.03
hundred 0.05 0.05
million 0.12 0.04
grade 0.05 0.05
star 0.04 0.04
pay 0.03 0.03
powerful 1 1
strong 0.16 0.16
important 0.16 0.14
political 0.17 0.12
local 0.14 0.11
popular 0.11 0.09
formidable 0.10 0.08
sophisticated 0.11 0.11
inuential 0.10 0.10
prominent 0.16 0.08
famous 0.07 0.05
openly 1 1
publicly 0.11 0.11
privately 0.15 0.11
freely 0.09 0.07
repeatedly 0.10 0.08
widely 0.11 0.07
generally 0.12 0.05
often 0.30 0.05
loudly 0.07 0.06
strongly 0.10 0.10
explicitly 0.09 0.09
readily 0.07 0.05
ercely 0.05 0.05
intensely 0.12 0.05
frankly 0.03 0.03
* The rst number after a word is its similarity to its parent. The second number is the similarity between the
word and the root of the tree. The shadowed words are WordNet synonyms.
Figure 2: Clusters of similar words
6
clustering algorithm is extremely simple, yet appears to be very eective in grouping words that
are similar to dierent senses of the given word.
Our pruning algorithm is based on the detection of meaning shifts along the directed paths
in similarity trees. In contrast, previous approaches usually relies on an arbitrary threshold
to select similar words from an ordered list.
tack because one can only attack persons or
places one can visit.
powerful. The following are WordNet synonyms of \powerful" that appeared at least 50
times2 in our parsed corpus.
(9) strong1 , inuential8, potent27 ,
high-powered140 , compelling263,
vigorous480 , mighty675 , hefty1953
The subscripts of the words are their rankings
in similarity to \powerful" with our measure.
The word \sophisticated" is very similar to
\powerful" when they are used to describe artifacts such as \programs", \machines", and \systems". This relationship between the two words
is not captured by WordNet1.5.
openly. Our program found many closely related words. Surprisingly, none of these words
is a synonym or an antonym of \openly" in
WordNet1.5. This clearly demonstrates that
manual construction of lexical resources is a
tremendously dicult task and automatically
extracted similar words can be of great help.
References
Hiyan Alshawi and David Carter. 1994. Training and
scaling preference functions for disambiguation. Computational Linguistics, 20(4):635{648, December.
Fumiyo Fukumoto and Jun'ichi Tsujii. 1994. Automatic recognition of verbal polysemy. In Proceedings
of COLING-94, pages 762{768, Kyoto, Japan.
Gregory Grefenstette. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Press,
Boston, MA.
Ralph Grishman and John Sterling. 1994. Generalizing automatically generated selectional patterns. In
Proceedings of COLING-94, pages 742{747, Kyoto,
Japan.
Donald Hindle. 1990. Noun classication from
predicate-argument structures. In Proceedings of
ACL-90, pages 268{275, Pittsburg, Pennsylvania,
June.
Paul Jacob. 1991. making sense of lexical acquisition.
In Uri Zernik, editor, Lexical Acquisition: Exploiting
On-Line Resources to Build a Lexicon, pages 29{44.
Lawence Erlbaum Associates, Publishers.
N. Jardine and R. Sibson. 1968. The construction of hierarchie and non-hierarchic classications. Computer
Journal, pages 177{184.
Dekang Lin. 1997. Using syntactic dependency as local
context to resolve word sense ambiguity. In Proceedings of ACL/EACL-97, pages 64{71, Madrid, Spain,
July.
George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller.
1990. Introduction to WordNet: An on-line lexical database. International Journal of Lexicography,
3(4):235{244.
Gerda Ruge. 1992. Experiments on linguistically based
term associations. Information Processing & Management, 28(3):317{332.
Jess Stein and Stuart Berg Flexner, editors. 1984. Random House College Thesaurus. Random House, New
York.
4 Related Work and Conclusion
There have been many approaches to automatic
detection of similar words from text corpora.
Ours is similar to (Grefenstette, 1994; Hindle,
1990; Ruge, 1992) in the use of dependency relationship as the word features, based on which
word similarities are computed. The dierence
is that we use a full parser instead of a partial
parser and that we adopt a dierent similarity
measure.
The problem of automatic recognition of verbal polysemy is investigated by (Fukumoto and
Tsujii, 1994). An overlapping clustering algorithm based on (Jardine and Sibson, 1968) is
employed to cluster 56 verbs (10 of which are
polysemous) and achieved 69.2% accuracy in
grouping.
The main contribution of this paper is that we
proposed methods for clustering the retrieved
similar words and for pruning the clusters. Our
We only computed the similarities among words that
occurred at least 50 times
2
7