Paper Id: 573 Automatic Retrieval and Clustering of Similar Words Dekang Lin Department of Computer Science University of Manitoba Winnipeg, Manitoba, Canada R3T 2N2 [email protected] January 28, 1998 Abstract Bootstrapping semantics from text is one of the greatest challenges in natural language learning. Earlier research showed that it is possible to automatically identify words that are semantically similar to a given word based on the syntactic collocation patterns of the words. We present an approach that goes a step further by obtaining a tree structure among the most similar words so that dierent senses of a given word can be identied with dierent subtrees. Submission Type: paper Topic Areas: R2: Lexical Resources Author of Record: Dekang Lin Under consideration for other conferences (specify)? none Paper Id: 573 Automatic Retrieval and Clustering of Similar Words Abstract Bootstrapping semantics from text is one of the greatest challenges in natural language learning. Earlier research showed that it is possible to automatically identify words that are semantically similar to a given word based on the syntactic collocation patterns of the words. We present an approach that goes a step further by obtaining a tree structure among the most similar words so that dierent senses of a given word can be identied with dierent subtrees. 1 Introduction results and compare them with the corresponding entries in the WordNet (Miller et al., 1990). Finally, in Section 4, we briey review related work and summarize our contributions. Bootstrapping semantics from syntax is one of the greatest challenges in natural language learning. Earlier research showed that it is possible to automatically identify words that are semantically similar to a given word based on the syntactic collocation patterns of the words (Grefenstette, 1994; Hindle, 1990; Ruge, 1992). We present a method that goes a step further by creating a tree structure among the most similar words so that so that dierent senses of a given word can be identied with dierent subtrees. The main advantage of automatically retrieved similar words over manually constructed general-purpose dictionaries and thesauri is that automatically extracted similar words to a given word are related to the meanings of the given word in the corpus. One of the biggest problems for using general-purpose lexical resources in a particular application is that they contain many senses of words that are never used in the application domain (Jacob, 1991). Furthermore, automatically extracted similar words can provide valuable help in compilation of dictionaries and thesauri which is a tremendously dicult task. By comparison with the WordNet (Miller et al., 1990), we domonstrate that our program is able to identify common usages of words that have been overlooked by lexicographers. The next section is concerned with computing similarities between words according to the collocations of the words. In Section 3, we dene the notion of similarity tree which organizes similar words of a given word in a tree structure and then present an algorithm for pruning the similarity tree. We will also discuss some sample 2 Word Similarity 2.1 Dependency triples Similar to (Alshawi and Carter, 1994; Grishman and Sterling, 1994; Ruge, 1992), we use a parser to extract dependency triples from the text corpus. A dependency triple consists of a head, a dependency type and a modier. For example, the triples extracted from the sentence \I have a brown dog" are: (1) (have subj I), (have obj dog), (dog adj-mod brown), (dog det a) Our text corpus consists of 55-million-word Wall Street Journal and 45-million-word San Jose Mercury. Two steps are taken to reduce the number of errors in the parsed corpus. Firstly, only sentences with no more than 25 words are fed into the parser. Secondly, only complete parses are included in the parsed corpus. The 100-million-word text corpus is parsed in about 72 hours on a Pentium 200 with 80MB memory. There are about 22 million words in the parse trees. 2.2 Similarity measure We can view dependency triples extracted from the corpus as features of the words that participate in them. Suppose (avert obj duty) is a dependency triple extracted from corpus, we say that \duty" has the feature obj-of(avert) and \avert" has the feature obj(duty). Other words 1 by \duciary". The amount of information in the feature adj-mod(duciary) is greater than the amount of information in subj-of(include). This agrees with our intuition that saying that a word can be modied by \duciary" is more informative than saying that the word can be the subject of \include". The ? log ( ) column in Table 1 shows the amount of information contained in each feaTable 1: Features of \duty" and \sanction" ture. If features in Table 1 were all the features Feature duty sanction ? log P (f ) of \duty" and \sanction", their similarity would f1 : subj-of(include) x x 3.15 be: I (ff1 ;f2 ;f3 ;f52;f6I;f(f7fg1)+;f3I;f(f5f;f1 ;f7 g3);f4 ;f5 ;f7 ;f8 g) =0.66. f2 : obj-of(assume) x 5.43 that also have the feature obj-of(avert) include \default", \crisis", \eye", \panic", \strike", \war", etc. Table 1 shows a subset of the features of \duty" and \sanction". Each row corresponds to a feature. A `x' in the \duty" or \sanction" column means that the word has that feature. P f f3 : f4 : f5 : f6 : f7 : f8 : obj-of(avert) obj-of(ease) obj-of(impose) adj-mod(duciary) adj-mod(punitive) adj-mod(economic) x x x x x x x 5.88 4.99 4.97 7.76 7.10 3.70 x x 2.3 Sample results The following are words with similarity to \duty" greater than 0.4: (3) responsibility, position, sanction, tari, obligation, fee, post, job, role, tax, penalty, condition, function, assignment, power, expense, task, deadline, training, work, standard, ban, restriction, authority, commitment, award, liability, requirement, sta, membership, limit, pledge, right, chore, mission, care, title, capability, patrol, ne, faith, seat, levy, violation, load, salary, attitude, bonus, schedule, instruction, rank, purpose, personnel, worth, jurisdiction, presidency, exercise The following is the entry for \duty" in the Random House Thesaurus (Stein and Flexner, 1984). (4) duty n. 1. obligation , responsibility ; onus; business, province; 2. function , task , assignment , charge. 3. tax , tari , customs, excise, levy . The shadowed words in (4) also appear in (3). Two words are a pair of respective nearest neighbors (RNNs) if each is the other's most similar word. Our program found 622 pairs of RNNs among the 5230 nouns that occurred at least 50 times in the parsed corpus. Table 2 shows every 10th RNN. Some of the pairs may look peculiar. Detailed examination actually reveals that they are quite reasonable. For example, the 221 ranked pair is The similarity between two words can be computed according to their features. Our similarity measure is based on a proposal in (Lin, 1997), where the similarity between two objects is dened to be the amount of information contained in the commonality between the objects divided by the amount of information in the descriptions of the objects. Let ( ) be the set of features possessed by and ( ) be the amount of information contained in a set of features . We dene the similarity between two words as follows: w1 )\F (w2 )) (2) sim( 1 2 ) = I2(FI((wF1())+ I (F (w2 )) Assuming that features P are independent of one another, ( ) = ? f 2S log ( ), where ( ) is the probability of feature . When two words have identical sets of features, their similarity reaches the maximum value 1.0. The minimum similarity 0 is reached when two words do not have any common feature. The probability ( ) is estimated by the percentage of words that have feature among the set of words that have the same part of speech. For example, there are 32868 unique nouns in the parsed corpus, 1405 of which were used as the subject of \include". The proba1405 . The probability of subj-of(include) is 32868 14 bility of the feature adj-mod(duciary) is 32868 because only 14 (unique) nouns were modied F w w I S S w ;w P f I S P f f P f f 2 \captive" and \westerner". It is very unlikely for any manually created thesaurus to consider them as near-synonyms. We examined all 274 occurrences of \westerner" in a 45-million-word San Jose Mercury corpus and found that 55% of them refer to westerners in captivity. Some of the bad RNNs, such as (avalanche, raft), (audition, rite), are due to their relative low frequencies,1 which make them susceptible to accidental commonalities, such as: (5) The favavalanche, raftg fdrifted, hitg .... To fhold, attendg the faudition, riteg. An uninhibited faudition, riteg. Table 2: Respective Nearest Neighbors Rank 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 211 221 231 241 251 261 271 281 291 301 311 321 331 341 351 361 371 381 391 401 411 421 431 441 451 461 471 481 491 501 511 521 531 541 551 561 571 581 591 601 611 621 Respective Nearest Neighbors earnings prot revenue sale acquisition merger attorney lawyer data information amount number downturn slump there way fear worry jacket shirt lm movie felony misdemeanor importance signicance reaction response heroin marijuana championship tournament consequence implication rape robbery dinner lunch turmoil upheaval biggest largest blaze re captive westerner imprisonment probation apparel clothing comment elaboration disadvantage drawback infringement negligence angler shermen emission pollution granite marble gourmet vegetarian publicist stockbroker maternity outpatient artillery warplanes psychiatrist psychologist blunder asco door window counseling therapy austerity stimulus ours yours procurement zoning neither none briefcase wallet audition rite nylon silk columnist commentator avalanche raft herb olive distance length interruption pause ocean sea ying watching ladder spectrum lotto poker camping skiing lip mouth mounting reducing pill tablet choir troupe conservatism nationalism bone esh powder spray Similarity 0.50 0.39 0.34 0.32 0.30 0.27 0.26 0.24 0.23 0.22 0.21 0.21 0.20 0.19 0.19 0.18 0.18 0.17 0.17 0.17 0.17 0.16 0.16 0.16 0.15 0.15 0.15 0.15 0.14 0.14 0.14 0.14 0.14 0.13 0.13 0.13 0.13 0.13 0.12 0.12 0.12 0.12 0.12 0.11 0.11 0.11 0.11 0.11 0.11 0.10 0.10 0.10 0.10 0.09 0.09 0.09 0.09 0.09 0.08 0.08 0.08 0.07 0.06 3 Clustering and Pruning The automatically retrieved similar words in (3) covers all three senses of \duty". However, unlike the thesaurus entry (4), the similar words are not divided into groups. In this section, we present algorithms for clustering the similar words and pruning the clusters. 3.1 Similarity tree Let = f n g be a list of words in deW w1 ; : : : ; w scending order of their similarity to a given word 0 . The similarity tree for 0 is created as follows: Initialize the similarity tree to consist of a single node 0 . For =1, 2, , , insert i 2 as a child of j such that j is the most similar one to i among f 0 , , i?1 g. The similarity tree for \duty" is shown in Figure 1. The rst number after a word is its similarity to its parent. The second number is its similarity to the root node of the tree. The rst three subtrees of \duty" in Figure 1 clearly correspond to the three senses of \duty" in (4). Our program also extracted \training" because \training" is often associated with \duty", e.g., (6) fcombat, crowd control, leadership, lifeguard, management, National Guardg fduty, trainingg w w w i ::: n w w 1 3 w W w w ::: w They all occurred 50{60 times in the parsed corpus. Although \duty" and \tari" are synonyms, our similarity measure ranks \sanction" to be more similar to \duty" and \tari" than they are to each other. This is because the similarity being measured is the similarity between words instead of word senses. The existance of other senses of \duty" reduces its similarity to \tari". duty 1 1 responsibility 0.13 0.13 obligation 0.09 0.09 commitment 0.09 0.06 position 0.10 0.10 post 0.20 0.08 title 0.10 0.05 seat 0.12 0.05 presidency 0.08 0.04 job 0.17 0.08 assignment 0.10 0.07 award 0.08 0.06 work 0.14 0.07 patrol 0.06 0.05 role 0.17 0.08 power 0.13 0.07 authority 0.18 0.06 sta 0.10 0.06 personnel 0.12 0.04 jurisdiction 0.11 0.04 freedom 0.10 0.06 right 0.10 0.05 capability 0.09 0.05 faith 0.08 0.05 attitude 0.09 0.05 condition 0.10 0.07 standard 0.12 0.06 function 0.10 0.07 task 0.08 0.07 chore 0.10 0.05 purpose 0.09 0.05 pledge 0.08 0.06 mission 0.08 0.05 rank 0.09 0.05 sanction 0.10 0.10 tari 0.10 0.09 fee 0.13 0.09 penalty 0.15 0.08 ne 0.14 0.05 violation 0.09 0.05 expense 0.15 0.07 membership 0.09 0.06 load 0.11 0.05 salary 0.14 0.05 bonus 0.12 0.05 tax 0.19 0.08 liability 0.12 0.06 worth 0.07 0.04 levy 0.13 0.05 deadline 0.08 0.07 ban 0.15 0.06 restriction 0.17 0.06 requirement 0.15 0.06 schedule 0.08 0.05 limit 0.18 0.06 training 0.07 0.07 care 0.09 0.05 instruction 0.07 0.05 exercise 0.08 0.04 3.2 Meaning shift An interesting observation one could make from Figure 1 is that there are sometimes meaning shifts along directed paths. Consider the path duty!sanction!ban. Both \duty" and \ban" are quite similar to \sanction" because they may both be a form of sanction. The commonalities between \duty" and \ban" are mostly due to their commonality with \sanction". For example, (7) The fban, duty, sanctiong faected, became, forced, resulted, took eectg ... To fbegin, breach, continue, impose, lead to, live, overturn, recommend, undermine, violateg a fban, duty, sanctiong Along the path duty!sanction!ban, the meaning shifts from one form of sanction to another. Other examples of meaning shifts in Figure 1 include: position!function!purpose. The words \position" and \function" are similar because there is usually a functionality associated with a position/post. The words \function" and \purpose" are similar because a functionality usually serves certain purposes. duty!position!post!title. The meaning shifts from the responsibility associated with a position/post to its name. 3.3 Pruning the similarity tree Previous approaches for nding similar words usually use a threshold for similarity or ranking of similarity to select a subset of words from an ordered list of similar words. We propose a dierent approach that is based on the detection Figure 1: Similarity tree for \duty" 4 of meaning shifts along the directed paths of similarity trees. That is a subtree is pruned if a meaning shift is detected. Consider the following path in a similarity tree: A Compared with Figure 1, the pruned similarity tree for \duty" has much higher percentage of closely related words. marriage. Our program ranked \relationship" as the most similar word to \marriage". In WordNet1.5, they are not particularly close: d AC d AB state B dBC C relationship where the arcs are in the direction from root to leaf, and xy = sim1(x;y) ? 1 is the dissimilarity between and . The fact that C is a child of B implies that AC AB and AC BC . When AC is much greater than AB and BC , the similarities between A and C are often related to two dierent senses of B, i.e., there is a meaning shift along the path A!B!C. Therefore, we used the following method to prune the similarity tree: (8) Let A be the root node of a similarity tree. For any node C in tree, the subtree rooted at C is deleted if there exists an ancestor B of C such that 2 2 2 AC AB + BC + AB BC . The right-hand side of the above inequality is the average of ( AB + BC )2 and 2AB + 2BC . If 2AC ( AB + BC )2 , the dissimilarities between A, B and C violate the triangular inequality that must be satised by a distance metric. If A, B and C are treated as points in Euclidean space, the vectors AB and BC are orthogonal when 2AC = 2AB + 2BC . d y > d d > d d d d d d d > d d d d > d d d d d d d marital status marriage Our program also discovered two aspects of \marriage" that are missed by WordNet1.5. The rst is the process aspect of \marriage" which has a time span and can be \ruined", \wrecked" or \saved". The second is that \marriage" is a deal/accord. The words \matrimony" and \wedlock" were not found in the similarity tree because they did not occur frequently enough (50) to be included in similarity computation. score. The word \score" has 11 senses in WordNet1.5. The similarity tree for \score" captured 5 of them (including the rst four senses): mark/grade, number/abstraction, musical score, a set of 20 things, and success in games. The similarities between \score" and \star", as well as \score" and \pay" were boosted by several accidental common features. suit. Our program identied both \lawsuit" and \clothing" senses of \suit". WordNet1.5 also contains a \playing card" sense. The word \jacket" is the 36th most similar word to \suit". The similarity tree in Figure 2 obviously boosted the prominence of \jacket" among the similar words to \suit". The word \plainti" is not close to any one of the senses of \suit" in WordNet1.5. However, it is close to \suit" in the similarity tree because \plainti" and \suit" share many common features, such as: subj-of(accuse), subj-of(allege), subj-of(ask for), subj-of(assert), subj-of(claim), subj-of(contend), subj-of(demand), and subjof(seek). attack. The three subtrees of \attack" correspond to physical, verbal and military attack, respectively. The word \visit" is similar to at- d x status d 3.4 Sample results Figure 2 shows the pruned similarity trees for 4 nouns (duty, marriage, score, suit), a verb (attack), an adjective (powerful) and an adverb (openly). We now compare these similarity trees with the corresponding entries in WordNet (Miller et al., 1990). For the purpose of this section, we say two words are WordNet synonyms if they are within 3 hyponym or hypernym links or 1 synonym link from each other in WordNet (version 1.5). duty. The word \duty" has three senses in WordNet1.5, which corresponds to the rst 3 subtrees of \duty" in Figure 2. 5 duty 1 1 responsibility 0.13 0.13 obligation 0.09 0.09 commitment 0.09 0.06 position 0.10 0.10 post 0.20 0.08 job 0.17 0.08 assignment 0.10 0.07 work 0.14 0.07 role 0.17 0.08 power 0.13 0.07 condition 0.10 0.07 standard 0.12 0.06 function 0.10 0.07 pledge 0.08 0.06 mission 0.08 0.05 sanction 0.10 0.10 tari 0.10 0.09 tax 0.19 0.08 deadline 0.08 0.07 training 0.07 0.07 care 0.09 0.05 instruction 0.07 0.05 exercise 0.08 0.04 marriage 1 1 relationship 0.11 0.11 career 0.10 0.10 stint 0.11 0.07 reign 0.13 0.07 sentence 0.08 0.06 life 0.14 0.07 struggle 0.08 0.06 friendship 0.07 0.05 retirement 0.07 0.07 deal 0.08 0.08 accord 0.15 0.07 release 0.09 0.07 alliance 0.08 0.06 truce 0.11 0.06 experiment 0.09 0.06 conversion 0.08 0.05 concept 0.07 0.05 combination 0.08 0.05 wedding 0.07 0.07 couple 0.06 0.06 brother 0.07 0.04 sex 0.05 0.05 dream 0.05 0.05 suit 1 1 lawsuit 0.25 0.25 litigation 0.18 0.15 complaint 0.14 0.14 case 0.22 0.22 action 0.19 0.18 proposal 0.22 0.13 petition 0.14 0.09 award 0.10 0.08 claim 0.18 0.17 charge 0.20 0.15 indictment 0.20 0.13 allegation 0.19 0.13 plainti 0.09 0.09 jacket 0.08 0.08 shirt 0.22 0.08 sock 0.14 0.07 pant 0.14 0.06 uniform 0.08 0.06 tie 0.06 0.06 shoe 0.11 0.05 attack 1 1 kill 0.12 0.12 strike 0.10 0.09 hit 0.15 0.08 catch 0.09 0.06 re 0.09 0.06 seize 0.06 0.06 visit 0.07 0.05 oppose 0.09 0.09 support 0.29 0.08 ght 0.15 0.08 threaten 0.10 0.05 defend 0.12 0.08 reject 0.23 0.08 denounce 0.12 0.07 express 0.12 0.05 accuse 0.08 0.08 criticize 0.13 0.07 challenge 0.14 0.07 bomb 0.07 0.07 blast 0.10 0.05 score 1 1 touchdown 0.07 0.07 pass 0.17 0.07 total 0.06 0.06 ratio 0.11 0.06 count 0.08 0.05 mark 0.06 0.04 song 0.05 0.05 piece 0.07 0.04 sound 0.06 0.04 text 0.05 0.05 le 0.06 0.03 hundred 0.05 0.05 million 0.12 0.04 grade 0.05 0.05 star 0.04 0.04 pay 0.03 0.03 powerful 1 1 strong 0.16 0.16 important 0.16 0.14 political 0.17 0.12 local 0.14 0.11 popular 0.11 0.09 formidable 0.10 0.08 sophisticated 0.11 0.11 inuential 0.10 0.10 prominent 0.16 0.08 famous 0.07 0.05 openly 1 1 publicly 0.11 0.11 privately 0.15 0.11 freely 0.09 0.07 repeatedly 0.10 0.08 widely 0.11 0.07 generally 0.12 0.05 often 0.30 0.05 loudly 0.07 0.06 strongly 0.10 0.10 explicitly 0.09 0.09 readily 0.07 0.05 ercely 0.05 0.05 intensely 0.12 0.05 frankly 0.03 0.03 * The rst number after a word is its similarity to its parent. The second number is the similarity between the word and the root of the tree. The shadowed words are WordNet synonyms. Figure 2: Clusters of similar words 6 clustering algorithm is extremely simple, yet appears to be very eective in grouping words that are similar to dierent senses of the given word. Our pruning algorithm is based on the detection of meaning shifts along the directed paths in similarity trees. In contrast, previous approaches usually relies on an arbitrary threshold to select similar words from an ordered list. tack because one can only attack persons or places one can visit. powerful. The following are WordNet synonyms of \powerful" that appeared at least 50 times2 in our parsed corpus. (9) strong1 , inuential8, potent27 , high-powered140 , compelling263, vigorous480 , mighty675 , hefty1953 The subscripts of the words are their rankings in similarity to \powerful" with our measure. The word \sophisticated" is very similar to \powerful" when they are used to describe artifacts such as \programs", \machines", and \systems". This relationship between the two words is not captured by WordNet1.5. openly. Our program found many closely related words. Surprisingly, none of these words is a synonym or an antonym of \openly" in WordNet1.5. This clearly demonstrates that manual construction of lexical resources is a tremendously dicult task and automatically extracted similar words can be of great help. References Hiyan Alshawi and David Carter. 1994. Training and scaling preference functions for disambiguation. Computational Linguistics, 20(4):635{648, December. Fumiyo Fukumoto and Jun'ichi Tsujii. 1994. Automatic recognition of verbal polysemy. In Proceedings of COLING-94, pages 762{768, Kyoto, Japan. Gregory Grefenstette. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Press, Boston, MA. Ralph Grishman and John Sterling. 1994. Generalizing automatically generated selectional patterns. In Proceedings of COLING-94, pages 742{747, Kyoto, Japan. Donald Hindle. 1990. Noun classication from predicate-argument structures. In Proceedings of ACL-90, pages 268{275, Pittsburg, Pennsylvania, June. Paul Jacob. 1991. making sense of lexical acquisition. In Uri Zernik, editor, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pages 29{44. Lawence Erlbaum Associates, Publishers. N. Jardine and R. Sibson. 1968. The construction of hierarchie and non-hierarchic classications. Computer Journal, pages 177{184. Dekang Lin. 1997. Using syntactic dependency as local context to resolve word sense ambiguity. In Proceedings of ACL/EACL-97, pages 64{71, Madrid, Spain, July. George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller. 1990. Introduction to WordNet: An on-line lexical database. International Journal of Lexicography, 3(4):235{244. Gerda Ruge. 1992. Experiments on linguistically based term associations. Information Processing & Management, 28(3):317{332. Jess Stein and Stuart Berg Flexner, editors. 1984. Random House College Thesaurus. Random House, New York. 4 Related Work and Conclusion There have been many approaches to automatic detection of similar words from text corpora. Ours is similar to (Grefenstette, 1994; Hindle, 1990; Ruge, 1992) in the use of dependency relationship as the word features, based on which word similarities are computed. The dierence is that we use a full parser instead of a partial parser and that we adopt a dierent similarity measure. The problem of automatic recognition of verbal polysemy is investigated by (Fukumoto and Tsujii, 1994). An overlapping clustering algorithm based on (Jardine and Sibson, 1968) is employed to cluster 56 verbs (10 of which are polysemous) and achieved 69.2% accuracy in grouping. The main contribution of this paper is that we proposed methods for clustering the retrieved similar words and for pruning the clusters. Our We only computed the similarities among words that occurred at least 50 times 2 7
© Copyright 2025 Paperzz