LING115: Final Project 1. Goal The goal of the final project is to automatically build a thesaurus like what Lin did in his paper (Lin, 1998). More specifically, here’s what you should do: For each adjective in /ling115/sample_data/adjectives.1000, identify ten most similar adjectives within the same list. Use the formulas in section 2 in conjunction with the dependency triples in /home/ling115/sample_data/deps. The formulas may seem different from the ones in Lin’s paper. But they yield the same results, so rest assured. Each line in the deps file specifies how often a dependency triple appears in a corpus. 2. Formulas How does Lin know if two words are similar? Roughly speaking, two words are similar if there is a lot of overlap in “meaningful” contexts in which they appear. There are two issues here: (1) What is a “meaningful” context? (2) How do you measure the amount of overlap? 2.1. Meaningful context A context in which a word appears is characterized by a dependency triple that begins with the word. For example, in “People drink coffee”, one context of “coffee” is (coffee,is-obj-of,drink), which means “coffee” is the object of “drink”. A context, i.e. dependency triple is meaningful if it appears relatively more often than expected. We compare the following two probabilities to see if a dependency triple (a,r,b) is meaningful: (3) Pobs(a,r,b) (4) Pexp(a,r,b) Here is how the two probabilities are calculated. ( , , )= ( , , )= || , , || || ∗,∗,∗ || || , ,∗|| ||∗, , || || ∗, ,∗ || ∙ ∙ || ∗, ,∗ || || ∗, ,∗ || || ∗,∗,∗ || Page 1 of 2 In the formulas above, a and b are words that are in relation r: for example, a = coffee, b = drink, r = is-obj-of in (coffee,is-obj-of,drink). * means ‘anything’: any word or any relation. A triple surrounded by || denotes the frequency of the triple. For example, ||a,r,b|| denotes how often (a,r,b) is found in the corpus. If Pobs(a,r,b) is higher than Pexp(a,r,b), the dependency triple (a,r,b) is a meaningful context for the word a. The ratio of Pobs(a,r,b) to Pexp(a,r,b) measures the amount of contextual information contained in (a,r,b). Let’s denote that amount by I(a,r,b) below. That is, ( , , ) ( , , ) ( , , )= Obviously, if I(a,r,b) is greater than 1, the dependency triple (a,r,b) is meaningful. 2.2. Similarity The amount of similarity between two words x and y is proportional to the amount of information contained in the meaningful contexts shared by both x and y. Let’s follow the naming convention below to define precisely how to measure the similarity between two words x and y. (5) (6) (7) (8) T(x): the set of all dependency triples beginning with x that are meaningful T(y): the set of all dependency triples beginning with y that are meaningful I(x and y): the sum of information over all dependency triples in both T(x) and T(y) I(x or y): the sum of information over all dependency triples in either T(x) or T(y) The similarity between two words and x and y, sim(x,y) is measured as follows: ( , )= ( ( Page 2 of 2 ) )
© Copyright 2026 Paperzz