of 2 LING115: Final Project 1. Goal The goal of the final project is to

LING115: Final Project
1. Goal
The goal of the final project is to automatically build a thesaurus like what Lin did in his paper
(Lin, 1998). More specifically, here’s what you should do:
For each adjective in /ling115/sample_data/adjectives.1000, identify ten most
similar adjectives within the same list.
Use the formulas in section 2 in conjunction with the dependency triples in
/home/ling115/sample_data/deps. The formulas may seem different from the ones in
Lin’s paper. But they yield the same results, so rest assured. Each line in the deps file specifies
how often a dependency triple appears in a corpus.
2. Formulas
How does Lin know if two words are similar? Roughly speaking, two words are similar if there
is a lot of overlap in “meaningful” contexts in which they appear. There are two issues here:
(1) What is a “meaningful” context?
(2) How do you measure the amount of overlap?
2.1. Meaningful context
A context in which a word appears is characterized by a dependency triple that begins with the
word. For example, in “People drink coffee”, one context of “coffee” is (coffee,is-obj-of,drink),
which means “coffee” is the object of “drink”.
A context, i.e. dependency triple is meaningful if it appears relatively more often than expected.
We compare the following two probabilities to see if a dependency triple (a,r,b) is meaningful:
(3) Pobs(a,r,b)
(4) Pexp(a,r,b)
Here is how the two probabilities are calculated.
( , , )=
( , , )=
|| , , ||
|| ∗,∗,∗ ||
|| , ,∗|| ||∗, , || || ∗, ,∗ ||
∙
∙
|| ∗, ,∗ || || ∗, ,∗ || || ∗,∗,∗ ||
Page 1 of 2
In the formulas above, a and b are words that are in relation r: for example, a = coffee, b = drink,
r = is-obj-of in (coffee,is-obj-of,drink). * means ‘anything’: any word or any relation. A triple
surrounded by || denotes the frequency of the triple. For example, ||a,r,b|| denotes how often
(a,r,b) is found in the corpus.
If Pobs(a,r,b) is higher than Pexp(a,r,b), the dependency triple (a,r,b) is a meaningful context for
the word a.
The ratio of Pobs(a,r,b) to Pexp(a,r,b) measures the amount of contextual information contained in
(a,r,b). Let’s denote that amount by I(a,r,b) below. That is,
( , , )
( , , )
( , , )=
Obviously, if I(a,r,b) is greater than 1, the dependency triple (a,r,b) is meaningful.
2.2. Similarity
The amount of similarity between two words x and y is proportional to the amount of
information contained in the meaningful contexts shared by both x and y.
Let’s follow the naming convention below to define precisely how to measure the similarity
between two words x and y.
(5)
(6)
(7)
(8)
T(x): the set of all dependency triples beginning with x that are meaningful
T(y): the set of all dependency triples beginning with y that are meaningful
I(x and y): the sum of information over all dependency triples in both T(x) and T(y)
I(x or y): the sum of information over all dependency triples in either T(x) or T(y)
The similarity between two words and x and y, sim(x,y) is measured as follows:
( , )=
(
(
Page 2 of 2
)
)