Collocations/ Multi-Word- Expressions

Modeling the Gram in NGram
• While doing Language Modeling or
Parsing or Translation, we need to
know what constitutes a word
• I want to eat Pani Puri vs. I want to
eat Samosa
• President vs. Prime Minister
Collocations/
Multi-WordExpressions
(Based on Chapter 5 of
FSNLP)
More examples
Multi Word Expression
Conceptually,
A sequence, continuous or discontinuous, of words or other
elements, which is or appears to be prefabricated: that is stored and
retrieved whole from memory at the time from use, rather than being
subject to generation or analysis by language grammar.
From Computational-Linguistic viewpoint
MWEs could be:
Compound Nouns: green card, traffic signal, जल ूपात
Verb-Particle collaction: figured out, ate up
Verb-Noun collocation: fall asleep, अंगिु ल उठाना
Idioms: kick the bucket, spill the bean
A sequence of lexemes morphologically processed as a unit,
whose meaning or distribution cannot be accounted for by the
productive rules of the language system
i.e. A MWE crosses word boundaries and either:
is semantically non-compositional: blow hot and cold
exhibits institutionalized usage: traffic signal
3
Frequency Criteria
What makes a collocation a
multiword?
•
•
4
• If two words occur together a lot then they
some special meaning gets attached with
them
Institutionalized usage:
By convention, the collocation is frequently
used in everyday discourse
Statistical co-occurrence a good indicator
traffic signal, good morning, Prime Minister
Semantic non-compositionality
The meaning of the collocation is not
completely composable from those of its
constituents
eg. Traffic signal, spill the beans, run for office, चाय
पानी
5
POS Tag Filter
POS Filter (NY Times Corpus)
• We would like to ignore the function
words
• What about Degrees of Freedom
List of Top-10 NN Hindi
Compounds in a Tourism Corpus
MWE
समुि तट
Specific Application:
Contrasting Synonyms
Freq
87
रा)ीय उ*ान
53
मंिदर याऽा
51
िहल ःटे शन
36
वग% िकलोमीटर
35
संयु# रा$य
33
खान पान
31
ूवेश ार
31
िहमाचल ूदे श
28
शै वेल डे िःटनेशंस
28
Is it by Chance
• At rank 42 in tourism corpus, we find मंिदर भारत
– freq.
13
• If both मंिदर and भारत occur often enough in the
corpus, then just by chance, they will occur
together
• Null Hypothesis H0: there is no association
between the words
• Compute the probability p that observed
collocations would occur if H0 were true
• Reject H0 if p is too low (typically if beneath a
significance level of p < 0:05, 0:01, 0:005) and
retain H0 as possible otherwise
Forming the Hypothesis
• W1 and w2 occur independently:
– H0: P(w1w2) = P(w1)P(w2)
• We now need a test to tell us if the
observed counts of w1w2 are
significant or not
The t test
t=
x−µ
s2
N
Modeling Bigram Occurence
as Bernoulli Trial
If the t statistic is large enough
we can reject the null hypothesis.
Corpus as a long sequence of N bigrams: and the samples are then
indicator random variables that take on the value 1 when the bigram of
interest occurs, and are 0 otherwise.
8 occurrences of new companies among
the 14307668 bigrams in our corpus.
This t value of 0.999932 is not larger than 2.576, the critical value for
significance level of 5. So we cannot reject the null hypothesis that
new and companies occur independently and do not form a collocation.
More Examples
Likelihood Ratios
• What does a t value of 1.2 or 2.7 mean
• t test assumes that probabilities are
approximately normally distributed
• Alternative: How likely is one hypothesis
compared to other
Can be used for Contrasting
Synonyms
Computing the Likelihood Ratio
The log-likelihood ratio calculated as
The likelihood of the observed frequency of w2
The following are the quantities involved
p1 = P(w2|w1), p2 = P(w2|~w1) , n1 = c1, k1 = c12
n2 = n − c1, k2 = c2 − c12
c1, c2, c12 =corpus frequencies of w1,w2,w1w2
n=total number of words in the corpus
For the alternate hypothesis, the MLE estimates of p1, p2 are,
p1 =k1/n1 and p2 =k2/n2
For the null hypothesis, we have p1 = p2 = p.
p =(k1 + k2)/(n1 + ns)
Examples from [Dunning 93] –
paper that introduced LLR
Hindi Examples Ranked by
Frequency and LLR
Effective Frequency
LLR
िकलो मीटर
िकलो मीटर
समुि तट
समुि तट
रा)ीय उ*ान
ूवेश
जल ूपात
ूवेश
Dealing with Non-Adjacent
Words
• Results 1 - 10 of about 1,680,000 for knock
door. (First Screen)
–
–
–
–
A Knock at the Door
Knocking on Heaven's Door
Knock on Any Door
Knock On Door Cartoons
• Enough evidence that knock...door is some kind
of a phrase
• Contrast this with beat door
–
–
–
–
Beat Door
beat a path to door
Beat swing door
Door Wide Open: A Beat Love Affair in Letters 1957
ार
रा)ीय उ*ान
ार
जल ूपात
वग% िकलोमीटर
खान पान
संयु# रा$य
वग% िकलोमीटर
खान पान
वाःतु िश-प
भू /ँय
संयु# रा$य
वाःतु िश-प
भीड भाड
Mean and Variance
• Is the distance between terms predictable
– low variance
• Mean Estimation: MLE of Average distance
• Variance Estimation: (Note the definition)
s2 =
∑
n
i =1
( di − d ) 2
n −1
More Example
Example
The pair previous / games (distance 2) corresponds to phrases like in the
previous 10 games or in the previous 15 games; minus / points
corresponds to phrases like minus 2 percentage points, minus 3 percentage
points etc; hundreds / dollars corresponds to hundreds of billions of dollars and
hundreds of millions of dollars.