Modeling the Gram in NGram • While doing Language Modeling or Parsing or Translation, we need to know what constitutes a word • I want to eat Pani Puri vs. I want to eat Samosa • President vs. Prime Minister Collocations/ Multi-WordExpressions (Based on Chapter 5 of FSNLP) More examples Multi Word Expression Conceptually, A sequence, continuous or discontinuous, of words or other elements, which is or appears to be prefabricated: that is stored and retrieved whole from memory at the time from use, rather than being subject to generation or analysis by language grammar. From Computational-Linguistic viewpoint MWEs could be: Compound Nouns: green card, traffic signal, जल ूपात Verb-Particle collaction: figured out, ate up Verb-Noun collocation: fall asleep, अंगिु ल उठाना Idioms: kick the bucket, spill the bean A sequence of lexemes morphologically processed as a unit, whose meaning or distribution cannot be accounted for by the productive rules of the language system i.e. A MWE crosses word boundaries and either: is semantically non-compositional: blow hot and cold exhibits institutionalized usage: traffic signal 3 Frequency Criteria What makes a collocation a multiword? • • 4 • If two words occur together a lot then they some special meaning gets attached with them Institutionalized usage: By convention, the collocation is frequently used in everyday discourse Statistical co-occurrence a good indicator traffic signal, good morning, Prime Minister Semantic non-compositionality The meaning of the collocation is not completely composable from those of its constituents eg. Traffic signal, spill the beans, run for office, चाय पानी 5 POS Tag Filter POS Filter (NY Times Corpus) • We would like to ignore the function words • What about Degrees of Freedom List of Top-10 NN Hindi Compounds in a Tourism Corpus MWE समुि तट Specific Application: Contrasting Synonyms Freq 87 रा)ीय उ*ान 53 मंिदर याऽा 51 िहल ःटे शन 36 वग% िकलोमीटर 35 संयु# रा$य 33 खान पान 31 ूवेश ार 31 िहमाचल ूदे श 28 शै वेल डे िःटनेशंस 28 Is it by Chance • At rank 42 in tourism corpus, we find मंिदर भारत – freq. 13 • If both मंिदर and भारत occur often enough in the corpus, then just by chance, they will occur together • Null Hypothesis H0: there is no association between the words • Compute the probability p that observed collocations would occur if H0 were true • Reject H0 if p is too low (typically if beneath a significance level of p < 0:05, 0:01, 0:005) and retain H0 as possible otherwise Forming the Hypothesis • W1 and w2 occur independently: – H0: P(w1w2) = P(w1)P(w2) • We now need a test to tell us if the observed counts of w1w2 are significant or not The t test t= x−µ s2 N Modeling Bigram Occurence as Bernoulli Trial If the t statistic is large enough we can reject the null hypothesis. Corpus as a long sequence of N bigrams: and the samples are then indicator random variables that take on the value 1 when the bigram of interest occurs, and are 0 otherwise. 8 occurrences of new companies among the 14307668 bigrams in our corpus. This t value of 0.999932 is not larger than 2.576, the critical value for significance level of 5. So we cannot reject the null hypothesis that new and companies occur independently and do not form a collocation. More Examples Likelihood Ratios • What does a t value of 1.2 or 2.7 mean • t test assumes that probabilities are approximately normally distributed • Alternative: How likely is one hypothesis compared to other Can be used for Contrasting Synonyms Computing the Likelihood Ratio The log-likelihood ratio calculated as The likelihood of the observed frequency of w2 The following are the quantities involved p1 = P(w2|w1), p2 = P(w2|~w1) , n1 = c1, k1 = c12 n2 = n − c1, k2 = c2 − c12 c1, c2, c12 =corpus frequencies of w1,w2,w1w2 n=total number of words in the corpus For the alternate hypothesis, the MLE estimates of p1, p2 are, p1 =k1/n1 and p2 =k2/n2 For the null hypothesis, we have p1 = p2 = p. p =(k1 + k2)/(n1 + ns) Examples from [Dunning 93] – paper that introduced LLR Hindi Examples Ranked by Frequency and LLR Effective Frequency LLR िकलो मीटर िकलो मीटर समुि तट समुि तट रा)ीय उ*ान ूवेश जल ूपात ूवेश Dealing with Non-Adjacent Words • Results 1 - 10 of about 1,680,000 for knock door. (First Screen) – – – – A Knock at the Door Knocking on Heaven's Door Knock on Any Door Knock On Door Cartoons • Enough evidence that knock...door is some kind of a phrase • Contrast this with beat door – – – – Beat Door beat a path to door Beat swing door Door Wide Open: A Beat Love Affair in Letters 1957 ार रा)ीय उ*ान ार जल ूपात वग% िकलोमीटर खान पान संयु# रा$य वग% िकलोमीटर खान पान वाःतु िश-प भू /ँय संयु# रा$य वाःतु िश-प भीड भाड Mean and Variance • Is the distance between terms predictable – low variance • Mean Estimation: MLE of Average distance • Variance Estimation: (Note the definition) s2 = ∑ n i =1 ( di − d ) 2 n −1 More Example Example The pair previous / games (distance 2) corresponds to phrases like in the previous 10 games or in the previous 15 games; minus / points corresponds to phrases like minus 2 percentage points, minus 3 percentage points etc; hundreds / dollars corresponds to hundreds of billions of dollars and hundreds of millions of dollars.
© Copyright 2026 Paperzz