LBSC 796/INFM 718R: Week 4 Language Models Jimmy Lin College of Information Studies University of Maryland Monday, February 20, 2006 Last Time… Boolean model Based on the notion of sets Documents are retrieved only if they satisfy Boolean conditions specified in the query Does not impose a ranking on retrieved documents Exact match Vector space model Based on geometry, the notion of vectors in high dimensional space Documents are ranked based on their similarity to the query (ranked retrieval) Best/partial match Today Language models Based on the notion of probabilities and processes for generating text Documents are ranked based on the probability that they generated the query Best/partial match First we start with probabilities… Probability What is probability? Statistical: relative frequency as n Subjective: degree of belief Thinking probabilistically Imagine a finite amount of “stuff” (= probability mass) The total amount of “stuff” is one The event space is “all the things that could happen” Distribute that “mass” over the possible events Sum of all probabilities have to add up to one Key Concepts Defining probability with frequency Statistical independence Conditional probability Bayes’ Theorem Statistical Independence A and B are independent if and only if: P(A and B) = P(A) P(B) Simplest example: series of coin flips Independence formalizes “unrelated” P(“being brown eyed”) = 6/10 P(“being a doctor”) = 1/1000 P(“being a brown eyed doctor”) = P(“being brown eyed”) P(“being a doctor”) = 6/10,000 Dependent Events Suppose: Would you expect: P(“having a B.S. degree”) = 4/10 P(“being a doctor”) = 1/1000 P(“having a B.S. degree and being a doctor”) = P(“having a B.S. degree”) P(“being a doctor”) = 4/10,000 Another example: P(“being a doctor”) = 1/1000 P(“having studied anatomy”) = 12/1000 P(“having studied anatomy” | “being a doctor”) = ?? Conditional Probability P(A | B) P(A and B) / P(B) Event Space A and B A B P(A) = prob. of A relative to entire event space P(A|B) = prob. of A considering that we know B is true Doctors and Anatomy P(A | B) P(A and B) / P(B) What is P(“having studied anatomy” | “being a doctor”)? A = having studied anatomy B = being a doctor P(“being a doctor”) = 1/1000 P(“having studied anatomy”) = 12/1000 P(“being a doctor who studied anatomy”) = 1/1000 P(“having studied anatomy” | “being a doctor”) = 1 More on Conditional Probability What if P(A|B) = P(A)? A and B must be statistically independent! Is P(A|B) = P(B|A)? A = having studied anatomy B = being a doctor P(“being a doctor”) = 1/1000 P(“having studied anatomy”) = 12/1000 P(“being a doctor who studied anatomy”) = 1/1000 P(“having studied anatomy” | “being a doctor”) = 1 If you’re a doctor, you must have studied anatomy… P(“being a doctor” | “having studied anatomy”) = 1/12 If you’ve studied anatomy, you’re more likely to be a doctor, but you could also be a biologist, for example Probabilistic Inference Suppose there’s a horrible, but very rare disease The probability that you contracted it is 0.01% But there’s a very accurate test for it The test is 99% accurate Unfortunately, you tested positive… Should you panic? Bayes’ Theorem You want to find P(“have disease” | “test positive”) But you only know How rare the disease is How accurate the test is Use Bayes’ Theorem (hence Bayesian Inference) Prior probability P( B | A) P( A) P( A | B) P( B) Posterior probability Applying Bayes’ Theorem P(“have disease”) = 0.0001 (0.01%) P(“test positive” | “have disease”) = 0.99 (99%) P(“test positive”) = 0.010098 Two case: 1. You have the disease, and you tested positive 2. You don’t have the disease, but you tested positive (error) Case 1: (0.0001)(0.99) = 0.000099 Case 2: (0.9999)(0.01) = 0.009999 Case 1+2 = 0.010098 P(“have disease” | “test positive”) = (0.99)(0.0001) / 0.010098 = 0.009804 = 0.9804% Don’t worry! Another View In a population of one million people 100 are infected 99 test positive 1 test negative 999,900 are not 9999 test positive 989901 test negative 10098 will test positive… Of those, only 99 really have the disease! Competing Hypotheses Consider A set of hypotheses: H1, H2, H3 Some observable evidence: O If you observed O, what likely caused it? P1 = P(H1|O) P2 = P(H2|O) P3 = P(H3|O) Which explanation is most likely? Example: You know that three things can cause the grass to be wet: rain, sprinkler, flood You observed that that grass is wet What caused it? An Example Let O = “Joe earns more than $80,000/year” H1 = “Joe is a NBA referee” H2 = “Joe is a college professor” H3 = “Joe works in food services” Suppose we know that Joe earns more than $80,000 a year… What should be our guess about Joe’s profession? What’s his job? Suppose we do a survey and we find out We can calculate P(O|H1) = 0.6 P(H1) = 0.0001 referee P(O|H2) = 0.07 P(H2) = 0.001 professor P(O|H3) = 0.001 P(H3) = 0.02 food services P(H1|O) = 0.00006 P(H2|O) = 0.00007 P(H3|O) = 0.00002 What do we guess? / P(“earning > $80K/year”) / P(“earning > $80K/year”) / P(“earning > $80K/year”) Recap: Key Concepts Defining probability with frequency Statistical independence Conditional probability Bayes’ Theorem What is a Language Model? Probability distribution over strings of text How likely is a string in a given “language”? p1 = P(“a quick brown dog”) p2 = P(“dog quick a brown”) p3 = P(“быстрая brown dog”) p4 = P(“быстрая собака”) In a language model for English: p1 > p2 > p3 > p4 Probabilities depend on what language we’re modeling In a language model for Russian: p1 < p2 < p3 < p4 How do we model a language? Brute force counts? Is understanding the path to enlightenment? Think of all the things that have ever been said or will ever be said, of any length Count how often each one occurs Figure out how meaning and thoughts are expressed Build a model based on this Throw up our hands and admit defeat? Unigram Language Model Assume each word is generated independently Obviously, this is not true… But it seems to work well in practice! The probability of a string, given a model: k P (q1 qk | M ) P ( qi | M ) i 1 The probability of a sequence of words decomposes into a product of the probabilities of individual words A Physical Metaphor Colored balls are randomly drawn from an urn (with replacement) M words P( )= P( ) P( ) P( ) P( = (4/9) (2/9) (4/9) (3/9) ) An Example Model M P(w) w 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 likes the 0.2 man 0.01 likes 0.02 the 0.2 woman 0.01 multiply P(s | M) = 0.00000008 … P(“the man likes the woman”|M) = P(the|M) P(man|M) P(likes|M) P(the|M) P(man|M) = 0.00000008 Comparing Language Models Model M1 Model M2 P(w) w P(w) w 0.2 the 0.2 the 0.0001 yon 0.1 yon 0.01 0.001 class 0.0005 maiden 0.01 maiden 0.0003 sayst 0.03 sayst 0.0001 pleaseth 0.02 pleaseth … … class the class pleaseth yon 0.2 0.01 0.0001 0.2 0.001 0.02 maiden 0.0001 0.0005 0.1 0.01 P(s|M2) > P(s|M1) What exactly does this mean? Noisy-Channel Model of IR Information need d1 d2 Query … User has a information need, “thinks” of a relevant document… and writes down some queries dn document collection Task of information retrieval: given the query, figure out which document it came from? How is this a noisy-channel? Source message Destination Transmitter channel Receiver message noise Source Information need Destination Encoder channel Decoder Query formulation process No one seriously claims that this is actually what’s going on… But this view is mathematically convenient! query terms Retrieval w/ Language Models Build a model for every document Rank document d based on P(MD | q) Expand using Bayes’ Theorem P( M D | q) P(q | M D ) P( M D ) P(q) P(q) is same for all documents; doesn’t change ranks P(MD) [the prior] is assumed to be the same for all d Same as ranking by P(q | MD) What does it mean? Ranking by P(MD | q)… Hey, what’s the probability this query came from you? is the same as ranking by P(q | MD) Hey, what’s the probability that you generated this query? model1 Hey, what’s the probability this query came from you? model1 Hey, what’s the probability that you generated this query? model2 … model2 … Hey, what’s the probability this query came from you? Hey, what’s the probability that you generated this query? modeln modeln Ranking Models? Ranking by P(q | MD) … is the same as ranking documents Hey, what’s the probability that you generated this query? model1 … is a model of document1 model2 … is a model of document2 modeln … is a model of documentn Hey, what’s the probability that you generated this query? … Hey, what’s the probability that you generated this query? Building Document Models How do we build a language model for a document? What’s in the urn? Physical metaphor: M What colored balls and how many of each? A First Try Simply count the frequencies in the document = maximum likelihood estimate Sequence S M P( P( P( ) = 1/2 ) = 1/4 ) = 1/4 P(w|MS) = #(w,S) / |S| #(w,S) = number of times w occurs in S |S| = length of S Zero-Frequency Problem Suppose some event is not in our observation S Model will assign zero probability to that event M P( P( P( P( Sequence S ) = 1/2 ) = 1/4 ) = 1/4 )= P( ) P( ) P( ) P( = (1/2) (1/4) 0 (1/4) = 0 !! ) Why is this a bad idea? Modeling a document Just because a word didn’t appear doesn’t mean it’ll never appear… But safe to assume that unseen words are rare Analogy: fishes in the sea Think of the document model as a topic There are many documents that can be written about a single topic We’re trying to figure out what the model is based on just one document Practical effect: assigning zero probability to unseen words forces exact match But partial matches are useful also! Smoothing The solution: “smooth” the word probabilities P(w) Maximum Likelihood Estimate p ML ( w ) count of w count of all words Smoothed probability distribution w How do you smooth? Assign some small probability to unseen events But remember to take away “probability mass” from other events Simplest example: for words you didn’t see, pretend you saw it once Other more sophisticated methods: Absolute discounting Linear interpolation, Jelinek-Mercer Dirichlet, Witten-Bell Good-Turing … Lots of performance to be gotten out of smoothing! Recap: LM for IR Build language models for every document Models can be viewed as “topics” Models are “generative” Smoothing is very important Retrieval: Estimate the probability of generating the query according to each model Rank the documents according to these probabilities Advantages of LMs Novel way of looking at the problem of text retrieval Conceptually simple and explanatory Formal mathematical model Unfortunately, not realistic Satisfies math envy Natural use of collection statistics, not heuristics Comparison With Vector Space Similar in some ways Term weights are based on frequency Terms treated as if they were independent (unigram language model) Probabilities have the effect of length normalization Different in others Based on probability rather than similarity Intuitions are probabilistic (processes for generating text) rather than geometric Details of use of document length and term, document, and collection frequencies differ What’s the point? Language models formalize assumptions All of which aren’t true! Binary relevance Document independence Term independence Uniform priors Relevance isn’t binary Documents are often not independent Terms are clearly not independent Some documents are inherently higher in quality But it works! One Minute Paper What was the muddiest point in today’s class?
© Copyright 2026 Paperzz