Probability: Review • The state of the world is described using random variables • Probabilities are defined over events – Sets of world states characterized by propositions about random variables – E.g., g , D1, D2: rolls of two dice • P(D1 > 2) • P(D1 + D2 = 11) – W is the state of the weather • P(W = “rainy” ∨ W = “sunny”) Kolmogorov’s axioms of probability • For any propositions (events) A, A B 0 ≤ P(A) ≤ 1 P(True) = 1 and P(False) = 0 P(A ∨ B) = P(A) + P(B) – P(A ∧ B) Joint probability distributions • A joint distribution is an assignment of probabilities to every possible atomic event P(Cavity, Toothache) Cavity = false ∧Toothache = false 0.8 Cavityy = false ∧ Toothache = true 0.1 Cavity = true ∧ Toothache = false 0.05 Cavity = true ∧ Toothache = true 0.05 Marginal probability distributions P(Cavity, Toothache) Cavity = false ∧Toothache = false 0.8 Cavityy = false ∧ Toothache = true 0.1 Cavity = true ∧ Toothache = false 0.05 Cavity = true ∧ Toothache = true 0.05 P(Cavity) P(Toothache) Cavity = false ? Toothache = false ? Cavity = true ? Toochache = true ? Marginal probability distributions • Given the joint distribution P(X, Y), how do we find the marginal distribution P(X)? P( X = x) = P(( X = x ∧ Y = y1 ) ∨ K ∨ ( X = x ∧ Y = yn ) ) n = P(( x, y1 ) ∨ K ∨ ( x, yn ) ) = ∑ P( x, yi ) i =1 • General rule: to find P(X = x), sum the probabilities of all atomic events where X = x. x Conditional probability p y Conditional probability p y P( A ∧ B ) P ( A, B) = • For any two events A and B, P ( A | B ) = P( B) P( B) P(A ∧ B) P(A) ( ) P(B) ( ) Conditional distributions • A conditional distribution is a distribution over the values of one variable given fixed values of other variables P(Cavity, Toothache) Cavity = false ∧Toothache = false 0.8 Cavity = false ∧ Toothache = true 01 0.1 Cavity = true ∧ Toothache = false 0.05 Cavity = true ∧ Toothache = true 0.05 P(Cavity | Toothache = true) P(Cavity|Toothache = false) Cavity = false 0.667 Cavity = false 0.941 C it = ttrue Cavity 0 333 0.333 C it = ttrue Cavity 0 059 0.059 P(Toothache | Cavity = true) P(Toothache | Cavity = false) Toothache= false 0.5 Toothache= false 0.889 Toothache = true 0.5 Toothache = true 0.111 Normalization trick • To get the whole conditional distribution P(X | y) at once, select all entries in the joint distribution matching Y = y and renormalize them to sum to one P(Cavity, Toothache) Cavity = false ∧Toothache = false 0.8 Cavity = false ∧ Toothache = true 0.1 Cavity = true ∧ Toothache = false 0.05 Cavity = true ∧ Toothache = true 0.05 Select Toothache, Cavity = false Toothache= false 0.8 Toothache = true 0.1 Renormalize P(Toothache | Cavity = false) Toothache= false 0.889 Toothache = true 0.111 Normalization trick • To get the whole conditional distribution P(X | y) at once, select all entries in the joint distribution matching Y = y and renormalize them to sum to one • Why does it work? P ( x, y ) P ( x, y ) = ∑ P( x′, y) P( y) x′ by marginalization Product rule P ( A, B ) • Definition of conditional probability: P ( A | B ) = P( B) • Sometimes we have the conditional probability and want to obtain the joint: j P( A, B ) = P ( A | B) P( B) = P ( B | A) P( A) Product rule P ( A, B ) • Definition of conditional probability: P ( A | B ) = P( B) • Sometimes we have the conditional probability and want to obtain the joint: j P( A, B ) = P ( A | B) P( B) = P ( B | A) P( A) • The chain rule: P ( A1 , K , An ) = P ( A1 ) P ( A2 | A1 ) P ( A3 | A1 , A2 ) K P ( An | A1 , K , An −1 ) n = ∏ P ( Ai | A1 , K , Ai −1 ) i =1 Bayes Rule • The product rule gives us two ways to factor a joint distribution: Rev. Thomas Bayes (1702-1761) P( A, B ) = P ( A | B) P( B) = P ( B | A) P( A) • Therefore, P( B | A) P( A) P( A | B) = P( B) • Why is this useful? – Can get diagnostic probability, e.g., P(cavity | toothache) from causal probability, e.g., P(toothache | cavity) P( Evidence | Cause) P(Cause) P(Cause | Evidence) = P ( Evidence) – Can update our beliefs based on evidence – Important tool for probabilistic inference Independence • Two events A and B are independent if and only if P(A, B) = P(A) P(B) – In other words, P(A | B) = P(A) and P(B | A) = P(B) – This is an important simplifying assumption for modeling, e.g., Toothache and Weather can be assumed to be independent • Are two mutually exclusive events independent? – No, but for mutually exclusive events we have P(A ∨ B) = P(A) + P(B) • Conditional independence: A and B are conditionally independent given C iff P(A, P(A B | C) = P(A | C) P(B | C) Conditional independence: Example • • • • • • • Toothache: boolean variable indicating whether the patient has a toothache Cavity: boolean variable indicating whether the patient has a cavity Catch: whether the dentist dentist’s s probe catches in the cavity If the patient has a cavity, the probability that the probe catches in it doesn't depend p on whether he/she has a toothache P(Catch | Toothache, Cavity) = P(Catch | Cavity) Therefore, Catch is conditionally independent of Toothache given Cavity Likewise Toothache is conditionally independent of Catch given Cavity Likewise, P(Toothache | Catch, Cavity) = P(Toothache | Cavity) Equivalent statement: P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity) Conditional independence: Example • How manyy numbers do we need to represent p the jjoint probability table P(Toothache, Cavity, Catch)? 23 – 1 = 7 independent entries • Write W it outt the th joint j i t di distribution t ib ti using i chain h i rule: l P(Toothache, Catch, Cavity) = P(Cavity) P(Catch | Cavity) P(Toothache | Catch, Cavity) = P(Cavity) P(Catch | Cavity) P(Toothache | Cavity) • How many numbers do we need to represent these di t ib ti distributions? ? 1 + 2 + 2 = 5 independent numbers • In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n Naïve Bayes model • Suppose we have many different types of observations ((symptoms, y features)) that we want to use to diagnose g the underlying cause • It is usually impractical to directly estimate or store the ) joint distribution P (Cause, Effect1 , K , Effect n ). • To simplify things, we can assume that the different effects are conditionallyy independent p given the g underlying cause • Then we can estimate the joint distribution as Naïve Bayes model • Suppose we have many different types of observations ((symptoms, y features)) that we want to use to diagnose g the underlying cause • It is usually impractical to directly estimate or store the ) joint distribution P (Cause, Effect1 , K , Effect n ). • To simplify things, we can assume that the different effects are conditionallyy independent p given the g underlying cause • Then we can estimate the joint distribution as P (Cause, Effect1 , K , Effectn ) = P (Cause)∏ P ( Effecti | Cause) i • This is usually not accurate accurate, but very useful in practice Example: Naïve Bayes Spam Filter • Bayesian decision theory: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message) – Maximum a posteriori (MAP) decision Example: Naïve Bayes Spam Filter • Bayesian decision theory: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message) – Maximum a posteriori (MAP) decision • Apply Bayes rule to the posterior: P (message | spam) P ( spam) P( spam | message) = P(message) P(¬spam | message) = and P(message | ¬spam) P (¬spam) P(message) • Notice that P(message) is just a constant normalizing factor and doesn’t affect the decision • Therefore, Therefore to classify the message message, all we need is to find P(message | spam)P(spam) and P(message | ¬spam)P(¬spam) Example: Naïve Bayes Spam Filter • We need to find P(message | spam) P(spam) and P(message | ¬spam) P(¬spam) • The message is a sequence of words (w1, …, wn) • Bag of words representation – The order of the words in the message g is not important p – Each word is conditionally independent of the others given message class (spam or not spam) Bag of words illustration US Presidential Speeches Tag Cloud http://chir.ag/projects/preztags/ Bag of words illustration US Presidential Speeches Tag Cloud http://chir.ag/projects/preztags/ Bag of words illustration US Presidential Speeches Tag Cloud http://chir.ag/projects/preztags/ Example: Naïve Bayes Spam Filter • We need to find P(message | spam) P(spam) and P(message | ¬spam) P(¬spam) • The message is a sequence of words (w1, …, wn) • Bag of words representation – The order of the words in the message g is not important p – Each word is conditionally independent of the others given message class (spam or not spam) n P(message | spam) = P( w1 , K, wn | spam) = ∏ P ( wi | spam) i =1 • Our filter will classify the message as spam if n n i =1 i =1 P( spam)∏ P ( wi | spam) > P (¬spam)∏ P ( wi | ¬spam) Example: Naïve Bayes Spam Filter n P ( spam p | w1 , K , wn ) ∝ P ( spam p )∏ P ( wi | spam p ) i =1 posterior prior likelihood Parameter estimation • In order to classify a message, we need to know the prior P(spam) and the likelihoods P(word | spam) and P( P(word d | ¬spam)) – These are the parameters of the probabilistic model – How do we obtain the values of these parameters? prior spam: ¬spam: 0.33 0.67 P(word | spam) P(word | ¬spam) Parameter estimation • How do we obtain the prior P(spam) and the likelihoods P(word | spam) and P(word | ¬spam)? – Empirically: use training data # of word occurrences in spam messages P( P(word d | spam)) = total # of words in spam messages – Thi This iis th the maximum i lik likelihood lih d (ML) estimate, ti t or estimate ti t that maximizes the likelihood of the training data: D nd ∏∏ P(w d ,i | classd ,i ) d =1 i =1 d: index of training document, i: index of a word
© Copyright 2026 Paperzz