Two Roles for Bayesian Methods Bayesian Learning • • • • • • • Bayes Theorem MAP, ML hypotheses MAP learners Minimum description length principle Bayes optimal classifier Naïve Bayes learner Bayesian belief networks CS 8751 ML & KDD Bayesian Methods Provide practical learning algorithms: • Naïve Bayes learning • Bayesian belief network learning • Combine prior knowledge (prior probabilities) with observed data Requires prior probabilities: • Provides useful conceptual framework: • Provides “gold standard” for evaluating other learning algorithms • Additional insight into Occam’s razor 1 CS 8751 ML & KDD Bayes Theorem P(h | D) = • • • • Bayesian Methods 2 Choosing Hypotheses P(h | D) = P ( D | h ) P ( h) P( D) P ( D | h ) P ( h) P( D) Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis hMAP: P(h) = prior probability of hypothesis h P(D) = prior probability of training data D P(h|D) = probability of h given D P(D|h) = probability of D given h hMAP = arg max P (h | D ) h∈H P ( D | h) P ( h) P( D) = arg max P ( D | h) P (h) = arg max h∈H h∈H If we assume P(hi)=P(hj) then can further simplify, and choose the Maximum likelihood (ML) hypothesis hML = arg max P ( D | hi ) hi ∈H CS 8751 ML & KDD Bayesian Methods 3 CS 8751 ML & KDD Bayesian Methods Bayes Theorem Some Formulas for Probabilities Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, 0.8% of the entire population have this cancer. P(cancer) = P(¬cancer) = P(+|cancer) = P(-|cancer) = P(+|¬cancer) = P(-|¬cancer) = • Product rule: probability P(A ∧ B) of a conjunction of two events A and B: P(A ∧ B) = P(A|B)P(B) = P(B|A)P(A) • Sum rule: probability of disjunction of two events A and B: P(A ∨ B) = P(A) + P(B) - P(A ∧ B) • Theorem of total probability: if events A1,…,An n are mutually exclusive with ∑i =1 P ( Ai ) = 1 , then 4 n P( B ) = ∑ P( B | Ai ) P ( Ai ) P(cancer|+) = P(¬cancer|+) = CS 8751 ML & KDD i =1 Bayesian Methods 5 CS 8751 ML & KDD Bayesian Methods 6 1 Relation to Concept Learning Brute Force MAP Hypothesis Learner 1. For each hypothesis h in H, calculate the posterior probability P(h | D) = Consider our usual concept learning task • instance space X, hypothesis space H, training examples D • consider the FindS learning algorithm (outputs most specific hypothesis from the version space VSH,D) P ( D | h ) P ( h) P( D) 2. Output the hypothesis hMAP with the highest posterior probability hMAP = arg max P (h | D ) What would Bayes rule produce as the MAP hypothesis? Does FindS output a MAP hypothesis? h∈H CS 8751 ML & KDD Bayesian Methods 7 Relation to Concept Learning CS 8751 ML & KDD y m hML = arg min ∑ (d i − h( xi )) 2 otherwise h∈H 9 i =1 d − h ( xi ) 2 1 ) −1( i e 2 σ 2πσ 2 Maximize natural log of this instead ... m = arg max ∏ i =1 1 d − h( xi ) − i 2 σ 1 2πσ 2 1 d − h( xi ) = arg max − i h∈H 2 σ 2 2 10 – Note LC2 (D|h) = 0 if examples classified perfectly by h. Need only describe exceptions = arg max − (d i − h( xi ) ) 2 h∈H = arg min (d i − h( xi ) ) • Hence hMDL trades off tree size for training errors 2 h∈H CS 8751 ML & KDD Bayesian Methods Occam’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that minimizes hMDL = arg min LC1 ( h) + LC 2 ( D | h) h∈H where LC(x) is the description length of x under encoding C Example: • H = decision trees, D = training data labels • LC1(h) is # bits to describe tree h • LC2(D|h) is #bits to describe D given h m = arg max ∏ p (d i | h) h∈H CS 8751 ML & KDD i =1 Minimum Description Length Principle h∈H hML = arg max ln hML x Learning a Real Valued Function h∈H f Consider any real-valued target function f Training examples (xi,di), where di is noisy training value • di = f(xi) + ei • ei is random variable (noise) drawn independently for each xi according to some Gaussian distribution with mean = 0 Then the maximum likelihood hypothesis hML is the one that minimizes the sum of squared errors: hML = arg max p( D | h) h∈H 8 e if h is consistent with D Bayesian Methods Bayesian Methods Learning a Real Valued Function Assume fixed set of instances (x1,…,xm) Assume D is the set of classifications D = (c(x1),…,c(xm)) Choose P(D|h): • P(D|h) = 1 if h consistent with D • P(D|h) = 0 otherwise Choose P(h) to be uniform distribution • P(h) = 1/|H| for all h in H Then 1 P(h | D) = VS H,D 0 CS 8751 ML & KDD Bayesian Methods 11 CS 8751 ML & KDD Bayesian Methods 12 2 Bayes Optimal Classifier Minimum Description Length Principle hMAP = arg max P( D | h) P(h) Bayes optimal classification h∈H = arg max log 2 P( D | h) + log 2 P( h) arg max ∑ P (v j | hi ) P(hi | D) h∈H v j ∈V = arg min − log 2 P( D | h) − log 2 P( h) (1) Interesting fact from information theory: The optimal (shortest expected length) code for an event with probability p is log2p bits. So interpret (1): -log2P(h) is the length of h under optimal code -log2P(D|h) is length of D given h in optimal code → prefer the hypothesis that minimizes length(h)+length(misclassifications) CS 8751 ML & KDD Bayesian Methods therefore ∑ P ( + | h ) P ( h | D ) = .4 and hi ∈H arg max ∑ P(v j | hi ) P (hi | D) = v j ∈V 13 CS 8751 ML & KDD hi ∈H Bayesian Methods 14 Naïve Bayes Classifier Bayesian Methods Along with decision trees, neural networks, nearest neighor, one of the most practical learning methods. When to use • Moderate or large training set available • Attributes that describe instances are conditionally independent given classification Successful applications: • Diagnosis • Classifying text documents 15 Naïve Bayes Classifier CS 8751 ML & KDD Bayesian Methods 16 Naïve Bayes Algorithm Assume target function f: X→V, where each instance x described by attributed (a1,a2,…,an). Most probable value of f(x) is: vMAP = arg max P (v j | a1 , a2 ,..., an ) v j ∈V v j ∈V i ∑ P(− | hi ) P(hi | D) = .6 Gibbs Classifier = arg max i hi ∈H Bayes optimal classifier provides best result, but can be expensive if many hypotheses. Gibbs algorithm: 1. Choose one hypothesis at random, according to P(h|D) 2. Use this to classify new instance Surprising fact: assume target concepts are drawn at random from H according to priors on H. Then: E[errorGibbs] ≤ 2E[errorBayesOptimal] Suppose correct, uniform prior distribution over H, then • Pick any hypothesis from VS, with uniform probability • Its expected error no worse than twice Bayes optimal CS 8751 ML & KDD hi ∈H Example: P(h1|D)=.4, P(-|h1)=0, P(+|h1)=1 P(h2|D)=.3, P(-|h2)=1, P(+|h2)=0 P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0 h∈H Naive_Bayes_Learn(examples) For each target value v j Pˆ (v j ) ← estimate P(v j ) For each attribute value ai of each attribute a Pˆ (a |v ) ← estimate P(a |v ) P(a1 , a2 ,..., an | v j ) P(v j ) P (a1 , a2 ,..., an ) i = arg max P (a1 , a2 ,..., an | v j ) P (v j ) j i j v j ∈V Naïve Bayes assumption: P(a1 , a2 ,..., an | v j ) = ∏ P( ai | v j ) i which gives P (v j )∏ P (ai | v j ) Naïve Bayes classifier: v NB = arg max v ∈V j CS 8751 ML & KDD Bayesian Methods i 17 Classify_New_Instance( x) v NB = arg max Pˆ (v j ) ∏ Pˆ (ai|v j ) v j ∈V CS 8751 ML & KDD ai ∈ x Bayesian Methods 18 3 Naïve Bayes Example Naïve Bayes Subtleties Consider CoolCar again and new instance (Color=Blue,Type=SUV,Doors=2,Tires=WhiteW) Want to compute 1. Conditional independence assumption is often violated P (a1 , a2 ,..., an | v j ) = ∏ P( ai | v j ) i • … but it works surprisingly well anyway. Note that you do not need estimated posteriors to be correct; need only that v NB = arg max P (v j )∏ P (ai | v j ) v j ∈V i P(+)*P(Blue|+)*P(SUV|+)*P(2|+)*P(WhiteW|+)= 5/14 * 1/5 * 2/5 * 4/5 * 3/5 = 0.0137 P(-)*P(Blue|-)*P(SUV|-)*P(2|-)*P(WhiteW|-)= 9/14 * 3/9 * 4/9 * 3/9 * 3/9 = 0.0106 CS 8751 ML & KDD Bayesian Methods arg max Pˆ (v j )∏ Pˆ (ai | v j ) = arg max P (v j ) P (a1 ,..., an | v j ) v j ∈V v j ∈V i • see Domingos & Pazzani (1996) for analysis • Naïve Bayes posteriors often unrealistically close to 1 or 0 19 Naïve Bayes Subtleties CS 8751 ML & KDD Bayesian Methods 20 Bayesian Belief Networks 2. What if none of the training instances with target value vj have attribute value ai? Then i Typical solution is Bayesian estimate for Pˆ (ai | v j ) n + mp ˆ P (ai | v j ) ← c n+m • n is number of training examples for which v=vj • nc is number of examples for which v=vj and a=ai • p is prior estimate for Pˆ ( ai | v j ) • m is weight given to prior (i.e., number of “virtual” examples) Interesting because • Naïve Bayes assumption of conditional independence is too restrictive • But it is intractable without some such assumptions… • Bayesian belief networks describe conditional independence among subsets of variables • allows combing prior knowledge about (in)dependence among variables with observed training data • (also called Bayes Nets) CS 8751 ML & KDD CS 8751 ML & KDD Pˆ (ai | v j ) = 0, and ... Pˆ (v j )∏ Pˆ ( ai | v j ) = 0 Bayesian Methods 21 Conditional Independence P(X|Y,Z) = P(X|Z) Storm BusTourGroup C ¬C Lightning S,B S,¬B ¬S,B ¬S,¬B 0.4 0.1 0.8 0.2 0.6 0.9 0.2 0.8 Campfire Campfire Thunder Example: Thunder is conditionally independent of Rain given Lightning ForestFire Network represents a set of conditional independence assumptions • Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors • Directed acyclic graph P(Thunder|Rain,Lightning)=P(Thunder|Lightning) Naïve Bayes uses conditional ind. to justify P(X,Y|Z)=P(X|Y,Z)P(Y|Z) =P(X|Z)P(Y|Z) Bayesian Methods 22 Bayesian Belief Network Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is, if (∀xi , y j , z k ) P( X = xi | Y = y j , Z = z k ) = P ( X = xi | Z = z k ) more compactly we write CS 8751 ML & KDD Bayesian Methods 23 CS 8751 ML & KDD Bayesian Methods 24 4 Bayesian Belief Network Inference in Bayesian Networks • Represents joint probability distribution over all variables • e.g., P(Storm,BusTourGroup,…,ForestFire) • in general, n P ( y1 ,..., yn ) = ∏ P ( yi | Parents (Yi )) i =1 where Parents(Yi) denotes immediate predecessors of Yi in graph • so, joint distribution is fully defined by graph, plus the P(yi|Parents(Yi)) CS 8751 ML & KDD Bayesian Methods 25 How can one infer the (probabilities of) values of one or more network variables, given observed values of others? • Bayes net contains all information needed • If only one variable with unknown value, easy to infer it • In general case, problem is NP hard In practice, can succeed in many cases • Exact inference methods work well for some network structures • Monte Carlo methods “simulate” the network randomly to calculate approximate solutions CS 8751 ML & KDD Learning of Bayesian Networks Bayesian Methods Learning Bayes Net Several variants of this learning task • Network structure might be known or unknown • Training examples might provide values of all network variables, or just some If structure known and observe all variables • Then it is easy as training a Naïve Bayes classifier Suppose structure known, variables partially observable e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning, Campfire, … • Similar to training neural network with hidden units • In fact, can learn network conditional probability tables using gradient ascent! • Converge to network h that (locally) maximizes P(D|h) CS 8751 ML & KDD CS 8751 ML & KDD Bayesian Methods 27 Gradient Ascent for Bayes Nets d ∈D j 28 • Combine prior knowledge with observed data • Impact of prior knowledge (when correct!) is to lower the sample complexity • Active research area – Extend from Boolean to real-valued variables – Parameterized distributions instead of tables – Extend to first-order instead of propositional systems – More effective inference methods Ph ( yij , uik | d ) wijk 2. Then renormalize the wijk to assure ∑ Bayesian Methods Summary of Bayes Belief Networks Let wijk denote one entry in the conditional probability table for variable Yi in the network wijk =P(Yi=yij|Parents(Yi)=the list uik of values) e.g., if Yi = Campfire, then uik might be (Storm=T, BusTourGroup=F) Perform gradient ascent by repeatedly 1. Update all wijk using training data D wijk ← wijk + η ∑ 26 wijk = 1 , 0 ≤ wijk ≤ 1 CS 8751 ML & KDD Bayesian Methods 29 CS 8751 ML & KDD Bayesian Methods 30 5
© Copyright 2026 Paperzz