LESS NAÏVE CONDITIONAL CLASSIFIERS October 5: Lecture 10 Independence: the key assumption for naïve Bayes classifier • In general we have a learning problem where we observe some data, and want to figure out what is the posterior probability of something happening. In other words, the data alter our prior expectations. Which data are relevant? • Say we have three distinct random variables X, Y, Z • We say that X is conditionally independent of Y given Z if • In other words, if I give you Z , Y is totally irrelevant in estimating the probability Y. This can be generalized to sets of variables. Independence: the key assumption for naïve Bayes classifier • The naïve Bayes classifier reads the data and works under the assumption that any given attribute is independent from any other attribute given the target value V. What is the probability that two things happen simultaneously given V? Ask the question more gradually….. • This is crucial for efficiency but not true in practice. Barack ……. Vs Barack….. Bayesian Belief Networks encoding causality • A picture is 1000 words… • Ovals are variables, arrows indicate causality The graph is acyclic • A variable Y is a descendant of X if there is path XY A variable X is conditionally independent from its nondescendants given the value of its parents Bayesian Belief Networks encoding causality A variable X is conditionally independent from its nondescendants given the value of its parents In general a BBN represents the joint probability distribution of a bunch of variables, along with some conditional dependencies. • For example, Campfire is independent from Lightning and Thunder, if Storm and BusTourGroup are known. Bayesian Belief Networks encoding causality A variable X is conditionally independent from its nondescendants given the value of its parents We want to compute the joint probability. What is missing ? This table should be known for every oval Bayesian Belief Networks encoding causality What is the joint probability? How is this different from the case of independence? Bayesian Belief Networks encoding causality • Bayesian Belief Networks can be used to infer the joint probability distribution of a subset of the variables, given specific observed values of some other variables. • In general inferring the exact distribution is NP-hard. Even approximating a distribution is a hard problem. In practice though sampling methods seem to work satisfactorily. • In the learning problem we want to learn the tables that go with the ovals. The objective function we try to maximize is P(D|h), i.e. find the hypohesis (tables) that maximize the probability of observing the data. An algorithm to do that is gradient ascent following the gradient of ln P(D|h). • With more advanced algorithms we can also learn the acyclic graph too. This helps discovering causalities, for example with data involving genes and diseases. Learning Hidden Creatures • In many experiments we cannot afford to observe all attributes. Or it may be impossible to observe them all. • In fact in some cases some random variables may be eluding our grasp. We need to look carefully to see that they’re there. • The EM algorithm is a general technique that maintains a current hypothesis h and estimates the missing data according to it. Then it uses both the actually observed data and the estimates of the missing data to find a (better) new hypothesis h’ that becomes the new h. The whole process gets iterated and converges to a local maximum. Learning Hidden Creatures (i am totally making this up) • Consider the following problem. We have a radio-telescope that receives radioactivity from three different galaxies, and want to estimate their distance. Learning Hidden Creatures (i am totally making this up) • Consider the following problem. We have a radio-telescope that receives radioactivity from three different galaxies, and want to estimate their distance. The average intensity of the activity is proportional to the distance. But there are fluctuations that follow a Gaussian distribution Learning Hidden Creatures (i am totally making this up) • In the simple case we have only one distribution whose mean we want to estimate (given data x1, … xm): • Then we just take an average (minimize square errors) • We can also calculate confidence intervals. Learning Hidden Creatures mixture of Gaussians • But now we have a mixture of Gaussians: • The key is to imagine the hidden variables that will make our life easier. • Every data point xi is now viewed as a triple <xi, zi1, zi2>. • zi1 is 1 if xi comes from first Gaussian, 0 otherwise. Similarly for zi2 Learning Hidden Creatures an instance of EM • • But we don’t know the zi’ s! Otherwise the problem would be just two cases of one Gaussian • Find some first estimates for mu1 and mu2 • This E[zij] is the probability that the instance xi is generated by the jth distribution Learning Hidden Creatures an instance of EM • Now update the estimate • This is accomplished with this Homework Read and unrestand the derivation of EM
© Copyright 2025 Paperzz