Bayes and Naïve Bayes cs534-Machine Learning Bayes Classifier • Generative model learns • Prediction is made by x where x ∑ and x x x x, • This is often referred to as the Bayes Classifier, because of the use of the Bayes rule • In last lecture, we have seen for continuous using Gaussian models, we have – a linear decision boundary (LDA) with different means and shared covariance matrix – a quadratic decision boundary with different means and covariance matrices Joint Density estimation • More generally, learning x| is a density estimation problem, which is a challenging task • Now we will consider the case where xis a dimensional binary vector • How to learn x| in this case? The Joint Distribution Recipe for making a joint distribution of M variables: Copyright © Andrew W. Moore Example: Boolean variables A, B, C The Joint Distribution Recipe for making a joint distribution of M variables: 1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have M 2 rows). Copyright © Andrew W. Moore Example: Boolean variables A, B, C A B C 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 The Joint Distribution Recipe for making a joint distribution of M variables: 1. 2. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have M 2 rows). For each combination of values, say how probable it is. Copyright © Andrew W. Moore Example: Boolean variables A, B, C A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 The Joint Distribution Recipe for making a joint distribution of M variables: 1. 2. 3. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have M 2 rows). For each combination of values, say how probable it is. If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 A 0.05 0.25 0.30 Copyright © Andrew W. Moore B 0.10 0.05 0.10 0.05 0.10 C Learning a joint distribution Build a JD table for your attributes in which the probabilities are unspecified The fill in each row with records matching row ˆ P(row ) total number of records A B C Prob 0 0 0 ? 0 0 1 ? A B C Prob 0 1 0 ? 0 0 0 0.30 0 1 1 ? 0 0 1 0.05 1 0 0 ? 0 1 0 0.10 1 0 1 ? 0 1 1 0.05 1 1 0 ? 1 0 0 0.05 1 1 1 ? 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 Fraction of all records in which A and B are True but C is False Copyright © Andrew W. Moore Example of Learning a Joint • This Joint was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995] Copyright © Andrew W. Moore UCI machine learning repository: http://www.ics.uci.edu/~mlearn/MLRepository.html Learning joint distribution overfits • Let be a -dimensional binary vector, and ∈ 1,2, … , • Learning the joint distribution P | for 1, … , involves estimating 2 1 parameters – For a large , this number is prohibitively large – Typically we don’t have enough data to estimate the joint distribution accurately – It is common to encounter the situation where no training examples have the exact ,…, value combination – then ,…, 0 for all values of – This presents severe overfitting Naïve Bayes Assumption • Assume that each feature is independent from one another given the class label • Definition: is conditionally independent of given , if the probability distribution governing is independent of the value of , given the value of ∀, , , Often denoted as , • For example: , • If , are conditional independent given , we have: , Conditional independence vs. independence • Conditional independence: or equivalently: • Independence: or equivalently: • Conditional independence – Does one imply the other? independence Naïve Bayes Classifier • Under Naïve Bayes assumption, we have: x • Thus, no need to estimate the joint distribution • Instead, only need to estimate for each feature • Example: with binary features and classes, we 1 to reduce the number of parameters from 2 – Significantly reduces overfitting Example: spam filtering • • Bag-of-words to describe emails Represent an email by a vector whose dimension = the number of words in our “dictionary” Example: Bernoulli feature • – – • • 1 if the -th word is present 0 if the -th word is not present The ordering/position of the words does not matter “Dictionary” can be formed by looking through the training set and identify all the words & tokens that have appeared at least once (with stop-words like “the” “and” removed) MLE for Naïve Bayes with Bernoulli Model • Suppose our training set contains emails, maximum likelihood estimate of the parameters are: N1 , where N1 is the number of spam emails N N i|1 P ( xi 1 | y 1) , N1 P ( y 1) i.e., the fraction of spam emails where xi appeared P ( xi 1 | y 0) N i |0 N0 i.e., the fraction of the nonspam emails where xi appeared Naïve Bayes Prediction • To make a prediction for a new example with feature ∈ , Discrete and Continuous Features • Naïve bayes can be easily extended to handle features that are not binary-valued • Discrete: for - categorical distribution in place of Bernoulli • Continuous: – Discretize the feature, then build categorical distribution for each feature – When the feature does not follow Gaussian, this can result in a better classifier Problem with MLE • Suppose you picked up a new word “Mahalanobis” in your class and started using it in your email x • Because “Mahalanobis” (say it’s the n+1 th word in the vocabulary) has never appeared in any of the training emails, the probability estimate for this word will be 1 1 1 0 0 • Now ∏ | 0 for both y=0 and y=1 • Given limited training data, MLE can often result in probabilities of 0 or 1. Such extreme probabilities are “too strong” and cause problems • To solve this problem, we can use Maximum A Posterior (MAP) estimate, which requires us to think of parameters as random variables, a Bayesian view of parameters Bayesian Parameter Estimation • In the Bayesian framework, we consider the parameters we are trying to estimate to be random variables • We give them a prior distribution, which is used to capture our prior believe about the parameter • When the data is sparse and this allows us to fall back to the prior and avoid the issues faced by MLE Example: Bernoulli • Given a unfair coin, we want to estimate - the probability of head • We toss the coin times, and observe heads • MLE estimate: • Now let’s go Bayesian and assume that is a random variable and has a prior distribution • For reasons that will become clear later, we assume the following prior for : 1 ~ , , ; , 1 , Beta distribution : uniform U-shape uni-model symmetric Posterior distribution of ∝ ∝ ∝ 1 1 , 1 , 1 1 Noting that the posterior has exactly the same form as the prior. This is not a coincidence, it is due to careful selection of the prior distribution – conjugate prior Maximum a-Posterior (MAP) • A full Bayesian treatment will not care about estimating the parameters • Instead, it will use the posterior distribution to make predictions for the target variable • But if we do need to have a concrete estimate of the parameters • We can use Maximum A Posterior estimation, that is MAP for Bernoulli 1 D , • Mode: the maximum point for 1 1 , is 2 • In this case we have: 1 2 • Let 2, 2, we have • Comparing the MLE, it is like adding some fake coin tosses (one head, one tail) into the observed data – this is also called sometimes as Laplace smoothing Laplace Smoothing • Suppose we estimate a probability P(z) and we have n0 examples where 0 and n1 examples where 1 n1 MLE estimate is P ( z = 1) = n 0 + n1 • Laplace Smoothing. Add 1 to the numerator and 2 to the n1 1 denominator P ( z 1) n0 n1 2 If we don’t observe any examples, we expect P(z=1) = 0.5, but our belief is weak (equivalent to seeing one example of each outcome). MAP for Naïve Bayes Spam Filter 1 and • When estimating – Bernoulli case: P( xi 1 | y 0) MLE N i |0 N0 P( xi 1 | y 0) MAP 0 N i |0 1 N0 2 • When encounter a new word that has not appeared in training set, now the probabilities do not go to zero • This is called Laplace Smoothing Multinomial Event Model for Text • Instead of treating the absence/presence of each word in the dictionary as a Bernoulli, we can treat each word in the sided die, where email as rolling a is the total size of the dictionary • Each of the words has a fixed probability of being selected • This allows us to take the counts of the words into consideration MLE for Naïve Bayes with Multinomial model • The likelihood of observing one email : • MLE estimate for the -th word in the dictionary: • Total number of parameters: MAP estimation for Multi-nomial • The conjugate prior for multinomial is called Dirichlet distribution • For outcomes, a Dirichlet distribution has parameters, each serves a similar purpose as Beta distribution’s parameters – Acting as fake observation(s) for the corresponding output, the count depends on the value of the parameter • Laplace smoothing for this case: nk 1 P( z k ) nK Laplace Smoothing for Multinomial case | MLE: MAP: 0 0 total#ofword inn‐semails total#ofwordsinn‐semails total#ofword inn‐semails 1 total#ofwordsinn‐semails |V| where | | is the size of the vocabulary Naïve Bayes Summary • Generative classifier – learn P( x|y ) and P(y) – Use Bayes rule to compute P( y|x ) for classification • Assumes conditional independence between features given class labels – Greatly reduces the numbers of parameters to learn • MAP estimation (or Laplace smoothing) is necessary to avoid overfitting and extreme probability values
© Copyright 2026 Paperzz