Introduction Naive Bayes Models Estimating Probabilities and Smoothing Introduction to Probability (4) John Kelleher & Brian Mac Namee Introduction Naive Bayes Models Introduction Estimating Probabilities and Smoothing Introduction Naive Bayes Models Estimating Probabilities and Smoothing Overview 1 Introduction Overview 2 Naive Bayes Models Theory Example: Learning to Classify Text 3 Estimating Probabilities and Smoothing Introduction Naive Bayes Models Estimating Probabilities and Smoothing Naive Bayes Models Introduction Naive Bayes Models Estimating Probabilities and Smoothing Theory A simple form of Bayesian network, the naive Bayes approach is a common classification technique. The naive Bayes classifier applies to learning tasks where: 1 2 Each instance x is described by a conjunction of attribute values, ha1 , . . . , an i, and where the target function f (x) can take on any value from some finite set C A set of of training examples of the target function is provided, and a new instance is presented, described by the tuple of attribute values ha1 , . . . , an i The learner is asked to predict the target value, or classification, for this new instance. Introduction Naive Bayes Models Estimating Probabilities and Smoothing Theory Naive Bayes Model The model is naive because is assumes that the attributes are conditionally independent of each other given the class. With the observed attribute values a1 , . . . , an for the new instance that is to be classified the probability of each class is given by: P(C|a1 , . . . , an ) = αP(C) Q i P(ai |C) A deterministic prediction is obtained by choosing the most likely class, probabilistic predictions can also be given. Introduction Naive Bayes Models Estimating Probabilities and Smoothing Theory The naive bayes learning method involves a learning step in which the various P(cj ) and P(ai |cj ) terms are estimated based on their frequency over the training data. Naive Bayes learning has some attractive properties: 1 2 3 Naive Bayes learning scales well to very large problems: with n Boolean variables, there are just 2n + 1 parameters With a naive Bayes system there is no search required to find hML the maximum likelihood naive Bayes hypothesis; instead the hypothesis is formed without searching simply by counting the frequency of various data combinations within the training examples. Naive Bayes systems have little difficulty with noisy or missing data and can give probabilistic predictions when appropriate. Introduction Naive Bayes Models Estimating Probabilities and Smoothing Example: Learning to Classify Text To illustrate how a naive Bayes classifier works we will examine how to use a Naive Bayes classifier to classify text. Imagine we are trying to learn a target function that classifies documents as interesting or uninteresting to a particular person, using the target values like and dislike to indicate these two categories (a typical document filtering problem) There are two main design issues in applying the naive Bayes classifier to such text classification problems: 1 2 How do we represent an arbitrary text document in terms of attribute values? How do we estimate the probabilities required by the naive Bayes classifier? Introduction Naive Bayes Models Estimating Probabilities and Smoothing Example: Learning to Classify Text How to represent an arbitrary text document? There are many ways to represent a document, however for this example we will use an extremely simple one: Given a text document we define an attribute for each word in a default English dictionary and define the value of that attribute to indicate the presence or absence of the word in the document. So for the text "Machine learning is fun" we might have the representation as follows: and 0 fun 1 is 1 learning 1 machine 1 much 0 not 0 Introduction Naive Bayes Models Estimating Probabilities and Smoothing Example: Learning to Classify Text Given this document representation we can now apply the naive Bayes classifier. Lets assume we are given a set of 700 training documents that a friend has classified as dislike and another 300 that they have classified as like. We are now given a new document and asked to classify it, let us assume that it is "Machine learning is fun". Given this we calculate the naive Bayes classification as: CNB = argmaxCj ∈[like,dislike] P(Cj ) d Y P(ai |Cj ) i=1 = argmaxCj ∈[like,dislike] P(Cj )P(machine|Cj )P(learning|Cj ) P(is|Cj )P(fun|Cj ) Introduction Naive Bayes Models Estimating Probabilities and Smoothing Example: Learning to Classify Text In summary, CNB is the classification that maximizes the probability of observing the words that were actually found in the document, subject to the usual naive Bayes independence assumptions. To calculate CNB we need estimates for: P(Cj ) and P(wk |Cj ) (here wk indicates the k th word in the English language). P(Cj ) is the fraction of each class in the training data; for our example P(like) = .3 and P(dislike) = .7 Estimating the class conditional probability P(wk |Cj ) is a bit more difficult! Introduction Naive Bayes Models Estimating Probabilities and Smoothing Example: Learning to Classify Text Estimating the class conditional probability P(wk |Cj ): We must estimate one such probability term for each combination of English word (wk ) and target value (Cj ). Unfortunately, there are approx. 50,000 distinct words in the English vocabulary and 2 possible target values i.e. we must estimate 50, 000 × 2 ≈ 100, 000 such terms from the training data! Introduction Naive Bayes Models Estimating Probabilities and Smoothing Example: Learning to Classify Text To summarise, we end up using a naive Bayes classifier together with the assumption that the probability of word occurrence is independent of position within the text. 1 2 During learning the algorithm examines all the training documents to extract the vocabulary of all the words and tokens that appear in the text, then counts their frequencies among the different target class to obtain the necessary probability estimates. During classification we use these probabilities estimates to calculate CNB . note: any words appearing in the new document that are not observed in the training set are simply ignored by this algorithm. Introduction Naive Bayes Models Estimating Probabilities and Smoothing Example: Learning to Classify Text Practice Here we want to build a naive Bayes classification model to diagnose heart disease based on a small snippet of text from a discharge report written by a doctor. Here are some sample texts for patients who went on to develop heart disease within 2 years of discharge: The patient is male and old. The patient is female with a history of diabetes and obesity. The patient is male with a history of diabetes and obesity. The patient is male, old and diabetic. The patient is female and young. Here are some sample texts for patients who went on to develop heart disease within 2 years of discharge: The patient is male and young. The patient is female with no history of diabetes and obesity. The patient is female and young. The patient is female and diabetic. The patient is female and old. Introduction Naive Bayes Models Estimating Probabilities and Smoothing Example: Learning to Classify Text Practice Can we calculate the probability of the following patients having heart disease based on the given text extracts from their discharge reports? 1) The patient is male. 2) The patient is diabetic. 3) The patient is female and diabetic. 4) The patient is female. See the associated Excel sheet for the answers. Introduction Naive Bayes Models Estimating Probabilities and Smoothing Estimating Probabilities and Smoothing Introduction Naive Bayes Models Estimating Probabilities and Smoothing To complete the design of our algorithm we must choose a method for estimating the probability terms. We could use the relative frequency of a term in the training data. Although in many cases the relative frequency model works well it provides poor estimates if the training data is small. Introduction Naive Bayes Models Estimating Probabilities and Smoothing What about probabilities of 0? What happens when a word appears in our dictionary, but does not not appear in our training set? For example, suppose the word whelmed is in our dictionary but not in our training set, then it would have a frequency count of zero. Should we assign P(whelmed)= 0? If we do, then the text “The student was whelmed by the difficulty of the exam question” would have an English probability of zero, which seems wrong! What can we do? Introduction Naive Bayes Models Estimating Probabilities and Smoothing We have a problem in generalisation: we want our language models to generalize well to texts they haven’t seen yet. Thus we will adjust our language model so that sequences that have a count of zero in the training corpus will be assigned a small nonzero probability (and the other counts will be adjusted downward slightly so the the probability still sums to 1) Smoothing The process of adjusting the probability of low frequency counts is called smoothing. Introduction Naive Bayes Models Estimating Probabilities and Smoothing Laplace smoothing (aka Add-one smoothing) Suggested by Pierre-Simon Laplace If a random Boolean variable X has been false in all n observations so far then the estimate for P(X = true) 1 should be n+2 i.e., we assume that with two more trials, one might be true and one false Introduction Naive Bayes Models Estimating Probabilities and Smoothing Introduction Naive Bayes Models Estimating Probabilities and Smoothing There are many other more sophisticated smoothing models; e.g., Good-Turing discounting, Witten-Bell discounting, Kneser-Ney smoothing, Dirchlet smoothing All of them are getting at the same goals: 1 2 addressing the issue of zero counts for rare events; reducing the variance in the statistical model Another approach to these issues is to gather more training data. With more data even simple smoothing techniques work well. Introduction Naive Bayes Models Estimating Probabilities and Smoothing 1 Introduction Overview 2 Naive Bayes Models Theory Example: Learning to Classify Text 3 Estimating Probabilities and Smoothing
© Copyright 2026 Paperzz