Introduction to Probability (4)

Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Introduction to Probability (4)
John Kelleher & Brian Mac Namee
Introduction
Naive Bayes Models
Introduction
Estimating Probabilities and Smoothing
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Overview
1
Introduction
Overview
2
Naive Bayes Models
Theory
Example: Learning to Classify Text
3
Estimating Probabilities and Smoothing
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Naive Bayes Models
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Theory
A simple form of Bayesian network, the naive Bayes
approach is a common classification technique.
The naive Bayes classifier applies to learning tasks where:
1
2
Each instance x is described by a conjunction of attribute
values, ha1 , . . . , an i, and where the target function f (x) can
take on any value from some finite set C
A set of of training examples of the target function is
provided, and a new instance is presented, described by
the tuple of attribute values ha1 , . . . , an i
The learner is asked to predict the target value, or
classification, for this new instance.
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Theory
Naive Bayes Model
The model is naive because is assumes that the attributes
are conditionally independent of each other given the
class.
With the observed attribute values a1 , . . . , an for the new
instance that is to be classified the probability of each
class is given by:
P(C|a1 , . . . , an ) = αP(C)
Q
i
P(ai |C)
A deterministic prediction is obtained by choosing the most
likely class, probabilistic predictions can also be given.
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Theory
The naive bayes learning method involves a learning step
in which the various P(cj ) and P(ai |cj ) terms are estimated
based on their frequency over the training data.
Naive Bayes learning has some attractive properties:
1
2
3
Naive Bayes learning scales well to very large problems:
with n Boolean variables, there are just 2n + 1 parameters
With a naive Bayes system there is no search required to
find hML the maximum likelihood naive Bayes hypothesis;
instead the hypothesis is formed without searching simply
by counting the frequency of various data combinations
within the training examples.
Naive Bayes systems have little difficulty with noisy or
missing data and can give probabilistic predictions when
appropriate.
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Example: Learning to Classify Text
To illustrate how a naive Bayes classifier works we will
examine how to use a Naive Bayes classifier to classify
text.
Imagine we are trying to learn a target function that
classifies documents as interesting or uninteresting to a
particular person, using the target values like and dislike to
indicate these two categories (a typical document filtering
problem)
There are two main design issues in applying the naive
Bayes classifier to such text classification problems:
1
2
How do we represent an arbitrary text document in terms of
attribute values?
How do we estimate the probabilities required by the naive
Bayes classifier?
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Example: Learning to Classify Text
How to represent an arbitrary text document?
There are many ways to represent a document, however
for this example we will use an extremely simple one:
Given a text document we define an attribute for each word
in a default English dictionary and define the value of that
attribute to indicate the presence or absence of the word in
the document.
So for the text "Machine learning is fun" we might have the
representation as follows:
and
0
fun
1
is
1
learning
1
machine
1
much
0
not
0
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Example: Learning to Classify Text
Given this document representation we can now apply the
naive Bayes classifier.
Lets assume we are given a set of 700 training documents
that a friend has classified as dislike and another 300 that
they have classified as like.
We are now given a new document and asked to classify it,
let us assume that it is "Machine learning is fun".
Given this we calculate the naive Bayes classification as:
CNB = argmaxCj ∈[like,dislike] P(Cj )
d
Y
P(ai |Cj )
i=1
= argmaxCj ∈[like,dislike] P(Cj )P(machine|Cj )P(learning|Cj )
P(is|Cj )P(fun|Cj )
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Example: Learning to Classify Text
In summary, CNB is the classification that maximizes the
probability of observing the words that were actually found
in the document, subject to the usual naive Bayes
independence assumptions.
To calculate CNB we need estimates for: P(Cj ) and
P(wk |Cj ) (here wk indicates the k th word in the English
language).
P(Cj ) is the fraction of each class in the training data; for
our example P(like) = .3 and P(dislike) = .7
Estimating the class conditional probability P(wk |Cj ) is a
bit more difficult!
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Example: Learning to Classify Text
Estimating the class conditional probability P(wk |Cj ):
We must estimate one such probability term for each
combination of English word (wk ) and target value (Cj ).
Unfortunately, there are approx. 50,000 distinct words in
the English vocabulary and 2 possible target values i.e. we
must estimate 50, 000 × 2 ≈ 100, 000 such terms from the
training data!
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Example: Learning to Classify Text
To summarise, we end up using a naive Bayes classifier
together with the assumption that the probability of word
occurrence is independent of position within the text.
1
2
During learning the algorithm examines all the training
documents to extract the vocabulary of all the words and
tokens that appear in the text, then counts their frequencies
among the different target class to obtain the necessary
probability estimates.
During classification we use these probabilities estimates to
calculate CNB . note: any words appearing in the new
document that are not observed in the training set are
simply ignored by this algorithm.
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Example: Learning to Classify Text
Practice
Here we want to build a naive Bayes classification model to diagnose heart disease based on a small snippet of text
from a discharge report written by a doctor.
Here are some sample texts for patients who went on to develop heart disease within 2 years of discharge:
The patient is male and old.
The patient is female with a history of diabetes and obesity.
The patient is male with a history of diabetes and obesity.
The patient is male, old and diabetic.
The patient is female and young.
Here are some sample texts for patients who went on to develop heart disease within 2 years of discharge:
The patient is male and young.
The patient is female with no history of diabetes and obesity.
The patient is female and young.
The patient is female and diabetic.
The patient is female and old.
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Example: Learning to Classify Text
Practice
Can we calculate the probability of the following patients having heart
disease based on the given text extracts from their discharge reports?
1) The patient is male.
2) The patient is diabetic.
3) The patient is female and diabetic.
4) The patient is female.
See the associated Excel sheet for the answers.
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Estimating Probabilities and
Smoothing
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
To complete the design of our algorithm we must choose a
method for estimating the probability terms.
We could use the relative frequency of a term in the
training data.
Although in many cases the relative frequency model
works well it provides poor estimates if the training data is
small.
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
What about probabilities of 0?
What happens when a word appears in our dictionary, but
does not not appear in our training set?
For example, suppose the word whelmed is in our
dictionary but not in our training set, then it would have a
frequency count of zero.
Should we assign P(whelmed)= 0? If we do, then the text
“The student was whelmed by the difficulty of the exam
question” would have an English probability of zero, which
seems wrong!
What can we do?
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
We have a problem in generalisation: we want our
language models to generalize well to texts they haven’t
seen yet.
Thus we will adjust our language model so that sequences
that have a count of zero in the training corpus will be
assigned a small nonzero probability (and the other counts
will be adjusted downward slightly so the the probability
still sums to 1)
Smoothing
The process of adjusting the probability of low frequency counts
is called smoothing.
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Laplace smoothing (aka Add-one smoothing)
Suggested by Pierre-Simon Laplace
If a random Boolean variable X has been false in all n
observations so far then the estimate for P(X = true)
1
should be n+2
i.e., we assume that with two more trials, one might be true
and one false
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
There are many other more sophisticated smoothing
models; e.g., Good-Turing discounting, Witten-Bell
discounting, Kneser-Ney smoothing, Dirchlet smoothing
All of them are getting at the same goals:
1
2
addressing the issue of zero counts for rare events;
reducing the variance in the statistical model
Another approach to these issues is to gather more
training data. With more data even simple smoothing
techniques work well.
Introduction
Naive Bayes Models
Estimating Probabilities and Smoothing
1
Introduction
Overview
2
Naive Bayes Models
Theory
Example: Learning to Classify Text
3
Estimating Probabilities and Smoothing