Bayes and Naïve Bayes

Bayes and Naïve Bayes
cs534-Machine Learning
Bayes Classifier
• Generative model learns
• Prediction is made by
x
where
x
∑
and
x
x
x
x,
• This is often referred to as the Bayes Classifier, because
of the use of the Bayes rule
• In last lecture, we have seen for continuous using
Gaussian models, we have
– a linear decision boundary (LDA) with different means and
shared covariance matrix
– a quadratic decision boundary with different means and
covariance matrices
Joint Density estimation
• More generally, learning x| is a density estimation
problem, which is a challenging task
• Now we will consider the case where xis a dimensional binary vector
• How to learn x| in this case?
The Joint Distribution
Recipe for making a joint distribution of M
variables:
Copyright © Andrew W. Moore
Example: Boolean variables A, B,
C
The Joint Distribution
Recipe for making a joint distribution of M
variables:
1.
Make a truth table listing all combinations
of values of your variables (if there are M
Boolean
variables then the table will have
M
2 rows).
Copyright © Andrew W. Moore
Example: Boolean variables A, B,
C
A
B
C
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
The Joint Distribution
Recipe for making a joint distribution of M
variables:
1.
2.
Make a truth table listing all combinations
of values of your variables (if there are M
Boolean
variables then the table will have
M
2 rows).
For each combination of values, say how
probable it is.
Copyright © Andrew W. Moore
Example: Boolean variables A, B,
C
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
The Joint Distribution
Recipe for making a joint distribution of M
variables:
1.
2.
3.
Make a truth table listing all combinations
of values of your variables (if there are M
Boolean
variables then the table will have
M
2 rows).
For each combination of values, say how
probable it is.
If you subscribe to the axioms of
probability, those numbers must sum to 1.
Example: Boolean variables A, B,
C
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
A
0.05
0.25
0.30
Copyright © Andrew W. Moore
B
0.10
0.05
0.10
0.05
0.10
C
Learning a joint distribution
Build a JD table for your attributes in
which the probabilities are unspecified
The fill in each row with
records matching row
ˆ
P(row ) 
total number of records
A
B
C
Prob
0
0
0
?
0
0
1
?
A
B
C
Prob
0
1
0
?
0
0
0
0.30
0
1
1
?
0
0
1
0.05
1
0
0
?
0
1
0
0.10
1
0
1
?
0
1
1
0.05
1
1
0
?
1
0
0
0.05
1
1
1
?
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
Fraction of all records in which
A and B are True but C is False
Copyright © Andrew W. Moore
Example of Learning a Joint
• This Joint was
obtained by
learning from
three
attributes in
the UCI
“Adult”
Census
Database
[Kohavi 1995]
Copyright © Andrew W. Moore
UCI machine learning repository:
http://www.ics.uci.edu/~mlearn/MLRepository.html
Learning joint distribution overfits
• Let be a -dimensional binary vector, and ∈
1,2, … ,
• Learning the joint distribution P |
for
1, … ,
involves estimating
2
1 parameters
– For a large , this number is prohibitively large
– Typically we don’t have enough data to estimate the joint
distribution accurately
– It is common to encounter the situation where no training
examples have the exact
,…,
value combination
– then
,…,
0 for all values of
– This presents severe overfitting
Naïve Bayes Assumption
• Assume that each feature is independent from one
another given the class label
• Definition: is conditionally independent of given ,
if the probability distribution governing is independent
of the value of , given the value of
∀, , ,
Often denoted as
,
• For example:
,
• If , are conditional independent given , we have:
,
Conditional independence vs.
independence
• Conditional independence:
or equivalently:
• Independence:
or equivalently:
• Conditional independence
– Does one imply the other?
independence
Naïve Bayes Classifier
• Under Naïve Bayes assumption, we have:
x
• Thus, no need to estimate the joint distribution
• Instead, only need to estimate
for each feature
• Example: with binary features and classes, we
1 to
reduce the number of parameters from 2
– Significantly reduces overfitting
Example: spam filtering
•
•
Bag-of-words to describe emails
Represent an email by a vector whose dimension = the
number of words in our “dictionary”
Example: Bernoulli feature
•
–
–
•
•
1 if the -th word is present
0 if the -th word is not
present
The ordering/position of the words does not matter
“Dictionary” can be formed by looking through the training set
and identify all the words & tokens that have appeared at
least once (with stop-words like “the” “and” removed)
MLE for Naïve Bayes with
Bernoulli Model
• Suppose our training set contains emails, maximum
likelihood estimate of the parameters are:
N1
, where N1 is the number of spam emails
N
N i|1
P ( xi  1 | y  1) 
,
N1
P ( y  1) 
i.e., the fraction of spam emails where xi appeared
P ( xi  1 | y  0) 
N i |0
N0
i.e., the fraction of the nonspam emails where xi appeared
Naïve Bayes Prediction
• To make a prediction for a new example with
feature
∈ ,
Discrete and Continuous
Features
• Naïve bayes can be easily extended to
handle features that are not binary-valued
• Discrete:
for
- categorical
distribution in place of Bernoulli
• Continuous:
– Discretize the feature, then build categorical
distribution for each feature
– When the feature does not follow Gaussian,
this can result in a better classifier
Problem with MLE
• Suppose you picked up a new word “Mahalanobis” in your
class and started using it in your email x
• Because “Mahalanobis” (say it’s the n+1 th word in the
vocabulary) has never appeared in any of the training emails,
the probability estimate for this word will be
1
1
1
0
0
• Now
∏
|
0
for both y=0 and y=1
• Given limited training data, MLE can often result in
probabilities of 0 or 1. Such extreme probabilities are “too
strong” and cause problems
• To solve this problem, we can use Maximum A Posterior
(MAP) estimate, which requires us to think of parameters as
random variables, a Bayesian view of parameters
Bayesian Parameter Estimation
• In the Bayesian framework, we consider
the parameters we are trying to estimate
to be random variables
• We give them a prior distribution, which is
used to capture our prior believe about the
parameter
• When the data is sparse and this allows us
to fall back to the prior and avoid the
issues faced by MLE
Example: Bernoulli
• Given a unfair coin, we want to estimate - the
probability of head
• We toss the coin times, and observe
heads
• MLE estimate:
• Now let’s go Bayesian and assume that is a random
variable and has a prior distribution
• For reasons that will become clear later, we assume the
following prior for :
1
~
, ,
; ,
1
,
Beta distribution
:
uniform
U-shape
uni-model
symmetric
Posterior distribution of
∝
∝
∝
1
1
,
1
,
1
1
Noting that the posterior has exactly the same form as the
prior. This is not a coincidence, it is due to careful
selection of the prior distribution – conjugate prior
Maximum a-Posterior (MAP)
• A full Bayesian treatment will not care
about estimating the parameters
• Instead, it will use the posterior distribution
to make predictions for the target variable
• But if we do need to have a concrete
estimate of the parameters
• We can use Maximum A Posterior
estimation, that is
MAP for Bernoulli
1
D
,
• Mode: the maximum point for
1
1
,
is
2
• In this case we have:
1
2
• Let
2,
2, we have
• Comparing the MLE, it is like adding some fake coin
tosses (one head, one tail) into the observed data – this
is also called sometimes as Laplace smoothing
Laplace Smoothing
• Suppose we estimate a probability P(z) and we have n0
examples where
0 and n1 examples where
1
n1
MLE estimate is
P ( z = 1) =
n 0 + n1
• Laplace Smoothing. Add 1 to the numerator and 2 to the
n1  1
denominator
P ( z  1) 
n0  n1  2
If we don’t observe any examples, we expect P(z=1) =
0.5, but our belief is weak (equivalent to seeing one
example of each outcome).
MAP for Naïve Bayes Spam
Filter
1 and
• When estimating
– Bernoulli case:
P( xi  1 | y  0) 
MLE
N i |0
N0
 P( xi  1 | y  0) 
MAP
0
N i |0  1
N0  2
• When encounter a new word that has not appeared in
training set, now the probabilities do not go to zero
• This is called Laplace Smoothing
Multinomial Event Model for Text
• Instead of treating the absence/presence
of each word in the dictionary as a
Bernoulli, we can treat each word in the
sided die, where
email as rolling a
is the total size of the dictionary
• Each of the
words has a fixed
probability of being selected
• This allows us to take the counts of the
words into consideration
MLE for Naïve Bayes with
Multinomial model
• The likelihood of observing one email :
• MLE estimate for the -th word in the
dictionary:
• Total number of parameters:
MAP estimation for Multi-nomial
• The conjugate prior for multinomial is called
Dirichlet distribution
• For outcomes, a Dirichlet distribution has
parameters, each serves a similar purpose as
Beta distribution’s parameters
– Acting as fake observation(s) for the corresponding
output, the count depends on the value of the
parameter
• Laplace smoothing for this case:
nk  1
P( z  k ) 
nK
Laplace Smoothing for Multinomial case
|
MLE:
MAP:
0
0
total#ofword inn‐semails
total#ofwordsinn‐semails
total#ofword inn‐semails 1
total#ofwordsinn‐semails |V|
where | | is the size of the vocabulary
Naïve Bayes Summary
• Generative classifier
– learn P( x|y ) and P(y)
– Use Bayes rule to compute P( y|x ) for classification
• Assumes conditional independence between features
given class labels
– Greatly reduces the numbers of parameters to learn
• MAP estimation (or Laplace smoothing) is necessary to
avoid overfitting and extreme probability values