Text Classification using Naïve Bayes

Text Classification using
Naïve Bayes
Rao Vemuri
1
Midterm Examina-on •  April 17th Mid-­‐term examina-on •  Covers everything discussed in the class and reading assignments Naive Bayes Idea (1)
(2)
(n)
•  Let D = {(x , y1 ),(x , y2 ),...,(x , yn )}
•  Each x(i) is a point in d-­‐dimensional real space x
(i)
(i)
1
(i)
2
(i)
d
= (x , x ,..., x ) ∈ R
d
yi ∈ Y = {1, 2,..., K}
•  Each yi can assume one of K possible values (class labels). •  Naive Bayes is a genera-ve method. That is, we assume D is generated from the family pθ(x,y), parameterized by θ, and find θ Naive Bayes Assump-on •  Now we write the joint distribu-on as 

pθ ( x, y) = pθ ( x | y)pθ (y); x is a vector
•  Now we make the KEY assump-on (independent feature assump-on or Naive Bayes assump-on) 

pθ ( x, y) = pθ ( x | y)pθ (y);
= pθ (x1 | y)pθ (x2 | y)....pθ (xd | y)pθ (y)
Strategy •  Goal: Given a new x, predict y, the class label •  Algorithm: •  Step1: Es-mate θ from D, call it “theta hat” –  This can be done using Maximum Likelihood (ML)es-mate, Maximum a posteriori (MAP) es-mate,.. •  Step 2: compute “y-­‐hat”, ŷ = argmax
p a(yn |ex)s-mate of y: y∈Y
θˆ
•  Calculate RHS for all possible y and pick the max New x Now Invoke Bayes Theorem •  Rewrite the previous Eqn using Bayes’ Theorem: pθˆ (x | y)pθˆ (y)
ŷ = argmax
y∈Y
pθˆ (x)
•  The denominator doesn’t depend on y ŷ = argmax pθˆ (x | y)pθˆ (y)
y∈Y
= argmax pθˆ (x1 | y)pθˆ (x2 | y)...pθˆ (xd | y)pθˆ (y)
y∈Y
•  The second eqn follows from NB assump-on Example: Filtering Spam using Naive Bayes •  Consider incoming e-­‐mails as good and spam •  We have a training set where some e-­‐mails are labeled good and some are labeled spam •  Let us choose features as follows –  There are many ways of doing this –  We are presen-ng one method here Represen-ng e-­‐mail Messages •  Let us represent each e-­‐mail by a vector x whose length is equal to the number of words in English dic-onary, |V| (say 50,000) •  If an e-­‐mail contains the ith word in our dic-onary then we set xi = 1; otherwise, xi = 0 •  Say, our dic-onary has words “a, aardvaark, aardwolf,…buy,…zygmurgy”. •  Then one of our e-­‐mails could be represented as ... One Message •  The e-­‐mail message containing the words “a” and “buy” can be represented by the vector x as shown: !
#
#
x =#
#
#
#"
1
0
.
1
0
$ ! a $
&
& #
& # aarvaark &
&
&=#
& # buy &
&
& #
&% #" zygmurgy &%
Limita-ons •  This representa-on does not capture the case –  When the same word occurs more than once –  When there are numbers or symbols in the document that are not found in a dic-onary Dimension of x •  The size of the vector x corresponds to the number of words (vocabulary, |V|) in our dic-onary. •  All x’s are of the same size •  We want to model p(x|y) •  If the dimension of x is 50,000 and if each element is either 0 or 1 then we have almost 250000 possible x-­‐vectors. TOO MANY Naive Bayes’ Assump-on •  To manage the “size problem” we will make a strong assump-on •  We’ll assume that xi’s are condi-onally independent given y –  The probability of each word occurring in a document is independent of other words, given its class label •  This is called Naive Bayes’ assump-on •  The resul-ng algorithm is called Naive Bayes’ classifier Meaning of Condi-onal Independence • 
• 
• 
• 
Suppose y = 1 for spam mail Assume “buy” is word number 2096 Assume “price” is word number 38451 If I tell you y = 1 (ie, it is Spam), then knowledge of x2096 (viz, the word “buy” appears) will have no effect on your belief about the value of x38451 (viz the word “price” appears). (This is obviously not true in this case! Yet, the assump-on gives good results!!) What is NOT Independence •  We are saying:
•  We are not saying
p(x2087 | y) = p(x2087 | y, x39831 )
p(x2096 ) = p(x2096 | x38491 )
•  x2096 and x38451 are NOT independent, they are conditionally independent •  By Conditional independence, we mean
p(x1,..., x50000 | y)
= p(x1 | y)p(x2 | y, x1 )p(x3 | y, x1, x2 )...p(x50000 | y, x1,..., x49999 )
n
= ∏ p(xi | y)
i=1
Once you know the mail is spam or not, the presence of other words is irrelevant
The Model •  The output y is Bernoulli distributed •  The input features x are Bernoulli p(y)
~ φy
∀j : p(x j | y = 0) ~ φ j|y=0
∀j : p(x j | y = 1) ~ φ j|y=1
Model Parameters •  Our model is parameterized by: φi|y=1 = p(xi =1| y =1)
ie., frac-on of spam (y=1) e-­‐mails in which word i appears φi|y=0 = p(xi =1| y = 0)
ie., frac-on of non-­‐spam (y=0) e-­‐mails in which word i appears, and φ y = p(y = 1)
Joint Likelihood •  Given the training set, {(x
(i )
,y
(i )
};i = 1, 2..., m
•  We can write down the joint likelihood of the data m
(i)
(i)
L(φ y , φi|y=0 , φi|y=1 ) = ∏ p(x , y )
i=1
•  Maximizing this in the usual manner gives us the Maximum Likelihood Es-mate of parameters (Prove this at home) MLE of Parameters •  The MLE es-mates are φ j|y=1
φ j|y=0
φy
∑
=
∑
=
m
i=1
I{x (i)j =1^ y(i) =1}
∑
m
i=1
∑
=
m
i=1
I{y(i) =1}
I{x (i)j =1^ y(i) = 0}
∑
m
i=1
m
i=1
I{y(i) = 0}
I{y (i ) = 1}
m
= p(x j =1| y =1)
= p(x j =1| y = 0)
= p(y = 1)
Predic-on •  Now we can use Bayes’ rule to predict if a new mail with features x is spam or not p(x | y = 1)p(y = 1)
p(y = 1| x) =
p(x)
" m
%
$ ∏ p(xi | y = 1)' p(y = 1)
# i=1
&
= n
"
% " n
%
$ ∏ p(xi | y = 1)p(y = 1)' + $ ∏ p(xi | y = 0)p(y = 0)'
# i=1
& # i=1
&
Laplace Smoothing •  We do not HAVE TO do this, but it is a good idea to do it always •  Without smoothing, you are safe most of the -me •  On rare occasions you can get into BIG trouble if you do not do it •  It is like protec-ng from “divide by 0” Laplace Smoothing: Mo-va-on •  Suppose you did an excellent job on your class project (separa-ng spam from good e-­‐mail) and I encouraged you to submit your paper to SPIE conference. Un-l now you never recd. any e-­‐mail from SPIE. So it was not included in your training set. Now they sent you a message and that has one word “spie” in it. Assume that “spie” is the 35,000th word in your dic-onary. •  Upon seeing “spie,” your spam filter will calculate the ML parameters as shown on the next slide ML Values of Φ •  The ML parameters are: φ35000|y=1
φ35000|y=0
∑
=
∑
=
m
i=1
m
i=1
(i)
I{x35000
=1^ y(i) =1}
∑
I{x
∑
m
i=1
I{y(i) =1}
(i)
35000
m
i=1
= p(x35000 =1| y =1) = 0
(i)
=1^ y = 0}
(i)
I{y = 0}
= p(x35000 =1| y = 0) = 0
Classifier Judgment •  Because the filter has never seen “spie” in either spam or good e-­‐mail, it thinks the probability of seeing it in either class is zero •  Thus, it makes the judgment: " m
%
$ ∏ p(xi | y = 1)' p(y = 1)
# i=1
&
p(y = 1 | x) = n
"
% " n
%
$ ∏ p(xi | y = 1)p(y = 1)' + $ ∏ p(xi | y = 0)p(y = 0)'
# i=1
& # i=1
&
0
=
0
Why? m
•  This is because each of the terms ∏ p(xi y)
i=1
contains “p(x35000|y) = 0” that is mul-plied into it. •  Sta-s-cally it is bad idea to es-mate the probability of an event to be zero because we have not encountered that event in the past Laplace Smoothing Formula •  MLE: count
P(A | B) =
N
•  Laplace Smoothing: count + k
PLaplace (A | B) =
; choose k =1
N + k. | classes |
Example: k=1 •  Suppose we have two classes: Movies, Songs •  We have 6 -tles Movies Songs A perfect world A perfect day My perfect woman Electric storm Pre=y woman Another rainy day Query: “Perfect storm” P(movie) = count/N = 3/6 = ½ = prior With smoothing: P(movie) =( count +k)/(N+k|C|) = (3+1)/(6+2) =1/2= same as before •  Similarly, p(song) = ½ P(“perfect”|movie)=? –  In the movie -tles, we see 2 “perfect”s out of 8. –  (“perfect”|movie)=? 2/8 = 0.25 •  With Laplacian smoothing: P(“perfect”|movie) = ?? •  The number of classes here is the size of the vocabulary. We have 6 unique words in “movies” P(“perfect”|movie) = (2+1)/(8+6) = 3/14 = 0.21 Calculate these at home •  P(“perfect”|song) = ? (2/19?) •  P(“strom”|movie) = ? (1/19?) •  P(“strom”|song) = ? (2/19?) Smoothed Formulas •  The modified ML es-mates for mul-nomial case are m
(i)
I{z
= j} +1
∑
φ j = i=1
m
+
k
m
φ j|y=1
∑
=
φ j|y=0
i=1
(i)
I{x (i)
=1^
y
=1} +1
j
∑
∑
=
m
i=1
m
i=1
I{y (i) =1} + 2
I{x (i)j =1^ y(i) = 0}+1
∑
m
i=1
I{y(i) = 0}+ 2