Tackling the Poor Assumptions of Naive Bayes Text Classifiers

Tackling the Poor Assumptions of Naive
Bayes Text Classifiers
Pubished by: Jason D.M.Rennie, Lawrence Shih, Jamime
Teevan, David R.Karger
Liang Lan
11/19/2007
Outline
Introduce the Multinomial Naive Bayes
Model for Text Classification.
 The Poor Assumption of Multinomial Naive
Bayes Model.
 Solutions to some problem of the Naive
Bayes Classifier.

Multinomial Naive Bayes Model for Text
Classification
 Given:
 A description
of the document d:
f = (f1,…..,fn) fi is the frequency count of word i
occurring in document d
 A fixed number of classes:
C = {1, 2,…, m}, Parameter Vector for each class
The parameter vector for a class c is
θ

ci
is the probability of word i occurs in class c
Determine:
 The
class label of d.
Introduce the Multinomial Naive Bayes
Model for Text Classification

The likelihood of a document is a product
of the parameters of the words that appear
in the document.

Selecting the class with the largest
posterior probability
Parameter Estimation for Naive Bayes
Model

The parameters θci must be estimated from the training
data.

Then, we get the MNB classifier.

For simplicity, we use uniform prior estimate.
lMNB(d) = argmaxc(fiwci)

The Poor Assumption of Multinomial
Naive Bayes Model
Two systemic errors (Occurring in any
naive bayes classifier )
1. Skewed Data Bias ( uneven training
size)
2. Weight Magnitude Errors (Caused by
the independence assumption)
 The Multinomial does not model the text
well

Correcting the skewed data bias
More training examples for one class than
another --- can cause the classifier to
prefer one class over the other.
 Using Complement Naive Bayes


N~ci is the number of times word i occurred
in documents in classes other than c.
Correcting the Weight Magnitude
Errors


Caused by the independence assumption
Ex. “San Francisco” , “Boston”
Normalizing the Weight Vectors
We call this Weight-normalized Complement
Naive Bayes(WCNB).
Modeling Text Better
Transforming Term Frequency
 Transforming by Document Frequency
 Transforming Based on Length

Transforming Term Frequency

The term distribution had heavier tails than
predicted by the multinomial model, instead
appearing like a power-law distribution.
The probability is also proportional to
So we can use the multinomial model to generate
probabilities proportional to a class of power law
distribution via a simple transform,
Transforming by Document
Frequency
Common words are unlikely to be related to the
class of a document, but random variations can
create apparent fictitious correlation.
 Discount the weight of the common words.
Inverse document frequency (a common IR
transform)– to discount terms by their document
frequency.

Transforming Based on Length
The jump for larger term frequency is
disproportionally large with the length of
the document.
 Discount the influence of long documents
by transforming the term frequency:

The New Naive Bayes procedure
The result of experiment comparing MNB
to TWCNB and the SVM shows that the
TWCNB’s performance is substantially
better than MNB, and approach the SVM’s
performance.