c|d - Department of Computer Science and Information Systems

Information Retrieval
For the MSc Computer Science Programme
Lecture 4
Introduction to Information Retrieval (Manning et al. 2007)
Chapter 13
Dell Zhang
Birkbeck, University of London
Is this spam?
Text Classification/Categorization

Given:



A document, dD.
A set of classes C = {c1, c2,…, cn}.
Determine:

The class of d: c(d)C, where c(d) is a
classification function (“classifier”).
Classification Methods (1)

Manual Classification

For example,



Yahoo! Directory, DMOZ, Medline, etc.
Very accurate when job is done by experts.
Difficult to scale up.
Classification Methods (2)

Hand-Coded Rules

For example,



CIA, Reuters, SpamAssassin, etc.
Accuracy is often quite high, if the rules have
been carefully refined over time by experts.
Expensive to build/maintain the rules.
Classification Methods (3)

Machine Learning (ML)

For example

Automatic Email Classification: PopFile


Automatic Webpage Classification: MindSet



http://popfile.sourceforge.net/
http://mindset.research.yahoo.com/
There is no free lunch: hand-classified training
data are required.
But the training data can be built up (and refined)
easily by amateurs.
Text Classification via ML
Learning
L
Training
Documents
Predicting
Classifier
U
Test
Documents
Text Classification via ML - Example
“planning
language
proof
intelligence”
Test
Data:
(AI)
(Programming)
(HCI)
Classes:
ML
Training
Data:
learning
intelligence
algorithm
reinforcement
network...
Planning
Semantics
planning
temporal
reasoning
plan
language...
programming
semantics
language
proof...
Garb.Coll.
Multimedia
garbage
...
collection
memory
optimization
region...
GUI
...
Evaluating Classification

Classification Accuracy


The proportion of correct predictions
Precision, Recall  F1 (for each class)


macro-averaging: computes performance
measure for each class, and then computes a
simple average over classes.
micro-averaging: pools per-document predictions
across classes, and then computes performance
measure on the pooled contingency table.
Sample Learning Curve
Yahoo Science Data
Bayesian Methods for Classification

Before seeing the content of document d


Classify d to the class with maximum prior
probability Pr[c].
For each class cjC, Pr[cj] could be estimated
from the training data:
Nj
Pr[c j ] 
j N j
Nj: the number of documents in the class cj
Bayesian Methods for Classification

After seeing the content of document d


Classify d to the class with maximum a posterio
probability Pr[c|d].
For each class cjC, Pr[cj|d] could be computed
by the Bayes’ Theorem.
Bayes’ Theorem
Pr[ d | c] Pr[c]
Pr[c | d ] 
Pr[ d ]
Pr[c | d ] a posterior probability
Pr[c]
prior probability
Pr[ d | c] class-conditional probability
Pr[d ]
a constant
Naïve Bayes: Classification
c(d )  argmax Pr[c j | d ]
c j C
 argmax
c j C
Pr[ d | c j ] Pr[c j ]
Pr[ d ]
 argmax Pr[ d | c j ] Pr[c j ]
c j C
as Pr[d] is a constant
How can we compute Pr[d|cj] ?
Naive Bayes Assumptions

To facilitate the computation of Pr[d|cj], two
simplifying assumptions are made.

Conditional Independence Assumption


Positional Independence Assumption


Given the doc’s topic, word in one position tells us
nothing about words in other positions.
Each doc as a bag-of-words: the occurrence of word
does not depend on position.
Then Pr[d|cj] is given by the class-specific
unigram language model

Essentially a multinomial distribution.
Unigram Language Model
Pr[d | c j ]   Pr[ wi | c j ]
wi d
Model for cj
0.2
the
0.1
a
0.01
man
0.01
woman
0.03
said
0.02
likes
…
the
man
likes
the
woman
0.2
0.01
0.02
0.2
0.01
multiply
Pr[ d | c j ]  0.2  0.01 0.02  0.2  0.01
 0.00000008
Naïve Bayes: Learning

Given the training data

for each class cjC
 estimate Pr[cj] (as before)
 for each term wi in the vocabulary V
 estimate Pr[wi|cj]

T  1
Pr[ w | c ] 
 T  1
ji
i
j
i
ji
Tji: the number of occurrences of term i
in documents of class cj
Smoothing

Why not just use MLE? Pr[ wi | c j ] 
T ji
T
i
ji
If a term w (in a test doc d) did not occur in the training data,
Pr[w|cj] would be 0, and then Pr[d|cj] would be 0 no matter
how strongly other terms in d are associated with class cj.
Add-One (Laplace) Smoothing

T  1
Pr[ w | c ] 
 T  1
ji
i
j
i
ji
Naïve Bayes is Not So Naïve

Fairly Effective





The Bayes optimal classifier if the independence
assumptions do hold.
Often performs well even if the independence
assumptions are badly violated.
Usually yields highly accurate classification
(though the estimated probabilities are not so
accurate).
The 1st & 2nd place in KDD-CUP 97 competition,
among 16 (then) state-of-the-art algorithms.
A good dependable baseline for text classification
(though not the best).
Naïve Bayes is Not So Naïve

Very Efficient


Linear time complexity for learning/classification.
Low storage requirements.
Take Home Messages



Text Classification via Machine Learning
Bayes’ Theorem
Pr[ d | c] Pr[c]
Pr[c | d ] 
Pr[ d ]
Naïve Bayes
Learning

T  1
Pr[ w | c ] 
 T  1
ji
i
j
i
Classification
ji
Pr[c j ] 
Nj

j
Nj


c(d )  argmax   Pr[ wi | c j ]  Pr[c j ]
c j C
 wi d
