Information Retrieval
For the MSc Computer Science Programme
Lecture 4
Introduction to Information Retrieval (Manning et al. 2007)
Chapter 13
Dell Zhang
Birkbeck, University of London
Is this spam?
Text Classification/Categorization
Given:
A document, dD.
A set of classes C = {c1, c2,…, cn}.
Determine:
The class of d: c(d)C, where c(d) is a
classification function (“classifier”).
Classification Methods (1)
Manual Classification
For example,
Yahoo! Directory, DMOZ, Medline, etc.
Very accurate when job is done by experts.
Difficult to scale up.
Classification Methods (2)
Hand-Coded Rules
For example,
CIA, Reuters, SpamAssassin, etc.
Accuracy is often quite high, if the rules have
been carefully refined over time by experts.
Expensive to build/maintain the rules.
Classification Methods (3)
Machine Learning (ML)
For example
Automatic Email Classification: PopFile
Automatic Webpage Classification: MindSet
http://popfile.sourceforge.net/
http://mindset.research.yahoo.com/
There is no free lunch: hand-classified training
data are required.
But the training data can be built up (and refined)
easily by amateurs.
Text Classification via ML
Learning
L
Training
Documents
Predicting
Classifier
U
Test
Documents
Text Classification via ML - Example
“planning
language
proof
intelligence”
Test
Data:
(AI)
(Programming)
(HCI)
Classes:
ML
Training
Data:
learning
intelligence
algorithm
reinforcement
network...
Planning
Semantics
planning
temporal
reasoning
plan
language...
programming
semantics
language
proof...
Garb.Coll.
Multimedia
garbage
...
collection
memory
optimization
region...
GUI
...
Evaluating Classification
Classification Accuracy
The proportion of correct predictions
Precision, Recall F1 (for each class)
macro-averaging: computes performance
measure for each class, and then computes a
simple average over classes.
micro-averaging: pools per-document predictions
across classes, and then computes performance
measure on the pooled contingency table.
Sample Learning Curve
Yahoo Science Data
Bayesian Methods for Classification
Before seeing the content of document d
Classify d to the class with maximum prior
probability Pr[c].
For each class cjC, Pr[cj] could be estimated
from the training data:
Nj
Pr[c j ]
j N j
Nj: the number of documents in the class cj
Bayesian Methods for Classification
After seeing the content of document d
Classify d to the class with maximum a posterio
probability Pr[c|d].
For each class cjC, Pr[cj|d] could be computed
by the Bayes’ Theorem.
Bayes’ Theorem
Pr[ d | c] Pr[c]
Pr[c | d ]
Pr[ d ]
Pr[c | d ] a posterior probability
Pr[c]
prior probability
Pr[ d | c] class-conditional probability
Pr[d ]
a constant
Naïve Bayes: Classification
c(d ) argmax Pr[c j | d ]
c j C
argmax
c j C
Pr[ d | c j ] Pr[c j ]
Pr[ d ]
argmax Pr[ d | c j ] Pr[c j ]
c j C
as Pr[d] is a constant
How can we compute Pr[d|cj] ?
Naive Bayes Assumptions
To facilitate the computation of Pr[d|cj], two
simplifying assumptions are made.
Conditional Independence Assumption
Positional Independence Assumption
Given the doc’s topic, word in one position tells us
nothing about words in other positions.
Each doc as a bag-of-words: the occurrence of word
does not depend on position.
Then Pr[d|cj] is given by the class-specific
unigram language model
Essentially a multinomial distribution.
Unigram Language Model
Pr[d | c j ] Pr[ wi | c j ]
wi d
Model for cj
0.2
the
0.1
a
0.01
man
0.01
woman
0.03
said
0.02
likes
…
the
man
likes
the
woman
0.2
0.01
0.02
0.2
0.01
multiply
Pr[ d | c j ] 0.2 0.01 0.02 0.2 0.01
0.00000008
Naïve Bayes: Learning
Given the training data
for each class cjC
estimate Pr[cj] (as before)
for each term wi in the vocabulary V
estimate Pr[wi|cj]
T 1
Pr[ w | c ]
T 1
ji
i
j
i
ji
Tji: the number of occurrences of term i
in documents of class cj
Smoothing
Why not just use MLE? Pr[ wi | c j ]
T ji
T
i
ji
If a term w (in a test doc d) did not occur in the training data,
Pr[w|cj] would be 0, and then Pr[d|cj] would be 0 no matter
how strongly other terms in d are associated with class cj.
Add-One (Laplace) Smoothing
T 1
Pr[ w | c ]
T 1
ji
i
j
i
ji
Naïve Bayes is Not So Naïve
Fairly Effective
The Bayes optimal classifier if the independence
assumptions do hold.
Often performs well even if the independence
assumptions are badly violated.
Usually yields highly accurate classification
(though the estimated probabilities are not so
accurate).
The 1st & 2nd place in KDD-CUP 97 competition,
among 16 (then) state-of-the-art algorithms.
A good dependable baseline for text classification
(though not the best).
Naïve Bayes is Not So Naïve
Very Efficient
Linear time complexity for learning/classification.
Low storage requirements.
Take Home Messages
Text Classification via Machine Learning
Bayes’ Theorem
Pr[ d | c] Pr[c]
Pr[c | d ]
Pr[ d ]
Naïve Bayes
Learning
T 1
Pr[ w | c ]
T 1
ji
i
j
i
Classification
ji
Pr[c j ]
Nj
j
Nj
c(d ) argmax Pr[ wi | c j ] Pr[c j ]
c j C
wi d
© Copyright 2026 Paperzz