Information Retrieval
For the MSc Computer Science Programme
Lecture 6
Introduction to Information Retrieval (Manning et al. 2007)
Chapter 16
Dell Zhang
Birkbeck, University of London
What is text clustering?
Text clustering – grouping a set of documents
into classes of similar documents.
Classification vs. Clustering
Classification: supervised learning
Labeled data are given for training
Clustering: unsupervised learning
Only unlabeled data are available
Why text clustering?
To improve user interface
To improve recall
Navigation/analysis of corpus or search results
Cluster docs in corpus a priori. When a query
matches a doc d, also return other docs in the
cluster containing d. Hope if we do this, the query
“car” will also return docs containing “automobile”.
To improve retrieval speed
Cluster Pruning
http://clusty.com/
What clustering is good?
External criteria
Consistent with the latent classes in gold standard
(ground truth) data.
Internal criteria
High intra-cluster similarity
Low inter-cluster similarity
Issues for Clustering
Similarity between docs
Number of clusters
Ideal: semantic similarity
Practical: statistical similarity, e.g., cosine.
Fixed, e.g., kMeans.
Flexible, e.g., Single-Link HAC.
Structure of clusters
Flat partition, e.g., kMeans.
Hierarchical tree, e.g., Single-Link HAC.
kMeans Algorithm
Pick k docs {s1, s2,…,sk} randomly as seeds.
Repeat until clustering converges
(or other stopping criterion):
For each doc di :
Assign di to cluster cj such that sim(di, sj) is maximal.
c(d ) arg max sim (d , s j )
c j C
For each cluster cj :
Update sj to the centroid (mean) of cluster cj.
1
sj
d
| c j | dc j
kMeans – Example
(k = 2)
Pick seeds
Reassign clusters
Compute centroids
x
x
x
x
Reassign clusters
Compute centroids
Reassign clusters
Converged!
kMeans – Example
kMeans – Example
kMeans – Online Demo
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
Convergence
kMeans is proved to converge, i.e., to reach a
state in which clusters don’t change.
kMeans usually converges quickly, i.e., the
number of iterations is small in most cases.
Seeds
Problem
Results can vary because of
random seed selections.
Some seeds can result in poor
convergence rate, or
convergence to sub-optimal
clustering.
Solution
Example showing
sensitivity to seeds
Try kMeans for multiple times
with different random seed
selections.
……
In the above, if you start
with B and E as centroids
you converge to {A,B,C}
and {D,E,F}
If you start with D and F
you converge to
{A,B,D,E} {C,F}
Take Home Message
kMeans
c(d ) arg max sim (d , s j )
c j C
1
sj
d
| c j | dc j
© Copyright 2026 Paperzz