kMeans - Department of Computer Science and Information Systems

Information Retrieval
For the MSc Computer Science Programme
Lecture 6
Introduction to Information Retrieval (Manning et al. 2007)
Chapter 16
Dell Zhang
Birkbeck, University of London
What is text clustering?


Text clustering – grouping a set of documents
into classes of similar documents.
Classification vs. Clustering

Classification: supervised learning


Labeled data are given for training
Clustering: unsupervised learning

Only unlabeled data are available
Why text clustering?

To improve user interface


To improve recall


Navigation/analysis of corpus or search results
Cluster docs in corpus a priori. When a query
matches a doc d, also return other docs in the
cluster containing d. Hope if we do this, the query
“car” will also return docs containing “automobile”.
To improve retrieval speed

Cluster Pruning
http://clusty.com/
What clustering is good?

External criteria


Consistent with the latent classes in gold standard
(ground truth) data.
Internal criteria


High intra-cluster similarity
Low inter-cluster similarity
Issues for Clustering

Similarity between docs



Number of clusters



Ideal: semantic similarity
Practical: statistical similarity, e.g., cosine.
Fixed, e.g., kMeans.
Flexible, e.g., Single-Link HAC.
Structure of clusters


Flat partition, e.g., kMeans.
Hierarchical tree, e.g., Single-Link HAC.
kMeans Algorithm
Pick k docs {s1, s2,…,sk} randomly as seeds.
Repeat until clustering converges
(or other stopping criterion):
For each doc di :
Assign di to cluster cj such that sim(di, sj) is maximal.
c(d )  arg max sim (d , s j )
c j C
For each cluster cj :
Update sj to the centroid (mean) of cluster cj.


1
sj 
d

| c j | dc j
kMeans – Example
(k = 2)
Pick seeds
Reassign clusters
Compute centroids
x
x
x
x
Reassign clusters
Compute centroids
Reassign clusters
Converged!
kMeans – Example
kMeans – Example
kMeans – Online Demo

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
Convergence


kMeans is proved to converge, i.e., to reach a
state in which clusters don’t change.
kMeans usually converges quickly, i.e., the
number of iterations is small in most cases.
Seeds

Problem

Results can vary because of
random seed selections.


Some seeds can result in poor
convergence rate, or
convergence to sub-optimal
clustering.
Solution


Example showing
sensitivity to seeds
Try kMeans for multiple times
with different random seed
selections.
……
In the above, if you start
with B and E as centroids
you converge to {A,B,C}
and {D,E,F}
If you start with D and F
you converge to
{A,B,D,E} {C,F}
Take Home Message

kMeans
c(d )  arg max sim (d , s j )
c j C


1
sj 
d

| c j | dc j