Cluster - DidaWiki

Clustering
Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Objectives of Cluster Analysis
 Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Competing
objectives
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
The commonest form of unsupervised learning
Notion of similarity/distance
 Ideal: semantic similarity
 Practical: term-statistical similarity
 Docs as vectors
 Similarity = cosine similarity, LSH,…
Clustering Algorithms
 Flat algorithms
 Create a set of clusters
 Usually start with a random (partial) partitioning
 Refine it iteratively
 K means clustering
 Hierarchical algorithms
 Create a hierarchy of clusters (dendogram)
 Bottom-up, agglomerative
 Top-down, divisive
Hard vs. soft clustering
 Hard clustering: Each document belongs to exactly one cluster
 More common and easier to do
 Soft clustering: Each document can belong to more than one
cluster.
 Makes more sense for applications like creating browsable
hierarchies
 News is a proper example
 Search results is another example
Flat & Partitioning Algorithms
 Given: a set of n documents and the number K
 Find: a partition in K clusters that optimizes the
chosen partitioning criterion
 Globally optimal
 Intractable for many objective functions
 Ergo, exhaustively enumerate all partitions
 Locally optimal
 Effective heuristic methods: K-means and K-medoids algorithms
Sec. 16.4
K-Means
 Assumes documents are real-valued vectors.
 Clusters based on centroids (aka the center of gravity
or mean) of points in a cluster, c:


1
μ(c) 
x

| c | xc
 Reassignment of instances to clusters is based on
distance to the current cluster centroids.
Sec. 16.4
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seed centroids.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cr such that dist(di, sr) is minimal.
For each cluster cj
sj = (cj)
Sec. 16.4
K Means Example (K=2)
Pick seeds
x
x
x
x
Reassign clusters
Compute centroids
Reassign clusters
Compute centroids
Reassign clusters
Converged!
Sec. 16.4
Termination conditions
 Several possibilities, e.g.,
 A fixed number of iterations.
 Doc partition unchanged.
 Centroid positions don’t change.
Sec. 16.4
Time Complexity




The centroids are K
Each doc/centroid consists of M dimensions
Computing distance btw vectors is O(M) time.
Reassigning clusters: Each doc compared with all
centroids, O(KNM) time.
 Computing centroids: Each doc gets added once to
some centroid, O(NM) time.
Assume these two steps are each done once for I
iterations: O(IKNM).
How Many Clusters?
 Number of clusters K is given
 Partition n docs into predetermined number of clusters
 Finding the “right” number of clusters is part of the
problem
 Can usually take an algorithm for one flavor and
convert to the other.
Bisecting K-means
Variant of K-means that can produce a partitional or
a hierarchical clustering
SSE = Sum of Squared Error
Bisecting K-means Example
K-means
Pros
 Simple
 Fast for low dimensional data
 Good performance on globular data
Cons
 K-Means is restricted to data which has the notion of
a center (centroid)
 K-Means cannot handle non-globular data of
different sizes and densities
 K-Means will not identify outliers
Ch. 17
Hierarchical Clustering
 Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents
animal
vertebrate
fish reptile amphib. mammal
invertebrate
worm insect crustacean
 One approach: recursive application of a
partitional clustering algorithm
Strengths of Hierarchical Clustering
 No assumption of any particular number of clusters
 Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level
Sec. 17.1
Hierarchical Agglomerative Clustering (HAC)
 Starts with each doc in a separate cluster
 Then repeatedly joins the closest pair of
clusters, until there is only one cluster.
 The history of mergings forms a binary tree
or hierarchy.
 The closest pair drives the mergings, how is
it defined ?
Sec. 17.2
Closest pair of clusters
 Single-link
 Similarity of the closest points, the most cosine-similar
 Complete-link
 Similarity of the farthest points, the least cosine-similar
 Centroid
 Clusters whose centroids are the closest (or most cosinesimilar)
 Average-link
 Clusters whose average distance/cosine between pairs of
elements is the smallest
How to Define Inter-Cluster Similarity
p1
Similarity?
p2
p3
p4 p5
p1
p2
p3
p4




Single link (MIN)
Complete link (MAX)
Centroids
Average
p5
.
.
.
Proximity Matrix
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1
p2
p3
p4




MIN
MAX
Centroids
Average
p5
.
.
.
Proximity Matrix
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1
p2
p3
p4




MIN
MAX
Centroids
Average
p5
.
.
.
Proximity Matrix
...
How to define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1


p2
p3
p4




MIN
MAX
Centroids
Average
p5
.
.
.
Proximity Matrix
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1
p2
p3
p4




MIN
MAX
Centroids
Average
p5
.
.
.
Proximity Matrix
...