Clustering

Clustering
• Clustering. Given a set U of n objects labeled p1, …,
pn, classify into coherent groups.
photos, documents. micro-organisms
• Distance function. Numeric value specifying
"closeness" of two objects.
number of corresponding pixels whose
intensities differ by some threshold
• Fundamental problem. Divide into clusters so that
points in different clusters are far apart.
– Routing in mobile ad hoc networks.
– Identify patterns in gene expression.
– Document categorization for web search.
– Similarity searching in medical image databases
– Skycat: cluster 109 sky objects into stars, quasars,
galaxies.
Clustering of Maximum Spacing
• k-clustering. Divide objects into k non-empty groups.
• Distance function. Assume it satisfies several natural properties.
– d(pi, pj) = 0 iff pi = pj
(identity of indiscernibles)
– d(pi, pj)  0
(nonnegativity)
– d(pi, pj) = d(pj, pi) (symmetry)
• Spacing. Min distance between any pair of points in different
clusters.
spacing
• Clustering of maximum spacing. Given an integer k, find a kk=4
clustering of maximum spacing.
Greedy Clustering Algorithm
• Single-link k-clustering algorithm.
– Form a graph on the vertex set U, corresponding to n
clusters.
– Find the closest pair of objects such that each object
is in a different cluster, and add an edge between
them.
– Repeat n-k times until there are exactly k clusters.
• Key observation. This procedure is precisely Kruskal's
algorithm
(except we stop when there are k connected
components).
• Remark. Equivalent to finding an MST and deleting the
k-1 most expensive edges.
Greedy Analysis
• Theorem. Let C* denote the clustering C*1, …, C*k formed by
deleting the
k-1 most expensive edges of a MST. C* is a k-clustering of max
spacing.
• Pf. Let C denote some other clustering C1, …, Ck.
– The spacing of C* is the length d* of the (k-1)st most expensive
edge.
– Let pi, pj be in the same cluster in C*, say C*r, but different
clusters in C, say Cs and Ct.
C
– Some edge (p, q) on pi-pj path in C*r spans twoCsdifferent t
clusters in C.
C*r
– All edges on pi-pj path have length  d*
since Kruskal chose them.
p
q
pj
pi
– Spacing of C is  d* since p and q
are in different clusters. ▪
Dendrogram
• Dendrogram. Scientific visualization of
hypothetical sequence of evolutionary
events.
– Leaves = genes.
– Internal nodes = hypothetical ancestors.
Reference: http://www.biostat.wisc.edu/bmi576/fall-2003/lecture13.pdf
Dendrogram of Cancers in
Human
• Tumors in similar tissues cluster together.
Gene 1
Gene n
Reference: Botstein & Brown group
gene expressed
gene not expressed
13.2 Global Minimum Cut
7
Global Minimum Cut
• Global min cut. Given a connected, undirected graph G
= (V, E) find a cut (A, B) of minimum cardinality.
• Applications. Partitioning items in a database, identify
clusters of related documents, network reliability,
network design, circuit design, TSP solvers.
• Network flow solution.
– Replace every edge (u, v) with two antiparallel edges
(u, v) and (v, u).
– Pick some vertex s and compute min s-v cut
separating s from each other vertex v  V.
• False intuition. Global min-cut is harder than min s-t cut.
Contraction Algorithm
• Contraction algorithm. [Karger 1995]
– Pick an edge e = (u, v) uniformly at random.
– Contract edge e.
• replace u and v by single new super-node w
• preserve edges, updating endpoints of u and v to w
• keep parallel edges, but delete self-loops
– Repeat until graph has just two nodes v1 and v2.
– Return the cut (all nodes that were contracted to form v1).
a
b
c
u
d
v
f
e

a
c
b
w
contract u-v
f
Contraction Algorithm
• Claim. The contraction algorithm returns a min cut with
prob  2/n2.
• Pf. Consider a global min-cut (A*, B*) of G. Let F* be
edges with one endpoint in A* and the other in B*. Let k
= |F*| = size of min cut.
– In first step, algorithm contracts an edge in F*
probability k / |E|.
– Every node has degree  k since otherwise (A*, B*)
B*
A*
would not be min-cut.
 |E|  ½kn.
– Thus, algorithm contracts an edge in F* with
probability  2/n.
F*
Contraction Algorithm
• Claim. The contraction algorithm returns a min cut with
prob  2/n2.
• Pf. Consider a global min-cut (A*, B*) of G. Let F* be
edges with one endpoint in A* and the other in B*. Let k
= |F*| = size of min cut.
– Let G' be graph after j iterations. There are n' = n-j
supernodes.
– Suppose no edge in F* has been contracted. The
min-cut in G' is still k.
– Since value of min-cut is k, |E'|  ½kn'.
– Thus, algorithm contracts an edge in F* with
probability  2/n'.
– Let Ej = event that an edge in F* is not contracted in
iteration j.
Pr[E1  E2
 En2 ]  Pr[E1 ]  Pr[E2 | E1 ] 




 Pr[En2 | E1  E2
1 2n  1 n12  1 24  1 23
n n2  nn13   24   13 
2
n(n1)
2
n2
 En3 ]
Contraction Algorithm
• Amplification. To amplify the probability of success, run
the contraction algorithm many times.
• Claim. If we repeat the contraction algorithm n2 ln n
times with independent random choices, the probability
of failing to find the global min-cut is at most 1/n2.
n 2 ln n
 2 
1 2 
 n 
•
2ln n
 2 12 n 2 
 1 2  

 n  

 
 e1
2ln n

1
n2
(1 - 1/x)x  1/e
• Pf. By independence, the probability of failure is at most

Global Min Cut: Context
• Remark. Overall running time is slow since we perform
(n2 log n) iterations and each takes (m) time.
• Improvement. [Karger-Stein 1996] O(n2 log3n).
– Early iterations are less risky than later ones:
probability of contracting an edge in min cut hits 50%
when n / √2 nodes remain.
– Run contraction algorithm until n / √2 nodes remain.
– Run contraction algorithm twice on resulting graph,
and return best of two cuts.
• Extensions. Naturally generalizes to handle positive
weights.
• Best known. [Karger 2000] O(m log3n).
faster than best known max flow algorithm or
deterministic global min cut algorithm