Clustering • Clustering. Given a set U of n objects labeled p1, …, pn, classify into coherent groups. photos, documents. micro-organisms • Distance function. Numeric value specifying "closeness" of two objects. number of corresponding pixels whose intensities differ by some threshold • Fundamental problem. Divide into clusters so that points in different clusters are far apart. – Routing in mobile ad hoc networks. – Identify patterns in gene expression. – Document categorization for web search. – Similarity searching in medical image databases – Skycat: cluster 109 sky objects into stars, quasars, galaxies. Clustering of Maximum Spacing • k-clustering. Divide objects into k non-empty groups. • Distance function. Assume it satisfies several natural properties. – d(pi, pj) = 0 iff pi = pj (identity of indiscernibles) – d(pi, pj) 0 (nonnegativity) – d(pi, pj) = d(pj, pi) (symmetry) • Spacing. Min distance between any pair of points in different clusters. spacing • Clustering of maximum spacing. Given an integer k, find a kk=4 clustering of maximum spacing. Greedy Clustering Algorithm • Single-link k-clustering algorithm. – Form a graph on the vertex set U, corresponding to n clusters. – Find the closest pair of objects such that each object is in a different cluster, and add an edge between them. – Repeat n-k times until there are exactly k clusters. • Key observation. This procedure is precisely Kruskal's algorithm (except we stop when there are k connected components). • Remark. Equivalent to finding an MST and deleting the k-1 most expensive edges. Greedy Analysis • Theorem. Let C* denote the clustering C*1, …, C*k formed by deleting the k-1 most expensive edges of a MST. C* is a k-clustering of max spacing. • Pf. Let C denote some other clustering C1, …, Ck. – The spacing of C* is the length d* of the (k-1)st most expensive edge. – Let pi, pj be in the same cluster in C*, say C*r, but different clusters in C, say Cs and Ct. C – Some edge (p, q) on pi-pj path in C*r spans twoCsdifferent t clusters in C. C*r – All edges on pi-pj path have length d* since Kruskal chose them. p q pj pi – Spacing of C is d* since p and q are in different clusters. ▪ Dendrogram • Dendrogram. Scientific visualization of hypothetical sequence of evolutionary events. – Leaves = genes. – Internal nodes = hypothetical ancestors. Reference: http://www.biostat.wisc.edu/bmi576/fall-2003/lecture13.pdf Dendrogram of Cancers in Human • Tumors in similar tissues cluster together. Gene 1 Gene n Reference: Botstein & Brown group gene expressed gene not expressed 13.2 Global Minimum Cut 7 Global Minimum Cut • Global min cut. Given a connected, undirected graph G = (V, E) find a cut (A, B) of minimum cardinality. • Applications. Partitioning items in a database, identify clusters of related documents, network reliability, network design, circuit design, TSP solvers. • Network flow solution. – Replace every edge (u, v) with two antiparallel edges (u, v) and (v, u). – Pick some vertex s and compute min s-v cut separating s from each other vertex v V. • False intuition. Global min-cut is harder than min s-t cut. Contraction Algorithm • Contraction algorithm. [Karger 1995] – Pick an edge e = (u, v) uniformly at random. – Contract edge e. • replace u and v by single new super-node w • preserve edges, updating endpoints of u and v to w • keep parallel edges, but delete self-loops – Repeat until graph has just two nodes v1 and v2. – Return the cut (all nodes that were contracted to form v1). a b c u d v f e a c b w contract u-v f Contraction Algorithm • Claim. The contraction algorithm returns a min cut with prob 2/n2. • Pf. Consider a global min-cut (A*, B*) of G. Let F* be edges with one endpoint in A* and the other in B*. Let k = |F*| = size of min cut. – In first step, algorithm contracts an edge in F* probability k / |E|. – Every node has degree k since otherwise (A*, B*) B* A* would not be min-cut. |E| ½kn. – Thus, algorithm contracts an edge in F* with probability 2/n. F* Contraction Algorithm • Claim. The contraction algorithm returns a min cut with prob 2/n2. • Pf. Consider a global min-cut (A*, B*) of G. Let F* be edges with one endpoint in A* and the other in B*. Let k = |F*| = size of min cut. – Let G' be graph after j iterations. There are n' = n-j supernodes. – Suppose no edge in F* has been contracted. The min-cut in G' is still k. – Since value of min-cut is k, |E'| ½kn'. – Thus, algorithm contracts an edge in F* with probability 2/n'. – Let Ej = event that an edge in F* is not contracted in iteration j. Pr[E1 E2 En2 ] Pr[E1 ] Pr[E2 | E1 ] Pr[En2 | E1 E2 1 2n 1 n12 1 24 1 23 n n2 nn13 24 13 2 n(n1) 2 n2 En3 ] Contraction Algorithm • Amplification. To amplify the probability of success, run the contraction algorithm many times. • Claim. If we repeat the contraction algorithm n2 ln n times with independent random choices, the probability of failing to find the global min-cut is at most 1/n2. n 2 ln n 2 1 2 n • 2ln n 2 12 n 2 1 2 n e1 2ln n 1 n2 (1 - 1/x)x 1/e • Pf. By independence, the probability of failure is at most Global Min Cut: Context • Remark. Overall running time is slow since we perform (n2 log n) iterations and each takes (m) time. • Improvement. [Karger-Stein 1996] O(n2 log3n). – Early iterations are less risky than later ones: probability of contracting an edge in min cut hits 50% when n / √2 nodes remain. – Run contraction algorithm until n / √2 nodes remain. – Run contraction algorithm twice on resulting graph, and return best of two cuts. • Extensions. Naturally generalizes to handle positive weights. • Best known. [Karger 2000] O(m log3n). faster than best known max flow algorithm or deterministic global min cut algorithm
© Copyright 2026 Paperzz