Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla Clustering Say we want to cluster n objects of some kind (documents, images, text strings) But we don’t have a meaningful way to project into Euclidean space. Idea of Cohen, McCallum, Richman: use past data to train up f(x,y)=same/different. Then run f on all pairs and try to find most consistent clustering. 2 The problem Harry B. Harry Bovik H. Bovik Tom X. 3 The problem Harry B. + Harry Bovik + H. Bovik +: Same -: Different + Tom X. Train up f(x)= same/different Run f on all pairs 4 The problem Harry B. + Harry Bovik + H. Bovik 1. 2. +: Same -: Different + Tom X. Totally consistent: + edges inside clusters – edges outside clusters 5 The problem Harry B. + Harry Bovik +: Same -: Different + H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs 6 The problem + Harry B. Harry Bovik +: Same -: Different + Disagreement H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering 7 The problem + Harry B. Harry Bovik +: Same -: Different + Disagreement H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering 8 The problem + Harry Bovik + Disagreement Harry B. +: Same -: Different H. Bovik Tom X. Problem: Given a complete graph on n vertices. Each edge labeled + or -. Goal is to find partition of vertices as consistent as possible with edge labels. Max #(agreements) or Min #( disagreements) There is no k : # of clusters could be anything 9 The Problem Noise Removal: There is some true clustering. However some edges incorrect. Still want to do well. Agnostic Learning: No inherent clustering. Try to find the best representation using hypothesis with limited representation power. Eg: Research communities via collaboration graph 10 Our results Constant-factor approx for minimizing disagreements. PTAS for maximizing agreements. Results for random noise case. 11 PTAS for maximizing agreements Easy to get ½ of the edges Goal: additive apx of en2 . Standard approach: Draw small sample, Guess partition of sample, Compute partition of remainder. Can do directly, or plug into General Property Tester of [GGR]. Running time doubly exp’l in e, or singly with bad exponent. 12 Minimizing Disagreements Goal: Get a constant factor approx. Problem: Even if we can find a cluster that’s as good as best cluster of OPT, we’re headed toward O(log n) [Set-Cover like analysis] Need a way of lower-bounding Dopt. 13 Lower bounding idea: bad triangles Consider + + We know any clustering has to disagree with at least one of these edges. 14 Lower bounding idea: bad triangles If several such disjoint, then mistake on each one + 1 + 5 2 + + 4 + 2 Edge disjoint Bad Triangles (1,2,3), (3,4,5) 3 Dopt ¸ #{Edge disjoint bad triangles} (Not Tight) 15 Lower bounding idea: bad triangles If several such, then mistake on each one + 1 + 5 2 + + 4 + Edge disjoint Bad Triangles (1,2,3), (3,4,5) 3 Dopt ¸ #{Edge disjoint bad triangles} How can we use this ? 16 d-clean Clusters Given a clustering, vertex d-good if few disagreements N-(v) Within C < d|C| N+(v) Outside C < d|C| C v is d-good +: Similar -: Dissimilar Essentially, N+(v) ¼ C(v) Cluster C d-clean if all v2C are d-good 17 Observation Any d-clean clustering is 8 approx for d<1/4 Idea: Charging mistakes to bad triangles w Intuitively, enough choices of w for each wrong edge (u,v) + + u - v 18 Observation Any d-clean clustering is 8 approx for d<1/4 Idea: Charging mistakes to bad triangles Intuitively, enough choices of w for each wrong edge (u,v) Can find edge-disjoint bad triangles + u’ w + + u - v Similar argument for +ve edges between 2 clusters 19 General Structure of Argument Any d-clean clustering is 8 approx for d<1/4 Consequence: Just need to produce a d-clean clustering for d<1/4 20 General Structure of Argument Any d-clean clustering is 8 approx for d<1/4 Consequence: Just need to produce a d-clean clustering for d<1/4 Bad News: May not be possible !! 21 General Structure of Argument Any d-clean clustering is 8 approx for d<1/4 Consequence: Just need to produce a d-clean clustering for d<1/4 Bad News: May not be possible !! Approach: 1) Clustering of a special form: Opt(d) Clusters either d clean or singletons Not many more mistakes than Opt 2) Produce something “close’’ to Opt(d) 22 Existence of OPT(d) Opt C1 C2 23 Existence of OPT(d) Identify d/3-bad vertices Opt C1 C2 d/3-bad vertices 24 Existence of OPT(d) 1) Move d/3-bad vertices out 2) If “many” (¸ d/3) d/3-bad, “split” Opt Opt(d) : Singletons and d-clean clusters C1 C2 DOpt(d) = O(1) DOpt OPT(d) Split 25 Main Result Opt(d) d-clean 26 Main Result Opt(d) d-clean Guarantee: •Non-Singletons 11d clean •Singletons subset of Opt(d) Algorithm 11d-clean 27 Main Result Opt(d) d-clean Algorithm Guarantee: •Non-Singletons 11d clean •Singletons subset of Opt(d) Choose d =1/44 Non-singletons: ¼-clean Bound mistakes among nonsingletons by bad trianges Involving singletons By those of Opt(d) 11d-clean 28 Main Result Opt(d) d-clean Algorithm 11d-clean Guarantee: •Non-Singletons 11d clean •Singletons subset of Opt(d) Choose d =1/44 Non-singletons: ¼-clean Bound mistakes among nonsingletons by bad trianges Involving singletons By those of Opt(d) Approx. ratio= 9/d2 + 8 29 Open Problems How about a small constant? Is Dopt · 2 #{Edge disjoint bad triangles} ? Is Dopt · 2 #{Fractional e.d. bad triangles}? Extend to {-1, 0, +1} weights: apx to constant or even log factor? Extend to weighted case? Clique Partitioning Problem Cutting plane algorithms [Grotschel et al] 30 31 32 The problem Harry B. Harry Bovik +: Same -: Different + + Disagreement H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering 33 Nice features of formulation There’s no k. (OPT can have anywhere from 1 to n clusters) If a perfect solution exists, then it’s easy to find: C(v) = N +(v). Easy to get agreement on ½ of edges. 34 Algorithm 1. Pick vertex v. Let C(v) = N (v) 2. Modify C(v) (a) (b) + Remove 3d-bad vertices from C(v). Add 7d good vertices into C(v). 3. Delete C(v). Repeat until done, or above always makes empty clusters. 4. Output nodes left as singletons. 35 Step 1 Choose v, C= + neighbors of v C1 C2 v C 36 Step 2 Vertex Removal Phase: If x is 3d bad, C=C-{x} C1 C2 v C 1) No vertex in C1 removed. 2) All vertices in C2 removed 37 Step 3 Vertex Addition Phase: Add 7d-good vertices to C C1 C2 v C 1) All remaining vertices in C1 will be added 2) None in C2 added 3) Cluster C is 11d-clean 38 Case 2: v Singleton in OPT(d) Choose v, C= +1 neighbors of v C v Same idea works 39 The problem 1 4 2 3 40
© Copyright 2024 Paperzz