slides

Correlation Clustering
Nikhil Bansal
Joint Work with Avrim Blum and
Shuchi Chawla
Clustering
Say we want to cluster n objects of some
kind (documents, images, text strings)
But we don’t have a meaningful way to
project into Euclidean space.
Idea of Cohen, McCallum, Richman: use past
data to train up f(x,y)=same/different.
Then run f on all pairs and try to find most
consistent clustering.
2
The problem
Harry B.
Harry Bovik
H. Bovik
Tom X.
3
The problem
Harry B.
+
Harry Bovik
+
H. Bovik
+: Same
-: Different
+
Tom X.
Train up f(x)= same/different
Run f on all pairs
4
The problem
Harry B.
+
Harry Bovik
+
H. Bovik
1.
2.
+: Same
-: Different
+
Tom X.
Totally consistent:
+ edges inside clusters
– edges outside clusters
5
The problem
Harry B.
+
Harry Bovik
+: Same
-: Different
+
H. Bovik
Tom X.
Train up f(x)= same/different
Run f on all pairs
6
The problem
+
Harry B.
Harry Bovik
+: Same
-: Different
+
Disagreement
H. Bovik
Tom X.
Train up f(x)= same/different
Run f on all pairs
Find most consistent clustering
7
The problem
+
Harry B.
Harry Bovik
+: Same
-: Different
+
Disagreement
H. Bovik
Tom X.
Train up f(x)= same/different
Run f on all pairs
Find most consistent clustering
8
The problem
+
Harry Bovik
+
Disagreement
Harry B.
+: Same
-: Different
H. Bovik
Tom X.
Problem: Given a complete graph on n vertices.
Each edge labeled + or -.
Goal is to find partition of vertices as consistent as
possible with edge labels.
Max #(agreements) or Min #( disagreements)
There is no k : # of clusters could be anything
9
The Problem
Noise Removal:
There is some true clustering. However some edges
incorrect. Still want to do well.
Agnostic Learning:
No inherent clustering.
Try to find the best representation using hypothesis
with limited representation power.
Eg: Research communities via collaboration graph
10
Our results
Constant-factor approx for
minimizing disagreements.
PTAS for maximizing agreements.
Results for random noise case.
11
PTAS for maximizing agreements
Easy to get ½ of the edges
Goal: additive apx of en2 .
Standard approach:
 Draw small sample,
 Guess partition of sample,
 Compute partition of remainder.
Can do directly, or plug into General Property Tester
of [GGR].
 Running time doubly exp’l in e, or singly with bad
exponent.
12
Minimizing Disagreements
Goal: Get a constant factor approx.
Problem: Even if we can find a cluster that’s as
good as best cluster of OPT, we’re headed
toward O(log n) [Set-Cover like analysis]
Need a way of lower-bounding Dopt.
13
Lower bounding idea: bad triangles
Consider
+
+
We know any clustering has to disagree
with at least one of these edges.
14
Lower bounding idea: bad triangles
If several such disjoint, then mistake on each one
+
1
+
5
2
+
+
4
+
2 Edge disjoint
Bad Triangles
(1,2,3), (3,4,5)
3
Dopt ¸ #{Edge disjoint bad triangles}
(Not Tight)
15
Lower bounding idea: bad triangles
If several such, then mistake on each one
+
1
+
5
2
+
+
4
+
Edge disjoint
Bad Triangles
(1,2,3), (3,4,5)
3
Dopt ¸ #{Edge disjoint bad triangles}
How can we use this ?
16
d-clean Clusters
Given a clustering, vertex d-good if few
disagreements
N-(v) Within C < d|C|
N+(v) Outside C < d|C|
C
v is d-good
+: Similar
-: Dissimilar
Essentially, N+(v) ¼ C(v)
Cluster C d-clean if all v2C are d-good
17
Observation
Any d-clean clustering is 8 approx for d<1/4
Idea: Charging mistakes to bad triangles
w
Intuitively, enough choices of w
for each wrong edge (u,v)
+
+
u
-
v
18
Observation
Any d-clean clustering is 8 approx for d<1/4
Idea: Charging mistakes to bad triangles
Intuitively, enough choices of w
for each wrong edge (u,v)
Can find edge-disjoint bad triangles
+
u’
w
+
+
u
-
v
Similar argument for +ve edges between 2 clusters
19
General Structure of Argument
Any d-clean clustering is 8 approx for d<1/4
Consequence: Just need to produce a
d-clean clustering for d<1/4
20
General Structure of Argument
Any d-clean clustering is 8 approx for d<1/4
Consequence: Just need to produce a
d-clean clustering for d<1/4
Bad News: May not be possible !!
21
General Structure of Argument
Any d-clean clustering is 8 approx for d<1/4
Consequence: Just need to produce a
d-clean clustering for d<1/4
Bad News: May not be possible !!
Approach:
1) Clustering of a special form: Opt(d)
Clusters either d clean or singletons
Not many more mistakes than Opt
2) Produce something “close’’ to Opt(d)
22
Existence of OPT(d)
Opt
C1
C2
23
Existence of OPT(d)
Identify d/3-bad vertices
Opt
C1
C2
d/3-bad vertices
24
Existence of OPT(d)
1) Move d/3-bad vertices out
2) If “many” (¸ d/3) d/3-bad, “split”
Opt
Opt(d) : Singletons
and d-clean clusters
C1
C2
DOpt(d) = O(1) DOpt
OPT(d)
Split
25
Main Result
Opt(d)
d-clean
26
Main Result
Opt(d)
d-clean
Guarantee:
•Non-Singletons 11d clean
•Singletons subset of Opt(d)
Algorithm
11d-clean
27
Main Result
Opt(d)
d-clean
Algorithm
Guarantee:
•Non-Singletons 11d clean
•Singletons subset of Opt(d)
Choose d =1/44
Non-singletons: ¼-clean
Bound mistakes among nonsingletons by bad trianges
Involving singletons
By those of Opt(d)
11d-clean
28
Main Result
Opt(d)
d-clean
Algorithm
11d-clean
Guarantee:
•Non-Singletons 11d clean
•Singletons subset of Opt(d)
Choose d =1/44
Non-singletons: ¼-clean
Bound mistakes among nonsingletons by bad trianges
Involving singletons
By those of Opt(d)
Approx. ratio= 9/d2 + 8
29
Open Problems
How about a small constant?
Is Dopt · 2 #{Edge disjoint bad triangles} ?
Is Dopt · 2 #{Fractional e.d. bad triangles}?
Extend to {-1, 0, +1} weights: apx to constant or
even log factor?
Extend to weighted case?
Clique Partitioning Problem
Cutting plane algorithms [Grotschel et al]
30
31
32
The problem
Harry B.
Harry Bovik
+: Same
-: Different
+
+
Disagreement
H. Bovik
Tom X.
Train up f(x)= same/different
Run f on all pairs
Find most consistent clustering
33
Nice features of formulation
There’s no k. (OPT can have anywhere
from 1 to n clusters)
If a perfect solution exists, then it’s easy
to find: C(v) = N +(v).
Easy to get agreement on ½ of edges.
34
Algorithm
1. Pick vertex v. Let C(v) = N (v)
2. Modify C(v)
(a)
(b)
+
Remove 3d-bad vertices from C(v).
Add 7d good vertices into C(v).
3. Delete C(v). Repeat until done, or above
always makes empty clusters.
4. Output nodes left as singletons.
35
Step 1
Choose v, C= + neighbors of v
C1
C2
v
C
36
Step 2
Vertex Removal Phase: If x is 3d bad, C=C-{x}
C1
C2
v
C
1) No vertex in C1 removed.
2) All vertices in C2 removed
37
Step 3
Vertex Addition Phase: Add 7d-good vertices to C
C1
C2
v
C
1) All remaining vertices in C1 will be added
2) None in C2 added
3) Cluster C is 11d-clean
38
Case 2: v Singleton in OPT(d)
Choose v, C= +1 neighbors of v
C
v
Same idea works
39
The problem
1
4
2
3
40