Slides

LinkClus: Efficient Clustering via
Heterogeneous Semantic Links
Xiaoxin Yin, Jiawei Han
Univ. of Illinois at Urbana-Champaign
Philip S. Yu
IBM T.J. Watson Research Center
1
A Motivating Example
Authors
Proceedings
Tom
sigmod03
sigmod04
Mike
Cathy
John
sigmod
sigmod05
vldb03
vldb04
vldb
vldb05
aaai04
Mary
Conferences
aaai05
aaai
Questions:
Q1: How to cluster each type of objects?
Q2: How to define similarity between each type of objects?
2
Link-based Similarities
• Two objects are similar if they are linked with
similar objects
Jeh & Widom, 2002 - SimRank
sigmod03
Tom
sigmod04
sigmod
The similarity between two
objects x and y is defined as
the average similarity
between objects linked with x
and those with y.
sigmod
Very expensive to compute:
sigmod05
Tom
Mike
Cathy
John
sigmod03
sigmod04
sigmod05
vldb03
vldb04
vldb05
vldb
For a dataset of N objects
and M links, it takes O(N2)
space and O(M2) time to
compute all similarities.
3
Observation 1: Hierarchical Structures
• Hierarchical structures often exist naturally
among objects (e.g., taxonomy of animals)
Relationships between articles and
words (Chakrabarti, Papadimitriou,
Modha, Faloutsos, 2004)
A hierarchical structure of
products in Walmart
grocery electronics
TV
DVD
apparel
Articles
All
camera
Words
4
Observation 2: Distribution of Similarity
portion of entries
0.4
Distribution of SimRank similarities
among DBLP authors
0.3
0.2
0.1
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
similarity value
• Power law distribution exists in similarities
– 56% of similarity entries are in [0.005, 0.015]
– 1.4% of similarity entries are larger than 0.1
– Our goal: Design a data structure that stores the
significant similarities and compresses insignificant ones
5
Our Data Structure: SimTree
Each leaf node
represents an object
Each non-leaf
node represents a
group of similar
lower-level nodes
Similarities between
siblings are stored
Canon A40
digital camera
Digital
Sony V3 digital Cameras
Consumer
camera
Apparels
electronics
TVs
6
Similarity Defined by A SimTree
Similarity between two
sibling nodes n1 and n2
n1
Adjustment ratio
for node n7
0.8
n4
0.9
n7
0.3
n2
0.2
0.9
0.9
n5
n6
0.8
n8
n3
1.0
n9
• simp(n7,n8) = s(n7,n4) x s(n4,n5) x s(n5,n8)
– Path-based node similarity
• Similarity between two nodes is the average similarity between
nodes linked with them in other SimTrees
• Adjustment ratio for x =
Average similarity between x and all other nodes
Average similarity between x’s parent and all
other nodes
7
Overview of LinkClus
• Initialize a SimTree for objects of each type
• Repeat
– For each SimTree, update the similarities between
its nodes using similarities in other SimTrees
• Similarity between two nodes x and y is the average
similarity between objects linked with them
– Adjust the structure of each SimTree
• Assign each node to the parent node that it is most
similar to
8
Initialization of SimTrees
• The “SimTrees” before initialization
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
l
m
n
o
p
q
r
s
t
u
v
w
x
y
ST2
ST1
– Each leaf nodes have similarity 1 to itself and 0 to
others
• Initializing a SimTree
– Repeatedly find groups of tightly related nodes,
which are merged into a higher-level node
9
(continued)
• Tightness of a group of nodes
– For a group of nodes {n1, …, nk}, its tightness is
defined as the number of leaf nodes in other
SimTrees that are connected to all of {n1, …, nk}
Nodes
n1
n2
Leaf nodes in
another SimTree
1
2
3
4
5
The tightness of {n1, n2} is 3
10
(continued)
• Finding tight groups
Frequent pattern mining
Reduced to
The tightness of a
g1
group of nodes is the
support of a frequent
pattern
g2
n1
n2
n3
n4
Transactions
1
2
3
4
5
6
7
8
9
{n1}
{n1, n2}
{n2}
{n1, n2}
{n1, n2}
{n2, n3, n4}
{n4}
{n3, n4}
{n3, n4}
• Procedure of initializing a tree
– Start from leaf nodes (level-0)
– At each level l, find non-overlapping groups of similar
nodes with frequent pattern mining
11
Updating Similarities Between Nodes
• The initial similarities can seldom capture the
relationships between objects
• Iteratively update similarities
– Similarity between two nodes is the average similarity
between objects linked with them
0
ST2
1
4
2
5
3
6
7
8
sim(na,nb) =
average similarity between
9
c
a
b
f
l m n
o p
q r
ST1
d
e
g
s
u v w
13
and
14
takes O(3x2) time
h
t
11
12
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
z
10
k
x
y
12
Aggregation-based Similarity Computation
0.2
4
0.9
1.0 0.8
10
11
ST2
5
0.9
1.0
13
14
12
a
b
ST1
For each node nk∈{n10,n11,n12} and nl∈{n13,n14}, their pathbased similarity simp(nk, nl) = s(nk, n4)·s(n4, n5)·s(n5, nl).
sim na , nb  
k 10 snk , n4 
12
3

 sn , n  
14
l 13
4
5
snl , n5 
2
 0.171
takes O(3+2) time
After aggregation, we reduce quadratic time computation to
linear time computation.
13
Simweights of Linkages
Simweight between nodes
na and n4: the average
similarity and total weight
of linkages between them
a:(0.9,3)
10
na has a linkage of weight
1 and similarity 1 to each
leaf node it is linked with
0.2
4
0.9
a:(1,1)
b:(0.95,2)
1.0 0.8
11
SC2
5
12
0.9
1.0
13
14
a:(1,1) a:(1,1) b:(1,1)
a
b:(1,1)
b
SC1
simweight(na, n4)= ( 0.9+1.0+0.8 , 3 )
3
weighted average similarity
of linkages between na and
children of n4
total weight of
linkages between na
and children of n4
14
Computing Similarity with Simweights
a:(0.9,3)
sim(na, nb) can be computed
from aggregated similarities
0.2
4
10
11
a
12
b:(0.95,2)
5
13
14
b
sim(na, nb) = simweight(na,n4).sim x s(n4, n5) x simweight(nb,n5).sim
= 0.9 x 0.2 x 0.95 = 0.171
To compute sim(na,nb):
• Find all pairs of sibling nodes ni and nj, so that na linked with ni
and nb with nj.
• Calculate similarity (and weight) between na and nb w.r.t. ni and nj.
• Calculate weighted average similarity between na and nb w.r.t. all
such pairs.
15
Adjusting SimTree Structures
n1
n2
0.9
n4
0.8
n7
n5
n7 n8
n3
n6
n9
• After similarity changes, the tree structure also
needs to be changed
– If a node is more similar to its parent’s sibling, then move
it to be a child of that sibling
– Try to move each node to its parent’s sibling that it is most
similar to, under the constraint that each parent node can
have at most c children
16
Complexity
For two types of objects, N in each, and M linkages
between them.
Time
O(M(logN)2)
Space
O(M+N)
O(N)
O(N)
LinkClus
O(M(logN)2)
O(M+N)
SimRank
O(M2)
O(N2)
Updating similarities
Adjusting tree
structures
17
Empirical Study
• Generating clusters using a SimTree
– Suppose K clusters are to be generated
– Find a level in the SimTree that has number of nodes
closest to K
– Merging most similar nodes or dividing largest nodes on
that level to get K clusters
• Accuracy
– Measured by manually labeled data
– Accuracy of clustering: Percentage of pairs of objects in
the same cluster that share common label
• Efficiency and scalability
– Scalability w.r.t. number of objects, clusters, and linkages
18
Approaches in Comparison
• SimRank (Jeh & Widom, KDD 2002)
– Computing pair-wise similarities
• Pruned-SimRank (P-SimRank)
– Only compute similarities between objects that are linked
to the same object
• SimRank with FingerPrints (F-SimRank)
– Fogaras & R´acz, WWW 2005
– pre-computes a large sample of random paths from each
object and uses the samples of two objects to estimate
their SimRank similarity
• ReCom (Wang et al. SIGIR 2003)
– Iteratively clustering objects using cluster labels of linked
objects
19
DBLP Dataset
Authors
author-id
author-name
email
Publishes
author-id
paper-id
Publications
paper-id
title
proc-id
Proceedings
proc-id
conference
year
location
Conferences
conference
publisher
• We use 4170 most productive authors, and 154 well-known
conferences with most proceedings
– Manually labeled research areas of 400 most productive authors
according to their home pages (or publications)
– Manually labeled areas of 154 conferences according to their call for
papers
20
Accuracy
1
0.8
0.7
0.95
0.3
LinkClus
SimRank
ReCom
F-SimRank
0.2
#iteration
Approaches
13
11
9
7
5
3
1
19
17
15
13
11
9
7
5
3
0.1
19
0.8
0.4
17
LinkClus
SimRank
ReCom
F-SimRank
0.85
0.5
15
accuracy
0.9
1
accuracy
0.6
#iteration
Accr-Author
Accr-Conf
average time
LinkClus
0.957
0.723
76.7
SimRank
0.958
0.760
1020
ReCom
0.907
0.457
43.1
F-SimRank
0.908
0.583
83.6
21
(continued)
1
0.8
0.7
0.92
Accuracy
Accuracy
0.96
0.88
0.84
0.8
0
500
LinkClus
SimRank
ReCom
F-SimRank
P-SimRank
1000
1500
Time (sec)
0.6
0.5
0.4
0
500
LinkClus
SimRank
ReCom
F-SimRank
P-SimRank
1000
1500
Time (sec)
• Accuracy vs. Running time
– LinkClus is almost as accurate as SimRank (most
accurate), and is much more efficient
22
Email Dataset
• F. Nielsen. Email dataset.
http://www.imm.dtu.dk/∼rem/data/Email-1431.zip
• 370 emails on conferences, 272 on jobs, and 789 spam
emails
Approach
Accuracy
Total time (sec)
LinkClus
0.8026
1579.6
SimRank
0.7965
39160
ReCom
0.5711
74.6
F-SimRank
0.3688
479.7
CLARANS
0.4768
8.55
23
Scalability (1)
• Tested on synthetic datasets, with randomly
generated clusters
• Scalability w.r.t. number of objects
– Number of clusters is fixed (40)
1000
0.8
LinkClus
SimRank
ReCom
F-SimRank
O(N)
O(N*(logN)^2)
O(N^2)
0.7
0.6
Accuracy
time (sec)
10000
LinkClus
SimRank
ReCom
F-SimRank
0.5
0.4
0.3
100
0.2
0.1
10
1000
2000
3000
4000
#objects per relation
5000
0
1000
2000
3000
4000
#objects per relation
5000
24
Scalability (2)
• Scalability w.r.t. number of objects & clusters
– Each cluster has fixed size (100 objects)
10000
LinkClus
SimRank
ReCom
F-SimRank
0.8
0.7
0.6
100
LinkClus
SimRank
ReCom
F-SimRank
O(N)
O(N*(logN)^2)
O(N^2)
10
1
500
1000
2000 5000 10000 20000
#objects per relation
Accuracy
time (sec)
1000
0.5
0.4
0.3
0.2
0.1
0
500
1000
2000
5000 10000
#objects per relation
20000
25
Scalability (3)
• Scalability w.r.t. number of linkages from each
object
1
10000
0.8
1000
Accuracy
time (sec)
0.6
100
10
5
10
15
selectivity
20
LinkClus
SimRank
ReCom
F-SimRank
0.4
LinkClus
SimRank
ReCom
F-SimRank
O(S)
O(S^2)
0.2
0
25
5
10
15
selectivity
20
25
26
Conclusions
• With our data structure SimTree, LinkClus
can compress the pair-wise similarities while
achieving high accuracy
• Experimental results show that LinkClus is a
highly accurate and scalable approach for
clustering multi-typed linked objects
27
Thank you
• Questions and comments
28