CUR (CX) Colibri-S Original Matrix

SCS CMU
Colibri: Fast Mining of Large
Static and Dynamic Graphs
Joint Work by
Hanghang Tong, Spiros Papadimitriou, Jimeng Sun,
Philip S. Yu, Christos Faloutsos
Speaker: Hanghang Tong
Aug. 24-27, 2008, Las Vegas
KDD 2008
SCS CMU
Graphs are everywhere!
Q: How to find patterns?
e.g., community, anomaly, etc.
2
SCS CMU
Motivation
• Q: How to find patterns?
– e.g., community, anomaly, etc.
• A: Low-Rank Approximation (LRA) for
Adjacency Matrix of the Graph.
X M X
A
~
R
L
3
SCS CMU
LRA for Graph Mining: Example
Adj. matrix: A
John
ICDM
KDD
Carl
ISMB
M
X
Tom
Bob
L
~
R
X
Conf. Cluster
Interaction
Van
Roy
Author
RECOMB
Conf.
Au. clusters
Recon. error is high
 ‘Carl’ is abnormal
4
SCS CMU
Challenges
• How to get (L, M, R)
+ Efficiently (both time and space);
+ Intuitively (easy for interpretation);
+ Dynamically (track patterns over time)?
5
SCS CMU
Roadmap
• Motivation
• Existing Methods
– SVD
– CUR/CX
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
6
SCS CMU
Matrix & Column Space
• Matrix
3 1
B= 1 1
0 0
b1 , b2 are vectors in 3-d space!
b1 b2
• Column Space of a Matrix
b2
b1
7
SCS CMU
Projection, Projection Matrix & Core Matrix
v
v~
+
X BTB X
v~ =
BT
B
X
v
Core Matrix
Projection of v
Projection matrix of B
An arbitrary vector8
SCS CMU
Singular-Value-Decomposition (SVD)
…
A: n x m
~

u1
….
…
uk
…
m
k
…
…a
…
x
…
a2 ….
a3
….
v1
…
a1
x
…
…
…
…
1
….
vk
V:
right singular vectors
…
U: left singular vectors
9
SCS CMU
SVD: How to
• #1: Find the left matrix U, where
ui 
A v
i
T
i

a1  vi ,1  a2  vi ,2  ...  am  vi ,m
i
• #2: Project A into the column space of U

A  U (U U ) U A  ...  U V
T
T
Projection Matrix of Column Space of U
10
SCS CMU
SVD: drawbacks
A
U

V
• Efficiency
2
2
O
(min(
n
m
,
nm
))
– Time
– Space (U, V) are dense
=
1st singular vector
• Interpretation
2nd singular vector
• Dynamic: not easy
11
SCS CMU
CUR (CX) decomposition
x (C
… … …
… ……. …
~
C)
U

x
T
….
C A
R
….
•Sample Columns from A to form C
•Project A onto the col. Space of C
… … …
A: n x m
T….
C
12
SCS CMU
CUR (CX): advantages
• Efficiency (better than SVD)
2
3
O
(
c
n
)
or
O
(
c
 cm)
– Time
• (c is # of sampled col.s)
– Space (C, R) are sparse
• Interpretation
13
SCS CMU
CUR (CX): drawbacks
• Redundancy in C, wasting both time and space
• 3 copies of green,
• 2 copies of red,
• 2 copies of purple
• purple=0.5*green + red…
• Dynamic: not easy
14
SCS CMU
Roadmap
• Motivation
• Existing Methods
• Colibri
– Colibri-S for static graphs
– Colibri-D for dynamic graphs
• Experimental Results
• Conclusion
15
SCS CMU
Colibri-S: Basic Idea
Original Matrix
Colibri-S
CUR (CX)
…
x. x
M
….
R
…
.
• 3 copies of green,
• 2 copies of red,
• 2 copies of purple
• purple=0.5*green + red…
L
We want the Col.s in L are linearly
independent with each other!
16
SCS CMU
Input
Output
L=
…
.
: Linearly Ind. Col.s
-1
….
?
Initially
Sampled
matrix C
LT
…
.
M= =
Core
Matrix
L
…
.
Q: How to find L & M from C efficiently?
T
R=L xA=
….
17
SCS CMU
A: Find L & M iteratively!
Initial Sampled
Matrix c
….
…
Current
L&M
For each col. v in C
Project it on L
Expand L & M
Redundant
?
discard v
18
SCS CMU
Colibri-S vs. CUR(CX)
• Quality:
• Colibri-S = CUR(CX)
• Time: O(c 3  cm) vs. O(c3  cm), where c  c, m  m
• Colibri-S >= CUR(CX)
• Space
• Colibri-S >= CUR(CX)
• Illustrations
Colibri-S
CUR (CX)
19
SCS CMU
Colirbri-D for dynamic graphs
Mt
Rt
Mt+1
Rt+1
Lt
t
Initially sampled matrix
?
t+1
Lt+1
Q: How to update L and M efficiently?
20
SCS CMU
Colibri-D: How-To
Selected
Redundant
Mt
Rt
Mt+1
Rt+1
Lt
t
Initially sampled matrix
Selected
Redundant
?
t+1
Lt+1
21
Changed from t
SCS CMU
Colibri-D: How-To
Selected
Redundant
Mt
Lt
t
Unchanged Cols!
~
M
Initially sampled matrix
Selected
~
L
Redundant
t+1
Subspace by
blue cols
at t+1
M
t+1
Lt+1
22
SCS CMU
Roadmap
•
•
•
•
•
Motivation
Existing Methods
Colibri
Experimental Results
Conclusion
23
SCS CMU
Experimental Setup
• Datasets
• Network traffic
• 21,837 sources/destinations
• 1,222 consecutive hours
• 22,800 edges per hour
• Accuracy:
Accu =
• Space Cost:
24
SCS CMU
Performance of Colibri-S
CUR
CUR
CMD
Ours
CMD
Time
Ours
• Accuracy
• Same 91%+
• Time
• 12x of CMD
• 28x of CUR
• Space
• ~1/3 of CMD
• ~10% of CUR
Space
25
SCS CMU
More Evaluation on Colibri-S
Log Time (Sec)
CUR
CMD
Colibri-S
26
Approximation Accuracy
SCS CMU
Time
Performance
of Colibri-D
CMD
Colibri-S
Colibri-D
# of changed cols
Colibri-D achieves up to 112x speedups
27
SCS CMU
A Family of Low-Rank Approximation
for Fast Graph Mining
• Colibri-S
– For static graphs
– Remove redundancy
– Significant saving in time & space by “free”
• Colibri-D
– For dynamic graphs
– Explores “smoothness”
– Up to 112x than best known methods
28
SCS CMU
Poster tonight!
Thank you!
www.cs.cmu.edu/~htong
29