Peta-Graph-Mining

CMU SCS
Peta-Graph Mining
Christos Faloutsos
Appel, Ana
Chau, Polo
Leskovec, Jure
Kang, U
Prakash, Aditya
Shringarpure, Suyash
Tsourakakis, Charalampos
Yahoo/Hadoop, 2008
#1
CMU SCS
Our goal:
One-stop solution for mining huge graphs
Yahoo/Hadoop, 2008
2
CMU SCS
Outline
Centralized
Hadoop
Degree Distribution
old
old
Pagerank
old
old
Diameter/ANF
old
X
Communities
old
X
Triangles
X
todo
Visualization
X
todo
Datasets:
(a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges)
Yahoo/Hadoop,
(b) NetFlix (20K movies,
~500K2008
users, 100M edges)
3
CMU SCS
Degree Distributions - NetFlix
count
100 machines - 8min
Yahoo/Hadoop, 2008
Movie in-degree
4
CMU SCS
Degree Distributions - NetFlix
count
Theoretically expected
100 machines - 8min
Yahoo/Hadoop, 2008
Movie in-degree
5
CMU SCS
Degree Distributions - NetFlix
count
100 machines - 8min
Yahoo/Hadoop, 2008
User out-degree
6
CMU SCS
Degree Distributions - NetFlix
count
Theoretically expected
Sharp drop
below 100 ratings
User out-degree
100 machines - 8min
Yahoo/Hadoop, 2008
7
CMU SCS
Degree Distributions - Kronecker
count
degree
100 machines - 6h
Yahoo/Hadoop, 2008
Nodes:259M
- Edges: 1B
8
CMU SCS
Degree Distributions - timings
Time (sec)
24 tasks
48 tasks
1 task
Yahoo/Hadoop, Edge
2008
file size (MB)
9
CMU SCS
Outline
Centralized
Hadoop
Degree Distribution
old
old
Pagerank
old
old
Diameter/ANF
old
X
Communities
old
X
Triangles
X
todo
Visualization
X
todo
Datasets:
(a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges)
Yahoo/Hadoop,
(b) NetFlix (20K movies,
~500K2008
users, 100M edges)
10
CMU SCS
Diameter
Diameter of a graph
Maximum shortest path
Normally, > O(N**2)
ANF : `Approximate Neighborhood
function’ [Palmer+02]: O(E)
Goal : calculate neighborhood function
Neighborhood N(h) : number of pairs
of nodes within distance h
Yahoo/Hadoop, 2008
11
CMU SCS
Time (min)
Diameter
1 node
48 nodes
28 nodes
Edge file (MB)


For large jobs, parallelization helps
Yahoo/Hadoop, 2008
Unstable results due
to shared machines
12
CMU SCS
Diameter / Hop Plot (Netflix)
# of reachable pairs
within <= h hops
h: # of hops
Yahoo/Hadoop, 2008
13
CMU SCS
Diameter / Hop Plot (Netflix)
# of reachable pairs
within <= h hops
Diameter: 3
h: # of hops
Yahoo/Hadoop, 2008
14
CMU SCS
Outline
Centralized
Hadoop
Degree Distribution
old
old
Pagerank
old
old
Diameter/ANF
old
X
Communities
old
X
Triangles
X
todo
Visualization
X
todo
Datasets:
(a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges)
Yahoo/Hadoop,
(b) NetFlix (20K movies,
~500K2008
users, 100M edges)
15
CMU SCS
Community detection
• Cross associations [Chakrabarti+ ’04]
Yahoo/Hadoop, 2008
16
CMU SCS
Community detection
Yahoo/Hadoop, 2008
17
CMU SCS
Outline
Centralized
Hadoop
Degree Distribution
old
old
Pagerank
old
old
Diameter/ANF
old
X
Communities
old
X
Triangles
X
todo
Visualization
X
todo
Datasets:
(a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges)
Yahoo/Hadoop,
(b) NetFlix (20K movies,
~500K2008
users, 100M edges)
18
CMU SCS
Triangles
• ‘friends of friends are friends’
Yahoo/Hadoop, 2008
19
CMU SCS
Triangles
• ‘friends of friends are friends’
Yahoo/Hadoop, 2008
20
CMU SCS
Triangles
• ‘friends of friends are friends’
• Naïve algo: 3-way join (slow)
• [Tsourakakis’08]: # triangles ~ sum of
cubes of eigenvalues
• Thus, super-fast computation of #triangles
(100x - 25,000x faster than naïve; >95%
accuracy
Yahoo/Hadoop, 2008
21
CMU SCS
Triangles
• Easy to implement on hadoop: it only needs
eigenvalues (to do, with Lanczos)
Yahoo/Hadoop, 2008
22
CMU SCS
Outline
Centralized
Hadoop
Degree Distribution
old
old
Pagerank
old
old
Diameter/ANF
old
X
Communities
old
X
Triangles
X
todo
Visualization
X
todo
Datasets:
(a) Synthetic (‘Kronecker’, ~300M nodes, 1B edges)
Yahoo/Hadoop,
(b) NetFlix (20K movies,
~500K2008
users, 100M edges)
23
CMU SCS
Visualization
Principled visualization of large graphs
Yahoo/Hadoop, 2008
(show few most `important’ edges)
24
CMU SCS
Summary
Centralized
Hadoop
Degree Distribution
old
old
Pagerank
old
old
Diameter/ANF
old
X
Communities
old
X
Triangles
X
todo
Visualization
X
todo
Goal: one-stop solution for mining huge graphs
Yahoo/Hadoop, 2008
25