Massive Graph Triangulation - by X. Hu, Y. Tao, and C. Chung

Massive Graph Triangulation
by X. Hu, Y. Tao, and C. Chung, SIGMOD’13
Ilias Giechaskiel
Cambridge University, R212
[email protected]
February 21, 2014
Conclusions
Takeaway Messages
I Triangle listing important input for graph properties
I I/O becomes bottleneck for massive graphs
I
I
Obvious approach doesn’t work
MGT algorithm
I
I
I
Total order of vertices guarantees unique triangle orientation
Near optimal asymptotic I/O + CPU performance
Much faster than alternatives in practice
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
2 / 19
Triangle Listing
Definition
Given a graph G = (V , E ), list exactly once all
∆v1 v2 v3 = {v1 , v2 , v3 } such that vi ∈ V and (vi , vj ) ∈ E
Motivation
I Triangle = shortest non-trivial cycle and clique
I Various metrics
I
I
I
I
Dense neighborhood discovery
Triangular connectivity
k-truss
Clustering coefficient
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
3 / 19
In-Memory Triangle Listing [CC12]
The Algorithm
procedure list(G )
∆(G ) ← ∅
loop u ∈ V
loop v ∈ adjG (u) & v > u
loop w ∈ adjG (u) ∩ adjG (v ) & w > v
∆(G ) ← ∆(G ) ∪ {∆uvw }
return ∆(G )
The Problem
I Random access to adjG (v ) for v ∈ adjG (u)
I O (|E | · scan(dmax )) I/Os in the worst case
I
I
When it doesn’t fit in the memory of size M
Recall: scan(N) = Θ(N/B) where B is the disk block size
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
4 / 19
Motivation
Previous Approaches
I
External Memory Compact Forward (EM-CF)
I
I
I
I
External Memory Node Iterator (EM-NI)
I
I
I
I
O |E | + |E |1.5 /B I/Os
|E | I/O reads
Output insensitive
O |E |1.5 /B · logM/B (|E |/B) I/Os
Almost insensitive to M
Output insensitive
Graph Partition [CC12]
I
I
I
I
O |E |2 /(MB) + K /B I/Os where K triangles
p
In practice, M > |E |
If M = c|E |, asymptotically optimal
But under a set of assumptions...
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
5 / 19
Contributions
This Approach
I
I
O |E |2 /(MB) + K /B I/Os in all settings
O |E | log |E | + |E |2 /M + α|E | CPU time
I
α is the arboricity of the graph
I
Both optimal up to constants
I
Key idea: total order for unique triangle orientation
I
Side note: also improves analysis of previous work
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
6 / 19
Orienting G
Defining G ∗
I Define ≺ on V by u ≺ v iff
I
I
I
G ∗ is G with edges oriented by ≺
I
I
I
d(u) < d(v ) or d(u) = d(v ) and id(u) < id(v )
Is a total order
Takes O (sort(|E |)) I/Os
Recall: sort(N) = Θ N/BlogM/B N/B
Every triangle {u, v , w } has unique orientation u ≺ v ≺ w
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
7 / 19
The Algorithm
Initial Idea
1. Load next cM edges of G ∗ into memory (Emem )
I
All-or-nothing requirement (small-degree assumption)
2. Find all triangle with pivot edges in Emem
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
8 / 19
The Algorithm
Step 2 (Initial)
procedure list(G , Emem )
loop u ∈ V
Vmem (u) ← N + (u) ∩ Vmem
Find triangles with u cone in Emem (u) ∪ Emem
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
9 / 19
The Algorithm
Step 2 (Details)
procedure list(G ∗ , Emem )
Build hash structures
loop u ∈ V
Vmem (u) ← N + (u) ∩ Vmem
+ (u)
loop v ∈ Vmem
loop w ∈ Vmem (u)
if v 6= w & (v , w ) ∈ Emem then
Output ∆uvw
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
10 / 19
The Algorithm
Analysis
I
O |E |2 /(MB) + K /B I/O
I
I
I
I
O |E | log |E | + |E |2 /M + α|E | CPU
I
I
I
I
I
I
Θ (|E |/M) iterations
O (|E |/B) I/Os for scanning
O (K /B) for listing
O (|E | log |E |) for G ∗ sorting
Θ (|E |/M) iterations
+
O (|N + (u)| + |N + (u)| · |Vmem
(u)|)
+
Σ|N (u)| = |E |
Σv ∈V d + (v )2 = O(α|E |)
Optimality comes from considering the complete graph
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
11 / 19
The Algorithm
Small-Degree Assumption
I
What if ∃v such that d + (v ) > cM/2?
1.
2.
3.
4.
5.
Find one
Load a set S of cM/2 of its out-edges
Report all triangles involving one of the edges in S
Remove S from the graph
Repeat
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
12 / 19
The Algorithm
Small-Degree Assumption
I How to implement step 3
I
I
I
I
Create hash table of loaded vertices
Scan all |E | edges
Also scan N(v ) for each v 6= u with u ∈ N(v )
Does not change complexity
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
13 / 19
Evaluation
Experimental Setup
I
8GB memory (but memory conscious)
I
Graphs unoriented
Real data
I
I
I
I
I
I
I
364MB to 7.5GB
4.8 to 165 million vertices
28 to 938 million edges
|E |/|V | from 1.2 to 15.1
Varied M from 5% to 25% of disk size
Synthetic data
I
I
I
Random, Recursive Matrix, Small World
m = 16n, n from 16 to 80 million
2.1GB to 10.6GB
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
14 / 19
Evaluation
Real Data
I MGT always better for CPU
I
MGT almost always better for I/O
I
RGP higher hidden constant in complexity!
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
15 / 19
Evaluation
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
16 / 19
Evaluation
Criticism
I I/O analysis excludes cost of sorting
I Algorithm does not exploit parallelism
I
I
I
I
Is inherently sequential
Not applicable to distributed environment
Or across cores
RGP ideas applied in this case [PC13]
I
Block I/O model for SSDs and parallel environment?
I
Behavior for large-degree vertices
I
Experiments lacking when M bigger percentage of graph
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
17 / 19
Conclusions
Key Insights
I Total order of vertices guarantees unique triangle orientation
I
Key idea simple, but multiple tricks
I
Near optimal asymptotic I/O + CPU performance
I
Much faster than alternatives in practice
Key Questions
I Can you parallelize the algorithms non-trivially on a single PC?
I
How can you extend the I/O model to different environments?
I
How can you minimize data transfers in a distr. environment?
I
Your questions?
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
18 / 19
Bibliography I
Shumo Chu and James Cheng, Triangle listing in massive
networks, ACM Trans. Knowl. Discov. Data 6 (2012), no. 4,
17:1–17:32.
Xiaocheng Hu, Yufei Tao, and Chin-Wan Chung, Massive
graph triangulation, Proceedings of the 2013 ACM SIGMOD
International Conference on Management of Data (New York,
NY, USA), SIGMOD ’13, ACM, 2013, pp. 325–336.
Ha-Myung Park and Chin-Wan Chung, An efficient mapreduce
algorithm for counting triangles in a very large graph,
Proceedings of the 22Nd ACM International Conference on
Conference on Information &#38; Knowledge Management
(New York, NY, USA), CIKM ’13, ACM, 2013, pp. 539–548.
Ilias Giechaskiel [email protected]
Massive Graph Triangulation
19 / 19