Parallel Depth First on GPU

Parallel Depth First on GPU
M. Naumov, A. Vrielink and M. Garland, GTC 2017
Introduction
Directed Trees
AGENDA
Directed Acyclic Graphs (DAGs)
 Path- and SSSP-based variants
 Optimizations
Performance Experiments
2
What is DFS?
a
d
c
b
Node:
Parent:
a,b,c,d,e,f,g,i,j
Discovery:
e
g
f
i
Finish:
j
3
What is DFS?
a
d
c
b
Node:
Parent:
a,b,c,d,e,f,g,i,j
/,a
Discovery: a,b
e
g
f
i
Finish:
j
4
What is DFS?
a
d
c
b
Node:
Parent:
a,b,c,d,e,f,g,i,j
/,a,
b,
Discovery: a,b,e
e
g
f
i
Finish:
e
j
5
What is DFS?
a
d
c
b
Node:
Parent:
a,b,c,d,e,f,g,i,j
/,a,
b,b
Discovery: a,b,e,f
e
g
f
i
Finish:
e
j
6
What is DFS?
a
d
c
b
Node:
Parent:
a,b,c,d,e,f,g,i,j
/,a,
b,b, ,f
Discovery: a,b,e,f,i
e
g
f
i
Finish:
e,i
j
7
What is DFS?
a
d
c
b
Node:
Parent:
a,b,c,d,e,f,g,i,j
/,a,
b,b, ,f,f
Discovery: a,b,e,f,i,j
e
g
f
i
Finish:
e,i,j
j
8
What is DFS?
a
d
c
b
Node:
Parent:
a,b,c,d,e,f,g,i,j
/,a,a,a,b,b,d,f,f
Discovery: a,b,e,f,i,j,c,d,g
e
g
f
i
Finish:
e,i,j,f,b,c,g,d,a
j
9
Previous Work on DFS
Lexicographic DFS
Planar Graphs
Directed Acyclic Graphs (DAGs)
Directed Graphs with Cycles
Time O(log2n)
Processors O(n)
Time
O(log2n)
Processors O(nω/log n)
Time O( 𝑛 log11n)
Processors O(n3)
where ω < 2.373 is the matrix multiplication exponent
10
Previous Work on DFS
Lexicographic DFS
Planar Graphs
Directed Acyclic Graphs (DAGs)
Directed Graphs with Cycles
Time O(log2n)
Processors O(n)
Time
O(log2n)
Processors O(nω/log n)
Time O( 𝑛 log11n)
Processors O(n3)
topological sort, bi-connectivity and planarity testing
where ω < 2.373 is the matrix multiplication exponent
11
DIRECTED TREES
12
Directed Tree
a
c
b
[0]
f
e
d
g
[0]
[0]
i
[0]
j
[0]
Phase 2: Bottom-Up Traversal
13
Directed Tree
a
c
b
f
e
[0]
[0]
[0,1,1]
d
[0,1]
g
[0]
i
[0]
j
[0]
Phase 2: Bottom-Up Traversal
14
Directed Tree
a
c
b
f
e
[0]
[0]
[0,1,2]
d
[0,1]
g
[0]
i
[0]
j
[0]
prefix sum
Phase 2: Bottom-Up Traversal
15
Directed Tree
a
b
c
[0,1,3]
f
e
[0]
[0]
[0,1,2]
d
[0,1]
g
[0]
i
[0]
j
[0]
Phase 2: Bottom-Up Traversal
16
Directed Tree
a
b
c
[0,1,4]
f
e
[0]
[0,1,2]
[0,1]
g
[0]
i
prefix sum
[0]
d
[0]
j
[0]
Phase 2: Bottom-Up Traversal
17
Directed Tree
[0,5,1,2]
a
b
c
[0,1,4]
f
e
[0]
[0]
[0,1,2]
d
[0,1]
g
[0]
i
[0]
j
[0]
Phase 2: Bottom-Up Traversal
18
Directed Tree
[0,5,6,8]
a
b
c
[0,1,4]
f
e
[0]
[0]
[0,1,2]
d
[0,1]
g
[0]
i
[0]
j
[0]
Phase 2: Bottom-Up Traversal
19
Directed Tree
[0,5,6,8]
a
b
c
[0,1,4]
f
e
[0]
[0]
[0,1,2]
d
[0,1]
g
[0]
i
[0]
j
[0]
This phase is done, next phase is about to start …
20
Directed Tree
[0,5,6,8]
a
b
c
[0,1,4]
f
e
offset 0
[0]
[0]
[0,1,2]
d
[0,1]
g
[0]
i
[0]
j
[0]
Phase 3: Top-down Traversal
21
Directed Tree
[0,5,6,8]
a
b
c
[0,1,4]
f
e
offset 0
[0]
[0,1,2]
[0,1]
g
[0]
i
offset 1
[0]
d
[0]
j
[0]
Phase 3: Top-down Traversal
22
Directed Tree
[0,5,6,8]
a
b
c
[0,1,4]
f
e
offset 0
[0]
[0,1,2]
[0,1]
offset 6
g
[0]
i
offset 1
[0]
d
[0]
j
[0]
Phase 3: Top-down Traversal
23
Directed Tree
[0,5,6,8]
a
b
c
[0,1,4]
f
e
discovery 0+2
[0]
[0]
[0,1,2]
d
[0,1]
discovery 6+1
g
[0]
i
discovery 1+3
[0]
j
[0]
discovery = offset + depth
Phase 3: Top-down Traversal
24
Directed Tree
[0,5,6,8]
a
b
c
[0,1,4]
f
e
finish 0+0
[0]
[0,1,2]
[0,1]
finish 6+1
g
[0]
i
finish 1+0
[0]
d
[0]
j
[0]
finish = offset + sub-tree size
Phase 3: Top-down Traversal
25
DIRECTED ACYCLIC GRAPHS
PATH-BASED VARIANT
26
Path-Based (for DAGs)
a
b
c
d
e
f
g
i
j
collision
left
[a,b,f]
Phase 1
right
f
[a,d,f]
27
Path-Based (for DAGs)
a
b
c
d
e
f
g
i
• wait until all paths to a node are traversed
• align path sequences
left [a,b,f]
(lexicographically smallest)
right [a,d,f]
• compare left-to-right and choose smallest
j
collision
left
resolution
Phase 1
[a,b,f]
right
f
[a,d,f]
28
Path-Based (for DAGs)
a
b
c
d
e
f
g
i
j
This phase is done
29
OPTIMIZATIONS
30
Path Pruning
a
b
c
d
e
[a,c,d,f]
f
[a,b,e,f]
31
Path Pruning
When two paths reach the same node
 There exists a parent “a” where
the path split [a,b,…] and [a,c,…]
a
b
c
d
e
[a,c,d,f]
f
[a,b,e,f]
32
Path Pruning
When two paths reach the same node
 There exists a parent “a” where
the path split [a,b,…] and [a,c,…]
a
b
c
d
e
[a,c,d,f]
f
 It is the comparison between “b” and “c”
that allows us to distinguish between paths
[a,b,e,f]
33
Path Pruning
When two paths reach the same node
 There exists a parent “a” where
the path split [a,b,…] and [a,c,…]
a
b
c
d
e
[a,c,d,f]
f
 It is the comparison between “b” and “c”
that allows us to distinguish between paths
 Parent node with a single edge
will never be a decision point
[a,b,e,f]
34
Path Pruning
When two paths reach the same node
 There exists a parent “a” where
the path split [a,b,…] and [a,c,…]
a
b
c
d
e
[a,c,f]
f
[a,b,f]
 It is the comparison between “b” and “c”
that allows us to distinguish between paths
 Parent node with a single edge
will never be a decision point
 No need to store nodes with such parents
35
Path Pruning
36
Phase Composition
37
SSSP-BASED VARIANT
38
SSSP-based (for DAGs)
a
b
c
e
f
[1]
d
g
[1]
[1]
i
[1]
j
[1]
Run
Runthe
thealgorithm
algorithmfor
forDirected
DirectedTrees,
Trees,but
but
Propagate
Propagate##ofofnodes
nodestotoall
allthe
theparents
parents
Start
Startprefix
prefixsum
sumwith
with11(instead
(insteadofof0)0)
Phase 1: Bottom-Up Traversal
39
SSSP-based (for DAGs)
a
b
c
e
f
[1]
[1]
[1,1,1]
d
[1,1]
g
[1]
i
[1]
j
[1]
Run
Runthe
thealgorithm
algorithmfor
forDirected
DirectedTrees,
Trees,but
but
Propagate
Propagate##ofofnodes
nodestotoall
allthe
theparents
parents
Start
Startprefix
prefixsum
sumwith
with11(instead
(insteadofof0)0)
Phase 1: Bottom-Up Traversal
40
SSSP-based (for DAGs)
a
b
c
e
f
[1]
[1]
[1,2,3]
d
[1,2]
g
[1]
i
[1]
j
[1]
prefix sum
Run
Runthe
thealgorithm
algorithmfor
forDirected
DirectedTrees,
Trees,but
but
Propagate
Propagate##ofofnodes
nodestotoall
allthe
theparents
parents
Start
Startprefix
prefixsum
sumwith
with11(instead
(insteadofof0)0)
Phase 1: Bottom-Up Traversal
41
SSSP-based (for DAGs)
a
b
c
[1,1,3]
f
e
[1]
[1]
[1,2,3]
d
[1,2]
g
[1]
i
[1]
j
[1]
Run
Runthe
thealgorithm
algorithmfor
forDirected
DirectedTrees,
Trees,but
but
Propagate
Propagate##ofofnodes
nodestotoall
allthe
theparents
parents
Start
Startprefix
prefixsum
sumwith
with11(instead
(insteadofof0)0)
Phase 1: Bottom-Up Traversal
42
SSSP-based (for DAGs)
a
b
c
[1,2,4]
f
e
[1] prefix sum
i
[1]
[1]
[1,2,3]
d
[1,2]
g
[1]
j
[1]
Run
Runthe
thealgorithm
algorithmfor
forDirected
DirectedTrees,
Trees,but
but
Propagate
Propagate##ofofnodes
nodestotoall
allthe
theparents
parents
Start
Startprefix
prefixsum
sumwith
with11(instead
(insteadofof0)0)
Phase 1: Bottom-Up Traversal
43
SSSP-based (for DAGs)
[1,5,1,2,1]
a
b
c
[1,2,4]
f
e
[1]
[1]
[1,2,3]
d
[1,2]
g
[1]
i
[1]
j
[1]
Run
Runthe
thealgorithm
algorithmfor
forDirected
DirectedTrees,
Trees,but
but
Propagate
Propagate##ofofnodes
nodestotoall
allthe
theparents
parents
Start
Startprefix
prefixsum
sumwith
with11(instead
(insteadofof0)0)
Phase 1: Bottom-Up Traversal
44
SSSP-based (for DAGs)
[1,6,7,9,10]
a
b
c
[1,2,4]
f
e
[1]
[1]
[1,2,3]
d
[1,2]
g
[1]
i
[1]
j
[1]
Run the algorithm for Directed Trees, but
 Propagate # of nodes to all the parents
 Start prefix sum with 1 (instead of 0)
Phase 1: Bottom-Up Traversal
45
SSSP-based (for DAGs)
a
6
1
c
b
1
7
9
d
1
2
f
e
1
i
g
2
j
Assign # of nodes
as the edge weight
This phase is done, next phase is about to start …
46
SSSP-based (for DAGs)
a
6
1
c
b
1
7
9
d
1
2
f
e
1
i
g
2
j
1+2+2=5 < 9
Phase 2: Top-down traversal
47
SSSP-based (for DAGs)
a
6
1
c
b
1
7
9
d
1
2
f
e
1
i
g
2
j
1+2+2=5 < 9
Shortest Path is the DFS path
Phase 2: Top-down traversal
48
SSSP-based (for DAGs)
a
b
c
d
e
f
g
i
j
Phase 2: This phase is done
49
OPTIMIZATIONS
50
Discovery time
 The length of shortest path
defines an ordering of nodes
0
1
b
6
2
e
3
a
c
7
f
8
i
j
4
5
Phase 3a: Sorting
d
g
51
Discovery time
 The length of shortest path
defines an ordering of nodes
 We can sort them to obtain
discovery time
0
1
b
6
2
e
3
a
c
7
f
8
i
j
4
5
Phase 3a: Sorting
d
g
52
Discovery time
 The length of shortest path
defines an ordering of nodes
 We can sort them to obtain
discovery time
Discovery: a,b,e,f,i,j,c,d,g
0
1
b
6
2
e
3
a
c
7
f
8
i
j
4
5
Phase 3a: This phase is done
(Phase 3b will find the finish time)
d
g
53
Phase composition
54
EXPERIMENTS
55
Data
#
Graph
n
m
Application
1
coPapersDBLP
540487
15251812
Citations
2
auto
448696
3350678
Numeric Sim.
3
hugebubbles-000…
18318144
30144175
Numeric Sim.
4
delaunay_n24
16777217
52556391
Random Tri.
5
il2010
451555
1166978
Census Data
6
fl2010
484482
1270757
Census Data
7
ca2010
710146
1880571
Census Data
8
tx2010
914232
2403504
Census Data
9
great-britain_osm
7733823
8523976
Road Network
10
germanu_osm
11548846
12793527
Road Network
11
road_central
14081817
21414269
Road Network
12
road_usa
23947348
35246600
Road Network
When necessary DAGs are created from general graphs by dropping back edges
56
Speedup (Parallel vs. Seq. DFS)
Performance
6x
4
3.5
3
5x
Path-based
SSSP-based
3 BFS
2.5
2
1.5
1
0.5
0
Results obtained with Nvidia Pascal TitanX GPU, Intel Core i7-3930K @3.2GHz CPU, Ubuntu 14.04 LTS OS, CUDA Toolkit 8.0
57
CONCLUSIONS
58
Conclusions
 Parallel DFS for DAGs
 Work-efficient O(m+n)
 The algorithm takes O(z log n) steps,
where z is the maximum depth of a node
 Performance
 Depends highly on the connectivity/sparsity pattern
 Can achieve up to 6x speedup (but slowdown possible)
 Details
 M. Naumov, A. Vrielink and M. Garland, “Parallel Depth-First Search
for Directed Acyclic Graphs”, Technical Report, NVR-2017-001, 2017
https://research.nvidia.com/publication/parallel-depth-first-search-directed-acyclic-graphs
59
Thank you
https://research.nvidia.com/publication/parallel-depth-first-search-directed-acyclic-graphs
60