Parallel Depth First on GPU M. Naumov, A. Vrielink and M. Garland, GTC 2017 Introduction Directed Trees AGENDA Directed Acyclic Graphs (DAGs) Path- and SSSP-based variants Optimizations Performance Experiments 2 What is DFS? a d c b Node: Parent: a,b,c,d,e,f,g,i,j Discovery: e g f i Finish: j 3 What is DFS? a d c b Node: Parent: a,b,c,d,e,f,g,i,j /,a Discovery: a,b e g f i Finish: j 4 What is DFS? a d c b Node: Parent: a,b,c,d,e,f,g,i,j /,a, b, Discovery: a,b,e e g f i Finish: e j 5 What is DFS? a d c b Node: Parent: a,b,c,d,e,f,g,i,j /,a, b,b Discovery: a,b,e,f e g f i Finish: e j 6 What is DFS? a d c b Node: Parent: a,b,c,d,e,f,g,i,j /,a, b,b, ,f Discovery: a,b,e,f,i e g f i Finish: e,i j 7 What is DFS? a d c b Node: Parent: a,b,c,d,e,f,g,i,j /,a, b,b, ,f,f Discovery: a,b,e,f,i,j e g f i Finish: e,i,j j 8 What is DFS? a d c b Node: Parent: a,b,c,d,e,f,g,i,j /,a,a,a,b,b,d,f,f Discovery: a,b,e,f,i,j,c,d,g e g f i Finish: e,i,j,f,b,c,g,d,a j 9 Previous Work on DFS Lexicographic DFS Planar Graphs Directed Acyclic Graphs (DAGs) Directed Graphs with Cycles Time O(log2n) Processors O(n) Time O(log2n) Processors O(nω/log n) Time O( 𝑛 log11n) Processors O(n3) where ω < 2.373 is the matrix multiplication exponent 10 Previous Work on DFS Lexicographic DFS Planar Graphs Directed Acyclic Graphs (DAGs) Directed Graphs with Cycles Time O(log2n) Processors O(n) Time O(log2n) Processors O(nω/log n) Time O( 𝑛 log11n) Processors O(n3) topological sort, bi-connectivity and planarity testing where ω < 2.373 is the matrix multiplication exponent 11 DIRECTED TREES 12 Directed Tree a c b [0] f e d g [0] [0] i [0] j [0] Phase 2: Bottom-Up Traversal 13 Directed Tree a c b f e [0] [0] [0,1,1] d [0,1] g [0] i [0] j [0] Phase 2: Bottom-Up Traversal 14 Directed Tree a c b f e [0] [0] [0,1,2] d [0,1] g [0] i [0] j [0] prefix sum Phase 2: Bottom-Up Traversal 15 Directed Tree a b c [0,1,3] f e [0] [0] [0,1,2] d [0,1] g [0] i [0] j [0] Phase 2: Bottom-Up Traversal 16 Directed Tree a b c [0,1,4] f e [0] [0,1,2] [0,1] g [0] i prefix sum [0] d [0] j [0] Phase 2: Bottom-Up Traversal 17 Directed Tree [0,5,1,2] a b c [0,1,4] f e [0] [0] [0,1,2] d [0,1] g [0] i [0] j [0] Phase 2: Bottom-Up Traversal 18 Directed Tree [0,5,6,8] a b c [0,1,4] f e [0] [0] [0,1,2] d [0,1] g [0] i [0] j [0] Phase 2: Bottom-Up Traversal 19 Directed Tree [0,5,6,8] a b c [0,1,4] f e [0] [0] [0,1,2] d [0,1] g [0] i [0] j [0] This phase is done, next phase is about to start … 20 Directed Tree [0,5,6,8] a b c [0,1,4] f e offset 0 [0] [0] [0,1,2] d [0,1] g [0] i [0] j [0] Phase 3: Top-down Traversal 21 Directed Tree [0,5,6,8] a b c [0,1,4] f e offset 0 [0] [0,1,2] [0,1] g [0] i offset 1 [0] d [0] j [0] Phase 3: Top-down Traversal 22 Directed Tree [0,5,6,8] a b c [0,1,4] f e offset 0 [0] [0,1,2] [0,1] offset 6 g [0] i offset 1 [0] d [0] j [0] Phase 3: Top-down Traversal 23 Directed Tree [0,5,6,8] a b c [0,1,4] f e discovery 0+2 [0] [0] [0,1,2] d [0,1] discovery 6+1 g [0] i discovery 1+3 [0] j [0] discovery = offset + depth Phase 3: Top-down Traversal 24 Directed Tree [0,5,6,8] a b c [0,1,4] f e finish 0+0 [0] [0,1,2] [0,1] finish 6+1 g [0] i finish 1+0 [0] d [0] j [0] finish = offset + sub-tree size Phase 3: Top-down Traversal 25 DIRECTED ACYCLIC GRAPHS PATH-BASED VARIANT 26 Path-Based (for DAGs) a b c d e f g i j collision left [a,b,f] Phase 1 right f [a,d,f] 27 Path-Based (for DAGs) a b c d e f g i • wait until all paths to a node are traversed • align path sequences left [a,b,f] (lexicographically smallest) right [a,d,f] • compare left-to-right and choose smallest j collision left resolution Phase 1 [a,b,f] right f [a,d,f] 28 Path-Based (for DAGs) a b c d e f g i j This phase is done 29 OPTIMIZATIONS 30 Path Pruning a b c d e [a,c,d,f] f [a,b,e,f] 31 Path Pruning When two paths reach the same node There exists a parent “a” where the path split [a,b,…] and [a,c,…] a b c d e [a,c,d,f] f [a,b,e,f] 32 Path Pruning When two paths reach the same node There exists a parent “a” where the path split [a,b,…] and [a,c,…] a b c d e [a,c,d,f] f It is the comparison between “b” and “c” that allows us to distinguish between paths [a,b,e,f] 33 Path Pruning When two paths reach the same node There exists a parent “a” where the path split [a,b,…] and [a,c,…] a b c d e [a,c,d,f] f It is the comparison between “b” and “c” that allows us to distinguish between paths Parent node with a single edge will never be a decision point [a,b,e,f] 34 Path Pruning When two paths reach the same node There exists a parent “a” where the path split [a,b,…] and [a,c,…] a b c d e [a,c,f] f [a,b,f] It is the comparison between “b” and “c” that allows us to distinguish between paths Parent node with a single edge will never be a decision point No need to store nodes with such parents 35 Path Pruning 36 Phase Composition 37 SSSP-BASED VARIANT 38 SSSP-based (for DAGs) a b c e f [1] d g [1] [1] i [1] j [1] Run Runthe thealgorithm algorithmfor forDirected DirectedTrees, Trees,but but Propagate Propagate##ofofnodes nodestotoall allthe theparents parents Start Startprefix prefixsum sumwith with11(instead (insteadofof0)0) Phase 1: Bottom-Up Traversal 39 SSSP-based (for DAGs) a b c e f [1] [1] [1,1,1] d [1,1] g [1] i [1] j [1] Run Runthe thealgorithm algorithmfor forDirected DirectedTrees, Trees,but but Propagate Propagate##ofofnodes nodestotoall allthe theparents parents Start Startprefix prefixsum sumwith with11(instead (insteadofof0)0) Phase 1: Bottom-Up Traversal 40 SSSP-based (for DAGs) a b c e f [1] [1] [1,2,3] d [1,2] g [1] i [1] j [1] prefix sum Run Runthe thealgorithm algorithmfor forDirected DirectedTrees, Trees,but but Propagate Propagate##ofofnodes nodestotoall allthe theparents parents Start Startprefix prefixsum sumwith with11(instead (insteadofof0)0) Phase 1: Bottom-Up Traversal 41 SSSP-based (for DAGs) a b c [1,1,3] f e [1] [1] [1,2,3] d [1,2] g [1] i [1] j [1] Run Runthe thealgorithm algorithmfor forDirected DirectedTrees, Trees,but but Propagate Propagate##ofofnodes nodestotoall allthe theparents parents Start Startprefix prefixsum sumwith with11(instead (insteadofof0)0) Phase 1: Bottom-Up Traversal 42 SSSP-based (for DAGs) a b c [1,2,4] f e [1] prefix sum i [1] [1] [1,2,3] d [1,2] g [1] j [1] Run Runthe thealgorithm algorithmfor forDirected DirectedTrees, Trees,but but Propagate Propagate##ofofnodes nodestotoall allthe theparents parents Start Startprefix prefixsum sumwith with11(instead (insteadofof0)0) Phase 1: Bottom-Up Traversal 43 SSSP-based (for DAGs) [1,5,1,2,1] a b c [1,2,4] f e [1] [1] [1,2,3] d [1,2] g [1] i [1] j [1] Run Runthe thealgorithm algorithmfor forDirected DirectedTrees, Trees,but but Propagate Propagate##ofofnodes nodestotoall allthe theparents parents Start Startprefix prefixsum sumwith with11(instead (insteadofof0)0) Phase 1: Bottom-Up Traversal 44 SSSP-based (for DAGs) [1,6,7,9,10] a b c [1,2,4] f e [1] [1] [1,2,3] d [1,2] g [1] i [1] j [1] Run the algorithm for Directed Trees, but Propagate # of nodes to all the parents Start prefix sum with 1 (instead of 0) Phase 1: Bottom-Up Traversal 45 SSSP-based (for DAGs) a 6 1 c b 1 7 9 d 1 2 f e 1 i g 2 j Assign # of nodes as the edge weight This phase is done, next phase is about to start … 46 SSSP-based (for DAGs) a 6 1 c b 1 7 9 d 1 2 f e 1 i g 2 j 1+2+2=5 < 9 Phase 2: Top-down traversal 47 SSSP-based (for DAGs) a 6 1 c b 1 7 9 d 1 2 f e 1 i g 2 j 1+2+2=5 < 9 Shortest Path is the DFS path Phase 2: Top-down traversal 48 SSSP-based (for DAGs) a b c d e f g i j Phase 2: This phase is done 49 OPTIMIZATIONS 50 Discovery time The length of shortest path defines an ordering of nodes 0 1 b 6 2 e 3 a c 7 f 8 i j 4 5 Phase 3a: Sorting d g 51 Discovery time The length of shortest path defines an ordering of nodes We can sort them to obtain discovery time 0 1 b 6 2 e 3 a c 7 f 8 i j 4 5 Phase 3a: Sorting d g 52 Discovery time The length of shortest path defines an ordering of nodes We can sort them to obtain discovery time Discovery: a,b,e,f,i,j,c,d,g 0 1 b 6 2 e 3 a c 7 f 8 i j 4 5 Phase 3a: This phase is done (Phase 3b will find the finish time) d g 53 Phase composition 54 EXPERIMENTS 55 Data # Graph n m Application 1 coPapersDBLP 540487 15251812 Citations 2 auto 448696 3350678 Numeric Sim. 3 hugebubbles-000… 18318144 30144175 Numeric Sim. 4 delaunay_n24 16777217 52556391 Random Tri. 5 il2010 451555 1166978 Census Data 6 fl2010 484482 1270757 Census Data 7 ca2010 710146 1880571 Census Data 8 tx2010 914232 2403504 Census Data 9 great-britain_osm 7733823 8523976 Road Network 10 germanu_osm 11548846 12793527 Road Network 11 road_central 14081817 21414269 Road Network 12 road_usa 23947348 35246600 Road Network When necessary DAGs are created from general graphs by dropping back edges 56 Speedup (Parallel vs. Seq. DFS) Performance 6x 4 3.5 3 5x Path-based SSSP-based 3 BFS 2.5 2 1.5 1 0.5 0 Results obtained with Nvidia Pascal TitanX GPU, Intel Core i7-3930K @3.2GHz CPU, Ubuntu 14.04 LTS OS, CUDA Toolkit 8.0 57 CONCLUSIONS 58 Conclusions Parallel DFS for DAGs Work-efficient O(m+n) The algorithm takes O(z log n) steps, where z is the maximum depth of a node Performance Depends highly on the connectivity/sparsity pattern Can achieve up to 6x speedup (but slowdown possible) Details M. Naumov, A. Vrielink and M. Garland, “Parallel Depth-First Search for Directed Acyclic Graphs”, Technical Report, NVR-2017-001, 2017 https://research.nvidia.com/publication/parallel-depth-first-search-directed-acyclic-graphs 59 Thank you https://research.nvidia.com/publication/parallel-depth-first-search-directed-acyclic-graphs 60
© Copyright 2026 Paperzz