Parallel Breadth First Search on GPU Clusters h6p://mapgraph.io This work was (partially) funded by the DARPA XDATA program under AFRL Contract #FA8750-13-C-0002. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. D14PC00029. The authors wo˙uld like to thank Dr. White, NVIDIA, and the MVAPICH group at Ohio State University for their support of this work. Z. Fu, H.K. Dasari, B. Bebee, M. Berzins, B. Thompson. Parallel Breadth First Search on GPU Clusters. IEEE Big Data. Bethesda, MD. 2014. http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved 1 3/15/15 Many-‐Core is Your Future http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved 2 3/10/15 Graphs are everywhere and need for graph analyDcs is growing rapidly. Communication and Social Networks Human Brain and Biological Networks E-Commerce and Online Service Analytics Real-time Fraud Detection http://BlazeGraph.com http://MapGraph.io • Facebook has ~ 1 trillion edges in their graph. • Over 30 minutes per itera<on using 200 machines. • All computa<ons require mul<ple itera<ons (6-‐50). • We could do it in seconds on a cluster of GPUs. SYSTAP™, LLC © 2006-2015 All Rights Reserved 3 3/10/15 SYSTAP, LLC Small Business, Founded 2006 Graph Database • High performance, Scalable – – – – 50B edges/node High level query language Efficient Graph Traversal High 9s soluDon • Open Source – SubscripDons http://BlazeGraph.com http://MapGraph.io 100% Employee Owned GPU Analy2cs • Extreme Performance – 100s of Dmes faster than CPUs – 10,000x faster than graph databases – 30,000,000,000 edges/sec on 64 GPUs • DARPA funding • DisrupDve technology – Early adopters – Huge ROIs SYSTAP™, LLC © 2006-‐2015 All Rights Reserved 4 3/10/15 Customers Powering Their Graphs with SYSTAP Information Management / Retrieval http://BlazeGraph.com http://MapGraph.io Genomics / Precision Medicine Defense, Intel, Cyber SYSTAP is focused on building so^ware that enable graphs at speed and scale. • Blazegraph™ for Property and RDF Graphs – High Availability (HA) Architecture with Horizontal Scaling Blazegraph™ Comparison Fastest http://BlazeGraph.com http://MapGraph.io • Mapgraph™ GPU-‐accelerated data parallel graph analyDcs – Vertex-‐Centric API. – Single or mulD-‐GPU. – 10,000X faster than Accumulo, Neo4J, Titan, etc. – 3x faster then Cray XMT-‐2 at 1/3rd the price. SYSTAP™, LLC © 2006-2015 All Rights Reserved 6 3/10/15 GPU Hardware Trends • K40 GPU (today) • 12G RAM/GPU • 288 GB/s bandwidth • PCIe Gen 3 • Pascal GPU (Q1 2016) • 32G RAM/GPU • 1 TB/s bandwidth • Unified memory across CPU, GPUs • NVLINK • High bandwidth access to host memory http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved 7 3/10/15 MapGraph: Extreme performance • GTEPS is Billions (10^9) of Traversed Edges per Second. – This is the basic measure of performance for graph traversal. Configura2on Cost GTEPS $/GTEPS 4-‐Core CPU $4,000 0.2 $5,333 4-‐Core CPU + K20 GPU $7,000 3.0 $2,333 XMT-‐2 (rumored price) $1,800,000 10.0 $188,000 64 GPUs (32 nodes with 2x K20 GPUs per node and InfiniBand DDRx4 – today) $500,000 30.0 $16,666 16 GPUs (2 nodes with 8x Pascal GPUs per node and InfiniBand DDRx4 – Q1, 2016) $125,000 >30.0 <$4,166 http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved 8 3/10/15 SYSTAP’s MapGraph APIs Roadmap • Vertex Centric API – Same performance as CUDA. – Property Graph / RDF – Reduce import hassles – OperaDons over merged graphs • Graph pa6ern matching language • DSL/Scala => CUDA code generaDon http://BlazeGraph.com http://MapGraph.io Ease of Use • Schema-‐flexible data-‐model SYSTAP™, LLC © 2006-2015 All Rights Reserved 9 3/10/15 Breadth First Search • Fundamental building block for graph algorithms – Including SPARQL (JOINs) • In level synchronous steps 1 – Label visited verDces • For this graph – IteraDon 0 – IteraDon 1 – IteraDon 2 1 2 2 1 2 • Hard problem! 2 3 2 2 0 1 2 1 2 – Basis for Graph 500 benchmark 2 http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved 3 3 10 3/10/15 GPUs – A Game Changer for Graph AnalyDcs Breadth-First Search on Graphs • Graphs are a hard problem • GPUs deliver effecDve parallelism • 10x+ memory bandwidth • Dynamic parallelism 3 GTEPS 3500 Million Traversed Edges per Second • Non-‐locality • Data dependent parallelism • Memory bus and communicaDon bo6lenecks 10x Speedup on GPUs NVIDIA Tesla C2050 Multicore per socket 3000 Sequential 2500 2000 1500 1000 500 0 1 Graphic: Merrill, Garland, and Grimshaw, “GPU Sparse Graph Traversal”, GPU Technology Conference, 2012 http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved 10 100 1000 10000 100000 Average Traversal Depth 11 3/10/15 BFS Results : MapGraph vs GraphLab MapGraph Speedup vs GraphLab (BFS) 1,000.00 - CPU vs GPU - GPU 15x-300x faster GL-‐2 Speedup - More CPU cores does not help 100.00 GL-‐4 10.00 GL-‐8 GL-‐12 MPG 1.00 0.10 Webbase http://BlazeGraph.com http://MapGraph.io Delaunay Bitcoin SYSTAP™, LLC © 2006-2015 All Rights Reserved Wiki Kron 12 3/10/15 PageRank : MapGraph vs GraphLab MapGraph Speedup vs GraphLab (Page Rank) 100.00 - CPU vs GPU - GPU 5x-90x faster - CPU slows down with more cores! 10.00 Speedup GL-‐2 GL-‐4 GL-‐8 GL-‐12 MPG 1.00 0.10 Webbase http://BlazeGraph.com http://MapGraph.io Delaunay Bitcoin SYSTAP™, LLC © 2006-2015 All Rights Reserved Wiki Kron 13 3/10/15 2D ParDDoning (aka Vertex Cuts) • p x p compute grid • Parallelism – work must be distributed and balanced. • Memory bandwidth – memory, not disk, is the bo6leneck – Edges in rows/cols – Minimize messages • log(p) (versus p2) – One parDDon per GPU In1t • Batch parallel operaDon • RepresentaDve fronDers Query BFS PR in-edges R1 In3t In4t C2 C3 C4 Source VerDces G11 R2 R3 Out1t G22 Target VerDces – Grid row: out-‐edges – Grid column: in-‐edges C1 In2t Out2t G33 Out3t G44 R4 Out4t out-edges http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved 14 3/10/15 Scale 25 Traversal • Work spans mulDple orders of magnitude. 10,000,000 1,000,000 Frontier Size 100,000 10,000 1,000 100 10 1 0 1 2 3 4 5 6 Iteration http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved 15 3/10/15 Distributed BFS Algorithm 1: procedure U PDATE P REDS (As procedure BFS(Root, Predecessor) 2: for all v 2 Assignedij in p 2: In0ij LocalVertex(Root) ç Starting vertex 3: Pred[v] 1 3: for t 0 do 4: all GPUsfor i ColOff[v], Col ç Data parallel 1-hop expand on 4: Expand(Inti , Outtij ) if level[v] == level 5: LocalFrontiert Count(Outtij ) ç Local frontier size 5: 6: Pred[v] Row 6: GlobalFrontiert Reduce(LocalFrontiert ) ç Global frontier size 7: end if 7: if GlobalFrontiert > 0 then ç Termination check 8: end for 8: Contract(Outtij , Outtj , Int+1 , Assignij ) ç Global Frontier contraction (“wave”) i 9: end for 9: UpdateLevels(Outtj , t, level) ç Record level of vertices in the In / 10: end procedure 10: else Out frontier 11: UpdatePreds(Assignij ,P redsij , level) ç Compute predecessors from local Figure 7. Predecess In / Out levels (no communication) 12: break ç Done 13: end if 14: t++ ç Next iteration for finding the Prefixtij , which is 15: end for performing Bitwise-OR Exclusive 16: end procedure across each row j as shown in li are several tree-based algorithms SYSTAP™, LLC Figure 3. Distributed Breadth First Search algorithm 16 http://BlazeGraph.com Since the bitmap union © 2006-2015 All Rights Reserved http://MapGraph.io 3/10/15 operation the complexity of the parallel sca t 1: procedure E XPAND (Inti , Outij ) the number of communication ste 2: L convert(Int ) 1: Distributed BFS Algorithm 1: procedure U PDATE P REDS (As procedure BFS(Root, Predecessor) 2: for all v 2 Assignedij in p 2: In0ij LocalVertex(Root) ç Starting vertex 3: Pred[v] 1 3: for t 0 do 4: all GPUsfor i ColOff[v], Col ç Data parallel 1-hop expand on 4: Expand(Inti , Outtij ) if level[v] == level 5: LocalFrontiert Count(Outtij ) ç Local frontier size 5: 6: Pred[v] Row 6: GlobalFrontiert Reduce(LocalFrontiert ) ç Global frontier size 7: end if 7: if GlobalFrontiert > 0 then ç Termination check 8: end for 8: Contract(Outtij , Outtj , Int+1 , Assignij ) ç Global Frontier contraction (“wave”) i 9: end for 9: UpdateLevels(Outtj , t, level) ç Record level of vertices in the In / 10: end procedure 10: else Out frontier 11: UpdatePreds(Assignij ,P redsij , level) ç Compute predecessors from local Figure 7. Predecess In / Out levels (no communication) 12: break ç Done 13: end if • Key differences 14: t++ ç Next iteration for the Prefixtij , which is • log(p) parallel scan (vs finding sequenDal wave) 15: end for • GPU-‐local computaDon of predecessors performing Bitwise-OR Exclusive • 1 parDDon per GPU 16: end procedure across each row j as shown in li • GPUDirect (vs RDMA) are several tree-based algorithms SYSTAP™, LLC Figure 3. Distributed Breadth First Search algorithm 17 http://BlazeGraph.com Since the bitmap union © 2006-2015 All Rights Reserved http://MapGraph.io 3/10/15 operation the complexity of the parallel sca t 1: procedure E XPAND (Inti , Outij ) the number of communication ste 2: L convert(Int ) 1: end if t++ end for 16: end procedure 13: 14: 15: for finding the P performing Bitwi across each row are several tree-b Figure 3. Distributed Breadth First Search algorithm Since the bitmap the complexity o t 1: procedure E XPAND (Inti , Outij ) the number of co t 2: Lin convert(Ini ) algorithm having 3: Lout ; [14] even though 4: for all v 2 Lin in parallel do p log p bitmap uni 5: for i RowOff[v], RowOff[v + 1] do efficient algorithm 6: c ColIdx[i] At each iteratio 7: Lout c line 3 of Figure 8: end for containing the ver 9: end for back to the GPU 10: Outtij convert(Lout ) the row. This is d 11: end procedure • The GPU implementaDon uses mulDple strategies vertices soto that t handle data-‐dependent parallelism. See our those vertices in t Figure 4. Expand algorithm SIGMOD 2014 paper for details. updating the leve t+1 In both 1: procedure C ONTRACT(Outtij SYSTAP™, , Outtj ,LLCInt+1 , Assign ) 18 http://BlazeGraph.com j . Since ij i © 2006-2015 All Rights Reserved http://MapGraph.io 3/10/15 t t in the diagonal (i 2: Prefixij ExclusiveScanj (Outij ) the diagonal GPU 3: Assignedij Assignedij [ (Outtij Prefixtij ) Expand containing the vertices discovered in it 9: end for back to the GPUs of the row Rj fr 10: Outtij convert(Lout ) the row. This is done to update the G 11: end procedure vertices so that they do not produce Global Figure FronDer ContracDon and Cthose ommunicaDon vertices in the future iterations. It 4. Expand algorithm updating the levels of the vertices. Th t t+1 t t t+1 Int+1 . Since both Out and In 1: procedure C ONTRACT(Outij , Outj , Ini , Assignij ) j j i are th ç Distributed prefix sum. Prefix is vertices in the diagonal (i = j), in line 9, we c 2: Prefixtij ExclusiveScanj (Outtij ) discovered by your left neighbors (log p) t t GPUs andGPU. then broadcas 3: Assignedij Assignedij [ (Outij Prefixij ) ç Verticesthe firstdiagonal discovered by this Ci using MPI_Broadcast in line 11. C 4: if i = p then setupcolumn alonghas every rowfrontier and column du ç Right most global for 5: Outtj Outtij [ Prefixtij the row facilitate these parallel scan and broad 6: end if ç BroadcastDuring frontier each over row. BFS iteration, we mus 7: Broadcast(Outtj , p , ROW ) of the algorithm by checking if the g 8: if i = j then t zero as in line 6 of Figure 3. We cur 9: Int+1 Out j i frontier using a MPI_Allreduce operatio 10: end if edgefrontier expansion and before the globa ç Broadcast over column. 11: Broadcast(Int+1 i , i, COL) phase. However, an unexplored optimi 12: end procedure the termination check with the globa SYSTAP™, LLC 19 http://BlazeGraph.com Figure 5. Global Frontier contraction and communication algorithm and communication phase in order to h © 2006-2015 All Rights Reserved http://MapGraph.io 3/10/15 the total time. t 1: procedure U PDATE L EVELS(Outj , t, level) Global FronDer ContracDon and CommunicaDon 1: 2: 8: 9: 10: 11: 12: 1 1 2 2 3 4 3 4 1 1 2 2 3 3 4 4 http://BlazeGraph.com http://MapGraph.io Target VerDces 3: 4: 5: 6: 7: procedure C ONTRACT(Outtij , Outtj , Int+1 , Assignij ) i Source VerDces t t ç Distributed prefix sum. Prefix is verDces Prefixt ij ExclusiveScan (Out ) j ij discovered by your le^ neighbors (log p) In In t In t In t t t Assignedij Assignedij [ (Outij Prefixij ) ç Vertices first discovered by this GPU. if iC= p then C C C t t t ç Right most column has global frontier for Out Out [ Prefix ij ij R G11 j Out t the row end if R Broadcast(Out ç Broadcast frontier over row. G22 t , p , ROW ) Out t j R if i = j then G Out t 33 t Int+1 Out j i R end if Out t G44 ç Broadcast frontier over column. Broadcast(Int+1 i , i, COL) end procedure SYSTAP™, LLC © 2006-2015 All Rights Reserved 20 3/10/15 Global FronDer ContracDon and CommunicaDon 1: 2: 8: 9: 10: 11: 12: 1 1 2 2 3 4 3 4 1 1 2 2 3 3 4 4 Target VerDces 3: 4: 5: 6: 7: procedure C ONTRACT(Outtij , Outtj , Int+1 , Assignij ) i Source VerDces t t ! Distributed prefix sum. Prefix is ver2ces Prefixt ij ExclusiveScan (Out ) j ij discovered by your leb neighbors (log p) In In t In t In t t t Assignedij Assignedij [ (Outij Prefixij ) ç Vertices first discovered by this GPU. if iC= p then C C C t t t ç Right most column has global frontier for Out Out [ Prefix ij ij R G11 j Out t the row end if R Broadcast(Out ç Broadcast frontier over row. G22 t , p , ROW ) Out t j R if i = j then G Out t 33 t Int+1 Out j i R end if Out t G44 ç Broadcast frontier over column. Broadcast(Int+1 i , i, COL) end procedure • http://BlazeGraph.com We use a parallel scan that minimizes cSYSTAP™, ommunicaDon steps (vs work). LLC Allb Rights Reserved • http://MapGraph.io This improves the overall scaling ©e2006-2014 fficiency y 30%. 21 3/10/15 Global FronDer ContracDon and CommunicaDon 1: 2: 8: 9: 10: 11: 12: 1 1 2 2 3 4 3 4 1 1 2 2 3 3 4 4 Target VerDces 3: 4: 5: 6: 7: procedure C ONTRACT(Outtij , Outtj , Int+1 , Assignij ) i Source VerDces t t ! Distributed prefix sum. Prefix is ver2ces Prefixt ij ExclusiveScan (Out ) j ij discovered by your leb neighbors (log p) In In t In t In t t t Assignedij Assignedij [ (Outij Prefixij ) ç Vertices first discovered by this GPU. if iC= p then C C C t t t ç Right most column has global frontier for Out Out [ Prefix ij ij R G11 j Out t the row end if R Broadcast(Out ç Broadcast frontier over row. G22 t , p , ROW ) Out t j R if i = j then G Out t 33 t Int+1 Out j i R end if Out t G44 ç Broadcast frontier over column. Broadcast(Int+1 i , i, COL) end procedure • http://BlazeGraph.com We use a parallel scan that minimizes cSYSTAP™, ommunicaDon steps (vs work). LLC Allb Rights Reserved • http://MapGraph.io This improves the overall scaling ©e2006-2014 fficiency y 30%. 22 3/15/15 Global FronDer ContracDon and CommunicaDon 1: 2: 8: 9: 10: 11: 12: 1 1 2 2 3 4 3 4 1 1 2 2 3 3 4 4 Target VerDces 3: 4: 5: 6: 7: procedure C ONTRACT(Outtij , Outtj , Int+1 , Assignij ) i Source VerDces t t ç Distributed prefix sum. Prefix is verDces Prefixt ij ExclusiveScan (Out ) j ij discovered by your le^ neighbors (log p) In In t In t In t t t Assignedij Assignedij [ (Outij Prefixij ) ç Vertices first discovered by this GPU. if iC= p then C C C t t t ç Right most column has global frontier Out Out [ Prefix ij ij R G11 j Out t for the row end if R Broadcast(Out ç Broadcast frontier over row. G22 t , p , ROW ) Out t j R if i = j then G Out t 33 t Int+1 Out j i R end if Out t G44 ç Broadcast frontier over column. Broadcast(Int+1 i , i, COL) end procedure • http://BlazeGraph.com We use a parallel scan that minimizes cSYSTAP™, ommunicaDon steps (vs work). LLC Allb Rights Reserved • http://MapGraph.io This improves the overall scaling ©e2006-2014 fficiency y 30%. 23 3/15/15 Global FronDer ContracDon and CommunicaDon 1: 2: 8: 9: 10: 11: 12: 1 1 2 2 3 4 3 4 1 1 2 2 3 3 4 4 http://BlazeGraph.com http://MapGraph.io Target VerDces 3: 4: 5: 6: 7: procedure C ONTRACT(Outtij , Outtj , Int+1 , Assignij ) i Source VerDces t t ç Distributed prefix sum. Prefix is vertices Prefixt ij ExclusiveScan (Out ) j ij discovered by your left neighbors (log p) In In t In t In t t t Assignedij Assignedij [ (Outij Prefixij ) ç Vertices first discovered by this GPU. if iC= p then C C C t t t ç Right most column has global frontier for Out Out [ Prefix ij ij R G11 j Out t the row end if R Broadcast(Out ! Broadcast frontier over row. G22 t , p , ROW ) Out t j R if i = j then G Out t 33 t Int+1 Out j i R end if Out t G44 ç Broadcast frontier over column. Broadcast(Int+1 i , i, COL) end procedure SYSTAP™, LLC © 2006-2014 All Rights Reserved 24 3/15/15 Global FronDer ContracDon and CommunicaDon 1: 2: 8: 9: 10: 11: 12: 1 1 2 2 3 4 3 4 1 1 2 2 3 3 4 4 • http://BlazeGraph.com All GPUs now have the new fronDer. http://MapGraph.io Target VerDces 3: 4: 5: 6: 7: procedure C ONTRACT(Outtij , Outtj , Int+1 , Assignij ) i Source VerDces t t ç Distributed prefix sum. Prefix is vertices Prefixt ij ExclusiveScan (Out ) j ij discovered by your left neighbors (log p) In In t In t In t t t Assignedij Assignedij [ (Outij Prefixij ) ç Vertices first discovered by this GPU. if iC= p then C C C t t t ç Right most column has global frontier for Out Out [ Prefix ij ij R G11 j Out t the row end if R Broadcast(Out ç Broadcast frontier over row. G22 t , p , ROW ) Out t j R if i = j then G Out t 33 t Int+1 Out j i R end if Out t G44 ! Broadcast frontier over column. Broadcast(Int+1 i , i, COL) end procedure SYSTAP™, LLC © 2006-2014 All Rights Reserved 25 3/15/15 8: 9: 10: 11: 12: if i = j then Int+1 Outtj i end if Broadcast(Int+1 , i, COL) i end procedure Figure 5. 1: 2: 3: 4: 5: Update Levels Global Frontier contraction and communication algorithm procedure U PDATE L EVELS(Outtj , t, level) for all v 2 Outtj in parallel do level[v] t end for end procedure Figure 6. Update levels of the algorithm by checkin zero as in line 6 of Figure 3 frontier using a MPI_Allreduc edge expansion and before phase. However, an unexplor the termination check with and communication phase in the total time. D. Computing the Predecess Unlike the Blue Gene/Q messages to update the verte level) during traversal. Inste els associated with vertices frontier (See Figure 6 for the • We store both the In and Out levels. • This allows us to compute the predecessors in a GPU local manner. http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved 26 3/10/15 Predecessor ComputaDon 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: procedure U PDATE P REDS(Assignedij , Predsij , level) for all v 2 Assignedij in parallel do Pred[v] 1 for i ColOff[v], ColOff[v + 1] do if level[v] == level[RowIdx[i]] + 1 then Pred[v] RowIdx[i] end if • Predecessors are computed end for a^er the traversal is complete end for using node-‐local In/Out levels. end procedure Figure 7. http://BlazeGraph.com http://MapGraph.io • No inter-‐node communicaDon is required. Predecessor update SYSTAP™, LLC © 2006-2015 All Rights Reserved 27 3/10/15 for finding the Prefixtij , which is the bitmap obtained by t Weak Scaling 35 • Scaling the problem size with more GPUs Scaling the problem size with more GPUs. 30 – Fixed problem size per GPU. • Maximum scale 27 (4.3B edges) GTEPS 25 – 64 K20 GPUs => .147s => 29 GTEPS – 64 K40 GPUs => .135s => 32 GTEPS 20 15 10 5 0 1 • K40 has faster memory bus. GPUs Scale Ver2ces Edges 16 64 #GPUs Time (s) GTEPS 1 21 2,097,152 67,108,864 0.0254 2.5 4 23 8,388,608 268,435,456 0.0429 6.3 16 25 33,554,432 1,073,741,824 0.0715 15.0 64 27 134,217,728 4,294,967,296 0.1478 29.1 http://BlazeGraph.com http://MapGraph.io 4 SYSTAP™, LLC © 2006-2015 All Rights Reserved 28 3/10/15 Central IteraDon Costs (Weak Scaling) 2 log(2p)/log(p) • How to scale? – Overlapping – Compression • Graph • Message – Hybrid parDDoning – Heterogeneous compuDng http://BlazeGraph.com http://MapGraph.io 0.0120 Computation Communication Termination 0.0100 0.0080 Time (Seconds) • CommunicaDon costs are not constant. • 2D design implies cost grows as 0.0060 0.0040 0.0020 0.0000 4 16 64 #GPUs SYSTAP™, LLC © 2006-2015 All Rights Reserved 29 3/10/15 Costs During BFS Traversal • • • • Chart shows the different costs for each GPU in each iteraDon (64 GPUs). Wave Dme is essenDally constant, as expected. Compute Dme peaks during the central iteraDons. Costs are reasonably well balanced across all GPUs a^er the 2nd iteraDon. http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved 30 3/10/15 Strong Scaling • Speedup on a constant problem size with more GPUs • Problem scale 25 – 2^25 verDces (33,554,432) – 2^26 directed edges (1,073,741,824) • Strong scaling efficiency of 48% – Versus 44% for BG/Q GPUs GTEPS Time (s) 16 15.2 0.071 25 18.2 0.059 36 20.5 0.053 49 21.8 0.049 64 22.7 0.047 http://BlazeGraph.com http://MapGraph.io Speedup on a constant problem size with more GPUs. SYSTAP™, LLC © 2006-2015 All Rights Reserved 31 3/10/15 DirecDons to Improve Scaling • Overlap computaDon with communicaDon – MulDple parDDons per GPU – FronDer compression • Hybrid parDDoning – Degree aware data layout + bo6om up search opDmizaDon • This also requires asynchronous communicaDons and per-‐target node buffers. – Graph aware parDDoning plus 2D data layout • Uintah style data warehouse – Hand off tasks to workers (Uintah) – Hybrid CPU/GPU computaDon strategies (TOTEM) http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved 32 3/10/15 Concept: Accelerate Key Value Stores Assuming 10000x Speedup => .72s/step 1 BFS step @ 270 MTEPS => 2 Hours • Accumulo + MR • 1200 nodes • 1T edges http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved 33 3/10/15 MapGraph Timings (single GPU) • Orkut social network 2.3M verDces, 92M edges. Most Dme is *load* on CPU. • Next step eliminates overhead: 62500ms => 63ms (1000x faster) Parse : 47,000ms CSR : 15,000ms Total : 62,000ms LOAD Graph COO CSC : 244ms Build CSC XFer GPU: 232ms This cost can be eliminated. http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved BFS : 63ms TE : 45M TEPS: 725M BFS Once on the GPU, it is very cheap to run other algorithms, different starting vertices, etc. 34 3/10/15 MulD-‐ MapGraph 32 GTEPS on 64 GPUs, 4.3 Billion edge graph MapGraph + Java Single GPU JNI Interface Schema Flexible Data Model (PG/RDF) Overlap computaDon with communicaDon for be6er scaling. http://BlazeGraph.com http://MapGraph.io Updates pushed to GPU in background. • No ETL latency impact • Full GPU AcceleraDon • • Next Steps Current Activity • MulDple ParDDons per GPU Implemented Implemented Current and Future Code Streams Horizontal Scaling Extreme Performance SYSTAP™, LLC © 2006-2015 All Rights Reserved 35 3/10/15 MapGraph Beta Customer? Contact Bradley Bebee [email protected] http://BlazeGraph.com http://MapGraph.io SYSTAP™, LLC © 2006-2015 All Rights Reserved 36 3/10/15
© Copyright 2026 Paperzz