Parallel Breadth First Search on GPU Clusters

Parallel Breadth First Search on GPU Clusters h6p://mapgraph.io This work was (partially) funded by the DARPA XDATA program under AFRL Contract #FA8750-13-C-0002. This material is
based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. D14PC00029.
The authors wo˙uld like to thank Dr. White, NVIDIA, and the MVAPICH group at Ohio State University for their support of this
work.
Z. Fu, H.K. Dasari, B. Bebee, M. Berzins, B. Thompson. Parallel Breadth First Search on GPU Clusters. IEEE Big Data.
Bethesda, MD. 2014.
http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
1
3/15/15
Many-­‐Core is Your Future http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
2
3/10/15
Graphs are everywhere and need for graph analyDcs is growing rapidly. Communication and Social
Networks
Human Brain
and Biological
Networks
E-Commerce and Online
Service Analytics
Real-time Fraud Detection
http://BlazeGraph.com
http://MapGraph.io
•  Facebook has ~ 1 trillion edges in their graph. •  Over 30 minutes per itera<on using 200 machines. •  All computa<ons require mul<ple itera<ons (6-­‐50). •  We could do it in seconds on a cluster of GPUs. SYSTAP™, LLC
© 2006-2015 All Rights Reserved
3
3/10/15
SYSTAP, LLC Small Business, Founded 2006
Graph Database •  High performance, Scalable – 
– 
– 
– 
50B edges/node High level query language Efficient Graph Traversal High 9s soluDon •  Open Source –  SubscripDons http://BlazeGraph.com
http://MapGraph.io
100% Employee Owned GPU Analy2cs •  Extreme Performance –  100s of Dmes faster than CPUs –  10,000x faster than graph databases –  30,000,000,000 edges/sec on 64 GPUs •  DARPA funding •  DisrupDve technology –  Early adopters –  Huge ROIs SYSTAP™, LLC © 2006-­‐2015 All Rights Reserved 4
3/10/15
Customers Powering Their Graphs with SYSTAP Information
Management /
Retrieval
http://BlazeGraph.com
http://MapGraph.io
Genomics / Precision
Medicine
Defense, Intel, Cyber
SYSTAP is focused on building so^ware that enable graphs at speed and scale. •  Blazegraph™ for Property and RDF Graphs –  High Availability (HA) Architecture with Horizontal Scaling Blazegraph™ Comparison Fastest http://BlazeGraph.com
http://MapGraph.io
•  Mapgraph™ GPU-­‐accelerated data parallel graph analyDcs –  Vertex-­‐Centric API. –  Single or mulD-­‐GPU. –  10,000X faster than Accumulo, Neo4J, Titan, etc. –  3x faster then Cray XMT-­‐2 at 1/3rd the price. SYSTAP™, LLC
© 2006-2015 All Rights Reserved
6
3/10/15
GPU Hardware Trends •  K40 GPU (today) •  12G RAM/GPU •  288 GB/s bandwidth •  PCIe Gen 3 •  Pascal GPU (Q1 2016) •  32G RAM/GPU •  1 TB/s bandwidth •  Unified memory across CPU, GPUs •  NVLINK •  High bandwidth access to host memory http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
7
3/10/15
MapGraph: Extreme performance •  GTEPS is Billions (10^9) of Traversed Edges per Second. –  This is the basic measure of performance for graph traversal. Configura2on Cost GTEPS $/GTEPS 4-­‐Core CPU $4,000 0.2 $5,333 4-­‐Core CPU + K20 GPU $7,000 3.0 $2,333 XMT-­‐2 (rumored price) $1,800,000 10.0 $188,000 64 GPUs (32 nodes with 2x K20 GPUs per node and InfiniBand DDRx4 – today) $500,000 30.0 $16,666 16 GPUs (2 nodes with 8x Pascal GPUs per node and InfiniBand DDRx4 – Q1, 2016) $125,000 >30.0 <$4,166 http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
8
3/10/15
SYSTAP’s MapGraph APIs Roadmap •  Vertex Centric API –  Same performance as CUDA. –  Property Graph / RDF –  Reduce import hassles –  OperaDons over merged graphs •  Graph pa6ern matching language •  DSL/Scala => CUDA code generaDon http://BlazeGraph.com
http://MapGraph.io
Ease of Use •  Schema-­‐flexible data-­‐model SYSTAP™, LLC
© 2006-2015 All Rights Reserved
9
3/10/15
Breadth First Search •  Fundamental building block for graph algorithms –  Including SPARQL (JOINs) •  In level synchronous steps 1 –  Label visited verDces •  For this graph –  IteraDon 0 –  IteraDon 1 –  IteraDon 2 1 2 2 1 2 •  Hard problem! 2 3 2 2 0 1 2 1 2 –  Basis for Graph 500 benchmark 2 http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
3 3 10
3/10/15
GPUs – A Game Changer for Graph AnalyDcs Breadth-First Search on Graphs
•  Graphs are a hard problem •  GPUs deliver effecDve parallelism •  10x+ memory bandwidth •  Dynamic parallelism 3 GTEPS
3500
Million Traversed Edges per Second
•  Non-­‐locality •  Data dependent parallelism •  Memory bus and communicaDon bo6lenecks 10x Speedup on GPUs
NVIDIA Tesla C2050
Multicore per socket
3000
Sequential
2500
2000
1500
1000
500
0
1
Graphic: Merrill, Garland, and Grimshaw, “GPU Sparse Graph
Traversal”, GPU Technology Conference, 2012
http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
10
100
1000
10000
100000
Average Traversal Depth
11
3/10/15
BFS Results : MapGraph vs GraphLab MapGraph Speedup vs GraphLab (BFS) 1,000.00 -  CPU vs GPU
-  GPU 15x-300x
faster
GL-­‐2 Speedup -  More CPU cores
does not help
100.00 GL-­‐4 10.00 GL-­‐8 GL-­‐12 MPG 1.00 0.10 Webbase http://BlazeGraph.com
http://MapGraph.io
Delaunay Bitcoin SYSTAP™, LLC
© 2006-2015 All Rights Reserved
Wiki Kron 12
3/10/15
PageRank : MapGraph vs GraphLab MapGraph Speedup vs GraphLab (Page Rank) 100.00 -  CPU vs GPU
-  GPU 5x-90x faster
-  CPU slows down
with more cores!
10.00 Speedup GL-­‐2 GL-­‐4 GL-­‐8 GL-­‐12 MPG 1.00 0.10 Webbase http://BlazeGraph.com
http://MapGraph.io
Delaunay Bitcoin SYSTAP™, LLC
© 2006-2015 All Rights Reserved
Wiki Kron 13
3/10/15
2D ParDDoning (aka Vertex Cuts) •  p x p compute grid •  Parallelism – work must be distributed and balanced. •  Memory bandwidth – memory, not disk, is the bo6leneck –  Edges in rows/cols –  Minimize messages •  log(p) (versus p2) –  One parDDon per GPU In1t
•  Batch parallel operaDon •  RepresentaDve fronDers Query BFS PR in-edges
R1
In3t
In4t
C2
C3
C4
Source VerDces G11
R2
R3
Out1t
G22
Target VerDces –  Grid row: out-­‐edges –  Grid column: in-­‐edges C1
In2t
Out2t
G33
Out3t
G44
R4
Out4t
out-edges
http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
14
3/10/15
Scale 25 Traversal •  Work spans mulDple orders of magnitude. 10,000,000
1,000,000
Frontier Size
100,000
10,000
1,000
100
10
1
0
1
2
3
4
5
6
Iteration
http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
15
3/10/15
Distributed BFS Algorithm 1: procedure U PDATE P REDS (As
procedure BFS(Root, Predecessor)
2:
for all v 2 Assignedij in p
2:
In0ij
LocalVertex(Root) ç Starting vertex
3:
Pred[v]
1
3:
for t
0 do
4: all GPUsfor i
ColOff[v], Col
ç Data parallel 1-hop expand on
4:
Expand(Inti , Outtij )
if level[v] == level
5:
LocalFrontiert
Count(Outtij ) ç Local frontier size 5:
6:
Pred[v]
Row
6:
GlobalFrontiert
Reduce(LocalFrontiert ) ç Global frontier
size
7:
end if
7:
if GlobalFrontiert > 0 then
ç Termination check
8:
end for
8:
Contract(Outtij , Outtj , Int+1
, Assignij ) ç Global Frontier
contraction (“wave”)
i
9:
end for
9:
UpdateLevels(Outtj , t, level)
ç Record level of vertices in the In /
10: end procedure
10:
else
Out frontier
11:
UpdatePreds(Assignij ,P redsij , level) ç Compute predecessors from
local
Figure
7. Predecess
In / Out levels (no communication)
12:
break ç Done
13:
end if
14:
t++
ç Next iteration
for finding the Prefixtij , which is
15:
end for
performing Bitwise-OR Exclusive
16: end procedure
across each row j as shown in li
are several tree-based algorithms
SYSTAP™,
LLC
Figure
3.
Distributed
Breadth
First
Search
algorithm
16
http://BlazeGraph.com
Since the bitmap union
© 2006-2015 All Rights Reserved
http://MapGraph.io
3/10/15 operation
the complexity of the parallel sca
t
1: procedure E XPAND (Inti , Outij )
the number of communication ste
2:
L
convert(Int )
1:
Distributed BFS Algorithm 1: procedure U PDATE P REDS (As
procedure BFS(Root, Predecessor)
2:
for all v 2 Assignedij in p
2:
In0ij
LocalVertex(Root) ç Starting vertex
3:
Pred[v]
1
3:
for t
0 do
4: all GPUsfor i
ColOff[v], Col
ç Data parallel 1-hop expand on
4:
Expand(Inti , Outtij )
if level[v] == level
5:
LocalFrontiert
Count(Outtij ) ç Local frontier size 5:
6:
Pred[v]
Row
6:
GlobalFrontiert
Reduce(LocalFrontiert ) ç Global frontier
size
7:
end if
7:
if GlobalFrontiert > 0 then
ç Termination check
8:
end for
8:
Contract(Outtij , Outtj , Int+1
, Assignij ) ç Global Frontier
contraction (“wave”)
i
9:
end for
9:
UpdateLevels(Outtj , t, level)
ç Record level of vertices in the In /
10: end procedure
10:
else
Out frontier
11:
UpdatePreds(Assignij ,P redsij , level) ç Compute predecessors from
local
Figure
7. Predecess
In / Out levels (no communication)
12:
break ç Done
13:
end if
•  Key differences 14:
t++
ç Next iteration
for
the
Prefixtij , which is
•  log(p) parallel scan (vs finding
sequenDal wave) 15:
end for
•  GPU-­‐local computaDon of predecessors performing
Bitwise-OR Exclusive
•  1 parDDon per GPU 16: end procedure
across each row j as shown in li
•  GPUDirect (vs RDMA) are several tree-based algorithms
SYSTAP™,
LLC
Figure
3.
Distributed
Breadth
First
Search
algorithm
17
http://BlazeGraph.com
Since the bitmap union
© 2006-2015 All Rights Reserved
http://MapGraph.io
3/10/15 operation
the complexity of the parallel sca
t
1: procedure E XPAND (Inti , Outij )
the number of communication ste
2:
L
convert(Int )
1:
end if
t++
end for
16: end procedure
13:
14:
15:
for finding the P
performing Bitwi
across each row
are several tree-b
Figure 3. Distributed Breadth First Search algorithm
Since the bitmap
the complexity o
t
1: procedure E XPAND (Inti , Outij )
the number of co
t
2:
Lin
convert(Ini )
algorithm having
3:
Lout
;
[14] even though
4:
for all v 2 Lin in parallel do
p log p bitmap uni
5:
for i
RowOff[v], RowOff[v + 1] do
efficient algorithm
6:
c
ColIdx[i]
At each iteratio
7:
Lout
c
line 3 of Figure
8:
end for
containing the ver
9:
end for
back to the GPU
10:
Outtij
convert(Lout )
the row. This is d
11: end procedure
•  The GPU implementaDon uses mulDple strategies vertices
soto that t
handle data-­‐dependent parallelism. See our those vertices in t
Figure 4. Expand algorithm
SIGMOD 2014 paper for details. updating the leve
t+1
In
both
1:
procedure C ONTRACT(Outtij SYSTAP™,
, Outtj ,LLCInt+1
,
Assign
)
18
http://BlazeGraph.com
j . Since
ij
i
©
2006-2015
All
Rights
Reserved
http://MapGraph.io
3/10/15
t
t
in
the
diagonal (i
2:
Prefixij
ExclusiveScanj (Outij )
the diagonal GPU
3:
Assignedij
Assignedij [ (Outtij Prefixtij )
Expand containing the vertices discovered in it
9:
end for
back to the GPUs of the row Rj fr
10:
Outtij convert(Lout )
the row. This is done to update the G
11: end procedure
vertices so that they do not produce
Global Figure
FronDer ContracDon and Cthose
ommunicaDon
vertices in the future iterations. It
4. Expand algorithm
updating the levels of the vertices. Th
t
t+1
t
t
t+1
Int+1
.
Since
both
Out
and
In
1: procedure C ONTRACT(Outij , Outj , Ini , Assignij )
j
j
i are th
ç Distributed
prefix
sum. Prefix
is vertices
in the
diagonal
(i = j),
in line 9, we c
2:
Prefixtij ExclusiveScanj (Outtij )
discovered by your left neighbors (log p)
t
t
GPUs
andGPU.
then broadcas
3:
Assignedij Assignedij [ (Outij Prefixij ) ç Verticesthe
firstdiagonal
discovered
by this
Ci using MPI_Broadcast in line 11. C
4:
if i = p then
setupcolumn
alonghas
every
rowfrontier
and column
du
ç Right most
global
for
5:
Outtj Outtij [ Prefixtij
the row facilitate these parallel scan and broad
6:
end if
ç BroadcastDuring
frontier each
over row.
BFS iteration, we mus
7:
Broadcast(Outtj , p , ROW )
of the algorithm by checking if the g
8:
if i = j then
t
zero as in line 6 of Figure 3. We cur
9:
Int+1
Out
j
i
frontier using a MPI_Allreduce operatio
10:
end if
edgefrontier
expansion
and before the globa
ç Broadcast
over column.
11:
Broadcast(Int+1
i , i, COL)
phase. However, an unexplored optimi
12: end procedure
the termination check with the globa
SYSTAP™, LLC
19
http://BlazeGraph.com
Figure
5.
Global
Frontier
contraction
and
communication
algorithm
and communication phase
in order to h
©
2006-2015
All Rights Reserved
http://MapGraph.io
3/10/15
the total time.
t
1: procedure U PDATE L EVELS(Outj , t, level)
Global FronDer ContracDon and CommunicaDon 1:
2:
8:
9:
10:
11:
12:
1
1
2
2
3
4
3
4
1
1
2
2
3
3
4
4
http://BlazeGraph.com
http://MapGraph.io
Target VerDces 3:
4:
5:
6:
7:
procedure C ONTRACT(Outtij , Outtj , Int+1
, Assignij )
i
Source VerDces t
t
ç Distributed prefix sum. Prefix is verDces Prefixt ij ExclusiveScan
(Out
)
j
ij
discovered by your le^ neighbors (log p) In
In t
In t
In t
t
t
Assignedij Assignedij [ (Outij Prefixij ) ç Vertices first discovered by this GPU.
if iC= p then
C
C
C
t
t
t
ç Right most column has global frontier for
Out
Out
[
Prefix
ij
ij
R
G11 j
Out t
the row
end if
R Broadcast(Out
ç Broadcast frontier over row.
G22 t , p , ROW )
Out t
j
R if i = j then G
Out t
33
t
Int+1
Out
j
i
R end if
Out t
G44
ç Broadcast frontier over column.
Broadcast(Int+1
i , i, COL)
end procedure
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
20
3/10/15
Global FronDer ContracDon and CommunicaDon 1:
2:
8:
9:
10:
11:
12:
1
1
2
2
3
4
3
4
1
1
2
2
3
3
4
4
Target VerDces 3:
4:
5:
6:
7:
procedure C ONTRACT(Outtij , Outtj , Int+1
, Assignij )
i
Source VerDces t
t
! Distributed prefix sum. Prefix is ver2ces Prefixt ij ExclusiveScan
(Out
)
j
ij
discovered by your leb neighbors (log p) In
In t
In t
In t
t
t
Assignedij Assignedij [ (Outij Prefixij ) ç Vertices first discovered by this GPU.
if iC= p then
C
C
C
t
t
t
ç Right most column has global frontier for
Out
Out
[
Prefix
ij
ij
R
G11 j
Out t
the row
end if
R Broadcast(Out
ç Broadcast frontier over row.
G22 t , p , ROW )
Out t
j
R if i = j then G
Out t
33
t
Int+1
Out
j
i
R end if
Out t
G44
ç Broadcast frontier over column.
Broadcast(Int+1
i , i, COL)
end procedure
•  http://BlazeGraph.com
We use a parallel scan that minimizes cSYSTAP™,
ommunicaDon steps (vs work). LLC
Allb
Rights
Reserved
•  http://MapGraph.io
This improves the overall scaling ©e2006-2014
fficiency y 30%. 21
3/10/15
Global FronDer ContracDon and CommunicaDon 1:
2:
8:
9:
10:
11:
12:
1
1
2
2
3
4
3
4
1
1
2
2
3
3
4
4
Target VerDces 3:
4:
5:
6:
7:
procedure C ONTRACT(Outtij , Outtj , Int+1
, Assignij )
i
Source VerDces t
t
! Distributed prefix sum. Prefix is ver2ces Prefixt ij ExclusiveScan
(Out
)
j
ij
discovered by your leb neighbors (log p) In
In t
In t
In t
t
t
Assignedij Assignedij [ (Outij Prefixij ) ç Vertices first discovered by this GPU.
if iC= p then
C
C
C
t
t
t
ç Right most column has global frontier for
Out
Out
[
Prefix
ij
ij
R
G11 j
Out t
the row
end if
R Broadcast(Out
ç Broadcast frontier over row.
G22 t , p , ROW )
Out t
j
R if i = j then G
Out t
33
t
Int+1
Out
j
i
R end if
Out t
G44
ç Broadcast frontier over column.
Broadcast(Int+1
i , i, COL)
end procedure
•  http://BlazeGraph.com
We use a parallel scan that minimizes cSYSTAP™,
ommunicaDon steps (vs work). LLC
Allb
Rights
Reserved
•  http://MapGraph.io
This improves the overall scaling ©e2006-2014
fficiency y 30%. 22
3/15/15
Global FronDer ContracDon and CommunicaDon 1:
2:
8:
9:
10:
11:
12:
1
1
2
2
3
4
3
4
1
1
2
2
3
3
4
4
Target VerDces 3:
4:
5:
6:
7:
procedure C ONTRACT(Outtij , Outtj , Int+1
, Assignij )
i
Source VerDces t
t
ç Distributed prefix sum. Prefix is verDces Prefixt ij ExclusiveScan
(Out
)
j
ij
discovered by your le^ neighbors (log p) In
In t
In t
In t
t
t
Assignedij Assignedij [ (Outij Prefixij ) ç Vertices first discovered by this GPU.
if iC= p then
C
C
C
t
t
t
ç Right most column has global frontier
Out
Out
[
Prefix
ij
ij
R
G11 j
Out t
for the row
end if
R Broadcast(Out
ç Broadcast frontier over row.
G22 t , p , ROW )
Out t
j
R if i = j then G
Out t
33
t
Int+1
Out
j
i
R end if
Out t
G44
ç Broadcast frontier over column.
Broadcast(Int+1
i , i, COL)
end procedure
•  http://BlazeGraph.com
We use a parallel scan that minimizes cSYSTAP™,
ommunicaDon steps (vs work). LLC
Allb
Rights
Reserved
•  http://MapGraph.io
This improves the overall scaling ©e2006-2014
fficiency y 30%. 23
3/15/15
Global FronDer ContracDon and CommunicaDon 1:
2:
8:
9:
10:
11:
12:
1
1
2
2
3
4
3
4
1
1
2
2
3
3
4
4
http://BlazeGraph.com
http://MapGraph.io
Target VerDces 3:
4:
5:
6:
7:
procedure C ONTRACT(Outtij , Outtj , Int+1
, Assignij )
i
Source VerDces t
t
ç Distributed prefix sum. Prefix is vertices
Prefixt ij ExclusiveScan
(Out
)
j
ij
discovered by your left neighbors (log p)
In
In t
In t
In t
t
t
Assignedij Assignedij [ (Outij Prefixij ) ç Vertices first discovered by this GPU.
if iC= p then
C
C
C
t
t
t
ç Right most column has global frontier for
Out
Out
[
Prefix
ij
ij
R
G11 j
Out t
the row
end if
R Broadcast(Out
! Broadcast frontier over row.
G22 t , p , ROW )
Out t
j
R if i = j then G
Out t
33
t
Int+1
Out
j
i
R end if
Out t
G44
ç Broadcast frontier over column.
Broadcast(Int+1
i , i, COL)
end procedure
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
24
3/15/15
Global FronDer ContracDon and CommunicaDon 1:
2:
8:
9:
10:
11:
12:
1
1
2
2
3
4
3
4
1
1
2
2
3
3
4
4
•  http://BlazeGraph.com
All GPUs now have the new fronDer. http://MapGraph.io
Target VerDces 3:
4:
5:
6:
7:
procedure C ONTRACT(Outtij , Outtj , Int+1
, Assignij )
i
Source VerDces t
t
ç Distributed prefix sum. Prefix is vertices
Prefixt ij ExclusiveScan
(Out
)
j
ij
discovered by your left neighbors (log p)
In
In t
In t
In t
t
t
Assignedij Assignedij [ (Outij Prefixij ) ç Vertices first discovered by this GPU.
if iC= p then
C
C
C
t
t
t
ç Right most column has global frontier for
Out
Out
[
Prefix
ij
ij
R
G11 j
Out t
the row
end if
R Broadcast(Out
ç Broadcast frontier over row.
G22 t , p , ROW )
Out t
j
R if i = j then G
Out t
33
t
Int+1
Out
j
i
R end if
Out t
G44
! Broadcast frontier over column.
Broadcast(Int+1
i , i, COL)
end procedure
SYSTAP™, LLC
© 2006-2014 All Rights Reserved
25
3/15/15
8:
9:
10:
11:
12:
if i = j then
Int+1
Outtj
i
end if
Broadcast(Int+1
, i, COL)
i
end procedure
Figure 5.
1:
2:
3:
4:
5:
Update Levels Global Frontier contraction and communication algorithm
procedure U PDATE L EVELS(Outtj , t, level)
for all v 2 Outtj in parallel do
level[v]
t
end for
end procedure
Figure 6.
Update levels
of the algorithm by checkin
zero as in line 6 of Figure 3
frontier using a MPI_Allreduc
edge expansion and before
phase. However, an unexplor
the termination check with
and communication phase in
the total time.
D. Computing the Predecess
Unlike the Blue Gene/Q
messages to update the verte
level) during traversal. Inste
els associated with vertices
frontier (See Figure 6 for the
•  We store both the In and Out levels. •  This allows us to compute the predecessors in a GPU local manner. http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
26
3/10/15
Predecessor ComputaDon 1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
procedure U PDATE P REDS(Assignedij , Predsij , level)
for all v 2 Assignedij in parallel do
Pred[v]
1
for i
ColOff[v], ColOff[v + 1] do
if level[v] == level[RowIdx[i]] + 1 then
Pred[v]
RowIdx[i]
end if
•  Predecessors are computed end for
a^er the traversal is complete end for
using node-­‐local In/Out levels. end procedure
Figure 7.
http://BlazeGraph.com
http://MapGraph.io
•  No inter-­‐node communicaDon is required. Predecessor update
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
27
3/10/15
for finding the Prefixtij , which is the bitmap obtained by
t
Weak Scaling 35
•  Scaling the problem size with more GPUs Scaling the problem size with
more GPUs.
30
–  Fixed problem size per GPU. •  Maximum scale 27 (4.3B edges) GTEPS
25
–  64 K20 GPUs => .147s => 29 GTEPS –  64 K40 GPUs => .135s => 32 GTEPS 20
15
10
5
0
1
•  K40 has faster memory bus. GPUs Scale Ver2ces Edges 16
64
#GPUs
Time (s) GTEPS 1 21 2,097,152 67,108,864 0.0254 2.5 4 23 8,388,608 268,435,456 0.0429 6.3 16 25 33,554,432 1,073,741,824 0.0715 15.0 64 27 134,217,728 4,294,967,296 0.1478 29.1 http://BlazeGraph.com
http://MapGraph.io
4
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
28
3/10/15
Central IteraDon Costs (Weak Scaling) 2 log(2p)/log(p) •  How to scale? –  Overlapping –  Compression •  Graph •  Message –  Hybrid parDDoning –  Heterogeneous compuDng http://BlazeGraph.com
http://MapGraph.io
0.0120
Computation
Communication
Termination
0.0100
0.0080
Time (Seconds)
•  CommunicaDon costs are not constant. •  2D design implies cost grows as 0.0060
0.0040
0.0020
0.0000
4
16
64
#GPUs
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
29
3/10/15
Costs During BFS Traversal • 
• 
• 
• 
Chart shows the different costs for each GPU in each iteraDon (64 GPUs). Wave Dme is essenDally constant, as expected. Compute Dme peaks during the central iteraDons. Costs are reasonably well balanced across all GPUs a^er the 2nd iteraDon. http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
30
3/10/15
Strong Scaling •  Speedup on a constant problem size with more GPUs •  Problem scale 25 –  2^25 verDces (33,554,432) –  2^26 directed edges (1,073,741,824) •  Strong scaling efficiency of 48% –  Versus 44% for BG/Q GPUs GTEPS Time (s) 16 15.2 0.071 25 18.2 0.059 36 20.5 0.053 49 21.8 0.049 64 22.7 0.047 http://BlazeGraph.com
http://MapGraph.io
Speedup on a constant
problem size with more GPUs.
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
31
3/10/15
DirecDons to Improve Scaling •  Overlap computaDon with communicaDon –  MulDple parDDons per GPU –  FronDer compression •  Hybrid parDDoning –  Degree aware data layout + bo6om up search opDmizaDon •  This also requires asynchronous communicaDons and per-­‐target node buffers. –  Graph aware parDDoning plus 2D data layout •  Uintah style data warehouse –  Hand off tasks to workers (Uintah) –  Hybrid CPU/GPU computaDon strategies (TOTEM) http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
32
3/10/15
Concept: Accelerate Key Value Stores Assuming
10000x
Speedup
=> .72s/step
1 BFS step @ 270
MTEPS => 2 Hours
•  Accumulo + MR
•  1200 nodes
•  1T edges
http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
33
3/10/15
MapGraph Timings (single GPU) •  Orkut social network 2.3M verDces, 92M edges. Most Dme is *load* on CPU. •  Next step eliminates overhead: 62500ms => 63ms (1000x faster) Parse : 47,000ms CSR : 15,000ms Total : 62,000ms LOAD Graph COO CSC : 244ms Build CSC XFer GPU: 232ms This cost can be eliminated.
http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
BFS : 63ms TE : 45M TEPS: 725M BFS Once on the GPU, it is
very cheap to run other
algorithms, different
starting vertices, etc.
34
3/10/15
MulD-­‐ MapGraph 32 GTEPS on 64 GPUs, 4.3 Billion edge graph MapGraph + Java Single GPU JNI Interface Schema Flexible Data Model (PG/RDF) Overlap computaDon with communicaDon for be6er scaling. http://BlazeGraph.com
http://MapGraph.io
Updates pushed to GPU in background. •  No ETL latency impact •  Full GPU AcceleraDon • 
• 
Next Steps
Current Activity
• 
MulDple ParDDons per GPU Implemented
Implemented
Current and Future Code Streams Horizontal Scaling Extreme Performance SYSTAP™, LLC
© 2006-2015 All Rights Reserved
35
3/10/15
MapGraph Beta Customer? Contact Bradley Bebee [email protected] http://BlazeGraph.com
http://MapGraph.io
SYSTAP™, LLC
© 2006-2015 All Rights Reserved
36
3/10/15