Hybrid Breadth-First Search on a Single-Chip FPGA

Hybrid Breadth-First Search
on a Single-Chip FPGA-CPU
Heterogeneous Platform
Yaman Umuroglu, Donn Morrison, Magnus Jahre
FPL 2015, London
Norwegian University of Science and Technology
Executive Summary
●
Breadth-first search (BFS) on large, sparse, small-world
graphs
●
Trading redundant computation for DRAM bandwidth
●
Hybrid execution across CPU and FPGA
●
~8x better than CPU-only, ~2x better than FPGA-only
Norwegian University of Science and Technology
2
What, that BFS?
●
Classical «textbook» graph algorithm
●
Backbone for other graph algorithms
●
–
counting connected components
–
shortest path
–
minimum spanning tree
–
...
In this work:
–
compute #hops from root on unweighted
graphs (dist[] array)
Norwegian University of Science and Technology
3
Why accelerate BFS?
●
●
Exciting new frontiers with huge sparse graphs
–
social network analysis
–
simulating spread of disease
–
electronic design automation
–
protein-protein interactions
Hard to make efficient, hard to parallelize / accelerate
–
Input-dependent, irregular memory accesses
–
Lots of data for a little bit of computation, variable parallelism
–
Performance bounded by DRAM bandwidth
Norwegian University of Science and Technology
4
Why accelerate BFS with FPGAs?
●
Can customize memory system
–
●
BFS performance limited by DRAM bandwidth
–
●
To effectively deal with irregular accesses
Low clock speed but high parallelism is OK
Runtime adaptability
–
No single «silver bullet» for all graphs
–
Our work: CPU-FPGA
Norwegian University of Science and Technology
5
Large, Sparse, Small-World Graphs
●
We consider large, sparse, small-world graphs
●
In a sparse graph, the average degree is small
●
–
i.e. every node is connected to «a few» other nodes
–
Large real-world graphs are usually sparse
–
Typically stored as a sparse adjacency matrix
In a small world graph, the diameter is small
–
i.e., few hops between any two nodes in graph
–
e.g. «six degrees of separation»
–
Common especially in social graphs
–
Gives a peculiar BFS frontier profile...
Norwegian University of Science and Technology
6
Frontier profile in small-world BFS
Norwegian University of Science and Technology
7
Frontier profile in small-world BFS
big frontier
lots of parallelism
Norwegian University of Science and Technology
8
Frontier profile in small-world BFS
big frontier
lots of parallelism
small frontier
little parallelism
Norwegian University of Science and Technology
9
Frontier profile in small-world BFS
big frontier
lots of parallelism
=> use highly parallel
engine (FPGA)
small frontier
little parallelism
=> use (single) CPU core
Norwegian University of Science and Technology
10
Frontier profile in small-world BFS
big frontier
lots of parallelism
=> use highly parallel
engine (FPGA)
small frontier
little parallelism
=> use (single) CPU core
Previously used for CPU-GPGPU systems (Hong et al., PACT 2011)
Norwegian University of Science and Technology
11
BFS as Matrix-Vector Multiplication
●
y (result vector) = A (matrix) «times» x (input vector)
–
Replace multiply with AND, add with OR
–
All numbers are a single bit
●
Use result vector as input vector to next step
●
Repeat until convergence (input == output)
Norwegian University of Science and Technology
12
Avoiding Redundant Work
●
●
Where can 1's in the result vector come from?
–
From 1's in the input vector (because 0 AND x = 0)
–
So we can consider only the 1's in the input vector
Where can new 1's in the result vector come from?
–
Only from 1's generated in the previous step
Norwegian University of Science and Technology
13
Avoiding Redundant Work
●
●
Where can 1's in the result vector come from?
–
From 1's in the input vector (because 0 AND x = 0)
–
So we can consider only the 1's in the input vector
Where can new 1's in the result vector come from?
–
●
Only from 1's generated in the previous step
We can treat the input vector as sparse
–
Sparse matrix sparse vector multiplication
–
Avoids redundant computation
Norwegian University of Science and Technology
14
Avoiding Redundant Work
●
●
Where can 1's in the result vector come from?
–
From 1's in the input vector (because 0 AND x = 0)
–
So we can consider only the 1's in the input vector
Where can new 1's in the result vector come from?
–
●
●
Only from 1's generated in the previous step
We can treat the input vector as sparse
–
Sparse matrix sparse vector multiplication
–
Avoids redundant computation
This is what traditional BFS does already!
–
Frontier = {indices of 1s generated in the previous step}
Norwegian University of Science and Technology
15
DRAM Bandwidth
●
Want to sustain peak DRAM bandwidth
●
Peak DRAM bandwidth requires:
●
–
High request rate
–
Bursts
–
«Friendly» access pattern (e.g. sequential)
How does BFS with sparse x behave?
Norwegian University of Science and Technology
16
Memory Requests for Sparse x
- get node status
Norwegian University of Science and Technology
17
Memory Requests for Sparse x
- get node status
- (if visited ) get
neighbor pointers
Norwegian University of Science and Technology
18
Memory Requests for Sparse x
- get node status
- (if visited ) get
neighbor pointers
- get neighbors
Norwegian University of Science and Technology
19
Memory Requests for Sparse x
●
Request rate?
–
●
Bursts?
–
●
limited by avg. degree
Access pattern?
–
Norwegian University of Science and Technology
response-dependent
sequential with jumps
20
Memory Requests for Sparse x
●
Request rate?
–
●
Bursts?
–
●
response-dependent
limited by avg. degree
Access pattern?
–
sequential with jumps
Avoiding redundant work may result in lower bandwidth utilization!
Norwegian University of Science and Technology
21
Embracing Redundant Work?
●
What if we treat the input vector as dense?
Norwegian University of Science and Technology
22
Embracing Redundant Work?
●
●
What if we treat the input vector as dense?
–
Still gives correct result
–
Lots of redundant work if input vector is actually sparse
–
But the FPGA is going to work on big frontiers anyway
Now we can just stream the entire matrix!
–
●
Simple, sequential, large bursts
Trade-off: more bandwidth, but more redundant work
–
Does it pay off?
Norwegian University of Science and Technology
23
Processing Element Architecture
Backend:
Read data
Norwegian University of Science and Technology
Frontend:
Computation
24
Processing Element Architecture
Norwegian University of Science and Technology
25
Processing Element Architecture
Norwegian University of Science and Technology
26
Processing Element Architecture
sparse x:
no input vector (implicit 1)
Norwegian University of Science and Technology
27
Processing Element Architecture
sparse x:
no input vector (implicit 1)
Norwegian University of Science and Technology
dense x:
28
Processing Element Architecture
sparse x:
no input vector (implicit 1)
dense x:
3 streaming DMA channels
Norwegian University of Science and Technology
29
Processing Element Architecture
●
write 1's to visited locations in result vector
●
Need stall-free, fine-grained writes to random addresses
●
We can keep entire result vector in BRAM
–
Need only 1 bit per element, even ZedBoard can do 2M nodes!
–
Dual port: can do 2 edge traversals per cycle per PE
Norwegian University of Science and Technology
30
Experimental Setup
●
Platform: ZedBoard
–
Zynq Z7020 with 560 KB BRAM, bare-metal
–
AXI HP sustained DRAM bandwidth: ~3.2 GB/s
–
Up to 4 accelerators at 150 MHz (DRAM BW limited)
●
HW design: Chisel + Vivado IP Integrator
●
Workload: synthetically generated RMAT graphs
–
Used in the Graph500 rankings
–
6 to 8 BFS levels
–
Row-wise partitioned among PEs
Norwegian University of Science and Technology
31
CPU, Sparse and Dense
●
Graph: half a million nodes, ~16 million edges
●
Want to pick best performance for each step
●
Fastest method varies with step (& frontier size)
–
CPU best for first and last steps
–
Dense x best for the middle steps
Norwegian University of Science and Technology
32
Sparse vs Dense
Total read BW util %
100
●
sparse, step 3
sparse, step 4
dense (average)
80
–
60
20
1
2
3
4
Accelerator PE count
5
More PEs = more pressure
●
Dense: up to ~80% of peak
●
Sparse: up to ~20% of peak
●
Redundant work?
40
0
Bandwidth utilization vs PEs
Norwegian University of Science and Technology
33
Sparse vs Dense
Total read BW util %
100
●
sparse, step 3
sparse, step 4
dense (average)
80
–
60
20
1
2
3
4
Accelerator PE count
5
More PEs = more pressure
●
Dense: up to ~80% of peak
●
Sparse: up to ~20% of peak
●
Redundant work?
40
0
Bandwidth utilization vs PEs
–
Norwegian University of Science and Technology
BW gain > redundant work!
34
Putting It All Together
●
Start execution on CPU
–
●
●
Bitmap optimization
Switch to FPGA (dense x)
–
When: use performance model
–
Dense x: «flat», predictable perf.
Switch back to CPU
–
When: frontier size < T
–
T = 5% of graph works well in
practice
Norwegian University of Science and Technology
35
Performance & Efficiency
●
Metric: MTEPS
–
million traversed edges per second
●
CPU-only: ~22 MTEPS
●
FPGA-only: ~80 MTEPS
●
Hybrid: ~170 MTEPS
–
2x over FPGA, 7.8x over CPU
2x traversals per
bandwidth
Norwegian University of Science and Technology
36
Conclusion
●
●
Representation can reveal (or hide) new options
–
Keeping random access component small & on-chip
–
Dense x trades redundant work for more bandwidth
Adapting to available parallelism pays off
–
●
Hybrid: 8x over just CPU, 2x over just FPGA
Source code on github (see paper)
Norwegian University of Science and Technology
37
Thank you for listening!
Questions?
Norwegian University of Science and Technology
38
Distance Generation
●
«Visited» not enough, we want the distance to root
–
●
After each step, compare input and result vectors
–
●
i.e. number of hops to reach each element from root
Elements that went from 0 to 1 have distance = <step number>
Low overhead (easily parallelizable)
Norwegian University of Science and Technology
39
Execution Time Breakdown
●
FPGA performance scales well with more PEs
●
Switching overheads ~10%
●
«Amdahl's Law» -- CPU eventually becomes bottleneck
Norwegian University of Science and Technology
40