Enterprise
Breadth-First Graph Traversal on GPUs
Hang Liu
H. Howie Huang
November 19th, 2015
Graph is Ubiquitous
2
Breadth-First Search (BFS) is Important
❖
Wide Range of Applications
❖
Single Source Shortest Path (SSSP)
❖
Connectivity Detection
❖
Distance Oracle
❖
Reachability Problem
❖
Centrality Problems, e.g., Betweenness & Closeness Centralities
3
Graphics Processing Unit (GPU)
Warp
Cache
…
SMX n
Instruction
SMX 2
SMX 1
GPU
❖
Schedulers
Register
core
core
core
core
core
core
core
core
core
Interconnect
…
…
…
File
core
core
core
core
core
core
Memory Hierarchy:
❖
L1 cache (KB), ~20 cycles
❖
L2 cache (MB), ~100 cycles
❖
Global memory (GB), ~600 cycles
Network
SharedMemory(L1Cache)
L2
Cache
❖
Thread Granularity:
Thread -> Warp -> Block -> Grid
❖
Global
Memory
32 Threads ~256 Threads
4
Enterprise Innovations
❖
Streamlined GPU Thread Scheduling
❖
GPU Workload Balancing
❖
Hub Vertex based Optimization
❖
Rank No. 46 in Graph500 with two GPUs
❖
No. 1 in Green Graph500 Small Data Category for the last two year
5
Top-Down BFS
1
0
0
1
2
1
4
7
2
3
2
5
8
3
3
0
3
4
6
4
3
9
1
4
2
7
5
8
6
SA
Vertex ID
NSA
Vertex ID
9
0 F U U F U U U U U
0 1 2 3 4 5 6 7 8 9
0 1 F U 1 U U F U U
0 1 2 3 4 5 6 7 8 9
FQ
Atomic
operation
NFQ
1 4
2 7
Frontier Queue
Status Array
Status Array (SA) needs to assign threads to non-frontiers
6
Frontier Percentage (%)
Challenge #1: Putting GPU Threads to Good Use
80
60
40
20
0
FB
❖
FR
HW
KR0
KR1
KR2
KR3
KR4
LJ
OR
PK
RM
TW
WK
Average frontier ratio per level for all graphs is very low: ~ 9%.
Need a more efficient way to use GPU threads
7
Bottom-Up BFS
1
0
0
1
2
1
4
7
2
3
2
5
8
3
3
0
3
4
6
4
3
9
6
Level 3
1
4
2
7
5
8
9
Bottom-up:
SA
NSA
0 1 2 F 1 F F 2 F F
0 1 2 3 4 5 6 7 8 9
Early Termination
0 1 2 3 1 3 F 2 3 F
0 1 2 3 4 5 6 7 8 9
8
❖
From unvisited to visited
❖
Reduce workload by early termination
❖
Decide when to switch heuristically!
Atomic Op
Technique #1: Streamlined GPU Thread Scheduling
Frontier Queue
Enterprise
Status Array
Compact
9
Streamlined GPU Thread Scheduling – Top-Down Workflow
1
0
0
1
2
2
1
4
2
5
7
3
8
3
3
3
4
6
4
9
SA
0 F U U F U U U U U
0 1 2 3 4 5 6 7 8 9
Thread Bin
4
0
3
6
1
4
2
7
5
NFQ
4 1
1
Current Level
Next Level
Follow-up traversal
8
9
Thread 1
Thread 0
Frontier Order Matters!
10
Streamlined GPU Thread Scheduling –Direction-Switching Workflow
1
0
0
1
2
2
1
4
2
5
7
3
8
3
3
Thread 0
3
4
6
4
9
SA
Thread 1
0 1 2 F 1 F F 2 F F
0 1 2 3 4 5 6 7 8 9
Thread Bin
3
5 6 8 9
0
3
6
1
4
2
7
5
NFQ
3 5 6 8 9
Current Level
Next Level
Follow-up traversal
8
9
11
Challenge #2: Balancing Workload Between GPU Threads
6
Out−degree (log10)
Out−degree (log10)
6
4
2
out−degree = 256
out−degree = 32
0
0.2
0.4
0.6
0.8
Percentile of vertices
2 out−degree = 256
out−degree = 32
0
1
Gowalla
❖
4
0.2
0.4
0.6
0.8
Percentile of vertices
Orkut
Various graphs have different distribution of out-degrees
❖
Gowalla: 87% of vertices have less than 32 edges, 99.5% have less than 256 edges
Not every frontier is created equal.
12
1
Technique #2: GPU Workload Balancing
Status Array
FQ Generation /
Classification
F
Out-degree <32
…
(32, 256)
F
…
F
…
F
Out-degree>65536
(256, 65536)
SmallQueue
MiddleQueue
LargeQueue
ExtremeQueue
Frontier Queues
Thread
Warp
CTA
Grid
…
…
Multiple Expansion / Inspections
1 thread
32 threads
256 threads
…
65,536 threads
Two Steps:
❖ Classify frontiers when generate FQ from SA
❖ Assign different amount of threads for different frontiers
13
Facebook Execution Timeline
CTA kernel
490ms
(a) Status array
CTA kernel
47 ms
23.6 ms
419 ms
(b)Streamlined GPU threads scheduling
28.5 ms
FQ generation
CTA kernel
10.5 ms
Warp kernel
17.8 ms
Thread kernel
63.5 ms
0
100
337 ms
200
300
(c) GPU workload balancing
14
400
500
Challenge #3: Making Bottom-Up BFS GPU-Aware
❖
❖
❖
❖
Bottom-up BFS:
❖ Direction-switching level is decided heuristically, where
❖ Large portion of status array is accessed
CPUs have large LLC (e.g., 35MB Xeon-E5)
GPUs have small cache/shared memory — 64KB per SMX, but
❖ Manually controllable
Have developed graph-ware, software controlled caching strategy
15
Challenge #3: Making Bottom-up BFS GPU-aware (Cont'd)
0.8
0.6
0.4
0.2
0
0
❖
1
CDF of total edges
CDF of total edges
1
0.2
0.4
0.6
0.8
Percentile of vertices
1
YouTube
0.8
0.6
Kron-24-32
0.4
Wiki-talk
0.2
0
0.9995
0.9997
0.9999
Percentile of vertices
1
Small amount of hub vertices contains considerable amount of edges
❖ YouTube: 300 (0.03%) vertices à
10% edges
❖ Kron-24-32: 770 (0.005%) vertices à
10% edges
❖ Wiki-Talk: 96 (0.004%) vertices à
20% edges
Hub vertices are super important in bottom-up BFS
16
Technique #3: Graph-Aware, Software Controlled GPU Cache
1
0
0
1
2
2
1
4
2
5
7
3
8
3
3
3
3 5 6 8 9
FQ
4
6
HubCache
4
2 7
6
1
4
2
7
Steps:
8
❖
5
Miss
9
0
3
3’s neighbor: 2, 5 and 6.
6’s neighbor: 3
9
SA
0 1 2 3 1 3 F 2 3 F
Vertex ID 0 1 2 3 4 5 6 7 8 9
Vertex ID of just visited Hub Vertex in shared mem.
❖ Load frontier’s neighbors in-core
❖ Check Neighbor ID == Cached Vertex ID?
17
Evaluation
❖
❖
Hardware:
❖
C2070 and K40 are from our own cluster
❖
M2090 and K20 are from Keeneland and Stampede of XSEDE
Metrics
❖
❖
GTEPS: billion traversed edges per second
Software
❖
g++ 4.4.7, CUDA 5.0
❖
NVIDIA profiler: nvprof, nvvp.
❖
Compilation flag: -O3
All results are reported with average of 64 runs
18
Graph Datasets
Edge Count (Million)
1200
KR0 KR1
900
KR2
KR3
KR4
600
FR
RM
300
0
HW
0
PK
OR
WK
FB
TW
LJ
4.5
9
Vertex Count (Million)
19
13.5
18
TEPS (bilion, log scale)
Different Optimizations
BaseLine (BL)
BL+TS+Workload Balance (WB)
100
BL+Thread Scheduling (TS)
BL+TS+WB+HubCaching (HC)
10
1
FB
FR
HW KR0 KR1 KR2 KR3 KR4
TS improves 2x to 37.5x
❖ WB further increases 2x
❖ HC further improves upto 50%
LJ
OR
PK
❖
Overall Speedup: by 3.3x (KR0) to 105.5x (TW)
20
RM
TW
WK
Scalability
Weak-vertex scale
TEPS (billion)
100
Weak-edge scale
Strong scalability
75
50
25
0
1
2
4
GPU count
21
8
GPU Counter Analysis
Power (W)
110
100
❖
❖
❖
❖
BL+TS
BL+TS+WB
BL+TS+WB+HC
90
80
70
❖
BL
FB
FR
HW
LJ
OR
Power saving 86W -> 77W
Contribution distribution:
TS: 6 W
WB: 2W
HC: 1W
22
PK
TW
WK
KR0
Conclusion & Future Work
❖
❖
Techniques
❖
Streamlined GPU Thread Scheduling
❖
GPU Thread Workload Balancing
❖
Hub Vertex Based Optimization
Possible extensions
❖
Different Workload Balancing Heuristics
❖
Theoretical Support of Direction-Switching Based on Hub Vertex
23
Acknowledgements
24
Thank You
{asherliu, howie}@gwu.edu
© Copyright 2026 Paperzz