Power-Efficient Breadth-First Search with DRAM Row Buffer Locality-Aware Address Mapping Satoshi Imamura*, Yuichiro Yasui*, Koji Inoue*, Takatsugu Ono*, Hiroshi Sasaki**, Katsuki Fujisawa* *Kyushu University, **Columbia University HPGDMP ’16, Salt Lake City, November 13, 2016 Graph Analysis & BFS • Graph analysis can be used to various real services - Social networking services - Road traffic analysis, etc. 2 Graph Analysis & BFS • Graph analysis can be used to various real services - Social networking services - Road traffic analysis, etc. • Graphs - Consist of vertices and edges - Can represent relationships between vertices 2 Graph Analysis & BFS • Graph analysis can be used to various real services - Social networking services - Road traffic analysis, etc. • Graphs - Consist of vertices and edges - Can represent relationships between vertices 2 Graph Analysis & BFS • Graph analysis can be used to various real services - Social networking services - Road traffic analysis, etc. • Graphs - Consist of vertices and edges - Can represent relationships between vertices • Breadth-first search (BFS) - Many graph algorithms are based on BFS 2 Graph Analysis & BFS • Graph analysis can be used to various real services - Social networking services - Road traffic analysis, etc. • Graphs - Consist of vertices and edges - Can represent relationships between vertices • Breadth-first search (BFS) - Many graph algorithms are based on BFS 2 Graph Analysis & BFS • Graph analysis can be used to various real services - Social networking services - Road traffic analysis, etc. Visited • Graphs - Consist of vertices and edges - Can represent relationships between vertices • Breadth-first search (BFS) - Many graph algorithms are based on BFS 2 Graph Analysis & BFS • Graph analysis can be used to various real services - Social networking services - Road traffic analysis, etc. Visited • Graphs - Consist of vertices and edges - Can represent relationships between vertices • Breadth-first search (BFS) - Many graph algorithms are based on BFS 2 Graph Analysis & BFS • Graph analysis can be used to various real services - Social networking services - Road traffic analysis, etc. Visited • Graphs - Consist of vertices and edges - Can represent relationships between vertices • Breadth-first search (BFS) - Many graph algorithms are based on BFS 2 Graph Analysis & BFS • Graph analysis can be used to various real services - Social networking services - Road traffic analysis, etc. Visited • Graphs - Consist of vertices and edges - Can represent relationships between vertices Level • Breadth-first search (BFS) - Many graph algorithms are based on BFS 2 Graph Analysis & BFS • Graph analysis can be used to various real services - Social networking services - Road traffic analysis, etc. Visited • Graphs - Consist of vertices and edges - Can represent relationships between vertices Level • Breadth-first search (BFS) - Many graph algorithms are based on BFS 2 Graph Analysis & BFS • Graph analysis can be used to various real services - Social networking services - Road traffic analysis, etc. Visited • Graphs - Consist of vertices and edges - Can represent relationships between vertices Level Frontier • Breadth-first search (BFS) - Many graph algorithms are based on BFS 2 Graph Analysis & BFS • Graph analysis can be used to various real services - Social networking services - Road traffic analysis, etc. Visited • Graphs - Consist of vertices and edges - Can represent relationships between vertices Level Frontier • Breadth-first search (BFS) - Many graph algorithms are based on BFS 2 Power Problem on Servers • Performance has been improved as the amount of hardware resources is increased • Power consumption also continues to grow - Proportional to # of active transistors • One of the most critical design constraints 3 Summary of This Work • Objective: to improve the power efficiency (i.e., performance per watt) of a state-of-the-art BFS implementation • Contributions: 4 Summary of This Work • Objective: to improve the power efficiency (i.e., performance per watt) of a state-of-the-art BFS implementation • Contributions: - Investigate the memory access pattern of its main algorithm using a simulator 4 Summary of This Work • Objective: to improve the power efficiency (i.e., performance per watt) of a state-of-the-art BFS implementation • Contributions: - Investigate the memory access pattern of its main algorithm using a simulator - Reveal that conventional address mapping schemes of memory controllers do NOT efficiently exploit DRAM 4 Summary of This Work • Objective: to improve the power efficiency (i.e., performance per watt) of a state-of-the-art BFS implementation • Contributions: - Investigate the memory access pattern of its main algorithm using a simulator - Reveal that conventional address mapping schemes of memory controllers do NOT efficiently exploit DRAM - Propose a novel scheme and improve DRAM power efficiency by 30.3% 4 Agenda • State-of-the-art BFS implementation • DRAM mechanisms • Memory access analysis with conventional address mapping schemes • Proposed: per-row channel interleaving • Evaluation of power efficiency 5 Agenda • State-of-the-art BFS implementation • DRAM mechanisms • Memory access analysis with conventional address mapping schemes • Proposed: per-row channel interleaving • Evaluation of power efficiency 6 Yasui16 Implementation [Yasui+, HPGP’16] • Achieved the best performance for a single-node system in the June 2016 Graph500 list • Applies several tuning techniques - NUMA-aware data layout [Yasui+, BigData’13] - Adjacency list and vertex sorting [Yasui+, HPCS’15] - Direction optimizing [Beamer+, SC’12] ✓ Significantly reduces # of edge traversals by switching two algorithms at each level: top-down or bottom-up 7 Time Breakdown of BFS Time [ms] Top-down 10 8 6 4 2 0 Bottom-up Top-down Breakdown of Bottom-up 2nd loop 1st loop 0 1 2 3 4 Level 5 6 BFS is executed with 2 scale 3 22 4 on 5 10-core Haswell machine Level 8 Time Breakdown of BFS Time [ms] Top-down 10 8 6 4 2 0 Bottom-up Top-down Breakdown of Bottom-up 2nd loop 1st loop 0 1 2 3 4 Level 5 6 BFS is executed with 2 scale 3 22 4 on 5 10-core Haswell machine Level • Over 99% of exe. time is spent by bottom-up algorithm • We aim to improve the power efficiency of bottom-up 8 Bottom-up Algorithm 5 3 7 2 6 8 4 0 1 Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 0 1 Frontier 9 Bottom-up Algorithm Unvisited vertices 5 7 3 Child vertices 6 8 4 2 0 1 Frontier 9 Bottom-up Algorithm Unvisited vertices 5 7 3 Child vertices 6 8 4 2 0 1 Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 0 1 Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 0 1 7 2 6 8 4 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 Vertex ID 7 2 6 8 4 0 1 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 Vertex ID 0 1 Pointers 0 2 Compressed sparse row (CSR) … 0 0 1 2 3 4 1 Adjacency list 1 2 0 2 0 … Frontier 9 Bottom-up Algorithm Unvisited vertices 5 3 7 2 6 8 4 0 1 Pointers 0 2 … 0 0 1 2 3 4 1 Frontier Vertex ID Compressed sparse row (CSR) Adjacency list 1 2 0 2 0 … This search is parallelized using OpenMP • Memory-intensive • 9 Agenda • State-of-the-art BFS implementation • DRAM mechanisms • Memory access analysis with conventional address mapping schemes • Proposed: per-row channel interleaving • Evaluation of power efficiency 10 Multibank architecture CPU Memory controller … 11 Multibank architecture CPU Memory controller … … Channel Channel 11 Multibank architecture CPU Memory controller Rank Rank … Rank Rank … … … Channel Channel 11 Multibank architecture CPU Memory controller … Channel … Bank … … … Rank … … Rank Bank Bank Rank Rank Bank … Channel 11 Multibank architecture CPU Memory controller … Channel … Bank … Bank … … Rank … … Rank Bank Bank Rank Rank Bank Row Row Row … Channel 11 Multibank architecture CPU Row buffer Memory controller … Channel … Bank … Bank … … Rank … … Rank Bank Bank Rank Rank Bank Row Row Row … Channel 11 Multibank architecture CPU Row buffer Memory controller Column … Channel … Bank … Bank … … Rank … … Rank Bank Bank Rank Rank Bank Row Row Row … Channel 11 Multibank architecture CPU Row buffer Memory controller Column … Channel … Bank … … … Rank … … Rank Bank Bank Rank Rank Bank … Row Row Row Bank Multiple banks can be accessed in parallel → bank parallelism Channel 11 Three Types of Row Buffer Accesses Row buffer Row 0 Row 1 Row 2 12 Three Types of Row Buffer Accesses Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Hit Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Hit Access latency Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Hit Access latency Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Hit Access latency Access Empty Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Hit Activate Empty Access latency Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Hit Activate Empty Access latency Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Hit Activate Empty Access latency Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Miss Hit Activate Empty Access latency Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Miss Hit Activate Empty Access latency Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Miss Hit Activate Row 1 Empty Access latency Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Miss Hit Precharge Row 1 Activate Empty Access latency Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Miss Hit Precharge Row 1 Activate Empty Access latency Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Miss Hit Precharge Row 1 Activate Empty Access latency Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Conflict Miss Hit Precharge Row 1 Activate Empty Access latency Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Three Types of Row Buffer Accesses Conflict Miss Hit Precharge Row 1 Activate Empty Access latency Access Row 2 Row buffer Row 0 Row 1 Row 2 Access to data in row 2 12 Row Buffer Management Polices • Open-page policy - Row buffers are NOT precharged (closed) after accessed ✓ Pros: accesses to the same row become “hits” ✓ Cons: accesses to different rows become “conflicts” 13 Row Buffer Management Polices • Open-page policy - Row buffers are NOT precharged (closed) after accessed ✓ Pros: accesses to the same row become “hits” ✓ Cons: accesses to different rows become “conflicts” • Closed-page policy - Row buffers are closed immediately after accessed ✓ Pros/Cons: all accesses become “misses” 13 Row Buffer Management Polices • Open-page policy - Row buffers are NOT precharged (closed) after accessed ✓ Pros: accesses to the same row become “hits” ✓ Cons: accesses to different rows become “conflicts” • Closed-page policy - Row buffers are closed immediately after accessed ✓ Pros/Cons: all accesses become “misses” • AMD’s adaptive open-page policy - Row buffers are closed after 128 cycles from the last access 13 Row Buffer Management Polices • Open-page policy - Row buffers are NOT precharged (closed) after accessed ✓ Pros: accesses to the same row become “hits” ✓ Cons: accesses to different rows become “conflicts” • Closed-page policy - Row buffers are closed immediately after accessed ✓ Pros/Cons: all accesses become “misses” Used in this work • AMD’s adaptive open-page policy - Row buffers are closed after 128 cycles from the last access 13 Agenda • State-of-the-art BFS implementation • DRAM mechanisms • Memory access analysis with conventional address mapping schemes • Proposed: per-row channel interleaving • Evaluation of power efficiency 14 Address Mapping Schemes • Determine the location of data in DRAM based on a physical address • Implemented in memory controllers 15 Address Mapping Schemes • Determine the location of data in DRAM based on a physical address • Implemented in memory controllers • Conventional schemes: - - Per-cache-line channel interleaving (PCL) ✓ 64 B blocks are interleaved across channels ✓ Applied to Intel Nehalem processors [Park+, ASPLOS ’13] Per-2-cache-line channel interleaving (P2CL) ✓ 128 B blocks are interleaved across channels ✓ Applied to Intel SandyBridge and Haswell processors 15 Breakdown of Physical Addresses Per-cache-line channel interleaving (PCL) 34 25 24 23 22 Row block ID 18 17 15 14 Rank Row block Bank ID offset ID 8 7 Column ID 65 32 Chan Block ID offset 0 Byte offset Per-2-cache-lines channel interleaving (P2CL) 34 25 24 23 22 Row block ID 18 17 15 14 98 76 Rank Row block Bank Chan Column ID ID offset ID ID 32 Block offset 0 Byte offset 16 Breakdown of Physical Addresses Per-cache-line channel interleaving (PCL) 34 25 24 23 22 Row block ID 18 17 15 14 Rank Row block Bank ID offset ID 8 7 Column ID 65 32 Chan Block ID offset 0 Byte offset Per-2-cache-lines channel interleaving (P2CL) 34 25 24 23 22 Row block ID 18 17 15 14 98 76 Rank Row block Bank Chan Column ID ID offset ID ID 32 Block offset 0 Byte offset 16 Breakdown of Physical Addresses 6 bits → 64 B Per-cache-line channel interleaving (PCL) 34 25 24 23 22 Row block ID 18 17 15 14 Rank Row block Bank ID offset ID 8 7 Column ID 65 32 Chan Block ID offset 0 Byte offset Per-2-cache-lines channel interleaving (P2CL) 34 25 24 23 22 Row block ID 18 17 15 14 98 76 Rank Row block Bank Chan Column ID ID offset ID ID 32 Block offset 0 Byte offset 16 Breakdown of Physical Addresses 6 bits → 64 B Per-cache-line channel interleaving (PCL) 34 25 24 23 22 Row block ID 18 17 15 14 Rank Row block Bank ID offset ID 8 7 Column ID 65 32 Chan Block ID offset 0 Byte offset Per-2-cache-lines channel interleaving (P2CL) 34 25 24 23 22 Row block ID 18 17 15 14 98 76 Rank Row block Bank Chan Column ID ID offset ID ID 32 Block offset 0 Byte offset 16 Breakdown of Physical Addresses 6 bits → 64 B Per-cache-line channel interleaving (PCL) 34 25 24 23 22 Row block ID 18 17 15 14 Rank Row block Bank ID offset ID 8 7 Column ID Per-2-cache-lines channel interleaving (P2CL) 34 25 24 23 22 Row block ID 18 17 15 14 98 65 32 Chan Block ID offset 0 Byte offset 7 bits → 128 B 76 Rank Row block Bank Chan Column ID ID offset ID ID 32 Block offset 0 Byte offset 16 Simulator Setup • Cycle-accurate multicore simulator: MARSSx86 • DRAM simulator: DRAMsim2 17 Simulator Setup • Cycle-accurate multicore simulator: MARSSx86 • DRAM simulator: DRAMsim2 Parameter Setting Processor 16 Out-of-Order cores, 3.0 GHz Cache Memory Controller DRAM Private 32 KB L1 I/D, Private 256 KB L2 Shared 16 MB L3 Adaptive open-page policy (128 cycles) FR-FCFS scheduling policy [Rixner+, ISCA’00] 32 GB, DDR3-1333, 8 KB row 4 channels, 4 ranks/channel, 8 banks/rank (128 banks) 17 Memory Access Trace of Bottom-up with PCL 140 120 80 60 40 20 0 4e 00 4e 00 + 07 400 + 07 350 + 07 00 2e 00 + 07 2e 00 + 07 2e 300 3e 250 Elapsed Cyclecycle 00 1e 200 + 07 150 + 07 100 00 50 e+ 07 0 e+ 07 Bank ID 100 18 Memory Access Trace of Bottom-up with PCL 140 120 80 60 40 20 0 4e 00 4e 00 + 07 400 + 07 350 + 07 00 2e 00 + 07 2e 00 + 07 2e 300 3e 250 Elapsed Cyclecycle 00 1e 200 + 07 150 + 07 100 00 50 e+ 07 0 e+ 07 Bank ID 100 18 Memory Access Trace of Bottom-up with PCL 140 120 80 60 40 20 0 4e 00 4e 00 + 07 400 + 07 350 + 07 00 2e 00 + 07 2e 00 + 07 2e 300 3e 250 Elapsed Cyclecycle 00 1e 200 + 07 150 + 07 100 00 50 e+ 07 0 e+ 07 Bank ID 100 18 Memory Access Trace of Bottom-up with PCL 140 120 80 60 40 20 0 4e 00 4e 00 + 07 400 + 07 350 + 07 00 2e 00 + 07 2e 00 + 07 2e 300 3e 250 Elapsed Cyclecycle 00 1e 200 + 07 150 + 07 100 00 50 e+ 07 0 e+ 07 Bank ID 100 18 Memory Access Trace of Bottom-up with PCL 140 120 Bank ID 100 80 60 40 Activate 20 0 4e 00 4e 00 + 07 400 + 07 350 + 07 00 2e 00 + 07 2e 00 + 07 2e 300 3e 250 Elapsed Cyclecycle 00 1e 200 + 07 150 + 07 100 00 50 e+ 07 e+ 07 0 18 Memory Access Trace of Bottom-up with PCL 140 120 Bank ID 100 80 60 40 Activate 20 0 4e 00 4e 00 + 07 400 + 07 350 + 07 00 2e 00 + 07 2e 00 + 07 2e 300 3e 250 Elapsed Cyclecycle 00 1e 200 + 07 150 + 07 100 00 50 e+ 07 e+ 07 0 18 Memory Access Trace of Bottom-up with PCL 140 120 Bank ID 100 80 60 40 Activate 20 Access 0 4e 00 4e 00 + 07 400 + 07 350 + 07 00 2e 00 + 07 2e 00 + 07 2e 300 3e 250 Elapsed Cyclecycle 00 1e 200 + 07 150 + 07 100 00 50 e+ 07 e+ 07 0 18 Memory Access Trace of Bottom-up with PCL 140 120 Bank ID 100 80 60 40 Activate 20 Access 0 4e 00 4e 00 + 07 400 + 07 350 + 07 00 2e 00 + 07 2e 00 + 07 2e 300 3e 250 Elapsed Cyclecycle 00 1e 200 + 07 150 + 07 100 00 50 e+ 07 e+ 07 0 18 Memory Access Trace of Bottom-up with PCL 140 120 Bank ID 100 80 60 40 Activate 20 Access Precharge 0 4e 00 4e 00 + 07 400 + 07 350 + 07 00 2e 00 + 07 2e 00 + 07 2e 300 3e 250 Elapsed Cyclecycle 00 1e 200 + 07 150 + 07 100 00 50 e+ 07 e+ 07 0 18 Memory Access Trace of Bottom-up with PCL 140 120 Bank ID 100 80 Row buffer miss 60 40 Activate 20 Access Precharge 0 4e 00 4e 00 + 07 400 + 07 350 + 07 00 2e 00 + 07 2e 00 + 07 2e 300 3e 250 Elapsed Cyclecycle 00 1e 200 + 07 150 + 07 100 00 50 e+ 07 e+ 07 0 18 Memory Access Trace of Bottom-up with PCL 140 120 80 60 40 20 0 4e 00 4e 00 + 07 400 + 07 350 + 07 00 2e 00 + 07 2e 00 + 07 2e 300 3e 250 Elapsed Cyclecycle 00 1e 200 + 07 150 + 07 100 00 50 e+ 07 0 e+ 07 Bank ID 100 18 Memory Access Trace of Bottom-up with PCL 140 120 80 60 40 20 100 150 200 250 300 350 00 + 07 4e 00 + 07 3e 00 + 07 2e 00 + 07 2e 00 + 07 2e 00 00 1e + 07 Almost all accesses are row buffer misses Elapsed Cyclecycle • Almost all threads access distinct banks • 400 + 07 50 4e 0 e+ 07 0 e+ 07 Bank ID 100 18 Memory Access Trace of Bottom-up with PCL (1 thread) 140 120 80 60 40 20 0 Elapsed Cyclecycle 300 350 400 + 07 250 + 07 200 + 07 150 + 07 100 + 07 50 + 07 0 + 07 Bank ID 100 19 Memory Access Trace of Bottom-up with PCL (1 thread) 140 120 80 32 banks 60 40 20 0 Elapsed Cyclecycle 300 350 400 + 07 250 + 07 200 + 07 150 + 07 100 + 07 50 + 07 0 + 07 Bank ID 100 19 Memory Access Trace of Bottom-up with PCL (1 thread) 140 120 80 32 banks 60 40 Around 50-cycle interval 20 0 Elapsed Cyclecycle 300 350 400 + 07 250 + 07 200 + 07 150 + 07 100 + 07 50 + 07 0 + 07 Bank ID 100 19 Memory Access Trace of Bottom-up with PCL (1 thread) 140 120 80 32 banks 60 40 Around 50-cycle interval 20 Bank parallelism is not actually utilized within a thread → Multiple banks are wastefully used 250 Elapsed Cyclecycle 300 350 400 + 07 200 + 07 150 + 07 100 + 07 50 + 07 0 + 07 0 + 07 Bank ID 100 19 Memory Access Trace of Bottom-up with PCL (1 thread) 140 Around 220-cycle interval 120 80 32 banks 60 40 Around 50-cycle interval 20 Bank parallelism is not actually utilized within a thread → Multiple banks are wastefully used 250 Elapsed Cyclecycle 300 350 400 + 07 200 + 07 150 + 07 100 + 07 50 + 07 0 + 07 0 + 07 Bank ID 100 19 Memory Access Trace of Bottom-up with PCL (1 thread) 140 Around 220-cycle interval 120 Row buffer is closed → Row buffer misses 80 32 banks 60 40 Around 50-cycle interval 20 Bank parallelism is not actually utilized within a thread → Multiple banks are wastefully used 250 Elapsed Cyclecycle 300 350 400 + 07 200 + 07 150 + 07 100 + 07 50 + 07 0 + 07 0 + 07 Bank ID 100 19 Agenda • State-of-the-art BFS implementation • DRAM mechanisms • Memory access analysis with conventional address mapping schemes • Proposed: per-row channel interleaving • Evaluation of power efficiency 20 Idea • If each thread sequentially accesses the entire row in the same bank… - Row buffer hit ratio (RBHR) should be improved - Bank parallelism should be exploited among different threads - #banks used at a time should be reduced 21 Per-Row Channel Interleaving (PR) • Interleaves a contiguous memory region across channels per row size (8 KB in this work) 22 Per-Row Channel Interleaving (PR) Per-cache-line channel interleaving (PCL) a contiguous memory • 34Interleaves 25 24 23 22 18 17 15 14 8 region 7 6 5 across 32 0 Rank Row block Bank Chan Block Byte channels in this Row block ID per row size (8 KB Column ID work) ID offset ID ID Per-2-cache-lines channel interleaving (P2CL) 34 25 24 23 22 Row block ID 18 17 15 14 98 offset offset 7 bits → 128 B 76 Rank Row block Bank Chan Column ID ID offset ID ID 32 Block offset 0 Byte offset 22 Per-Row Channel Interleaving (PR) Per-cache-line channel interleaving (PCL) a contiguous memory • 34Interleaves 25 24 23 22 18 17 15 14 8 region 7 6 5 across 32 0 Rank Row block Bank Chan Block Byte channels in this Row block ID per row size (8 KB Column ID work) ID offset ID ID Per-2-cache-lines channel interleaving (P2CL) 34 25 24 23 22 Row block ID 34 15 14 98 18 17 15 14 13 12 offset 7 bits → 128 B 76 Rank Row block Bank Chan Column ID ID offset ID ID 25 24 23 22 Row block ID 18 17 offset 32 Block offset 76 Rank Row block Bank Chan Column ID ID offset ID ID Byte offset 32 Block offset 0 0 Byte offset 22 Per-Row Channel Interleaving (PR) Per-cache-line channel interleaving (PCL) a contiguous memory • 34Interleaves 25 24 23 22 18 17 15 14 8 region 7 6 5 across 32 0 Rank Row block Bank Chan Block Byte channels in this Row block ID per row size (8 KB Column ID work) ID offset ID ID Per-2-cache-lines channel interleaving (P2CL) 34 25 24 23 22 Row block ID 34 15 14 98 18 17 15 14 13 12 offset 7 bits → 128 B 76 Rank Row block Bank Chan Column ID ID offset ID ID 25 24 23 22 Row block ID 18 17 offset 32 Block offset 76 Rank Row block Bank Chan Column ID ID offset ID ID Byte offset 32 Block offset 0 0 Byte offset 22 Per-Row Channel Interleaving (PR) Per-cache-line channel interleaving (PCL) a contiguous memory • 34Interleaves 25 24 23 22 18 17 15 14 8 region 7 6 5 across 32 0 Rank Row block Bank Chan Block Byte channels in this Row block ID per row size (8 KB Column ID work) ID offset ID ID Per-2-cache-lines channel interleaving (P2CL) 34 25 24 23 22 Row block ID 34 15 14 98 18 17 15 14 13 12 offset 7 bits → 128 B 76 Rank Row block Bank Chan Column ID ID offset ID ID 25 24 23 22 Row block ID 18 17 offset 32 Block offset 76 Rank Row block Bank Chan Column ID ID offset ID ID Byte offset 32 Block offset 0 0 Byte offset 22 Per-Row Channel Interleaving (PR) Per-cache-line channel interleaving (PCL) a contiguous memory • 34Interleaves 25 24 23 22 18 17 15 14 8 region 7 6 5 across 32 0 Rank Row block Bank Chan Block Byte channels in this Row block ID per row size (8 KB Column ID work) ID offset ID ID Per-2-cache-lines channel interleaving (P2CL) 34 25 24 23 22 Row block ID 18 17 15 14 98 offset offset 7 bits → 128 B 76 Rank Row block Bank Chan Column ID ID offset ID ID 32 Block offset 0 Byte offset 13 bits → 8 KB 34 25 24 23 22 Row block ID 18 17 15 14 13 12 76 Rank Row block Bank Chan Column ID ID offset ID ID 32 Block offset 0 Byte offset 22 Memory Access Trace with PR 140 120 80 60 40 20 400 1e + 07 01 350 1e + 07 1e + 07 00 00 9e + 07 Elapsed Cyclecycle 300 00 250 00 200 8e + 07 00 00 00 150 8e + 07 100 8e + 07 50 7e + 07 0 1e + 07 0 00 Bank ID 100 23 Memory Access Trace with PR 140 120 80 60 40 20 400 1e + 07 01 350 1e + 07 1e + 07 00 00 9e + 07 Elapsed Cyclecycle 300 00 250 00 200 8e + 07 00 00 00 150 8e + 07 100 8e + 07 50 7e + 07 0 1e + 07 0 00 Bank ID 100 23 Memory Access Trace with PR RBHR is improved 140 120 80 60 40 20 400 1e + 07 01 350 1e + 07 1e + 07 00 00 9e + 07 Elapsed Cyclecycle 300 00 250 00 200 8e + 07 00 00 00 150 8e + 07 100 8e + 07 50 7e + 07 0 1e + 07 0 00 Bank ID 100 23 Memory Access Trace with PR RBHR is improved 140 120 80 60 40 20 400 1e + 07 01 350 1e + 07 1e + 07 00 00 9e + 07 Elapsed Cyclecycle 300 00 250 00 200 8e + 07 00 00 00 150 8e + 07 100 8e + 07 50 7e + 07 0 1e + 07 0 00 Bank ID 100 23 Memory Access Trace with PR RBHR is improved 140 120 80 60 40 20 200 250 300 350 400 01 00 00 1e + 07 00 00 9e + 07 Elapsed Cyclecycle 8e + 07 00 00 8e + 07 Different threads access different banks 8e + 07 00 150 1e + 07 100 1e + 07 50 7e + 07 0 1e + 07 0 00 Bank ID 100 23 Memory Access Trace with PR RBHR is improved 140 120 80 60 40 20 200 250 300 350 400 01 00 00 1e + 07 00 00 9e + 07 Elapsed Cyclecycle 8e + 07 00 00 8e + 07 Different threads access different banks 8e + 07 00 150 1e + 07 100 1e + 07 50 7e + 07 0 1e + 07 0 00 Bank ID 100 23 Memory Access Trace with PR RBHR is improved 140 120 80 60 #banks used at a time is reduced 40 20 200 250 300 350 400 01 00 00 1e + 07 00 00 9e + 07 Elapsed Cyclecycle 8e + 07 00 00 8e + 07 Different threads access different banks 8e + 07 00 150 1e + 07 100 1e + 07 50 7e + 07 0 1e + 07 0 00 Bank ID 100 23 Agenda • Graph analysis applications and Graph500 • State-of-the-art Graph500 implementation • DRAM mechanisms • Memory access analysis with conventional address mapping schemes • Proposed: per-row channel interleaving • Evaluation of power efficiency 24 Row Buffer Access Ratio Hit Miss Conflict RB access ratio 1.0 0.8 0.6 0.4 0.2 0.0 PCL P2CL PR 25 Row Buffer Access Ratio Hit Miss Conflict RB access ratio 1.0 0.8 0.6 0.4 0.2 4.0% 0.0 PCL P2CL PR 25 Row Buffer Access Ratio Hit Miss Conflict RB access ratio 1.0 0.8 0.6 43.9% 0.4 0.2 4.0% 0.0 PCL P2CL PR 25 Row Buffer Access Ratio Hit Miss Conflict RB access ratio 1.0 0.8 70.4% 0.6 43.9% 0.4 0.2 4.0% 0.0 PCL P2CL PR 25 Performance & Power Normalized performance 1.4 1.3 1.2 1.1 1.0 0.9 PCL P2CL PR 26 Performance & Power Normalized performance Better 1.4 1.3 1.2 1.1 1.0 0.9 PCL P2CL PR 26 Performance & Power Normalized performance Better 1.4 1.3 1.2 18.5%↑ 1.1 1.0 0.9 PCL P2CL PR 26 Performance & Power Normalized performance Better 1.4 1.3 1.2 6.0%↑ 18.5%↑ 1.1 1.0 0.9 PCL P2CL PR 26 Performance & Power 1.4 100 1.3 1.2 6.0%↑ 18.5%↑ 1.1 1.0 0.9 Avg. power [W] Normalized performance Better DRAM Processor 80 60 40 20 0 PCL P2CL PR PCL P2CL PR 26 Performance & Power 1.4 100 1.3 1.2 6.0%↑ 18.5%↑ 1.1 1.0 Avg. power [W] Normalized performance Better DRAM Processor 0.9 80 60 40 20 0 PCL P2CL PR Better PCL P2CL PR 26 Performance & Power 1.4 100 1.3 1.2 6.0%↑ 18.5%↑ 1.1 1.0 Avg. power [W] Normalized performance Better DRAM Processor 0.9 10.6%↓ 80 60 40 20 0 PCL P2CL PR Better PCL P2CL PR 26 Performance & Power 1.4 100 1.3 1.2 6.0%↑ 18.5%↑ 1.1 1.0 Avg. power [W] Normalized performance Better DRAM Processor 0.9 10.6%↓ 18.6%↓ 80 60 40 20 0 PCL P2CL PR Better PCL P2CL PR 26 Power Efficiency Processor+DRAM DRAM Normalized power efficiency 1.8 1.6 1.4 1.2 1.0 0.8 PCL P2CL PR 27 Power Efficiency Processor+DRAM Better DRAM Normalized power efficiency 1.8 1.6 1.4 1.2 1.0 0.8 PCL P2CL PR 27 Power Efficiency Processor+DRAM Better DRAM Normalized power efficiency 1.8 1.6 1.4 21.3%↑ 1.2 1.0 0.8 PCL P2CL PR 27 Power Efficiency Processor+DRAM Better DRAM Normalized power efficiency 1.8 1.6 10.5%↑ 1.4 21.3%↑ 1.2 1.0 0.8 PCL P2CL PR 27 Power Efficiency Processor+DRAM Better DRAM Normalized power efficiency 1.8 1.6 1.4 1.2 1.0 0.8 PCL P2CL PR 27 Power Efficiency Processor+DRAM Better DRAM Normalized power efficiency 1.8 1.6 32.5%↑ 1.4 1.2 1.0 0.8 PCL P2CL PR 27 Power Efficiency Processor+DRAM Better Normalized power efficiency 1.8 DRAM 30.3%↑ 1.6 32.5%↑ 1.4 1.2 1.0 0.8 PCL P2CL PR 27 Conclusions • It is a big challenge to improve the power efficiency of BFS • Conventional address mapping schemes do not efficiently utilize DRAM in bottom-up algorithm • We propose per-row channel interleaving - It improves RBHR and DRAM power efficiency by 30.3% • Future work - Evaluate PR with other graphs/algorithms/applications with various memory access patterns 28 Power-Efficient Breadth-First Search with DRAM Row Buffer Locality-Aware Address Mapping Satoshi Imamura*, Yuichiro Yasui*, Koji Inoue*, Takatsugu Ono*, Hiroshi Sasaki**, Katsuki Fujisawa* *Kyushu University, **Columbia University
© Copyright 2026 Paperzz