Power-Efficient Breadth-First Search with DRAM

Power-Efficient Breadth-First
Search with DRAM Row Buffer
Locality-Aware Address Mapping
Satoshi Imamura*, Yuichiro Yasui*, Koji Inoue*,
Takatsugu Ono*, Hiroshi Sasaki**, Katsuki Fujisawa*
*Kyushu University, **Columbia University
HPGDMP ’16, Salt Lake City, November 13, 2016
Graph Analysis & BFS
• Graph analysis can be used
to various real services
-
Social networking services
-
Road traffic analysis, etc.
2
Graph Analysis & BFS
• Graph analysis can be used
to various real services
-
Social networking services
-
Road traffic analysis, etc.
• Graphs
-
Consist of vertices and edges
-
Can represent relationships between vertices
2
Graph Analysis & BFS
• Graph analysis can be used
to various real services
-
Social networking services
-
Road traffic analysis, etc.
• Graphs
-
Consist of vertices and edges
-
Can represent relationships between vertices
2
Graph Analysis & BFS
• Graph analysis can be used
to various real services
-
Social networking services
-
Road traffic analysis, etc.
• Graphs
-
Consist of vertices and edges
-
Can represent relationships between vertices
• Breadth-first search (BFS)
-
Many graph algorithms are
based on BFS
2
Graph Analysis & BFS
• Graph analysis can be used
to various real services
-
Social networking services
-
Road traffic analysis, etc.
• Graphs
-
Consist of vertices and edges
-
Can represent relationships between vertices
• Breadth-first search (BFS)
-
Many graph algorithms are
based on BFS
2
Graph Analysis & BFS
• Graph analysis can be used
to various real services
-
Social networking services
-
Road traffic analysis, etc.
Visited
• Graphs
-
Consist of vertices and edges
-
Can represent relationships between vertices
• Breadth-first search (BFS)
-
Many graph algorithms are
based on BFS
2
Graph Analysis & BFS
• Graph analysis can be used
to various real services
-
Social networking services
-
Road traffic analysis, etc.
Visited
• Graphs
-
Consist of vertices and edges
-
Can represent relationships between vertices
• Breadth-first search (BFS)
-
Many graph algorithms are
based on BFS
2
Graph Analysis & BFS
• Graph analysis can be used
to various real services
-
Social networking services
-
Road traffic analysis, etc.
Visited
• Graphs
-
Consist of vertices and edges
-
Can represent relationships between vertices
• Breadth-first search (BFS)
-
Many graph algorithms are
based on BFS
2
Graph Analysis & BFS
• Graph analysis can be used
to various real services
-
Social networking services
-
Road traffic analysis, etc.
Visited
• Graphs
-
Consist of vertices and edges
-
Can represent relationships between vertices
Level
• Breadth-first search (BFS)
-
Many graph algorithms are
based on BFS
2
Graph Analysis & BFS
• Graph analysis can be used
to various real services
-
Social networking services
-
Road traffic analysis, etc.
Visited
• Graphs
-
Consist of vertices and edges
-
Can represent relationships between vertices
Level
• Breadth-first search (BFS)
-
Many graph algorithms are
based on BFS
2
Graph Analysis & BFS
• Graph analysis can be used
to various real services
-
Social networking services
-
Road traffic analysis, etc.
Visited
• Graphs
-
Consist of vertices and edges
-
Can represent relationships between vertices
Level
Frontier
• Breadth-first search (BFS)
-
Many graph algorithms are
based on BFS
2
Graph Analysis & BFS
• Graph analysis can be used
to various real services
-
Social networking services
-
Road traffic analysis, etc.
Visited
• Graphs
-
Consist of vertices and edges
-
Can represent relationships between vertices
Level
Frontier
• Breadth-first search (BFS)
-
Many graph algorithms are
based on BFS
2
Power Problem on Servers
• Performance has been improved as the amount
of hardware resources is increased
• Power consumption also continues to grow
-
Proportional to # of active transistors
• One of the most critical design constraints
3
Summary of This Work
• Objective: to improve the power efficiency (i.e.,
performance per watt) of a state-of-the-art BFS
implementation
• Contributions:
4
Summary of This Work
• Objective: to improve the power efficiency (i.e.,
performance per watt) of a state-of-the-art BFS
implementation
• Contributions:
- Investigate the memory access pattern of its main
algorithm using a simulator
4
Summary of This Work
• Objective: to improve the power efficiency (i.e.,
performance per watt) of a state-of-the-art BFS
implementation
• Contributions:
- Investigate the memory access pattern of its main
algorithm using a simulator
-
Reveal that conventional address mapping schemes
of memory controllers do NOT efficiently exploit DRAM
4
Summary of This Work
• Objective: to improve the power efficiency (i.e.,
performance per watt) of a state-of-the-art BFS
implementation
• Contributions:
- Investigate the memory access pattern of its main
algorithm using a simulator
-
Reveal that conventional address mapping schemes
of memory controllers do NOT efficiently exploit DRAM
-
Propose a novel scheme and improve DRAM power
efficiency by 30.3%
4
Agenda
• State-of-the-art BFS implementation
• DRAM mechanisms
• Memory access analysis with conventional
address mapping schemes
• Proposed: per-row channel interleaving
• Evaluation of power efficiency
5
Agenda
• State-of-the-art BFS implementation
• DRAM mechanisms
• Memory access analysis with conventional
address mapping schemes
• Proposed: per-row channel interleaving
• Evaluation of power efficiency
6
Yasui16 Implementation
[Yasui+, HPGP’16]
• Achieved the best performance for a single-node
system in the June 2016 Graph500 list
• Applies several tuning techniques
-
NUMA-aware data layout [Yasui+, BigData’13]
-
Adjacency list and vertex sorting [Yasui+, HPCS’15]
-
Direction optimizing [Beamer+, SC’12]
✓
Significantly reduces # of edge traversals by switching
two algorithms at each level: top-down or bottom-up
7
Time Breakdown of BFS
Time [ms]
Top-down
10
8
6
4
2
0
Bottom-up Top-down
Breakdown of
Bottom-up
2nd loop
1st loop
0
1
2
3
4
Level
5
6
BFS is executed with
2 scale
3 22
4 on 5
10-core
Haswell machine
Level
8
Time Breakdown of BFS
Time [ms]
Top-down
10
8
6
4
2
0
Bottom-up Top-down
Breakdown of
Bottom-up
2nd loop
1st loop
0
1
2
3
4
Level
5
6
BFS is executed with
2 scale
3 22
4 on 5
10-core
Haswell machine
Level
• Over 99% of exe. time is spent by bottom-up algorithm
• We aim to improve the power efficiency of bottom-up
8
Bottom-up Algorithm
5
3
7
2
6
8
4
0
1
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
0
1
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
7
3
Child vertices
6
8
4
2
0
1
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
7
3
Child vertices
6
8
4
2
0
1
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
0
1
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
0 1
7
2
6
8
4
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
Vertex ID
7
2
6
8
4
0 1
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
Vertex ID
0 1
Pointers
0 2
Compressed sparse
row (CSR)
…
0
0 1 2 3 4
1
Adjacency list 1 2 0 2 0
…
Frontier
9
Bottom-up Algorithm
Unvisited vertices
5
3
7
2
6
8
4
0 1
Pointers
0 2
…
0
0 1 2 3 4
1
Frontier
Vertex ID
Compressed sparse
row (CSR)
Adjacency list 1 2 0 2 0
…
This search is parallelized
using OpenMP
• Memory-intensive
•
9
Agenda
• State-of-the-art BFS implementation
• DRAM mechanisms
• Memory access analysis with conventional
address mapping schemes
• Proposed: per-row channel interleaving
• Evaluation of power efficiency
10
Multibank architecture
CPU
Memory controller
…
11
Multibank architecture
CPU
Memory controller
…
…
Channel
Channel
11
Multibank architecture
CPU
Memory controller
Rank
Rank
…
Rank
Rank
…
…
…
Channel
Channel
11
Multibank architecture
CPU
Memory controller
…
Channel
…
Bank
…
…
…
Rank
…
…
Rank
Bank
Bank
Rank
Rank
Bank
…
Channel
11
Multibank architecture
CPU
Memory controller
…
Channel
…
Bank
…
Bank
…
…
Rank
…
…
Rank
Bank
Bank
Rank
Rank
Bank
Row
Row
Row
…
Channel
11
Multibank architecture
CPU
Row buffer
Memory controller
…
Channel
…
Bank
…
Bank
…
…
Rank
…
…
Rank
Bank
Bank
Rank
Rank
Bank
Row
Row
Row
…
Channel
11
Multibank architecture
CPU
Row buffer
Memory controller
Column
…
Channel
…
Bank
…
Bank
…
…
Rank
…
…
Rank
Bank
Bank
Rank
Rank
Bank
Row
Row
Row
…
Channel
11
Multibank architecture
CPU
Row buffer
Memory controller
Column
…
Channel
…
Bank
…
…
…
Rank
…
…
Rank
Bank
Bank
Rank
Rank
Bank
…
Row
Row
Row
Bank
Multiple banks can be
accessed in parallel
→ bank parallelism
Channel
11
Three Types of Row Buffer Accesses
Row buffer
Row 0
Row 1
Row 2
12
Three Types of Row Buffer Accesses
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Hit
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Hit
Access latency
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Hit
Access latency
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Hit
Access latency
Access
Empty
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Hit
Activate
Empty
Access latency
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Hit
Activate
Empty
Access latency
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Hit
Activate
Empty
Access latency
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Miss
Hit
Activate
Empty
Access latency
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Miss
Hit
Activate
Empty
Access latency
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Miss
Hit
Activate
Row 1
Empty
Access latency
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Miss
Hit
Precharge
Row 1
Activate
Empty
Access latency
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Miss
Hit
Precharge
Row 1
Activate
Empty
Access latency
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Miss
Hit
Precharge
Row 1
Activate
Empty
Access latency
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Conflict
Miss
Hit
Precharge
Row 1
Activate
Empty
Access latency
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Three Types of Row Buffer Accesses
Conflict
Miss
Hit
Precharge
Row 1
Activate
Empty
Access latency
Access
Row 2
Row buffer
Row 0
Row 1
Row 2
Access to data in row 2
12
Row Buffer Management Polices
• Open-page policy
-
Row buffers are NOT precharged (closed) after accessed
✓
Pros: accesses to the same row become “hits”
✓
Cons: accesses to different rows become “conflicts”
13
Row Buffer Management Polices
• Open-page policy
-
Row buffers are NOT precharged (closed) after accessed
✓
Pros: accesses to the same row become “hits”
✓
Cons: accesses to different rows become “conflicts”
• Closed-page policy
-
Row buffers are closed immediately after accessed
✓
Pros/Cons: all accesses become “misses”
13
Row Buffer Management Polices
• Open-page policy
-
Row buffers are NOT precharged (closed) after accessed
✓
Pros: accesses to the same row become “hits”
✓
Cons: accesses to different rows become “conflicts”
• Closed-page policy
-
Row buffers are closed immediately after accessed
✓
Pros/Cons: all accesses become “misses”
• AMD’s adaptive open-page policy
-
Row buffers are closed after 128 cycles from the last
access
13
Row Buffer Management Polices
• Open-page policy
-
Row buffers are NOT precharged (closed) after accessed
✓
Pros: accesses to the same row become “hits”
✓
Cons: accesses to different rows become “conflicts”
• Closed-page policy
-
Row buffers are closed immediately after accessed
✓
Pros/Cons: all accesses become “misses”
Used in this work
• AMD’s adaptive open-page policy
-
Row buffers are closed after 128 cycles from the last
access
13
Agenda
• State-of-the-art BFS implementation
• DRAM mechanisms
• Memory access analysis with conventional
address mapping schemes
• Proposed: per-row channel interleaving
• Evaluation of power efficiency
14
Address Mapping Schemes
• Determine the location of data in DRAM based on a
physical address
• Implemented in memory controllers
15
Address Mapping Schemes
• Determine the location of data in DRAM based on a
physical address
• Implemented in memory controllers
• Conventional schemes:
-
-
Per-cache-line channel interleaving (PCL)
✓
64 B blocks are interleaved across channels
✓
Applied to Intel Nehalem processors [Park+, ASPLOS ’13]
Per-2-cache-line channel interleaving (P2CL)
✓
128 B blocks are interleaved across channels
✓
Applied to Intel SandyBridge and Haswell processors
15
Breakdown of Physical Addresses
Per-cache-line channel interleaving (PCL)
34
25 24 23 22
Row block ID
18 17
15 14
Rank Row block Bank
ID
offset
ID
8 7
Column ID
65
32
Chan Block
ID
offset
0
Byte
offset
Per-2-cache-lines channel interleaving (P2CL)
34
25 24 23 22
Row block ID
18 17
15 14
98
76
Rank Row block Bank
Chan
Column ID
ID
offset
ID
ID
32
Block
offset
0
Byte
offset
16
Breakdown of Physical Addresses
Per-cache-line channel interleaving (PCL)
34
25 24 23 22
Row block ID
18 17
15 14
Rank Row block Bank
ID
offset
ID
8 7
Column ID
65
32
Chan Block
ID
offset
0
Byte
offset
Per-2-cache-lines channel interleaving (P2CL)
34
25 24 23 22
Row block ID
18 17
15 14
98
76
Rank Row block Bank
Chan
Column ID
ID
offset
ID
ID
32
Block
offset
0
Byte
offset
16
Breakdown of Physical Addresses
6 bits → 64 B
Per-cache-line channel interleaving (PCL)
34
25 24 23 22
Row block ID
18 17
15 14
Rank Row block Bank
ID
offset
ID
8 7
Column ID
65
32
Chan Block
ID
offset
0
Byte
offset
Per-2-cache-lines channel interleaving (P2CL)
34
25 24 23 22
Row block ID
18 17
15 14
98
76
Rank Row block Bank
Chan
Column ID
ID
offset
ID
ID
32
Block
offset
0
Byte
offset
16
Breakdown of Physical Addresses
6 bits → 64 B
Per-cache-line channel interleaving (PCL)
34
25 24 23 22
Row block ID
18 17
15 14
Rank Row block Bank
ID
offset
ID
8 7
Column ID
65
32
Chan Block
ID
offset
0
Byte
offset
Per-2-cache-lines channel interleaving (P2CL)
34
25 24 23 22
Row block ID
18 17
15 14
98
76
Rank Row block Bank
Chan
Column ID
ID
offset
ID
ID
32
Block
offset
0
Byte
offset
16
Breakdown of Physical Addresses
6 bits → 64 B
Per-cache-line channel interleaving (PCL)
34
25 24 23 22
Row block ID
18 17
15 14
Rank Row block Bank
ID
offset
ID
8 7
Column ID
Per-2-cache-lines channel interleaving (P2CL)
34
25 24 23 22
Row block ID
18 17
15 14
98
65
32
Chan Block
ID
offset
0
Byte
offset
7 bits → 128 B
76
Rank Row block Bank
Chan
Column ID
ID
offset
ID
ID
32
Block
offset
0
Byte
offset
16
Simulator Setup
• Cycle-accurate multicore simulator: MARSSx86
• DRAM simulator: DRAMsim2
17
Simulator Setup
• Cycle-accurate multicore simulator: MARSSx86
• DRAM simulator: DRAMsim2
Parameter
Setting
Processor
16 Out-of-Order cores, 3.0 GHz
Cache
Memory
Controller
DRAM
Private 32 KB L1 I/D, Private 256 KB L2
Shared 16 MB L3
Adaptive open-page policy (128 cycles)
FR-FCFS scheduling policy [Rixner+, ISCA’00]
32 GB, DDR3-1333, 8 KB row
4 channels, 4 ranks/channel, 8 banks/rank (128 banks)
17
Memory Access Trace of
Bottom-up with PCL
140
120
80
60
40
20
0
4e
00
4e
00
+
07
400
+
07
350
+
07
00
2e
00
+
07
2e
00
+
07
2e
300
3e
250
Elapsed
Cyclecycle
00
1e
200
+
07
150
+
07
100
00
50
e+
07
0
e+
07
Bank ID
100
18
Memory Access Trace of
Bottom-up with PCL
140
120
80
60
40
20
0
4e
00
4e
00
+
07
400
+
07
350
+
07
00
2e
00
+
07
2e
00
+
07
2e
300
3e
250
Elapsed
Cyclecycle
00
1e
200
+
07
150
+
07
100
00
50
e+
07
0
e+
07
Bank ID
100
18
Memory Access Trace of
Bottom-up with PCL
140
120
80
60
40
20
0
4e
00
4e
00
+
07
400
+
07
350
+
07
00
2e
00
+
07
2e
00
+
07
2e
300
3e
250
Elapsed
Cyclecycle
00
1e
200
+
07
150
+
07
100
00
50
e+
07
0
e+
07
Bank ID
100
18
Memory Access Trace of
Bottom-up with PCL
140
120
80
60
40
20
0
4e
00
4e
00
+
07
400
+
07
350
+
07
00
2e
00
+
07
2e
00
+
07
2e
300
3e
250
Elapsed
Cyclecycle
00
1e
200
+
07
150
+
07
100
00
50
e+
07
0
e+
07
Bank ID
100
18
Memory Access Trace of
Bottom-up with PCL
140
120
Bank ID
100
80
60
40
Activate
20
0
4e
00
4e
00
+
07
400
+
07
350
+
07
00
2e
00
+
07
2e
00
+
07
2e
300
3e
250
Elapsed
Cyclecycle
00
1e
200
+
07
150
+
07
100
00
50
e+
07
e+
07
0
18
Memory Access Trace of
Bottom-up with PCL
140
120
Bank ID
100
80
60
40
Activate
20
0
4e
00
4e
00
+
07
400
+
07
350
+
07
00
2e
00
+
07
2e
00
+
07
2e
300
3e
250
Elapsed
Cyclecycle
00
1e
200
+
07
150
+
07
100
00
50
e+
07
e+
07
0
18
Memory Access Trace of
Bottom-up with PCL
140
120
Bank ID
100
80
60
40
Activate
20
Access
0
4e
00
4e
00
+
07
400
+
07
350
+
07
00
2e
00
+
07
2e
00
+
07
2e
300
3e
250
Elapsed
Cyclecycle
00
1e
200
+
07
150
+
07
100
00
50
e+
07
e+
07
0
18
Memory Access Trace of
Bottom-up with PCL
140
120
Bank ID
100
80
60
40
Activate
20
Access
0
4e
00
4e
00
+
07
400
+
07
350
+
07
00
2e
00
+
07
2e
00
+
07
2e
300
3e
250
Elapsed
Cyclecycle
00
1e
200
+
07
150
+
07
100
00
50
e+
07
e+
07
0
18
Memory Access Trace of
Bottom-up with PCL
140
120
Bank ID
100
80
60
40
Activate
20
Access
Precharge
0
4e
00
4e
00
+
07
400
+
07
350
+
07
00
2e
00
+
07
2e
00
+
07
2e
300
3e
250
Elapsed
Cyclecycle
00
1e
200
+
07
150
+
07
100
00
50
e+
07
e+
07
0
18
Memory Access Trace of
Bottom-up with PCL
140
120
Bank ID
100
80
Row buffer miss
60
40
Activate
20
Access
Precharge
0
4e
00
4e
00
+
07
400
+
07
350
+
07
00
2e
00
+
07
2e
00
+
07
2e
300
3e
250
Elapsed
Cyclecycle
00
1e
200
+
07
150
+
07
100
00
50
e+
07
e+
07
0
18
Memory Access Trace of
Bottom-up with PCL
140
120
80
60
40
20
0
4e
00
4e
00
+
07
400
+
07
350
+
07
00
2e
00
+
07
2e
00
+
07
2e
300
3e
250
Elapsed
Cyclecycle
00
1e
200
+
07
150
+
07
100
00
50
e+
07
0
e+
07
Bank ID
100
18
Memory Access Trace of
Bottom-up with PCL
140
120
80
60
40
20
100
150
200
250
300
350
00
+
07
4e
00
+
07
3e
00
+
07
2e
00
+
07
2e
00
+
07
2e
00
00
1e
+
07
Almost all accesses
are
row
buffer
misses
Elapsed
Cyclecycle
• Almost all threads access distinct banks
•
400
+
07
50
4e
0
e+
07
0
e+
07
Bank ID
100
18
Memory Access Trace of
Bottom-up with PCL (1 thread)
140
120
80
60
40
20
0
Elapsed
Cyclecycle
300
350
400
+
07
250
+
07
200
+
07
150
+
07
100
+
07
50
+
07
0
+
07
Bank ID
100
19
Memory Access Trace of
Bottom-up with PCL (1 thread)
140
120
80
32 banks
60
40
20
0
Elapsed
Cyclecycle
300
350
400
+
07
250
+
07
200
+
07
150
+
07
100
+
07
50
+
07
0
+
07
Bank ID
100
19
Memory Access Trace of
Bottom-up with PCL (1 thread)
140
120
80
32 banks
60
40
Around 50-cycle interval
20
0
Elapsed
Cyclecycle
300
350
400
+
07
250
+
07
200
+
07
150
+
07
100
+
07
50
+
07
0
+
07
Bank ID
100
19
Memory Access Trace of
Bottom-up with PCL (1 thread)
140
120
80
32 banks
60
40
Around 50-cycle interval
20
Bank parallelism is not actually utilized within a thread
→ Multiple banks are wastefully used
250
Elapsed
Cyclecycle
300
350
400
+
07
200
+
07
150
+
07
100
+
07
50
+
07
0
+
07
0
+
07
Bank ID
100
19
Memory Access Trace of
Bottom-up with PCL (1 thread)
140
Around 220-cycle interval
120
80
32 banks
60
40
Around 50-cycle interval
20
Bank parallelism is not actually utilized within a thread
→ Multiple banks are wastefully used
250
Elapsed
Cyclecycle
300
350
400
+
07
200
+
07
150
+
07
100
+
07
50
+
07
0
+
07
0
+
07
Bank ID
100
19
Memory Access Trace of
Bottom-up with PCL (1 thread)
140
Around 220-cycle interval
120
Row buffer is closed
→ Row buffer misses
80
32 banks
60
40
Around 50-cycle interval
20
Bank parallelism is not actually utilized within a thread
→ Multiple banks are wastefully used
250
Elapsed
Cyclecycle
300
350
400
+
07
200
+
07
150
+
07
100
+
07
50
+
07
0
+
07
0
+
07
Bank ID
100
19
Agenda
• State-of-the-art BFS implementation
• DRAM mechanisms
• Memory access analysis with conventional
address mapping schemes
• Proposed: per-row channel interleaving
• Evaluation of power efficiency
20
Idea
• If each thread sequentially accesses the entire
row in the same bank…
-
Row buffer hit ratio (RBHR) should be improved
-
Bank parallelism should be exploited among
different threads
-
#banks used at a time should be reduced
21
Per-Row Channel Interleaving (PR)
• Interleaves a contiguous memory region across
channels per row size (8 KB in this work)
22
Per-Row Channel Interleaving (PR)
Per-cache-line channel interleaving (PCL)
a contiguous
memory
• 34Interleaves
25 24
23 22
18 17 15
14
8 region
7
6 5 across
32
0
Rank Row block Bank
Chan Block
Byte
channels
in this
Row block ID per row size (8 KB
Column
ID work)
ID
offset
ID
ID
Per-2-cache-lines channel interleaving (P2CL)
34
25 24 23 22
Row block ID
18 17
15 14
98
offset
offset
7 bits → 128 B
76
Rank Row block Bank
Chan
Column ID
ID
offset
ID
ID
32
Block
offset
0
Byte
offset
22
Per-Row Channel Interleaving (PR)
Per-cache-line channel interleaving (PCL)
a contiguous
memory
• 34Interleaves
25 24
23 22
18 17 15
14
8 region
7
6 5 across
32
0
Rank Row block Bank
Chan Block
Byte
channels
in this
Row block ID per row size (8 KB
Column
ID work)
ID
offset
ID
ID
Per-2-cache-lines channel interleaving (P2CL)
34
25 24 23 22
Row block ID
34
15 14
98
18 17
15 14 13 12
offset
7 bits → 128 B
76
Rank Row block Bank
Chan
Column ID
ID
offset
ID
ID
25 24 23 22
Row block ID
18 17
offset
32
Block
offset
76
Rank Row block Bank Chan
Column ID
ID
offset
ID
ID
Byte
offset
32
Block
offset
0
0
Byte
offset
22
Per-Row Channel Interleaving (PR)
Per-cache-line channel interleaving (PCL)
a contiguous
memory
• 34Interleaves
25 24
23 22
18 17 15
14
8 region
7
6 5 across
32
0
Rank Row block Bank
Chan Block
Byte
channels
in this
Row block ID per row size (8 KB
Column
ID work)
ID
offset
ID
ID
Per-2-cache-lines channel interleaving (P2CL)
34
25 24 23 22
Row block ID
34
15 14
98
18 17
15 14 13 12
offset
7 bits → 128 B
76
Rank Row block Bank
Chan
Column ID
ID
offset
ID
ID
25 24 23 22
Row block ID
18 17
offset
32
Block
offset
76
Rank Row block Bank Chan
Column ID
ID
offset
ID
ID
Byte
offset
32
Block
offset
0
0
Byte
offset
22
Per-Row Channel Interleaving (PR)
Per-cache-line channel interleaving (PCL)
a contiguous
memory
• 34Interleaves
25 24
23 22
18 17 15
14
8 region
7
6 5 across
32
0
Rank Row block Bank
Chan Block
Byte
channels
in this
Row block ID per row size (8 KB
Column
ID work)
ID
offset
ID
ID
Per-2-cache-lines channel interleaving (P2CL)
34
25 24 23 22
Row block ID
34
15 14
98
18 17
15 14 13 12
offset
7 bits → 128 B
76
Rank Row block Bank
Chan
Column ID
ID
offset
ID
ID
25 24 23 22
Row block ID
18 17
offset
32
Block
offset
76
Rank Row block Bank Chan
Column ID
ID
offset
ID
ID
Byte
offset
32
Block
offset
0
0
Byte
offset
22
Per-Row Channel Interleaving (PR)
Per-cache-line channel interleaving (PCL)
a contiguous
memory
• 34Interleaves
25 24
23 22
18 17 15
14
8 region
7
6 5 across
32
0
Rank Row block Bank
Chan Block
Byte
channels
in this
Row block ID per row size (8 KB
Column
ID work)
ID
offset
ID
ID
Per-2-cache-lines channel interleaving (P2CL)
34
25 24 23 22
Row block ID
18 17
15 14
98
offset
offset
7 bits → 128 B
76
Rank Row block Bank
Chan
Column ID
ID
offset
ID
ID
32
Block
offset
0
Byte
offset
13 bits → 8 KB
34
25 24 23 22
Row block ID
18 17
15 14 13 12
76
Rank Row block Bank Chan
Column ID
ID
offset
ID
ID
32
Block
offset
0
Byte
offset
22
Memory Access Trace with PR
140
120
80
60
40
20
400
1e
+
07
01
350
1e
+
07
1e
+
07
00
00
9e
+
07
Elapsed
Cyclecycle
300
00
250
00
200
8e
+
07
00
00
00
150
8e
+
07
100
8e
+
07
50
7e
+
07
0
1e
+
07
0
00
Bank ID
100
23
Memory Access Trace with PR
140
120
80
60
40
20
400
1e
+
07
01
350
1e
+
07
1e
+
07
00
00
9e
+
07
Elapsed
Cyclecycle
300
00
250
00
200
8e
+
07
00
00
00
150
8e
+
07
100
8e
+
07
50
7e
+
07
0
1e
+
07
0
00
Bank ID
100
23
Memory Access Trace with PR
RBHR is improved
140
120
80
60
40
20
400
1e
+
07
01
350
1e
+
07
1e
+
07
00
00
9e
+
07
Elapsed
Cyclecycle
300
00
250
00
200
8e
+
07
00
00
00
150
8e
+
07
100
8e
+
07
50
7e
+
07
0
1e
+
07
0
00
Bank ID
100
23
Memory Access Trace with PR
RBHR is improved
140
120
80
60
40
20
400
1e
+
07
01
350
1e
+
07
1e
+
07
00
00
9e
+
07
Elapsed
Cyclecycle
300
00
250
00
200
8e
+
07
00
00
00
150
8e
+
07
100
8e
+
07
50
7e
+
07
0
1e
+
07
0
00
Bank ID
100
23
Memory Access Trace with PR
RBHR is improved
140
120
80
60
40
20
200
250
300
350
400
01
00
00
1e
+
07
00
00
9e
+
07
Elapsed
Cyclecycle
8e
+
07
00
00
8e
+
07
Different threads access different banks
8e
+
07
00
150
1e
+
07
100
1e
+
07
50
7e
+
07
0
1e
+
07
0
00
Bank ID
100
23
Memory Access Trace with PR
RBHR is improved
140
120
80
60
40
20
200
250
300
350
400
01
00
00
1e
+
07
00
00
9e
+
07
Elapsed
Cyclecycle
8e
+
07
00
00
8e
+
07
Different threads access different banks
8e
+
07
00
150
1e
+
07
100
1e
+
07
50
7e
+
07
0
1e
+
07
0
00
Bank ID
100
23
Memory Access Trace with PR
RBHR is improved
140
120
80
60
#banks used at a time is reduced
40
20
200
250
300
350
400
01
00
00
1e
+
07
00
00
9e
+
07
Elapsed
Cyclecycle
8e
+
07
00
00
8e
+
07
Different threads access different banks
8e
+
07
00
150
1e
+
07
100
1e
+
07
50
7e
+
07
0
1e
+
07
0
00
Bank ID
100
23
Agenda
• Graph analysis applications and Graph500
• State-of-the-art Graph500 implementation
• DRAM mechanisms
• Memory access analysis with conventional address
mapping schemes
• Proposed: per-row channel interleaving
• Evaluation of power efficiency
24
Row Buffer Access Ratio
Hit
Miss
Conflict
RB access ratio
1.0
0.8
0.6
0.4
0.2
0.0
PCL
P2CL
PR
25
Row Buffer Access Ratio
Hit
Miss
Conflict
RB access ratio
1.0
0.8
0.6
0.4
0.2
4.0%
0.0
PCL
P2CL
PR
25
Row Buffer Access Ratio
Hit
Miss
Conflict
RB access ratio
1.0
0.8
0.6
43.9%
0.4
0.2
4.0%
0.0
PCL
P2CL
PR
25
Row Buffer Access Ratio
Hit
Miss
Conflict
RB access ratio
1.0
0.8
70.4%
0.6
43.9%
0.4
0.2
4.0%
0.0
PCL
P2CL
PR
25
Performance & Power
Normalized
performance
1.4
1.3
1.2
1.1
1.0
0.9
PCL
P2CL
PR
26
Performance & Power
Normalized
performance
Better
1.4
1.3
1.2
1.1
1.0
0.9
PCL
P2CL
PR
26
Performance & Power
Normalized
performance
Better
1.4
1.3
1.2
18.5%↑
1.1
1.0
0.9
PCL
P2CL
PR
26
Performance & Power
Normalized
performance
Better
1.4
1.3
1.2
6.0%↑
18.5%↑
1.1
1.0
0.9
PCL
P2CL
PR
26
Performance & Power
1.4
100
1.3
1.2
6.0%↑
18.5%↑
1.1
1.0
0.9
Avg. power [W]
Normalized
performance
Better
DRAM
Processor
80
60
40
20
0
PCL
P2CL
PR
PCL
P2CL
PR
26
Performance & Power
1.4
100
1.3
1.2
6.0%↑
18.5%↑
1.1
1.0
Avg. power [W]
Normalized
performance
Better
DRAM
Processor
0.9
80
60
40
20
0
PCL
P2CL
PR
Better
PCL
P2CL
PR
26
Performance & Power
1.4
100
1.3
1.2
6.0%↑
18.5%↑
1.1
1.0
Avg. power [W]
Normalized
performance
Better
DRAM
Processor
0.9
10.6%↓
80
60
40
20
0
PCL
P2CL
PR
Better
PCL
P2CL
PR
26
Performance & Power
1.4
100
1.3
1.2
6.0%↑
18.5%↑
1.1
1.0
Avg. power [W]
Normalized
performance
Better
DRAM
Processor
0.9
10.6%↓ 18.6%↓
80
60
40
20
0
PCL
P2CL
PR
Better
PCL
P2CL
PR
26
Power Efficiency
Processor+DRAM
DRAM
Normalized power
efficiency
1.8
1.6
1.4
1.2
1.0
0.8
PCL
P2CL
PR
27
Power Efficiency
Processor+DRAM
Better
DRAM
Normalized power
efficiency
1.8
1.6
1.4
1.2
1.0
0.8
PCL
P2CL
PR
27
Power Efficiency
Processor+DRAM
Better
DRAM
Normalized power
efficiency
1.8
1.6
1.4
21.3%↑
1.2
1.0
0.8
PCL
P2CL
PR
27
Power Efficiency
Processor+DRAM
Better
DRAM
Normalized power
efficiency
1.8
1.6
10.5%↑
1.4
21.3%↑
1.2
1.0
0.8
PCL
P2CL
PR
27
Power Efficiency
Processor+DRAM
Better
DRAM
Normalized power
efficiency
1.8
1.6
1.4
1.2
1.0
0.8
PCL
P2CL
PR
27
Power Efficiency
Processor+DRAM
Better
DRAM
Normalized power
efficiency
1.8
1.6
32.5%↑
1.4
1.2
1.0
0.8
PCL
P2CL
PR
27
Power Efficiency
Processor+DRAM
Better
Normalized power
efficiency
1.8
DRAM
30.3%↑
1.6
32.5%↑
1.4
1.2
1.0
0.8
PCL
P2CL
PR
27
Conclusions
• It is a big challenge to improve the power efficiency of BFS
• Conventional address mapping schemes do not efficiently
utilize DRAM in bottom-up algorithm
• We propose per-row channel interleaving
-
It improves RBHR and DRAM power efficiency by 30.3%
• Future work
-
Evaluate PR with other graphs/algorithms/applications with
various memory access patterns
28
Power-Efficient Breadth-First
Search with DRAM Row Buffer
Locality-Aware Address Mapping
Satoshi Imamura*, Yuichiro Yasui*, Koji Inoue*,
Takatsugu Ono*, Hiroshi Sasaki**, Katsuki Fujisawa*
*Kyushu University, **Columbia University