Requests Bundling DRAM Controller

A Requests Bundling DRAM Controller
for Mixed-Criticality System
April 23, 2017
RTAS 2017
by: Danlu Guo, Rodolfo Pellizzoni
Outline
 Introduction
 DRAM Background
 Predictable DRAM Controller Evaluation
 Requests Bundling DRAM Controller
 Worst Case Latency Analysis
 Evaluation
 Conclusion
PAGE 2
Introduction
 Multicore architecture
Multicore architecture
- Shared DRAM main memory
- Inter-core memory interference
 Real-Time system
Core 0
Core N
CPU
CPU
Cache
Cache
- Hard Real-Time (HRT) applications
- Soft Real-Time (SRT) applications
LL Cache
 What do we want from DRAM
- Tighter upper bound latency for HRT request
- Better lower bound bandwidth for SRT request
 Solution:
- Innovative predictable DRAM controllers
PAGE 3
DRAM controller
DRAM main memory
Outline
 Introduction
 DRAM Background
 Predictable DRAM Controller Evaluation
 Requests Bundling DRAM Controller
 Worst Case Latency Analysis
 Evaluation
 Conclusion
PAGE 4
DRAM Background
Column
Rank: Share Command/Data Bus
-
Bank: Access in Parallel
-
Row, Column, Row Buffer: data cells
Row Buffer
PAGE 5
DRAM Channel
DATA
BUS
DRAM
CHIP 7
DRAM Rank N
DRAM
CHIP 6
DRAM Rank 0
DRAM Controller
ADDRESS/
COMMAND
BUS
DRAM
CHIP 0
DRAM Bank 0
-
Row
Channel: Independent DRAM controller
Row Decoder
-
DRAM Bank N
 Organization
DRAM Background
Column
- Precharge (PRE): restore data
y
z
x
y
z
- Timing Constraints (Refer DDR Specifications)
y
 RD [0,0,1]
A
tRCD
R
tRL
Data
tRTP
P
PAGE 6
DRAM Bank 0
- Column-Access-Strobe (RD/WR): access data
x
Row
- Activate (ACT): retrieve data
Row Decoder
 Operation
DRAM Background
 Page Policy
RD[0,0,1], RD[0,0,0]
- Close-Page: Precharge (PRE) after access (CAS)
tRC
A
tRCD
R
tRL
Data
tRTP
A
P
tRCD
- Open-Page: Precharge (PRE) when required
tRP
A
tRCD
tRL
Close
Close
P
R
R
tRL
Close (Miss)
Data
tRL
R
Data
Open (Hit)
PAGE 7
Data
tRTP
P
DRAM Background
 Data Allocation
 Private Bank
 Shared Banks
 Allows data sharing among cores
 Allows isolation between cores/banks
 Contention on the same bank
 Limits data sharing
Bank 1
Bank 0
Bank 1
Bank 0
0
1
2
9
10
11
0
1
2
9
10
11
3
4
5
1
2
13
14
3
4
5
1
2
13
14
6
7
8
15
16
17
6
7
8
15
16
17
Core 0
Core 1
PAGE 8
Outline
 Introduction
 DRAM Background
 Predictable DRAM Controller Evaluation
 Requests Bundling DRAM Controller
 Worst Case Latency Analysis
 Evaluation
 Conclusion
PAGE 9
Predictable DRAM Controllers Evaluation
 Shared bank + Close-Page
tRC
Bank0 A tRCD
R
tRC
A tRCD
P
W
Core 0
Core 1
Core 2
Core 3
tRC
A tRCD
P
R
P
A tRCD
W
Bank1
Bank2
Bank3
N – 1 reactivation on the same bank
 Private Bank + Open-Page
Bank0 P
Bank1
R
A
P
W
A
P
Bank2
A
P tRP
Bank3
PRE
N-1 PRE
tRTW
tWTR
R
W tWL
A tRCD
ACT
N-1 ACT
tRTW
CAS (Open)
N-1 CAS Switching
PAGE 10
Data
tWL
Data
Predictable DRAM Controllers Evaluation
Ex: DDR3-1600H
RD-RD: 4
RD-WR: 7
WR-RD: 18
 Private Bank + Open-Page
Bank0 P
Bank1
R
A
P
W
A
P
Bank2
tRTW
A
P
Bank3
tWTR
tRP
R
W tWL
A tRCD
PRE
tRTW
ACT
Data
CAS (Open)
32 cycles
 Private Bank and Open-Page + CAS reordering [L.Ecco & R.Ernst,RTSS’15 ]
Bank0 P
Bank1
R tCCD
A
P
P
Bank2
PRE
R tRTW
A
P
Bank3
W
A
tRP
W tWL
A tRCD
ACT
CAS (Open)
15 cycles
PAGE 11
Data
Predictable DRAM Controllers Evaluation
 Current Analytical Model
Bank0 P
Bank1
C
A
P
A
P
Bank2
P
Bank3
tRP
PRE
Not the actual
A command
arrival
time
A tRCD
C
C
C
ACT
Data
CAS
 Pipeline System
Bank0 P
Bank1
A
C
A
P
P
Bank3
PRE
C
A
P
Bank2
HRT Latency
Objective
tRP
ACT
C
A tRCD
C
CAS
PAGE 12
Data
Predictable DRAM Controllers Evaluation
 Mixed Criticality System
 Co-existing of HRT and SRT applications on different cores
 Fixed priority can guarantee the HRT latency but limit SRT bandwidth
Bank0
Bank1
Request
Request
Request
Request
Request
Bank3
Bank4
Request
Request
Bank2
Request
SRT Request
SRT Request
SRT Bandwidth
Objective
Starvation
PAGE 13
Objective Summary
Reordering CAS breaks
the
execution sequence
 HRT Latency:
- Apply Pipelining can cover the overlap interference.
- Apply Reordering can avoid the repetitive CAS switching.
 SRT Bandwidth:
- Apply Co-schedule of SRT and HRT requests can avoid the starvation.
Requests Bundling DRAM Controller
PAGE 14
Outline
 Introduction
 DRAM Background
 Predictable DRAM Controller Classification
 Requests Bundling DRAM Controller
 Worst Case Latency Analysis
 Evaluation
 Conclusion
PAGE 15
Requests Bundling (REQBundle) DRAM Controller
HRT Latency
SRT Bandwidth
 Isolation
 Fast Access
- Private bank
 Pipelining and Reordering
- Shared bank + Open-page
 Co-schedule SRT and HRT requests
- Fixed SRT execution slots before HRT
- Close-Page
=> Fixed command sequence
- Reordering on the request level
=> Avoid multiple switching
=> Fixed request sequence
PAGE 16
Command Scheduler
HRT Banks
InRound Scheduler
SRT Banks
OutRound Scheduler
Schedule HRT & SRT Commands
Bundle same type of requests
Switch access type between round
Command Scheduler
Switch
Switch
Starts
Ends/Start
W
R
RD
Bank0
Bank1
Schedule SRT Commands only
Write
Read
Bank2
RD
Bank3
W
R
SRT Bank
Write
OutRound
Ends
Write
InRound
InRound
PAGE 17
InRound Scheduler
 Execution Time of an InRound
: time to determine the number of HRT requests (N)
-
: time to issue the last SRT CAS
-
: time to issue the last HRT ACT
- Execution time R(N) = max(
Dat
a
Bank0
+ (N-1) *
,
+
)
=2
Round
Starts
RD
A
RD
Ends
SRT ACT
SRT ACT
R
A
A
Bank1
R
RD
Bank2
A
Not Care
Bank3
SRT Bank
A
W
SRT CAS
PAGE 18
R
Outline
 Introduction
 DRAM Background
 Predictable DRAM Controller Evaluation
 Requests Bundling DRAM Controller
 Worst Case Latency Analysis
 Evaluation
 Conclusion
PAGE 19
Request Arrival Time and Latency
 Case0: Arrives before snapshot of same type of round

LReq = R(N0) + tRL + tBus
R0 Starts
R0 Ends
SRT ACT
Bank0
Bank1
Bank2
Bank3
RD
R
A
A
tBus
R
tRL
D
LReq
PAGE 20
Request Arrival Time and Latency
 Case1: Arrives before/after snapshot of different type of round

LReq = R(No) + R(N1) + tRL + tBus
R0 Starts
R1 Ends
R0 Ends R1 Starts
SRT ACT
SRT ACT
Bank0
Bank1
Bank2
A
RD
A
R
W
tBus
Bank3
A
LReq
PAGE 21
R tRL
D
Request Arrival Time and Latency
 Case2: Arrives after snapshot in the same type of round

LReq = R(No) + R(N1) + R(N2) + tRL + tBus (Worst Case)
R0 Starts
R0 Ends R1 Starts
SRT ACT
R1 Ends
SRT ACT
R2 Starts
R2 Ends
SRT ACT
Bank0
Bank1
Bank2
A
RD
A
W
R
tBus
Bank3
A
LReq
PAGE 22
R tRL
D
Outline
 Introduction
 DRAM Background
 Predictable DRAM Controller Evaluation
 Requests Bundling DRAM Controller
 Worst Case Latency Analysis
 Evaluation
 Conclusion
PAGE 23
Evaluation
 Implemented in a general DRAM controller simulation framework in C++
 [DRAMController Demo RTSS’16]
 EEMBC benchmark memory traces generated from MACsim
 CPU 1GHz
 Private L1/2 Cache
 Shared L3 Cache
 Evaluate against Command Bundling (CMDBundle) DRAM Controller
 [L.Ecco and R.Ernst,RTSS’15 ]
 Burst Mode
 Non-Burst Mode
PAGE 24
Benchmark Worst Case Execution Time (8 HRTs)
 HRT0 runs benchmark trace and other 7 HRTs run memory intensive traces
 Normalized on CMDBundle (non-burst)
REQBuddle
CMDBundle(Burst)
Normalized Execu on Time
1
0.9
0.8
0.7
0.6
0.5
a2 me
cache
basefp
irrflt
PAGE 25
aifirf
tblook
Worst Case HRT Request Latency (8 HRTs)
 WR Request
CMDBundleH(NBurst) CMDBundleM(NBurst) REQBundle
CMDBundleH(NBurst) CMDBundleM(NBurst) REQBundle
CMDBundleH(Burst)
CMDBundleH(Burst)
CMDBundleM(Burst)
400
Worst Case Write Latency (ns)
Worst Case Read Latency (ns)
 RD Request
350
300
250
200
150
100
50
0
800D
1066E
1333G
1600H
1866K
2133L
400
350
300
250
200
150
100
50
0
800D
PAGE 26
CMDBundleM(Burst)
1066E
1333G
1600H
1866K
2133L
Worst Case SRT Requests Bandwidth (8 HRTs)
 RD Bandwidth
1066E
1333G
 WR Bandwidth
1600H
1866K
2133L
1333G
SRT 0
SRT 1
1600H
1866K
2133L
7
SRT Write Bandwidth (BG/s)
SRT Read Bandwidth (GB/s)
3
1066E
2.5
2
1.5
1
0.5
0
6
5
4
3
2
1
0
SRT 0
SRT 1
SRT 2
SRT 3
SRT 4
PAGE 27
SRT 2
SRT 3
SRT 4
Mixed-Criticality System (8 HRTs, 8 SRTs)
 HRT Latency
 SRT Bandwidth
REQBundle
CMDBundle(Burst)
REQBundle
5
150
SRT Bandwidth (GB/s)
HRT Request Latency (Cycles)
180
CMDBundle
120
90
60
30
0
4
3
2
1
0
0
1
2
3
4
0
1
 Implement virtual HRT requestor mechanism for CMDBundle
 Considered as a HRT cores in the system
 All SRT requests share the virtual requestors
PAGE 28
2
3
4
Outline
 Introduction
 DRAM Background
 Predictable DRAM Controller Evaluation
 Requests Bundling DRAM Controller
 Worst Case Latency Analysis
 Evaluation
 Conclusion
PAGE 29
Conclusion
 Employing request bundling with pipelining can improve the worst case request
latency.
 Considering the command timing constraints gaps can provide a good trade-off
between the SRT bandwidth and HRT latency.
 Compared with a state-of-the-art real-time memory controller and show the
balance point based on the row-hit ratio of a task.
 Measurement row hit ratio is lower than 50%. A guaranteed row hit ratio requires static
analysis and is lower than measured ratio.
PRESENTATION TITLE
PAGE 30
THANK YOU
PRESENTATION TITLE
EVALUATION
 Burst
Previous
RD
executed
Command
Registers
tDelay
R
cr
= tRL + tBus
tCCD
icr1
R
tCCD
R
tRTW
icr0
RD u.a
executed
RD u.a
inserted
W
tWR-RD
R
tCCD
W
tCCD
icr2
W
 Non-burst
W
tCCD
W
Round 0
W
Round 1
time
tWR-RD
R
icr0
W
icr1
W
icr2
RD u.a
executed
RD u.a
interted
Command
Registers
cr
W
tCCD
R
tCCD
W
W
tCCD
W tRTW
W
tCCD
W
W
W
W
tCCD
W
Round 1
Round 0
time
PRESENTATION TITLE
R
PAGE 32
R
W