A Requests Bundling DRAM Controller for Mixed-Criticality System April 23, 2017 RTAS 2017 by: Danlu Guo, Rodolfo Pellizzoni Outline Introduction DRAM Background Predictable DRAM Controller Evaluation Requests Bundling DRAM Controller Worst Case Latency Analysis Evaluation Conclusion PAGE 2 Introduction Multicore architecture Multicore architecture - Shared DRAM main memory - Inter-core memory interference Real-Time system Core 0 Core N CPU CPU Cache Cache - Hard Real-Time (HRT) applications - Soft Real-Time (SRT) applications LL Cache What do we want from DRAM - Tighter upper bound latency for HRT request - Better lower bound bandwidth for SRT request Solution: - Innovative predictable DRAM controllers PAGE 3 DRAM controller DRAM main memory Outline Introduction DRAM Background Predictable DRAM Controller Evaluation Requests Bundling DRAM Controller Worst Case Latency Analysis Evaluation Conclusion PAGE 4 DRAM Background Column Rank: Share Command/Data Bus - Bank: Access in Parallel - Row, Column, Row Buffer: data cells Row Buffer PAGE 5 DRAM Channel DATA BUS DRAM CHIP 7 DRAM Rank N DRAM CHIP 6 DRAM Rank 0 DRAM Controller ADDRESS/ COMMAND BUS DRAM CHIP 0 DRAM Bank 0 - Row Channel: Independent DRAM controller Row Decoder - DRAM Bank N Organization DRAM Background Column - Precharge (PRE): restore data y z x y z - Timing Constraints (Refer DDR Specifications) y RD [0,0,1] A tRCD R tRL Data tRTP P PAGE 6 DRAM Bank 0 - Column-Access-Strobe (RD/WR): access data x Row - Activate (ACT): retrieve data Row Decoder Operation DRAM Background Page Policy RD[0,0,1], RD[0,0,0] - Close-Page: Precharge (PRE) after access (CAS) tRC A tRCD R tRL Data tRTP A P tRCD - Open-Page: Precharge (PRE) when required tRP A tRCD tRL Close Close P R R tRL Close (Miss) Data tRL R Data Open (Hit) PAGE 7 Data tRTP P DRAM Background Data Allocation Private Bank Shared Banks Allows data sharing among cores Allows isolation between cores/banks Contention on the same bank Limits data sharing Bank 1 Bank 0 Bank 1 Bank 0 0 1 2 9 10 11 0 1 2 9 10 11 3 4 5 1 2 13 14 3 4 5 1 2 13 14 6 7 8 15 16 17 6 7 8 15 16 17 Core 0 Core 1 PAGE 8 Outline Introduction DRAM Background Predictable DRAM Controller Evaluation Requests Bundling DRAM Controller Worst Case Latency Analysis Evaluation Conclusion PAGE 9 Predictable DRAM Controllers Evaluation Shared bank + Close-Page tRC Bank0 A tRCD R tRC A tRCD P W Core 0 Core 1 Core 2 Core 3 tRC A tRCD P R P A tRCD W Bank1 Bank2 Bank3 N – 1 reactivation on the same bank Private Bank + Open-Page Bank0 P Bank1 R A P W A P Bank2 A P tRP Bank3 PRE N-1 PRE tRTW tWTR R W tWL A tRCD ACT N-1 ACT tRTW CAS (Open) N-1 CAS Switching PAGE 10 Data tWL Data Predictable DRAM Controllers Evaluation Ex: DDR3-1600H RD-RD: 4 RD-WR: 7 WR-RD: 18 Private Bank + Open-Page Bank0 P Bank1 R A P W A P Bank2 tRTW A P Bank3 tWTR tRP R W tWL A tRCD PRE tRTW ACT Data CAS (Open) 32 cycles Private Bank and Open-Page + CAS reordering [L.Ecco & R.Ernst,RTSS’15 ] Bank0 P Bank1 R tCCD A P P Bank2 PRE R tRTW A P Bank3 W A tRP W tWL A tRCD ACT CAS (Open) 15 cycles PAGE 11 Data Predictable DRAM Controllers Evaluation Current Analytical Model Bank0 P Bank1 C A P A P Bank2 P Bank3 tRP PRE Not the actual A command arrival time A tRCD C C C ACT Data CAS Pipeline System Bank0 P Bank1 A C A P P Bank3 PRE C A P Bank2 HRT Latency Objective tRP ACT C A tRCD C CAS PAGE 12 Data Predictable DRAM Controllers Evaluation Mixed Criticality System Co-existing of HRT and SRT applications on different cores Fixed priority can guarantee the HRT latency but limit SRT bandwidth Bank0 Bank1 Request Request Request Request Request Bank3 Bank4 Request Request Bank2 Request SRT Request SRT Request SRT Bandwidth Objective Starvation PAGE 13 Objective Summary Reordering CAS breaks the execution sequence HRT Latency: - Apply Pipelining can cover the overlap interference. - Apply Reordering can avoid the repetitive CAS switching. SRT Bandwidth: - Apply Co-schedule of SRT and HRT requests can avoid the starvation. Requests Bundling DRAM Controller PAGE 14 Outline Introduction DRAM Background Predictable DRAM Controller Classification Requests Bundling DRAM Controller Worst Case Latency Analysis Evaluation Conclusion PAGE 15 Requests Bundling (REQBundle) DRAM Controller HRT Latency SRT Bandwidth Isolation Fast Access - Private bank Pipelining and Reordering - Shared bank + Open-page Co-schedule SRT and HRT requests - Fixed SRT execution slots before HRT - Close-Page => Fixed command sequence - Reordering on the request level => Avoid multiple switching => Fixed request sequence PAGE 16 Command Scheduler HRT Banks InRound Scheduler SRT Banks OutRound Scheduler Schedule HRT & SRT Commands Bundle same type of requests Switch access type between round Command Scheduler Switch Switch Starts Ends/Start W R RD Bank0 Bank1 Schedule SRT Commands only Write Read Bank2 RD Bank3 W R SRT Bank Write OutRound Ends Write InRound InRound PAGE 17 InRound Scheduler Execution Time of an InRound : time to determine the number of HRT requests (N) - : time to issue the last SRT CAS - : time to issue the last HRT ACT - Execution time R(N) = max( Dat a Bank0 + (N-1) * , + ) =2 Round Starts RD A RD Ends SRT ACT SRT ACT R A A Bank1 R RD Bank2 A Not Care Bank3 SRT Bank A W SRT CAS PAGE 18 R Outline Introduction DRAM Background Predictable DRAM Controller Evaluation Requests Bundling DRAM Controller Worst Case Latency Analysis Evaluation Conclusion PAGE 19 Request Arrival Time and Latency Case0: Arrives before snapshot of same type of round LReq = R(N0) + tRL + tBus R0 Starts R0 Ends SRT ACT Bank0 Bank1 Bank2 Bank3 RD R A A tBus R tRL D LReq PAGE 20 Request Arrival Time and Latency Case1: Arrives before/after snapshot of different type of round LReq = R(No) + R(N1) + tRL + tBus R0 Starts R1 Ends R0 Ends R1 Starts SRT ACT SRT ACT Bank0 Bank1 Bank2 A RD A R W tBus Bank3 A LReq PAGE 21 R tRL D Request Arrival Time and Latency Case2: Arrives after snapshot in the same type of round LReq = R(No) + R(N1) + R(N2) + tRL + tBus (Worst Case) R0 Starts R0 Ends R1 Starts SRT ACT R1 Ends SRT ACT R2 Starts R2 Ends SRT ACT Bank0 Bank1 Bank2 A RD A W R tBus Bank3 A LReq PAGE 22 R tRL D Outline Introduction DRAM Background Predictable DRAM Controller Evaluation Requests Bundling DRAM Controller Worst Case Latency Analysis Evaluation Conclusion PAGE 23 Evaluation Implemented in a general DRAM controller simulation framework in C++ [DRAMController Demo RTSS’16] EEMBC benchmark memory traces generated from MACsim CPU 1GHz Private L1/2 Cache Shared L3 Cache Evaluate against Command Bundling (CMDBundle) DRAM Controller [L.Ecco and R.Ernst,RTSS’15 ] Burst Mode Non-Burst Mode PAGE 24 Benchmark Worst Case Execution Time (8 HRTs) HRT0 runs benchmark trace and other 7 HRTs run memory intensive traces Normalized on CMDBundle (non-burst) REQBuddle CMDBundle(Burst) Normalized Execu on Time 1 0.9 0.8 0.7 0.6 0.5 a2 me cache basefp irrflt PAGE 25 aifirf tblook Worst Case HRT Request Latency (8 HRTs) WR Request CMDBundleH(NBurst) CMDBundleM(NBurst) REQBundle CMDBundleH(NBurst) CMDBundleM(NBurst) REQBundle CMDBundleH(Burst) CMDBundleH(Burst) CMDBundleM(Burst) 400 Worst Case Write Latency (ns) Worst Case Read Latency (ns) RD Request 350 300 250 200 150 100 50 0 800D 1066E 1333G 1600H 1866K 2133L 400 350 300 250 200 150 100 50 0 800D PAGE 26 CMDBundleM(Burst) 1066E 1333G 1600H 1866K 2133L Worst Case SRT Requests Bandwidth (8 HRTs) RD Bandwidth 1066E 1333G WR Bandwidth 1600H 1866K 2133L 1333G SRT 0 SRT 1 1600H 1866K 2133L 7 SRT Write Bandwidth (BG/s) SRT Read Bandwidth (GB/s) 3 1066E 2.5 2 1.5 1 0.5 0 6 5 4 3 2 1 0 SRT 0 SRT 1 SRT 2 SRT 3 SRT 4 PAGE 27 SRT 2 SRT 3 SRT 4 Mixed-Criticality System (8 HRTs, 8 SRTs) HRT Latency SRT Bandwidth REQBundle CMDBundle(Burst) REQBundle 5 150 SRT Bandwidth (GB/s) HRT Request Latency (Cycles) 180 CMDBundle 120 90 60 30 0 4 3 2 1 0 0 1 2 3 4 0 1 Implement virtual HRT requestor mechanism for CMDBundle Considered as a HRT cores in the system All SRT requests share the virtual requestors PAGE 28 2 3 4 Outline Introduction DRAM Background Predictable DRAM Controller Evaluation Requests Bundling DRAM Controller Worst Case Latency Analysis Evaluation Conclusion PAGE 29 Conclusion Employing request bundling with pipelining can improve the worst case request latency. Considering the command timing constraints gaps can provide a good trade-off between the SRT bandwidth and HRT latency. Compared with a state-of-the-art real-time memory controller and show the balance point based on the row-hit ratio of a task. Measurement row hit ratio is lower than 50%. A guaranteed row hit ratio requires static analysis and is lower than measured ratio. PRESENTATION TITLE PAGE 30 THANK YOU PRESENTATION TITLE EVALUATION Burst Previous RD executed Command Registers tDelay R cr = tRL + tBus tCCD icr1 R tCCD R tRTW icr0 RD u.a executed RD u.a inserted W tWR-RD R tCCD W tCCD icr2 W Non-burst W tCCD W Round 0 W Round 1 time tWR-RD R icr0 W icr1 W icr2 RD u.a executed RD u.a interted Command Registers cr W tCCD R tCCD W W tCCD W tRTW W tCCD W W W W tCCD W Round 1 Round 0 time PRESENTATION TITLE R PAGE 32 R W
© Copyright 2026 Paperzz