3D-DRAM Circuit Design, Modeling and Exploration for Computer Memory Hierarchy Rakesh Anigu, Hongbin Sun, James J.-Q. Lu, Ken Rose, and Tong Zhang Electrical, Computer and Systems Engineering Department Rensselaer Polytechnic Institute Motivation TSV size/pitch but… Thermal Yield loss EDA tools Equipments Cost … 2 Motivation Naturally embraces the immaturity of 3D integration TSV size/pitch 9 Coarse-grained die-to-die interconnect only Thermal 9 Inherently low power and less heat Yield loss 9 Easy to achieve very high defect tolerance EDA tools 9 Minimal departure from 2D design Equipments 9 Big $$$ market Cost 9 Higher-end, definitely not commodity 3 Overall performance Why 3D Processor-DRAM Integration Memory Wall & Bandwidth Wall Time (Dr. Phil Emma @ IBM) Move more memory closer to processor cores at minimal extra cost! 3D Processor-DRAM Integration 4 Why 3D Processor-DRAM Integration Almost no yield loss 2D design know-how Coarse-grained TSVs DRAM dies Thermal friendly Processor die Justifiable cost To break the memory & bandwidth wall! Quantitatively evaluate the potential 5 Outline Motivation 3D DRAM Architecture Design 3D Processor-DRAM Integration Conclusions 6 3D DRAM Architecture Design Stacked commodity DRAM dies Processor die L2 cache ⇔ main memory Bandwidth Latency Area CACTI 5 Î 1Gb 2D DRAM @ 65nm Latency Energy 7 3D DRAM Architecture Design Stacked Commodity DRAM Î Customized 3D DRAM At which granularity should we carry out 3D mapping Intra-sub-array 3D mapping Fine-grained TSVs Inter-sub-array 3D mapping Coarse-grained TSVs 8 Inter-Sub-Array 3D Mapping TSV I/Os Top view 9 3D Sub-Array Set Distributed across dies 2D sub-array Data bus Address bus 2D sub-array 2D sub-array TSVs bundle Multi-layer data access (MLDA) Single-layer data access (SLDA) All 2D sub-arrays are activated Only one 2D sub-array is activated Each handles a portion of data One 2D sub-array handles all data TSVs Energy TSVs Energy 10 3D DRAM Architecture Design Inter-sub-array 3D mapping Small number of TSVs (1K~10K) Intact individual DRAM sub-array design Distributed global routing Î performance gain Modified CACTI 5 to support inter-sub-array 3D mapping Case study: 1Gb with 8 banks and 256-bit I/O @ 65nm 2D vs. 3D die packaging (i.e., no TSVs) SLDA vs. 3D DRAM MLDA 11 12 Defect Tolerance One more dimension for redundancy repair Sub-Array Sub-Array Sub-Array Redundancy x Redundancy Redundancy Inter-die inter-sub-array redundancy repair 13 Inter-Die Inter-Sub-Array Redundancy Repair 1024x256 sub-array, defect density: 0.05%, repair-most algorithm 14 Outline Motivation 3D DRAM Architecture Design 3D Processor-DRAM Integration Conclusions 15 Current Design Practice Core w/ L1 Core w/ L1 Shared L2 Cache (SRAM) L2 capacity & L1↔L2 bandwidth Core w/ L1 Core w/ L1 Core w/ L1 Core w/ L1 3D Integration DDRx Commodity DRAM channel L2 ↔ main memory bandwidth High-density DRAM High-speed DRAM 16 Heterogeneous 3D DRAM Stacked Commodity DRAM Î Customized 3D DRAM Heterogeneous 3D-DRAM L2 cache + main memory structure Each core has its private 2D-SRAM L1 cache & 3D-DRAM L2 cache DRAM density vs. speed trade-off Density Density Sub-Array Sub-Array Speed Speed Integrate both high-threshold & low-threshold MOSFETs 17 Evaluation M5 full system simulator with Linux (U. of Mich.) Four 4.0GHz cores with 8-layer 3D-DRAM at 45nm node ¾ 3D-DRAM L2 cache per core: 2MB ¾ 3D-DRAM main memory: 1GB Processor Die Baseline Core w/ L1 Core w/ L1 Core w/ L1 Core w/ L1 Without multi-Vt With multi-Vt 18 Instruction Per Cycle (IPC) Gain over Baseline 19 One Step Further Decentralized distributed main memory structure Fastlane between L2 cache and its closest main memory block Reduced L2 cache miss penalty 20 One Step Further 21 Conclusions 3D multi-core processor DRAM integration 3D DRAM Design Simple but effective inter-sub-array 3D mapping strategy Simple but effective 3D redundancy repair Good memory performance gain Integration of processor and 3D DRAM Heterogeneous 3D DRAM architecture Great computing system performance gain 22
© Copyright 2026 Paperzz