3D-DRAM Circuit Design, Modeling and Exploration for Computer

3D-DRAM Circuit Design, Modeling and
Exploration for Computer Memory Hierarchy
Rakesh Anigu, Hongbin Sun, James J.-Q. Lu, Ken Rose, and
Tong Zhang
Electrical, Computer and Systems Engineering Department
Rensselaer Polytechnic Institute
Motivation
TSV size/pitch
but…
Thermal
Yield loss
EDA tools
Equipments
Cost
…
2
Motivation
Naturally embraces the immaturity of 3D integration
TSV size/pitch
9 Coarse-grained die-to-die interconnect only
Thermal
9 Inherently low power and less heat
Yield loss
9 Easy to achieve very high defect tolerance
EDA tools
9 Minimal departure from 2D design
Equipments
9 Big $$$ market
Cost
9 Higher-end, definitely not commodity
3
Overall performance
Why 3D Processor-DRAM Integration
Memory Wall
&
Bandwidth Wall
Time
(Dr. Phil Emma @ IBM)
Move more memory closer to processor cores at minimal extra cost!
3D Processor-DRAM Integration
4
Why 3D Processor-DRAM Integration
Almost no yield loss
2D design know-how
Coarse-grained TSVs
DRAM dies
Thermal friendly
Processor die
Justifiable cost
To break the memory & bandwidth wall!
Quantitatively evaluate the potential
5
Outline
‰ Motivation
‰ 3D DRAM Architecture Design
‰ 3D Processor-DRAM Integration
‰ Conclusions
6
3D DRAM Architecture Design
Stacked commodity
DRAM dies
Processor die
L2 cache ⇔ main memory
Bandwidth
Latency
Area
CACTI 5 Î 1Gb 2D DRAM @ 65nm
Latency
Energy
7
3D DRAM Architecture Design
Stacked Commodity DRAM Î Customized 3D DRAM
At which granularity should we carry out 3D mapping
Intra-sub-array 3D mapping
Fine-grained TSVs
Inter-sub-array 3D mapping
Coarse-grained TSVs
8
Inter-Sub-Array 3D Mapping
TSV I/Os
Top view
9
3D Sub-Array Set
Distributed across dies
2D sub-array
Data bus
Address bus
2D sub-array
2D sub-array
TSVs bundle
Multi-layer data access (MLDA)
Single-layer data access (SLDA)
‰ All 2D sub-arrays are activated
‰ Only one 2D sub-array is activated
‰ Each handles a portion of data
‰ One 2D sub-array handles all data
TSVs
Energy
TSVs
Energy
10
3D DRAM Architecture Design
Inter-sub-array 3D mapping
Small number of TSVs (1K~10K)
Intact individual DRAM sub-array design
Distributed global routing Î performance gain
Modified CACTI 5 to support inter-sub-array 3D mapping
Case study: 1Gb with 8 banks and 256-bit I/O @ 65nm
2D
vs.
3D die packaging
(i.e., no TSVs)
SLDA
vs.
3D DRAM
MLDA
11
12
Defect Tolerance
One more dimension for redundancy repair
Sub-Array
Sub-Array
Sub-Array
Redundancy
x
Redundancy
Redundancy
Inter-die inter-sub-array redundancy repair
13
Inter-Die Inter-Sub-Array Redundancy Repair
1024x256 sub-array, defect density: 0.05%, repair-most algorithm
14
Outline
‰ Motivation
‰ 3D DRAM Architecture Design
‰ 3D Processor-DRAM Integration
‰ Conclusions
15
Current Design Practice
Core w/ L1
Core w/ L1
Shared L2 Cache
(SRAM)
L2 capacity & L1↔L2 bandwidth
Core w/ L1
Core w/ L1
Core w/ L1
Core w/ L1
3D Integration
DDRx
Commodity DRAM
channel
L2 ↔ main memory bandwidth
High-density DRAM High-speed DRAM
16
Heterogeneous 3D DRAM
Stacked Commodity DRAM Î Customized 3D DRAM
‰ Heterogeneous 3D-DRAM L2 cache + main memory structure
‰ Each core has its private 2D-SRAM L1 cache & 3D-DRAM L2 cache
DRAM density vs. speed trade-off
Density
Density
Sub-Array
Sub-Array
Speed
Speed
Integrate both high-threshold & low-threshold MOSFETs
17
Evaluation
‰ M5 full system simulator with Linux (U. of Mich.)
‰ Four 4.0GHz cores with 8-layer 3D-DRAM at 45nm node
¾ 3D-DRAM L2 cache per core: 2MB
¾ 3D-DRAM main memory: 1GB
Processor Die
Baseline
Core w/ L1
Core w/ L1
Core w/ L1
Core w/ L1
Without multi-Vt
With multi-Vt
18
Instruction Per Cycle (IPC) Gain over Baseline
19
One Step Further
Decentralized distributed main memory structure
‰ Fastlane between L2 cache and its closest main memory block
Reduced L2 cache miss penalty
20
One Step Further
21
Conclusions
3D multi-core processor DRAM integration
‰ 3D DRAM Design
Simple but effective inter-sub-array 3D mapping strategy
Simple but effective 3D redundancy repair
Good memory performance gain
‰ Integration of processor and 3D DRAM
Heterogeneous 3D DRAM architecture
Great computing system performance gain
22