Compute Cache Architecture

In-Situ Compute Memory Systems
Reetuparna Das
Assistant Professor, EECS Department
Massive amounts of data generated each day
Near Data Computing
Move compute near storage
3
1997
Processing in Memory (PIM)
IRAM, DIVA, Active Pages, etc ...
Evolution...
Emergence of Big Data
Data movement dominates Joules/Op
3D Memories available
2012
2014
2015
Resurgence of Processing in Memory (PIM)
Logic layer near Memory enabled by 3D Technology
Automaton Processor
Associative memory with custom interconnects
Computer Memories (bit-line computing)
In-situ computation inside memory arrays
Problem 1: Memory are big and inactive
Memory consumes most of aggregate die area
Cache
Problem 2: Significant energy wasted is moving
data through memory hierarchy
1000-2000 pJ
Data movement
20-40x
1-50 pJ
Operation
Problem 3: General purpose processors are
inefficient for data parallel applications
Scalar
Small vector (32 bytes)
Inefficient
More efficient
Very large vector
1000x larger?
Problem summary:
Conventional systems process data inefficiently
core
Energy
alu
data-mov
Problem 3
Problem 1
Area
Memory
Memory
CORE
CORE
Key Idea
Memory = Storage
+ In-place compute
Proposal: Repurpose memory logic for compute
core
Energy
alu
data-mov
Massive Parallelism (up to 100X)
Area
MemoryEfficiency Memory
Energy
(up to 20X)
CORE
CORE
Memory
CORE
Cache
Compute Caches for Efficient Very
Large Vector Processing
PIs: Reetuparna Das, Satish Narayanasamy, David Blaauw
Students: Shaizeen Aga, Supreet Jeloka, Arun Subramanian
Proposal: Compute Caches
Memory disambiguation
support
CORE0
CORE3
L1
L1
L2
L2
= A op B
In-place compute SRAM
Bank
Challenges:
Interconnect
Cache controller
extension
L3-Slice0
A
B
Sub-array
L3-Slice3
Data orchestration
Managing parallelism
Coherence and Consistency
Opportunity 1: Large vector computation
Operation width = row size
core
alu
data-mov
Many smaller sub-arrays
L3
Each sub-array can
compute in parallel
Parallelism available (16 MB L3)
512 sub-arrays * 64B
32KB Operand
128X
L2
L1
saved
Opportunity 2: Save data movement energy
Significant portion of cache
energy is wire energy
(60-80%)
L3
Save wire energy
Save energy in moving
data to higher cache
levels
L2
L1
H-tree
Compute Cache Architecture
CORE0
CORE3
L1
L1
L2
L2
Memory disambiguation for large vector operations
Cache controller extension
More details in upcoming HPCA paper
Interconnect
In-place compute SRAM
L3-Slice0
L3-Slice3
SRAM array operation
Read Operation
Bitlines
Address
BLB0
BL0
BLn
BLBn
Row Decoder
Precharge
0 1
0 1
Wordlines
sub-array
SA
1
Differential
Sense Amplifiers
SA
1
In-place compute SRAM
Bitlines
Changes
BL0
BLn
BLBn
Row Decoder
Row Decoder-O
BLB0
Wordlines
Vref
SA
Vref
SA
SA
SA
Differential
Sense Amplifiers
SA
SA
Single-ended
Sense Amplifiers
A
B
Row Decoder
A AND B
Row Decoder-O
In-place compute SRAM
BLB0
BL0
A
0 1
0 1
SA
BLn
BLBn
1 0
0 1
Vref
Vref
SA
SA
0
B
SA
1
A AND B
Single-ended
Sense Amplifiers
SRAM Prototype Test Chip
Compute Cache ISA So Far
π‘π‘π’„π’π’‘π’š π‘Ž, 𝑏, 𝑛
𝑐𝑐𝒔𝒆𝒂𝒓𝒄𝒉 π‘Ž, π‘˜, π‘Ÿ, 𝑛
π‘π‘π’π’π’ˆπ’Šπ’„π’‚π’ π‘Ž, 𝑏, 𝑐, 𝑛
𝑐𝑐𝒃𝒖𝒛 π‘Ž, 𝑛
π‘π‘π’„π’Žπ’‘ π‘Ž, 𝑏, π‘Ÿ, 𝑛
𝑐𝑐𝒏𝒐𝒕 π‘Ž, 𝑏, 𝑛
π‘π‘π’„π’π’Žπ’–π’ π‘Ž, π‘˜, π‘Ÿ, 𝑛
Applications modeled using Compute Caches
Text Processing
StringMatch
Wordcount
In-memory Checkpointing
FastBit BitMap Database
Bit Matrix Multiplication
Compute Cache Results Summary
1.9X
2.4X
4%
Compute Cache Summary
Cache
Empower caches to compute
Performance:
Energy:
Large vector parallelism
Reduce on-chip data movement
In-place compute SRAM
Data placement and cache geometry for increased operand locality
8% area overhead
2.1X performance, 2.7X energy savings
Future
Compute Memory System Stack
Redesign
Applications
Adapt
PL/Compiler
OS primitives
FSA
Machine Learning
Data Analytics
Crypto
Image Processing
Bioinformatics
Graphs
Data-flow languages
RAPID [20]
Google’s TensorFlow [14]
Java/C++
OpenCL
ISA
Design
Architecture
Design
Compute
Memories
In-situ technique
Operation set
Large SIMD
Data-flow
Non-Volatile
Re-RAM STT-RAM
MRAM Flash
Bit-line
Express computation
using in-situ operations
Express parallelism
Data orchestration
Coherence & Consistency
Manage parallelism
Cache
Customize hierarchy
DRAM
SRAM
Where to compute?
Rich operation set
Parallel Automaton
Bit-line
Volatile
Logical, Data migration, Comparison, Search
Addition, Multiplication, Convolution, FSM
Locality
In-Situ Compute Memory Systems
Thank You!
Reetuparna Das