In-Situ Compute Memory Systems Reetuparna Das Assistant Professor, EECS Department Massive amounts of data generated each day Near Data Computing Move compute near storage 3 1997 Processing in Memory (PIM) IRAM, DIVA, Active Pages, etc ... Evolution... Emergence of Big Data Data movement dominates Joules/Op 3D Memories available 2012 2014 2015 Resurgence of Processing in Memory (PIM) Logic layer near Memory enabled by 3D Technology Automaton Processor Associative memory with custom interconnects Computer Memories (bit-line computing) In-situ computation inside memory arrays Problem 1: Memory are big and inactive Memory consumes most of aggregate die area Cache Problem 2: Significant energy wasted is moving data through memory hierarchy 1000-2000 pJ Data movement 20-40x 1-50 pJ Operation Problem 3: General purpose processors are inefficient for data parallel applications Scalar Small vector (32 bytes) Inefficient More efficient Very large vector 1000x larger? Problem summary: Conventional systems process data inefficiently core Energy alu data-mov Problem 3 Problem 1 Area Memory Memory CORE CORE Key Idea Memory = Storage + In-place compute Proposal: Repurpose memory logic for compute core Energy alu data-mov Massive Parallelism (up to 100X) Area MemoryEfficiency Memory Energy (up to 20X) CORE CORE Memory CORE Cache Compute Caches for Efficient Very Large Vector Processing PIs: Reetuparna Das, Satish Narayanasamy, David Blaauw Students: Shaizeen Aga, Supreet Jeloka, Arun Subramanian Proposal: Compute Caches Memory disambiguation support CORE0 CORE3 L1 L1 L2 L2 = A op B In-place compute SRAM Bank Challenges: Interconnect Cache controller extension L3-Slice0 A B Sub-array L3-Slice3 Data orchestration Managing parallelism Coherence and Consistency Opportunity 1: Large vector computation Operation width = row size core alu data-mov Many smaller sub-arrays L3 Each sub-array can compute in parallel Parallelism available (16 MB L3) 512 sub-arrays * 64B 32KB Operand 128X L2 L1 saved Opportunity 2: Save data movement energy Significant portion of cache energy is wire energy (60-80%) L3 Save wire energy Save energy in moving data to higher cache levels L2 L1 H-tree Compute Cache Architecture CORE0 CORE3 L1 L1 L2 L2 Memory disambiguation for large vector operations Cache controller extension More details in upcoming HPCA paper Interconnect In-place compute SRAM L3-Slice0 L3-Slice3 SRAM array operation Read Operation Bitlines Address BLB0 BL0 BLn BLBn Row Decoder Precharge 0 1 0 1 Wordlines sub-array SA 1 Differential Sense Amplifiers SA 1 In-place compute SRAM Bitlines Changes BL0 BLn BLBn Row Decoder Row Decoder-O BLB0 Wordlines Vref SA Vref SA SA SA Differential Sense Amplifiers SA SA Single-ended Sense Amplifiers A B Row Decoder A AND B Row Decoder-O In-place compute SRAM BLB0 BL0 A 0 1 0 1 SA BLn BLBn 1 0 0 1 Vref Vref SA SA 0 B SA 1 A AND B Single-ended Sense Amplifiers SRAM Prototype Test Chip Compute Cache ISA So Far ππππππ π, π, π ππππππππ π, π, π, π πππππππππ π, π, π, π πππππ π, π πππππ π, π, π, π πππππ π, π, π πππππππ π, π, π, π Applications modeled using Compute Caches Text Processing StringMatch Wordcount In-memory Checkpointing FastBit BitMap Database Bit Matrix Multiplication Compute Cache Results Summary 1.9X 2.4X 4% Compute Cache Summary Cache Empower caches to compute Performance: Energy: Large vector parallelism Reduce on-chip data movement In-place compute SRAM Data placement and cache geometry for increased operand locality 8% area overhead 2.1X performance, 2.7X energy savings Future Compute Memory System Stack Redesign Applications Adapt PL/Compiler OS primitives FSA Machine Learning Data Analytics Crypto Image Processing Bioinformatics Graphs Data-flow languages RAPID [20] Googleβs TensorFlow [14] Java/C++ OpenCL ISA Design Architecture Design Compute Memories In-situ technique Operation set Large SIMD Data-flow Non-Volatile Re-RAM STT-RAM MRAM Flash Bit-line Express computation using in-situ operations Express parallelism Data orchestration Coherence & Consistency Manage parallelism Cache Customize hierarchy DRAM SRAM Where to compute? Rich operation set Parallel Automaton Bit-line Volatile Logical, Data migration, Comparison, Search Addition, Multiplication, Convolution, FSM Locality In-Situ Compute Memory Systems Thank You! Reetuparna Das
© Copyright 2026 Paperzz