Time-based Snoop Filtering in Chip Multiprocessors ImanFaraji Amirkabir University of Technology Tehran, Iran Amirali Baniasadi University of Victoria Victoria, Canada This work: Reducing redundant snoops in chip multiprocessors Our Goal Improving energy efficiency of WT-based CMP Our Motivation There are long time intervals where snooping fails, wasting energy and bandwidth. Our Solution Detect such intervals and avoid snoops Key Results Memory Energy 18% Snoop Traffic 93% Performance 3.8% 2 Conventional Snooping CPU D$ D$ CPU 4 5 1 2 3 Interconnect 5 D$ 3 4 6 5 D$ CPU controller 4 CPU WB vs. WT Write-back configuration Write-through configuration Low memory traffic High memory traffic Sophisticated coherency mechanism Simple coherency mechanism Relative memory energy consumption 4 Previous Work: Snoop Filters Eliminate redundant snoop (local & global) requests. Local: one core fails to provide data Global: all cores fail. Examples: RegionScout: Detects Memory Regions Not Shared (Moshovos) Selective Snoop Request: Predicts Supplier (Atoofian & Baniasadi) Serial Snooping: Requests Nodes One by One (Saldanha & Lipasti) Good snoop filter 5 1. Fast & simple 2. Accurate and effective Our Work Time-based Snoop Filtering Motivation: There are long intervals where snooping fails consecutively But how long & how often? 6 Our Work (Cont.) 7 Our Work (Cont.) Global Read Miss (GRM): Occurs whenever the last snoopbyall processors fail Local Read Miss (LRM): Redundant snoop occurringbya singleprocessor fails 8 Distribution (a) LRM distribution for different processors (b) GRM distribution Periods of Data Scarcity are usually long 9 Time-based Global Miss predictor (TGM) TGM Goals: 1. Detect GRM intervals 2. Shutting down snooping in all processors but one (surviving node). TGM Types: 1. TGM-First: First processor that has failed snooping survives. 2. TGM-Last: Last processor that has failed snooping survives. 10 TGM implementation TGM-enhanced CMP 11 TGM (a) Coverage 12 (b) Accuracy Time-based Local Miss predictor (TLM) Goal: Detect LRMs How? 13 1. Count consecutive snoop misses in a node 2. Disable snoop when exceeds a threshold 3. Restart snooping after a number of cycles TLM implementation TGM-enhanced CMP Each Processor 14 Processing Unit (PU) First Level Cache Redundant SNoop (RSN) Counter Predictor ReStarT (RST) Counter TLM features (a)Coverage 15 (b) Accuracy Methodology Our Simulator: SESC Benchmarks: Splash-2 To evaluate energy :Cacti 6.5 System used:Quad-Core CMP System Parameters Processor Interconnection Network IL1: 64KB/ 2 way SPLASH-2 Benchmarks and INPUT parameters Benchmarks Barnes Cholesky FFT Ocean Volrend Water-Nsqrd Water-spatial Input Parameters 16K Particles tk29.O 1024k complex data points 258x258 ocean Head 512 molecules 512 molecules Memory Frequency: 5 GHz DL1: 64KB/4way/Write Through Technology: 68 nm Access Time: 1 cycle Branch Predictor: 16K entry bimodal and gshare Fetch/Issue/Commit 4/4/5 Branch Penalty : 17 cycles RAS: 32 entries BTB: 2k Entries, 2 way Block Size: 64 Data Interconnect: crossbar Interconnect Width: 64 B Cache line size: 32 L2:512KB/8way/Write Through Access Time: 11 cycles Block Size: 64 Memory: 1GB Access Time: 70 cycles Page Size: 4 Kbit 16 Relative Snoop Traffic Reduction TGM-F: 58% TGM-L: 57% TLM: 77% 17 Relative Memory Energy TGM-F: 8% TGM-L: 8.5% TLM: 11% 18 Relative Memory Delay TGM-F: 1.1% TGM-L: 2.1% TLM: 1.7% 19 Relative Performance TGM-F: No Change TGM-L: 0.4% TLM: 0.3% 20 Summary We showed: Long data scarcity period (DSP) exist during workload runtime During DSPs redundant snoops happen frequently and consecutively Our solutions TGM: uses snoop behavior on all processors to detect and filter redundant snoops Shutdown snoop on as much processor as possible TLM: Redundant snoops are filtered in a single node Counts recent redundant snoops to detect data scarcity periods and filter upcoming redundant snoops Simulation Results: 21 Snoop Reduction: TGM-F: 58% TGM-L: 57% TLM: 77% Memory Energy: TGM-F: 8% TGM-L: 8.5% TLM: 11% Memory Delay: TGM-F: 1.1% TGM-L: 2.1% TLM: 1.7% Performance: TGM-F: no change TGM-L: 0.4% TLM: 0.3% Thanks for your attention 22 Backup Slides 23 Discussion How Characteristics of the benchmarks affect memory energy/delay reduced by our solution? 1. True detection of redundant snoops 24 2. Share of Redundant Snoops Memory Energy.Delay Memory Energy = Energy consumed to provide the requested data 25 Memory Delay = time required to provide the requested data Volrend Benchmark Volrend while running rarely send snoop requests This application renders a three-dimensional volume. It renders several frames from changing viewpoints consecutive frames in rotation sequences often vary slightly in viewpoint High Temporal Locality Volrend does Load Distribution very well High Spatial Locality 26
© Copyright 2025 Paperzz