Time-based Snoop Filtering in Chip Multiprocessors

Time-based Snoop Filtering in Chip Multiprocessors
ImanFaraji
Amirkabir University of Technology
Tehran, Iran
Amirali Baniasadi
University of Victoria
Victoria, Canada
This work:
Reducing redundant snoops in chip multiprocessors
Our Goal
Improving energy efficiency of WT-based CMP
Our Motivation
There are long time intervals where snooping fails, wasting energy and bandwidth.
Our Solution
Detect such intervals and avoid snoops
Key Results
Memory Energy 18% Snoop Traffic 93% Performance 3.8%
2
Conventional Snooping
CPU
D$
D$
CPU
4
5
1
2
3
Interconnect
5
D$
3
4
6
5
D$
CPU
controller
4
CPU
WB vs. WT
Write-back configuration
Write-through configuration
Low memory traffic
High memory traffic
Sophisticated coherency mechanism
Simple coherency mechanism
Relative memory energy consumption
4
Previous Work: Snoop Filters
Eliminate redundant snoop (local & global) requests.
Local: one core fails to provide data
Global: all cores fail.
Examples:
RegionScout: Detects Memory Regions Not Shared (Moshovos)
Selective Snoop Request: Predicts Supplier (Atoofian & Baniasadi)
Serial Snooping: Requests Nodes One by One (Saldanha & Lipasti)
Good snoop filter
5
1.
Fast & simple
2.
Accurate and effective
Our Work
 Time-based Snoop Filtering
 Motivation: There are long intervals where snooping fails consecutively
 But how long & how often?
6
Our Work (Cont.)
7
Our Work (Cont.)
 Global Read Miss (GRM): Occurs whenever the last snoopbyall processors fail
 Local Read Miss (LRM): Redundant snoop occurringbya singleprocessor fails
8
Distribution
(a) LRM distribution
for different processors
(b) GRM distribution
Periods of Data Scarcity are usually long
9
Time-based Global Miss predictor (TGM)
TGM Goals:
1. Detect GRM intervals
2. Shutting down snooping in all processors but one (surviving node).
TGM Types:
1. TGM-First: First processor that has failed snooping survives.
2. TGM-Last: Last processor that has failed snooping survives.
10
TGM implementation
TGM-enhanced CMP
11
TGM
(a) Coverage
12
(b) Accuracy
Time-based Local Miss predictor (TLM)
 Goal: Detect LRMs
 How?
13
1.
Count consecutive snoop misses in a node
2.
Disable snoop when exceeds a threshold
3.
Restart snooping after a number of cycles
TLM implementation
TGM-enhanced CMP
Each Processor
14
Processing Unit (PU)
First Level Cache
Redundant SNoop (RSN) Counter
Predictor
ReStarT (RST) Counter
TLM features
(a)Coverage
15
(b) Accuracy
Methodology




Our Simulator: SESC
Benchmarks: Splash-2
To evaluate energy :Cacti 6.5
System used:Quad-Core CMP
System Parameters
Processor
Interconnection
Network
IL1: 64KB/ 2 way
SPLASH-2 Benchmarks and INPUT parameters
Benchmarks
Barnes
Cholesky
FFT
Ocean
Volrend
Water-Nsqrd
Water-spatial
Input Parameters
16K Particles
tk29.O
1024k complex data points
258x258 ocean
Head
512 molecules
512 molecules
Memory
Frequency: 5 GHz
DL1: 64KB/4way/Write
Through
Technology: 68 nm
Access Time: 1 cycle
Branch Predictor: 16K
entry
bimodal and gshare
Fetch/Issue/Commit
4/4/5
Branch Penalty : 17
cycles RAS: 32 entries
BTB: 2k Entries, 2
way
Block Size: 64
Data
Interconnect:
crossbar
Interconnect
Width: 64 B
Cache line size: 32
L2:512KB/8way/Write
Through
Access Time: 11 cycles
Block Size: 64
Memory: 1GB
Access Time: 70 cycles
Page Size: 4 Kbit
16
Relative Snoop Traffic Reduction
TGM-F: 58%
TGM-L: 57%
TLM: 77%
17
Relative Memory Energy
TGM-F: 8%
TGM-L: 8.5%
TLM: 11%
18
Relative Memory Delay
TGM-F: 1.1%
TGM-L: 2.1%
TLM: 1.7%
19
Relative Performance
TGM-F: No Change
TGM-L: 0.4%
TLM: 0.3%
20
Summary
We showed:
Long data scarcity period (DSP) exist during workload runtime
During DSPs redundant snoops happen frequently and consecutively
Our solutions
 TGM:
 uses snoop behavior on all processors to detect and filter redundant
snoops
 Shutdown snoop on as much processor as possible
 TLM:
 Redundant snoops are filtered in a single node
 Counts recent redundant snoops to detect data scarcity periods and
filter upcoming redundant snoops
Simulation Results:
21
Snoop Reduction: TGM-F: 58% TGM-L: 57% TLM: 77%
Memory Energy: TGM-F: 8% TGM-L: 8.5% TLM: 11%
Memory Delay: TGM-F: 1.1% TGM-L: 2.1% TLM: 1.7%
Performance: TGM-F: no change TGM-L: 0.4% TLM: 0.3%
Thanks for your attention
22
Backup Slides
23
Discussion
How Characteristics of the benchmarks affect memory energy/delay reduced by
our solution?
1. True detection of redundant snoops
24
2. Share of Redundant Snoops
Memory Energy.Delay
Memory Energy = Energy consumed to provide the requested data
25
Memory Delay = time required to provide the requested data
Volrend Benchmark
 Volrend while running rarely send snoop requests
 This application renders a three-dimensional volume. It renders
several frames from changing viewpoints
consecutive frames in rotation sequences often vary slightly in
viewpoint
High Temporal Locality
Volrend does Load Distribution very well
High Spatial Locality
26