Lazy Eager Oracle

Flexible Snooping:
Adaptive Forwarding and Filtering of Snoops
in Embedded-Ring Multiprocessors
Karin Strauss, Xiaowei Shen*, Josep Torrellas
University of Illinois at Urbana-Champaign
*IBM Research
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
http://iacoma.cs.uiuc.edu
Motivation
• CMPs are becoming standard components
• cheaper to build medium size machines
– 32 to 128 cores (multi-CMP)
• shared memory, cache coherent
– easier to program, easier to manage
• supporting cache coherence is difficult
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
Flexible Snooping
2
Cache coherence solutions
strategy
ordered
network?
pros
cons
snoopy
broadcast bus
yes
simple
difficult to
scale
directory
based protocol
no
scalable
indirection,
extra hardware
snoopy
embedded ring
no
simple
long latencies
• other proposals (e.g. token coherence)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
Flexible Snooping
3
Contributions
• family of adaptive coherence protocols for rings
• two were chosen as best options
compared to fastest state-of-the-art scheme
high performance scheme
performance
energy
consumption
Superset Aggressive
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
energy conscious scheme
performance
energy
consumption
Superset Conservative
Flexible Snooping
4
Multi-CMP multiprocessor
CMP
Proc + L1 + L2
local network
memory
• coherence protocol used: only one supplier if line is cached
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
Flexible Snooping
5
Ring in action
R
S
S
S
Lazy
cmp
supplier
predictor
R
R
Oracle
Eager
snoop
request
response
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
Flexible Snooping
6
Ring in action
R
S
S
S
Lazy
R
R
Oracle
Eager
latency
snoops
messages
• goal: adaptive schemes that approximate Oracle’s behavior
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
Flexible Snooping
7
Primitive snooping actions
• snoop and then forward
+ fewer messages
• forward and then snoop
+ shorter latency
• forward only
+ fewer snoops
+ shorter latency
– false negative predictions not allowed
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
X
Flexible Snooping
X
8
Predictors and algorithms
set of addresses:
node can supply
in predictor
predictor /
algorithm
action on
negative
prediction
action on
positive
prediction
Subset
forward
then snoop
snoop
Super
set
Con
forward
forward
then snoop
Agg
Exact
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
snoop then
forward
forward
Flexible Snooping
snoop
9
Algorithms
algorithm
negative
Subset
forward
then
snoop
S
u
p
e
r
set
C
o
n
forward
A
g
g
Exact
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
positive
snoop
Lazy
Eager
snoop
then
forward
Oracle / Exact
Subset
forward
then
snoop
forward
Per miss service:
SupersetCon
number of
messages
SupersetAgg
snoop
Karin Strauss
Flexible Snooping
10
Predictor implementation
• Subset
– associative table:
subset of addresses that can be supplied by node
• Superset
– bloom filter: superset of addresses that can be supplied by node
– associative table (exclude cache):
addresses that recently suffered false positives
• Exact
– associative table: all addresses that can be supplied by node
– downgrading: if address has to be evicted from predictor table,
corresponding line in node has to be downgraded
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
Flexible Snooping
11
Downgrading
S
E
A
Negative effects:
• writes by this node
need to snoop other
nodes
B
A
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
• reads and writes by
other nodes need to
fetch line from
memory
Karin Strauss
Flexible Snooping
12
Experiments
• 8 CMPs, 4 ooo cores each = 32 cores
– private L2 caches
• on-chip bus interconnect
• off-chip 2D torus interconnect with embedded unidirectional ring
• per node predictors: latency of 3 processor cycles
• sesc simulator (sesc.sourceforge.net)
• SPLASH-2, SPECjbb, SPECweb
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
Flexible Snooping
13
Execution time

Normalized
execution time
1.2

1
0.8
0.6
0.4
0.2
0
SPLASH-2
SPECjbb
SPECweb
Lazy
Eager
Oracle
Subset
SupersetCon
SupersetAgg
Exact
• performance of most flexible snooping algorithms is similar to Eager
• the fastest of all algorithms is SupersetAgg
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
Flexible Snooping
14


Normalized
energy consumption
Miss service energy
2
3.22
Lazy
Eager
Oracle
Subset
SupersetCon
1.5
1
0.5
SupersetAgg
Exact
0
SPLASH-2
SPECjbb
SPECweb
• algorithms that eagerly forward messages use more energy
• SupersetCon is least energy-hungry algorithm
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
Flexible Snooping
15



Normalized
energy consumption

Normalized
execution time
Most cost-effective algorithms
1.2
1
0.8
0.6
0.4
0.2
0
2
SPLASH-2
SPECjbb
SPECweb
3.22
1.5
1
0.5
0
SPLASH-2
SPECjbb
SPECweb
Lazy
Eager
Oracle
Subset
SupersetCon
SupersetAgg
Exact
Lazy
Eager
Oracle
Subset
SupersetCon
SupersetAgg
Exact
• high performance: Superset Aggressive
• faster than Eager at lower energy consumption
• energy conscious: Superset Conservative
• slightly slower than Eager at much lower energy consumption
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
Flexible Snooping
16
Most cost-effective algorithms
compared to fastest state-of-the-art scheme (Eager)
Superset Aggressive
high performance scheme
energy
performance
consumption
Superset Conservative
energy conscious scheme
energy
performance
consumption
can be combined by only changing forwarding policy
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
Flexible Snooping
17
Conclusions
• proposed flexible snooping, a family of adaptive
protocols for embedded rings
• two chosen protocols
– high performance: Superset Aggressive
– energy conservation: Superset Conservative
– can be selected dynamically
• embedded-ring protocols more attractive
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
Flexible Snooping
18
Arch map
http://iacoma.cs.uiuc.edu/students/archmap/archmap.html
Google:
architecture conference map
(1st hit)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Karin Strauss
Flexible Snooping
19
Flexible Snooping:
Adaptive Forwarding and Filtering of Snoops
in Embedded-Ring Multiprocessors
Karin Strauss, Xiaowei Shen*, Josep Torrellas
University of Illinois at Urbana-Champaign
*IBM Research
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
http://iacoma.cs.uiuc.edu