Shared-Memory Systems and Cache Coherence

Foundations
What is the meaning of shared
memory when you have multiple
access ports into global memory?
Shared-Memory Systems
and
Cache Coherence
What if you have caches?
Memory
Pa
wa3
ra2
ra1
wb3
Pc
Pb wb2
rb1
rc4
rc3
wc2
wc1
Sequential consistency: Final state (of
memory) is as if all RDs and WRTs were
executed in some fixed serial order (per
processor order also maintained) Æ Lamport
[This notion borrows from similar notions of
sequential consistency in transaction
processing systems.]
6.173
Fall 2010
Agarwal
-
1
-
-
Page 1
2
-
One other cache nasty to watch out for
Foundations
cache
A hardware designers physical
perspective of sequential
consistency
MEM
foo1
foo2
foo3
foo4
cache
foo1
foo2
foo3
foo4
foo home
P
Memory
P
Flush foo* from cache, wait till done
w1c
Does it always work?
w
ra
ra
a3
2
1
Pa
r c4
r c3
c2
Pc w c1
w
w
b
w 3
b2
rb1 Pb
Key: Using fence to wait until flush is done is the key
mechanism that guarantees sequential consistency
We will revisit this in more detail shortly
-
3
-
-
Page 2
4
-
One other cache nasty to watch out for
xxx
cache
cache
yyy
foo1 xxx
foo2
Cache foo1 xxx
foo3
foo4
line
Summary of New Multicore Instructions
MEM
yyy
foo1 xxx
foo2
foo3
foo4
• Send message
• Receive message
foo home
P
• Synchronization
P
– Barrier
Flush yyy from cache,
wait till done
Flush foo* from cache,
wait till done
– Test and set
– F&A and relatives (e.g., F&Op, CmpXch)
• Flush cache line
Correct final value:
foo1 yyy
Wrong final value:
foo1 xxx
• Memory fence
Problem called “False Sharing”
Leads to bugs with sw coherence
Leads to poor perf. with hw coherence
Solutions?
Pad shared data structures so multiple shared
items do not fall into same cache line
-
5
-
-
Page 3
6
-
Recall, Shared Memory
Algorithmic Model
Outlline
Memory architecture
Cache coherence in small multicores
Shared Memory
Cache coherence in manycores
wrt
read
P
P
-
7
-
P
-
Page 4
. . .
8
-
P
Shared-Memory Structure in
Cutting Edge Multicores
Shared Memory Structures
in Parallel Computers
Monolithic
Memory
Distributed
M
Network
C C
C
P
P
P
M
M
M
Network
C
C
. . .P
P
C C
P
P
C
. . .P
Distributed - local
Multicore Chip
C
Network
Like legos,
can move Ps,
Cs and Ms
around
P
But, what about multicores chips?
P
C
M
P
M
C
. . .
C
P
-
9
M
P
P
P
-
Distributed
M
M
C
P
Ring
C C
P
P
Chip
10
M
M
Network
Memory
C
-
Page 5
r
le
Memory
C
C
y
or
em
M
l
ro
nt
co
-
C
. . .P
Shared-Memory Structure in
Cutting Edge Multicores
Caches and
Cache Coherence
Tile processor
64 cores
Network
M
C
M
M
P
l
ro
nt
o
c
Multicore Chip
Memory
Memory
y
or
em
M
Memory
P C P C P C P C
P C P C P C P C
P C P C P C P C
Memory
P C P C P C P C
r
le
C
C
P
P
Distributed
M
M
M
M
Network
C
P
C C
P
P
C
. . .P
Chip
Mesh
-
11
-
-
Page 6
12
-
A World Without Caches
With Caches
Network
M
C
Network
M
M
M
C
P
P
-
13
C
C
rd
P
M
M
C
C
rd
P
P
P
-
-
Page 7
14
-
How are Caches Different from
Fast Local Memory (SRAM)?
Key insight
why use a cache when local mem exists
Anatomy of a common case LD operation
Network
M
HW: 1 cycle
SW: 10 cycles
M
M
C
LD A
C
C
P
P
P
If A replicated in local store
then fetch from local store
versus
Else send message to get A from
DRAM
Network
M
m
HW: 100 cycles
SW: 110 cycles
M
M
m
m
P
When done in HW, we
call the store a cache!
P
P
Can do all of this
in hardware too. This is
what typical caches do
Discuss
-
15
-
-
Page 8
16
-
Solving the Coherence Problem
Cache Coherence Problem
– Small multicores
Network
> Software coherence
> Snooping caches
M
C
M
M
C
C
P
P
wrt
– Manycores
?
> Software coherence
> full map directories
P
> limited pointers
> chained pointers
· singly linked
· doubly linked
> limitless schemes
Coherence problem
> Hierarchical methods
We will study
Coherence structures
Coherence protocols
Cache side state diagrams
Directory side state diagrams
-
17
-
-
Page 9
18
-
Hardware Cache Coherence
Software Coherence
Saw this before
Snooping Caches
Shared Memory
MEM
cache
foo1
foo2
foo3
foo4
cache
foo1
foo2
foo3
foo4
Bus or Ring
foo home
flush
fence
P
a
cache
P
a
GET_foo_LOCK
.
.
.
.
MUNGE
a
3 a
snoop
2
cache
a
Match
4
cache
x
5 tags
InvalidateDual
ported
Processor
y
x
Broadcast
1
write
cache
tags
Processor
z
Flush foo* from cache
Fence: wait till changes that result from flush
are visible to everyone
• Works for small multicores (mem off chip)
• Broadcast address on shared write
RELEASE_foo_LOCK
• Everyone listens (snoops) on bus/ring to see
if any of their own addresses match
• Invalidate copy on match
Can stick the locking,
flushes and fences in library code
to provide clean abstractions
• How do you know when to broadcast,
invalidate
– State associated with each cache line
– Key benefit: no global state in main mem
Let’s look at this in more detail next…
-
19
-
-
Page 10
20
-
Hardware Cache Coherence
Update versus Invalidate Protocols
Invalidate versus Update Snooping Caches
Tradeoffs between
Shared Memory
a
• Update protocols
Bus or Ring
cache
a
• Ownership protocols
a
3 a
snoop
2
cache
a
Match
4
cache
5 tags
Update Dual
ported
1
write
Processor
Update better when poor write locality
Broadcast
cache
tags
Invalidate better otherwise
Competitive snooping idea --
Processor
–Do write updates
• Broadcast address on shared write
–If more than a “few” updates, then use
ownership
• Everyone listens (snoops) on bus/ring to see if
any of their own addresses match
“Few” Æ Switch mode when cost of all
updates so far = cost of invalidation
• If address matches
– Invalidate local copy (called invalidate or
ownership protocol)
OR
The cost of this approach is no worse than
twice the optimal (try to prove this)
– Update local copy with new data from bus
(writer must broadcast value along with address)
“Competitive algorithms are cool”
Only a cache side state machine needed
Discuss paper
-
21
-
-
Page 11
22
-
Snooping Caches
Definitions
State diagram for
ownership protocols
Shared Memory
For each address a
Cache side state machine
Store state with cache tags
“Invalid”
a
Bus or Ring
invalid
Ext. bus request
cache
3 a
snoop
2
cache
a
Match
cache
4
5 tags
Dual
My local responseUpdate ported
a
“Modified”
“Shared”
a
1
write
Processor
read-clean
shared-data
• For each address
^
• Assume cache blocksize is one word
for now; Let’s deal with the cache
block complexity later
“MSI”
Variants such as MESI, MOESI
23
-
-
Page 12
cache
tags
My local request
Processor
write-dirty
-
My bus response
Broadcast
24
-
State diagram for cache block in
ownership protocols
a: address
invalid
Local
Read
Fetch
block
State diagram for update
protocols
My local request
Ext. bus request
My bus response
invalid
Local Write
Broadcast a; Fetch block
Remote Write
Remote
Write/local
replace
Update
memory
read-clean
My local request
Ext. bus request
My bus response
My local response
a: address
<a>: value
Local
Read
Fetch
block
Local replace
Update
memory
read-clean
write-dirty
Local Write
Broadcast a, <a>; Fetch block
Remote Read
Update memory
Remote Write
write-dirty
Update local copy
Local Write
Broadcast a, <a>
Local Write
Broadcast a
Local Write
Broadcast a,<a>
Remote Write
Update local copy
In ownership protocol:
writer owns exclusive copy
-
25
-
-
Page 13
26
-
Maintaining coherence in manycores
• Software coherence – saw this before
• Hardware coherence
> full map directories
> limited pointers
> chained pointers
· singly linked
· doubly linked
> limitless schemes
> Hierarchical methods
-
27
-
Page 14