High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
CS-590.26, Spring 2014!
High-Speed Memory Systems:
Architecture and Performance Analysis!
University of Crete
SLIDE
1
Coherence
and Consistency
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
The Problem is Multi-Fold!
Cache Consistency (taken from web-cache community)!
In the presence of a cache, reads and writes behave (to a first order) no differently than if the cache were not there!
Bruce Jacob"
University of Crete
SLIDE
2
Three main issues:!
•
Consistent with backing store!
•
Consistent with self!
•
Consistent with other clients of same backing store
CS-590.26
Lecture H"
Bruce Jacob"
University of Crete
Write buffer commonly used
in write-through caches:
$
Write Buffer/Cache
$
CPU
Backing Store
Backing Store
$
CPU
CPU
(a)
(b)
CPU
3
$
SLIDE
For example, write-through vs. write-back!
$
Spring 2014"
Consistency w/ Backing
ChapterStore!
4 MANAGEMENT OF CACHE CONSISTENC
CPU
High-Speed Memory Systems"
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
requiring careful cache manageBruce Jacob"
consistencies from occurring.
that this feature—shared memUniversity of Crete
he model of virtual memory. If
owed to have several SLIDE
equivalent
4
ssible for the datum to reside in
locations. This can easily cause
example, when one writes vallocations that map to the same
reason that virtual memory is
ping between two namespaces;
r this when dealing with virtual
here is a one-to-one mapping
names, no inconsistencies can
re virtual memory mechanism
ly than a traditional cache hiercaches can be used without fear
ies. As soon as shared memory
simple cache model becomes
n, because it is very convenient
ystem to allow one-to-many
Consistency w Self!
222 CONSISTENCY
Memory Systems: Cache,
Chapter 4 MANAGEMENT OF CACHE
221 DRAM, Disk
Virtual cache synonym problem & hardware solutions
Address Space A
Address Space A
OR
Virtual
Cache
Physical
Memory
Address Space B
Direct-Mapped
Virtual Cache
Physical
Memory
Set-Associative
Virtual Cache
(w/ physical tags)
Address Space B
FIGURE 4.2: The synonym problem
of virtual
If two solutions to page aliasing. If the cache is no larger than the page size and dire
FIGURE
4.3: caches.
Simple hardware
processes are allowed to map physical pages at arbitrary loca-
High-Speed Memory Systems"
Spring 2014"
Consistency w Self!
Chapter 4 MANAGEMENT OF CACHE CONSISTENCY 223
CS-590.26
Lecture H"
Operating system solutions to aliasing problem
Bruce Jacob"
University of Crete
SLIDE
5
Address Space A
Address Space A
Virtual
Cache
Physical
Memory
Virtual
Cache
Physical
Memory
Address Space B
Address Space B
(a) SPUR and OS/2 solutions
(b) SunOS solution
FIGURE 4.4: Synonym problem solved by operating system policy. OS/2 and the operating system for the SPUR processor guar-
ween address generation and cache access
fore, the address space should be divided
High-Speed
for instance,
number of small
segments,
Systems"
segments,Memory
4096 1-MB
segments, etc.
Spring 2014"
e Table
mapped by 1024 PTEs—a collective structure small
enough to wire down in physical memory for every
running process (4 KB, if each is 4 bytes). This structure is termed the per-user root page table in Figure
31.16. In addition, there must be a table for every process containing 1024 segment IDs and per-segment
protection information.
Consistency w Self!
CS-590.26
Lecture H"
.15 illustrates an example
mechanism. The
Segmentation as a solution to the aliasing problem
on granularity is 4 MB. The 4-GB address
Bruce Jacob"
vided into 1024 segments. This simplifies
Segno (10 bits) Segment & Page Offsets (22 bits)
University of Crete
Process B
Process C
SLIDE
6
Segment Registers
52-bit Virtual Address
e
y
32-bit Effective Address
Segment ID (30 bits)
NULL
(segment only
partially-used)
TLB and
Page Table
Segment & Page Offsets (22 bits)
Cache
: The use of segments to provide virtual-address
FIGURE 31.15: Segmentation mechanism used in discussion.
cess conta
protection
Disjunct Page Table
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
Figure 31.15 illustrates an example mechanism. The
segmentation granularity
Consistency
w Self! is 4 MB. The 4-GB address
space is divided into 1024 segments. This simplifies
Segmentation as a solution to the aliasing problem
Process A
Process B
Process C
S
University of Crete
SLIDE
7
52-bit Virtual Ad
Global Virtual Space
Se
Paged
Segment
Physical Memory
NULL
(segment only
partially-used)
would be an effective protection
obviating the need to flush the TL
switch. If the segment registers ar
from modification by user-level p
protection identifiers are necessa
sible implementation of the class
Figure 31.23.
IDs, requiring more chip area but alleviating the problem of multiple TLB entries
High-Speed per
physical
page.
Alternatively,
if
there
Memory Systems"
were multipleConsistency
IDs associated with each
w
Self!
Spring 2014"
process and not with each page (this is like
CS-590.26
the scheme used
in PA-RISC),
a different
ASID
remapping
Lecture H"
ID could be created for every instance of a
Bruce Jacob"
Virtual Address
University of Crete
ASID
VPN
PFN
ASID
VPN
PFN
…
TLB
ASID
…
8
Virtual Page Number
…
SLIDE
ASID
VPN
PFN
Page Offset
Physical Address
Page Frame Number
Page Offset
FIGURE 31.21: An implementation of a single-owner, single-ID architecture. The ASID acts as a process ID.
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
University of Crete
SLIDE
Consistency w Other Clients!
i.e. Cache Coherence & various Consistency Models!
First, a look at some of the things that can go wrong,
just inside a SINGLE CHIP:
9
I/O
DMA
GPU
CTL
DRAM
Core … Core
High-Speed Memory Systems"
Spring 2014"
Proc B reads data from dev A, signals proc C when done
(producer-consumer pair)
CS-590.26
Lecture H"
Bruce Jacob"
University of Crete
SLIDE 10
Process C (consumer):!
Process B (producer):!
"
"
global char data[SIZE];!
global int ready=0;!
global char data[SIZE];!
global int ready=0;!
"
"
while (1) {!
int fd = open(“dev A”);!
"
"
! ! while (!ready)
! ! ! ! ;!
while (1) {!
"
! ! process( data );!
! ! ready = 0;!
"
}
"
! ! while (ready)
! ! ! ! ;!
"
! ! dma( fd, data, SIZE );!
! ! ready = 1;!
"
}
1
B
After ‘ready’ is set to 0,
device A transfers data
into the memory system
High-Speed Memory Systems"
3
updates synchronization
4 B pressed
would be hard
to
deduce
causality
between
variable ‘ready’ to 1
the events), in this picture the memory system
observes all of the relevant events. In particular, the
B is signaled
following
event orderings are seen in a sequentially
via
driver
consistent system:
CS-590.26
Variable ‘done’
Lecture H" (in driver)
4K data buffer
C
University of Crete
SLIDE 11
2.
A writes data, followed by A writing done
(seen by A and thus seen by all)
A writes done, followed by B reading done
or some other variable written by the device
driver (seen by device driver and B and thus
seen by all)
Proc B reads data from dev A, signals proc C when done
…
(producer-consumer pair) — more detail
Spring 2014"
Bruce Jacob"
1.
5
4-byte synchronization
variable
A ‘ready’
C uses both data buffer and
synchronization variable: 1
while (!ready) // spin
;
x = data[i]; // read data buffer
2
A communicates
with device driver
B
After ‘ready’ is set to 0,
device A transfers data
into the memory system
3
Memory
4
B updates synchronization
variable ‘ready’ to 1
B is signaled
via driver
…
…
dition example, more detail. The previous model ignored (intentionally) the method by which device
cess B: the device driver. Communication with the device driver is through the memory system—
or memory-mapped I/O registers.
4K data buffer
B
C
A
data
done
(a)
ready
B
Variable ‘done’
(in driver)
C
C
5
4-byte synchronization
variable ‘ready’
C uses both data buffer and
synchronization variable:
while (!ready) // spin
;
x = data[i]; // read data buffer
FIGURE 4.10: Race condition example, more detail. The previous model ignored (intentionally) the method by which device
(b)
A communicates to process B: the device driver. Communication with the device driver is through the memory system—
High-Speed Memory Systems"
Spring 2014"
Proc B reads data from dev A, signals proc C when done
(producer-consumer pair)
CS-590.26
Lecture H"
2 3 4
Bruce Jacob"
University of Crete
SLIDE 12
I/O
DMA
GPU
data buffer
driver sync
“ready” variable
data buffer
CTL
1
DRAM
5
…
Core
6
Core
High-Speed Memory Systems"
Spring 2014"
Proc B reads data from dev A, signals proc C when done
(producer-consumer pair)
CS-590.26
Lecture H"
2 3 4
Bruce Jacob"
University of Crete
SLIDE 13
I/O
DMA
GPU
CTL
1
data - delayed in DMA
driver sync
“ready” variable
data - stale
DRAM
5
…
Core
6
Core
7data sent by DMA
… arrives too late
High-Speed Memory Systems"
Spring 2014"
Proc B reads data from dev A, signals proc C when done
(producer-consumer pair)
CS-590.26
Lecture H"
2 3 4
Bruce Jacob"
University of Crete
SLIDE 14
I/O
DMA
GPU
data held in CTL
driver sync
“ready” variable
data is stale
1
CTL
DRAM
5
…
Core
6
Core
7
data sent to DRAM
e, more detail.
The
previous
model
ignored
(intentionally)
the
method
by
which
device
High-Speed Memory Systems"
evice driver.
Communication with the device driver is through the memory system—
Problem:
causal
relationships
2014"
apped I/O Spring
registers.
C
CS-590.26
Lecture H"
A
Bruce Jacob"
B
University of Crete
SLIDE 15
data
done
ready
(b)
C
High-Speed Backing
Memory Systems"
Store
DRAM
System
Spring 2014"
Problem scales with the system size
CS-590.26
Lecture H"
CPU
Bruce Jacob"
University of Crete
CPU
CPU
DRAM
SLIDE 16
MC
MC
CPU
DRAM
DRAM
MC
MC
CPU
CPU
MC
DRAM
MC
CPU
DRAM
MC
processor system can take on many forms. In
Solve! system linear eqs: xi+1 = Axi + b!
"
while (!converged) {!
! ! doparallel(N) {!
! ! ! ! int i = myid();!
! ! ! ! xtemp[i] = b[i];!
! ! ! ! for (j=0; j<N; j++) {!
! ! ! ! ! ! xtemp[i] += A[i,j] * x[j];!
! ! ! ! }!
!!} ! ! // implicit barrier sync!
! ! doparallel(N) {!
! ! ! ! int i = myid();!
! ! ! ! x[i] = xtemp[i];!
! ! }!
}
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
University of Crete
SLIDE 17
Some Consistency Models!
Strict Consistency: A read operation shall return the value
written by the most recent store operation.
Sequential Consistency: The result of an execution is the
same as a single interleaving of sequential, program-order
accesses from different processors.
Processor Consistency: Writes from a process are
observed by other clients to be in program order; all clients
observe a single interleaving of writes from different
processors.
High-Speed Memory Systems"
Spring 2014"
234 Memory Systems: Cache, DRAM, Disk
Strict Consistency
Fails to satisfy strict consistency:
CS-590.26
Lecture H"
M, Disk
A writes 1
Bruce Jacob"
∆t
University of Crete
time
SLIDE 18
tency:
Satisfies strict consistency:
A writes 1
B reads 0
B reads 1
∆t
FIGURE 4.8: Strict consistency. Each timeline shows a sequence of r
time
tency to hold, B must read what A wrote to a given memory location,
following
of the
B reads 1are some
B reads
1 most commonly used models for the behavior of the memory system (tenets
stolen from Tanenbaum [1995]):
h timeline shows a sequence of read/write events to a particular location. For strict consis0
B reads 1
the
tic
High-Speed Memory Systems"
Spring 2014"
234 Memory Systems: Cache, DRAM, Disk
Sequential Consistency
Fails to satisfy strict consistency:
CS-590.26
Lecture H"
M, Disk
But satisfies Sequential Consistency
A writes 1
Bruce Jacob"
∆t
University of Crete
time
SLIDE 19
tency:
Satisfies strict consistency:
… and Sequential Consistency
A writes 1
B reads 0
B reads 1
∆t
FIGURE 4.8: Strict consistency. Each timeline shows a sequence of r
time
tency to hold, B must read what A wrote to a given memory location,
following
of the
B reads 1are some
B reads
1 most commonly used models for the behavior of the memory system (tenets
stolen from Tanenbaum [1995]):
h timeline shows a sequence of read/write events to a particular location. For strict consis0
B reads 1
the
tic
xample, more detail. The previous model ignored (intentionally) the method by which device
the device driver. Communication
with
the
device
driver
is
through
the
memory
system—
Sequential
Consistency!
Spring 2014"
ory-mapped I/O registers.
CS-590.26
Handles
our
earlier
problem:!
Lecture H"
High-Speed Memory Systems"
C
Bruce Jacob"
A
"
B
C
University of Crete
"
SLIDE 20
data
"
"
done
"
ready
(b) controller may reorder
Note: for this to work, memory
internally, but not externally
f data movement and causality. In (a), information propagates directly from A to B without going
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
University of Crete
Sequential Consistency!
Requirements:!
•
Everyone can reorder internally but not externally!
•
All I/O & memory references must go through the same sync point (e.g. memory-mapped I/O)!
•
Write of data and driver signal must be same client!
•
Write buffering presents significant problems!
•
Reads must be delayed by system latency!
SLIDE 21
… let’s look at this last one more closely
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
University of Crete
SLIDE 22
Really Famous Example (Goodman 1989)!
"
Process P1:!
Process P2:!
""
"
Initially, A=0!
Initially, B=0!
""
"
A=1;!
B=1;!
""
"
if (B==0) {!
!
!
kill
P2;!
"
}
if (A==0) {!
! ! kill P1;!
}
"
Sequential Consistency allows 0 or 1 processes to die!
!
(not both)
The symmetric argument holds equally well.
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Race-Condition Example!
Fails to satisfy sequential consistency:
P1
Bruce Jacob"
University of Crete
P2
A=1
write A
SLIDE 23
B=1
read B
write B
read A
A=1
B=1
time
nt holds equally well.
High-Speed Memory Systems"
Spring 2014"
tency:
A=1
CS-590.26
Lecture H"
be reexecuted in light of the new data.
Race-Condition Example!
Satisfies sequential consistency:
P1
Bruce Jacob"
University of Crete
P2
A=1
write A
SLIDE 24
write B
B=1
write B
read A
A=1
read B
B=1
time
read A
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
University of Crete
Race-Condition Example!
In practice:!
•
speculate!
•
throw exception if problem occurs!
SLIDE 25
HOWEVER — from Jaleel & Jacob [HPCA 2005]:!
•
increasing the reorder buffer from 80 to 512 entries results in
an increase in memory traps by 6x and an increase in total
execution overhead by 10–40%!
•
reordering memory instructions increases L1 data cache
accesses by 10–60% and L1 data cache misses by 10–20%
clients to be in program order; all clients
driver variable
or the variable
driver variable
done ordone
the variable
ready. ready.
observe
single interleaving
writes from
observe a
single ainterleaving
of writesoffrom
processors.
different different
processors.
High-Speed Memory Systems"
Other Consistency
Other Consistency
ModelsModels
This removes
simply removes
the read-ordering
restriction
This simply
the read-ordering
restriction
Thisseem
might
allnecessary,
that is necessary,
but
there are
This might
allseem
that is
but there
are
by sequential
consistency:
a processor
is
imposedimposed
by sequential
consistency:
a processor
is
many further
relaxations.
For instance,
many further
relaxations.
For instance,
free to reorder
readsof
ahead
writes without
reads ahead
writesofwithout
waiting waiting
Spring 2014" free to reorder
write
to be propagated
other in
clients
for writefor
data
to data
be propagated
to othertoclients
the in the• Partial
• store
Partial
store
orderaallows
a processor
to
order
allows
processor
to
CS-590.26
system. system.
Thethreads
racing threads
example,
if executed
The racing
example,
if executed
on an on an
freely reorder
localwith
writes
with respect
to
freely reorder
local writes
respect
to
Lecture H" implementation
implementation
of processor
consistency,
can
of processor
consistency,
can result
inresult in
other
local writes.
other local
writes.
both processes
killing
eachreads
other:
reads
need
not block
both processes
killing each
other:
need
not
block
• consistency
Weak consistency
a processor
• Weak
allows aallows
processor
to freelyto freely
Bruce Jacob"
on preceding
writes;
they
mayexecute
even execute
on preceding
writes; they
may
even
ahead ahead
localahead
writesofahead
local reads.
reorder reorder
local writes
local of
reads.
University of Crete
Processor Consistency!
Also called Total Store Order!
All writes in program order, reads freely reordered!
Both of these scenarios are satisfied in this model:
Satisfies
processor
consistency:
SLIDE 26 Satisfies
processor
consistency:
P1
P1
write A write A
A=1
Satisfies
processor
consistency:
Satisfies
processor
consistency:
P2
P2
read B read B
write A write A
A=1
B=1
read B read B
B=1
P1
P1
A=1
write B write B
P2
P2
read A read A
A=1
B=1
B=1
write B write B
read A read A
A=1
B=1
B=1
time
time
A=1
A=1
B=1
B=1
time
A=1
time
FIGURE
4.13: Processor
consistency
andthreads.
racing threads.
consistency
allows
each processor
client
FIGURE 4.13:
Processor
consistency
and racing
ProcessorProcessor
consistency
allows each
processor
or client or
within
thewithin
sys- the system tofreely
reorder
freely
reads
with to
respect
As athe
result,
thethreads
racing example
threads example
can
easily
in both processes
tem to reorder
reads
with
respect
writes.toAswrites.
a result,
racing
can easily
result
in result
both processes
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
University of Crete
SLIDE 27
Some Other Consistency Models!
Partial store order — a processor can freely reorder local
writes with respect to other local writes!
Weak consistency — a processor can freely reorder local
writes ahead of local reads!
Release consistency — different classes of synchronization
… enforces synchronization only w.r.t. acquire/release
operations. On acquire, memory system updates all
protected variables before continuing; on release, memory
system propagates changes to the protected variables out to
the rest of the system
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
University of Crete
Cache Coherence Schemes!
Ways to implement a consistency model:!
•
in software (e.g. in virtual memory system, via page table)"
•
in hardware!
•
combine hardware & software!
SLIDE 28
The hardware component is called “cache coherence”
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
University of Crete
SLIDE 29
Coherence Implementations!
Cache-Block States:!
I — Invalid!
M — Modified — read-writable, forwardable, dirty!
S — Shared — read-only (can be clean or dirty)!
E — Exclusive — read-writable, clean!
O — Owned — read-only, forwardable, dirty
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
University of Crete
SLIDE 30
Coherence Implementations: SI!
Works with write-through caches!
Block is either present (Shared) or not (Invalid)!
Write operations cause one of two results:!
Problem:
Nobody
wants
write-invalidate (only one writable copy extant)!
to use
write-through
write-update
(can have multiple writers)! caches
•
•
Both schemes require broadcast or multicast of
coherence information and/or write data!
Note: write-update and sequential consistency don’t
play nice together
store and the other clients in the system. The implementation must ensure that this allows no situations
that would
be deemed incorrect by the chosen conHigh-Speed
sistency model. Figure 4.15 illustrates a possible state
Memory Systems"
machine for the implementation.
Spring 2014"
Whereas a write-through cache is permitted to
CS-590.26
evict a block silently at any time (e.g., to make room
the process to require two steps in all instances.
In particular, when an MSI client acquires a bloc
a read miss, the block is acquired in the Shared s
and to write it the client must then follow this wi
write-invalidate broadcast so that it can place the b
in the Modified state and overwrite the block with
data. In an alternative MSI implementation, all re
Coherence Implementations: MSI!
Lecture H"
Write-back caches: dirty bit (Modified state)
bus write-miss
Bruce Jacob"
University of Crete
SLIDE 31
Invalid
Shared
read miss
xmit read miss
Problem: when the app
reads data and then writes,
sends a second broadcast
bus read-miss
bus write-miss
send cached data
write hit
xmit write miss
(acts as invalidate msg)
Modified
write miss
xmit write miss
write hit
FIGURE 4.15: State machine for an MSI protocol [based on Archibald and Baer 1986]. Boldface font indicates observed a
cache block unless it first acquires a copy of the block
in the Exclusive state, and only one system-wide copy
High-Speed
of that block may be marked Exclusive. Even if a cliMemory Systems"
ent has a readable copy of the block in its data cache
(e.g.,
in the Shared state), the client may not write the
Spring
2014"
block unless it first acquires the block in an Exclusive
The written data becomes visible to the rest
system later when the block is evicted from the
er’s cache, causing a write-back to the backing
Alternatively, the cache-coherence mechanism
prompt the Exclusive owner to propagate ch
back to the backing store early if, in the mean
Coherence Implementations: MESI!
CS-590.26
Lecture H"
Reduces write broadcasts
Bruce Jacob"
bus invalidate
University of Crete
SLIDE 32
Invalid
Exclusive
read miss
xmit read miss
[response from
main memory]
Problem: When you ask
for a block, potentially many
clients may respond
bus invalidate
send cached data
write miss
xmit write miss
write hit
bus
invalidate
bus read-miss
send cached data
write
hit
write hit
Modified
Shared
read miss
xmit read miss
[response from
another cache]
bus read-miss
send cached data
bus read-miss
send cached data
FIGURE 4.16: State machine for a MESI protocol [based on Archibald and Baer 1986]. Boldface font indicates observed
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
Coherence Implementations: MESIF!
Shared broken into two: Shared (1+) and Forwardable (1)
Compare MESI (left) vs. MESIF (right):
University of Crete
SLIDE 33
Problem: All coherence info
goes through central point:
the backing store
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
University of Crete
SLIDE 34
Coherence Implementations: MOESI
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
University of Crete
SLIDE 35
MESI vs MOESI (AMD) vs MESIF (Intel)
High-Speed Memory Systems"
Spring 2014"
Some System Configurations!
Chapter 4 MANAGEMENT OF CACHE CONSISTENCY 247
CS-590.26
Lecture H"
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
$
System & Memory
Controller
UMA:
SLIDE 36
NUMA:
CPU
$
CPU
Backing Store
$
Bruce Jacob"
University of Crete
CPU
Sys/Mem
Controller
$
Backing
Store
CPU
CPU
CPU
CPU
Backing
Store
DRAM
System
DRAM
System
CPU
DRAM
CPU
CPU
MC
CPU
CPU
S/M
Contr.
S/M
Contr.
S/M
Contr.
S/M
Contr.
DRAM
DRAM
System
DRAM
System
DRAM
System
DRAM
System
DRAM
Backing Store
MC
MC
DRAM
DRAM
MC
MC
CPU
CPU
DRAM
MC
CPU
DRAM
MC
FIGURE 4.18: The many faces of backing store. The backing store in a multiprocessor system can take on many forms. In
particular, a primary characteristic is whether the backing store is distributed or not. Moreover, the choices within a distributed
High-Speed Memory Systems"
CPU
Sys/Mem
Controller
Backing
Store
$
DRAM
Chapter
MANAGEMENT
OFCACHE
CACHECONSISTENCY
CONSISTENCY 247
247
System
Chapter
44 MANAGEMENT
OF
Backing
Store
Bus-Based, Hierarchical
CPU
Spring 2014"
CS-590.26
Lecture H"
$
$
CPU
Backing Store
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
University of Crete
SLIDE 37
DRAM
System
$$
CPU
CPU
CPU
CPU
S/M
S/M
Contr.
Contr.
CPU
CPU
$
$
$
$
CPU
CPU
BackingStore
Store
Backing
S/M
Contr.
S/M
Contr.
CPU
CPU
Sys/Mem
Sys/Mem
Controller
Controller
DRAM
DRAM
System DRAM
DRAMSystem
Backing
Backing
Store
Backing StoreStore
CPU
CPU
S/M
Contr.
System
System
DRAM
Backing
DRAMBacking
Store
Store
System
MC
CPU
MC
DRAM
DRAM
DRAM
DRAM
System
System
DRAM
CPU
CPU
System&&Memory
Memory
System
Controller
Controller
MC
DRAM
S/M
Contr.
Even this:
CPU
CPU
CPU
$$
Bruce Jacob"
DRAM
System
MC
MC
CPU
CPU
DRAM
MC
CPU
DRAM
MC
CPU
CPU
DRAM
DRAM
CPU
CPU
MC
MC
FIGURE 4.18: The many faces of backing store. TheCPU
backingCPU
store in a multiprocessor system can take on many forms. In
CPU
CPU
particular, a primary characteristic is whether the backing store is distributed or not. Moreover,
the choices within a distributed
CPU
CPU
S/M
S/Mjust as varied.
S/M The two organizations
S/M
S/M
S/M
MCon the MC
MC
organization
are
bottom
right are different implementations of the design on the
DRAM MC
DRAM
DRAM
DRAM
Contr.
Contr.
Contr.
Contr.
Contr.
Contr.
MC
DRAM MC
bottom left.
DRAM
Core
Core
…
Core
Core
Core
indicating
ownership
and
returning
the
requested
DRAM
DRAM
each
cache
block
in
a
data
structure
associated
with
MC
MC
CPU
DRAM
DRAM
DRAM
DRAM
DRAM
MC
MC
CPU
DRAM
DRAM
DRAM
DRAM
DRAM
System
System
System
System
System
System
that cacheSystem
block. TheSystem
data structure contains such data (if such is the appropriate response).
CPU sta-CPU
CPU
MC
DRAM MC
CPU
information
as
the
block’s
ownership,
its
sharing
DRAM
Backing
Store
Backing Store
tus, etc. These data structures are all held together in
a directory, which can be centralized or distributed; Snoopy Protocols
FIGURE 4.18:
4.18: The
The many
many faces
faces ofof backing
backing store.
store.The
The backing
backing store
store inin aa multiprocessor
multiprocessor system
system can
can take
take on
on many
many forms.
forms. InIn
FIGURE
when a client makes a request for a cache block, its
In a snoopy protocol, all coherence-related activity
particular,aaprimary
primarycharacteristic
characteristicisiswhether
whetherthe
thebacking
backingstore
storeisisdistributed
distributedorornot.
not.Moreover,
Moreover,the
thechoices
choiceswithin
withinaadistributed
distributed
particular,
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
Directory-Based Protocols!
Can run on any configuration—the main idea is to
Chapter 4 MANAGEMENT OF CACHE CONSISTENCY 247
eliminate the need to broadcast every coherence event
CPU
University of Crete
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
$
System & Memory
Controller
CPU
$
CPU
Backing Store
$
SLIDE 38
Sys/Mem
Controller
$
Backing
Store
CPU
CPU
CPU
CPU
Backing
Store
DRAM
System
DRAM
System
CPU
DRAM
CPU
CPU
MC
CPU
CPU
S/M
Contr.
S/M
Contr.
S/M
Contr.
S/M
Contr.
DRAM
DRAM
System
DRAM
System
DRAM
System
DRAM
System
DRAM
Backing Store
MC
MC
DRAM
DRAM
MC
MC
CPU
CPU
DRAM
MC
CPU
DRAM
MC
FIGURE 4.18: The many faces of backing store. The backing store in a multiprocessor system can take on many forms. In
particular, a primary characteristic is whether the backing store is distributed or not. Moreover, the choices within a distributed
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
Directory-Based Protocols!
Each memory block has a directory entry!
"
University of Crete
"
SLIDE 39
P+1 bits where P is the number of processors!
One dirty bit per directory entry!
If dirty bit is on then only one presence bit can be on!
Nodes only communicate with other nodes that have the memory block
High-Speed Memory Systems"
Spring 2014"
CS-590.26
Lecture H"
Bruce Jacob"
Directory-Based Protocols!
Flavors:!
•
Centralized Directory, Centralized Memory
Poor scalability, but better than bus-based …"
•
Decentralized Directory, Distributed Memory
Each node stores small piece of entire directory
corresponding to the memory resident at that node.
Directory queries sent to the node that the address of the block corresponds to (i.e. its Home Node)."
University of Crete
SLIDE 40
•
Clustered Presence bits are coarse-grained
transaction request
coherence
point
processor
most recent data
CS-590.26
Lecture H"
FIGURE 4.23: Abstract illustration of coherence point.
Problem with Scalability!
Chapter 4 MANAGEMENT OF CACHE CONSISTENCY 253
What if latency to other
procs
>
latency
to
local
DRAM?
bus-level error checking
Bruce Jacob"
transaction
request
transmission
University of Crete
other processor(s)
coherence check
coherence status
parallel tasks
SLIDE 41
data return
coherence check
serialize at
data retrieval (from DRAM)
transaction request
coherence
processor
FIGURE 4.24: Abstract illustration of “transaction latency.”
point
most recent data
coherence
read
point command
DRAM
read data
FIGURE 4.23: Abstract illustration of coherence point.
I/O
CPU
CPU
System
Controller
memory
controller
CPU
CPU
I/O
DRAM
CPU
CPU
CPU
System
Controller
memory
bus-level
error
controller
I/O
checking
coherence check
transaction
FIGURE 4.25: Coherence point in different system topologies.
request
other CPU
and system
controller
System
Controller
memory
controller
DRAM
Spring 2014"
read data
DRAM
High-Speed Memory Systems"
DRAM
respective points
of coherence-synchronization
data return
serialize at
© Copyright 2026 Paperzz