Definitions For Caches

ECE 486/586
Computer Architecture
Chapter 3
UP Data Cache
Herbert G. Mayer, PSU
Status 1/15/2017
1
Data Cache in
UP Microprocessor
2
Syllabus UP Caches













Intro: Purpose, Design Parameters, Architecture
Effective Time teff
Single-Line Degenerate Cache
Multi-Line, Single-Set Cache
Single-Line, Multi-Set Cache, Blocked Mapping
Single-Line, Multi-Set, Cyclic Mapping
Multi-Line per Set (Associative), Multi-Set Cache
Replacement Policies
LRU Sample
Compute Cache Size
Trace Cache
Characteristic Cache Curve
Bibliography
3
Intro: Purpose of Cache
 Cache is logically part of Memory Subsystem, but
physically part of microprocessor, i.e. on the same
silicon die
 Purpose: render slow memory into a fast one
 With minimal cost despite high cost per bit, since the
cache is just a few % of total physical main store
 Works well, only if locality is good; else performance
is same as memory access, or worse, depending on
architecture
 With poor locality, when there is a random
distribution of memory accesses, then any cache can
slow down if:
teff = tcache+ (1-h) * tmem and not: teff = max( tcache, (1-h) * tmem )
4
Intro: Purpose of Cache
 With good locality, cache delivers available data in
close to unit cycle time
 In MP systems, caches must cooperate with other
processors’ caches, memory, some peripherals
 Even on a UP system there are multiple agents
that access memory and thus impact caches, e.g.
the DMA and memory controller
 Cache must cooperate with VMM of memory
subsystem to jointly render a physically small,
slow memory into a virtually large, fast memory at
small cost of added HW (silicon), and system SW
 L1 cache access time ideally should be within a
single machine cycle; realized on many CPUs
5
Intro: Trend of Speeds
6
Intro: Growing Caches
Intel Haswell-E Die, Center Showing 20 MB Shared L3 Cache
7
From Definitions in Appendix
Line
 Storage area in cache able to hold a copy of a
contiguous block of memory cells, AKA paragraph
 The portion of memory stored in that line is
aligned on a memory address modulo line size
 For example, if a line holds 64 bytes on a byteaddressable architecture, the address of the first
byte has 6 trailing zeros: i.e. it is evenly divisible
by 64, or we say it is 64-byte aligned
 Such known zeros don’t need to be stored in the
tag; they are implied, they are known a-priori!
 This shortens the tag, rendering cache simpler,
and cheaper to manufacture: less HW bits!
8
From Definitions in Appendix
Set






A logically connected region of memory, mapped to a specific
area of cache (line), is a set; memory is partitioned into N sets
Elements of a set don’t need to be physically contiguous in
memory; if contiguous, leftmost log2(N) bits are known a-priori
and don’t need to be stored; if cyclic distribution, then the
rightmost log2(N) are known a-priori
The number of sets is conventionally labeled N
A degenerate case maps all memory onto the whole cache, in
which case only a single set exists: N = 1; i.e. one set; not a
meaningful method!
Notion of set is meaningful only if there are multiple sets. Again:
A memory region belonging to one set can be a physically
contiguous block, or a cyclically distributed part of memory
Former case is called blocked, the latter cyclic. Cache area into
which such a portion of memory is mapped to is also called set
9
Intro: Cache Design Parameters
 Number of lines in set: K
Quick test: K is how large in a direct-mapped cache?
 Number of units –bytes– in a line is named L, AKA
Length of line L
 Number of sets in memory, and hence in the cache: N
 Policy upon store miss: cache write policy
 Policy upon load miss: cache read policy
 What to do, when an empty lines is needed for the
next paragraph to be streamed-in, but none is
available? That action is the: replacement policy
10
Intro: Cache Design Parameters
 Size here is the total size of a cache; unit being
discussed can be bits or bytes, be careful!
 Size = K * ( L + bits for tag and control bits ) * N
 Ratio of cache size to physical memory is generally
a very small percentage, e.g. < 1 %
 Cache access time, typically close to 1 cycle for L1
cache
 Number of processors with cache: 1 in UP, M in MP
architecture
 Levels of caches, L1, L2, L3 … Last one referred to
as LLC, for last level cache
11
Intro: Cache Architecture
 Cache-related definitions used throughout are
common, though not all manufacturers apply the
same nomenclature
 Initially we discuss cache designs for singleprocessor architectures
 In MP cache lecture we may progress to more
complex cache designs, covering the MESI
protocol for a two-processor system with external
L2 cache
 Focus here: L1 data cache
12
Effective Time teff
 Starting with teff = tcache + ( 1 - h ) * tmem we observe:
 No matter how many hits (H) we experience during
repeated memory access, the effective cycle time is
never less than tcache
 No matter how many misses (M) we experience, the
effective cycle time to access a datum is never
more than tcache + tmem
 Desirable to have teff = tmem in case of a cache miss
 Another way to compute effective access time is to
add all memory-access times, and divide them by
the total number of accesses, and thus compute
the effective time, or the average time teff
13
Effective Time teff
Average time per access:
teff = ( hits * tcache + misses * ( tcache + tmem ) ) /
total_accesses
teff =
h * tcache
+ m * ( tcache + tmem ) if memory
accessed immediately:
teff =
( h + m ) * tcache
+ m * tmem =
tcache + m * tmem
•
Assume an access time of 1 cycle to reference
data in the cache; best case, at times feasible
•
Assume an access time of 10 cycles for data in
memory; is unrealistically fast!!
•
Assume that a memory access is initiated after a
cache miss; then:
14
Effective Time teff
15
Effective Time teff
Symb.
H
M
A
T
tcache
tmem
teff
h
m
h+m
Name
Hits
Misses
All
Total time
Cache time
Mem time
Effective tm.
Hit rate
Miss rate
Total rate = 1
Explanation
Number of successful cache accesses
Number of failed cache accesses
All accesses A = H + M
Time for A memory accesses
Time to access data once via the data cache
Time to access data via memory once
Average time over all memory accesses
H/A =h=1–m
M/A =m=1–h
Total rate, either hit or miss, probability is 1
16
Effective Time teff
17
Effective Time teff
 Compare teff, the effective memory access time in L1
data cache at 99% hit rate vs. 97% hit rate
 Time for hit thit = 1 cycle, time for miss tmiss = 100
cycles; then compare 99 and 97 percent hit rates:
 Given a 99% hit rate:



1 miss costs 100 cycles
99 hits cost 99 cycles total
teff = ( 100 + 99 ) / 100 = 1.99 ≈ 2 cycles per average access
 Given a 97% hit rate: Students compute here!




18
Effective Time teff
 Compare teff, the effective memory access time in L1
data cache at 99% hit rate vs. 97% hit rate
 Time for hit thit = 1 cycle, time for miss tmiss = 100
cycles; then compare 99 and 97 percent hit rates:
 Given a 99% hit rate:



1 miss costs 100 cycles
99 hits cost 99 cycles total
teff = ( 100 + 99 ) / 100 = 1.99 ≈ 2 cycles per average access
 Given a 97% hit rate:



3 misses costs 300 cycles
97 hits cost 97 cycles total
teff = ( 300 + 97 ) / 100 = 397 / 100 = 3.97 ≈ 4 cycles per average
access
 Or 100% additional cycles for loss of 2% hit accuracy!
19
Actual Cache Data
Intel Core i7 with 3 levels of cache, L1 access > 1 cycle, L3 access
costing dozens of cycles, still way faster than memory access!
20
Complexity
of Opteron
Data Cache
-Taken from
Wikipedia
21
Deriving
Cache Performance Parameters
22
Single-Line Degenerate Cache
23
Single-Line Degenerate Cache
 Quick test: what is the minimum size (in number of bits) of
the tag for this degenerate cache? (assume 32-bit
architecture, and 64-byte lines)
 The single-line cache, shown here, stores multiple words
 Can improve memory access if extremely good locality
exists within a very narrow address range
 Upon miss cache initiates a stream-in operation
 Is a direct mapped cache: all memory locations know a
priori where they’ll reside in cache; there is but one line,
one option for them
 Is a single-set cache
24
Single-Line Degenerate Cache
 As data cache: exploits only locality of near-by
addresses in the same paragraph
 As instruction cache: Exploits locality of tight
loops that completely fit inside the address range
of a single cache line
 However, there will be a cache-miss as soon as an
address makes reference outside of line’s range
 For example, tight loop with a function call will
cause cache miss
 Stream-in time is time to load a line of data from
memory
 Total overhead: tag bits + valid bit + dirty bit (if
write-back)
 Not advisable to build this cache subsystem 
25
Dual-Line, Single-Set Cache
26
Dual-Line, Single-Set Cache
 Next cache has 1 set, multiple lines; here 2 lines shown
Quick test: minimum size of tag on byte-addressable, 32-bit
architecture with 2 lines, 1 set, line size of 16 bytes?
 Each line holds multiple, contiguous addressing units, 4
words, 16 bytes shown
 Thus 2 disparate areas of memory can be cached at the
same time
 Is associative cache; all lines (i.e. 2 lines) in single set
must be searched to determine, whether a memory
element is present in cache
 Is single-set associative cache, since all of memory
(singleton set) is mapped onto the same cache lines
27
Dual-Line, Single-Set Cache
 Some tight loops with a function call can be
completely cached in an I-cache, assuming loop
body fits into line and callée fits into the other line
 Also would allow one larger loop to be cached,
whose total body does not fit into a single line, but
would fit into 2 lines
 Applies to more realistic programs
 But if number of lines K >> 1, the time to search all
tags (in set) can grow beyond unit cycle time
 Not advisable to build this cache subsystem 
28
Single-Line, Dual-Set Cache
29
Single-Line, Dual-Set Cache

This cache architecture has multiple sets, 2 shown, 2 distinct
areas of memory, each being mapped onto separate cache
lines: N = 2, K = 1
Quick test: minimum size of the tag on 4-byte per word, 32-bit
architecture with 16-byte lines?

Each set has a single line, in this case 4 memory words; AKA
paragraph in memory

Thus 2 disparate areas of memory can be cached at the same
time

But these areas must reside in separate memory sets, each
contiguous, each having only 1 option

Is direct mapped; all memory locations know a priori where
they’ll reside in cache

Is multi-set cache, since parts of memory have their own
portion of cache
30
Single-Line, Dual-Set Cache
 Allows one larger loop to be cached, whose total
body does not fit into a single line of an I-cache, but
would fit into two lines
 But only if by some great coincidence both parts of
that loop reside in different memory sets
 If used as instruction cache, all programs
consuming half of memory or less never use the
second line in the second set. Hence this cache
architecture would be a bad idea!
 If used as data cache, all data areas that fit into first
block will never utilize second set of cache
 Problem specific to blocked mapping; try cyclic
instead
 Not advisable to build this type cache 
31
Dual-Set, Single-Line, Cyclic
32
Dual-Set, Single-Line, Cyclic

This cache architecture below also has 2 sets, N = 2

Each set has a single line, each holding 4 contiguous memory
units, 4 words, 16 bytes, K = 1

Thus 2 disparate areas of memory can be cached at the same
time
Quick test: tag size on 32-bit, 4-byte architecture?

Disparate areas (of line size, equal to paragraph size) are
scattered cyclically throughout memory

Cyclically distributed memory areas associated with each
respective set

Is direct mapped; all memory locations know a priori where
they’ll reside in cache, as each set has a single line

Is multi-set cache: different locations of memory are mapped
onto different cache lines, the sets
33
Dual-Set, Single-Line, Cyclic
 Also allows one larger loop to be cached, whose
total body does not fit into a single line, but would
fit into two lines
 Even if parts of loop belong to different sets
 If used as instruction cache, small code section
can use the total cache
 If used as data cache, small data areas can utilize
complete cache
 Cyclic mapping of memory areas to sets is
generally superior to blocked mapping
 Still not advisable to build this cache subsystem 
34
Multi-Line, Multi-Set, Cache
35
Multi-Line, Multi-Set, Cache
 Reminder: Tag is that minimal number of address
bits to be stored in cache lines
Quick test: minimum size (in bits) of the tag?
 Here 32-bit architecture, byte addressable, 2 sets
cyclic, line length 16 bytes, 2-way set associative
 Two sets, memory will be mapped cyclically, AKA in
a round-robin fashion
 Each set has two lines, each line holding 16 bytes;
i.e. paragraph length of memory is 16 bytes in this
example!
 Note: direct mapped caches, i.e. caches with one
line per set, are also common; AKA non-associative
36
Multi-Line, Multi-Set, Cache
 Associative cache: once set is known, search all
tags for the memory address in all lines of that set
 In earlier example p. 34, line 2 of set 2 is unused,
AKA invalid in MESI terminology
 By now you know: sets, lines, associate, nonassociative, direct mapped, and more terms!
37
Replacement Policy
 The replacement policy is the rule that determines,
when all lines are valid (i.e. already busy with other,
good data), and a new line must be streamed in:
 Which of the valid lines in a cache is to be replaced,
AKA removed?
 Removal can be low cost, if the modified bit (AKA
dirty bit) is clear = 0; this means: data in memory
and cache line are identical!
 Otherwise removal may be costly: If the dirty bit is
set = 1, data have to be written back into memory,
costing a memory access!
 We call this copying to memory: stream out
38
Replacement Policy
#
Name
Summary
1
LRU
Replaces Least Recently Used cache line; requires keeping track of
relative “ages” of lines. Retire line that has remained unused for the
longest time of all candidate lines. Speculate that that line will
remain unused for the longest time in the future.
2
LFU
Replaces Least Frequently Used cache line; requires keeping track of
the number m of times this line was used over the last n>=m uses.
Depending on how long we track the usage, this may require many
bits.
3
FIFO
First In First Out: The first of the lines in the set that was streamed in
is the first to be retired, when it comes time to find a candidate. Has
the advantage that no further update is needed, while all lines are in
use.
4
Random
Pick a random line from candidate set for retirement; is not as bad as
this irrational algorithm might suggest. Reason: The other methods
are not too good either 
5
Optimal
If a cache were omniscient, it could predict, which line will remain
unused for the longest time in the future. Of course, that is not
computable. However, for creating the perfect reference point, we
can do this with past memory access patterns, and use the optimal
access pattern for comparison, how well our chosen policy rates vs.
the optimal strategy!
39
LRU Sample 1
Assume the following cache architecture:
• N = 16 sets, cyclic distribution
• K = 4 lines per set
• 32-bit architecture, byte-addressable
• write back (dirty bit)
• valid line indicator (valid bit)
• L = 64 bytes per line; AKA line length
• LRU replacement; uses 2 bits (4 lines per
set), to store relative ages
• This results in a tag size of ???? bits
40
LRU Sample 1
Assume the following cache architecture:
• This results in a tag size of 22 bits
• What is the overhead size per line in bits?
––
41
LRU Sample 1
Assume the following cache architecture:
• Tag size = 22 bits
• 2 LRU bits (4 lines per set), to store relative
ages of the 4 lines in each set
• Dirty bit needed, AKA Modified bit = 1
• Valid bit needed = 1
• Overhead per line: 22 + 2 + 1 + 1 = 26 bits
42
LRU Sample 2
 Sample 2 focuses on one particular Set:
 Let the 4 lines be numbered 0..3
 Set is accessed in the order: line 0 miss, line 1 miss,
line 0 hit, line 2 miss, line 0 hit again, line 3 miss, line
0 hit again, and another miss
 Now cache is full, now find an available line by
eviction, to have a line for another miss!
 Assume initially a cold cache, all lines in the cache
were free before these accesses
 Problem: Once all lines are filled (Valid bit is 1 for all
4 lines) some line must be retired to make room for
the new access that missed, but which?
 Answer is based on the LRU policy (Least Recently
Used line), which here is line 1
43
LRU Sample 2
 The access order, assuming all memory accesses
are just reads (loads), no writes (no stores), i.e. dirty
bit is always clear:








Read miss, all lines invalid, stream paragraph in line 0
Read miss (implies new address), stream paragraph in line 1
Read hit on line 0
Read miss to a new address, stream paragraph into line 2
Read hit, access line 0
Read miss, stream paragraph into line 3
Read hit, access line 0
Now another Read Miss, all lines valid, find line to retire,
AKA to evict
 Note that LRU age 002 is youngest for cache line 0,
and 112 is the oldest line (AKA the least recently
used line) for cache line 1, of the 4 relative ages out
of 4 total lines
44
LRU Sample 2
45
LRU Sample 2
1. Initially, in a partly cold cache, if we experience a
miss, there will be an empty line (partly cold
cache), the paragraph is streamed into the empty
line, its relative age is set to 0, and all other ages
are incremented by 1
2. In a warm cache (all lines are used) when a line of
age X experiences a hit, its new age becomes 0.
Ages of all other lines whose age is < X are
incremented by 1; i.e. older ones remain “older”
46
Compute Cache Size
Typical Cache Design Parameters:
1. Number of lines in every set: K
2. Number of bytes in a line, i.e. the Length of line: L
3. Number of sets in memory, and hence in cache: N
4. Policy upon memory write (cache write policy)
5. Policy upon read miss (cache read policy)
6. Replacement policy (e.g. LRU, random, FIFO, etc.)
7. Size (bits) = K * ( 8 * L + tag + control bits ) * N
47
Compute Cache Size
Compute minimum number of bits for 8-way, setassociative cache with 64 sets, using cyclic allocation of
sets, line length L = 32 bytes, using LRU and write-back.
Memory is byte addressable, with 32-bit addresses:
Tag
=
LRU 8-ways
=
Dirty bit
=
Valid bit
=
Overhead per line
=
# of lines
=
Data bits per cache line=
Total cache size
=
Size in bytes approx.
=
32-5-6
= 21 bits
3 bits
1 bit
1 bit
21+3+1+1 = 26 bits
K*N
= 64 * 8 = 29 lines
32*8
= 28 bits
29*(26+28)= 144,384 bits
~17.6 k Bytes
48
Trace Cache
 Trace Cache is a special-purpose cache that does not
hold (raw) instruction bits, but instead stores predecoded operations AKA micro-ops
 The old AMD K5 uses a Trace Cache; see [1]
 Intel’s Pentium® P4 uses a 12 k micro-op Trace Cache
 Advantages: faster access to executable bits at every
cached instruction
 Disadvantage: less dense storage, i.e. wasted cache
bits, when compared to a regular I-cache
 Note that cache bits are way more costly than
memory bits; several decimal orders of magnitude!
 Trace caches are falling out of favor since the 2000s
49
Trace Cache
50
Characteristic Cache Curve
 In the graph below we a use relative number of
cache misses [RM] to avoid infinitely high abscissa
 RM = 0 is ideal case: No misses, all hits
 RM = 1 is worst case: All memory accesses are
cache misses
 If a program exhibits good locality, relative cache
size of 1 results in good performance; we use this
as the reference point:
 Very coarsely, in some ranges, doubling the
cache’s size results in 30% less cache misses
 In other ranges of the characteristic curve,
doubling the cache results in just a few % of
reduced misses: beyond the sweet spot!
51
Characteristic Cache Curve
52
Cache vs. Core on 22 nm Die
53
UP Cache Summary
 Cache is a special HW storage, allowing fast access
to small areas of memory, copied into cache lines
 Built with expensive technology, hence the size of a
cache relative to memory size is small; cache holds
only a small subset of memory, typically < 1 %
 Frequently used data (or instructions in an I-cache)
are copied to cache, with the hope that the data
present in the cache are accessed frequently
 Miraculously  that is generally true, so caches in
general do speed up execution despite slow
memories: Exploiting what is known as locality
 Caches are organized into sets, with each set having
1 or more lines; multiple lines require searching
 Defined portions of memory get mapped into any
one of these sets
54
Bibliography UP Caches
1. Shen, John Paul, and Mikko H. Lipasti: Modern
Processor Design, Fundamentals of Superscalar
Processors, McGraw Hill, ©2005
2. http://forums.amd.com/forum/messageview.cfm?cati
d=11&threadid=29382&enterthread=y
3. Lam, M., E. E. Rothberg, and M. E. Wolf [1991]. "The
Cache Performance and Optimizations of Blocked
Algorithms," ACM 0-89791-380-9/91, p. 63-74.
4. http://www.ece.umd.edu/~blj/papers/hpca2006.pdf
5. MESI: http://en.wikipedia.org/wiki/Cache_coherence
6. Kilburn, T., et al: “One-level storage systems, IRE
Transactions, EC-11, 2, 1962, p. 223-235
55
Definitions
For Caches
56
Definitions
Aging
 A cache line’s age is tracked; only in associative
cache, doesn’t apply for direct-mapped cache
 Aging tracks, when a cache line was accessed,
relative to the other lines in this set
 This implies that ages are compared
 Generally, the relative ages are of interest, such
as: am I older than you? Rather than the absolute
age, e.g.: I was accessed at cycle such and such
 Think about the minimum number of bits needed
to store the relative ages of, say, 8 cache lines!
 Memory access addresses only one line, hence all
lines in a set have distinct (relative) ages
57
Definitions
Alignment
 Alignment is a spacing requirement, i.e. the
restriction that an address adhere to a specific
placement condition
 For example, even-alignment means that an
address is even, that it be divisible by 2
 E.g. address 3 is not even-aligned, but address
1000 is; thus the rightmost address bit will be 0
 In VMM, page addresses are aligned on pageboundaries. If a page-frame has size 4k, then page
addresses that adhere to page-alignment are
evenly divisible by 4k
 As a result, the low-order (rightmost) 12 bits are 0.
Knowledge of alignment can be exploited to save
storing address bits in VMM, caching, etc.
58
Definitions
Allocate-on-Write
 If a store instruction experiences a cache miss,
and as a result a cache line is filled, then the
allocate-on-write cache policy is used
 If the write miss causes the paragraph from
memory to be streamed into a data cache line, we
say the cache uses allocate-on-write
 Pentium processors, for example, do not use
allocate-on-write
 Antonym: write-by
59
Definitions
Associativity
 If a cache has multiple lines per set, we call it k-way
associative; k stands for number of lines in a set
 Having a cache with multiple lines (i.e. k > 1) does
require searching, or address comparing; search
checks, whether some referenced object is in fact
present
 Another way of saying this is: In an associative
cache any memory object has more cache lines
than just one, where it might live
 Antonym: direct mapped; if only a single line (per
set) exists, the search is reduced to a simple, single
tag comparison
60
Definitions
Back-Off
 If processor P1 issues a store to a data address
shared with another processor P2, and P2 has
cached and modified the same data, a chance for
data inconsistency arises
 To avoid this, P2 with the modified cache line must
snoop for other processors’ accesses, to guarantee
delivery of the newest data
 Once the snoop detects the access request from
P1, P1 must be prevented from getting ownership
of the data; accomplished by temporarily
preventing P1 bus access
 This bus denial for the sake of preserving data
integrity is called back-off
61
Definitions
Blocking Cache
 Let a cache miss result in streaming-in a line
 If during that stream-in no further accesses
can be made to this cache until the data
transfer is complete, this cache is called
blocking
 Antonym: non-blocking
 Generally, a blocking cache yields lower
performance than a non-blocking
62
Definitions
Bus Master

Only one of the devices connected to a system bus has the
right to send signals across the bus; this ownership is called
being the bus master

Initially Memory & IO Controller (MIOC) is bus master;
chipset may include special-purpose bus arbiter

Over time, all processors –or their caches– may request to
become bus master for some number of bus cycles

The MIOC can grant this right; yet each of the processors pi
(more specifically: its cache) can request a back-off for pj,
even if otherwise pj would be bus master
63
Definitions
Critical Chunk First
 The number of bytes in a line is generally larger
than the number of bytes that can be brought to
the cache across the bus in 1 step, requiring
multiple bus transfers to fill a line completely
 Would be efficient, if the actually needed bytes
resided in the first chunk brought across the bus
 Deliberate policy that accomplishes just that is the
Critical Chunk First policy
 This allows the cache to be unblocked after the
first transfer, though line is not completely loaded
 Other parts of the line may be used later, but the
critical byte can thus be accessed right away
64
Definitions
Direct Mapped
 If each memory address has just one possible
location (i.e. one single line, of K = 1) in the cache
where it could possibly reside, then that cache is
called direct mapped
 Antonym: associative, or fully associative
 Synonym: non-associative
65
Definitions
Directory

The collection of all tags is referred to as the cache directory

In addition to the directory and the actual data there may be
further overhead bits in a data cache
Dirty Bit

Dirty bit is a data structure associated with a cache line. This
bit expresses whether a write hit has occurred on a system
applying write-back

Synonym: Modified bit

There may be further overhead bits in a data cache
66
Definitions
Effective Cycle Time teff
 Let the cache hit rate h be the number of hits divided
by the number of all memory accesses, with an ideal
hit rate being 1; m being the miss rate = 1-h; thus:
teff = tcache + (1-h) * tmem = tcache + m * tmem
 Alternatively, the effective cycle time might be
teff = max( tcache, m * tmem )
 The latter holds, if a memory access to retrieve the
data is initiated simultaneously to the cache access
 tcache = time to access a datum in the cache, ideally 1
cycle, while tmem is the time to access a data item in
memory; generally not a constant value
 The hit rate h varies from 0.0 to 1.0
67
Definitions
Exclusive
 State in MESI protocol. The E state indicates that
the current cache is not aware of any other cache
sharing the same information, and that the line is
unmodified
 E allows that in the future another line may
contain a copy of the same information, in which
case the E must transition to another state
 Possible that a higher-level cache (L1 for example
viewed from an L2) may actually have a shared
copy of the line in exclusive state; however that
level of sharing is transparent to other potentially
sharing agents outside the current processor
68
Definitions
Fully Associative Cache
 Possible to not partition cache into sets
 In that case, all lines need to be searched for a
cache hit or miss
 We call this a fully associative cache
 Generally works for small caches, since the
search may become costly in time or HW if the
cache were large
69
Definitions
Hit Rate h
 The hit rate h is the number of memory accesses
(read/writes, or load/stores) that hit the cache,
over the total number of memory accesses
 By contrast H is the total number of hits
 A hit rate h = 1 means: all accesses are from the
cache, while h = 0 means, all are from memory, i.e.
none hit the cache
 Conventional notations are: hr and hw for read and
write hits
 See also miss rate
70
Definitions
Invalid
 State in the MESI protocol
 State I indicates that its cache line is invalid, and
consequently holds no valid data; it is ready for
use
 It is desirable to have I lines: Allows the stream-in
of a paragraph without evicting another cache line
 Invalid (I) state is always set for any cache line
after a system reset
71
Definitions
Line
 Storage area in cache able to hold a copy of a
contiguous block of memory cells, i.e. a paragraph
 The portion of memory stored in that line is
aligned on an address modulo the line size
 For example, if a line holds 64 bytes on a byteaddressable architecture, the address of the first
byte has 6 trailing zeros: evenly divisible by 64, it
is 64-byte aligned
 Such known zeros don’t need to be stored in the
tag, the address bits stored in the cache; they are
implied
 This shortens the tag, rendering cache cheaper to
manufacture: less HW bits!
72
Definitions
LLC
 Acronym for Last Level Cache. This is the largest
cache in the memory hierarchy, the one closest to
physical memory, or furthest from the processor
 Typical on multi-core architectures
 Typical cash sizes: 4 MB to 32 MB
 Common to have one LLC be shared between all
cores of an MCP (Multi-Core Processor), but have
option of separating (by fusing) and creating
dedicated LLC caches, with identical total size
73
Definitions
LRU
 Acronym for Least Recently Used
 Cache replacement policy (also page replacement
policy discussed under VMM) that requires aging
information for the lines in a set
 Each time a cache line is accessed, that line
become the youngest one touched
 Other lines of the same set do age by one unit, i.e.
get older by 1 event: event is a memory access
 Relative ages are sufficient for LRU tracking; no
need to track exact ages!
 Antonym: last recently used!
74
Definitions
Locality of Data
 A surprising, beneficial attribute of memory access
patterns: when an address is referenced, there is a
good chance that in the near future another access
will happen at or near that same address
 I.e. memory accesses tend to cluster, also
observable in hashing functions and memory page
accesses
 Antonym: Randomly distributed, or normally
distributed
75
Definitions
MESI
 Acronym for Modified, Exclusive, Shared and
Invalid
 This is an ancient protocol to ensure cache
coherence on the family of Pentium processors. A
protocol is necessary, if multiple processors have
copy of common data with right to modify
 Through the MESI protocol data coherence is
ensured no matter which of the processors
performs writes
 AKA as Illinois protocol due to its origin at the
University of Illinois at Urbana-Champaign
76
Definitions
Miss Rate
 Miss rate is the number of memory (read/write)
accesses that miss the cache over total number of
accesses, denoted m
 Clearly the miss rate, like the hit rate, varies
between 0.0 .. 1.0
 The miss rate m = 1 - h
 Antonym: hit rate h
77
Definitions
Modified
 State in MESI protocol
 M state implies that the cache line found by a write
hit was exclusive, and that the current processor
has modified the data
 The modified state expresses: Currently not
shared, exclusively owned data have been
modified
 In a UP system, this is generally expressed by the
dirty bit
78
Definitions
Paragraph
 Conceptual, aligned, fixed-size area of the logical
address space that can be streamed into the
cache
 Holding area in the cache of paragraph-size is
called a line
 In addition to the actual data, a line in cache has
further information, including the dirty and valid
bit (in UP systems), the tag, LRU information, and
in MP systems the MESI bits
 The MESI M state corresponds to the dirty bit in a
UP system
79
Definitions
Replacement Policy
 A replacement policy is a defined convention that
defines which line is to be retired in case a new
line must be loaded, none is free in a set, so one
has to be evicted
 Ideally, the line that would remain unused for the
longest time in the future should be replaced and
its contents overwritten with new data
 Generally we do not know which line will stay
unreferenced for the longest time in the future
 In a direct-mapped cache, the replacement policy
is trivial, it is moot, as there will be just 1 line
80
Definitions
Set

A logically connected region of memory, to be mapped onto a
specific area of cache (line), is a set; there are N sets in memory

Elements of a set don’t need to be physically contiguous in
memory; if contiguous, leftmost log2(N) bits are 0; if cyclic
distribution, then the rightmost log2(N) after alignment bits are 0

The number of sets is conventionally labeled N

A degenerate case is to map all memory onto the whole cache,
in which case only a single set exists: N = 1; i.e. one set

Notion of set is meaningful only if there are multiple sets. A
memory region belonging to one set can be physically
contiguous or distributed cyclically

In the former case the distribution is called blocked, the latter
cyclic. Cache area into which a portion of memory is mapped to
is also called set
81
Definitions
Set-Associative
 A cached system in which each set has multiple
cache lines is called set-associative
 For example, 4-way set associative means that
there are multiple sets (could be 4 sets, 256 sets,
1024 sets, or any other number of sets) and each
of those sets has 4 lines
 Integral powers of 2 are good  to use
 That’s what the
4 refers to in a 4-way cache
 Antonym: non-associative, AKA direct-mapped
82
Definitions
Shared
 State in the MESI protocol
 S state expresses that the hit line is present in
more than one cache. Moreover, the current cache
(with the shared state) has not modified the line
after stream-in
 Another cache of the same processor may be
such a sharing agent. For example, in a two level
cache, the L2 cache will hold all data present in
the L1 cache
 Similarly, another processor’s L2 cache may share
data with the current processor’s L2 cache
83
Definitions
Stale Memory
 A valid cache line may be overwritten with new data
 The write-back policy records such over writing
 At the moment of a cache write with write-back,
cache and memory are out of synch; we say
memory is stale
 Poses no danger, since the dirty bit (or modified bit)
reflects that memory eventually must be updated
 But until this happens, memory is stale
 Note that if two processors’ caches share memory
and one cache renders memory stale, the other
processor should no longer have access to that
portion of shared memory
84
Definitions
Stream-Out
 Streaming out a line refers to the movement of one
line of modified data, out of the cache and back
into a memory paragraph
Stream-In
 The movement of one paragraph of data from
memory into a cache line. Since line length
generally exceeds the bus width (i.e. exceeds the
number of bytes that can be move in a single bus
transaction), a stream-in process requires multiple
bus transactions in a row
 Possible that the byte actually needed will arrive
last in a cache line during a sequence of bus
transactions; can be avoided with the critical
chunk first policy
85
Definitions
Snooping

After a line write hit in a cache using write-back, the data in
cache and memory are no longer identical. In accordance
with the write-back policy, memory will be written eventually,
but until then memory is stale

The modifier (the cache that wrote) must pay attention to
other bus masters trying to access the same line. If this is
detected, action must be taken to ensure data integrity

This paying attention is called snooping. The right action
may be forcing a back-off, or snarfing, or yet something else
that ensures data coherence

Snooping starts with the lowest-order cache, here the L2
cache. If appropriate, L2 lets L1 snoop for the same address,
because L1 may have further modified the line
86
Definitions
Squashing

Starting with a read-miss:

In a non-blocking cache, a subsequent memory access may
be issued after a read-miss, even if that previous miss
results in a stream-in that is still under way

That subsequent memory access will be a miss again, which
is being queued. Whenever an access references an address
for which a request is already outstanding, the duplicate
request to stream-in can be skipped

Not entering this in the queue is called squashing

The second and any further outstanding memory access can
be resolved, once the first stream-in results in the line being
present in the cache
87
Definitions
Strong Write Order
 A policy ensuring that memory writes occur in the
same order as the store operations in the
executing object code
 Antonym: Weak order
 The advantage of weak ordering can be speed
gain, allowing a compiler or cache policy to
schedule instructions out of order; this requires
some other policy to ensure data integrity
88
Definitions
Stream-In
 The movement of a paragraph from memory into a
cache line
 Since line length generally exceeds the bus width
(i.e. exceeds the number of bytes that can be
move in a single bus transaction), a stream-in
process requires multiple bus transactions
 It is possible that the byte actually needed arrives
last (or first) in a cache line during a sequence of
bus transactions
 Antonym: Stream-out
89
Definitions
Stream-Out
 The movement of one line of modified data from
cache into a memory paragraph
 Antonym: Stream-in
 Note that unmodified data don’t need to be
streamed-out from cache to memory; they are
already present in memory
90
Definitions
Trace Cache
 Special-purpose cache that holds predecoded instructions, AKA micro-ops
 Advantage: Repeated decoding for
instructions is not needed
 Trace caches have fallen out of favor in the
2000s
91
Definitions
Valid Bit
 Single-bit data structure per cache line, indicating,
whether or not the line is free; free means invalid
 If a line is not valid (i.e. if valid bit is 0), it can be
filled with a new paragraph upon a cache miss
 Else, (valid bit 1), the line holds valid information
 After a system reset, all valid bits of the whole
cache are set to 0
 The I bit in the MESI protocol takes on that role on
an MP cache subsystem
 To be discussed in MP-cache coherence topic
92
Definitions
Weak Write Order
 A memory-write policy allowing (a compiler or
cache) that memory writes may occur in a
different order than their originating store
operations
 Antonym: Strong Write Order
 The advantage of weak ordering is potential speed
gain
93
Definitions
Write-Back
 Cache write policy that keeps a line of data (a
paragraph) in the cache even after a write, i.e. after
a modification
 The changed state must be remembered via the
dirty bit, AKA Modified state, or modified bit
 Memory is temporarily stale in such a case
 Upon retirement, any dirty line must be copied
back into memory; called write-back
 Advantage: only one stream-out, no matter how
many write hits did occur to that same line
94
Definitions
Write-By
 Cache write policy, in which the cache is not
accessed on a write miss, even if there are cache
lines in I state
 A cache using write-by “hopes” that soon there
may be a load, which will result in a miss and then
stream-in the appropriate line; if not, it was not
necessary to stream-in the line in the first place
 Antonym: allocate-on-write
95
Definitions
Write-Once
 Cache write policy that starts out as write-through
and changes to write-back after the first write hit to
a line
 Typical policy imposed onto a higher level L1 cache
by the L2 cache
 Advantage: The L1 cache places no unnecessary
traffic onto the system bus upon a cache-write hit
 Lower level L2 cache can remember that a write
has occurred by setting the MESI state to modified
96
Definitions
Write-Through
 Cache write policy that writes data to memory
upon a write hit. Thus, cache and main memory
are in synch
 Disadvantage: repeated memory access traffic on
the bus
97
Bibliography –Other Topics
1. Don Anderson and Shanley, T., MindShare [1995]. Pentium TM
Processor System Architecture, Addison-Wesley Publishing
Company, Reading MA, PC System Architecture Series. ISBN 0201-40992-5
2. Pentium Pro Developer’s Manual, Volume 1: Specifications,
1996, one of a set of 3 volumes
3. Pentium Pro Developer’s Manual, Volume 2: Programmer's
Reference Manual, Intel document, 1996, one of a set of 3
volumes
4. Pentium Pro Developer’s Manual, Volume 3: Operating Systems
Writer’s Manual, Intel document, 1996, one of a set of 3
volumes
5. Y. Sheffer:
http://webee.technion.ac.il/courses/044800/lectures/MESI.pdf
6. MOESI protocol: http://en.wikipedia.org/wiki/MOESI_protocol
7. MESIF protocol: http://en.wikipedia.org/wiki/MESIF_protocol
98