class14-4.pdf

Cache Memories
15-213
Cache memories are small, fast SRAM-based memories
managed automatically in hardware.
“The course that gives CMU its Zip!”
!
Cache Memories
Oct. 10, 2002
CPU looks first for data in L1, then in L2, then in main
memory.
Typical bus structure:
Topics
!
!
!
!
Hold frequently accessed blocks of main memory
CPU chip
Generic cache memory organization
Direct mapped caches
Set associative caches
Impact of caches on performance
register file
L1
cache
L2 cache
system bus memory bus
main
memory
I/O
bridge
bus interface
15-213, F’02
–2–
class14.ppt
Inserting an L1 Cache Between
the CPU and Main Memory
The tiny, very fast CPU register file
has room for four 4-byte words.
The transfer unit between
the CPU register file and
the cache is a 4-byte block.
The small fast L1 cache has room
for two 4-word blocks.
line 1
The transfer unit between
the cache and main
memory is a 4-word block
(16 bytes).
block 10
1 valid bit t tag bits
per line
per line
valid
tag
0
1
•!•!•
B–1
E lines
per set
•!•!•
set 0:
Each line holds a
block of data.
B = 2b bytes
per cache block
valid
tag
valid
tag
0
1
•!•!•
B–1
0
1
•!•!•
B–1
1
•!•!•
B–1
1
•!•!•
B–1
1
•!•!•
B–1
•!•!•
set 1:
valid
tag
0
abcd
pqrs
...
block 30
Cache is an array
of sets.
S = 2s sets
...
block 21
General Org of a Cache Memory
Each set contains
one or more lines.
line 0
•!•!•
The big slow main memory
has room for many 4-word
blocks.
valid
–4–
0
•!•!•
valid
15-213, F’02
tag
set S-1:
wxyz
...
–3–
ALU
cache bus
tag
0
Cache size: C = B x E x S data bytes
15-213, F’02
Addressing Caches
Direct-Mapped Cache
Address A:
t bits
v
tag
v
tag
v
tag
v
tag
set 0:
set 1:
0
•!•!•
0
1
•!•!• B–1
1
•!•!• B–1
0
•!•!•
0
1
•!•!• B–1
1
•!•!• B–1
v
tag
v
tag
0
•!•!•
0
1
•!•!• B–1
1
•!•!• B–1
Simplest kind of cache
b bits
m-1
0
Characterized by exactly one line per set.
<tag> <set index> <block offset>
The word at address A is in the cache if
the tag bits in one of the <valid> lines in
set <set index> match <tag>.
•!•!•
set S-1:
s bits
set 0:
valid
tag
cache block
set 1:
valid
tag
cache block
E=1 lines per set
•!•!•
set S-1:
valid
cache block
tag
The word contents begin at offset
<block offset> bytes from the beginning
of the block.
15-213, F’02
–5–
15-213, F’02
–6–
Accessing Direct-Mapped Caches
Accessing Direct-Mapped Caches
Set selection
Line matching and word selection
!
Use the set index bits to determine the set of interest.
set 0:
valid
tag
cache block
set 1:
valid
tag
cache block
!
Line matching: Find a valid line in the selected set with a
matching tag
!
Word selection: Then extract the word
=1? (1) The valid bit must be set
selected set
•!•!•
t bits
m-1
tag
s bits
b bits
00 001
set index block offset0
set S-1: valid
tag
0
selected set (i):
cache block
1
0110
15-213, F’02
2
3
4
w0
5
–8–
t bits
0110
tag
6
w1 w2
(2) The tag bits in the cache
=?
line must match the
tag bits in the address
m-1
–7–
1
s bits
b bits
i
100
set index block offset0
7
w3
(3) If (1) and (2), then
cache hit,
and block offset
selects
starting byte.
15-213, F’02
Direct-Mapped Cache Simulation
t=1 s=2
x
xx
11
00
01
10
11
Address trace (reads):
0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]
0 [00002] (miss)
tag
data
0
v
m[1]
m[0]
M[0-1]
(1)
(3)
13 [11012] (miss)
tag
data
11
0
m[1]
m[0]
M[0-1]
11
1
m[13]
m[12]
M[12-13]
High-Order
Bit Indexing
4-line Cache
M=16 byte addresses, B=2 bytes/block,
S=4 sets, E=1 entry/set
b=1
x
v
Why Use Middle Bits as Index?
High-Order Bit Indexing
Adjacent memory lines would
map to same cache entry
Poor use of spatial locality
!
!
Middle-Order Bit Indexing
Consecutive memory lines map
to different cache lines
Can hold C-byte region of
address space in cache at one
time
!
v
(4)
8 [10002] (miss)
tag
data
11
1
m[9]
M[8-9]
m[8]
1
1
M[12-13]
v
(5)
0 [00002] (miss)
tag
data
11
0
m[1]
m[0]
M[0-1]
11
1
m[13]
m[12]
M[12-13]
!
15-213, F’02
–9–
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Middle-Order
Bit Indexing
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
– 10 –
15-213, F’02
Set Associative Caches
Accessing Set Associative Caches
Characterized by more than one line per set
Set selection
!
set 0:
set 1:
valid
tag
cache block
valid
tag
cache block
valid
tag
cache block
valid
tag
cache block
identical to direct-mapped cache
set 0:
E=2 lines per set
Selected set
set 1:
valid
tag
cache block
valid
tag
cache block
valid
tag
cache block
valid
tag
cache block
•!•!•
•!•!•
set S-1:
valid
tag
cache block
valid
tag
cache block
t bits
m-1
– 11 –
15-213, F’02
tag
– 12 –
set S-1:
s bits
b bits
00 001
set index block offset0
valid
tag
cache block
valid
tag
cache block
15-213, F’02
Accessing Set Associative Caches
Multi-Level Caches
Line matching and word selection
Options: separate data and instruction caches,
caches, or a
unified cache
!
must compare the tag in each valid line in the selected set.
=1? (1) The valid bit must be set.
Processor
0
selected set (i):
1
1001
1
0110
(2) The tag bits in one
of the cache lines must
match the tag bits in
the address
1
2
3
4
w0
m-1
6
w1 w2
Regs
7
w3
s bits
b bits
i
100
set index block offset0
15-213, F’02
– 13 –
L1
d-cache
L1
i-cache
(3) If (1) and (2), then
cache hit, and
block offset selects
starting byte.
=?
t bits
0110
tag
5
Intel Pentium Cache Hierarchy
size:
speed:
$/Mbyte:
line size:
200 B
3 ns
8-64 KB
3 ns
8B
32 B
larger, slower, cheaper
Unified
Unified
L2
L2
Cache
Cache
1-4MB SRAM
6 ns
$100/MB
32 B
Memory
Memory
128 MB DRAM
60 ns
$1.50/MB
8 KB
disk
disk
30 GB
8 ms
$0.05/MB
15-213, F’02
– 14 –
Cache Performance Metrics
Miss Rate
Regs.
L1 Data
1 cycle latency
16 KB
4-way assoc
Write-through
32B lines
L1 Instruction
16 KB, 4-way
32B lines
!
L2
L2Unified
Unified
128KB--2
128KB--2MB
MB
4-way
4-wayassoc
assoc
Write-back
Write-back
Write
Writeallocate
allocate
32B
32Blines
lines
!
Main
Main
Memory
Memory
Up
to
Up to4GB
4GB
Fraction of memory references not found in cache
(misses/references)
Typical numbers:
" 3-10% for L1
" can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
!
Processor
ProcessorChip
Chip
!
Time to deliver a line in the cache to the processor (includes
time to determine whether the line is in the cache)
Typical numbers:
" 1 clock cycle for L1
" 3-8 clock cycles for L2
Miss Penalty
!
Additional time required because of a miss
" Typically 25-100 cycles for main memory
– 15 –
15-213, F’02
– 16 –
15-213, F’02
Writing Cache Friendly Code
The Memory Mountain
Repeated references to variables are good (temporal
locality)
Read throughput (read bandwidth)
Stride-1 reference patterns are good (spatial locality)
Memory mountain
!
Examples:
!
cold cache, 4-byte words, 4-word cache blocks
int sumarrayrows(int a[M][N])
{
int i, j, sum = 0;
!
Measured read throughput as a function of spatial and
temporal locality.
!
Compact way to characterize memory system performance.
int sumarraycols(int a[M][N])
{
int i, j, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
Number of bytes read from memory per second (MB/s)
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}
Miss rate = 1/4 = 25%
Miss rate = 100%
15-213, F’02
– 17 –
Memory Mountain Test Function
15-213, F’02
– 18 –
Memory Mountain Main Routine
/* mountain.c - Generate the memory mountain. */
#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */
#define MAXBYTES (1 << 23) /* ... up to 8 MB */
#define MAXSTRIDE 16
/* Strides range from 1 to 16 */
#define MAXELEMS MAXBYTES/sizeof(int)
/* The test function */
void test(int elems, int stride) {
int i, result = 0;
volatile int sink;
for (i = 0; i < elems; i += stride)
result += data[i];
sink = result; /* So compiler doesn't optimize away the loop */
}
/* Run test(elems, stride) and return read throughput (MB/s) */
double run(int size, int stride, double Mhz)
{
double cycles;
int elems = size / sizeof(int);
int data[MAXELEMS];
int main()
{
int size;
int stride;
double Mhz;
/* The array we'll be traversing */
/* Working set size (in bytes) */
/* Stride (in array elements) */
/* Clock frequency */
init_data(data, MAXELEMS); /* Initialize each element in data to 1 */
Mhz = mhz(0);
/* Estimate the clock frequency */
for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {
for (stride = 1; stride <= MAXSTRIDE; stride++)
printf("%.1f\t", run(size, stride, Mhz));
printf("\n");
}
exit(0);
test(elems, stride);
/* warm up the cache */
cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */
return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */
}
}
– 19 –
15-213, F’02
– 20 –
15-213, F’02
The Memory Mountain
Ridges of Temporal Locality
Pentium III Xeon
550 MHz
16 KB on-chip L1 d-cache
16 KB on-chip L1 i-cache
512 KB off-chip unified
L2 cache
1000
L1
Slice through the memory mountain with stride=1
!
illuminates read throughputs of different caches and
memory
1200
main memory
region
800
L2 cache
region
L1 cache
region
1000
600
400
Ridges of
Temporal
Locality
xe
L2
Slopes of
Spatial
Locality
read througput (MB/s)
read throughput (MB/s)
1200
200
800
600
400
200
2k
15-213, F’02
– 21 –
A Slope of Spatial Locality
– 22 –
working set size (bytes)
1k
2k
4k
8k
16k
32k
64k
128k
512k
256k
1024k
4m
2m
0
working set size (bytes)
8m
8k
32k
128k
512k
2m
s15
8m
s13
s9
stride (words)
s11
s5
mem
s7
s1
s3
0
15-213, F’02
Matrix Multiplication Example
Slice through memory mountain with size=256KB
!
shows cache block size.
Major Cache Effects to Consider
!
800
Total cache size
" Exploit temporal locality and keep the working set small (e.g., by using
blocking)
700
!
read throughput (MB/s)
600
Block size
" Exploit spatial locality
500
one access per cache line
400
Description:
300
!
Multiply N x N matrices
200
!
100
!
O(N3) total operations
Accesses
/*
Variable sum
/* ijk
ijk */
*/
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{ held in register
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
sum
=
0.0;
sum = 0.0;
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++)
sum
sum +=
+= a[i][k]
a[i][k] ** b[k][j];
b[k][j];
c[i][j]
c[i][j] == sum;
sum;
}}
}}
" N reads per source element
0
s1
s2
s3
s4
s5
s6
s7
s8
s9
" N values summed per destination
s10 s11 s12 s13 s14 s15 s16
» but may be able to hold in register
stride (words)
– 23 –
15-213, F’02
– 24 –
15-213, F’02
Miss Rate Analysis for Matrix Multiply
Layout of C Arrays in Memory
(review)
Assume:
C arrays allocated in row-major order
!
Line size = 32B (big enough for 4 64-bit words)
!
Matrix dimension (N) is very large
!
Stepping through columns in one row:
" Approximate 1/N as 0.0
!
!
for (i = 0; i < N; i++)
!
accesses successive elements
if block size (B) > 4 bytes, exploit spatial locality
sum += a[0][i];
Cache is not even big enough to hold multiple rows
Analysis Method:
!
!
Look at access pattern of inner loop
k
" compulsory miss rate = 4 bytes / B
j
Stepping through rows in one column:
j
!
i
k
A
each row in contiguous memory locations
i
for (i = 0; i < n; i++)
sum += a[i][0];
!
C
B
!
accesses distant elements
no spatial locality!
" compulsory miss rate = 1 (i.e. 100%)
15-213, F’02
– 25 –
Matrix Multiplication (ijk)
/*
/* ijk
ijk */
*/
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{
for
(j=0;
j<n;
for (j=0; j<n; j++)
j++) {{
sum
=
0.0;
sum = 0.0;
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++)
sum
+=
a[i][k]
sum += a[i][k] ** b[k][j];
b[k][j];
c[i][j]
c[i][j] == sum;
sum;
}}
}}
Matrix Multiplication (jik)
Inner loop:
(*,j)
(i,j)
(i,*)
A
B
Row-wise
Columnwise
C
Fixed
/*
/* jik
jik */
*/
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{
sum
=
0.0;
sum = 0.0;
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++)
sum
+=
a[i][k]
sum += a[i][k] ** b[k][j];
b[k][j];
c[i][j]
c[i][j] == sum
sum
}}
}}
Misses per Inner Loop Iteration:
Misses per Inner Loop Iteration:
– 27 –
15-213, F’02
– 26 –
A
B
C
A
B
C
0.25
1.0
0.0
0.25
1.0
0.0
15-213, F’02
– 28 –
Inner loop:
(*,j)
(i,j)
(i,*)
A
B
Row-wise Columnwise
C
Fixed
15-213, F’02
Matrix Multiplication (kij)
/*
/* kij
kij */
*/
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++) {{
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{
rr == a[i][k];
a[i][k];
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++)
c[i][j]
+=
c[i][j] += rr ** b[k][j];
b[k][j];
}}
}}
Matrix Multiplication (ikj)
Inner loop:
(i,k)
(k,*)
A
Fixed
B
(i,*)
C
Row-wise Row-wise
Misses per Inner Loop Iteration:
B
C
A
B
C
0.0
0.25
0.25
0.0
0.25
0.25
15-213, F’02
Matrix Multiplication (jki)
/*
/* jki
jki */
*/
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
for
(k=0;
k<n;
for (k=0; k<n; k++)
k++) {{
rr == b[k][j];
b[k][j];
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++)
c[i][j]
+=
c[i][j] += a[i][k]
a[i][k] ** r;
r;
}}
}}
Inner loop:
(i,k)
(k,*)
A
B
(i,*)
C
Row-wise Row-wise
Fixed
Misses per Inner Loop Iteration:
A
– 29 –
(*,k)
(*,j)
(k,j)
A
15-213, F’02
– 30 –
Matrix Multiplication (kji)
Inner loop:
Column wise
B
Fixed
C
/*
/* kji
kji */
*/
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++) {{
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
rr == b[k][j];
b[k][j];
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++)
c[i][j]
+=
c[i][j] += a[i][k]
a[i][k] ** r;
r;
}}
}}
Columnwise
Misses per Inner Loop Iteration:
– 31 –
/*
/* ikj
ikj */
*/
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{
for
(k=0;
k<n;
for (k=0; k<n; k++)
k++) {{
rr == a[i][k];
a[i][k];
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++)
c[i][j]
c[i][j] +=
+= rr ** b[k][j];
b[k][j];
}}
}}
Inner loop:
(*,k)
(*,j)
(k,j)
A
Columnwise
B
Fixed
C
Columnwise
Misses per Inner Loop Iteration:
A
B
C
A
B
C
1.0
0.0
1.0
1.0
0.0
1.0
15-213, F’02
– 32 –
15-213, F’02
Summary of Matrix Multiplication
kij (& ikj):
Miss rates are helpful but not perfect predictors.
jki (& kji):
" Code scheduling matters, too.
• 2 loads, 0 stores
• 2 loads, 1 store
• 2 loads, 1 store
• misses/iter = 1.25
• misses/iter = 0.5
• misses/iter = 2.0
for (i=0; i<n; i++) {
for (k=0; k<n; k++) {
for (j=0; j<n; j++) {
for (j=0; j<n; j++) {
for (i=0; i<n; i++) {
sum = 0.0;
r = a[i][k];
r = b[k][j];
for (j=0; j<n; j++)
for (i=0; i<n; i++)
c[i][j] += r * b[k][j];
c[i][j] = sum;
40
c[i][j] += a[i][k] * r;
}
}
50
for (k=0; k<n; k++) {
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
60
Cycles/iteration
ijk (& jik):
Pentium Matrix Multiply Performance
}
}
}
kji
jki
kij
ikj
jik
ijk
30
20
}
10
0
15-213, F’02
– 33 –
Improving Temporal Locality by
Blocking
– 34 –
“block” (in this context) does not mean “cache block”.
!
Instead, it mean a sub-block within the matrix.
!
Example: N = 8; sub-block size = 4
A11 A12
A21 A22
B11 B12
X
C11 C12
C21 C22
Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.
– 35 –
C11 = A11B11 + A12B21
C12 = A11B12 + A12B22
C21 = A21B11 + A22B21
C22 = A21B12 + A22B22
15-213, F’02
Array size (n)
15-213, F’02
for (jj=0; jj<n; jj+=bsize) {
for (i=0; i<n; i++)
for (j=jj; j < min(jj+bsize,n); j++)
c[i][j] = 0.0;
for (kk=0; kk<n; kk+=bsize) {
for (i=0; i<n; i++) {
for (j=jj; j < min(jj+bsize,n); j++) {
sum = 0.0
for (k=kk; k < min(kk+bsize,n); k++) {
sum += a[i][k] * b[k][j];
}
c[i][j] += sum;
}
}
}
}
=
B21 B22
75 100 125 150 175 200 225 250 275 300 325 350 375 400
Blocked Matrix Multiply (bijk)
Example: Blocked matrix multiplication
!
25 50
– 36 –
15-213, F’02
Blocked Matrix Multiply Analysis
!
Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize
X bsize block of B and accumulates into 1 X bsize sliver of C
!
Loop over i steps through n row slivers of A & C, using same B
for (i=0; i<n; i++) {
for (j=jj; j < min(jj+bsize,n); j++) {
sum = 0.0
for (k=kk; k < min(kk+bsize,n); k++) {
sum += a[i][k] * b[k][j];
}
c[i][j] += sum;
Innermost
}
kk
jj
jj
Loop Pair
50
C
15-213, F’02
Programmer can optimize for cache performance
How data structures are organized
!
How data are accessed
" Nested loop structure
" Blocking is a general technique
All systems favor “cache friendly code”
code”
!
Getting absolute optimum performance is very platform
specific
!
Can get most of the advantage with generic code
" Cache sizes, line sizes, associativities, etc.
" Keep working set reasonably small (temporal locality)
" Use small strides (spatial locality)
– 39 –
30
20
10
Update successive
elements of sliver
Concluding Observations
!
kji
jki
kij
ikj
jik
ijk
bijk (bsize = 25)
bikj (bsize = 25)
40
15-213, F’02
– 38 –
Array size (n)
0
40
0
32
5
35
0
37
5
30
20
0
22
5
25
0
27
5
10
0
12
5
15
0
17
5
0
50
row sliver accessed
bsize times
block reused n
times in succession
– 37 –
60
i
B
relatively insensitive to array size.
25
A
!
75
kk
Blocking (bijk
(bijk and bikj)
bikj) improves performance by a
factor of two over unblocked versions (ijk
(ijk and jik)
jik)
Cycles/iteration
i
Pentium Blocked Matrix
Multiply Performance
15-213, F’02