Data Locality
CS 524 – High-Performance
Computing
Motivating Example: Vector Multiplication
1 GHz processor (1 ns clock), 4 instructions per cycle,
2 multiply-add units (4 FPUs)
100 ns memory latency, no cache, 1 word transfer size
Peak performance: 4/1 ns = 4 GFLOPS
Performance on matrix-vector multiply
After
every 200 ns 2 floating point operations occur
Performance = 1 / 100 ns = 10 MFLOPS
CS 524 (Au 05-06)- Asim Karim @ LUMS
2
Motivating Example: Vector Multiplication
Assume there is a 32 K cache with a 1 ns latency. 1
word cache line, optimal replacement policy
Multiply 1 K vectors
No.
of floating point operations = 2 K (2n: n = size of vector)
Latency to fill cache = 2 K * 100 ns = 200 micro sec
Time to execute = 2 K / 4 * 1 ns = 0.5 micro sec
Performance = 2 K / 200.5 = about 10 MFLOPS
No increase in performance! Why?
CS 524 (Au 05-06)- Asim Karim @ LUMS
3
Motivating Example: Matrix-Matrix Multiply
Assume the same memory system characteristics on the
previous two slides
Multiply two 32 x 32 matrices
of floating point operations = 64 K (2n3: n = dim of
matrix)
Latency to fill cache = 2 K * 100 ns = 200 micro sec
Time to execute = 64 K / 4 * 1 ns = 16 micro sec
Performance = 64 K / 216 = about 300 MFLOPS
No.
Repeated use of data in cache helps improve
performance: temporal locality
CS 524 (Au 05-06)- Asim Karim @ LUMS
4
Motivating Example: Vector Multiply
Assume same memory system characteristics on first
slide (no cache)
Assume 4 word transfer size (cache line size)
Multiply two vectors
Each
access brings in four words of a vector
In 200 ns, four words of each vector are brought in
8 floating point operations can be performed in 200 ns
Performance: 8 / 200 ns = 40 MFLOPS
Memory increased by the effective increase in memory
bandwidth
Spatial locality
CS 524 (Au 05-06)- Asim Karim @ LUMS
5
Locality of Reference
Application data space is often too large to all fit into
cache
Key is locality of reference
Types of locality
Register
locality
Cache-temporal locality
Cache-spatial locality
Attempt (in order of preference) to:
Keep
all data in registers
Order algorithms to do all ops on a data element before it is
overwritten in cache
Order algorithms such that successive ops are on physically
contiguous data
CS 524 (Au 05-06)- Asim Karim @ LUMS
6
Single Processor Performance Enhancement
Two fundamental issues
Adequate
op-level parallelism (to keep superscalar, pipelined
processor units busy)
Minimize memory access costs (locality in cache)
Three useful techniques to improve locality of
reference (loop transformations)
Loop
permutation
Loop blocking (tiling)
Loop unrolling
CS 524 (Au 05-06)- Asim Karim @ LUMS
7
Access Stride and Spatial Locality
Access stride: separation between successively
accessed memory locations
Unit
access stride maximizes spatial locality (only one miss
per cache line)
2-D arrays have different linearized representations in
C and FORTRAN
FORTRAN’s
column-major representation favors columnwise access of data
C’s representation favors row-wise access of data
a b c
d e f
a
a b c
d
Row major order (C )
g h i
g
Column major order (FORTRAN)
CS 524 (Au 05-06)- Asim Karim @ LUMS
8
Matrix-Vector Multiply (MVM)
j
do i = 1, N
do j = 1, N
y(i) = y(i)+A(i,j)*x(j)
end do
end do
j
i
A
y
x
Total no. of references = 4N2
Two questions:
Can
we predict the miss ratio of different variations of this
program for different cache models?
What transformations can we do to improve performance?
CS 524 (Au 05-06)- Asim Karim @ LUMS
9
Cache Model
Fully associative cache (so no conflict misses)
LRU replacement algorithm (ensures temporal locality)
Cache size
Large
cache model: no capacity misses
Small cache model: miss depending on problem size
Cache block size
1
floating-point number
b floating-point numbers
CS 524 (Au 05-06)- Asim Karim @ LUMS
10
MVM: Dot-Product Form (Scenario 1)
Loop order: i-j
Cache model
Small
cache: assume cache can hold fewer than (2N+1)
numbers
Cache block size: 1 floating-point number
Cache misses
N2 misses (cold misses)
Vector x: N cold misses + N(N-1) capacity misses
Vector y: N cold misses
Matrix A:
Miss ratio = (2N2 + N)/4N2 → 0.5
CS 524 (Au 05-06)- Asim Karim @ LUMS
11
MVM: Dot-Product Form (Scenario 2)
Cache model
Large
cache: cache can hold more than (2N+1) numbers
Cache block size: 1 floating-point number
Cache misses
N2 cold misses
Vector x: N cold misses
Vector y: N cold misses
Matrix A:
Miss ratio = (N2 + 2N)/4N2 → 0.25
CS 524 (Au 05-06)- Asim Karim @ LUMS
12
MVM: Dot-Product Form (Scenario 3)
Cache block size: b floating-point numbers
Matrix access order: row-major access (C)
Cache misses (small cache)
N2/b cold misses
Vector x: N/b cold misses + N(N-1)/b capacity misses
Vector y: N cold misses
Miss ratio = (1/2b +1/4N) → 1/2b
Matrix A:
Cache misses (large cache)
N2/b cold misses
Vector x: N/b cold misses
Vector y: N/b cold misses
Miss ratio = (1/4 +1/2N)*(1/b) → 1/4b
Matrix A:
CS 524 (Au 05-06)- Asim Karim @ LUMS
13
MVM: SAXPY Form
do j = 1, N
do i = 1, N
y(i) = y(i) +A(i,j)*x(j)
end do
end do
Total no. of references = 4N2
j
j
i
A
CS 524 (Au 05-06)- Asim Karim @ LUMS
y
x
14
MVM: SAXPY Form (Scenario 4)
Cache block size: b floating-point numbers
Matrix access order: row-major access (C)
Cache misses (small cache)
N2 cold misses
Vector x: N cold misses
Vector y: N/b cold misses + N(N-1)/b capacity misses
Miss ratio = 0.25(1 +1/b) + 1/4N → 0.25(1 + 1/b)
Matrix A:
Cache misses (large cache)
N2 cold misses
Vector x: N/b cold misses
Vector y: N/b cold misses
Miss ratio = 1/4 +1/2Nb → 1/4
Matrix A:
CS 524 (Au 05-06)- Asim Karim @ LUMS
15
Loop Blocking (Tiling)
Loop blocking enhances temporal locality by ensuring
that data remain in cache until all operations are
performed on them
Concept
Divide
or tile the loop computation into chunks or blocks that
fit into cache
Perform all operations on block before moving to the next
block
Procedure
Add
outer loop(s) to tile inner loop computation
Select size of the block so that you have a large cache model
(no capacity misses, temporal locality maximized)
CS 524 (Au 05-06)- Asim Karim @ LUMS
16
MVM: Blocked
do bi = 1, N/B
do bj = 1, N/B
do i = (bi-1)*B+1, bi*B
do j = (bj-1)*B+1, bj*B
y(i) = y(i) + A(i, j)*x(j)
x
j
B
B
i
A
CS 524 (Au 05-06)- Asim Karim @ LUMS
y
17
MVM: Blocked (Scenario 5)
Cache size greater than 2B + 1 floating-point numbers
Cache block size: b floating-point numbers
Matrix access order: row-major access (C)
Cache misses per block
B2/b cold misses
Vector x: B/b
Vector y: B/b
Miss ratio = (1/4 +1/2B)*(1/b) → 1/4b
Matrix A:
Lesson: proper loop blocking eliminates capacity
misses. Performance essentially becomes independent
of cache size (as long as cache size is > 2B +1 fp
numbers)
CS 524 (Au 05-06)- Asim Karim @ LUMS
18
Access Stride and Spatial Locality
Access stride: separation between successively
accessed memory locations
Unit
access stride maximizes spatial locality (only one miss
per cache line)
2-D arrays have different linearized representations in
C and FORTRAN
FORTRAN’s
column-major representation favors columnwise access of data
C’s representation favors row-wise access of data
a b c
d e f
a
a b c
d
Row major order (C )
g h i
g
Column major order (FORTRAN)
CS 524 (Au 05-06)- Asim Karim @ LUMS
19
Access Stride Analysis of MVM
c Dot-product form
do i = 1, N
do j = 1, N
y(i) = y(i) + A(i,j)*x(j)
end do
end do
c SAXPY form
do j = 1, N
do i = 1, N
y(i) = y(i) + A(i,j)*x(j)
end do
end do
CS 524 (Au 05-06)- Asim Karim @ LUMS
Access stride for arrays
Dot-product form
C
Fortran
A
1
N
x
1
1
y
0
0
SAXPY form
A
N
1
x
0
0
y
1
1
20
Loop Permutation
Changing the order of nested loops can improve spatial
locality
Choose
the loop ordering that minimizes miss ratio, or
Choose the loop ordering that minimizes array access strides
Loop permutation issues
Validity: Loop
reordering should not change the meaning of
the code
Performance: Does the reordering improve performance?
Array bounds: What should be the new array bounds?
Procedure: Is there a mechanical way to perform loop
permutation?
These are of primary concern in compiler design. We will
discuss some of these issues when we cover dependences,
from an application software developer’s perspective.
CS 524 (Au 05-06)- Asim Karim @ LUMS
21
Matrix-Matrix Multiply (MMM)
for (i = 0; i <
for (j = 0; j
for (k = 0;
C[i][j] =
N; i++) {
< N; j++) {
k < N; k++)
C[i][j] + A[k][i]*B[k][j]; } }
Access strides (C compiler: row-major) + Performance (MFLOPS)
ijk
jik
ikj
kij
jki
kji
C(i,j)
A(k,i)
B(k,j)
0
N
N
0
N
N
1
0
1
1
0
1
N
1
0
N
1
0
P4 1.6
P3 800
suraj
48
45
3.6
45
48
3.6
77
55
4.4
77
37
4.3
8
10
4.0
8
10
4.0
CS 524 (Au 05-06)- Asim Karim @ LUMS
22
Loop Unrolling
Reduce number of iterations of loop but add
statement(s) to loop body to do work of missing
iterations
Increase amount of operation-level parallelism in loop
body
Outer-loop unrolling changes the order of access of
memory elements and could reduce number of memory
accesses and cache misses
do i = 1, 2n
do j = 1, m
loop-body(i,j)
end do
CS 524 (Au 05-06)- Asim Karim @ LUMS
do i = 1, 2n, 2
do j = 1, m
loop-body(i,j)
loop-body(i+1, j)
End do
23
MVM: Dot-Product 4-Outer Unrolled Form
/* Assume n is a multiple of 4 */
for (i = 0; i < n; i+=4) {
for (j = 0; j < n; j++) {
y[i] = y[i] + A[i][j]*x[j];
y[i+1] = y[i+1] + A[i+1][j]*x[j];
y[i+2] = y[i+2] + A[i+2][j]*x[j];
y[i+3] = y[i+3] + A[i+3][j]*x[j];
} }
Performance (MFLOPS)
32 x 32
1024 x 1024
P4 1.6
23.0
8.5
P3 800
suraj
CS 524 (Au 05-06)- Asim Karim @ LUMS
73.1
6.6
38.0
39.2
24
MVM: SAXPY 4-Outer Unrolled Form
/* Assume n is a multiple of 4 */
for (j = 0; j < n; j+=4) {
for (i = 0; i < n; i++) {
y[i] = y[i] + A[i][j]*x[j]
+ A[i][j+1]*x[j+1]
+ A[i][j+2]*x[j+2]
+ A[i][j+3]*x[j+3];
} }
Performance (MFLOPS)
32 x 32
1024 x 1024
P4 1.6
5.0
2.1
P3 800
suraj
CS 524 (Au 05-06)- Asim Karim @ LUMS
20.5
3.5
9.8
8.6
25
Summary
Loop permutation
Enhances
spatial locality
C and FORTRAN compilers have different access patterns
for matrices resulting in different performances
Performance increases when inner loop index has the
minimum access stride
Loop blocking
Enhances
temporal locality
Ensure that 2B is less than cache size
Loop unrolling
Enhances
superscalar, pipelined operations
CS 524 (Au 05-06)- Asim Karim @ LUMS
26
© Copyright 2026 Paperzz