CS 524 – High Performance Computing

Data Locality
CS 524 – High-Performance
Computing
Motivating Example: Vector Multiplication




1 GHz processor (1 ns clock), 4 instructions per cycle,
2 multiply-add units (4 FPUs)
100 ns memory latency, no cache, 1 word transfer size
Peak performance: 4/1 ns = 4 GFLOPS
Performance on matrix-vector multiply
 After
every 200 ns 2 floating point operations occur
 Performance = 1 / 100 ns = 10 MFLOPS
CS 524 (Au 05-06)- Asim Karim @ LUMS
2
Motivating Example: Vector Multiplication


Assume there is a 32 K cache with a 1 ns latency. 1
word cache line, optimal replacement policy
Multiply 1 K vectors
 No.
of floating point operations = 2 K (2n: n = size of vector)
 Latency to fill cache = 2 K * 100 ns = 200 micro sec
 Time to execute = 2 K / 4 * 1 ns = 0.5 micro sec
 Performance = 2 K / 200.5 = about 10 MFLOPS

No increase in performance! Why?
CS 524 (Au 05-06)- Asim Karim @ LUMS
3
Motivating Example: Matrix-Matrix Multiply


Assume the same memory system characteristics on the
previous two slides
Multiply two 32 x 32 matrices
of floating point operations = 64 K (2n3: n = dim of
matrix)
 Latency to fill cache = 2 K * 100 ns = 200 micro sec
 Time to execute = 64 K / 4 * 1 ns = 16 micro sec
 Performance = 64 K / 216 = about 300 MFLOPS
 No.

Repeated use of data in cache helps improve
performance: temporal locality
CS 524 (Au 05-06)- Asim Karim @ LUMS
4
Motivating Example: Vector Multiply



Assume same memory system characteristics on first
slide (no cache)
Assume 4 word transfer size (cache line size)
Multiply two vectors
 Each
access brings in four words of a vector
 In 200 ns, four words of each vector are brought in
 8 floating point operations can be performed in 200 ns
 Performance: 8 / 200 ns = 40 MFLOPS

Memory increased by the effective increase in memory
bandwidth
 Spatial locality
CS 524 (Au 05-06)- Asim Karim @ LUMS
5
Locality of Reference



Application data space is often too large to all fit into
cache
Key is locality of reference
Types of locality
 Register
locality
 Cache-temporal locality
 Cache-spatial locality

Attempt (in order of preference) to:
 Keep
all data in registers
 Order algorithms to do all ops on a data element before it is
overwritten in cache
 Order algorithms such that successive ops are on physically
contiguous data
CS 524 (Au 05-06)- Asim Karim @ LUMS
6
Single Processor Performance Enhancement

Two fundamental issues
 Adequate
op-level parallelism (to keep superscalar, pipelined
processor units busy)
 Minimize memory access costs (locality in cache)

Three useful techniques to improve locality of
reference (loop transformations)
 Loop
permutation
 Loop blocking (tiling)
 Loop unrolling
CS 524 (Au 05-06)- Asim Karim @ LUMS
7
Access Stride and Spatial Locality

Access stride: separation between successively
accessed memory locations
 Unit
access stride maximizes spatial locality (only one miss
per cache line)

2-D arrays have different linearized representations in
C and FORTRAN
 FORTRAN’s
column-major representation favors columnwise access of data
 C’s representation favors row-wise access of data
a b c
d e f
a
a b c
d
Row major order (C )
g h i
g
Column major order (FORTRAN)
CS 524 (Au 05-06)- Asim Karim @ LUMS
8
Matrix-Vector Multiply (MVM)
j
do i = 1, N
do j = 1, N
y(i) = y(i)+A(i,j)*x(j)
end do
end do
j
i
A


y
x
Total no. of references = 4N2
Two questions:
 Can
we predict the miss ratio of different variations of this
program for different cache models?
 What transformations can we do to improve performance?
CS 524 (Au 05-06)- Asim Karim @ LUMS
9
Cache Model



Fully associative cache (so no conflict misses)
LRU replacement algorithm (ensures temporal locality)
Cache size
 Large
cache model: no capacity misses
 Small cache model: miss depending on problem size

Cache block size
1
floating-point number
 b floating-point numbers
CS 524 (Au 05-06)- Asim Karim @ LUMS
10
MVM: Dot-Product Form (Scenario 1)


Loop order: i-j
Cache model
 Small
cache: assume cache can hold fewer than (2N+1)
numbers
 Cache block size: 1 floating-point number

Cache misses
N2 misses (cold misses)
 Vector x: N cold misses + N(N-1) capacity misses
 Vector y: N cold misses
 Matrix A:

Miss ratio = (2N2 + N)/4N2 → 0.5
CS 524 (Au 05-06)- Asim Karim @ LUMS
11
MVM: Dot-Product Form (Scenario 2)

Cache model
 Large
cache: cache can hold more than (2N+1) numbers
 Cache block size: 1 floating-point number

Cache misses
N2 cold misses
 Vector x: N cold misses
 Vector y: N cold misses
 Matrix A:

Miss ratio = (N2 + 2N)/4N2 → 0.25
CS 524 (Au 05-06)- Asim Karim @ LUMS
12
MVM: Dot-Product Form (Scenario 3)



Cache block size: b floating-point numbers
Matrix access order: row-major access (C)
Cache misses (small cache)
N2/b cold misses
 Vector x: N/b cold misses + N(N-1)/b capacity misses
 Vector y: N cold misses
 Miss ratio = (1/2b +1/4N) → 1/2b
 Matrix A:

Cache misses (large cache)
N2/b cold misses
 Vector x: N/b cold misses
 Vector y: N/b cold misses
 Miss ratio = (1/4 +1/2N)*(1/b) → 1/4b
 Matrix A:
CS 524 (Au 05-06)- Asim Karim @ LUMS
13
MVM: SAXPY Form
do j = 1, N
do i = 1, N
y(i) = y(i) +A(i,j)*x(j)
end do
end do

Total no. of references = 4N2
j
j
i
A
CS 524 (Au 05-06)- Asim Karim @ LUMS
y
x
14
MVM: SAXPY Form (Scenario 4)



Cache block size: b floating-point numbers
Matrix access order: row-major access (C)
Cache misses (small cache)
N2 cold misses
 Vector x: N cold misses
 Vector y: N/b cold misses + N(N-1)/b capacity misses
 Miss ratio = 0.25(1 +1/b) + 1/4N → 0.25(1 + 1/b)
 Matrix A:

Cache misses (large cache)
N2 cold misses
 Vector x: N/b cold misses
 Vector y: N/b cold misses
 Miss ratio = 1/4 +1/2Nb → 1/4
 Matrix A:
CS 524 (Au 05-06)- Asim Karim @ LUMS
15
Loop Blocking (Tiling)


Loop blocking enhances temporal locality by ensuring
that data remain in cache until all operations are
performed on them
Concept
 Divide
or tile the loop computation into chunks or blocks that
fit into cache
 Perform all operations on block before moving to the next
block

Procedure
 Add
outer loop(s) to tile inner loop computation
 Select size of the block so that you have a large cache model
(no capacity misses, temporal locality maximized)
CS 524 (Au 05-06)- Asim Karim @ LUMS
16
MVM: Blocked
do bi = 1, N/B
do bj = 1, N/B
do i = (bi-1)*B+1, bi*B
do j = (bj-1)*B+1, bj*B
y(i) = y(i) + A(i, j)*x(j)
x
j
B
B
i
A
CS 524 (Au 05-06)- Asim Karim @ LUMS
y
17
MVM: Blocked (Scenario 5)




Cache size greater than 2B + 1 floating-point numbers
Cache block size: b floating-point numbers
Matrix access order: row-major access (C)
Cache misses per block
B2/b cold misses
 Vector x: B/b
 Vector y: B/b
 Miss ratio = (1/4 +1/2B)*(1/b) → 1/4b
 Matrix A:

Lesson: proper loop blocking eliminates capacity
misses. Performance essentially becomes independent
of cache size (as long as cache size is > 2B +1 fp
numbers)
CS 524 (Au 05-06)- Asim Karim @ LUMS
18
Access Stride and Spatial Locality

Access stride: separation between successively
accessed memory locations
 Unit
access stride maximizes spatial locality (only one miss
per cache line)

2-D arrays have different linearized representations in
C and FORTRAN
 FORTRAN’s
column-major representation favors columnwise access of data
 C’s representation favors row-wise access of data
a b c
d e f
a
a b c
d
Row major order (C )
g h i
g
Column major order (FORTRAN)
CS 524 (Au 05-06)- Asim Karim @ LUMS
19
Access Stride Analysis of MVM
c Dot-product form
do i = 1, N
do j = 1, N
y(i) = y(i) + A(i,j)*x(j)
end do
end do
c SAXPY form
do j = 1, N
do i = 1, N
y(i) = y(i) + A(i,j)*x(j)
end do
end do
CS 524 (Au 05-06)- Asim Karim @ LUMS
Access stride for arrays
Dot-product form
C
Fortran
A
1
N
x
1
1
y
0
0
SAXPY form
A
N
1
x
0
0
y
1
1
20
Loop Permutation

Changing the order of nested loops can improve spatial
locality
 Choose
the loop ordering that minimizes miss ratio, or
 Choose the loop ordering that minimizes array access strides

Loop permutation issues
 Validity: Loop
reordering should not change the meaning of
the code
 Performance: Does the reordering improve performance?
 Array bounds: What should be the new array bounds?
 Procedure: Is there a mechanical way to perform loop
permutation?
These are of primary concern in compiler design. We will
discuss some of these issues when we cover dependences,
from an application software developer’s perspective.
CS 524 (Au 05-06)- Asim Karim @ LUMS
21
Matrix-Matrix Multiply (MMM)
for (i = 0; i <
for (j = 0; j
for (k = 0;
C[i][j] =
N; i++) {
< N; j++) {
k < N; k++)
C[i][j] + A[k][i]*B[k][j]; } }
Access strides (C compiler: row-major) + Performance (MFLOPS)
ijk
jik
ikj
kij
jki
kji
C(i,j)
A(k,i)
B(k,j)
0
N
N
0
N
N
1
0
1
1
0
1
N
1
0
N
1
0
P4 1.6
P3 800
suraj
48
45
3.6
45
48
3.6
77
55
4.4
77
37
4.3
8
10
4.0
8
10
4.0
CS 524 (Au 05-06)- Asim Karim @ LUMS
22
Loop Unrolling



Reduce number of iterations of loop but add
statement(s) to loop body to do work of missing
iterations
Increase amount of operation-level parallelism in loop
body
Outer-loop unrolling changes the order of access of
memory elements and could reduce number of memory
accesses and cache misses
do i = 1, 2n
do j = 1, m
loop-body(i,j)
end do
CS 524 (Au 05-06)- Asim Karim @ LUMS
do i = 1, 2n, 2
do j = 1, m
loop-body(i,j)
loop-body(i+1, j)
End do
23
MVM: Dot-Product 4-Outer Unrolled Form
/* Assume n is a multiple of 4 */
for (i = 0; i < n; i+=4) {
for (j = 0; j < n; j++) {
y[i] = y[i] + A[i][j]*x[j];
y[i+1] = y[i+1] + A[i+1][j]*x[j];
y[i+2] = y[i+2] + A[i+2][j]*x[j];
y[i+3] = y[i+3] + A[i+3][j]*x[j];
} }
Performance (MFLOPS)
32 x 32
1024 x 1024
P4 1.6
23.0
8.5
P3 800
suraj
CS 524 (Au 05-06)- Asim Karim @ LUMS
73.1
6.6
38.0
39.2
24
MVM: SAXPY 4-Outer Unrolled Form
/* Assume n is a multiple of 4 */
for (j = 0; j < n; j+=4) {
for (i = 0; i < n; i++) {
y[i] = y[i] + A[i][j]*x[j]
+ A[i][j+1]*x[j+1]
+ A[i][j+2]*x[j+2]
+ A[i][j+3]*x[j+3];
} }
Performance (MFLOPS)
32 x 32
1024 x 1024
P4 1.6
5.0
2.1
P3 800
suraj
CS 524 (Au 05-06)- Asim Karim @ LUMS
20.5
3.5
9.8
8.6
25
Summary

Loop permutation
 Enhances
spatial locality
 C and FORTRAN compilers have different access patterns
for matrices resulting in different performances
 Performance increases when inner loop index has the
minimum access stride

Loop blocking
 Enhances
temporal locality
 Ensure that 2B is less than cache size

Loop unrolling
 Enhances
superscalar, pipelined operations
CS 524 (Au 05-06)- Asim Karim @ LUMS
26