Lecture 15
Loop Transformations
Chapter 11.10-11.11
Carnegie Mellon
Dror E. Maydan
CS243: Loop Optimization and Array Analysis
1
Loop Optimization
•
•
Domain
– Loops: Change the order in which we iterate through loops
Goals
– Minimize inner loop dependences that inhibit software pipelining
– Minimize loads and stores
– Parallelism
• SIMD Vector today, in general multiprocessor as well
•
•
– Minimize cache misses
– Minimize register spilling
Tools
– Loop interchange
– Fusion
– Fission
– Outer loop unrolling
– Cache Tiling
– Vectorization
Algorithm for putting it all together
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
2
Dror E. Maydan
Loop Interchange
for i = 1 to n
for j = 1 to n
for i = 1 to n
A[j][i] = A[j-1][i-1] * b[i]
for j = 1 to n
A[j][i] = A[j-1][i-1] * b[i]
•
Should I interchange the two loops?
– Stride-1 accesses are better for caches
– But one more load in the inner loop
– But one less register needed to hold the result of the loop
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
3
Dror E. Maydan
Loop Interchange
for i = 1 to n
for j = 1 to n
A[j][i] = A[j+1][i-1] * b[i]
i
Distance Vector is (deltai, deltaj) = (1, -1)
Direction Vector is (>, <)
j
•
•
•
Dependence represents that one ref, aw must happen before another ar
To permute loops, permute direction vectors in the same manner
– Permutation is legal iff all permuted direction vectors are lexicographically
positive
Special case: Fully permutable loop nest
– Either dependence “carried” by a loop outside of the nest or all components > or =
• All the loops in the nest can be arbitrarily permuted
– (>, >, <) Inner two loops are fully permutable
– (>=, =, >) All three loops are fully permutable
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
4
Dror E. Maydan
Loop Interchange
for i = 1 to n
i
for j = 1 to i
…
j
• How do I interchange
for j = 1 to n
for i = j to n
– In general ugly but doable
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
5
Dror E. Maydan
Non Perfectly Nested loops
for i = 1 to n
for j = 1 to n
S1
for j = 1 to n
S2
•
Can’t always interchange
– Can be expensive when you can
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
6
Dror E. Maydan
Loop Fusion
for i = 1 to n
for i = 1 to n
for j = 1 to n
for j = 1 to n
S1
S1
for j = 1 to n
S2
S2
•
•
•
Moving S2 across “j” iterations but not any of “i” iterations
Pretend to fuse
Legal as long as there is no direction vector from S2 to S1 with “=“ in all the
outer loops and > in one of the inner (=, =, …, =, >, …),
– That would imply that S2 is now before S1
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
7
Dror E. Maydan
Loop Fusion
for i = 1 to n
for j = 1 to n
a[i][j] = …
for j = 1 to n
… = a[i][j+1]
•
Legal as long as there is no direction vector from the read to the write
with “=“ in all the outer loops and > in one of the inner (=, =, …, =, >, …)
– (=, 1) so can’t fusion
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
8
Dror E. Maydan
Loop Fusion
for i = 1 to n
for j = 1 to n
a[i][j] = …
for j = 1 to n
… = a[i][j+1]
for i = 1 to n
a[i][1] = …
for j = 2 to n
for i = 1 to n
for i = 1 to n
a[i][1] = …
a[i][1] = …
for j = 2 to n
for j = 2 to n {
a[i][j] = …
a[i][j] = …
a[i][j] = …
for j = 1, n-1
… = a[i][j]
for j = 2, n
… = a[i][j+1]
… = a[i][n+1]
… = a[i][j]
… = a[i][n+1]
}
… = a[i][n-1]
If the first “+” direction is always a small literal constant, can skew the
loop and allow fusion
Bonus: can get rid of a load and maybe a store
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
9
Dror E. Maydan
Loop Fission
for i = 1 to n
for i = 1 to n
for j = 1 to n
for j = 1 to n
S1
S1
for j = 1 to n
for i = 1 to n
S2
for j = 1 to n
S2
•
Moving S2 across all later “i” iterations
– Legal as long as no dependences from S2 to S1 with > in the
fissioned outer loops
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
10
Dror E. Maydan
Loop Fission
for i = 1 to n
for i = 1 to n
for j = 1 to n
for j = 1 to n
= a[i-1][j]
= a[i-1][j]
for j = 1 to n
for i = 1 to n
a[i][j] =
Dep from write to read of (1)
for j = 1 to n
a[i][j] =
•
Moving S2 across all later “i” iterations
– Legal as long as no dependences from S2 to S1 with > in the
fissioned outer loops
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
11
Dror E. Maydan
Inner Loop Fission
for i = 1 to n
for i = 1 to n
for j = 1 to n
for j = 1 to n
… = h[i];
= h[i];
… = h[i+1];
…
…
… = h[i+25];
… = h[i+49];
for j = 1 to n
… = h[i+50];
… = h[26];
…
… = h[50];
•
Legal as long as there is no dependence from an S2 to an S1 where the first
“>” is in the “j” loop.
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
12
Dror E. Maydan
Inner Loop Fission
for j = 1 to n
S1
S1
S2
=
S3
S2
=
>
•
•
•
S3
Looking at edges carried by the inner most loops
Strongly Connected Components can not be fissioned
Everything else can be fissoned as long as loops are emitted in topological order
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
13
Dror E. Maydan
Outer Loop Unrolling
for i = 1 to n
for j = 1 to n
for k = 1 to n
c[i][j] += a[i][k] * b[k][j];
•
How many loads in the inner loop? How many MACs?
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
14
Dror E. Maydan
Outer Loop Unrolling
for i = 1 to n by 2
for j = 1 to n by 2
for k = 1 to n
c[i][j] += a[i][k] * b[k][j];
c[i][j+1] += a[i][k] * b[k][j+1];
c[i+1][j] += a[i+1][k] * b[k][j];
c[i+1][j+1] += a[i+1][k] * b[k][j+1];
•
Is it legal?
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
15
Dror E. Maydan
Outer Loop Unrolling
If n = 2
– Original order was
for i = 1 to 2 by 2
• (1, 1, 1) (1, 1, 2) (1, 2, 1) (1, 2, 2) (2, 1, 1)
(2, 1, 2) (2, 2, 1) (2, 2, 2)
– New order is
• (1, 1, 1) (1, 2, 1) (2, 1, 1) (2, 2, 1) (1, 1, 2)
(1, 2, 2) (2, 1, 2) (2, 2, 2)
• Equivalent to permuting the loops into
for k = 1 to 2
for j = 1 to 2 by 2
for k = 1 to 2
c[i][j] += a[i][k] * b[k];
c[i][j+1] += a[i][k] * b[k][j+1];
c[i+1][j] += a[i+1][k] * b[k][j];
c[i+1][j+1] += a[i+1][k] * b[k][j+1];
for i = 1 to 2
for j = 1 to 2
• If loops are fully permutable can also
outer loop unroll
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
16
Dror E. Maydan
Unrolling Trapezoidal Loops
for i=1 to n by 2
for j = 1 to i
i
j
• Ugly
• We unroll two level trapezoidal loops but the details are very ugly
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
17
Dror E. Maydan
Trapezoidal Example
for(i = 0; i <= (n + -2); i = i + 2) {
for (i=0; i<n; i++) {
for (j=2*i; j<n-i; j++) {
a[i][j] += 1;
}
}
lstar = (i * 2) + 2;
ustar = (n - (i + 1)) + -1;
if(((i * 2) + 2) < (n - (i + 1))) {
for(r2d_i = i; r2d_i <= (i + 1); r2d_i = r2d_i + 1){
for(j = r2d_i * 2; j <= ((i * 2) + 1); j = j + 1){
a[r2d_i][j] = a[r2d_i][j] + 1;
}
}
for(j0 = lstar; ustar >= j0; j0 = j0 + 1)
{
a[i][j0] = a[i][j0] + 1;
a[i + 1][j0] = a[i + 1][j0] + 1;
};
for(r2d_i0 = i; r2d_i0 <= (i + 1); r2d_i0 = r2d_i0 + 1)
{
for(j1 = n - (i + 1); j1 < (n - r2d_i0); j1 = j1 + 1)
{
a[r2d_i0][j1] = a[r2d_i0][j1] + 1;
};
}
} else {
for(r2d_i1 = i; r2d_i1 <= (i + 1); r2d_i1 = r2d_i1 + 1) {
for(j2 = r2d_i1 * 2; j2 < (n - r2d_i1); j2 = j2 + 1) {
a[r2d_i1][j2] = a[r2d_i1][j2] + 1;
}
}
}
}
if(n > i) {
for(j3 = i * 2; j3 < (n - i); j3 = j3 + 1) {
a[i][j3] = a[i][j3] + 1;
};
}
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
18
D.E. Maydan
Cache Tiling
for i = 1 to n
for j = 1 to n
for k = 1 to n
c[i][j] += a[i][k] * b[k][j];
•
How many cache misses?
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
19
Dror E. Maydan
Cache Tiling
for jb = 1 to n by b
for kb = 1 to n by b
for i = 1 to n
for j = jb to jb+b
for k = kb to kb + b
c[i][j] += a[i][k] * b[k][j];
•
•
How many cache misses?
– Order b reuse for each array
If loops are fully permutable can cache tile
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
20
Dror E. Maydan
Vectorization: SIMD
for i = 1 to n
for i = 1 to n by 8
for j = 1 to n
a[j][i:i+7] = 0;
for j = 1 to n
a[j][i] = 0;
•
N-way parallel where N is the SIMD width of
the machine
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
21
Dror E. Maydan
Vectorization: SIMD
for i = 1 to n
for j = 1 to n
for k = 1 to n
S1
….
SM
•
•
We have moved later iterations of S1 ahead of earlier iterations of S2,
…, SM, etc
Legal as long as no dependence from a latter S to an earlier S where
that dependence is carried by the vector loop
– E.g legal to vectorize ‘j’ above if no dependence from a latter S to
an earlier S with direction (=, >, *)
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
22
Dror E. Maydan
Putting It All Together
•
Three phase algorithm
1. Use fission and fusion to build perfectly nested loops
1. We prefer fusion but not obvious that that is right
2. Enumerate possibilities for unrolling, interchanging, cache tiling and
vectorizing
3. Use inner loop fission if necessary to minimize register pressure
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
23
Dror E. Maydan
Phase 2
Choose a loop to vectorize
All references that refer to vector loop must be stride-1
For each possible inner loop
Compute best possible unrollings for each outer
Compute best possible ordering and tiling
To compute best possible unrolling
Try all combinations of unrolling up to a max product of 16
For each possible unrolling
Estimate the machine cycles for the inner loop (ignoring cache)
Estimate the register pressure
Don’t unroll more if too much register pressure
To compute best possible ordering and tiling
Consider only loops with “reuse”
Choose best three
Iterate over all orderings of three with a binary search on cache tile size
Note and record the total cycle time for this configuration. Pick the best
Estimating cycles
Could compile every combination, but …
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
24
Dror E. Maydan
Machine Modeling
•
Recall that software pipelining had resource limits and latency limits
– Map high level IR to machine resources
– Unroll high level IR operations
• Remove duplicate loads and stores
• Count machine resources
• Build a latency graph of unrolled operations
– Iterate over inner loop cycles and find worst cycle
• Assume performance is worst of two limits
•
Model register pressure
– Count loop invariant loads and stores
– Count address streams
– Count cross iteration cse’s = a[i] + a[i-2]
– Add machine dependent constant
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
25
Dror E. Maydan
Cache Modeling
•
•
•
•
Given a loop ordering and a set of tile factors
Combine array references that differ by constant, e.g. a[i][j] and
a[i+1][j+1]
Estimate capacity of all array references, multiply by fudge factor for
interference, stop increasing block sizes if capacity is larger than cache
Estimate quantity of data that must be brought into cache
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
26
Dror E. Maydan
Phase 3: Inner loop fission
•
Does inner loop use too many registers
– Break down into SCCs
– Pick biggest SCC
• Does it use too many registers
– If yes, too bad
– If no, search for other SCCs to merge in
» Pick one with most commonality
» Keep merging while enough registers
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
27
Dror E. Maydan
Extra: Reductions
for i = 1 to n
for j = 1 to n
a[j] += b[i][j];
Can I unroll
for i = 1 to n by 2
for j = 1 to n
a[j] += b[i][j];
a[j] += b[i+1][j];
• Legal
– Integer: yes
– Floating point: maybe
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
28
Dror E. Maydan
Extra: Outer Loop Invariants
for i
for j
a[i][j] += b[i] * cos(c[j])
Can replace with
for j
t[j] = cos(c[j])
for i
for j
a[i][j] += b[i] * t[j];
•
Need to integrate with model
– Model must assume that invariant computation will be replaced with loads
CS243: Loop Optimization and Array
Analysis
Carnegie Mellon
29
Dror E. Maydan
© Copyright 2026 Paperzz