Loop Transformations

Loop Mangling
With both thanks and apologies to Allen and Kennedy
Optimizing Compilers for Modern Architectures
Preliminaries
•
•
•
•
Want to parallelize programs, so
Concentrate on loops
Types of parallelism
— Vectorization
— Instruction-level parallelism
— Courser grain parallelism
– Shared memory
– Distributed memory
– Message passing
Formalism based upon dependence analysis
Optimizing Compilers for Modern Architectures
Transformations
•
We call a transformation safe if the transformed program has
the same "meaning" as the original program
•
But, what is the "meaning" of a program?
For our purposes:
•
Two computations are equivalent if, on the same inputs:
— They produce the same outputs in the same order
Optimizing Compilers for Modern Architectures
Reordering Transformations
•
A reordering transformation is any program transformation that
merely changes the order of execution of the code, without
adding or deleting any executions of any statements
Optimizing Compilers for Modern Architectures
Properties of Reordering
Transformations
•
•
A reordering transformation does not eliminate dependences
•
A reordering transformation preserves a dependence if it
preserves the relative execution order of the source and sink of
that dependence.
However, it can change the ordering of the dependence which
will lead to incorrect behavior
Optimizing Compilers for Modern Architectures
Fundamental Theorem of Dependence
•
Fundamental Theorem of Dependence:
•
Proof by contradiction. Theorem 2.2 in the Allen and Kennedy
text
— Any reordering transformation that preserves every dependence in a
program preserves the meaning of that program
Optimizing Compilers for Modern Architectures
Fundamental Theorem of Dependence
•
A transformation is said to be valid for the program to which it
applies if it preserves all dependences in the program.
Optimizing Compilers for Modern Architectures
Parallelization and Vectorization
•
It is valid to convert a sequential loop to a parallel loop
if the loop carries no dependence.
•
Want to convert loops like:
DO I=1,N
X(I) = X(I) + C
•
ENDDO
to X(1:N) = X(1:N) + C
•
However:
(Fortran 77 to Fortran 90)
DO I=1,N
X(I+1) = X(I) + C
ENDDO
is not equivalent to X(2:N+1) = X(1:N) + C
Optimizing Compilers for Modern Architectures
Loop Distribution
•
Can statements in loops which carry dependences be
vectorized?
D0 I = 1, N
S1
A(I+1) = B(I) + C
S2
D(I) = A(I) + E
ENDDO
•
Dependence: S1 1 S2 can be converted to:
S1
A(2:N+1) = B(1:N) + C
S2
D(1:N) = A(1:N) + E
Optimizing Compilers for Modern Architectures
Loop Distribution
DO I = 1, N
S1
A(I+1) = B(I) + C
S2
D(I) = A(I) + E
ENDDO
• transformed to:
S1
S2
DO I = 1, N
A(I+1) = B(I) + C
ENDDO
DO I = 1, N
D(I) = A(I) + E
ENDDO
• leads to:
S1
S2
A(2:N+1) = B(1:N) + C
D(1:N) = A(1:N) + E
Optimizing Compilers for Modern Architectures
Loop Distribution
•
Loop distribution fails if there is a cycle of
dependences
DO I = 1, N
S1
A(I+1) = B(I) + C
S2
B(I+1) = A(I) + E
ENDDO
S1 1 S2
•
and
S2 1 S1
What about:
DO I = 1, N
S1
B(I) = A(I) + E
S2
A(I+1) = B(I) + C
ENDDO
Optimizing Compilers for Modern Architectures
Fine-Grained Parallelism
Techniques to enhance fine-grained parallelism:
•
•
•
•
Loop Interchange
Scalar Expansion
Scalar Renaming
Array Renaming
Optimizing Compilers for Modern Architectures
Motivational Example
DO J = 1, M
DO I = 1, N
T = 0.0
DO K = 1,L
T = T + A(I,K) * B(K,J)
ENDDO
C(I,J) = T
ENDDO
ENDDO
However, by scalar expansion, we can get:
DO J = 1, M
DO I = 1, N
T$(I) = 0.0
DO K = 1,L
T$(I) = T$(I) + A(I,K) * B(K,J)
ENDDO
C(I,J) = T$(I)
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Motivational Example
DO J = 1, M
DO I = 1, N
T$(I) = 0.0
DO K = 1,L
T$(I) = T$(I) + A(I,K) * B(K,J)
ENDDO
C(I,J) = T$(I)
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Motivational Example II
•
Loop Distribution gives us:
DO J = 1, M
DO I = 1, N
T$(I) = 0.0
ENDDO
DO I = 1, N
DO K = 1,L
T$(I) = T$(I) + A(I,K) * B(K,J)
ENDDO
ENDDO
DO I = 1, N
C(I,J) = T$(I)
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Motivational Example III
Finally, interchanging I and K loops, we get:
DO J = 1, M
T$(1:N) = 0.0
DO K = 1,L
T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J)
ENDDO
C(1:N,J) = T$(1:N)
ENDDO
•
A couple of new transformations used:
— Loop interchange
— Scalar Expansion
Optimizing Compilers for Modern Architectures
Loop Interchange
DO I = 1, N
DO J = 1, M
S
A(I,J+1) = A(I,J) + B
ENDDO
ENDDO
•
Applying loop interchange:
DO J = 1, M
DO I = 1, N
S
A(I,J+1) = A(I,J) + B
ENDDO
ENDDO
•
leads to:
DO J = 1, M
S
• DV: (=, <)
A(1:N,J+1) = A(1:N,J) + B
ENDDO
Optimizing Compilers for Modern Architectures
• DV: (<, =)
Scalar Expansion
S1
S2
S3
•
S1
S2
S3
•
S1
S2
S3
DO I = 1, N
T = A(I)
A(I) = B(I)
B(I) = T
ENDDO
Scalar Expansion:
DO I = 1, N
T$(I) = A(I)
A(I) = B(I)
B(I) = T$(I)
ENDDO
T = T$(N)
leads to:
T$(1:N) = A(1:N)
A(1:N) = B(1:N)
B(1:N) = T$(1:N)
T = T$(N)
Optimizing Compilers for Modern Architectures
Scalar Expansion
•
However, not always profitable. Consider:
DO I = 1, N
T = T + A(I) + A(I+1)
A(I) = T
ENDDO
•
Scalar expansion gives us:
T$(0) = T
DO I = 1, N
S1
T$(I) = T$(I-1) + A(I) + A(I+1)
S2
A(I) = T$(I)
ENDDO
T = T$(N)
Optimizing Compilers for Modern Architectures
Scalar Renaming
DO I = 1, 100
S1
T = A(I) + B(I)
S2
C(I) = T + T
S3
T = D(I) - B(I)
S4
A(I+1) = T * T
ENDDO
•
Renaming scalar T:
DO I = 1, 100
S1
T1 = A(I) + B(I)
S2
C(I) = T1 + T1
S3
T2 = D(I) - B(I)
S4
A(I+1) = T2 * T2
ENDDO
Optimizing Compilers for Modern Architectures
Scalar Renaming
•
will lead to:
S3
T2$(1:100) = D(1:100) - B(1:100)
S4
A(2:101) = T2$(1:100) * T2$(1:100)
S1
T1$(1:100) = A(1:100) + B(1:100)
S2
C(1:100) = T1$(1:100) + T1$(1:100)
T = T2$(100)
Optimizing Compilers for Modern Architectures
Array Renaming
DO I = 1, N
S1
A(I) = A(I-1) + X
S2
Y(I) = A(I) + Z
S3
A(I) = B(I) + C
•
•
ENDDO
S1  S2
S2 -1 S3
S3 1 S1
S1 0 S3
Rename A(I) to A$(I):
DO I = 1, N
S1
A$(I) = A(I-1) + X
S2
Y(I) = A$(I) + Z
S3
A(I) = B(I) + C
•
ENDDO
Dependences remaining:
Optimizing Compilers for Modern Architectures
S1  S2
and
S3 1 S1
Seen So Far...
•
Uncovering potential vectorization in loops by
•
Safety and Profitability of these transformations
— Loop Interchange
— Scalar Expansion
— Scalar and Array Renaming
Optimizing Compilers for Modern Architectures
And Now ...
•
More transformations
•
Unified framework to generate vector code
— Node Splitting
— Recognition of Reductions
— Index-Set Splitting
— Run-time Symbolic Resolution
— Left to the implementer (get a copy of Allen and Kennedy book)
Optimizing Compilers for Modern Architectures
Node Splitting
•
Sometimes Renaming fails
DO I = 1, N
S1:
A(I) = X(I+1) + X(I)
S2:
X(I+1) = B(I) + 32
ENDDO
•
Recurrence kept intact by renaming algorithm
Optimizing Compilers for Modern Architectures
Node Splitting
DO I = 1, N
DO I = 1, N
S1: A(I) = X(I+1) + X(I)
S1’:X$(I) = X(I+1)
S1: A(I) = X$(I) + X(I)
S2: X(I+1) = B(I) + 32
ENDDO
•
•
Break critical antidependence
Make copy of node from which
antidependence emanates
S2: X(I+1) = B(I) + 32
ENDDO
•
•
Recurrence broken
Vectorized to
X$(1:N) = X(2:N+1)
X(2:N+1) = B(1:N) + 32
A(1:N) = X$(1:N) + X(1:N)
Optimizing Compilers for Modern Architectures
Node Splitting
•
•
•
Determining minimal set of critical antidependences is in NP-C
Perfect job of Node Splitting difficult
Heuristic:
— Select antidependences
— Delete it to see if acyclic
— If acyclic, apply Node Splitting
Optimizing Compilers for Modern Architectures
Recognition of Reductions
•
•
Sum Reduction, Min/Max Reduction, Count Reduction
Vector ---> Single Element
S = 0.0
DO I = 1, N
S = S + A(I)
ENDDO
•
Not directly vectorizable
Optimizing Compilers for Modern Architectures
Recognition of Reductions
•
Assuming commutativity and
associativity
•
Distribute k loop
S = 0.0
DO k = 1, 4
S = 0.0
DO k = 1, 4
SUM(k) = 0.0
DO I = k, N, 4
SUM(k) = SUM(k) + A(I)
ENDDO
S = S + SUM(k)
ENDDO
SUM(k) = 0.0
ENDDO
DO k = 1, 4
DO I = k, N, 4
SUM(k) = SUM(k) + A(I)
ENDDO
ENDDO
DO k = 1, 4
S = S + SUM(k)
ENDDO
Optimizing Compilers for Modern Architectures
Recognition of Reductions
•
After Loop Interchange
DO I = 1, N, 4
DO k = I, min(I+3,N)
SUM(k-I+1) = SUM(k-I+1) + A(I)
ENDDO
ENDDO
•
Vectorize
DO I = 1, N, 4
SUM(1:4) = SUM(1:4) + A(I:I+3)
ENDDO
Optimizing Compilers for Modern Architectures
Recognition of Reductions
•
Properties of Reductions
— Reduce Vector/Array to one element
— No use of Intermediate values
— Reduction operates on vector and nothing else
Optimizing Compilers for Modern Architectures
Index-set Splitting
•
Subdivide loop into different iteration ranges to achieve partial
parallelization
— Threshold Analysis [Strong SIV, Weak Crossing SIV]
— Loop Peeling [Weak Zero SIV]
— Section Based Splitting [Variation of loop peeling]
Optimizing Compilers for Modern Architectures
Threshold Analysis
DO I = 1, 20
A(I+20) = A(I) + B
ENDDO
Vectorize to..
A(21:40) = A(1:20) + B
DO I = 1, 100
A(I+20) = A(I) + B
ENDDO
Strip mine to..
DO I = 1, 100, 20
DO i = I, I+19
A(i+20) = A(i) + B
ENDDO
ENDDO
Vectorize this
Optimizing Compilers for Modern Architectures
Loop Peeling
•
Source of dependence is a single iteration
DO I = 1, N
A(I) = A(I) + A(1)
ENDDO
Loop peeled to..
A(1) = A(1) + A(1)
DO I = 2, N
A(I) = A(I) + A(1)
ENDDO
Vectorize to..
A(1) = A(1) + A(1)
A(2:N)= A(2:N) + A(1)
Optimizing Compilers for Modern Architectures
Run-time Symbolic Resolution
•
“Breaking Conditions”
DO I = 1, N
A(I+L) = A(I) + B(I)
ENDDO
Transformed to..
IF(L.LE.0) THEN
A(L:N+L) = A(1:N) + B(1:N)
ELSE
DO I = 1, N
A(I+L) = A(I) + B(I)
ENDDO
ENDIF
Optimizing Compilers for Modern Architectures
Run-time Symbolic Resolution
•
•
Identifying minimum number of breaking conditions to break a
recurrence is in NP-C
Heuristic:
— Identify when a critical dependence can be conditionally eliminated
via a breaking condition
Optimizing Compilers for Modern Architectures
Putting It All Together
•
Good Part
•
Bad Part
— Many transformations imply more choices to exploit parallelism
— Choosing the right transformation
— How to automate transformation selection process?
— Interference between transformations
Optimizing Compilers for Modern Architectures
Putting It All Together
•
Any algorithm which tries to tie all transformations must
•
Goal of our algorithm
— Take a global view of transformed code
— Know the architecture of the target machine
— Finding ONE good vector loop [works well for most vector register
architectures]
Optimizing Compilers for Modern Architectures
Compiler Improvement of Register
Usage
Optimizing Compilers for Modern Architectures
Overview
•
•
•
Improving memory hierarchy performance by compiler
transformations
— Scalar Replacement
— Unroll-and-Jam
Saving memory loads & stores
Make good use of the processor registers
Optimizing Compilers for Modern Architectures
Motivating Example
DO I = 1, N
DO I = 1, N
DO J = 1, M
T = A(I)
A(I) = A(I) + B(J)
DO J = 1, M
ENDDO
T = T + B(J)
ENDDO
ENDDO
A(I) = T
•
A(I) can be left in a register
throughout the inner loop
ENDDO
•
Coloring based register allocation
fails to recognize this
•
All loads and stores to A in the
inner loop have been saved
•
High chance of T being allocated a
register by the coloring algorithm
Optimizing Compilers for Modern Architectures
Scalar Replacement
•
•
Convert array reference to scalar reference to improve
performance of the coloring based allocator
Our approach is to use dependences to achieve these memory
hierarchy transformations
Optimizing Compilers for Modern Architectures
Unroll-and-Jam
DO I = 1, N*2
DO I = 1, N*2, 2
DO J = 1, M
DO J = 1, M
A(I) = A(I) + B(J)
A(I) = A(I) + B(J)
ENDDO
A(I+1) = A(I+1) + B(J)
ENDDO
ENDDO
ENDDO
•
•
Can we achieve reuse of
references to B ?
Use transformation called
Unroll-and-Jam
Optimizing Compilers for Modern Architectures
•
•
Unroll outer loop twice and then
fuse the copies of the inner loop
Brought two uses of B(J)
together
Unroll-and-Jam
DO I = 1, N*2, 2
DO I = 1, N*2, 2
DO J = 1, M
s0 = A(I)
A(I) = A(I) + B(J)
s1 = A(I+1)
A(I+1) = A(I+1) + B(J)
DO J = 1, M
ENDDO
t = B(J)
ENDDO
s0 = s0 + t
s1 = s1 + t
•
Apply scalar replacement on this
code
ENDDO
A(I) = s0
A(I+1) = s1
ENDDO
•
Optimizing Compilers for Modern Architectures
Half the number of loads as the
original program
Legality of Unroll-and-Jam
•
Is unroll-and-jam always legal?
DO I = 1, N*2, 2
DO I = 1, N*2
DO J = 1, M
DO J = 1, M
A(I+1,J-1) = A(I,J) + B(I,J)
A(I+1,J-1) = A(I,J) + B(I,J)
A(I+2,J-1) = A(I+1,J) + B(I+1,J)
ENDDO
ENDDO
ENDDO
ENDDO
•
•
Apply unroll-and-jam
Optimizing Compilers for Modern Architectures
This is wrong!!!
Legality of Unroll-and-Jam
•
Direction vector in this example was (<,>)
•
But does loop interchange illegal imply unroll-and-jam illegal ?
NO
— This makes loop interchange illegal
— Unroll-and-Jam is loop interchange followed by unrolling inner loop
followed by another loop interchange
Optimizing Compilers for Modern Architectures
Conditions for legality of unroll-and-jam
•
•
Definition: Unroll-and-jam to factor n consists of unrolling the
outer loop n-1 times and fusing those copies together.
Theorem: An unroll-and-jam to a factor of n is legal iff there
exists no dependence with direction vector (<,>) such that the
distance for the outer loop is less than n.
Optimizing Compilers for Modern Architectures
Conclusion
•
We have learned two memory hierarchy transformations:
•
They reduce the number of memory accesses by maximum use
of processor registers
— scalar replacement
— unroll-and-jam
Optimizing Compilers for Modern Architectures
Loop Fusion
•
Consider the following:
DO I = 1,N
A(I) = C(I) + D(I)
ENDDO
DO I = 1,N
B(I) = C(I) - D(I)
ENDDO
If we fuse these loops, we can reuse operations in
registers:
DO I = 1,N
A(I) = C(I) + D(I)
B(I) = C(I) - D(I)
ENDDO
Optimizing Compilers for Modern Architectures
Example
DO J = 1,N
DO I = 1,M
A(I,J) = C(I,J)+D(I,J)
ENDDO
DO I = 1,M
B(I,J) = A(I,J-1)-E(I,J)
ENDDO
ENDDO
Scalar Replacement
DO I = 1,M
r = A(I,0)
DO J = 1,N
B(I,J) = r - E(I,J)
r = C(I,J) + D(I,J)
A(I,J) = r
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Fusion and Interchange
DO I = 1,M
DO J = 1,N
B(I,J) = A(I,J-1)-E(I,J)
A(I,J) = C(I,J)+D(I,J)
ENDDO
ENDDO
We’ve saved (n-1)m loads
Ordering Transformations
•
Recommended order:
— Loop Interchange
— Loop alignment and fusion
— Unroll-and-jam
— Scalar Replacement
Optimizing Compilers for Modern Architectures