Loop interchange Scalar Expansion

Enhancing Fine-Grained Parallelism
Chapter 5 of Allen and Kennedy
Optimizing Compilers for Modern Architectures
Fine-Grained Parallelism
Techniques to enhance fine-grained parallelism:
•
•
•
•
•
Loop Interchange
Scalar Expansion
Scalar Renaming
Array Renaming
Node Splitting
Optimizing Compilers for Modern Architectures
Recall Vectorization procedure….
procedure codegen(R, k, D);
// R is the region for which we must generate code.
// k is the minimum nesting level of possible parallel loops.
// D is the dependence graph among statements in R..
find the set {S1, S2, ... , Sm} of maximal strongly-connected
regions in the dependence graph D restricted to R
construct Rp from R by reducing each Si to a single node and
compute Dp, the dependence graph naturally induced on Rp by D
let {p1, p2, ... , pm} be the m nodes of Rp numbered in an order
consistent with Dp (use topological sort to do the numbering); We can fail
for i = 1 to m do begin
here
if pi is cyclic then begin
generate a level-k DO statement;
let Di be the dependence graph consisting of all dependence edges in D
that are at level k+1 or greater and are internal to pi;
codegen (pi, k+1, Di);
generate the level-k ENDDO statement;
end
else
generate a vector statement for pi in r(pi)-k+1 dimensions, where r (pi) is the
number of loops containing pi;
end
end
Optimizing Compilers for Modern Architectures
Can we do better?
•
•
•
Codegen: tries to find parallelism using transformations of loop
distribution and statement reordering
If we deal with loops containing cyclic dependences early on in
the loop nest, we can potentially vectorize more loops
Goal in Chapter 5: To explore other transformations to exploit
parallelism
Optimizing Compilers for Modern Architectures
Motivational Example
DO J = 1, M
DO I = 1, N
T = 0.0
DO K = 1,L
T = T + A(I,K) * B(K,J)
ENDDO
C(I,J) = T
ENDDO
ENDDO
codegen will not uncover any vector operations. However, by
scalar expansion, we can get:
DO J = 1, M
DO I = 1, N
T$(I) = 0.0
DO K = 1,L
T$(I) = T$(I) + A(I,K) * B(K,J)
ENDDO
C(I,J) = T$(I)
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Motivational Example
DO J = 1, M
DO I = 1, N
T$(I) = 0.0
DO K = 1,L
T$(I) = T$(I) + A(I,K) * B(K,J)
ENDDO
C(I,J) = T$(I)
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Motivational Example II
•
Loop Distribution gives us:
DO J = 1, M
DO I = 1, N
T$(I) = 0.0
ENDDO
DO I = 1, N
DO K = 1,L
T$(I) = T$(I) + A(I,K) * B(K,J)
ENDDO
ENDDO
DO I = 1, N
C(I,J) = T$(I)
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Motivational Example III
Finally, interchanging I and K loops, we get:
DO J = 1, M
T$(1:N) = 0.0
DO K = 1,L
T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J)
ENDDO
C(1:N,J) = T$(1:N)
ENDDO
•
A couple of new transformations used:
— Loop interchange
— Scalar Expansion
Optimizing Compilers for Modern Architectures
Loop Interchange
DO I = 1, N
DO J = 1, M
S
A(I,J+1) = A(I,J) + B
ENDDO
ENDDO
•
Applying loop interchange:
DO J = 1, M
DO I = 1, N
S
A(I,J+1) = A(I,J) + B
ENDDO
ENDDO
•
leads to:
DO J = 1, M
S
• DV: (=, <)
A(1:N,J+1) = A(1:N,J) + B
ENDDO
Optimizing Compilers for Modern Architectures
• DV: (<, =)
Loop Interchange
•
•
Loop interchange is a reordering transformation
Why?
— Think of statements being parameterized with the corresponding
iteration vector
— Loop interchange merely changes the execution order of these
statements.
— It does not create new instances, or delete existing instances
DO J = 1, M
DO I = 1, N
S
<some statement>
ENDDO
ENDDO
•
If interchanged, S(2, 1) will execute before S(1, 2)
Optimizing Compilers for Modern Architectures
Loop Interchange: Safety
•
Safety: not all loop interchanges are safe
DO J = 1, M
DO I = 1, N
A(I,J+1) = A(I+1,J) + B
ENDDO
ENDDO
•
Direction vector (<, >)
•
If we interchange loops, we violate the dependence
Optimizing Compilers for Modern Architectures
Loop Interchange: Safety
•
A dependence is interchange-preventing with respect to a given pair of
loops if interchanging those loops would reorder the endpoints of the
dependence.
Optimizing Compilers for Modern Architectures
Loop Interchange: Safety
•
•
•
A dependence is interchange-sensitive if it is carried by the
same loop after interchange. That is, an interchange-sensitive
dependence moves with its original carrier loop to the new level.
Example: Interchange-Sensitive?
Example: Interchange-Insensitive?
Optimizing Compilers for Modern Architectures
Loop Interchange: Safety
•
Theorem 5.1 Let D(i,j) be a direction vector for a dependence in
a perfect nest of n loops. Then the direction vector for the
same dependence after a permutation of the loops in the nest is
determined by applying the same permutation to the elements of
D(i,j).
•
The direction matrix for a nest of loops is a matrix in which
each row is a direction vector for some dependence between
statements contained in the nest and every such direction
vector is represented by a row.
Optimizing Compilers for Modern Architectures
Loop Interchange: Safety
DO I = 1, N
DO J = 1, M
DO K = 1, L
A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1)
ENDDO
ENDDO
ENDDO
•
•
•
The direction matrix for the loop nest is:
<
<
=
<
=
>
Theorem 5.2 A permutation of the loops in a perfect nest is
legal if and only if the direction matrix, after the same
permutation is applied to its columns, has no ">" direction as
the leftmost non-"=" direction in any row.
Follows from Theorem 5.1 and Theorem 2.3
Optimizing Compilers for Modern Architectures
Loop Interchange: Profitability
•
Profitability depends on architecture
DO I = 1, N
DO J = 1, M
DO K = 1, L
S
A(I+1,J+1,K) = A(I,J,K) + B
ENDDO
ENDDO
ENDDO
•
For SIMD machines with large number of FU’s:
DO I = 1, N
S
A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B
ENDDO
•
Not suitable for vector register machines
Optimizing Compilers for Modern Architectures
Loop Interchange: Profitability
•
For Vector machines, we want to vectorize loops with stride-one
memory access
•
Since Fortran stores in column-major order:
•
Thus, transform to:
— useful to vectorize the I-loop
DO J = 1, M
DO K = 1, L
S
A(2:N+1,J+1,K) = A(1:N,J,K) + B
ENDDO
ENDDO
Optimizing Compilers for Modern Architectures
Loop Interchange: Profitability
•
MIMD machines with vector execution units: want to cut down
synchronization costs
•
Hence, shift K-loop to outermost level:
PARALLEL DO K = 1, L
DO J = 1, M
A(2:N+1,J+1,K) = A(1:N,J,K) + B
ENDDO
END PARALLEL DO
Optimizing Compilers for Modern Architectures
Scalar Expansion
S1
S2
S3
•
S1
S2
S3
•
S1
S2
S3
DO I = 1, N
T = A(I)
A(I) = B(I)
B(I) = T
ENDDO
Scalar Expansion:
DO I = 1, N
T$(I) = A(I)
A(I) = B(I)
B(I) = T$(I)
ENDDO
T = T$(N)
leads to:
T$(1:N) = A(1:N)
A(1:N) = B(1:N)
B(1:N) = T$(1:N)
T = T$(N)
Optimizing Compilers for Modern Architectures
Scalar Expansion
•
However, not always profitable. Consider:
DO I = 1, N
T = T + A(I) + A(I+1)
A(I) = T
ENDDO
•
Scalar expansion gives us:
T$(0) = T
DO I = 1, N
S1
T$(I) = T$(I-1) + A(I) + A(I+1)
S2
A(I) = T$(I)
ENDDO
T = T$(N)
Optimizing Compilers for Modern Architectures
Scalar Expansion: Safety
•
•
Scalar expansion is always safe
•
Dependences due to reuse of memory location vs. reuse of
values
When is it profitable?
— Naïve approach: Expand all scalars, vectorize, shrink all unnecessary
expansions.
— However, we want to predict when expansion is profitable
— Dependences due to reuse of values must be preserved
— Dependences due to reuse of memory location can be deleted by
expansion
Optimizing Compilers for Modern Architectures
Scalar Expansion: Drawbacks
•
•
Expansion increases memory requirements
Solutions:
— Expand in a single loop
— Strip mine loop before expansion
— Forward substitution:
DO I = 1, N
T = A(I) + A(I+1)
A(I) = T + B(I)
ENDDO
DO I = 1, N
A(I) = A(I) + A(I+1) + B(I)
ENDDO
Optimizing Compilers for Modern Architectures
Scalar Renaming
DO I = 1, 100
S1
T = A(I) + B(I)
S2
C(I) = T + T
S3
T = D(I) - B(I)
S4
A(I+1) = T * T
ENDDO
•
Renaming scalar T:
DO I = 1, 100
S1
T1 = A(I) + B(I)
S2
C(I) = T1 + T1
S3
T2 = D(I) - B(I)
S4
A(I+1) = T2 * T2
ENDDO
Optimizing Compilers for Modern Architectures
Scalar Renaming
•
will lead to:
S3
T2$(1:100) = D(1:100) - B(1:100)
S4
A(2:101) = T2$(1:100) * T2$(1:100)
S1
T1$(1:100) = A(1:100) + B(1:100)
S2
C(1:100) = T1$(1:100) + T1$(1:100)
T = T2$(100)
Optimizing Compilers for Modern Architectures
Node Splitting
•
Sometimes Renaming fails
DO I = 1, N
S1:
A(I) = X(I+1) + X(I)
S2:
X(I+1) = B(I) + 32
ENDDO
•
Recurrence kept intact by renaming algorithm
Optimizing Compilers for Modern Architectures
Node Splitting
DO I = 1, N
DO I = 1, N
S1: A(I) = X(I+1) + X(I)
S1’:X$(I) = X(I+1)
S1: A(I) = X$(I) + X(I)
S2: X(I+1) = B(I) + 32
ENDDO
•
•
Break critical antidependence
Make copy of node from which
antidependence emanates
S2: X(I+1) = B(I) + 32
ENDDO
•
•
Recurrence broken
Vectorized to
X$(1:N) = X(2:N+1)
X(2:N+1) = B(1:N) + 32
A(1:N) = X$(1:N) + X(1:N)
Optimizing Compilers for Modern Architectures
Node Splitting
•
•
•
Determining minimal set of critical antidependences is in NP-C
Perfect job of Node Splitting difficult
Heuristic:
— Select antidependences
— Delete it to see if acyclic
— If acyclic, apply Node Splitting
Optimizing Compilers for Modern Architectures