Loop Mangling With both thanks and apologies to Allen and Kennedy Optimizing Compilers for Modern Architectures Preliminaries • • • • Want to parallelize programs, so Concentrate on loops Types of parallelism — Vectorization — Instruction-level parallelism — Courser grain parallelism – Shared memory – Distributed memory – Message passing Formalism based upon dependence analysis Optimizing Compilers for Modern Architectures Transformations • We call a transformation safe if the transformed program has the same "meaning" as the original program • But, what is the "meaning" of a program? For our purposes: • Two computations are equivalent if, on the same inputs: — They produce the same outputs in the same order Optimizing Compilers for Modern Architectures Reordering Transformations • A reordering transformation is any program transformation that merely changes the order of execution of the code, without adding or deleting any executions of any statements Optimizing Compilers for Modern Architectures Properties of Reordering Transformations • • A reordering transformation does not eliminate dependences • A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence. However, it can change the ordering of the dependence which will lead to incorrect behavior Optimizing Compilers for Modern Architectures Fundamental Theorem of Dependence • Fundamental Theorem of Dependence: • Proof by contradiction. Theorem 2.2 in the Allen and Kennedy text — Any reordering transformation that preserves every dependence in a program preserves the meaning of that program Optimizing Compilers for Modern Architectures Fundamental Theorem of Dependence • A transformation is said to be valid for the program to which it applies if it preserves all dependences in the program. Optimizing Compilers for Modern Architectures Parallelization and Vectorization • It is valid to convert a sequential loop to a parallel loop if the loop carries no dependence. • Want to convert loops like: DO I=1,N X(I) = X(I) + C • ENDDO to X(1:N) = X(1:N) + C • However: (Fortran 77 to Fortran 90) DO I=1,N X(I+1) = X(I) + C ENDDO is not equivalent to X(2:N+1) = X(1:N) + C Optimizing Compilers for Modern Architectures Loop Distribution • Can statements in loops which carry dependences be vectorized? D0 I = 1, N S1 A(I+1) = B(I) + C S2 D(I) = A(I) + E ENDDO • Dependence: S1 1 S2 can be converted to: S1 A(2:N+1) = B(1:N) + C S2 D(1:N) = A(1:N) + E Optimizing Compilers for Modern Architectures Loop Distribution DO I = 1, N S1 A(I+1) = B(I) + C S2 D(I) = A(I) + E ENDDO • transformed to: S1 S2 DO I = 1, N A(I+1) = B(I) + C ENDDO DO I = 1, N D(I) = A(I) + E ENDDO • leads to: S1 S2 A(2:N+1) = B(1:N) + C D(1:N) = A(1:N) + E Optimizing Compilers for Modern Architectures Loop Distribution • Loop distribution fails if there is a cycle of dependences DO I = 1, N S1 A(I+1) = B(I) + C S2 B(I+1) = A(I) + E ENDDO S1 1 S2 • and S2 1 S1 What about: DO I = 1, N S1 B(I) = A(I) + E S2 A(I+1) = B(I) + C ENDDO Optimizing Compilers for Modern Architectures Fine-Grained Parallelism Techniques to enhance fine-grained parallelism: • • • • Loop Interchange Scalar Expansion Scalar Renaming Array Renaming Optimizing Compilers for Modern Architectures Motivational Example DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO ENDDO However, by scalar expansion, we can get: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO Optimizing Compilers for Modern Architectures Motivational Example DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO Optimizing Compilers for Modern Architectures Motivational Example II • Loop Distribution gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO ENDDO Optimizing Compilers for Modern Architectures Motivational Example III Finally, interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO • A couple of new transformations used: — Loop interchange — Scalar Expansion Optimizing Compilers for Modern Architectures Loop Interchange DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B ENDDO ENDDO • Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B ENDDO ENDDO • leads to: DO J = 1, M S • DV: (=, <) A(1:N,J+1) = A(1:N,J) + B ENDDO Optimizing Compilers for Modern Architectures • DV: (<, =) Scalar Expansion S1 S2 S3 • S1 S2 S3 • S1 S2 S3 DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO Scalar Expansion: DO I = 1, N T$(I) = A(I) A(I) = B(I) B(I) = T$(I) ENDDO T = T$(N) leads to: T$(1:N) = A(1:N) A(1:N) = B(1:N) B(1:N) = T$(1:N) T = T$(N) Optimizing Compilers for Modern Architectures Scalar Expansion • However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I+1) A(I) = T ENDDO • Scalar expansion gives us: T$(0) = T DO I = 1, N S1 T$(I) = T$(I-1) + A(I) + A(I+1) S2 A(I) = T$(I) ENDDO T = T$(N) Optimizing Compilers for Modern Architectures Scalar Renaming DO I = 1, 100 S1 T = A(I) + B(I) S2 C(I) = T + T S3 T = D(I) - B(I) S4 A(I+1) = T * T ENDDO • Renaming scalar T: DO I = 1, 100 S1 T1 = A(I) + B(I) S2 C(I) = T1 + T1 S3 T2 = D(I) - B(I) S4 A(I+1) = T2 * T2 ENDDO Optimizing Compilers for Modern Architectures Scalar Renaming • will lead to: S3 T2$(1:100) = D(1:100) - B(1:100) S4 A(2:101) = T2$(1:100) * T2$(1:100) S1 T1$(1:100) = A(1:100) + B(1:100) S2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100) Optimizing Compilers for Modern Architectures Array Renaming DO I = 1, N S1 A(I) = A(I-1) + X S2 Y(I) = A(I) + Z S3 A(I) = B(I) + C • • ENDDO S1 S2 S2 -1 S3 S3 1 S1 S1 0 S3 Rename A(I) to A$(I): DO I = 1, N S1 A$(I) = A(I-1) + X S2 Y(I) = A$(I) + Z S3 A(I) = B(I) + C • ENDDO Dependences remaining: Optimizing Compilers for Modern Architectures S1 S2 and S3 1 S1 Seen So Far... • Uncovering potential vectorization in loops by • Safety and Profitability of these transformations — Loop Interchange — Scalar Expansion — Scalar and Array Renaming Optimizing Compilers for Modern Architectures And Now ... • More transformations • Unified framework to generate vector code — Node Splitting — Recognition of Reductions — Index-Set Splitting — Run-time Symbolic Resolution — Left to the implementer (get a copy of Allen and Kennedy book) Optimizing Compilers for Modern Architectures Node Splitting • Sometimes Renaming fails DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 ENDDO • Recurrence kept intact by renaming algorithm Optimizing Compilers for Modern Architectures Node Splitting DO I = 1, N DO I = 1, N S1: A(I) = X(I+1) + X(I) S1’:X$(I) = X(I+1) S1: A(I) = X$(I) + X(I) S2: X(I+1) = B(I) + 32 ENDDO • • Break critical antidependence Make copy of node from which antidependence emanates S2: X(I+1) = B(I) + 32 ENDDO • • Recurrence broken Vectorized to X$(1:N) = X(2:N+1) X(2:N+1) = B(1:N) + 32 A(1:N) = X$(1:N) + X(1:N) Optimizing Compilers for Modern Architectures Node Splitting • • • Determining minimal set of critical antidependences is in NP-C Perfect job of Node Splitting difficult Heuristic: — Select antidependences — Delete it to see if acyclic — If acyclic, apply Node Splitting Optimizing Compilers for Modern Architectures Recognition of Reductions • • Sum Reduction, Min/Max Reduction, Count Reduction Vector ---> Single Element S = 0.0 DO I = 1, N S = S + A(I) ENDDO • Not directly vectorizable Optimizing Compilers for Modern Architectures Recognition of Reductions • Assuming commutativity and associativity • Distribute k loop S = 0.0 DO k = 1, 4 S = 0.0 DO k = 1, 4 SUM(k) = 0.0 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO S = S + SUM(k) ENDDO SUM(k) = 0.0 ENDDO DO k = 1, 4 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO ENDDO DO k = 1, 4 S = S + SUM(k) ENDDO Optimizing Compilers for Modern Architectures Recognition of Reductions • After Loop Interchange DO I = 1, N, 4 DO k = I, min(I+3,N) SUM(k-I+1) = SUM(k-I+1) + A(I) ENDDO ENDDO • Vectorize DO I = 1, N, 4 SUM(1:4) = SUM(1:4) + A(I:I+3) ENDDO Optimizing Compilers for Modern Architectures Recognition of Reductions • Properties of Reductions — Reduce Vector/Array to one element — No use of Intermediate values — Reduction operates on vector and nothing else Optimizing Compilers for Modern Architectures Index-set Splitting • Subdivide loop into different iteration ranges to achieve partial parallelization — Threshold Analysis [Strong SIV, Weak Crossing SIV] — Loop Peeling [Weak Zero SIV] — Section Based Splitting [Variation of loop peeling] Optimizing Compilers for Modern Architectures Threshold Analysis DO I = 1, 20 A(I+20) = A(I) + B ENDDO Vectorize to.. A(21:40) = A(1:20) + B DO I = 1, 100 A(I+20) = A(I) + B ENDDO Strip mine to.. DO I = 1, 100, 20 DO i = I, I+19 A(i+20) = A(i) + B ENDDO ENDDO Vectorize this Optimizing Compilers for Modern Architectures Loop Peeling • Source of dependence is a single iteration DO I = 1, N A(I) = A(I) + A(1) ENDDO Loop peeled to.. A(1) = A(1) + A(1) DO I = 2, N A(I) = A(I) + A(1) ENDDO Vectorize to.. A(1) = A(1) + A(1) A(2:N)= A(2:N) + A(1) Optimizing Compilers for Modern Architectures Run-time Symbolic Resolution • “Breaking Conditions” DO I = 1, N A(I+L) = A(I) + B(I) ENDDO Transformed to.. IF(L.LE.0) THEN A(L:N+L) = A(1:N) + B(1:N) ELSE DO I = 1, N A(I+L) = A(I) + B(I) ENDDO ENDIF Optimizing Compilers for Modern Architectures Run-time Symbolic Resolution • • Identifying minimum number of breaking conditions to break a recurrence is in NP-C Heuristic: — Identify when a critical dependence can be conditionally eliminated via a breaking condition Optimizing Compilers for Modern Architectures Putting It All Together • Good Part • Bad Part — Many transformations imply more choices to exploit parallelism — Choosing the right transformation — How to automate transformation selection process? — Interference between transformations Optimizing Compilers for Modern Architectures Putting It All Together • Any algorithm which tries to tie all transformations must • Goal of our algorithm — Take a global view of transformed code — Know the architecture of the target machine — Finding ONE good vector loop [works well for most vector register architectures] Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Optimizing Compilers for Modern Architectures Overview • • • Improving memory hierarchy performance by compiler transformations — Scalar Replacement — Unroll-and-Jam Saving memory loads & stores Make good use of the processor registers Optimizing Compilers for Modern Architectures Motivating Example DO I = 1, N DO I = 1, N DO J = 1, M T = A(I) A(I) = A(I) + B(J) DO J = 1, M ENDDO T = T + B(J) ENDDO ENDDO A(I) = T • A(I) can be left in a register throughout the inner loop ENDDO • Coloring based register allocation fails to recognize this • All loads and stores to A in the inner loop have been saved • High chance of T being allocated a register by the coloring algorithm Optimizing Compilers for Modern Architectures Scalar Replacement • • Convert array reference to scalar reference to improve performance of the coloring based allocator Our approach is to use dependences to achieve these memory hierarchy transformations Optimizing Compilers for Modern Architectures Unroll-and-Jam DO I = 1, N*2 DO I = 1, N*2, 2 DO J = 1, M DO J = 1, M A(I) = A(I) + B(J) A(I) = A(I) + B(J) ENDDO A(I+1) = A(I+1) + B(J) ENDDO ENDDO ENDDO • • Can we achieve reuse of references to B ? Use transformation called Unroll-and-Jam Optimizing Compilers for Modern Architectures • • Unroll outer loop twice and then fuse the copies of the inner loop Brought two uses of B(J) together Unroll-and-Jam DO I = 1, N*2, 2 DO I = 1, N*2, 2 DO J = 1, M s0 = A(I) A(I) = A(I) + B(J) s1 = A(I+1) A(I+1) = A(I+1) + B(J) DO J = 1, M ENDDO t = B(J) ENDDO s0 = s0 + t s1 = s1 + t • Apply scalar replacement on this code ENDDO A(I) = s0 A(I+1) = s1 ENDDO • Optimizing Compilers for Modern Architectures Half the number of loads as the original program Legality of Unroll-and-Jam • Is unroll-and-jam always legal? DO I = 1, N*2, 2 DO I = 1, N*2 DO J = 1, M DO J = 1, M A(I+1,J-1) = A(I,J) + B(I,J) A(I+1,J-1) = A(I,J) + B(I,J) A(I+2,J-1) = A(I+1,J) + B(I+1,J) ENDDO ENDDO ENDDO ENDDO • • Apply unroll-and-jam Optimizing Compilers for Modern Architectures This is wrong!!! Legality of Unroll-and-Jam • Direction vector in this example was (<,>) • But does loop interchange illegal imply unroll-and-jam illegal ? NO — This makes loop interchange illegal — Unroll-and-Jam is loop interchange followed by unrolling inner loop followed by another loop interchange Optimizing Compilers for Modern Architectures Conditions for legality of unroll-and-jam • • Definition: Unroll-and-jam to factor n consists of unrolling the outer loop n-1 times and fusing those copies together. Theorem: An unroll-and-jam to a factor of n is legal iff there exists no dependence with direction vector (<,>) such that the distance for the outer loop is less than n. Optimizing Compilers for Modern Architectures Conclusion • We have learned two memory hierarchy transformations: • They reduce the number of memory accesses by maximum use of processor registers — scalar replacement — unroll-and-jam Optimizing Compilers for Modern Architectures Loop Fusion • Consider the following: DO I = 1,N A(I) = C(I) + D(I) ENDDO DO I = 1,N B(I) = C(I) - D(I) ENDDO If we fuse these loops, we can reuse operations in registers: DO I = 1,N A(I) = C(I) + D(I) B(I) = C(I) - D(I) ENDDO Optimizing Compilers for Modern Architectures Example DO J = 1,N DO I = 1,M A(I,J) = C(I,J)+D(I,J) ENDDO DO I = 1,M B(I,J) = A(I,J-1)-E(I,J) ENDDO ENDDO Scalar Replacement DO I = 1,M r = A(I,0) DO J = 1,N B(I,J) = r - E(I,J) r = C(I,J) + D(I,J) A(I,J) = r ENDDO ENDDO Optimizing Compilers for Modern Architectures Fusion and Interchange DO I = 1,M DO J = 1,N B(I,J) = A(I,J-1)-E(I,J) A(I,J) = C(I,J)+D(I,J) ENDDO ENDDO We’ve saved (n-1)m loads Ordering Transformations • Recommended order: — Loop Interchange — Loop alignment and fusion — Unroll-and-jam — Scalar Replacement Optimizing Compilers for Modern Architectures
© Copyright 2026 Paperzz