Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures Fine-Grained Parallelism Techniques to enhance fine-grained parallelism: • • • • • Loop Interchange Scalar Expansion Scalar Renaming Array Renaming Node Splitting Optimizing Compilers for Modern Architectures Recall Vectorization procedure…. procedure codegen(R, k, D); // R is the region for which we must generate code. // k is the minimum nesting level of possible parallel loops. // D is the dependence graph among statements in R.. find the set {S1, S2, ... , Sm} of maximal strongly-connected regions in the dependence graph D restricted to R construct Rp from R by reducing each Si to a single node and compute Dp, the dependence graph naturally induced on Rp by D let {p1, p2, ... , pm} be the m nodes of Rp numbered in an order consistent with Dp (use topological sort to do the numbering); We can fail for i = 1 to m do begin here if pi is cyclic then begin generate a level-k DO statement; let Di be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to pi; codegen (pi, k+1, Di); generate the level-k ENDDO statement; end else generate a vector statement for pi in r(pi)-k+1 dimensions, where r (pi) is the number of loops containing pi; end end Optimizing Compilers for Modern Architectures Can we do better? • • • Codegen: tries to find parallelism using transformations of loop distribution and statement reordering If we deal with loops containing cyclic dependences early on in the loop nest, we can potentially vectorize more loops Goal in Chapter 5: To explore other transformations to exploit parallelism Optimizing Compilers for Modern Architectures Motivational Example DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO ENDDO codegen will not uncover any vector operations. However, by scalar expansion, we can get: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO Optimizing Compilers for Modern Architectures Motivational Example DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO Optimizing Compilers for Modern Architectures Motivational Example II • Loop Distribution gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO ENDDO Optimizing Compilers for Modern Architectures Motivational Example III Finally, interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO • A couple of new transformations used: — Loop interchange — Scalar Expansion Optimizing Compilers for Modern Architectures Loop Interchange DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B ENDDO ENDDO • Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B ENDDO ENDDO • leads to: DO J = 1, M S • DV: (=, <) A(1:N,J+1) = A(1:N,J) + B ENDDO Optimizing Compilers for Modern Architectures • DV: (<, =) Loop Interchange • • Loop interchange is a reordering transformation Why? — Think of statements being parameterized with the corresponding iteration vector — Loop interchange merely changes the execution order of these statements. — It does not create new instances, or delete existing instances DO J = 1, M DO I = 1, N S <some statement> ENDDO ENDDO • If interchanged, S(2, 1) will execute before S(1, 2) Optimizing Compilers for Modern Architectures Loop Interchange: Safety • Safety: not all loop interchanges are safe DO J = 1, M DO I = 1, N A(I,J+1) = A(I+1,J) + B ENDDO ENDDO • Direction vector (<, >) • If we interchange loops, we violate the dependence Optimizing Compilers for Modern Architectures Loop Interchange: Safety • A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence. Optimizing Compilers for Modern Architectures Loop Interchange: Safety • • • A dependence is interchange-sensitive if it is carried by the same loop after interchange. That is, an interchange-sensitive dependence moves with its original carrier loop to the new level. Example: Interchange-Sensitive? Example: Interchange-Insensitive? Optimizing Compilers for Modern Architectures Loop Interchange: Safety • Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops in the nest is determined by applying the same permutation to the elements of D(i,j). • The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row. Optimizing Compilers for Modern Architectures Loop Interchange: Safety DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO • • • The direction matrix for the loop nest is: < < = < = > Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non-"=" direction in any row. Follows from Theorem 5.1 and Theorem 2.3 Optimizing Compilers for Modern Architectures Loop Interchange: Profitability • Profitability depends on architecture DO I = 1, N DO J = 1, M DO K = 1, L S A(I+1,J+1,K) = A(I,J,K) + B ENDDO ENDDO ENDDO • For SIMD machines with large number of FU’s: DO I = 1, N S A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B ENDDO • Not suitable for vector register machines Optimizing Compilers for Modern Architectures Loop Interchange: Profitability • For Vector machines, we want to vectorize loops with stride-one memory access • Since Fortran stores in column-major order: • Thus, transform to: — useful to vectorize the I-loop DO J = 1, M DO K = 1, L S A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO ENDDO Optimizing Compilers for Modern Architectures Loop Interchange: Profitability • MIMD machines with vector execution units: want to cut down synchronization costs • Hence, shift K-loop to outermost level: PARALLEL DO K = 1, L DO J = 1, M A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO END PARALLEL DO Optimizing Compilers for Modern Architectures Scalar Expansion S1 S2 S3 • S1 S2 S3 • S1 S2 S3 DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO Scalar Expansion: DO I = 1, N T$(I) = A(I) A(I) = B(I) B(I) = T$(I) ENDDO T = T$(N) leads to: T$(1:N) = A(1:N) A(1:N) = B(1:N) B(1:N) = T$(1:N) T = T$(N) Optimizing Compilers for Modern Architectures Scalar Expansion • However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I+1) A(I) = T ENDDO • Scalar expansion gives us: T$(0) = T DO I = 1, N S1 T$(I) = T$(I-1) + A(I) + A(I+1) S2 A(I) = T$(I) ENDDO T = T$(N) Optimizing Compilers for Modern Architectures Scalar Expansion: Safety • • Scalar expansion is always safe • Dependences due to reuse of memory location vs. reuse of values When is it profitable? — Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions. — However, we want to predict when expansion is profitable — Dependences due to reuse of values must be preserved — Dependences due to reuse of memory location can be deleted by expansion Optimizing Compilers for Modern Architectures Scalar Expansion: Drawbacks • • Expansion increases memory requirements Solutions: — Expand in a single loop — Strip mine loop before expansion — Forward substitution: DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO DO I = 1, N A(I) = A(I) + A(I+1) + B(I) ENDDO Optimizing Compilers for Modern Architectures Scalar Renaming DO I = 1, 100 S1 T = A(I) + B(I) S2 C(I) = T + T S3 T = D(I) - B(I) S4 A(I+1) = T * T ENDDO • Renaming scalar T: DO I = 1, 100 S1 T1 = A(I) + B(I) S2 C(I) = T1 + T1 S3 T2 = D(I) - B(I) S4 A(I+1) = T2 * T2 ENDDO Optimizing Compilers for Modern Architectures Scalar Renaming • will lead to: S3 T2$(1:100) = D(1:100) - B(1:100) S4 A(2:101) = T2$(1:100) * T2$(1:100) S1 T1$(1:100) = A(1:100) + B(1:100) S2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100) Optimizing Compilers for Modern Architectures Node Splitting • Sometimes Renaming fails DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 ENDDO • Recurrence kept intact by renaming algorithm Optimizing Compilers for Modern Architectures Node Splitting DO I = 1, N DO I = 1, N S1: A(I) = X(I+1) + X(I) S1’:X$(I) = X(I+1) S1: A(I) = X$(I) + X(I) S2: X(I+1) = B(I) + 32 ENDDO • • Break critical antidependence Make copy of node from which antidependence emanates S2: X(I+1) = B(I) + 32 ENDDO • • Recurrence broken Vectorized to X$(1:N) = X(2:N+1) X(2:N+1) = B(1:N) + 32 A(1:N) = X$(1:N) + X(1:N) Optimizing Compilers for Modern Architectures Node Splitting • • • Determining minimal set of critical antidependences is in NP-C Perfect job of Node Splitting difficult Heuristic: — Select antidependences — Delete it to see if acyclic — If acyclic, apply Node Splitting Optimizing Compilers for Modern Architectures
© Copyright 2024 Paperzz