Program Optimization Using Idioms SHLOMIT Technion and Parallelization S. PINTER — Israel Institute of Technology and RON Y. PINTER IBM Israel Science and Technology Programs in languages such as Fortran, Pascal, and C were designed and written for a sequential machine model. During the last decade, several methods to vectorize such programs and recover other forms of parallelism that apply to more advanced machine architectures have been developed (particularly for Fortran, due to its pointer-free semantics). We propose and demonstrate a more powerful translation technique for making such programs run efficiently on parallel machines which support facilities such as parallel prefix operations as well as parallel and vector capabilities. This technique, which is global in nature and involves a modification of the traditional definition of the program dependence graph (PDG), is based on the extraction of parallelizable program structures (“idioms”) from the given (sequential) program. The benefits of our technique extend beyond the above-mentioned architectures and can be viewed as a general program optimization method, applicable in many other situations. We show a few examples in which our method indeed outperforms existing analysis techniques. Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processor—compilers; optimization General Terms: Algorithms, Languages, Performance Additional Key Words and Phrases: Array data flow analysis, computational idioms, dependence analysis, graph rewriting, intermediate program representation, parallelism, parallel prefix, reduction, scan operations This research was performed in part while on sabbatical leave at the Department of Computer Science, Yale University, and was supported in part by NSF grant number DCR-8405478 and by ONR grant number NOO014-89-J-1906. A preliminary version of this article appeared in the Proceedings of the 18th Annual ACM Symposium on Principles of Programming Languages (Jan. 1991). Author’s addresses: S. Pinter, Department of Electrical Engineering, Technion—Israel Institute of Technology, Haifa 32000, Israel; email: [email protected]. ac.il; R. Pinter, IBM Science and Technology, MATAM Advanced Technology Center, Haifa 31905, Israel; em ail: pinter@haifasc3 .vnet.ibm.com. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 01994 ACM 0164-0925/94/0500-0305 $03.50 ACM Transactions on Progirammmg Languages and Systems. Vol 16, NO 3, May 1994, Pages 305–327 306 S, S. Pinter and R, Y Pinter . 1. INTRODUCTION Many of the classical comprise the replacing suboptimal case, that criterion compiler application fragments repeating is met) optimization of local this will with process result techniques transformations better (until point better et al. 1986] (intermediate) ones. One hopes, a fixed in an overall [Aho to is found program. code, as is often or some To a large the other extent, attempts to vectorize—and otherwise parallelize—sequential code [Wolfe 1989] are also based on the detection of local relationships among data items and the control structures that enclose them. These local in nature and do not recognize being carried out. A computation structure sometimes the structure may (that the include some irrelevant human “idioms” eye [Perlis spans details and to automatic and Rugaber a fragment might the detection which however, are not are that of the program obscure parallelism 1979], approaches, of the computation which picture both algorithms). expressible is to Such directly as constructs in conventional high-level languages, include innerproduct cal culations and recurrence equations in numeric code, data structure traversal in a symbolic updating context, shared selective values processing amounts to “lifting” the level the classical, local methods. Once recognized, pertains to sequential gains convolution equivalent bler in target of two vectors, fragment programmer. program, machines, which using The code lock, but is most example, which can the stock, appealing a loop references, was be beyond of and barrel at hand. in view computing by an expert by of which potential can be replaced tailored replaced patterns them, for the task For uses array pointers same is far optimized computing. and Recognizing can be replaced has been highly parallel elements, application. of the such structures a new piece of code that potential of vector in a distributed a call to by This of the the by an assem- a BLAS [Dongarra et al. 1988; Lawson et al. 1979] routine that does the same job on the appropriate target machine. Better yet, if the machine at hand—for example, TMC’S CM-2 [Thinking Machines Corp. 1987] —supports summary operators, such as reduction and scan which work in time O(log n ) using a parallel prefix implementation [Blelloch 1989; Ladner and Fischer 1980] rather than 0(n) on a sequential machine (where n is the length of the vectors In from involved), this article scientific then the gains we propose programs. could a method be even more for dramatic. extracting We cast our techniques parallelizable in terms idioms of the construction graph, which is a modified extension of the and analysis of the computation program dependence graph (PDG) [Ferrante et al. 1987], since PDG-based analysis methods seem to be most flexible and amenable for extensions. Forming this new type of graph from a source program-as is necessary for the required analyses—and using it for optimization transformations involves three major steps: — When necessary, loops are unrolled so as to reach a normal forml, ‘ In the sense of Munshi and Slmons [ 1987], not that of Allen and Kennedy [ 1987] ACM Transactions on Programming Languages and Systems, Vol 16, No 3, May 1994 whose Optimization precise definition contribution —For and of this each the and Parallelization algorithms for its Using Idioms formation are 307 . an additional article. innermost loop in the program, we construct graph for the basic block constituting its body. Certain are classified as “data filters”) are handled gracefully, the computation conditionals (which whereas others we disallow. —The graph of each loop is replicated and final iterations, the computation graph up structure the nesting branching Once the three with of the whole as far times, loop. This graph is built, algorithms a variety speedup. Allen Its we apply that indeed Example can Known cannot not methods realize fitted plan gave and providing up on trying to it using purpose. structures that other methods and as verified by experiments by and O(n). a larger a small potential example, to parallelize the program [Huson et al. 1986] linear recurrences, systems), hides Stone This computation KAP third-order loop [Kogge to other thereby (we examined this than our tools, noncommercial that implemented N) rather as we go a problematic for taken from following loop DOIOOI =2, N–1 C(I) =A(I)+ B(I) B(I+l)=C(I– I) *A(I) CONTINUE second- and other using is repeated we hit be vectorized): 1: recognize 1981], “middle,” node we obtain transformations for this be demonstrated who 100 can (until optimization we provide of existing power et al. [1987], (which initial, control process as possible Using this method, we were able to extract to recognize, as reported in the literature with for the an additional structure). appropriate fail and together kind two and even independent 1973] in time O(log of transformation graph. We believe analysis frameworks, that Version [Brode a human observer, recurrences that n) (n being can, our 6 which VAST-2 however, techniques such as symbolic may can be the value of be effected can also evaluation be and analysis. Another advantage of our uniformly: our method spanning deep nesting approach is that encapsulates can the effect be identified. it handles of inner Moreover, data loops by and control so that idioms incorporating data filters that abstract certain forms of conditionals, idioms for the selective manipulation of vector elements (which are useful on vector and SIMD machines) can be extracted. Finally, due to the careful definition of the framework method from and by the can uniformly vector appropriate serve to distributed a wide selection variety of transformation of target ACM our ranging transformations and Then, in Section 3, presentation of our technique, including the construction of computation graphs, Transactions rules, machines. In what follows we first review other work on compiler evaluate its limitations for purposes of idiom extraction. we provide a formal loop normalization, architectures, on Programming Languages and Systems, the definition of and the neces- Vol. 16, No. 3, May 1994. 308 sary . S S. Pinter and R. Y. Pinter transformation algorithms. In Section 4 we exemplify stressing the advantages over existing methods. work in perspective, discusses applications other poses further our techniques, Section 5 tries than parallelism, to put our and pro- research. 2. EXISTING PARALLELIZATION Much traditional of the METHODS program analysis work culminated in the definition and use of the above-mentioned PDG or its constituents, the control and data dependence graphs. Such a graph defines the flow of control in the program as well as the dependence statements; similar [1989] together and improving loop-invariant between graphs with many vectorization. assignments be quite effective. However, the analysis dence the variables presented loop carried assignment methods as they are rather reflected than at the the on a directed by this syntactic flow tation is not accurate not depend on each of the loop shown cycle, implying classifies framework, level of data the type e.g., as embod- (variable also do a very methods were used and values) using in Section 1 (Example dependency. hidden is more 1) contains This represen- in this computation conservative than in do our the Parafrase [Lee et al. 1985; Polychronopou1981; Pacific-Sierra 1990] (all these tools good job in vectorization that to the depen- names (among graph which will expose the potential for parallelization in this loop. Partial success in recovering reduction-type operators is reported turing of is not quite strong enough to that are not readily apparent a circular since the two reductions other. Thus, the PDG work on KAP [Huson et al. 1986], 10S et al. 1986], and VAST [Brode Wolfe and internal ones, and when applied (SSA) form [Cytron et al. 1989] it can deeper semantic analysis. Thus, this approach allow us to extract patterns of data modification two nodes and mainly for recovering enables the detection of et al. 1988], are very much tied Moreover, the focus is on following tracking from the source program. For example, the PDG used in the program’s et al. [1981] of vectorization, allowed the PTRAN system [Allen structure of the program. statements) being in Kuck compiler transformations Analyzing the graph for purposes other dependence as being to programs in single static ied in original were and loop parallelization). are in the spirit The restruc- of the PDG-based work, but the techniques for detecting reduction operators are different from ours and somewhat ad hoc. The KAP system (the strongest in recovering these types of operations) can recognize secondand third-order line ar recurrences, yet when such computations are somewhat intermixed as in our previous example it fails to do so. Another transformation-based approach is that of capturing the data dependence among array references by means of dependence vectors [Chen 1986; Karp et al. 1967; Lamport 1974]. Then methods from linear algebra and number theory can be brought to bear to extract wavefronts of the computation. The problem with this framework is that only certain dence are modeled; there is no way to talk about specific ACM TransactIons on Programmmg Languages and Systems. Vol 16, No 3, May types of depenoperations (and 1994 Optimization and Parallelization Using Idioms their algebraic properties), and in general it is hard to extend. the work of Banerjee [1988] on data dependence analysis and [1989] can be viewed as extending the above-mentioned framework by using direction and distance vectors. More recently, Callahan [1991] provided a combination graph-based recurrences. tation, We note that that of Wolfe dependence graph of algebraic and methods for recognizing and optimizing a generalized class of This method is indeed effective in exploiting this type of compu- but it is strongly to generalizations Finally, 309 . tied to a specific framework to nonrecurrence-type various and does not lend itself transformations. symbolic-evaluation methods have been proposed [Jouvelot and Dehbonei 1989; Letovsky 1988; Rich and Waters 1988], mostly for plan analysis of programs. These methods all follow the structure of the program rigorously, and even though the program is transformed into some normal form up front and all transformations preserve this property, still these methods are highly sensitive to “noise” the reasoning about array references code) is quite limited 3. COMPUTATION In this of section data and nique will GRAPHS among in the beginning computation we provide graph formally and list an algorithmic framework to analyze the structure usage loops of this and Figure 1, which is written in Fortran. illustrate the definitions and algorithms, 3.1 Preprocessing Before structure for transformations the shall point explain. this stage as to represent a program, the flow namely, this the section. Next we define identify computations patterns that and their we use the the Finally in order can be replacements sample program of This example is merely meant to not to demonstrate the advantages will be done in Section 4. Normalization can be applied computation to a given graph program to effectively we must capture transform the it program’s purposes of idiom extraction. Some of these “preprocessing” are straightforward and well-known optimizations, as we shall At serve and Loop our analysis so as to allow work; EXTRACTION some of its important properties. that uses this new abstraction of programs over other model in on arrays. of this model, programs must be must be normalized; this tech- parallelized or otherwise optimized; specific are provided. To guide the reader through this section of our approach operations FOR IDIOM values To allow effective most importantly, be presented reduction a new graph-theoretic dependence graph. and of finding AND ALGORITHMS we propose the computation preprocessed, for purposes in the source program. More severely, (which are the mainstay of scientific out in passing, data but we assume filters, others that as in require programs IF (A(I) some new contain only . EQ. O) B ( I ) = I. techniques IF that statements Some of the we that other conditionals can be pulled out of loops or converted to filters using standard techniques, but those that cannot are outside the scope of our techniques. Finally, we assume that all array references depend linearly (with integral coefficients) on loop variables. ACM Transactions on Programmmg Languages and Systems, Vol. 16, No. 3, May 1994. 310 . S. S. Pinter and R. Y, Pinter DO 10 1=1 ,1 T = 1+4 c Fig. 1. assume H is even A sample Fortran program. DO 10 J=3,14 A(I, J) = A(I, J) + T* A(I, J-2) 10 COETIHUE The preprocessing —Standard elimination, transformations are as follows: basic block optimizations common subexpression et al. [Section 9.4, 1986], so that (such as dead-code and dead-store detection, etc.) are performed per Aho the resulting set of assignments does not contain redundancies. This implies that is obtained represents only the clef-use relationship are live upon further assume each entry defining that to each block the relation expression could be that is of some it contains at most linear form; which restriction algorithm we want to apply —To extend wherever —Loops and those among that values are live at the is “simple,” predefine two values form. Typical and one operator we assume that that that exit. meaning We that restrictions or that is imposed depends on the type later (in Section 3.3). the scope of the technique to values the expression DAG between variables it be a of recognition loops are recognized possible. are with normalized respect to their index, as defined below. Mun- shi and Simons [1987] have observed that by sufficient unrolling all loop-carried dependence can be made to occur only between consecutive iterations. The exact conditions stating how much unrolling is required and the technique for transforming a loop to normal form have never spelled out; thus—in what follows—we provide the necessary details. Normalization comprises (specifically among dence data dependence analysis been two steps: determination of loop-carried dependifferent references of arrays) and unrolling. If does not provide the exact dependence we may unroll the loop more than is necessary without affecting the correctness of the process. Customarily, e.g., Allen et al. [1987], Ferrante et al. [1987], and Wolfe [1989], data dependence is defined between statements, rather than scalar values and array references. We provide the following definition for data dependence among values in a program (and from now on we use only this notion of data dependence): 1. Value A is data dependent on value B if they both refer to Definition the same memory location and either (i) a use of A depends on the definition of B (flow ACM dependence), TransactIons (ii) the definition on Programmmg Languages of A redefines and Systems, Vol 16, No B which 3, May 1994. was used in Optimization an expression (antidependence), previous definition of B (output A dependence different In the Since is loop iterations following loop-carried 2: 20 Conceptually (pairwise ences of one disjoint) B there consecutive form: 2. given if it occurs array can view the A values that references (and m is the value array normal is already so does B). is no were partitioned to say when form if each a loop of its into the two involves two referonly is in normal loop-carried depen- as follows. Find all loop-carried depen- integer k such set in iteration to normalize a loop with span k > 1, the loop’s body is copied so it starts at 1 and is – 1 times. The loop’s index is adjusted appropriately at each iteration. Loops are normalized from the nested outward. 1, if m >4 loop needs to be normalized in normal loop-carried in iterations. can be normalized In the code of Figure inner appear of M, there dence, and let the span of the original loop be the largest that some value set in iteration i depends directly on a value i – k. Then, (unrolled) k incremented most deeply the of A. A as if it we are ready is in 311 redefines subarrays, one for each reference. Between is a loop-carried flow dependence which Now . M two consecutive loop of between the references D020 1=2, B(I) =2*A(I) A(I+M)=B(I–1) Using Idioms definition A has two i, j < m, where between A loop is between Any 2< iterations. Definition dence carried dependence Example or (iii) the dependence). of a loop. example the i + m # j for all and Parallelization form dependence). been expanded, thus (it (where m denotes by unrolling is vacuously so, since We also assume obtaining the value it once, whereas that the program the there the loop are no reference temporary in Figure of M) then the outer scalar T has 2. In order to compute the span it is not enough to find the distance (in the loop iteration space) of a dependence. In the following example the span is 1 although the distance Example 3: 10 Here the there second is only statement iteration. At the current in the first assignment is 2: DOIOI =1, N A(I+2)=A(I)+X A(I+1)=A(I+2)+Y CONTINUE one loop-carried to its use flow by dependence the first stage the scope of our work from statement includes the value in the set in following only loops for which the span is a constant. Note that the loop can still include array references which use its bound in some cases. For example the span of the following loop is 2: Example 4: 10 DOIOI =2, N–1 A(N– I)= A(N– I)+ CONTINUE ACM Transactions on Programming A(N+2– Languages I) and Systems, Vol. 16, No. 3, May 1994. 312 S. S. Plnter and R. Y. Pinter . 1=1 ,1 Do 10 T(I) assume H is even c Fig. 2. The program in normal form. of Figure = 1+4 1 DO 10 J=l ,M-2,2 A(I, J+2) = A(I, J+2) + T(I)* A(I, J) A(I, J+3) = A(I, J+3) + T(I)* A(I, J+l) 10 COITIIUE An immediate result of normalization become uniform. This observation graph to explicitly reveal all the tions of the loop patterns in order notion to that appears of a loop is carried in Aiken out until loop-carried on, being interaction and its surrounding context. to enable the transformations. normalization computation is is, later possible dependence used in the computation patterns among itera- Thus, we can match graph Finally, note that a similar and Nicolau a steady state [ 1988] where the is reached. Comments on Normalization. We cannot handle indirect array references, such as A ( x ( I ) ) , without employing partial evaluation techniques. Indirect referencing is mostly used to generate access patterns which are irregular and data such dependent; patterns since only current at run-time, based approaches. There is a relation target this between the machines limitation span of a loop loop-carried dependence distance, which defines distance between the two points of the iteration dence. In particular, dependence values distance are mainly if a loop is being equal used to the for introducing take and advantage to all of compiler- Banerjee’s [1988] for a given dependence the space that cause the depen- normalized span. can is inherent then there is at least one In current transformations, distance the synchronization delay proper (of some sort). This delay must cause a wait which lasts through the duration of the execution of as many operations as may be generated by unrolling using the span. distances At first extraneous is done by machines; Both methods can handle (at the moment) only constant span or which may not be the same for all the statements in the loop. it may seem that unrolling a loop for normalization results in space utilization. However, we point out that unrolling loop bodies many optimizing compilers in order to better utilize pipeline-based thus space usage and other resource issues are similar for both cases. 3.2 The Computation Given a program graph (CG) in two ACM Transactions Graph satisfying stages: on Programming the above assumptions, first we handle Languages basic and Systems, we define blocks; then Vol. 16, No. 3, May its computation we show 1994 how to Optimization represent normalized loops; filters. The CG for a basic block and and Parallelization finally we is a directed show acyclic Using Idioms how graph . 313 to incorporate G = (V, data E), defined as follows: —There there is a node u = V representing each value that is defined in the block; is also a node in V for each value that is live [Aho et al. 1986] at the entry of the uniquely ments that block and identified are tagged represent that is used by a tag; nodes by the initial in that corresponding values it (initial value). correspond statement additional (new) Each to values number, tags node is set in assignand for nodes are generated. —If the loop control variable is used in the computations assume it has a proper initial value. within its body we —The meaning of an edge (u, U) in the graph is that the computation of u must end before the computation of u starts. There are three types of edges drawn [Aho between et al. represent nodes: 1986] two kinds We draw a clef-use of v. We draw dependent one (which represents includes of data dependence edge from an output on u as per the flow classical between u to u if the value data dependence Definition clef-use dependence), and values other two (per Definition 1). u is used in the definition edge from 1. Lastly, relationship the u to u if u is output we draw an antidependence edge from u to v if the definition of u uses a value on which u is antidependent (unless this edge is a self-loop which is redundant since the fetch is done before the assignment). —Each being node u is labeled by the expression according to which the value is computed. The labeling uses a numbering that is imposed on the clef-use edges entering for initial Using the array references constituting the body tion shown graph the node (i.e., the arguments). The label L is used values. in verbatim of the inner Figure 3. as atomic loop of Figure Since we variables, 2 gives shall be the basic block rise to the computalooking for potential reduction and scan operations, which apply to linear forms, we allow the functions at nodes to be ternary multiply-add combinations. Note also that the nodes denoting the initial values of the variables could be replaced when the graph is embedded in a larger context, as we shall see. Next we define computation graphs for normalized loops. Such obtained body 2 as follows: by replicating the DAGs representing the enclosed —Three copies of the graph representing the body are generated: represents the initial iteration; one is a typical middle iteration; stands for the final iteration. the tag of the replicated middle, or final). Each node with node is uniquely the unrolled identified ACM Transactions on Programming Languages and Systems, Vol are one copy and one by extending copy it belongs 2At the innermost level these are basic blocks, but as we go up the structure computation graphs. graphs to (initial, these are general 16, No 3, May 1994 314 S. S Plnter and R. Y. Plnter . t. (1 Fig. 3. The computation graph of the basic block constituting the inner loop of the program in Figure 2, In case there are both a flow and an output dependence we draw only the flow edge, —Loop-carried data dependence between the different copies. edges (cross iteration edges) are drawn —Each instance of the loop variable (like in expressions of array references) is replaced with ini t, mid, and fin depending on its appearance in the initial, middle, or final copy, respectively. —Two nodes in different copies that represent location (like A ( ini t + 2 ) and A (mid) when coalesced, them below unless (in which there exists case there a flow a value in the same memory the stride of the loop is 2) are or an output are possibly two dependence different edge between values; see also note on antidependences). —The graph necessary uses). The graph is linked to its surroundings by data (i.e., from ini~ializations outside the is constructed on a loop-nest dependence edges where loop and to subsequent by loop-nest basis, i.e., there is no need to build it for the whole program. In fact, one might decide to limit the extent of the graph and its application in order to reduce compile time or curb the potential space complexity. Note. If a node v is a target of an antidependence edge then v is also a target of an output dependence edge. Consider the existence of an antidependence edge from u to v, i.e., the value that was stored in the same memory location as u and that was used in the definition of u is now being destroyed. This means that there is a node w which constitutes the old value that was used in the definition of u; thus there is an output dependence edge between w and v in addition to the clef-use edge from w to u. In general, if the meaning of an output dependence edge, say (u, v), is that all the uses of t~—on clef-use edges leaving u—must be executed before v, then the antidependence Figure in Figure Node ACM edges are redundant. 4 shows the computation 2. The graph S4 of the middle Transactmns comprises graph of the inner three copies copy is coalesced on Programmmg Languages with and Systems, of the loop (J) of the example graph from node S 1 of the initial Vol. 16, No. 3, May 1994 Figure 3. copy since Optimization Fig. 4. The computation they both represent Using Idioms graph of the inner loop of the program the same value and there are Similarly, node no loop-carried S5 of the middle copy. The pattern same and Parallelization (the stride is repeated between the final t + 2 = mid), and the middle in Figure 5. of the triple unfolding copies. same value for all three the outer I loop) would consist of three copies of what is shown in Figure annotation of the nodes. A more general example that includes anti, output, is presented contribution ini 2. output dependence between these nodes. copy is coalesced with node S2 of the initial Node SO represents the value of T ( I ) which is the copies. The graph for the whole program (including dependence The main in Figure of J is 2, thus, 315 . 4, with and of loops flow is that appropriate loop-carried when com- bined with normalization the computation graph represents explicitly all possible dependence. The computation graph spans a “signature” of the whole loops structure, and we call are transformed the body into of the enclosing yield another summary The following lemma foundation for the it the their summary summary loops and may—in form. summarizes applicability this of the form form, of the they turn—be loop. Once are treated replicated inner as part similarly observation, and it will transformations described serve of to as the in the next Section. ~EMMA dependence PROOF. 1. Three iterations structure. Recall that of a normalized all loop-carried loop dependence same for all the middle iterations, i.e., those that its entire are only between tive iterations. Thus, having three nodes to represent loop assures the property that dependence involving one. capture data consecu- a value in a normalized the second copy are the are not the first or the last ❑ Note, that when embedding a loop in a program fragment we need to cover the cases in which the loop’s body is executed fewer than three times. Since we are interested in parallelizing loops that must be iterated more than twice we omit the description of such an embedding. ACM TransactIons on Programming Languages and Systems, Vol. 16, No. 3, May 1994. 316 . DO I S. S. Pinter and R, Y Pinter = 1,1 S1 : B(I) = A(I) S2: D(I) = D(I-1) S3 : C(I) = B(I-1) * 5 + C(I+I) EEDDO Fig. 5. A program and Its CG. Dashed and dotted arrows are anti respectively; the vertical dashed lines separate the three copies. The structs computation graph like data filters. Definition constituent can also represent some 3. A data filter is a conditional values is involved in any loop-carried types and output dependence, of conditional con- expression—none of whose dependence—controlling an assignment. A data filter is represented a double circle. The immediate node, denoted in the graph by a special filter predecessors of this node are the arguments the Boolean expression constituting the predicate of the conditional, previous definition of the variable to which the assignment is being and the values involved in the expression on the right-hand side assignment. The outgoing edge from the filter node goes to the node sponding to the value whose assignment is being controlled. The filter value is the right-hand labeled by a conditional expression whose then the original assignment statement, and the value of the variable to which the assignment is written statement ACM in terms of the 20, the assignment TransactIons on Programmmg incoming edges, to s is controlled Languages and Systems, by of the made, of the correnode is side of else value is the previous (old) is being made. This expression as Figure 6 demonstrates: by a predicate Vol. 16, No. 3, May on the value 1994, in of Optimization and Parallelization Using Idioms 317 . s = 0.0 DD 20 20 IF 1=1,1 (A(I) .lE.0) S = S + B(I) c) s Fig. 6. A program with a data filter and the computation A ( I ) ; hence the Figure 7 shows Finally, note constructs there the nor body, Additionally, cannot can 2. middle and of the copy and from graphs and is not hard existing body contains a filter middle the node. parallel copy to the initial final copy to the copy of middle copy. there are no cross iteration edges between following lemma summarizes the main CG; these properties a normalized both used of their in designing application. loop of a sequential edges are from to the final are acyclic are the correctness cross iteration the middle to show the from and assuring A CG representing Computation blocks from be edges patterns the only of the loop’s graph of the program of Figure 6. CG represents a program without be edges there properties transformation ~EMMA graph due to normalization, and final copies. The the initial structural acyclic, computation the computation that when the graph of its loop’s body. program the initial is to the copy. by construction. for loops This as well is obvious as for data for basic filters. 3.3 Algorithms Having constructed the computation graph, the task of finding computational idioms in the program amounts to recognizing certain patterns in the graph. These patterns comprise graph structures such as paths or other particular subgraphs, depending on the idiom, and some additional information pertaining to the labeling of the nodes. The idiom recognition algorithm constitutes both a technique well as the conditions for identifying for when the subgraph they apply, patterns in the given i.e., checking in which they are found is one where the idiom Overall, the optimization procedure consists whether graph as the context can indeed be used. of the following algorithmic ingredients: —Matching and replacement of individual graph grammars to describe the rewrite ACM TransactIons on Programmmg patterns is rules. While Languages and Systems, achieved rewriting, by using we also Vol. 16, No. 3, May 1994. 318 S. S. Pinter and R, Y, Plnter . Fig. 7 transform the target forbidden —We replace the labels (including idioms. We make entries need The computation and exits to provide them. These graph of the program the sure generated 6 operators, of course), thus generating no side effects are lost by denoting per Rosendahl and Mankwald a list of idioms the” graph-rewriting include structures and such equations, transposition, reflection, and thereof, such as inner product, convolution, basic of Figure as a preprocessing stage. Notice [1979]. as reduction, rules that scan, recurrence FFT butterflies. Compositions and other permutations, can be that data filters can be part of a rule. —At the top level, we (repeatedly) match patterns from the given list according to a predetermined application schedule until no more changes are applicable that we need govern similar (or some other termination to establish a precedence the order in which to the conventional they are applied. This greedy application of optimization [Aho et al. 1986], is just one alternative. the merits of transformations and find graph at each stage, and then iterate. We next elaborate on each details. First we discuss rize information, these condition is met). This means relation among rules that will of the graph-rewriting will be mostly One could a minimum above items, tactic, which transformations is assign costs that reflect cost cover of the whole filling in the necessary rules. Since we are trying to summareduction rules, i.e., shrinking sub- graphs into smaller ones. More importantly, there are three characteristics that must be matched besides the skeletal graph structure: the operators (functions), the array references, and context, i.e., relationship to the enclosing structure. The first two items can be handled by looking at the labels of the vertices. The third involves finding a particular subgraph that can be replaced by a new structure and making sure it does not interfere with the rest of the computation. For example, to identify a reduction operation on an array, we need to find a path of nodes all having the same associative-operator label (e.g., multiplication, ACM Transactions addition, on Programmmg or an appropriate Languages and Systems, linear Vol combination 16, No, 3, May 1994 thereof) Optimization and Parallelization . Using Idioms 319 Jvar Q == > @--c) Fig. 8. Matching and replacement rule for reduction. and using consecutive (i.e., initial, middle, and final) entries in the update the same summary variable; we also need to ascertain intervening computation is going on (writing to or reading from array that to no the summary variable). Both the annotations computations are part of the nodes as well as the guards against intervening of the graph grammar productions defining the re- placement rules. Figures 8 and 9 provide clear. Here we assume that vectorization expansion, patterns; using have occurred, the vectorization a different similarly with of Figure appropriate data rely on transformations tool or can be part 1992]). To accommodate the rule so we two such rules transformations, filters, results themselves of our rules expanding 10 takes their additional when rules rules for first by can be formulated [Pinter are necessary. reduction. notion scalar looking can be applied base (they rewrite care of a filtered to make this including Notice and Mizrachi For example, the similarity to the rule of Figure 8, if we splice out the filter nodes and redirect the edges; the scan rule of Figure 9 can be similarly used to derive a rule for a filtered scan. Once such rules are applied, the computation graph contains new types of nodes, namely, summary nodes representing idioms. These nodes can, of course, appear themselves as candidates side of a rule), thereby enabling apply the rules for lower-level for replacement further optimization. optimizations first (on the others, but one should not get the impression that this procedure follows the loop structure of the given program. On the contrary, tation graph as constructed allows the detection of structures otherwise be obscured, as can be seen in Section 4. ACM Transactions on Programming Languages left-hand Obviously, we would and only then use the and Systems, necessarily the computhat might Vol. 16, No 3, May 1994 320 . S. S. Plnter and R. Y. Plnter o L Y(init) -1 Va’r x(lnl~+A) 1 —— ——> scan(f) X(mid+A) Y a % 1 X( fin+A) Fig. 9. Matching and replacement rule for scan, —— ——> Fig, 10. A matching and replacement rule for a filtered The result of applying the rule of Figure 9 to the graph in Figure 11. If we had started with a graph for the entire program the inner loop (as represented in Figure 4), then to the result would have generated the following 10 ACM DOALL 10 1=1, T(I) =I+4 SCA]J(A(I, *), SCAIV(A(I, *), CO~JTIPJUE Transactions reduction of Figure 4 is shown of Figure 2, not just applying a vectorization program: N 1, 2, on Programming Pi–i, M, 2, Languages 2, T(I), ‘*,1, ‘, +,,) T(I), “*”, “+”) and Systems, Vol. 16, No 3. May 1994. rule Optimization Fig. 11. The computation graph in Figure 4. graph resulting We use an arbitrary information, including and Parallelization from applying Using Idioms the transformations template for SCAN which includes array bounds, strides inside vectors operations to be performed in reverse Polish notation. In the case of Figures 6 and 7, the resulting problem T(l:N)=O. O WHERE(A(l:N). S= REDUCE(T, where program 90 and others In general, of Figure 321 9 to the all the necessary or minors, and the would be NE. O) T(l:N)=B(l:N) 1, N, 1, “+”) T is a newly resultant . defined array. (This is due to the way somewhat most awkward parallel [Guzzi et al. 1990], are defined.) the order in which rules are phrasing dialects, applied such and the of the as Fortran termination condition depend on the rule set. If the rules are Church-Rosser [Church and Rosser 1936] then this is immaterial, but often they are competing (in the sense that the application the or are contradictory other) beyond of one would the scope of this on properties article, of rewriting outrule (creating the consequent potential and we defer oscillation). its discussion application of This is issue to general studies systems. 4. EXAMPLES To exemplify three cases the advantages in which other of our idiom methods recovery cannot speed method, the we present computation here up but ours can (if combined properly with standard methods). We do not include the complete derivation in each case, but we outline the major steps and point out the key items. The first example is one of the loops used in a Romberg integration routine. The original Hampton, Example code VA) 5: looks (as released by the Computer Sciences Corporation of as follows: DO 40 N=2, M+l KN2. K+N–2 KNM2 . KN2 + M KNM1 = KNM2 + 1 F=l. o/(4 .o**(N–1)–l. ACM TransactIons on Programming Languages o) and Systems, Vol. 16, No. 3, May 1994. 322 . S, S Plnter and R. Y. Plnter TEMP1=WK(KNM2) –WK(K~C2) WK(K~~Ml) =WF(&-NM2) +F*TEMPI CONTIllUE 40 After substitution loop bounds, which of temporaries, scalar are all straightforward expansion DO 40 I= K+M, K+2*M–1 F(I) =l. O/(4 .0** (l– K–M +1)–1.0) WK(I+l) =WK(I)+ F(I) *( WK(I)-WK(l CO~TTIITUE 40 Notice that no normalization of F, and transformations, renaming of we obtain -M)) is necessary in this case since from data dependence analysis the only loop-carried data dependence between references to WK is of distance 1 which implies a span of 1. Furthermore, observe that the computation can be deduced The using loop computing loop containing of F ( I ) can be performed the well-known technique F can be easily a single statement vectorized, which in a separate of “loop loop, as distribution.” so all we are left computes the we with is a WK ( I + 1 ) . If we draw the computation graph for this loop and allow operators to be linear combinations with a coefficient (F ( I ) in this case), we can immediately recognize the scan operation that is taking place and thus WK1(K+M: K+2*P-l)=WK (K: SCAN (WK, l;+M, K+2*~4–1, The from second Allen example is the et al. [ 1987], who the loop is already required, we detect including the computing computation equations K+ M–l) 1, WK1, program form. computation presented F, appearing of the in Kogge “*”, in the “+”) introduction to parallelize chains even3 in the entries in and Stone [ 1973], taking three computation ~ and the data are arranged using the method for (taken it). Notice Once the loop is unfolded independent the odd entries. If can be performed the following: “-”, gave up on trying in normal two produce that times, graph, c, and the as one other properly, computing the whole recurrence time n ) (n being O(log the value of N) rather than 0(n). The computation graph can be seen in Figure 12. Notice that it is completely different from the PDG which consists of two nodes on a directed cycle. Finally, we parallelize a program computing the inner product of two vectors, which is a commonly occurring computation: program (in p.o DOIOOI =1, N P= P+:<(I) *Y(I) 100 COBTTINTJE Traditional analysis places the loop’s body of the by preparation for vectorization) re- ‘T(I) =X( I) *Y(I) P= P+ T(I) 3 We use “even” and “od~ here just to distinguish does not need to know or prove this fact at all ACM TransactIons on Pro~ammmg Languages between the chains; the recognition and Systems, Vol 16, No 3, May 1994 procedure Optimization ( C(init) Fig. 12. ) ( and Parallelization A(mid) ) The computation Using Idioms . 323 (B(ini graph of example 1 Q J- P Fig. 13. Transforming an inner product to a reduction, Figure 13 shows the computation graphs of this transformed basic block and that of the relevant part of the loop. The vectorizable portion of the loop can be transformed into a statement of the form T = x *Y, and what is left can be matched by the left-hand side of the rule of Figure 8. Applying the rule produces 1, the graph “ + “ ). The recognition corresponding of an inner to the product statement using our P = REDUCE (T, techniques will 1, N, not be disturbed by enclosing contexts such as a matrix multiplication program. The framed part of the program in Figure 14 produces the same reduced graph that appears replicated after treating the two outermost loops. ACM Transactions on Programming Languages and Systems, Vol. 16, No, 3, May 1994, 324 S. S. Pinter and R. Y. Pinter . Fig. 14, tion The portion program formation that of a matrix is recognized of Figure DO 100 1=1, !J DD 100 J=l ,1 C(I, multiplica- J)=O by the trans- 12. DO 100 K=l ,1 C(I, 100 J)=C(I, J)+A(I, K)* B(K, J) COHTIHUE 5, DISCUSSION We have proposed tion parallelization. and grams, more thereby globally. methods a new program It adding extracts to existing When comparing analysis method more methods our for purposes information the method from ability to of optimiza- the source to recognize the other most pro- idioms common we can say the following: —Constructing and then using the program dependence graph (PDG) or a similar structure focuses on following dependence as they are reflected at the syntactic level alone. Thus, the PDG is more conservative than our computation graph (CG) which further potential for parallelization. —There was partial analysis) 1986]. contains success in recovering in the work on Parafrase The techniques employed, more reduction information and operators (using [Lee et al. 1985; Polychronopoulos however, are somewhat exposes PDG-like et al. ad hoc compared to ours. —The places where normalization is carried out are, in general, ble for known parallelizing transformations extension to automatic restructuring theory). —Capturing dependence solving the data dependence vectors [Chen 1986; a system computation of linear are limited (thereby not applica- providing between array references by Karp et al. 1967; Wolfe 1989] equations to functional to (side extract the effect free) wavefronts a proper means of and then of the sets of expressions. —Symbolic and partial evaluation methods [Jouvelot and Dehbonei 1989; Letovsky 1988; Rich and Waters 1988] all follow the structure of the program rigorously, and even when the program is transformed into some normal form these methods are still highly sensitive to “noise” in the source program. Their major drawback is the lack of specific semantics for array references, limiting their suitability for purposes of finding reduction operations on arrays. Our primary data parallelism. ACM Transactions objective is to cater for machines with effective support of Our techniques, however, are not predicated on any particuon Programming Languages and Systems, Vol. 16, No. 3, May 1994 Optimization lar hardware, language but can rather constructs formalisms that are APL and Parallelization be targeted reflect Crystal, C*, and functional *lisp which to an abstract such features. and the BLAS all support reductions might appear in conjunction on algorithmic (such as fixed-point implementing the lelization that package (which has and scans. with other parallel than and Mizrachi 1992]; such and synchronization steps computations. repeated Also, application further of the rules computations) is also necessary. Finally, we are currently techniques mentioned here as part of an existing paral- system would methods or to of such higher-level both methodological and algorithinvestigated. In addition to the can be expressed as computation graph patterns and be identified as idioms [Pinter constructs can be used to eliminate communication that 325 a variety of machines), vector-matrix et al. [ 1989], and languages such as All in all, we feel that this article makes mic contributions that should be further constructs mentioned above, many others work . architecture Examples idioms, many efficient implementations on primitives as suggested in Agrawal Using Idioms [Lempel indicate et al. 1992] how prevalent and plan to obtain experimental each of the transformations results is. ACKNOWLEDGMENTS The authors would like for helpful discussions, comments on an earlier anonymous referees to thank as well draft for their Ron Cytron, as Jeanne of this article. insightful Ilan Efrat, Ferrante and Liron and David We would Mizrachi Bernstein for to thank the also like reviews. REFERENCES AGRAWAL, A., BLELLOCH, primitives. In 292-302. AHo, A. V., the G. E., KRAWITZ, R. L., AND PHILLIPS, Symposium SETHI, R., AND ULLMAN, Addison-Wesley, Reading, on Programmmg 1988. Language R. AND KENNEDY J. D. Algorithms 1986. C. A. and 1989. Four Compilers—Principles, Vector-matrix ACM, New York, Architectures. Techniques, and Tools. Mass. AIKEN, A. AND NICOLAU, A. ALLEN, on Parallel Optimal Design K. and 1987. loop parallelization. Implementation. Automatic translation In the SIGPLAN’88 Conference ACM, New York, 308–3 17. of FORTRAN programs to vector form. ALLEN, F. E., BURKE, M., CHARLES, P., CYTRON, R., AND FERRANTE, J. PTRAN analysis system for multiprocessing. J. Parall. Dzstrib. 1988. Comput. AD overview ALLEN, R., CALLAHAN, D., AND KENNEDY, K. 1987. Automatic decomposition Symposmm on the Principles grams for parallel execution. In the 14th Annual New of scientific pro- of Programming ACM, New York, 63-76. Languages. BANERJEE, of the 5, 5 (Oct.), 617-640. U. 1988. Dependence Analysis for SupercomputZng. Kluwer Academic Publishers, York. G. E. 1989. Scans as primitive parallel operations. IEEE Trans. Comput. C-38, 11 (Nov.), 1526-1539. BRODE, M. 1981. Precompilation of FORTRAN programs to facilitate array processing. Computer 14, 9 (Sept.), 46-51. CALLAHAN, D. 1991. Recognizing and parallelizing bounded recurrences. In the 4th Workshop on Languages and Compilers for Parallel Computmg. Lecture Notes in Computer Science, vol. 589. Springer-Verlag, New York, 169-185. BLELLOCH, ACM Transactions on Programming Languages and Systems, Vol. 16, No. 3, May 1994 326 S. S. Pinter and R, Y, Plnter . CHEN, M. C. VLSI. 1986, A parallel language and its compilation to multiprocessor machines or Symposium on the Principles of Programmmg Languages. ACM, In the 13th Annual New York, 131-139. CHURCH,A. ANDROSSER,J.B. 1936 Some properties ofconversion. Trans. Am. Math Soc. 39, 472-482. CITRON,R., FERRANTE,J., method of computmg Prmclples RosEN, static of Programming B. K., WEGMAN, single assignment basic linear algebra N., .ANDZADECK, In the F K 1989. 16thAnnual Inefficient Sympost urn on the ACM, New York, 25-35. Languages. DONGARRA, J. J., CROZ, J. D., HAMMARLING, FORTRAN M form. S., AND HANSON, subprograms. ACM Trans. R. J. 1988 Math, Soft. An 14, extended set of 1 (Mar.), 1-17. FERRANTE, J., OTTENSTEIN, K. J., AND WARREN, J. D. 1987. The program dependence Lang. Syst. 9, 3 (July), 319–349. its use in optimization. ACM Trans. Program. graph and GUZZI, M. D., PADUA, D. A., HO~FLINGER, J. P., AND LAWRIEL D. H. 1990. Cedar FORTRAN and 4, 1 (Mar.), 37–62. other vector and parallel FORTRAN dialects. J Supercomput. HUSON, C., MACKE, T., DAVIS, J. R., WOLFE, M. J., AND LEASURE, B. 1986. The KAP/205: An advanced source-to-source vectorizer for the Cyber 205 supercomputer. In the International Conference on Parallel Processing. CRC Press, Inc., 827-832. JOLWELOT, P., AND DEHBONEI, B. 1989. A unified semantic approach for the vectorization and Conference on Supercomputing. parallelization of generalized reductions. In the International ACM, New York, 186-194. KARP. R. M., MILLER, R. E., AND WINOGRAD, S. 1967. The organization of computations uniform recurrence equations. J. ACM 14, 3 (July), 563–590. KOGGE, P. M. AND STONE, H. S. 1973. A parallel algorlthm for the efficient solution C-22, 8 (Aug.), 786-793. general class of recurrence equations. IEEE Trans. Comput. Kumi, D., KUHN, R. H., PADUA, D. A,, LEASURE, B., AND WOLFE, M. ACM Symposium and compiler optimizations In the 18th Annual gramming Languages. 1974. Dependence of a graphs on the Principles of Pro- ACM, New York, 207-218. LADNER, R. E. AND FISCHER, M. J. 831-838 LAMPORT, L. 1981. for The parallel 1980. Parallel execution prefix computation. of do loops. Commun. ACM J. ACM 27, 4 17, 2 (Feb.), 89-93. LAWSON, C. L., HANSON, R. J., KINCAID, D R., AND KROGH, F. T. 1979, Basic linear subprograms for FORTRAN usage. ACM Trans. Math. Soft. 5, 3 (Sept.), 308–323. LEE, G., KRUSKAL, C. P., AND KUCK, D. J. of nonnumerical programs for parallel (oct.), algebra 1985. An empirical study of automatic restructuring Trans. Comput. C-34, 10 (Ott ), processors. IEEE 927-933. L~MPEL, O., PINTER, S. S., AND TURIEL, E. 1992. Parallelizing a C dialect for distributed on Languages and Compilers for Parallel memory MIMD machines. In the 5th Workshop Computmg. Lecture Notes in Computer Science, vol. 757. Springer-Verlag, New York, 369-390 LETOVSKY, S. I. 1988 Plan analysis of programs. Ph.D. thesis, YALEU/CSD/TR-662, Dept. of Computer Science, Yale Univ., New Haven, Corm. MUNSHI, A. AND SIMONS, B. 1987. Scheduling sequential loops on parallel processors. Tech. Rep RJ 5546, IBM Almaden Research Center, San Jose, Calif. Research, Los Angeles, PACIFIC-SIERRA RESEARCH. 1990. VAST for RS / 6000. Pacific-Sierra Calif. PF.RI.IS, A. .J. ANTI RITGARRR, S. 1979. Programming with idioms in APL. In APL ‘7.9 ACM. New York, 232–235. Also, APL Quote Quad, vol. 9, no. 4. PINTER, S. S. AND MIZBACHI, L. 1992. Using the computation dependence graph for compile-time of the Workshop on Advanced Compilation optimization and parallelization. In Proceedings Techruques for Nouel Architectures. Springer-Verlag, New York. POLYCHRONOPOULOS,C. D., KUCK, D. J., AND PADUA, D. A. 1986. Execution of parallel loops on Conference on Parallel Processing. IEEE, New parallel processor systems. In the International York, 519–527, RICH, C. AND WATERS) R. C. puter 21, 11 (Nov.), 10–25. 1988. The programmer’s ROSENDAHL, M. AND MANKWALD, K. P. ACM TransactIons on Programmmg Languages 1979. Analysis and Systems, apprentice: A research of programs overview by reduction Vol. 16, No, 3, May 1994 Com- of their Optimization structure. In Graph-Grammars and Their and Parallelization Applications to Using Idioms Computer Science . and 327 Biology. Lecture Notes in Computer Science, vol. 73, Springer-Verlag, New York, 409-417. THINKING MACHINES CORPORATION. 1987. Connection Machine Model CM-2 technical summary. Tech. Rep. HA87-4. Thinking Machines Corp., Cambridge, Mass. Supercompilers for Supercomputers. MIT Press, Cambridge, WOLFE, M. J. 1989. Optimizing Mass. Received May 1990; revised ACM October Transactions 1993; accepted October on Programming Languages 1993 and Systems, Vol. 16, No. 3, May 1994.
© Copyright 2026 Paperzz