Effects of Parallelism Degree on Run-Time Parallelization of Loops Chengzhong Xu Department of Electrical and Computer Engineering Wayne State University, Detroit, MI 48202 http://www.pdcl.eng.wayne.edu/czxu Abstract Due to the overhead for exploiting and managing parallelism, run-time loop parallelization techniques with the aim of maximizing parallelism may not necessarily lead to the best performance. In this paper, we present two parallelization techniques that exploit different degrees of parallelism for loops with dynamic cross-iteration dependences. The DOALL approach exploits iteration-level parallelism. It restructures the loop into a sequence of do-parallel loops, separated by barrier operations. Iterations of a do-parallel loop are run in parallel. By contrast, the DOACROSS approach exposes fine-grained reference-level parallelism. It allows dependent iterations to be run concurrently by inserting point-to-point synchronization operations to preserve dependences. The DOACROSS approach has variants that identify different amounts of parallelism among consecutive reads to the same memory location. We evaluate the algorithms for loops using various structures, memory access patterns, and computational workloads on symmetric multiprocessors. The algorithms are scheduled using block cyclic decomposition strategies. The experimental results show that the DOACROSS technique outperforms the DOALL, even though the latter is widely used in compile-time parallelization of loops. Of the DOACROSS variants, the algorithm allowing partially concurrent reads performs best because it incurs only slightly more overhead than the algorithm disallowing concurrent reads. The benefit from allowing fully concurrent reads is significant for small loops that do not have enough parallelism. However, it is likely to be outweighed by its cost for large loops or loops with light workload. 1 Introduction Loop parallelization exploits parallelism among instruction sequences or loop iterations. Techniques for exploiting instructionlevel parallelism are prevalent in today’s microprocessors. On multiprocessors, loop parallelization techniques focus on loop-level parallelism. They partition and allocate loop iterations among processors with respect to cross-iteration dependences. Their primary objective is to expose enough parallelism to keep processors busy all the time while minimizing synchronization overheads. On multiprocessors, there are two important parallelization techniques that exploit different degrees of parallelism. The DOALL technique assumes loop iterations as the basic scheduling and execution units [2, 3]. It decomposes the iterations into a sequence of subsets, called wavefronts. Iterations within the same wavefront are run in parallel. A barrier synchronization operation is used to preserve cross-iteration dependences between two wavefronts. The DOALL technique reduces the run-time scheduling overhead at a sacrifice of a certain amount of parallelism. By contrast, the DOACROSS technique exploits finegrained reference-level parallelism. It allows dependent iterations to be run concurrently by inserting point-to-point synchronization operations to preserve dependences among memory references. The DOACROSS technique maximizes parallelism at the expense of frequent synchronization. Note that in the literature the terms DOACROSS and DOALL often refer to the loops with and without cross-iterations dependences, respectively. We borrow the terms as parallelization techniques in this paper because the DOALL technique essentially restructures a loop into a sequence of DOALL loops. Both the DOALL and DOACROSS techniques are used to parallelize DOACROSS loops. Chen and Yew [5] studied programs from the PERFECT benchmark suite and revealed the significant advantages of parallelizing DOACROSS loops. DOACROSS loops can be characterized as static and dynamic in terms of the time when cross-iteration dependence information is available (at compile-time or run-time). Figure 1 shows an example of dynamic loops due to the presence of indirect access patterns on data array X . Dynamic loops appear frequently in scientific and engineering applications [16]. Examples include SPICE for circuit simulation, CHARMM and DISCOVER for molecular dynamics simulation of organic systems, and FIDAP for modeling complex fluid flows [4]. for i=1 to N do : : : = X[v[i]] + : : : X[u[i]] = : : : endfor Figure 1: A general form of loops with indirect access patterns, where u and v are input-dependent functions. For parallelizing static loops, the DOALL technique plays a dominant role because it employs a simple execution model after exploiting parallelism at compile-time [8]. For dynamic loops, however, this may not be the case. Since parallelism in a dynamic loop has to be identified at run-time, the cost for building wavefronts in the DOALL technique becomes a major source of runtime overhead. The DOACROSS incurs run-time overhead for 1060-3425/98 $10.00 (c) 1998 IEEE analyzing reference-wise dependences, but it provides more flexibility to subsequent scheduling. This paper compares the DOALL and DOACROSS approaches for parallelizing loops at run-time, focusing on the effect of parallelism degree on the performance of parallelization. In [18], we devised two DOACROSS algorithms exploiting different amounts of parallelism and demonstrated the effectiveness of the DOACROSS algorithms on symmetric multiprocessors. This paper presents a DOALL algorithm that allows parallel construction of the wavefronts and compares this algorithm with the DOACROSS algorithms, focusing on the influences of parallelism degree. We show that the DOACROSS algorithms have advantages over the DOALL, even though the latter is preferred for compile-time parallelization. Of the DOACROSS variants, the algorithm that allows concurrent reads may over-expose parallelism for large loops. Its benefits will be outweighed by the run-time cost of exploiting and managing the extra amount of parallelism for large loops or loops with light workload. The rest of the paper is organized as follows. Section 2 reviews run-time parallelization techniques and qualitatively compares the DOACROSS and DOALL techniques. Section 3 briefly presents three DOACROSS algorithms that expose different amounts of parallelism. Section 4 presents a DOALL algorithm. Section 5 evaluates the algorithms, focusing on the effects of parallelism degree and granularity. Section 5 concludes the paper with a summary of evaluation results. 2 Run-time Parallelization Techniques In the past, many run-time parallelization algorithms have been developed for different types of loops on both shared-memory and distributed-memory machines [6, 9, 14]. Most of the algorithms follow a so-called I NSPECTOR /E XECUTOR approach. With this approach, a loop under consideration is transformed at compile-time into an inspector routine and an executor routine. At run-time, the inspector detects cross-iteration dependences and produces a parallel schedule; the executor performs the actual loop operations in parallel based on the dependence information exploited by the inspector. The keys to success with this approach is to shorten the time spent on dependence analyses without losing valuable parallelism and to reduce the synchronization overhead in the executor. An alternative to the I NSPECTOR /E XECUTOR approach is a speculative execution scheme that was recently proposed by Rauchwerger, Amato, and Padua [13]. In the speculative execution scheme, the target loop is first handled as a doall regardless of its inherent parallelism degree. If a subsequent test at run-time finds that the loop was not fully parallel, the whole computation is then rolled back and executed sequentially. Although the speculative execution yields good results when the loop is in fact executable as a doall, it fails in most applications that have partially parallel loops. The I NSPECTOR /E XECUTOR scheme provides a run-time parallelization framework, and leaves strategies for dependence analysis and scheduling unspecified. The scheme can also be restructured to decouple the scheduling function from the inspector and to merge it with the executor. The scheduling function can even be extracted to serve as a stand-alone routine between the inspector and the ex- ecutor. There are many run-time parallelization algorithms belonging to the I NSPECTOR /E XECUTOR scheme. They differ from each other mainly in their structures and strategies used in each routine, in addition to the type of target loops considered. Pioneering work on using the I NSPECTOR /E XECUTOR scheme for run-time parallelization is due to Saltz and his colleagues [15]. They considered loops without output dependences (i.e. the indexing function used in the assignments of the loop body is an identity function), and proposed an effective DOALL I NSPEC TOR /E XECUTOR scheme. Its inspector partitions the set of iterations into a number of wavefronts, which maintain cross-iteration flow dependences. Iterations within the same wavefront can be executed concurrently, but those in different wavefronts must be processed in order. The executor of the DOALL scheme enforces anti-flow dependences during the execution of iterations in the same wavefront. The DOALL I NSPECTOR /E XECUTOR scheme has been shown to be effective in many real applications. It is applicable, however, only to those loops without output dependences. The basic scheme was recently generalized by Leung and Zahorjan to allow general cross-iteration dependences [10]. In their algorithm, the inspector generates a wavefront-based schedule and maintains output and anti-flow dependences as well as flow dependences; the executor simply performs the loop operations according to the wavefronts of iterations. Note that the inspector in the above scheme is sequential. It requires time commensurate with that of a serial loop execution. Parallelization of the inspector loop was also investigated by Saltz, et al. [15] and Leung and Zahorjan [9]. Their techniques respect flow dependences, but ignore anti-flow and output dependences. Most recently, Rauchwerger, Amato and Padua presented a parallel inspector algorithm for a general form of loops [13]. They extracted the function of scheduling and explicitly presented an inspector/scheduler/executor scheme. DOALL I NSPECTOR /E XECUTOR schemes assume a loop iteration as the basic scheduling unit in the inspector and the basic synchronization object in the executor. An alternative to this scheme is DOACROSS I NSPECTOR /E XECUTOR parallelization techniques which assume a memory reference in the loop body as the basic unit of scheduling and synchronization. Processors running the executor are assigned iterations in a wrapped manner and each spinwaits as needed for operations that are necessary for its execution. An early study of DOACROSS run-time parallelization techniques was conducted by Zhu and Yew [20]. They proposed a scheme that integrates the functions of dependence analysis and scheduling into a single executor. Later, the scheme was improved by Midkiff and Padua to allow concurrent reads to the same array element by several iterations [12]. Even though the integrated scheme allows concurrent analysis of cross-iteration dependences, tight coupling of the dependence analysis and the executor causes high synchronization overhead in the executor. Most recently, Chen, et. al., developed the DOACROSS technique by decoupling the function of the dependence analysis from the executor [6]. We refer to their technique as the CTY algorithm. Separation of the inspector and executor not only reduces synchronization overhead in the executor, but also provides the possibility of reusing the dependence information developed in the inspector across multiple invocations of the same loop. Their inspector is parallel at the sacrifice of concurrent reads 1060-3425/98 $10.00 (c) 1998 IEEE to the same array element. Their algorithm was recently further improved by Xu and Chaudhary by allowing concurrent reads of the same array element in different iterations and by increasing the overlap of dependent iterations [18]. DOALL and DOACROSS are two competing techniques for run-time loop parallelization. DOALL parallelizes loops at iteration level, while DOACROSS supports parallelism at a finegrained memory access level. Consider the loop and index arrays shown in Figure 2. The first two iterations can be either indefor i=1 to N do if ( exp(i) ) X[u1[i]] = F(X[v1[i]], : : :) else X[u2[i]] = F(X[v2[i]], : : :) endfor u1 = [5, 7, . . . ] u2 = [3, 3, . . . ] v1 = [7, 2, . . . ] v2 = [4, 5, . . . ] Figure 2: An example of loops with conditional cross-iteration dependences, where F is an arbitrary operator. pendent (when exp(1) is false and exp(2) is true), flow dependent (when exp(1) is true and exp(2) is false), anti-flow dependent (when both exp(1) and exp(2) are true), or output dependent (when both exp(1) and exp(2) are false). The nondeterministic cross-iteration dependences are due to control dependences between statements in the loop body. We call such dependences conditional cross-iteration dependences. Control dependences can be converted into data dependences by a if-conversion technique at compile-time [1]. The compile-time technique, however, may not be helpful for loop-carried dependence analysis at run-time. By the DOALL technique, loops with conditional cross-iteration dependences must be handled sequentially. However, the DOACROSS technique can handle this class of loops easily. At run-time, the executor, upon testing a branch condition, may set all operands in the non-taken branch available so as to release processors waiting for those operands. Furthermore, the DOACROSS technique overlaps dependent iterations. The first two iterations in Figure 2 have an anti-flow dependence when both exp(1) and exp(2) are true. The memory access to X (4) in the second iteration, however, can be overlapped with the execution of iteration 1 without destroying the anti-flow dependences. The DOACROSS I NSPECTOR /E XECUTOR parallelization technique provides the potential to exploit fine-grained parallelism across loops. Fine-grained parallelism does not necessarily lead to overall performance gains without an efficient implementation of the executor. One main contribution of this paper is to show that multi-threaded implementations favor the DOACROSS technique. 3 The Time-Stamp D OACROSS Algorithms This section briefly presents three DOACROSS algorithms that feature parallel dependence analysis and scheduling. They expose different amounts of parallelism among consecutive reads to the same memory location. For more details, please see [18]. Consider the general form of loops in Figure 1. It defines a two dimensional iteration-reference space. The inspector of a time- stamp algorithm examines the memory references in a loop and constructs a dependence chain for each data array element of the loop. In addition to the precedence order, the inspector also assigns a stamp to each reference in a dependence chain, which indicates its earliest access time relative to the other references in the chain. A reference can be activated if and only if the preceding references are finished. The executor schedules references to a chain through a logical clock. At a given time, only those references whose stamps are equal to or larger than the time are allowed to proceed. Dependence chains are associated with clocks ticking at different speeds. Assume stamps of references as discrete integers. The stamps are stored in a two-dimensional array stamp. Let (i; j ) indicate the j th access of the ith iteration. stamp[i][j ] represents the stamp of reference (i; j ). By scanning through the iteration space sequentially, processors can easily construct a time-stamp array that has the following features. The stamp difference between any two directly connected references is one except for pairs of write-afterread and read-after-read references; both reads in pairs of readafter-read have an equivalent stamp; and in pairs of write-afterread, their difference is the size of the read group minus one. Figure 3 shows an example derived from the target loop assuming u = [15; 5; 5; 14; 10; 14; 12; 11; 3; 12; 4; 8; 3; 10; 10; 3] v = [3; 13; 10; 15; 0; 8; 10; 10; 1; 10; 10; 15; 3; 15; 11; 0]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Iteration Read 3 13 10 15 0 8 10 10 1 10 10 15 3 15 11 0 v (1) (1) (1) (2) (1) (1) (3) (3) (1) (3) (3) (2) (3) (2) (2) (1) Write 10 10 3 u 15 5 5 14 10 14 12 11 3 12 4 8 3 (1) (1) (2) (1) (2) (2) (1) (1) (2) (2) (1) (2) (4) (7) (8) (5) Access Figure 3: Sequentially constructed dependence chains labeled by array elements. The numbers in parentheses are the stamps of references. Based on such time-stamp arrays, a simple clock rule in the executor can preserve all dependences and allow consecutive reads to be run in parallel. 3.1 A Parallel Algorithm Allowing Partially Concurrent Reads (PCR) Building dependence chains requires an examination of all references at least once in the loop. To construct the time-stamp array in parallel, one key issue is how to stamp the references in a dependence chain across iteration regions on different processors. Since no processors (except the first) have knowledge about the references in previous regions, they are unable to stamp their local references in a local chain without the assignment of its head. To allow processors to continue with the examination of other references in their local regions in parallel, the time-stamp inspector uses a conservative approach to assign a conservative number to the second reference of a local chain and leave the first one to be decided in a subsequent global analysis. 1060-3425/98 $10.00 (c) 1998 IEEE 0 1 Read 3 (1) Write 15 (1) 2 3 4 10 15 (1) (2) 5 6 7 8 10 10 (10 ) (10 ) 10 (2) 9 10 11 12 13 10 10 15 3 15 (12 ) (12 ) ( 3 ) (17 ) (17) 3 (2) 14 15 Iteration v u 3 10 10 3 (26 ) (19 ) (26 ) (27 ) Access Figure 4: A fully stamped dependence chain labeled by array element. Numbers in parentheses are stamps of references. Using this conservative approach, most of the stamp table can be constructed in parallel. Upon completion of the local analysis, processors communicate with each other to determine the stamps of undecided references in the stamp table. Figure 4 shows the complete dependence chains associated with array elements 3, 10 and 15. Processor 3 temporarily assigns 26 to the reference (12; 1), assuming all 24 accesses in regions from 0 to 2 are in the same dependence chain. In the subsequent cross-processor analysis, processor 2 sets stamp[8][1] to 2 after communicating with processor 0 (processor 1 marks no reference to the same location). At the same time, processor 3 communicates with processor 2, but gets an undecided stamp on the reference (8; 1), and hence assigns another conservative number, 16 plus 1, to reference (12; 0), assuming all accesses in regions 0 and 1 are in the same dependence chain. The extra one is due to the total number of dependent references in region 2. Note that the communications from processor 3 to processor 2 and from processor 2 to processor 1 are in parallel. Processor 2 can provide processor 3 only the number of references in the local region until the end of the communication with processor 0. Accordingly, the time-stamp algorithm presents a special clocking rule that sets the clock of a dependence chain to n + 2 if an activated reference in region r is a local head, where n is the total number of accesses from region 0 to region r 1. For example, the reference (2; 0) in Figure 4 first triggers the reference (4; 1). Activation of reference (4; 1) will set the clock to 10 because there are 8 accesses in the first region, which consequently triggers the following two reads. Note that this parallel inspector algorithm only allows consecutive reads in the same region to be performed in parallel. Read operations in different regions must be performed sequentially even though they are totally independent of each other. In the dependence chain on element 10 in Figure 4, for example, the reads (9; 0) and (10; 0) are activated after reads (6; 0) and (7; 0). We are able to assign reads (9; 0) and (10; 0) the same stamp as reads (6; 0) and (7; 0), and assign the reference (13; 1) a stamp of 14. This dependence chain, however, will destroy the anti-flow dependences from (6; 0) and (7; 0) to (14; 0) in the executor if reference (9; 0) or (10; 0) starts earlier than one of the reads in region 1. 3.2 A Parallel Algorithm Allowing Fully Concurrent Reads (FCR) The basic idea of the algorithm is to treat write operations and groups of consecutive reads as a macro-reference. For a write reference or the first read operation in a read group in a dependence chain, the inspector stamps the reference with the total number of macro-references ahead. Other accesses in a read group are assigned the same stamp as the first read. Correspondingly, in the executor, the clock of a dependence chain is incremented by one time unit on a write reference and by a fraction of a time unit on a read operation. The magnitude of an increment on a read operation is the reciprocal of its read group size. Figure 5 presents sequentially stamped dependence chains. In addition to the stamp, each read reference is also associated with an extra data item recording its read group size. In an implementation, the variable for read group size can be combined with the variable for stamp. For simplicity of presentation, however, they are declared as two separate integers. Look at the dependence chain on element 10. The reference (4; 1) triggers four subsequent reads: (6; 0), (7; 0), (9; 0) and (10; 0) simultaneously. Activation of each of these reads increments the clock time by 1=4. After all of them are finished, the clock time reaches 4, which in turn activates the reference (13; 1). Following are the details of the algorithm. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Iteration Read 8 10 10 1 10 10 15 3 15 11 0 3 13 10 15 0 (1,1) (1,1) (1,1) (2,3) (1,2) (1,1) (3,4) (3,4) (1,1) (3,4)(3,4) (2,3) (3,1) (2,3) (2,1) (1,2) v Write 15 5 5 14 10 14 12 11 3 12 4 8 3 10 10 3 (1) (1) (2) (1) (2) (2) (1) (1) (2) (2) (1) (2) (4) (4) (5) (5) u Access Figure 5: Sequentially constructed dependence chains labeled by array element. Numbers in parentheses are stamps of references. As in the PCR algorithm, the inspector first partitions the iteration space of a target loop into a number of regions. Each region is assigned to a different processor. Each processor first stamps local references, except the head macro-references at the beginning of the local dependence chains. References next to head macroreferences are stamped with conservative numbers using the conservative approach as in the PCR algorithm. Processors then communicate with each other to merge consecutive read groups and stamp undecided macro-references in the time table. The base of the algorithm is a three dimensional stamp table: stamp. stamp[i][j ][0] records the stamp of reference (i; j ) in the iteration-reference space. stamp[i][j ][1] stores the size of a read group to which reference (i; j ) belongs. If reference (i; j ) is a write, stamp[i][j ][1] is not used. The local inspector at each processor passes once through its iteration-reference region, forms read groups, and assigns appropriate stamps onto its inner references. Inner references refer to those whose stamps can be determined locally using conservative approaches. All write operations except for the heads of local dependence chains are inner references. All reads of a read group are also inner references if the group is neither the head nor the tail of any local dependence chain. Figure 6 presents a partially stamped dependence chain on element 10. From this figure, it can be seen that region 1 establishes a right-open single read group; region 2 establishes a right-open group with two members; region 3 builds a group open at both ends. The local inspector stamps all inner references of each region. A subsequent global inspector merges consecutive read groups that are open to each other and assigns appropriate stamps to newly 1060-3425/98 $10.00 (c) 1998 IEEE 0 1 2 3 4 5 10 Read (1, u) 6 7 10 10 8 (-13,u) (0,-15) 9 10 10 10 11 12 14 15 Iteration v (-19,u) (u,-21) 10 Write 13 10 (u) 10 u ( u ) ( 26 ) Access Figure 6: A partially stamped dependence chain constructed in parallel. formed closed read groups and undecided writes. Figure 7 is a fully stamped chain evolved from Figure 6. 0 Read 1 2 10 3 4 (1, 1) Write 5 6 7 8 10 10 (10,4) (10,4) 9 10 11 10 10 12 13 14 15 Iteration v (10,4) (10,4) 10 10 10 (2) ( 17 ) ( 26 ) u tures will lead to different inspector and scheduler algorithms. Desirable features of the structure are low in memory overhead, simple to construct in parallel, and easy for the scheduler to generate wavefront vectors. In [14], Rauchwerger et al. presented an algorithm (RAP, for short) that uses a reference array R to collect all the references to an element in iteration order and a hierarchy vector H to record the index of the first reference of each wavefront level in the array. For example, for the loop in Figure 1 and a memory access pattern defined by u and v in Section 3, the reference array of element 10 and its hierarchy vector are shown in Figure 8. For example, H [2] = 2 and H [3] = 6 indicate that iteration 6 (R[2]) and iteration 13 (R[6]) are the first references of the 3th and 13th wavefronts, respectively. Its scheduler uses these two data structures as look-up tables for determining the predecessors and successors of all the references. Access (a) Figure 7: A fully stamped dependence chain constructed in parallel. The executor uses an extra clocking rule to allow concurrent reads and meanwhile preserve anti-flow dependences. That is, if the reference is a read in region r, the clock of its dependence chain is incremented by 1=b if the read is a head of region r; otherwise, it is set to n + 1 + 1=b + frac, where b is the read group size and frac is the fraction part of the ordinal clock. For example, look at the dependence chain associated with memory location 10 in Figure 7. Activation of reference (2; 0) increments time[10] by one because it is a single read group. Reference (4; 1) triggers all four subsequent reads simultaneously by setting time[10] to 10. Suppose the four references are performed in the following order: (6; 0), (9; 0), (7; 0) and (10; 0). Reference (6; 0) increments time[10] by 1=4. Reference (9; 0), however, sets time[10] to 16:5. The subsequent two reads add 1=4 to time[10] each. Upon completeness of all reads, their subsequent write (13; 1) is activated. The purpose of the read-number clock time is to record the number of activated reads in a group. 4 A D OALL algorithm In this section, we present a DOALL I NSPECTOR /E XECUTOR algorithm for run-time parallelization. The algorithm breaks down parallelization functions into three routines: inspector, scheduler and executor. The inspector examines the memory references in a loop and constructs a reference-wise dependence chain for each data element accessed in the loop. The scheduler then derives more restrictive iteration-wise dependence relations from the referencewise dependence chains. Finally, the executor restructures the loop into a sequence of wavefronts and executes iterations accordingly. Iteration-wise dependence information is usually represented by a vector of integers, denoted by wf . Each element of the vector, wf [i], indicates the earliest invocation time of iteration i. The wavefront vector bridges the scheduler and executer, being their output and input, respectively. The reference-wise dependence chains can be represented in different ways. Different data struc- (b) 2 4 6 7 9 10 13 14 Iteration R W R R R R W W Type 1 2 3 3 3 3 4 5 Level 0 1 2 6 7 Reference array (a) and hierarchy vector (b) of element 10 Figure 8: The reference array and hierarchy vector of element 10 for the loop in Figure 1 with a memory access pattern defined by u and v . Since the reference array in the RAP algorithm stores the reference levels of a dependence chain, it is hard to construct in parallel using the ideas from the DOACROSS algorithms. We cannot assign conservative levels to references because their levels will be used to derive consecutive iteration-wise wavefront levels. In the following, we present a new algorithm that simply uses a dependence table, as shown in Figure 9, to record the memory-reference dependences. The table is different from the stamp array of the time-stamp algorithms. First, each dependence chain of the table reflects only the order of precedence of its memory references. There are no levels associated with the references. Second, not all memory references have a role in the dependence chains. Since the DOALL approach concerns only iteration-wise dependences, a read operation can be ignored if there is another write in the same iteration which accesses to the same memory location. For example, the read in iteration 12 is overruled by the write in the same iteration. 0 1 2 3 4 5 6 7 8 9 10 11 12 Read 3 13 10 15 0 8 10 10 1 10 10 15 3 Write 15 5 5 12 11 3 14 12 4 8 10 14 3 13 14 15 Iteration 15 11 0 v 10 10 3 u Access Figure 9: Dependence table for the loop in Figure 1 with an memory access pattern defined by V and U . Each cell of the table is defined as 1060-3425/98 $10.00 (c) 1998 IEEE f struct Cell int iter; int elm; char type; int *xleft; int *yleft; g 5 Performance Evaluations /* /* /* /* /* Current iteration index */ Item to be referenced */ Reference type: RD/WR */ (xleft, yleft) points to */ its predecessor*/ For parallel construction of the dependence table in the inspector phase, each processor maintains a pair of pointers (head[i]; tail[i]) for each memory location i, which point to the head and tail of its associated dependence chain, respectively. As in the DOACROSS algorithm, the inspector needs cross-processor communication to connect those adjacent local dependence chains associated with the same memory location. Processor k can find the predecessor (successor) for its local dependence chain i by scanning tail[i] (head[i]) of processors from k 1 to 0 (from k + 1 to N-1). Based on the dependence table, processors then construct a wavefront vector in parallel through synchronizing accesses to each element of the wavefront vector. Specifically, for iteration i, a processor determines the dependence levels of all its references and sets wf [i] to the highest level. The dependence level of a reference is calculated according to the following rules. For a reference r of iteration i, (S1) if the reference r is a write and its immediate predecessor is a read, the processor examines all following consecutive reads and sets the reference level to max wf [k] + 1, where k is an iteration that contains one of the reads. f g (S2) if the reference r is a write and its immediate predecessor is a write (say in iteration k), the reference level is set to wf [k] + 1; (S3) if the reference r is a read, the processor backtracks the reference’s dependence chain until it meets a write (say in iteration k). The reference level is then set to wf [k] + 1. Notice that synchronization of accesses to wavefront vector elements requires a blocking read if its target has an invalid value. The waiting reads associated with an element should be woken up by each write to the element. Applying the rules to the dependence table in Figure 9, we obtain a wavefront vector of 1 1 2 2 3 3 4 4 2 5 4 4 3 6 7 4: Taking the wavefront vector as an input, the executor can be described as follows: for k=0 to d-1 do forall i such that wf[i]=k perform iteration i endfor barrier endfor We implemented the DOACROSS and DOALL run-time parallelization algorithms on a SUN Enterprise E3001. It is a symmetric multiprocessor, configured with four 167MHz UltraSPARC processors and a memory of 512 MB. Each processor module has an external cache of 1 MB. Each DOACROSS algorithm was implemented as a sequence of three routines: a local inspector, a global insepctor, and an executor. The implementation of the DOALL algorithm has an extra scheduler between its global inspector and executor. All routines are separated by barrier synchronization operations. They were programmed in the Single-Program-Multiple-Data (SPMD) paradigm and as multi-threaded codes. At the beginning, threads were created in a bound mode so that threads are bound to different CPUs and run to completion. Performance of a run-time parallelization algorithm is dependent on a number of factors. One is the structure of target loops. This experiment targets the same synthetic loop structure used in [6, 18]. It comprises a sequence of interleaved reads and writes in the loop body. Each iteration executes a delay() function, reflecting the workload caused by its memory references. In [18], another socalled multiple-reads-single-write loop structure was considered in a preliminary evaluation of the DOACROSS algorithms. It was found that the FCR and the PCR algorithms gain few benefits from the extra reads in that loop structure. Another major factor affecting the overall performance is memory access patterns defined by index arrays. Uniform and nonuniform memory access patterns were considered. A uniform access pattern (UNIFORM, for short) assumes all array elements have the same probability of being accessed by a memory reference. A non-uniform access pattern (NUNIFORM, for short) refers to the pattern where 90% of references are to 10% of array elements. Non-uniform access patterns reflect hot reference spots and result in long dependence chains. In addition to the loop structure and memory access pattern, the performance of a parallelization algorithm is also critically affected by its implementation strategies. One of the major issues is loop distribution. For the local and global inspectors, an iteration-wise block decomposition is straightforward because it not only ensures balanced workload among processors but also allows processors to be run in parallel. Loops in the scheduler and executor can be distributed in different ways: cyclic, block, or block cyclic. Their effects on the performance will be evaluated, too. The experiments measure the overall run-time execution time for a given loop and a memory access pattern. Each data point obtained in the experiments is the average of five runs, using different seeds for the generation of pseudo-random numbers. Since a seed determines a pseudo-random sequence, algorithms are able to be evaluated under the same test instances. 5.1 Impact of access patterns and loop sizes where d is the number of wavefronts in a schedule. Figure 10 presents speedups of different parallelization algorithms over serial loops ranging from 16 to 16384 iterations in a loop. Assume each iteration has four memory references and contains 1060-3425/98 $10.00 (c) 1998 IEEE 500 microseconds workload in the delay function. Loops in the scheduler and executor are decomposed in a cyclic way. Section 5.3 will show that the cyclic decomposition yields the best performance for all algorithms. 5 such small loops won’t gain any benefits on a system with four processors. This is in agreement with what the speedup curve of the DOALL algorithm shows in Figure 10(b). Similar observations can be made for loops with uniform access patterns from the plot of parallelism degrees in Figure 11 and the curve of speedups in Figure 10. 512 FCR: Fully Concurrent Reads PCR: Partially Concurrent Reads CTY: Chen-Torrellas-Yew Algorithm DOALL Parallelization 4 256 Uniform, SRSW Non-uniform, SRSW 128 Speedup Average Degree of Parallelism 3 2 1 64 32 16 8 4 2 0 16 32 64 128 256 512 1024 Number of Iterations 2048 4096 8192 16384 1 16 (a) Uniform access patterns 32 64 128 256 512 1024 Number of Iterations 2048 4096 8192 16384 Figure 11: Average degree of parallelism exploited by the DOALL algorithm 5 FCR: Fully Concurrent Reads PCR: Partially Concurrent Reads CTY: Chen-Torrellas-Yew Algorithm DOALL Parallelization 4 It is worth noting that higher degrees of parallelism may not necessarily lead to better performances. From Figure 10(a), it can be seen that the speedup due to the DOALL algorithm starts to drop when loop size reaches beyond 4096. The average degree of parallelism at the point is 128, which is evidently excessive on a system with only four processors. The performance degradation is due to the cost of barrier synchronizations. Speedup 3 2 1 0 16 32 64 128 256 512 1024 Number of Iterations 2048 4096 8192 16384 (b) Non-uniform access patterns Figure 10: Speedups of parallelization algorithms over serial loops of different sizes, where load=500s. Overall, the figure shows that the DOACROSS algorithms significantly outperform the DOALL algorithm, even though both techniques are capable of gaining speedups for loops that exhibit uniform and non-uniform memory access patterns. The gap between their speedup curves indicates the importance of exploiting fine-grained reference-level parallelism at run-time. Referencelevel parallelism is especially important to small loops because they normally have very limited degrees of iteration-level parallelism. We refer to the degree of iteration-level parallelism as the number of iterations in a wavefront of the DOALL algorithm. Figure 11 plots average degrees of parallelism in loops of different sizes. As can be observed, the degree of parallelism is proportional to the loop size. A loop that exhibits non-uniform access patterns has at most four degrees of parallelism until its loop size reaches beyond 512. This implies that iteration-level parallelization techniques for The DOACROSS technique delivers high speedups for small loops because it is able to exploit enough parallelism to keep all processors busy all the time. Expectedly, the amount of finegrained parallelism will quickly become excessive as the loop size increases. An interesting feature with the DOACROSS algorithms is that their speedups stabilize at the highest level when parallelism is over-exposed, rather than declining as in the DOALL technique. It is because the cost of point-to-point synchronizations used in the DOACROSS executor is independent of the parallelism degree. Consequently, the execution time of the executor is proportional to loop size. Of the DOACROSS variants, the FCR and PCR algorithms are superior to the CTY for small loops, while the CTY algorithm is preferred for large ones. The FCR and PCR algorithms improve on the CTY by allowing consecutive reads to the same location to be run simultaneously. The extra amount of parallelism does benefit to small loops. For large loops, however, the benefit could be outweighed by the cost of exploiting and managing the parallelism. Since the FCR algorithm incurs more overhead than the CTY in the global inspector, it obtains lower speedups for large loops. In contrast, the PCR algorithm can obtain almost the same speedup as the CTY because the PCR algorithm causes only slightly more overhead in the local inspector. To better understand the relative performance of the algorithms, we break down their execution into a number of principal tasks and 1060-3425/98 $10.00 (c) 1998 IEEE From the figure, it can be seen that both the CTY and PCR algorithms spend approximately 5% of time in their initializations, and 1.0% in their local and global inspectors each for large loops. The FCR algorithm spends about 10% of time in initialization because it uses a three-dimensional stamp table (instead of a two-dimensional stamp array in the CTY and PCR algorithms) and different auxiliary data structures: head and tail. All these structures need to be allocated at run-time. The FCR algorithm spends more time in the global inspector for exploiting concurrency reads across different regions. Compared with the CTY and PCR algorithms, the FCR reduces the time spent in the executor significantly. In the case that dependence analyses can be reused across multiple loop invocations, the FCR algorithm is expected to achieve even better performances. 25 Scheduler Global Analysis Local Analysis Initialization 20 Percentages 15 10 5 0 16 32 64 128 256 512 1024 Number of Iterations 2048 4096 8192 16384 4096 8192 16384 4096 8192 16384 4096 8192 16384 (a) CTY algorithm 25 Scheduler Global Analysis Local Analysis Initialization 20 15 Percentages present their percentages of overall execution time in Figure 12. Initialization curve indicates the cost for memory allocation, initialization, and thread creations. The cost spent in the local inspector, global inspector, or scheduler is indicated by the range between the curves of its predecessor and itself. Remainders above the global inspector and scheduler curves are due to the executors of the DOACROSS and DOALL algorithms, respectively. Figure 12(d) shows that the DOALL algorithm spends high percentages of time in the scheduler for generating iteration-wise dependences from reference-wise dependence information. The percentage decreases as the loop size increases because the cost of barrier synchronization operations in the executor increases. 10 5 0 16 32 64 128 256 512 1024 Number of Iterations 2048 (b) PCR algorithm 5.2 Impact of loop workload 25 Scheduler Global Analysis Local Analysis Initialization 20 15 Percentages It is known that the cost of a run-time parallelization algorithm in the inspector and scheduler is independent of the workload of iterations. The time spent in the executor, however, is proportional to the amount of loop workload. The larger the workload, the smaller is the cost percentage in the inspector and scheduler. The experiments above assumed 500 microseconds workload at each iteration. Figure 13 presents speedups of the algorithms under the assumption of 50 microseconds iteration workload, instead. From the figure, it can be seen that all algorithms lose certain amount of speedups. However, relative performances of the algorithms remain the same as revealed by Figure 10. The performance gap between the FCR algorithm and the CTY and PCR algorithms is enlarged because the relative cost of global analysis in the FCR algorithm increases as the workload decreases. 10 5 0 16 32 64 128 256 512 1024 Number of Iterations 2048 (c) FCR algorithm 25 Scheduler Global Analysis Local Analysis Initialization 20 5.3 Impact of loop distributions Generally, a loop iteration space can be assigned to threads in either static or dynamic approaches. Static approaches assign iterations to threads prior to their executions, while dynamic approaches make decisions at run-time. Their advantages and disadvantages were discussed in [19] in a general context of task mapping and load balancing. In this experiment, we tried three simple static assignment strategies: cyclic, block, and block-cyclic because their simplicity lends themselves to be implemented efficiently at run-time. Let b denote the block size. A block cyclic distribution algorithm as- Percentages 15 10 5 0 16 32 64 128 256 512 1024 Number of Iterations 2048 (d) DOALL algorithm Figure 12: Breakdown of execution time in percentage on different stages 1060-3425/98 $10.00 (c) 1998 IEEE 5 0.8 FCR: Fully Concurrent Reads PCR: Partially Concurrent Reads CTY: Chen-Torrellas-Yew Algorithm DOALL Parallelization 4 0.7 FCR: Fully Concurrent Reads PCR: Partially Concurrent Reads CTY: Chen-Torrellas-Yew Algorithm DOALL Parallelization Total Execution Time (sec) 0.6 Speedup 3 2 0.5 0.4 0.3 0.2 1 0.1 0 0 16 32 64 128 256 512 1024 Number of Iterations 2048 4096 8192 16384 1 (a) Uniform access patterns 2 4 8 16 Block Size 32 64 128 256 128 256 (a) Uniform dependence patterns 5 0.8 FCR: Fully Concurrent Reads PCR: Partially Concurrent Reads CTY: Chen-Torrellas-Yew Algorithm DOALL Parallelization 4 FCR: Fully Concurrent Reads PCR: Partially Concurrent Reads CTY: Chen-Torrellas-Yew Algorithm DOALL Parallelization 0.7 Total Execution Time (sec) 0.6 Speedup 3 2 0.5 0.4 0.3 0.2 1 0.1 0 0 16 32 64 128 256 512 1024 Number of Iterations 2048 4096 8192 16384 (b) Non-uniform access patterns 1 2 4 8 16 Block Size 32 64 (b) Non-uniform dependence patterns Figure 13: Speedups of algorithms over serial code for various loops, where load=50s. signs iteration i to thread ((i mod bM ) mod b), where M is the number of threads. It is reduced to a cyclic distribution if b = 1 and to a block distribution if b = N=M , where N is the number of iterations. Figure 14 shows the effect of block cyclic decompositions on the execution time of a loop with 1024 iterations. The figure shows that the DOACROSS algorithms are very sensitive to the block size. They prefer cyclic or small block cyclic distributions because they lead to good load balance among processors. Each plot of the DOACROSS algorithms has a knee, beyond which its execution time increases sharply. The FCR and PCR algorithms have the largest knees, reflecting the fact the algorithms exploit the largest degree of parallelism. Given more processors, they are projected to perform better than the CTY. The DOALL algorithm uses a block cyclic decomposition in the scheduler. Its overall execution time is insensitive to the block size. In the case of non-uniform access patterns, a large block cyclic decomposition is slightly superior to the cyclic distribution. Figure 14: The effect of block cyclic decompositions 6 Conclusions In this paper, we considered two run-time loop parallelization techniques, DOALL and DOACROSS, which expose different granularities and degrees of parallelism in a loop. The DOALL technique exploits iteration-level parallelism. It restructures the loop into a sequence of doall loops, separated by barrier operations. The DOACROSS technique supports fine-grained reference-level parallelism. It allows dependent iterations to be run concurrently through inserting synchronization operations to preserve dependences. Of the DOACROSS techniques, there are variants, CTY, PCR, and FCR, which expose different amounts of parallelism among concurrent reads to the same memory location. Both approaches follow a so-called I NSPECTOR /E XECUTOR scheme. We evaluated the performance of the algorithms on a symmetric multiprocessor for loops with different structures, memory access patterns, and computational workload. In the executor, loops are distributed among processors in a block cyclic way. The experimental results showed that: 1060-3425/98 $10.00 (c) 1998 IEEE The DOACROSS technique outperforms the DOALL even though the latter plays a dominant role in compile-time loop parallelization. This is because the DOALL algorithm spends a high percentage of time in the scheduler routine for the construction of iteration-wise wavefronts. Hot reference spots have more negative effects on the DOALL algorithm due to limited iteration-level parallelism. The DOACROSS technique identifies fine grained reference parallelism at the cost of frequent synchronization in the executor. Multithreaded implementations reduced the synchronization overhead of the algorithms. Of the DOACROSS variants, the PCR algorithm performs best because it incurs only slightly more overheads than the CTY algorithm. The FCR algorithm improves on the CTY algorithm for small loops that do not have enough parallelism. For large loops or loops with light workload, its benefits are likely to be outweighed by its extra run-time cost. [5] D. K. Chen and P. C. Yew, “An empirical study on DOACROSS loops”, In Proc. of Supercomputing’91, pages 620–632. [6] D. K. Chen, P. C. Yew, and J. Torrellas. “An efficient algorithm for the run-time parallelization of doacross loops”. In Proc.of Supercomputing 1994, pages 518-527, Nov.1994. [7] R. Cytron, “DOACROSS: beyond vectorization for multiprocessors”, In Proc. of International Conference on Parallel Processing, 1986, pages 836–844. [8] J. Ju and V. Chaudhary. “Unique sets oriented partitioning of nested loops with non-uniform dependences”, In Proc. of International Conference on Parallel Processing, 1996. [9] S.-T. Leung and J. Zahorjan. “Improving the performance of runtime parallelization”. In 4th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 8391, May 1993. Loops that are to be executed repeatedly favor the DOALL and FCR algorithms. It is because they spend a high percentage of time in dependence analysis and scheduling and the time can be saved in subsequent loop invocations. [10] S.-T. Leung and J. Zahorjan. “Extending the applicability and improving the performance of runtime parallelization”, Technical Report, Department of Computer Science, University of Washington, 1995. The DOACROSS algorithms are sensitive to the block size of loop distribution. Cyclic and small block cyclic distributions yield better performance. [11] J. T.Lim, A. R. Hurson, K. Kavi, and B. Lee, “A loop allocation policy for DOACROSS loops”, In Proc. of Symposium on Parallel and Distributed Processing, 1996, pages 240–249. Future work includes examination of the issues of load balancing and exploiting locality in parallelizing loops that are to be executed repeatedly and evaluations of the algorithms for loops from real applications. Acknowledgements This work was supported in part by a startup grant from Wayne State University. The author would like to thank Roy Sumit for his help in experiments and Vipin Chaudhary for his insights into the experimental data. Thanks also go to Loren Schwiebert for his advises in presentation. [12] S. Midkiff and D. Padua. “Compiler algorithms for synchronization”. IEEE Trans. on Computers, C-36(12),December 1987. [13] L. Rauchwerger and D. Padua. “The LRPD test: Speculative run-time paralle- lization of loops with privatization and reduction parallelization”. In ACM SIGPLAN Conf. on Programming language Design and Implementation, June,1995. [14] L. Rauchwerger, N. M. Amato, and D. A. Padua. “Run-time methods for parallelizing partially parallel loops”. Technical Report, UIUC, 1995. [15] J. Saltz, R. Mirchandaney, and K. Crowley. “Run-time parallelization and scheduling of loops”. IEEE Trans. Comput.,40(5),May 1991. [16] Z. Shen, Z. Li, and P. C. Yew, “An empirical study on array subscripts and data dependencies,” in Proc. of ICPP, pp. II– 145 to II–152, 1989. References [1] J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, “Conversion of control dependence to data dependence”, in Proc. of 10th ACM Symposium on Principles of Programming Languages, ACM Press, New York, pages 177-189, Jan. 1983. [2] D. F. Bacon, S. L. Graham, and O. J. Sharp, “Compiler Transformations for High-Performance Computing”, ACM Computing Surveys, Vol.26, No.4, Dec. 1994, pages 345–420. [3] U. Banerjee, R. Eigenmann, A. Nicolau, D. A. Padua, “Automatic program parallelization”, In Proc. of the IEEE, Vol.81, No.2, February 1993, pages 211–231. [4] W. J. Camp, S. J. Plimpton, B. A. Hendrickson, and R. W. Leland. “Massively parallel methods for engineering and science problems”. Comm. ACM,37(4): 31-41, April 1994. [17] SunSoft, Multhreaded Programming Guide, 1995. [18] C. Xu and V. Chaudhary, “Time-stamping algorithms for parallelization of loops at run-time”, Int. Symposium of Parallel Processing, April 1997. Also available at http://www.pdcl.eng.wayne.edu/ czxu. [19] C. Xu and L. Lau, “Load Balancing in Parallel Computers: Theory and Practice”, Kluwer Academic Publishers, ISBN 0-7923-9819-X, Nov. 1996. [20] C. Zhu and P. C. Yew, “A scheme to enforce data dependence on large multi- processor systems”. IEEE Trans. Softw. Eng., 13(6):726-739, 1987. 1060-3425/98 $10.00 (c) 1998 IEEE
© Copyright 2026 Paperzz