Effects of Parallelism Degree on Run-Time

Effects of Parallelism Degree on Run-Time Parallelization of Loops
Chengzhong Xu
Department of Electrical and Computer Engineering
Wayne State University, Detroit, MI 48202
http://www.pdcl.eng.wayne.edu/czxu
Abstract
Due to the overhead for exploiting and managing parallelism,
run-time loop parallelization techniques with the aim of maximizing parallelism may not necessarily lead to the best performance. In this paper, we present two parallelization techniques
that exploit different degrees of parallelism for loops with dynamic cross-iteration dependences. The DOALL approach exploits iteration-level parallelism. It restructures the loop into a
sequence of do-parallel loops, separated by barrier operations. Iterations of a do-parallel loop are run in parallel. By contrast, the
DOACROSS approach exposes fine-grained reference-level parallelism. It allows dependent iterations to be run concurrently by
inserting point-to-point synchronization operations to preserve dependences. The DOACROSS approach has variants that identify different amounts of parallelism among consecutive reads to
the same memory location. We evaluate the algorithms for loops
using various structures, memory access patterns, and computational workloads on symmetric multiprocessors. The algorithms
are scheduled using block cyclic decomposition strategies. The
experimental results show that the DOACROSS technique outperforms the DOALL, even though the latter is widely used in
compile-time parallelization of loops. Of the DOACROSS variants, the algorithm allowing partially concurrent reads performs
best because it incurs only slightly more overhead than the algorithm disallowing concurrent reads. The benefit from allowing fully
concurrent reads is significant for small loops that do not have
enough parallelism. However, it is likely to be outweighed by its
cost for large loops or loops with light workload.
1 Introduction
Loop parallelization exploits parallelism among instruction sequences or loop iterations. Techniques for exploiting instructionlevel parallelism are prevalent in today’s microprocessors. On multiprocessors, loop parallelization techniques focus on loop-level
parallelism. They partition and allocate loop iterations among processors with respect to cross-iteration dependences. Their primary
objective is to expose enough parallelism to keep processors busy
all the time while minimizing synchronization overheads.
On multiprocessors, there are two important parallelization
techniques that exploit different degrees of parallelism. The
DOALL technique assumes loop iterations as the basic scheduling and execution units [2, 3]. It decomposes the iterations into
a sequence of subsets, called wavefronts. Iterations within the
same wavefront are run in parallel. A barrier synchronization operation is used to preserve cross-iteration dependences between
two wavefronts. The DOALL technique reduces the run-time
scheduling overhead at a sacrifice of a certain amount of parallelism. By contrast, the DOACROSS technique exploits finegrained reference-level parallelism. It allows dependent iterations
to be run concurrently by inserting point-to-point synchronization
operations to preserve dependences among memory references.
The DOACROSS technique maximizes parallelism at the expense
of frequent synchronization.
Note that in the literature the terms DOACROSS and DOALL
often refer to the loops with and without cross-iterations dependences, respectively. We borrow the terms as parallelization techniques in this paper because the DOALL technique essentially
restructures a loop into a sequence of DOALL loops. Both
the DOALL and DOACROSS techniques are used to parallelize
DOACROSS loops. Chen and Yew [5] studied programs from the
PERFECT benchmark suite and revealed the significant advantages
of parallelizing DOACROSS loops.
DOACROSS loops can be characterized as static and dynamic
in terms of the time when cross-iteration dependence information is
available (at compile-time or run-time). Figure 1 shows an example
of dynamic loops due to the presence of indirect access patterns on
data array X . Dynamic loops appear frequently in scientific and
engineering applications [16]. Examples include SPICE for circuit
simulation, CHARMM and DISCOVER for molecular dynamics
simulation of organic systems, and FIDAP for modeling complex
fluid flows [4].
for i=1 to N do
: : : = X[v[i]] + : : :
X[u[i]] = : : :
endfor
Figure 1: A general form of loops with indirect access patterns,
where u and v are input-dependent functions.
For parallelizing static loops, the DOALL technique plays a
dominant role because it employs a simple execution model after exploiting parallelism at compile-time [8]. For dynamic loops,
however, this may not be the case. Since parallelism in a dynamic
loop has to be identified at run-time, the cost for building wavefronts in the DOALL technique becomes a major source of runtime overhead. The DOACROSS incurs run-time overhead for
1060-3425/98 $10.00 (c) 1998 IEEE
analyzing reference-wise dependences, but it provides more flexibility to subsequent scheduling. This paper compares the DOALL
and DOACROSS approaches for parallelizing loops at run-time,
focusing on the effect of parallelism degree on the performance of
parallelization.
In [18], we devised two DOACROSS algorithms exploiting
different amounts of parallelism and demonstrated the effectiveness of the DOACROSS algorithms on symmetric multiprocessors. This paper presents a DOALL algorithm that allows parallel
construction of the wavefronts and compares this algorithm with
the DOACROSS algorithms, focusing on the influences of parallelism degree. We show that the DOACROSS algorithms have
advantages over the DOALL, even though the latter is preferred for
compile-time parallelization. Of the DOACROSS variants, the algorithm that allows concurrent reads may over-expose parallelism
for large loops. Its benefits will be outweighed by the run-time
cost of exploiting and managing the extra amount of parallelism
for large loops or loops with light workload.
The rest of the paper is organized as follows. Section 2 reviews
run-time parallelization techniques and qualitatively compares the
DOACROSS and DOALL techniques. Section 3 briefly presents
three DOACROSS algorithms that expose different amounts of
parallelism. Section 4 presents a DOALL algorithm. Section 5
evaluates the algorithms, focusing on the effects of parallelism degree and granularity. Section 5 concludes the paper with a summary
of evaluation results.
2 Run-time Parallelization Techniques
In the past, many run-time parallelization algorithms have been developed for different types of loops on both shared-memory and
distributed-memory machines [6, 9, 14]. Most of the algorithms
follow a so-called I NSPECTOR /E XECUTOR approach. With this approach, a loop under consideration is transformed at compile-time
into an inspector routine and an executor routine. At run-time, the
inspector detects cross-iteration dependences and produces a parallel schedule; the executor performs the actual loop operations in
parallel based on the dependence information exploited by the inspector. The keys to success with this approach is to shorten the
time spent on dependence analyses without losing valuable parallelism and to reduce the synchronization overhead in the executor.
An alternative to the I NSPECTOR /E XECUTOR approach is a speculative execution scheme that was recently proposed by Rauchwerger, Amato, and Padua [13]. In the speculative execution scheme,
the target loop is first handled as a doall regardless of its inherent
parallelism degree. If a subsequent test at run-time finds that the
loop was not fully parallel, the whole computation is then rolled
back and executed sequentially. Although the speculative execution yields good results when the loop is in fact executable as a
doall, it fails in most applications that have partially parallel loops.
The I NSPECTOR /E XECUTOR scheme provides a run-time parallelization framework, and leaves strategies for dependence analysis
and scheduling unspecified. The scheme can also be restructured to
decouple the scheduling function from the inspector and to merge
it with the executor. The scheduling function can even be extracted
to serve as a stand-alone routine between the inspector and the ex-
ecutor. There are many run-time parallelization algorithms belonging to the I NSPECTOR /E XECUTOR scheme. They differ from each
other mainly in their structures and strategies used in each routine,
in addition to the type of target loops considered.
Pioneering work on using the I NSPECTOR /E XECUTOR scheme
for run-time parallelization is due to Saltz and his colleagues [15].
They considered loops without output dependences (i.e. the indexing function used in the assignments of the loop body is an
identity function), and proposed an effective DOALL I NSPEC TOR /E XECUTOR scheme. Its inspector partitions the set of iterations into a number of wavefronts, which maintain cross-iteration
flow dependences. Iterations within the same wavefront can be executed concurrently, but those in different wavefronts must be processed in order. The executor of the DOALL scheme enforces
anti-flow dependences during the execution of iterations in the
same wavefront. The DOALL I NSPECTOR /E XECUTOR scheme
has been shown to be effective in many real applications. It is applicable, however, only to those loops without output dependences.
The basic scheme was recently generalized by Leung and Zahorjan to allow general cross-iteration dependences [10]. In their algorithm, the inspector generates a wavefront-based schedule and
maintains output and anti-flow dependences as well as flow dependences; the executor simply performs the loop operations according
to the wavefronts of iterations.
Note that the inspector in the above scheme is sequential. It
requires time commensurate with that of a serial loop execution.
Parallelization of the inspector loop was also investigated by Saltz,
et al. [15] and Leung and Zahorjan [9]. Their techniques respect
flow dependences, but ignore anti-flow and output dependences.
Most recently, Rauchwerger, Amato and Padua presented a parallel inspector algorithm for a general form of loops [13]. They
extracted the function of scheduling and explicitly presented an inspector/scheduler/executor scheme.
DOALL I NSPECTOR /E XECUTOR schemes assume a loop iteration as the basic scheduling unit in the inspector and the basic synchronization object in the executor. An alternative to this scheme is
DOACROSS I NSPECTOR /E XECUTOR parallelization techniques
which assume a memory reference in the loop body as the basic
unit of scheduling and synchronization. Processors running the executor are assigned iterations in a wrapped manner and each spinwaits as needed for operations that are necessary for its execution.
An early study of DOACROSS run-time parallelization techniques
was conducted by Zhu and Yew [20]. They proposed a scheme that
integrates the functions of dependence analysis and scheduling into
a single executor. Later, the scheme was improved by Midkiff and
Padua to allow concurrent reads to the same array element by several iterations [12]. Even though the integrated scheme allows concurrent analysis of cross-iteration dependences, tight coupling of
the dependence analysis and the executor causes high synchronization overhead in the executor. Most recently, Chen, et. al., developed the DOACROSS technique by decoupling the function of the
dependence analysis from the executor [6]. We refer to their technique as the CTY algorithm. Separation of the inspector and executor not only reduces synchronization overhead in the executor, but
also provides the possibility of reusing the dependence information
developed in the inspector across multiple invocations of the same
loop. Their inspector is parallel at the sacrifice of concurrent reads
1060-3425/98 $10.00 (c) 1998 IEEE
to the same array element. Their algorithm was recently further
improved by Xu and Chaudhary by allowing concurrent reads of
the same array element in different iterations and by increasing the
overlap of dependent iterations [18].
DOALL and DOACROSS are two competing techniques for
run-time loop parallelization. DOALL parallelizes loops at iteration level, while DOACROSS supports parallelism at a finegrained memory access level. Consider the loop and index arrays
shown in Figure 2. The first two iterations can be either indefor i=1 to N do
if ( exp(i) )
X[u1[i]] = F(X[v1[i]], : : :)
else
X[u2[i]] = F(X[v2[i]], : : :)
endfor
u1 = [5, 7, . . . ]
u2 = [3, 3, . . . ]
v1 = [7, 2, . . . ]
v2 = [4, 5, . . . ]
Figure 2: An example of loops with conditional cross-iteration dependences, where F is an arbitrary operator.
pendent (when exp(1) is false and exp(2) is true), flow dependent (when exp(1) is true and exp(2) is false), anti-flow dependent (when both exp(1) and exp(2) are true), or output dependent (when both exp(1) and exp(2) are false). The nondeterministic cross-iteration dependences are due to control dependences
between statements in the loop body. We call such dependences
conditional cross-iteration dependences. Control dependences can
be converted into data dependences by a if-conversion technique at
compile-time [1]. The compile-time technique, however, may not
be helpful for loop-carried dependence analysis at run-time. By the
DOALL technique, loops with conditional cross-iteration dependences must be handled sequentially. However, the DOACROSS
technique can handle this class of loops easily. At run-time, the
executor, upon testing a branch condition, may set all operands in
the non-taken branch available so as to release processors waiting
for those operands. Furthermore, the DOACROSS technique overlaps dependent iterations. The first two iterations in Figure 2 have
an anti-flow dependence when both exp(1) and exp(2) are true.
The memory access to X (4) in the second iteration, however, can
be overlapped with the execution of iteration 1 without destroying
the anti-flow dependences.
The DOACROSS I NSPECTOR /E XECUTOR parallelization
technique provides the potential to exploit fine-grained parallelism
across loops. Fine-grained parallelism does not necessarily lead to
overall performance gains without an efficient implementation of
the executor. One main contribution of this paper is to show that
multi-threaded implementations favor the DOACROSS technique.
3 The Time-Stamp D OACROSS Algorithms
This section briefly presents three DOACROSS algorithms that
feature parallel dependence analysis and scheduling. They expose
different amounts of parallelism among consecutive reads to the
same memory location. For more details, please see [18].
Consider the general form of loops in Figure 1. It defines a
two dimensional iteration-reference space. The inspector of a time-
stamp algorithm examines the memory references in a loop and
constructs a dependence chain for each data array element of the
loop. In addition to the precedence order, the inspector also assigns
a stamp to each reference in a dependence chain, which indicates
its earliest access time relative to the other references in the chain.
A reference can be activated if and only if the preceding references
are finished. The executor schedules references to a chain through a
logical clock. At a given time, only those references whose stamps
are equal to or larger than the time are allowed to proceed. Dependence chains are associated with clocks ticking at different speeds.
Assume stamps of references as discrete integers. The stamps
are stored in a two-dimensional array stamp. Let (i; j ) indicate the
j th access of the ith iteration. stamp[i][j ] represents the stamp of
reference (i; j ). By scanning through the iteration space sequentially, processors can easily construct a time-stamp array that has
the following features. The stamp difference between any two directly connected references is one except for pairs of write-afterread and read-after-read references; both reads in pairs of readafter-read have an equivalent stamp; and in pairs of write-afterread, their difference is the size of the read group minus one. Figure 3 shows an example derived from the target loop assuming
u = [15; 5; 5; 14; 10; 14; 12; 11; 3; 12; 4; 8; 3; 10; 10; 3]
v = [3; 13; 10; 15; 0; 8; 10; 10; 1; 10; 10; 15; 3; 15; 11; 0]:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 Iteration
Read
3
13 10 15 0
8
10 10 1 10 10 15 3 15 11 0 v
(1) (1) (1) (2) (1) (1) (3) (3) (1) (3) (3) (2) (3) (2) (2) (1)
Write
10 10 3 u
15
5
5
14 10 14
12 11 3
12
4
8
3
(1) (1) (2) (1) (2) (2) (1) (1) (2) (2) (1) (2) (4) (7) (8) (5)
Access
Figure 3: Sequentially constructed dependence chains labeled by
array elements. The numbers in parentheses are the stamps of references.
Based on such time-stamp arrays, a simple clock rule in the executor can preserve all dependences and allow consecutive reads to
be run in parallel.
3.1 A Parallel Algorithm Allowing Partially Concurrent Reads (PCR)
Building dependence chains requires an examination of all references at least once in the loop. To construct the time-stamp array in parallel, one key issue is how to stamp the references in a
dependence chain across iteration regions on different processors.
Since no processors (except the first) have knowledge about the
references in previous regions, they are unable to stamp their local references in a local chain without the assignment of its head.
To allow processors to continue with the examination of other references in their local regions in parallel, the time-stamp inspector
uses a conservative approach to assign a conservative number to
the second reference of a local chain and leave the first one to be
decided in a subsequent global analysis.
1060-3425/98 $10.00 (c) 1998 IEEE
0
1
Read
3
(1)
Write
15
(1)
2
3
4
10
15
(1) (2)
5
6
7
8
10 10
(10 ) (10 )
10
(2)
9
10
11
12
13
10 10 15
3
15
(12 ) (12 ) ( 3 ) (17 ) (17)
3
(2)
14
15 Iteration
v
u
3
10 10
3
(26 ) (19 ) (26 ) (27 )
Access
Figure 4: A fully stamped dependence chain labeled by array element. Numbers in parentheses are stamps of references.
Using this conservative approach, most of the stamp table can
be constructed in parallel. Upon completion of the local analysis,
processors communicate with each other to determine the stamps of
undecided references in the stamp table. Figure 4 shows the complete dependence chains associated with array elements 3, 10 and
15. Processor 3 temporarily assigns 26 to the reference (12; 1),
assuming all 24 accesses in regions from 0 to 2 are in the same
dependence chain. In the subsequent cross-processor analysis, processor 2 sets stamp[8][1] to 2 after communicating with processor
0 (processor 1 marks no reference to the same location). At the
same time, processor 3 communicates with processor 2, but gets an
undecided stamp on the reference (8; 1), and hence assigns another
conservative number, 16 plus 1, to reference (12; 0), assuming all
accesses in regions 0 and 1 are in the same dependence chain. The
extra one is due to the total number of dependent references in region 2. Note that the communications from processor 3 to processor 2 and from processor 2 to processor 1 are in parallel. Processor
2 can provide processor 3 only the number of references in the local
region until the end of the communication with processor 0.
Accordingly, the time-stamp algorithm presents a special clocking rule that sets the clock of a dependence chain to n + 2 if an
activated reference in region r is a local head, where n is the total
number of accesses from region 0 to region r 1. For example,
the reference (2; 0) in Figure 4 first triggers the reference (4; 1).
Activation of reference (4; 1) will set the clock to 10 because there
are 8 accesses in the first region, which consequently triggers the
following two reads.
Note that this parallel inspector algorithm only allows consecutive reads in the same region to be performed in parallel. Read operations in different regions must be performed sequentially even
though they are totally independent of each other. In the dependence chain on element 10 in Figure 4, for example, the reads (9; 0)
and (10; 0) are activated after reads (6; 0) and (7; 0). We are able
to assign reads (9; 0) and (10; 0) the same stamp as reads (6; 0)
and (7; 0), and assign the reference (13; 1) a stamp of 14. This dependence chain, however, will destroy the anti-flow dependences
from (6; 0) and (7; 0) to (14; 0) in the executor if reference (9; 0)
or (10; 0) starts earlier than one of the reads in region 1.
3.2 A Parallel Algorithm Allowing Fully Concurrent Reads (FCR)
The basic idea of the algorithm is to treat write operations and
groups of consecutive reads as a macro-reference. For a write reference or the first read operation in a read group in a dependence
chain, the inspector stamps the reference with the total number of
macro-references ahead. Other accesses in a read group are assigned the same stamp as the first read. Correspondingly, in the
executor, the clock of a dependence chain is incremented by one
time unit on a write reference and by a fraction of a time unit on a
read operation. The magnitude of an increment on a read operation
is the reciprocal of its read group size.
Figure 5 presents sequentially stamped dependence chains. In
addition to the stamp, each read reference is also associated with
an extra data item recording its read group size. In an implementation, the variable for read group size can be combined with the
variable for stamp. For simplicity of presentation, however, they
are declared as two separate integers. Look at the dependence chain
on element 10. The reference (4; 1) triggers four subsequent reads:
(6; 0), (7; 0), (9; 0) and (10; 0) simultaneously. Activation of each
of these reads increments the clock time by 1=4. After all of them
are finished, the clock time reaches 4, which in turn activates the
reference (13; 1). Following are the details of the algorithm.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 Iteration
Read
8 10 10 1 10 10 15 3 15 11 0
3 13 10 15 0
(1,1) (1,1) (1,1) (2,3) (1,2) (1,1) (3,4) (3,4) (1,1) (3,4)(3,4) (2,3) (3,1) (2,3) (2,1) (1,2)
v
Write
15
5 5
14 10 14 12 11 3 12 4
8
3 10 10 3
(1) (1) (2) (1) (2) (2) (1) (1) (2) (2) (1) (2) (4) (4) (5) (5)
u
Access
Figure 5: Sequentially constructed dependence chains labeled by
array element. Numbers in parentheses are stamps of references.
As in the PCR algorithm, the inspector first partitions the iteration space of a target loop into a number of regions. Each region
is assigned to a different processor. Each processor first stamps local references, except the head macro-references at the beginning
of the local dependence chains. References next to head macroreferences are stamped with conservative numbers using the conservative approach as in the PCR algorithm. Processors then communicate with each other to merge consecutive read groups and
stamp undecided macro-references in the time table.
The base of the algorithm is a three dimensional stamp table:
stamp. stamp[i][j ][0] records the stamp of reference (i; j ) in
the iteration-reference space. stamp[i][j ][1] stores the size of a
read group to which reference (i; j ) belongs. If reference (i; j )
is a write, stamp[i][j ][1] is not used. The local inspector at each
processor passes once through its iteration-reference region, forms
read groups, and assigns appropriate stamps onto its inner references. Inner references refer to those whose stamps can be determined locally using conservative approaches. All write operations
except for the heads of local dependence chains are inner references. All reads of a read group are also inner references if the
group is neither the head nor the tail of any local dependence chain.
Figure 6 presents a partially stamped dependence chain on element 10. From this figure, it can be seen that region 1 establishes
a right-open single read group; region 2 establishes a right-open
group with two members; region 3 builds a group open at both
ends.
The local inspector stamps all inner references of each region.
A subsequent global inspector merges consecutive read groups that
are open to each other and assigns appropriate stamps to newly
1060-3425/98 $10.00 (c) 1998 IEEE
0
1
2
3
4
5
10
Read
(1, u)
6
7
10
10
8
(-13,u) (0,-15)
9
10
10
10
11
12
14
15
Iteration
v
(-19,u) (u,-21)
10
Write
13
10
(u)
10
u
( u ) ( 26 )
Access
Figure 6: A partially stamped dependence chain constructed in parallel.
formed closed read groups and undecided writes. Figure 7 is a
fully stamped chain evolved from Figure 6.
0
Read
1
2
10
3
4
(1, 1)
Write
5
6 7
8
10 10
(10,4) (10,4)
9 10 11
10 10
12
13
14
15 Iteration
v
(10,4) (10,4)
10
10 10
(2)
( 17 ) ( 26 )
u
tures will lead to different inspector and scheduler algorithms. Desirable features of the structure are low in memory overhead, simple to construct in parallel, and easy for the scheduler to generate
wavefront vectors.
In [14], Rauchwerger et al. presented an algorithm (RAP, for
short) that uses a reference array R to collect all the references to
an element in iteration order and a hierarchy vector H to record the
index of the first reference of each wavefront level in the array. For
example, for the loop in Figure 1 and a memory access pattern defined by u and v in Section 3, the reference array of element 10 and
its hierarchy vector are shown in Figure 8. For example, H [2] = 2
and H [3] = 6 indicate that iteration 6 (R[2]) and iteration 13 (R[6])
are the first references of the 3th and 13th wavefronts, respectively.
Its scheduler uses these two data structures as look-up tables for determining the predecessors and successors of all the references.
Access
(a)
Figure 7: A fully stamped dependence chain constructed in parallel.
The executor uses an extra clocking rule to allow concurrent
reads and meanwhile preserve anti-flow dependences. That is, if
the reference is a read in region r, the clock of its dependence chain
is incremented by 1=b if the read is a head of region r; otherwise,
it is set to n + 1 + 1=b + frac, where b is the read group size and
frac is the fraction part of the ordinal clock.
For example, look at the dependence chain associated with
memory location 10 in Figure 7. Activation of reference (2; 0)
increments time[10] by one because it is a single read group. Reference (4; 1) triggers all four subsequent reads simultaneously by
setting time[10] to 10. Suppose the four references are performed
in the following order: (6; 0), (9; 0), (7; 0) and (10; 0). Reference (6; 0) increments time[10] by 1=4. Reference (9; 0), however, sets time[10] to 16:5. The subsequent two reads add 1=4 to
time[10] each. Upon completeness of all reads, their subsequent
write (13; 1) is activated. The purpose of the read-number clock
time is to record the number of activated reads in a group.
4 A D OALL algorithm
In this section, we present a DOALL I NSPECTOR /E XECUTOR algorithm for run-time parallelization. The algorithm breaks down
parallelization functions into three routines: inspector, scheduler
and executor. The inspector examines the memory references in
a loop and constructs a reference-wise dependence chain for each
data element accessed in the loop. The scheduler then derives more
restrictive iteration-wise dependence relations from the referencewise dependence chains. Finally, the executor restructures the loop
into a sequence of wavefronts and executes iterations accordingly.
Iteration-wise dependence information is usually represented by
a vector of integers, denoted by wf . Each element of the vector, wf [i], indicates the earliest invocation time of iteration i. The
wavefront vector bridges the scheduler and executer, being their
output and input, respectively. The reference-wise dependence
chains can be represented in different ways. Different data struc-
(b)
2
4
6
7
9
10
13
14
Iteration
R
W
R
R
R
R
W
W
Type
1
2
3
3
3
3
4
5
Level
0
1
2
6
7
Reference array (a) and hierarchy vector (b) of element 10
Figure 8: The reference array and hierarchy vector of element 10
for the loop in Figure 1 with a memory access pattern defined by u
and v .
Since the reference array in the RAP algorithm stores the reference levels of a dependence chain, it is hard to construct in parallel
using the ideas from the DOACROSS algorithms. We cannot assign conservative levels to references because their levels will be
used to derive consecutive iteration-wise wavefront levels. In the
following, we present a new algorithm that simply uses a dependence table, as shown in Figure 9, to record the memory-reference
dependences. The table is different from the stamp array of the
time-stamp algorithms. First, each dependence chain of the table reflects only the order of precedence of its memory references.
There are no levels associated with the references. Second, not all
memory references have a role in the dependence chains. Since
the DOALL approach concerns only iteration-wise dependences,
a read operation can be ignored if there is another write in the same
iteration which accesses to the same memory location. For example, the read in iteration 12 is overruled by the write in the same
iteration.
0
1
2
3
4
5
6
7
8
9
10
11
12
Read
3
13
10
15
0
8
10
10
1
10
10
15
3
Write
15
5
5
12
11
3
14
12
4
8
10
14
3
13
14
15 Iteration
15
11
0
v
10
10
3
u
Access
Figure 9: Dependence table for the loop in Figure 1 with an memory access pattern defined by V and U .
Each cell of the table is defined as
1060-3425/98 $10.00 (c) 1998 IEEE
f
struct Cell
int iter;
int elm;
char type;
int *xleft;
int *yleft;
g
5 Performance Evaluations
/*
/*
/*
/*
/*
Current iteration index */
Item to be referenced */
Reference type: RD/WR */
(xleft, yleft) points to */
its predecessor*/
For parallel construction of the dependence table in the inspector
phase, each processor maintains a pair of pointers (head[i]; tail[i])
for each memory location i, which point to the head and tail of its
associated dependence chain, respectively. As in the DOACROSS
algorithm, the inspector needs cross-processor communication to
connect those adjacent local dependence chains associated with
the same memory location. Processor k can find the predecessor (successor) for its local dependence chain i by scanning tail[i]
(head[i]) of processors from k 1 to 0 (from k + 1 to N-1).
Based on the dependence table, processors then construct a
wavefront vector in parallel through synchronizing accesses to each
element of the wavefront vector. Specifically, for iteration i, a processor determines the dependence levels of all its references and
sets wf [i] to the highest level. The dependence level of a reference
is calculated according to the following rules.
For a reference r of iteration i,
(S1) if the reference r is a write and its immediate predecessor is a
read, the processor examines all following consecutive reads
and sets the reference level to max wf [k] + 1, where k is
an iteration that contains one of the reads.
f
g
(S2) if the reference r is a write and its immediate predecessor
is a write (say in iteration k), the reference level is set to
wf [k] + 1;
(S3) if the reference r is a read, the processor backtracks the reference’s dependence chain until it meets a write (say in iteration k). The reference level is then set to wf [k] + 1.
Notice that synchronization of accesses to wavefront vector elements requires a blocking read if its target has an invalid value.
The waiting reads associated with an element should be woken up
by each write to the element.
Applying the rules to the dependence table in Figure 9, we obtain a wavefront vector of
1 1 2 2 3 3 4 4 2 5 4 4 3 6 7 4:
Taking the wavefront vector as an input, the executor can be
described as follows:
for k=0 to d-1 do
forall i such that wf[i]=k
perform iteration i
endfor
barrier
endfor
We implemented the DOACROSS and DOALL run-time parallelization algorithms on a SUN Enterprise E3001. It is a symmetric
multiprocessor, configured with four 167MHz UltraSPARC processors and a memory of 512 MB. Each processor module has an
external cache of 1 MB.
Each DOACROSS algorithm was implemented as a sequence
of three routines: a local inspector, a global insepctor, and an executor. The implementation of the DOALL algorithm has an extra scheduler between its global inspector and executor. All routines are separated by barrier synchronization operations. They
were programmed in the Single-Program-Multiple-Data (SPMD)
paradigm and as multi-threaded codes. At the beginning, threads
were created in a bound mode so that threads are bound to different
CPUs and run to completion.
Performance of a run-time parallelization algorithm is dependent on a number of factors. One is the structure of target loops.
This experiment targets the same synthetic loop structure used in
[6, 18]. It comprises a sequence of interleaved reads and writes in
the loop body. Each iteration executes a delay() function, reflecting
the workload caused by its memory references. In [18], another socalled multiple-reads-single-write loop structure was considered in
a preliminary evaluation of the DOACROSS algorithms. It was
found that the FCR and the PCR algorithms gain few benefits from
the extra reads in that loop structure.
Another major factor affecting the overall performance is memory access patterns defined by index arrays. Uniform and nonuniform memory access patterns were considered. A uniform access pattern (UNIFORM, for short) assumes all array elements have
the same probability of being accessed by a memory reference. A
non-uniform access pattern (NUNIFORM, for short) refers to the
pattern where 90% of references are to 10% of array elements.
Non-uniform access patterns reflect hot reference spots and result
in long dependence chains.
In addition to the loop structure and memory access pattern, the
performance of a parallelization algorithm is also critically affected
by its implementation strategies. One of the major issues is loop
distribution. For the local and global inspectors, an iteration-wise
block decomposition is straightforward because it not only ensures
balanced workload among processors but also allows processors
to be run in parallel. Loops in the scheduler and executor can be
distributed in different ways: cyclic, block, or block cyclic. Their
effects on the performance will be evaluated, too.
The experiments measure the overall run-time execution time
for a given loop and a memory access pattern. Each data point obtained in the experiments is the average of five runs, using different
seeds for the generation of pseudo-random numbers. Since a seed
determines a pseudo-random sequence, algorithms are able to be
evaluated under the same test instances.
5.1 Impact of access patterns and loop sizes
where d is the number of wavefronts in a schedule.
Figure 10 presents speedups of different parallelization algorithms
over serial loops ranging from 16 to 16384 iterations in a loop.
Assume each iteration has four memory references and contains
1060-3425/98 $10.00 (c) 1998 IEEE
500 microseconds workload in the delay function. Loops in the
scheduler and executor are decomposed in a cyclic way. Section 5.3
will show that the cyclic decomposition yields the best performance
for all algorithms.
5
such small loops won’t gain any benefits on a system with four processors. This is in agreement with what the speedup curve of the
DOALL algorithm shows in Figure 10(b). Similar observations
can be made for loops with uniform access patterns from the plot
of parallelism degrees in Figure 11 and the curve of speedups in
Figure 10.
512
FCR: Fully Concurrent Reads
PCR: Partially Concurrent Reads
CTY: Chen-Torrellas-Yew Algorithm
DOALL Parallelization
4
256
Uniform, SRSW
Non-uniform, SRSW
128
Speedup
Average Degree of Parallelism
3
2
1
64
32
16
8
4
2
0
16
32
64
128
256
512
1024
Number of Iterations
2048
4096
8192
16384
1
16
(a) Uniform access patterns
32
64
128
256
512
1024
Number of Iterations
2048
4096
8192
16384
Figure 11: Average degree of parallelism exploited by the DOALL
algorithm
5
FCR: Fully Concurrent Reads
PCR: Partially Concurrent Reads
CTY: Chen-Torrellas-Yew Algorithm
DOALL Parallelization
4
It is worth noting that higher degrees of parallelism may not
necessarily lead to better performances. From Figure 10(a), it can
be seen that the speedup due to the DOALL algorithm starts to
drop when loop size reaches beyond 4096. The average degree of
parallelism at the point is 128, which is evidently excessive on a
system with only four processors. The performance degradation is
due to the cost of barrier synchronizations.
Speedup
3
2
1
0
16
32
64
128
256
512
1024
Number of Iterations
2048
4096
8192
16384
(b) Non-uniform access patterns
Figure 10: Speedups of parallelization algorithms over serial loops
of different sizes, where load=500s.
Overall, the figure shows that the DOACROSS algorithms significantly outperform the DOALL algorithm, even though both
techniques are capable of gaining speedups for loops that exhibit
uniform and non-uniform memory access patterns. The gap between their speedup curves indicates the importance of exploiting fine-grained reference-level parallelism at run-time. Referencelevel parallelism is especially important to small loops because they
normally have very limited degrees of iteration-level parallelism.
We refer to the degree of iteration-level parallelism as the number
of iterations in a wavefront of the DOALL algorithm. Figure 11
plots average degrees of parallelism in loops of different sizes. As
can be observed, the degree of parallelism is proportional to the
loop size. A loop that exhibits non-uniform access patterns has at
most four degrees of parallelism until its loop size reaches beyond
512. This implies that iteration-level parallelization techniques for
The DOACROSS technique delivers high speedups for small
loops because it is able to exploit enough parallelism to keep all
processors busy all the time. Expectedly, the amount of finegrained parallelism will quickly become excessive as the loop size
increases. An interesting feature with the DOACROSS algorithms
is that their speedups stabilize at the highest level when parallelism
is over-exposed, rather than declining as in the DOALL technique.
It is because the cost of point-to-point synchronizations used in the
DOACROSS executor is independent of the parallelism degree.
Consequently, the execution time of the executor is proportional to
loop size.
Of the DOACROSS variants, the FCR and PCR algorithms are
superior to the CTY for small loops, while the CTY algorithm is
preferred for large ones. The FCR and PCR algorithms improve
on the CTY by allowing consecutive reads to the same location to
be run simultaneously. The extra amount of parallelism does benefit to small loops. For large loops, however, the benefit could be
outweighed by the cost of exploiting and managing the parallelism.
Since the FCR algorithm incurs more overhead than the CTY in
the global inspector, it obtains lower speedups for large loops. In
contrast, the PCR algorithm can obtain almost the same speedup
as the CTY because the PCR algorithm causes only slightly more
overhead in the local inspector.
To better understand the relative performance of the algorithms,
we break down their execution into a number of principal tasks and
1060-3425/98 $10.00 (c) 1998 IEEE
From the figure, it can be seen that both the CTY and PCR algorithms spend approximately 5% of time in their initializations, and
1.0% in their local and global inspectors each for large loops. The
FCR algorithm spends about 10% of time in initialization because it
uses a three-dimensional stamp table (instead of a two-dimensional
stamp array in the CTY and PCR algorithms) and different auxiliary data structures: head and tail. All these structures need to
be allocated at run-time. The FCR algorithm spends more time
in the global inspector for exploiting concurrency reads across different regions. Compared with the CTY and PCR algorithms, the
FCR reduces the time spent in the executor significantly. In the
case that dependence analyses can be reused across multiple loop
invocations, the FCR algorithm is expected to achieve even better
performances.
25
Scheduler
Global Analysis
Local Analysis
Initialization
20
Percentages
15
10
5
0
16
32
64
128
256
512
1024
Number of Iterations
2048
4096
8192
16384
4096
8192
16384
4096
8192
16384
4096
8192
16384
(a) CTY algorithm
25
Scheduler
Global Analysis
Local Analysis
Initialization
20
15
Percentages
present their percentages of overall execution time in Figure 12.
Initialization curve indicates the cost for memory allocation, initialization, and thread creations. The cost spent in the local inspector, global inspector, or scheduler is indicated by the range between the curves of its predecessor and itself. Remainders above
the global inspector and scheduler curves are due to the executors
of the DOACROSS and DOALL algorithms, respectively.
Figure 12(d) shows that the DOALL algorithm spends high percentages of time in the scheduler for generating iteration-wise dependences from reference-wise dependence information. The percentage decreases as the loop size increases because the cost of
barrier synchronization operations in the executor increases.
10
5
0
16
32
64
128
256
512
1024
Number of Iterations
2048
(b) PCR algorithm
5.2 Impact of loop workload
25
Scheduler
Global Analysis
Local Analysis
Initialization
20
15
Percentages
It is known that the cost of a run-time parallelization algorithm in
the inspector and scheduler is independent of the workload of iterations. The time spent in the executor, however, is proportional to
the amount of loop workload. The larger the workload, the smaller
is the cost percentage in the inspector and scheduler. The experiments above assumed 500 microseconds workload at each iteration.
Figure 13 presents speedups of the algorithms under the assumption
of 50 microseconds iteration workload, instead. From the figure, it
can be seen that all algorithms lose certain amount of speedups.
However, relative performances of the algorithms remain the same
as revealed by Figure 10. The performance gap between the FCR
algorithm and the CTY and PCR algorithms is enlarged because
the relative cost of global analysis in the FCR algorithm increases
as the workload decreases.
10
5
0
16
32
64
128
256
512
1024
Number of Iterations
2048
(c) FCR algorithm
25
Scheduler
Global Analysis
Local Analysis
Initialization
20
5.3 Impact of loop distributions
Generally, a loop iteration space can be assigned to threads in either
static or dynamic approaches. Static approaches assign iterations to
threads prior to their executions, while dynamic approaches make
decisions at run-time. Their advantages and disadvantages were
discussed in [19] in a general context of task mapping and load
balancing.
In this experiment, we tried three simple static assignment
strategies: cyclic, block, and block-cyclic because their simplicity lends themselves to be implemented efficiently at run-time. Let
b denote the block size. A block cyclic distribution algorithm as-
Percentages
15
10
5
0
16
32
64
128
256
512
1024
Number of Iterations
2048
(d) DOALL algorithm
Figure 12: Breakdown of execution time in percentage on different
stages
1060-3425/98 $10.00 (c) 1998 IEEE
5
0.8
FCR: Fully Concurrent Reads
PCR: Partially Concurrent Reads
CTY: Chen-Torrellas-Yew Algorithm
DOALL Parallelization
4
0.7
FCR: Fully Concurrent Reads
PCR: Partially Concurrent Reads
CTY: Chen-Torrellas-Yew Algorithm
DOALL Parallelization
Total Execution Time (sec)
0.6
Speedup
3
2
0.5
0.4
0.3
0.2
1
0.1
0
0
16
32
64
128
256
512
1024
Number of Iterations
2048
4096
8192
16384
1
(a) Uniform access patterns
2
4
8
16
Block Size
32
64
128
256
128
256
(a) Uniform dependence patterns
5
0.8
FCR: Fully Concurrent Reads
PCR: Partially Concurrent Reads
CTY: Chen-Torrellas-Yew Algorithm
DOALL Parallelization
4
FCR: Fully Concurrent Reads
PCR: Partially Concurrent Reads
CTY: Chen-Torrellas-Yew Algorithm
DOALL Parallelization
0.7
Total Execution Time (sec)
0.6
Speedup
3
2
0.5
0.4
0.3
0.2
1
0.1
0
0
16
32
64
128
256
512
1024
Number of Iterations
2048
4096
8192
16384
(b) Non-uniform access patterns
1
2
4
8
16
Block Size
32
64
(b) Non-uniform dependence patterns
Figure 13: Speedups of algorithms over serial code for various
loops, where load=50s.
signs iteration i to thread ((i mod bM ) mod b), where M is the
number of threads. It is reduced to a cyclic distribution if b = 1
and to a block distribution if b = N=M , where N is the number of
iterations.
Figure 14 shows the effect of block cyclic decompositions on
the execution time of a loop with 1024 iterations. The figure shows
that the DOACROSS algorithms are very sensitive to the block
size. They prefer cyclic or small block cyclic distributions because
they lead to good load balance among processors. Each plot of the
DOACROSS algorithms has a knee, beyond which its execution
time increases sharply. The FCR and PCR algorithms have the
largest knees, reflecting the fact the algorithms exploit the largest
degree of parallelism. Given more processors, they are projected to
perform better than the CTY.
The DOALL algorithm uses a block cyclic decomposition in the
scheduler. Its overall execution time is insensitive to the block size.
In the case of non-uniform access patterns, a large block cyclic
decomposition is slightly superior to the cyclic distribution.
Figure 14: The effect of block cyclic decompositions
6 Conclusions
In this paper, we considered two run-time loop parallelization techniques, DOALL and DOACROSS, which expose different granularities and degrees of parallelism in a loop. The DOALL technique exploits iteration-level parallelism. It restructures the loop
into a sequence of doall loops, separated by barrier operations.
The DOACROSS technique supports fine-grained reference-level
parallelism. It allows dependent iterations to be run concurrently
through inserting synchronization operations to preserve dependences. Of the DOACROSS techniques, there are variants, CTY,
PCR, and FCR, which expose different amounts of parallelism
among concurrent reads to the same memory location. Both approaches follow a so-called I NSPECTOR /E XECUTOR scheme.
We evaluated the performance of the algorithms on a symmetric
multiprocessor for loops with different structures, memory access
patterns, and computational workload. In the executor, loops are
distributed among processors in a block cyclic way. The experimental results showed that:
1060-3425/98 $10.00 (c) 1998 IEEE
The DOACROSS technique outperforms the DOALL even
though the latter plays a dominant role in compile-time
loop parallelization. This is because the DOALL algorithm
spends a high percentage of time in the scheduler routine for
the construction of iteration-wise wavefronts. Hot reference
spots have more negative effects on the DOALL algorithm
due to limited iteration-level parallelism. The DOACROSS
technique identifies fine grained reference parallelism at the
cost of frequent synchronization in the executor. Multithreaded implementations reduced the synchronization overhead of the algorithms.
Of the DOACROSS variants, the PCR algorithm performs
best because it incurs only slightly more overheads than the
CTY algorithm. The FCR algorithm improves on the CTY
algorithm for small loops that do not have enough parallelism. For large loops or loops with light workload, its benefits are likely to be outweighed by its extra run-time cost.
[5] D. K. Chen and P. C. Yew, “An empirical study on
DOACROSS loops”, In Proc. of Supercomputing’91, pages
620–632.
[6] D. K. Chen, P. C. Yew, and J. Torrellas. “An efficient algorithm for the run-time parallelization of doacross loops”. In
Proc.of Supercomputing 1994, pages 518-527, Nov.1994.
[7] R. Cytron, “DOACROSS: beyond vectorization for multiprocessors”, In Proc. of International Conference on Parallel
Processing, 1986, pages 836–844.
[8] J. Ju and V. Chaudhary. “Unique sets oriented partitioning
of nested loops with non-uniform dependences”, In Proc. of
International Conference on Parallel Processing, 1996.
[9] S.-T. Leung and J. Zahorjan. “Improving the performance
of runtime parallelization”. In 4th ACM SIGPLAN Symp. on
Principles and Practice of Parallel Programming, pages 8391, May 1993.
Loops that are to be executed repeatedly favor the DOALL
and FCR algorithms. It is because they spend a high percentage of time in dependence analysis and scheduling and the
time can be saved in subsequent loop invocations.
[10] S.-T. Leung and J. Zahorjan. “Extending the applicability and
improving the performance of runtime parallelization”, Technical Report, Department of Computer Science, University of
Washington, 1995.
The DOACROSS algorithms are sensitive to the block size
of loop distribution. Cyclic and small block cyclic distributions yield better performance.
[11] J. T.Lim, A. R. Hurson, K. Kavi, and B. Lee, “A loop allocation policy for DOACROSS loops”, In Proc. of Symposium on
Parallel and Distributed Processing, 1996, pages 240–249.
Future work includes examination of the issues of load balancing and exploiting locality in parallelizing loops that are to be executed repeatedly and evaluations of the algorithms for loops from
real applications.
Acknowledgements
This work was supported in part by a startup grant from Wayne
State University. The author would like to thank Roy Sumit for
his help in experiments and Vipin Chaudhary for his insights into
the experimental data. Thanks also go to Loren Schwiebert for his
advises in presentation.
[12] S. Midkiff and D. Padua. “Compiler algorithms for synchronization”. IEEE Trans. on Computers, C-36(12),December
1987.
[13] L. Rauchwerger and D. Padua. “The LRPD test: Speculative run-time paralle- lization of loops with privatization and
reduction parallelization”. In ACM SIGPLAN Conf. on Programming language Design and Implementation, June,1995.
[14] L. Rauchwerger, N. M. Amato, and D. A. Padua. “Run-time
methods for parallelizing partially parallel loops”. Technical
Report, UIUC, 1995.
[15] J. Saltz, R. Mirchandaney, and K. Crowley. “Run-time parallelization and scheduling of loops”. IEEE Trans. Comput.,40(5),May 1991.
[16] Z. Shen, Z. Li, and P. C. Yew, “An empirical study on array
subscripts and data dependencies,” in Proc. of ICPP, pp. II–
145 to II–152, 1989.
References
[1] J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, “Conversion of control dependence to data dependence”, in Proc.
of 10th ACM Symposium on Principles of Programming Languages, ACM Press, New York, pages 177-189, Jan. 1983.
[2] D. F. Bacon, S. L. Graham, and O. J. Sharp, “Compiler Transformations for High-Performance Computing”, ACM Computing Surveys, Vol.26, No.4, Dec. 1994, pages 345–420.
[3] U. Banerjee, R. Eigenmann, A. Nicolau, D. A. Padua, “Automatic program parallelization”, In Proc. of the IEEE, Vol.81,
No.2, February 1993, pages 211–231.
[4] W. J. Camp, S. J. Plimpton, B. A. Hendrickson, and R. W. Leland. “Massively parallel methods for engineering and science problems”. Comm. ACM,37(4): 31-41, April 1994.
[17] SunSoft, Multhreaded Programming Guide, 1995.
[18] C. Xu and V. Chaudhary, “Time-stamping algorithms
for parallelization of loops at run-time”, Int. Symposium
of Parallel Processing, April 1997. Also available at
http://www.pdcl.eng.wayne.edu/ czxu.
[19] C. Xu and L. Lau, “Load Balancing in Parallel Computers:
Theory and Practice”, Kluwer Academic Publishers, ISBN
0-7923-9819-X, Nov. 1996.
[20] C. Zhu and P. C. Yew, “A scheme to enforce data dependence
on large multi- processor systems”. IEEE Trans. Softw. Eng.,
13(6):726-739, 1987.
1060-3425/98 $10.00 (c) 1998 IEEE