Parallelizing Programs Using Idioms

Program Optimization
Using Idioms
SHLOMIT
Technion
and Parallelization
S. PINTER
— Israel
Institute
of Technology
and
RON Y. PINTER
IBM Israel Science
and Technology
Programs in languages such as Fortran, Pascal, and C were designed and written for a
sequential machine model. During the last decade, several methods to vectorize such programs
and recover other forms of parallelism that apply to more advanced machine architectures have
been developed (particularly for Fortran, due to its pointer-free semantics). We propose and
demonstrate a more powerful translation technique for making such programs run efficiently on
parallel machines which support facilities such as parallel prefix operations as well as parallel
and vector capabilities. This technique, which is global in nature and involves a modification of
the traditional definition of the program dependence graph (PDG), is based on the extraction of
parallelizable program structures (“idioms”) from the given (sequential) program. The benefits of
our technique extend beyond the above-mentioned architectures and can be viewed as a general
program optimization method, applicable in many other situations. We show a few examples in
which our method indeed outperforms existing analysis techniques.
Categories and Subject Descriptors: D.3.4 [Programming
Languages]:
Processor—compilers;
optimization
General Terms: Algorithms, Languages, Performance
Additional Key Words and Phrases: Array data flow analysis, computational idioms, dependence
analysis, graph rewriting, intermediate program representation, parallelism, parallel prefix,
reduction, scan operations
This research was performed in part while on sabbatical leave at the Department of Computer
Science, Yale University, and was supported in part by NSF grant number DCR-8405478 and by
ONR grant number NOO014-89-J-1906. A preliminary version of this article appeared in the
Proceedings of the 18th Annual ACM Symposium on Principles of Programming Languages (Jan.
1991).
Author’s addresses: S. Pinter, Department of Electrical Engineering, Technion—Israel Institute
of Technology, Haifa 32000, Israel; email: [email protected]. ac.il; R. Pinter, IBM Science and
Technology,
MATAM
Advanced
Technology
Center, Haifa
31905, Israel;
em ail:
pinter@haifasc3 .vnet.ibm.com.
Permission to copy without fee all or part of this material is granted provided that the copies are
not made or distributed for direct commercial advantage, the ACM copyright notice and the title
of the publication and its date appear, and notice is given that copying is by permission of the
Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or
specific permission.
01994 ACM 0164-0925/94/0500-0305 $03.50
ACM Transactions on Progirammmg Languages and Systems. Vol 16, NO 3, May 1994, Pages 305–327
306
S, S. Pinter and R, Y Pinter
.
1. INTRODUCTION
Many
of the
classical
comprise
the
replacing
suboptimal
case, that
criterion
compiler
application
fragments
repeating
is met)
optimization
of local
this
will
with
process
result
techniques
transformations
better
(until
point
better
et
al.
1986]
(intermediate)
ones. One hopes,
a fixed
in an overall
[Aho
to
is found
program.
code,
as is often
or some
To a large
the
other
extent,
attempts
to vectorize—and
otherwise
parallelize—sequential
code [Wolfe
1989] are also based on the detection
of local relationships
among data items
and the control
structures
that
enclose
them.
These
local in nature
and do not recognize
being carried
out.
A computation
structure
sometimes
the structure
may
(that
the
include
some irrelevant
human
“idioms”
eye
[Perlis
spans
details
and
to
automatic
and
Rugaber
a fragment
might
the
detection
which
however,
are
not
are
that
of the program
obscure
parallelism
1979],
approaches,
of the computation
which
picture
both
algorithms).
expressible
is
to
Such
directly
as
constructs
in conventional
high-level
languages,
include
innerproduct
cal culations
and recurrence
equations
in numeric
code, data structure
traversal
in
a symbolic
updating
context,
shared
selective
values
processing
amounts
to “lifting”
the level
the classical,
local methods.
Once recognized,
pertains
to sequential
gains
convolution
equivalent
bler
in
target
of two vectors,
fragment
programmer.
program,
machines,
which
using
The
code
lock,
but
is most
example,
which
can
the
stock,
appealing
a loop
references,
was
be
beyond
of
and barrel
at hand.
in view
computing
by an expert
by
of
which
potential
can be replaced
tailored
replaced
patterns
them,
for the task
For
uses array
pointers
same
is far
optimized
computing.
and
Recognizing
can be replaced
has been highly
parallel
elements,
application.
of the
such structures
a new piece of code that
potential
of vector
in a distributed
a call
to
by
This
of the
the
by an
assem-
a BLAS
[Dongarra
et al. 1988; Lawson
et al. 1979] routine
that does the same job on
the appropriate
target
machine.
Better
yet, if the machine
at hand—for
example,
TMC’S CM-2 [Thinking
Machines
Corp. 1987] —supports
summary
operators,
such as reduction
and scan which work in time O(log n ) using a
parallel
prefix
implementation
[Blelloch
1989; Ladner
and Fischer
1980]
rather
than
0(n)
on a sequential
machine
(where
n is the length
of the
vectors
In
from
involved),
this
article
scientific
then
the gains
we propose
programs.
could
a method
be even more
for
dramatic.
extracting
We cast our techniques
parallelizable
in terms
idioms
of the construction
graph,
which is a modified
extension
of the
and analysis
of the computation
program
dependence
graph (PDG) [Ferrante
et al. 1987], since PDG-based
analysis
methods
seem to be most flexible
and amenable
for extensions.
Forming
this new type of graph from a source program-as
is necessary
for
the required
analyses—and
using
it for optimization
transformations
involves three major steps:
— When
necessary,
loops
are unrolled
so as to reach
a
normal
forml,
‘ In the sense of Munshi and Slmons [ 1987], not that of Allen and Kennedy [ 1987]
ACM
Transactions
on Programming
Languages
and Systems,
Vol
16, No 3, May
1994
whose
Optimization
precise
definition
contribution
—For
and
of this
each
the
and Parallelization
algorithms
for
its
Using Idioms
formation
are
307
.
an additional
article.
innermost
loop
in
the
program,
we construct
graph for the basic block constituting
its body. Certain
are classified
as “data filters”)
are handled
gracefully,
the
computation
conditionals
(which
whereas
others we
disallow.
—The
graph
of each loop is replicated
and final
iterations,
the computation
graph
up
structure
the
nesting
branching
Once
the
three
with
of the whole
as far
times,
loop. This
graph
is built,
algorithms
a variety
speedup.
Allen
Its
we apply
that
indeed
Example
can
Known
cannot
not
methods
realize
fitted
plan
gave
and
providing
up on trying
to it using
purpose.
structures
that other methods
and as verified
by experiments
by
and
O(n).
a larger
a small
potential
example,
to parallelize
the
program
[Huson
et al. 1986]
linear
recurrences,
systems),
hides
Stone
This
computation
KAP
third-order
loop
[Kogge
to other
thereby
(we examined
this
than
our
tools,
noncommercial
that
implemented
N) rather
as we go
a problematic
for
taken
from
following
loop
DOIOOI
=2, N–1
C(I) =A(I)+
B(I)
B(I+l)=C(I–
I) *A(I)
CONTINUE
second-
and other
using
is repeated
we hit
be vectorized):
1:
recognize
1981],
“middle,”
node we obtain
transformations
for this
be demonstrated
who
100
can
(until
optimization
we provide
of existing
power
et al. [1987],
(which
initial,
control
process
as possible
Using this method,
we were able to extract
to recognize,
as reported
in the literature
with
for the
an additional
structure).
appropriate
fail
and together
kind
two
and even
independent
1973]
in time
O(log
of transformation
graph.
We believe
analysis
frameworks,
that
Version
[Brode
a human
observer,
recurrences
that
n) (n being
can,
our
6 which
VAST-2
however,
techniques
such as symbolic
may
can
be
the value
of
be effected
can
also
evaluation
be
and
analysis.
Another
advantage
of our
uniformly:
our
method
spanning
deep
nesting
approach
is that
encapsulates
can
the
effect
be identified.
it
handles
of inner
Moreover,
data
loops
by
and
control
so that
idioms
incorporating
data
filters
that abstract
certain
forms of conditionals,
idioms
for the selective
manipulation
of vector
elements
(which
are useful
on vector
and SIMD
machines)
can be extracted.
Finally,
due to the careful
definition
of the
framework
method
from
and
by
the
can uniformly
vector
appropriate
serve
to distributed
a wide
selection
variety
of transformation
of target
ACM
our
ranging
transformations
and
Then, in Section 3,
presentation
of our technique,
including
the construction
of computation
graphs,
Transactions
rules,
machines.
In what follows we first review other work on compiler
evaluate
its limitations
for purposes
of idiom extraction.
we provide a formal
loop normalization,
architectures,
on Programming
Languages
and Systems,
the definition
of
and the neces-
Vol. 16, No. 3, May 1994.
308
sary
.
S S. Pinter and R. Y. Pinter
transformation
algorithms.
In
Section
4 we exemplify
stressing
the advantages
over existing
methods.
work in perspective,
discusses
applications
other
poses further
our
techniques,
Section
5 tries
than parallelism,
to put our
and pro-
research.
2. EXISTING
PARALLELIZATION
Much
traditional
of the
METHODS
program
analysis
work
culminated
in the
definition
and use of the above-mentioned
PDG or its constituents,
the control and data
dependence
graphs.
Such a graph defines the flow of control
in the program
as well
as the dependence
statements;
similar
[1989] together
and improving
loop-invariant
between
graphs
with many
vectorization.
assignments
be quite effective.
However,
the analysis
dence
the variables
presented
loop carried
assignment
methods
as they
are
rather
reflected
than
at
the
the
on a directed
by this
syntactic
flow
tation is not accurate
not depend
on each
of the loop shown
cycle, implying
classifies
framework,
level
of data
the type
e.g., as embod-
(variable
also do a very
methods
were
used
and
values)
using
in Section
1 (Example
dependency.
hidden
is more
1) contains
This
represen-
in this computation
conservative
than
in
do
our
the
Parafrase
[Lee et al. 1985; Polychronopou1981; Pacific-Sierra
1990] (all these tools
good job in vectorization
that
to the
depen-
names
(among
graph which will expose the potential
for parallelization
in this loop.
Partial
success in recovering
reduction-type
operators
is reported
turing
of
is not quite strong enough to
that are not readily
apparent
a circular
since the two reductions
other.
Thus, the PDG
work on KAP [Huson et al. 1986],
10S et al. 1986], and VAST [Brode
Wolfe
and internal
ones, and when applied
(SSA) form [Cytron
et al. 1989] it can
deeper semantic
analysis.
Thus, this approach
allow us to extract patterns
of data modification
two nodes
and
mainly
for recovering
enables
the detection
of
et al. 1988], are very much tied
Moreover,
the focus is on following
tracking
from the source program.
For example,
the PDG
used in the program’s
et al. [1981]
of vectorization,
allowed
the PTRAN
system
[Allen
structure
of the program.
statements)
being
in Kuck
compiler
transformations
Analyzing
the graph
for purposes
other dependence
as being
to programs
in single static
ied in
original
were
and loop parallelization).
are in the
spirit
The restruc-
of the PDG-based
work,
but
the techniques
for detecting
reduction
operators
are different
from ours and
somewhat
ad hoc. The KAP system (the strongest
in recovering
these types of
operations)
can recognize
secondand third-order
line ar recurrences,
yet
when such computations
are somewhat
intermixed
as in our previous
example it fails to do so.
Another
transformation-based
approach
is that of capturing
the data dependence
among
array
references
by means
of dependence
vectors
[Chen
1986; Karp et al. 1967; Lamport
1974]. Then methods
from linear algebra and
number
theory can be brought
to bear to extract
wavefronts
of the computation. The problem
with this framework
is that only certain
dence
are modeled;
there is no way to talk about specific
ACM
TransactIons
on Programmmg
Languages
and Systems.
Vol
16, No
3, May
types of depenoperations
(and
1994
Optimization
and Parallelization
Using Idioms
their algebraic
properties),
and in general
it is hard to extend.
the work of Banerjee
[1988] on data dependence
analysis
and
[1989] can be viewed
as extending
the above-mentioned
framework
by using direction
and distance
vectors.
More recently,
Callahan
[1991]
provided
a combination
graph-based
recurrences.
tation,
We note that
that of Wolfe
dependence
graph
of algebraic
and
methods
for recognizing
and optimizing
a generalized
class of
This method
is indeed effective
in exploiting
this type of compu-
but
it is strongly
to generalizations
Finally,
309
.
tied
to a specific
framework
to nonrecurrence-type
various
and does not lend
itself
transformations.
symbolic-evaluation
methods
have been proposed
[Jouvelot
and Dehbonei
1989; Letovsky
1988; Rich and Waters
1988], mostly for plan
analysis
of programs.
These methods
all follow the structure
of the program
rigorously,
and even though
the program
is transformed
into some normal
form up front
and all transformations
preserve
this property,
still
these
methods
are highly
sensitive
to “noise”
the reasoning
about array references
code) is quite
limited
3. COMPUTATION
In this
of
section
data
and
nique
will
GRAPHS
among
in the
beginning
computation
we provide
graph formally
and list
an algorithmic
framework
to analyze
the
structure
usage
loops
of this
and
Figure
1, which
is written
in Fortran.
illustrate
the definitions
and algorithms,
3.1
Preprocessing
Before
structure
for
transformations
the
shall
point
explain.
this stage
as
to represent
a program,
the flow
namely,
this
the
section.
Next
we define
identify
computations
patterns
that
and their
we use the
the
Finally
in order
can be
replacements
sample
program
of
This example
is merely
meant
to
not to demonstrate
the advantages
will
be done in Section
4.
Normalization
can be applied
computation
to a given
graph
program
to effectively
we must
capture
transform
the
it
program’s
purposes
of idiom
extraction.
Some of these “preprocessing”
are straightforward
and well-known
optimizations,
as we
shall
At
serve
and Loop
our analysis
so as to allow
work;
EXTRACTION
some of its important
properties.
that uses this new abstraction
of programs
over other
model
in
on arrays.
of this model, programs
must be
must be normalized;
this tech-
parallelized
or otherwise
optimized;
specific
are provided.
To guide the reader
through
this section
of our approach
operations
FOR IDIOM
values
To allow effective
most importantly,
be presented
reduction
a new graph-theoretic
dependence
graph.
and
of finding
AND ALGORITHMS
we propose
the
computation
preprocessed,
for purposes
in the source program.
More severely,
(which
are the mainstay
of scientific
out in passing,
data
but
we assume
filters,
others
that
as in
require
programs
IF (A(I)
some new
contain
only
. EQ. O) B ( I ) = I.
techniques
IF
that
statements
Some
of the
we
that
other
conditionals
can be pulled out of loops or converted
to filters
using standard
techniques,
but those that cannot
are outside
the scope of our techniques.
Finally,
we assume that all array references
depend linearly
(with
integral
coefficients)
on loop variables.
ACM
Transactions
on Programmmg
Languages
and Systems,
Vol. 16, No. 3, May 1994.
310
.
S. S. Pinter and R. Y, Pinter
DO 10 1=1 ,1
T = 1+4
c
Fig.
1.
assume H is even
A sample Fortran program.
DO 10 J=3,14
A(I,
J)
= A(I,
J)
+ T* A(I,
J-2)
10 COETIHUE
The preprocessing
—Standard
elimination,
transformations
are as follows:
basic block
optimizations
common
subexpression
et al. [Section
9.4, 1986],
so that
(such as dead-code
and dead-store
detection,
etc.) are performed
per Aho
the resulting
set of assignments
does not contain
redundancies.
This implies
that
is obtained
represents
only the clef-use relationship
are live
upon
further
assume
each
entry
defining
that
to each block
the
relation
expression
could be that
is of some
it contains
at most
linear form; which restriction
algorithm
we want to apply
—To
extend
wherever
—Loops
and
those
among
that
values
are live
at the
is “simple,”
predefine
two values
form.
Typical
and one operator
we assume
that
that
that
exit.
meaning
We
that
restrictions
or that
is imposed depends on the type
later (in Section 3.3).
the scope of the technique
to values
the expression
DAG
between variables
it be a
of recognition
loops
are recognized
possible.
are
with
normalized
respect
to their
index,
as defined
below.
Mun-
shi and Simons
[1987]
have observed
that
by sufficient
unrolling
all
loop-carried
dependence
can be made to occur only between
consecutive
iterations.
The exact conditions
stating
how much unrolling
is required
and
the technique
for transforming
a loop to normal
form have never
spelled out; thus—in
what follows—we
provide
the necessary
details.
Normalization
comprises
(specifically
among
dence
data
dependence
analysis
been
two steps: determination
of loop-carried
dependifferent
references
of arrays)
and unrolling.
If
does not
provide
the
exact
dependence
we may
unroll the loop more than is necessary without
affecting
the correctness
of the
process.
Customarily,
e.g., Allen
et al. [1987],
Ferrante
et al. [1987],
and Wolfe
[1989],
data dependence
is defined
between
statements,
rather
than scalar
values
and array
references.
We provide
the following
definition
for data
dependence
among values in a program
(and from now on we use only this
notion of data dependence):
1. Value A is data dependent
on value B if they both refer to
Definition
the same memory
location
and either (i) a use of A depends on the definition
of B (flow
ACM
dependence),
TransactIons
(ii) the definition
on Programmmg
Languages
of A redefines
and Systems,
Vol
16, No
B which
3, May
1994.
was used in
Optimization
an expression
(antidependence),
previous
definition
of B (output
A dependence
different
In the
Since
is loop
iterations
following
loop-carried
2:
20
Conceptually
(pairwise
ences of
one
disjoint)
B there
consecutive
form:
2.
given
if it occurs
array
can
view
the
A
values
that
references
(and
m is the value
array
normal
is already
so does B).
is no
were
partitioned
to say when
form
if each
a loop
of its
into
the two
involves
two
referonly
is in normal
loop-carried
depen-
as follows.
Find
all loop-carried
depen-
integer
k such
set in iteration
to normalize
a loop with span k > 1, the loop’s body is copied
so it starts
at 1 and is
– 1 times. The loop’s index is adjusted
appropriately
at each iteration.
Loops are normalized
from the
nested outward.
1, if m >4
loop needs to be normalized
in normal
loop-carried
in
iterations.
can be normalized
In the code of Figure
inner
appear
of M, there
dence,
and let the span of the original
loop be the largest
that some value set in iteration
i depends directly
on a value
i – k. Then,
(unrolled)
k
incremented
most deeply
the
of A.
A as if it
we are ready
is in
311
redefines
subarrays,
one for each reference.
Between
is a loop-carried
flow
dependence
which
Now
.
M
two consecutive
loop
of
between
the references
D020
1=2,
B(I) =2*A(I)
A(I+M)=B(I–1)
Using Idioms
definition
A has two
i, j < m, where
between
A loop
is between
Any
2<
iterations.
Definition
dence
carried
dependence
Example
or (iii) the
dependence).
of a loop.
example
the
i + m # j for all
and Parallelization
form
dependence).
been expanded,
thus
(it
(where
m denotes
by unrolling
is vacuously
so, since
We also assume
obtaining
the value
it once, whereas
that
the program
the
there
the
loop
are no reference
temporary
in Figure
of M) then
the outer
scalar
T has
2.
In order to compute
the span it is not enough to find the distance
(in the
loop iteration
space) of a dependence.
In the following
example
the span is 1
although
the distance
Example
3:
10
Here
the
there
second
is only
statement
iteration.
At the current
in the first
assignment
is 2:
DOIOI
=1, N
A(I+2)=A(I)+X
A(I+1)=A(I+2)+Y
CONTINUE
one loop-carried
to
its
use
flow
by
dependence
the
first
stage the scope of our work
from
statement
includes
the value
in
the
set in
following
only loops for which
the
span is a constant.
Note that the loop can still include
array references
which
use its bound in some cases. For example
the span of the following
loop is 2:
Example
4:
10
DOIOI
=2, N–1
A(N– I)= A(N– I)+
CONTINUE
ACM Transactions
on Programming
A(N+2–
Languages
I)
and Systems,
Vol. 16, No. 3, May 1994.
312
S. S. Plnter and R. Y. Pinter
.
1=1 ,1
Do 10
T(I)
assume H is even
c
Fig. 2. The program
in normal form.
of Figure
= 1+4
1
DO 10 J=l ,M-2,2
A(I,
J+2)
= A(I,
J+2)
+ T(I)*
A(I,
J)
A(I,
J+3)
= A(I,
J+3)
+ T(I)*
A(I,
J+l)
10 COITIIUE
An
immediate
result
of normalization
become uniform.
This observation
graph to explicitly
reveal all the
tions of the loop
patterns
in order
notion
to
that
appears
of a loop is carried
in
Aiken
out until
loop-carried
on, being
interaction
and its surrounding
context.
to enable the transformations.
normalization
computation
is
is, later
possible
dependence
used in the computation
patterns
among itera-
Thus, we can match
graph
Finally,
note that a similar
and
Nicolau
a steady
state
[ 1988]
where
the
is reached.
Comments
on Normalization.
We cannot handle indirect
array references,
such as A ( x ( I ) ) , without
employing
partial
evaluation
techniques.
Indirect
referencing
is mostly
used to generate
access patterns
which
are irregular
and
data
such
dependent;
patterns
since
only
current
at run-time,
based approaches.
There
is a relation
target
this
between
the
machines
limitation
span
of a loop
loop-carried
dependence
distance,
which defines
distance
between
the two points of the iteration
dence.
In particular,
dependence
values
distance
are mainly
if a loop is being
equal
used
to the
for introducing
take
and
advantage
to all
of
compiler-
Banerjee’s
[1988]
for a given dependence
the
space that cause the depen-
normalized
span.
can
is inherent
then
there
is at least
one
In current
transformations,
distance
the
synchronization
delay
proper
(of
some sort). This delay must cause a wait which lasts through
the duration
of
the execution
of as many operations
as may be generated
by unrolling
using
the span.
distances
At first
extraneous
is done by
machines;
Both methods
can handle
(at the moment)
only constant
span or
which may not be the same for all the statements
in the loop.
it may seem that unrolling
a loop for normalization
results
in
space utilization.
However,
we point out that unrolling
loop bodies
many optimizing
compilers
in order to better utilize
pipeline-based
thus space usage and other resource
issues are similar
for both
cases.
3.2 The Computation
Given
a program
graph
(CG) in two
ACM
Transactions
Graph
satisfying
stages:
on Programming
the above assumptions,
first
we handle
Languages
basic
and Systems,
we define
blocks;
then
Vol. 16, No. 3, May
its computation
we show
1994
how to
Optimization
represent
normalized
loops;
filters.
The CG for a basic
block
and
and Parallelization
finally
we
is a directed
show
acyclic
Using Idioms
how
graph
.
313
to incorporate
G = (V,
data
E), defined
as
follows:
—There
there
is a node u = V representing
each value that is defined in the block;
is also a node in V for each value that is live [Aho et al. 1986] at the
entry
of the
uniquely
ments
that
block
and
identified
are tagged
represent
that
is used
by a tag; nodes
by the
initial
in
that
corresponding
values
it
(initial
value).
correspond
statement
additional
(new)
Each
to values
number,
tags
node
is
set in assignand
for nodes
are generated.
—If the loop control variable
is used in the computations
assume it has a proper initial
value.
within
its body we
—The meaning
of an edge (u, U) in the graph is that the computation
of u
must end before the computation
of u starts. There are three types of edges
drawn
[Aho
between
et al.
represent
nodes:
1986]
two kinds
We draw
a clef-use
of v. We draw
dependent
one
(which
represents
includes
of data
dependence
edge from
an output
on u as per
the
flow
classical
between
u to u if the value
data
dependence
Definition
clef-use
dependence),
and
values
other
two
(per Definition
1).
u is used in the definition
edge from
1. Lastly,
relationship
the
u to u if u is output
we draw
an
antidependence
edge from
u to v if the definition
of u uses a value
on which
u is
antidependent
(unless this edge is a self-loop which is redundant
since the
fetch is done before the assignment).
—Each
being
node u is labeled
by the expression
according
to which the value is
computed.
The labeling
uses a numbering
that is imposed
on the
clef-use
edges entering
for initial
Using
the
array
references
constituting
the body
tion
shown
graph
the node (i.e., the arguments).
The label
L
is used
values.
in
verbatim
of the inner
Figure
3.
as atomic
loop of Figure
Since
we
variables,
2 gives
shall
be
the
basic
block
rise to the computalooking
for
potential
reduction
and scan operations,
which
apply to linear
forms,
we allow the
functions
at nodes to be ternary
multiply-add
combinations.
Note also that
the nodes denoting
the initial
values of the variables
could be replaced
when
the graph is embedded
in a larger context,
as we shall see.
Next we define computation
graphs for normalized
loops.
Such
obtained
body 2 as follows:
by replicating
the DAGs
representing
the enclosed
—Three
copies of the graph representing
the body are generated:
represents
the initial
iteration;
one is a typical
middle
iteration;
stands
for the final
iteration.
the tag of the replicated
middle,
or final).
Each
node with
node is uniquely
the unrolled
identified
ACM
Transactions
on Programming
Languages
and Systems,
Vol
are
one copy
and one
by extending
copy it belongs
2At the innermost level these are basic blocks, but as we go up the structure
computation
graphs.
graphs
to (initial,
these are general
16, No 3, May 1994
314
S. S Plnter and R. Y. Plnter
.
t.
(1
Fig. 3. The computation
graph of the basic block constituting
the inner loop of the program in
Figure 2, In case there are both a flow and an output dependence we draw only the flow edge,
—Loop-carried
data dependence
between
the different
copies.
edges
(cross
iteration
edges)
are
drawn
—Each instance
of the loop variable
(like in expressions
of array references)
is replaced
with
ini t, mid, and fin
depending
on its appearance
in the
initial,
middle,
or final copy, respectively.
—Two nodes in different
copies that represent
location
(like A ( ini t + 2 ) and A (mid) when
coalesced,
them
below
unless
(in which
there
exists
case there
a flow
a value in the same memory
the stride of the loop is 2) are
or an output
are possibly
two
dependence
different
edge between
values;
see also note
on antidependences).
—The graph
necessary
uses).
The graph
is linked
to its surroundings
by data
(i.e., from ini~ializations
outside
the
is constructed
on a loop-nest
dependence
edges where
loop and to subsequent
by loop-nest
basis,
i.e., there
is no
need to build it for the whole program.
In fact, one might
decide to limit the
extent of the graph and its application
in order to reduce compile time or curb
the potential
space complexity.
Note.
If a node
v is a target
of an antidependence
edge then
v is also a
target of an output
dependence
edge. Consider
the existence
of an antidependence edge from u to v, i.e., the value that was stored in the same memory
location
as u and that was used in the definition
of u is now being destroyed.
This means that there is a node w which constitutes
the old value that was
used in the definition
of u; thus there is an output
dependence
edge between
w and v in addition
to the clef-use edge from w to u. In general,
if the
meaning
of an output
dependence
edge, say (u, v), is that all the uses of
t~—on clef-use edges leaving
u—must
be executed before v, then the antidependence
Figure
in Figure
Node
ACM
edges are redundant.
4 shows the computation
2. The
graph
S4 of the middle
Transactmns
comprises
graph
of the inner
three
copies
copy is coalesced
on Programmmg
Languages
with
and Systems,
of the
loop (J) of the example
graph
from
node S 1 of the initial
Vol. 16, No. 3, May
1994
Figure
3.
copy since
Optimization
Fig. 4.
The computation
they both represent
Using Idioms
graph of the inner loop of the program
the same value
and there
are
Similarly,
node
no loop-carried
S5 of the middle
copy. The
pattern
same
and Parallelization
(the stride
is repeated
between
the final
t + 2 = mid),
and the middle
in Figure
5.
of the triple
unfolding
copies.
same value for all three
the outer I loop) would
consist
of three
copies of what
is shown
in Figure
annotation
of the nodes.
A more general
example
that includes
anti, output,
is presented
contribution
ini
2.
output
dependence
between
these nodes.
copy is coalesced with node S2 of the initial
Node SO represents
the value of T ( I ) which is the
copies. The graph for the whole program
(including
dependence
The main
in Figure
of J is 2, thus,
315
.
4, with
and
of loops
flow
is that
appropriate
loop-carried
when
com-
bined
with
normalization
the computation
graph
represents
explicitly
all
possible
dependence.
The computation
graph
spans a “signature”
of the
whole
loops
structure,
and
we call
are transformed
the body
into
of the enclosing
yield another
summary
The following
lemma
foundation
for the
it the
their
summary
summary
loops
and may—in
form.
summarizes
applicability
this
of the
form
form,
of the
they
turn—be
loop.
Once
are treated
replicated
inner
as part
similarly
observation,
and it will
transformations
described
serve
of
to
as the
in the next
Section.
~EMMA
dependence
PROOF.
1.
Three iterations
structure.
Recall
that
of a normalized
all loop-carried
loop
dependence
same for all the middle
iterations,
i.e., those
that
its
entire
are only between
tive iterations.
Thus, having three nodes to represent
loop assures the property
that dependence
involving
one.
capture
data
consecu-
a value in a normalized
the second copy are the
are not the first
or the last
❑
Note, that when embedding
a loop in a program
fragment
we need to cover
the cases in which the loop’s body is executed
fewer than three times. Since
we are interested
in parallelizing
loops that must be iterated
more than twice
we omit the description
of such an embedding.
ACM
TransactIons
on Programming
Languages
and Systems,
Vol. 16, No. 3, May 1994.
316
.
DO I
S. S. Pinter and R, Y Pinter
= 1,1
S1 :
B(I)
= A(I)
S2:
D(I)
= D(I-1)
S3 :
C(I)
= B(I-1)
* 5
+ C(I+I)
EEDDO
Fig. 5. A program and Its CG. Dashed and dotted arrows are anti
respectively; the vertical dashed lines separate the three copies.
The
structs
computation
graph
like data filters.
Definition
constituent
can also represent
some
3. A data filter
is a conditional
values is involved
in any loop-carried
types
and output
dependence,
of conditional
con-
expression—none
of whose
dependence—controlling
an
assignment.
A data filter is represented
a double circle. The immediate
node,
denoted
in the graph by a special filter
predecessors
of this node are the arguments
the Boolean
expression
constituting
the predicate
of the conditional,
previous
definition
of the variable
to which the assignment
is being
and the values
involved
in the expression
on the right-hand
side
assignment.
The outgoing
edge from the filter
node goes to the node
sponding
to the value whose assignment
is being controlled.
The filter
value is the right-hand
labeled by a conditional
expression
whose then
the original
assignment
statement,
and the
value of the variable
to which the assignment
is written
statement
ACM
in
terms
of the
20, the assignment
TransactIons
on Programmmg
incoming
edges,
to s is controlled
Languages
and Systems,
by
of
the
made,
of the
correnode is
side of
else value is the previous
(old)
is being made. This expression
as Figure
6 demonstrates:
by a predicate
Vol. 16, No. 3, May
on the value
1994,
in
of
Optimization
and Parallelization
Using Idioms
317
.
s = 0.0
DD 20
20
IF
1=1,1
(A(I)
.lE.0)
S = S + B(I)
c)
s
Fig. 6. A program with a data filter and the computation
A ( I ) ; hence
the
Figure
7 shows
Finally,
note
constructs
there
the
nor
body,
Additionally,
cannot
can
2.
middle
and
of the
copy and from
graphs
and is not hard
existing
body
contains
a filter
middle
the
node.
parallel
copy to the initial
final
copy
to the
copy of
middle
copy.
there are no cross iteration
edges between
following
lemma
summarizes
the main
CG;
these
properties
a normalized
both
used
of their
in
designing
application.
loop of a sequential
edges are from
to the final
are acyclic
are
the correctness
cross iteration
the middle
to show
the
from
and assuring
A CG representing
Computation
blocks
from
be edges
patterns
the only
of the loop’s
graph of the program
of Figure
6.
CG represents
a program
without
be edges
there
properties
transformation
~EMMA
graph
due to normalization,
and final
copies. The
the initial
structural
acyclic,
computation
the computation
that when the
graph of its loop’s body.
program
the initial
is
to the
copy.
by construction.
for loops
This
as well
is obvious
as for data
for basic
filters.
3.3 Algorithms
Having
constructed
the computation
graph, the task of finding
computational
idioms in the program
amounts
to recognizing
certain
patterns
in the graph.
These patterns
comprise
graph structures
such as paths or other particular
subgraphs,
depending
on the idiom, and some additional
information
pertaining to the labeling
of the nodes. The idiom recognition
algorithm
constitutes
both
a technique
well
as the conditions
for identifying
for when
the subgraph
they
apply,
patterns
in the given
i.e., checking
in which they are found is one where the idiom
Overall,
the optimization
procedure
consists
whether
graph
as
the context
can indeed be used.
of the following
algorithmic
ingredients:
—Matching
and replacement
of individual
graph
grammars
to describe
the rewrite
ACM
TransactIons
on Programmmg
patterns
is
rules. While
Languages
and Systems,
achieved
rewriting,
by using
we also
Vol. 16, No. 3, May 1994.
318
S. S. Pinter and R, Y, Plnter
.
Fig. 7
transform
the target
forbidden
—We
replace
the labels (including
idioms.
We make
entries
need
The computation
and exits
to provide
them.
These
graph of the program
the
sure
generated
6
operators,
of course), thus generating
no side effects
are lost by denoting
per Rosendahl
and Mankwald
a list
of idioms
the” graph-rewriting
include
structures
and
such
equations,
transposition,
reflection,
and
thereof,
such as inner product,
convolution,
basic
of Figure
as a preprocessing
stage.
Notice
[1979].
as reduction,
rules
that
scan, recurrence
FFT
butterflies.
Compositions
and other permutations,
can be
that
data
filters
can be part
of a
rule.
—At
the top level,
we (repeatedly)
match
patterns
from
the given
list
according
to a predetermined
application
schedule
until
no more changes
are applicable
that we need
govern
similar
(or some other termination
to establish
a precedence
the order in which
to the conventional
they are applied.
This greedy
application
of optimization
[Aho et al. 1986], is just one alternative.
the merits
of transformations
and find
graph at each stage, and then iterate.
We
next
elaborate
on each
details. First we discuss
rize information,
these
condition
is met). This means
relation
among
rules that
will
of the
graph-rewriting
will be mostly
One could
a minimum
above
items,
tactic, which
transformations
is
assign costs that reflect
cost cover of the whole
filling
in
the
necessary
rules. Since we are trying to summareduction
rules,
i.e., shrinking
sub-
graphs
into smaller
ones. More importantly,
there are three characteristics
that must be matched
besides the skeletal
graph
structure:
the operators
(functions),
the array references,
and context,
i.e., relationship
to the enclosing structure.
The first two items can be handled
by looking
at the labels of the vertices.
The third
involves
finding
a particular
subgraph
that can be replaced
by a
new structure
and making
sure it does not interfere
with
the rest of the
computation.
For example,
to identify
a reduction
operation
on an array, we
need to find a path of nodes all having
the same associative-operator
label
(e.g., multiplication,
ACM
Transactions
addition,
on Programmmg
or an appropriate
Languages
and Systems,
linear
Vol
combination
16, No, 3, May
1994
thereof)
Optimization
and Parallelization
.
Using Idioms
319
Jvar
Q
== >
@--c)
Fig. 8. Matching and replacement rule for reduction.
and using consecutive
(i.e., initial,
middle,
and final) entries
in the
update
the same summary
variable;
we also need to ascertain
intervening
computation
is going
on (writing
to or reading
from
array
that
to
no
the summary
variable).
Both the annotations
computations
are part
of the nodes as well as the guards against intervening
of the graph
grammar
productions
defining
the re-
placement
rules. Figures
8 and 9 provide
clear. Here we assume that vectorization
expansion,
patterns;
using
have
occurred,
the vectorization
a different
similarly
with
of Figure
appropriate
data
rely
on
transformations
tool or can be part
1992]).
To accommodate
the rule
so we
two such rules
transformations,
filters,
results
themselves
of our rules
expanding
10 takes
their
additional
when
rules
rules
for
first
by
can be formulated
[Pinter
are necessary.
reduction.
notion
scalar
looking
can be applied
base (they
rewrite
care of a filtered
to make this
including
Notice
and
Mizrachi
For
example,
the similarity
to
the rule of Figure
8, if we splice out the filter
nodes and redirect
the edges;
the scan rule of Figure
9 can be similarly
used to derive a rule for a filtered
scan.
Once such rules are applied,
the computation
graph contains
new types of
nodes, namely,
summary
nodes representing
idioms.
These nodes can, of
course,
appear
themselves
as candidates
side of a rule), thereby
enabling
apply the rules for lower-level
for
replacement
further
optimization.
optimizations
first
(on the
others, but one should not get the impression
that this procedure
follows the loop structure
of the given program.
On the contrary,
tation
graph
as constructed
allows
the detection
of structures
otherwise
be obscured,
as can be seen in Section 4.
ACM
Transactions
on Programming
Languages
left-hand
Obviously,
we would
and only then use the
and Systems,
necessarily
the computhat might
Vol. 16, No 3, May 1994
320
.
S. S. Plnter and R. Y. Plnter
o
L
Y(init)
-1
Va’r
x(lnl~+A)
1
——
——>
scan(f)
X(mid+A)
Y
a
%
1
X( fin+A)
Fig. 9.
Matching
and replacement
rule for scan,
——
——>
Fig,
10.
A matching
and replacement
rule
for a filtered
The result of applying
the rule of Figure 9 to the graph
in Figure
11.
If we had started
with a graph for the entire program
the inner loop (as represented
in Figure 4), then
to the result would have generated
the following
10
ACM
DOALL 10 1=1,
T(I) =I+4
SCA]J(A(I,
*),
SCAIV(A(I,
*),
CO~JTIPJUE
Transactions
reduction
of Figure
4 is shown
of Figure
2, not just
applying
a vectorization
program:
N
1,
2,
on Programming
Pi–i,
M, 2,
Languages
2, T(I),
‘*,1,
‘, +,,)
T(I),
“*”,
“+”)
and Systems,
Vol. 16, No 3. May
1994.
rule
Optimization
Fig. 11. The computation
graph in Figure 4.
graph resulting
We use an arbitrary
information,
including
and Parallelization
from applying
Using Idioms
the transformations
template
for SCAN which includes
array bounds, strides inside vectors
operations
to be performed
in reverse Polish notation.
In the case of Figures
6 and 7, the resulting
problem
T(l:N)=O.
O
WHERE(A(l:N).
S= REDUCE(T,
where
program
90 and others
In general,
of Figure
321
9 to the
all the necessary
or minors,
and the
would
be
NE. O) T(l:N)=B(l:N)
1, N, 1, “+”)
T is a newly
resultant
.
defined
array.
(This
is due to the way
somewhat
most
awkward
parallel
[Guzzi et al. 1990], are defined.)
the order
in which
rules
are
phrasing
dialects,
applied
such
and
the
of the
as Fortran
termination
condition
depend on the rule set. If the rules are Church-Rosser
[Church
and
Rosser 1936] then this is immaterial,
but often they are competing
(in the
sense that
the application
the
or are contradictory
other)
beyond
of one would
the scope of this
on properties
article,
of rewriting
outrule
(creating
the consequent
potential
and we defer
oscillation).
its discussion
application
of
This
is
issue
to general
studies
systems.
4. EXAMPLES
To exemplify
three
cases
the
advantages
in which
other
of our idiom
methods
recovery
cannot
speed
method,
the
we present
computation
here
up but
ours can (if combined
properly
with standard
methods).
We do not include the
complete
derivation
in each case, but we outline
the major steps and point
out the key items.
The first example is one of the loops used in a Romberg
integration
routine.
The
original
Hampton,
Example
code
VA)
5:
looks
(as
released
by
the
Computer
Sciences
Corporation
of
as follows:
DO 40 N=2,
M+l
KN2. K+N–2
KNM2 . KN2 + M
KNM1 = KNM2 + 1
F=l.
o/(4
.o**(N–1)–l.
ACM TransactIons
on Programming
Languages
o)
and Systems,
Vol. 16, No. 3, May 1994.
322
.
S, S Plnter and R. Y. Plnter
TEMP1=WK(KNM2) –WK(K~C2)
WK(K~~Ml) =WF(&-NM2) +F*TEMPI
CONTIllUE
40
After
substitution
loop bounds,
which
of temporaries,
scalar
are all straightforward
expansion
DO 40 I= K+M,
K+2*M–1
F(I) =l. O/(4 .0** (l–
K–M +1)–1.0)
WK(I+l)
=WK(I)+
F(I) *( WK(I)-WK(l
CO~TTIITUE
40
Notice
that
no normalization
of F, and
transformations,
renaming
of
we obtain
-M))
is necessary
in
this
case
since
from
data
dependence
analysis
the only loop-carried
data dependence
between
references
to WK is of distance
1 which implies
a span of 1. Furthermore,
observe
that
the computation
can be deduced
The
using
loop computing
loop containing
of F ( I ) can be performed
the well-known
technique
F can be easily
a single
statement
vectorized,
which
in a separate
of “loop
loop,
as
distribution.”
so all we are left
computes
the
we
with
is a
WK ( I + 1 ) . If we draw
the
computation
graph for this loop and allow operators
to be linear combinations
with a coefficient
(F ( I ) in this case), we can immediately
recognize
the scan
operation
that
is taking
place
and thus
WK1(K+M: K+2*P-l)=WK
(K:
SCAN (WK, l;+M,
K+2*~4–1,
The
from
second
Allen
example
is the
et al. [ 1987],
who
the loop is already
required,
we detect
including
the
computing
computation
equations
K+ M–l)
1, WK1,
program
form.
computation
presented
F,
appearing
of the
in Kogge
“*”,
in the
“+”)
introduction
to parallelize
chains
even3
in the
entries
in
and Stone
[ 1973],
taking
three
computation
~ and
the data are arranged
using the method
for
(taken
it). Notice
Once the loop is unfolded
independent
the odd entries.
If
can be performed
the following:
“-”,
gave up on trying
in normal
two
produce
that
times,
graph,
c, and
the
as
one
other
properly,
computing
the whole
recurrence
time
n ) (n being
O(log
the value of N) rather
than
0(n).
The computation
graph
can be seen in
Figure
12. Notice that it is completely
different
from the PDG which consists
of two nodes on a directed
cycle.
Finally,
we parallelize
a program
computing
the inner
product
of two
vectors,
which
is a commonly
occurring
computation:
program
(in
p.o
DOIOOI
=1, N
P= P+:<(I)
*Y(I)
100
COBTTINTJE
Traditional
analysis
places the loop’s body
of the
by
preparation
for vectorization)
re-
‘T(I) =X( I) *Y(I)
P= P+ T(I)
3 We use “even” and “od~ here just to distinguish
does not need to know or prove this fact at all
ACM
TransactIons
on Pro~ammmg
Languages
between the chains; the recognition
and Systems,
Vol
16, No 3, May
1994
procedure
Optimization
(
C(init)
Fig. 12.
)
(
and Parallelization
A(mid) )
The computation
Using Idioms
.
323
(B(ini
graph of example
1
Q
J-
P
Fig. 13. Transforming an inner product to a reduction,
Figure
13 shows the computation
graphs
of this transformed
basic block
and that of the relevant
part of the loop. The vectorizable
portion
of the loop
can be transformed
into a statement
of the form T = x *Y, and what is left can
be matched
by the left-hand
side of the rule of Figure
8. Applying
the rule
produces
1,
the
graph
“ + “ ).
The recognition
corresponding
of an
inner
to the
product
statement
using
our
P = REDUCE (T,
techniques
will
1,
N,
not
be
disturbed
by enclosing
contexts such as a matrix
multiplication
program.
The
framed
part of the program
in Figure
14 produces
the same reduced
graph
that appears replicated
after treating
the two outermost
loops.
ACM
Transactions
on Programming
Languages
and Systems,
Vol. 16, No, 3, May 1994,
324
S. S. Pinter and R. Y. Pinter
.
Fig. 14,
tion
The portion
program
formation
that
of a matrix
is recognized
of Figure
DO 100
1=1, !J
DD 100
J=l ,1
C(I,
multiplica-
J)=O
by the trans-
12.
DO
100 K=l ,1
C(I,
100
J)=C(I,
J)+A(I,
K)* B(K, J)
COHTIHUE
5, DISCUSSION
We have
proposed
tion
parallelization.
and
grams,
more
thereby
globally.
methods
a new
program
It
adding
extracts
to existing
When
comparing
analysis
method
more
methods
our
for purposes
information
the
method
from
ability
to
of optimiza-
the
source
to recognize
the
other
most
pro-
idioms
common
we can say the following:
—Constructing
and then using the program
dependence
graph (PDG) or a
similar
structure
focuses on following
dependence
as they are reflected
at
the syntactic
level alone. Thus, the PDG is more conservative
than our
computation
graph
(CG) which
further
potential
for parallelization.
—There
was partial
analysis)
1986].
contains
success in recovering
in the work
on Parafrase
The techniques
employed,
more
reduction
information
and
operators
(using
[Lee et al. 1985; Polychronopoulos
however,
are somewhat
exposes
PDG-like
et al.
ad hoc compared
to ours.
—The
places
where
normalization
is carried
out are, in general,
ble for known
parallelizing
transformations
extension
to automatic
restructuring
theory).
—Capturing
dependence
solving
the data dependence
vectors
[Chen
1986;
a system
computation
of linear
are limited
(thereby
not applica-
providing
between
array
references
by
Karp et al. 1967; Wolfe 1989]
equations
to functional
to
(side
extract
the
effect
free)
wavefronts
a proper
means
of
and then
of the
sets of expressions.
—Symbolic
and partial
evaluation
methods
[Jouvelot
and Dehbonei
1989;
Letovsky
1988; Rich and Waters
1988] all follow
the structure
of the
program
rigorously,
and even when the program
is transformed
into some
normal
form these methods
are still
highly
sensitive
to “noise”
in the
source program.
Their major drawback
is the lack of specific semantics
for
array references,
limiting
their suitability
for purposes of finding
reduction
operations
on arrays.
Our primary
data parallelism.
ACM
Transactions
objective
is to cater for machines
with
effective
support
of
Our techniques,
however,
are not predicated
on any particuon Programming
Languages
and Systems,
Vol. 16, No. 3, May
1994
Optimization
lar
hardware,
language
but
can rather
constructs
formalisms
that
are APL
and Parallelization
be targeted
reflect
Crystal,
C*, and
functional
*lisp
which
to an abstract
such features.
and
the
BLAS
all support
reductions
might
appear
in conjunction
on algorithmic
(such as fixed-point
implementing
the
lelization
that
package
(which
has
and scans.
with
other
parallel
than
and Mizrachi
1992]; such
and synchronization
steps
computations.
repeated
Also,
application
further
of the
rules
computations)
is also necessary. Finally,
we are currently
techniques
mentioned
here as part of an existing
paral-
system
would
methods
or to
of such higher-level
both methodological
and algorithinvestigated.
In addition
to the
can be expressed
as computation
graph patterns
and be identified
as idioms [Pinter
constructs
can be used to eliminate
communication
that
325
a variety
of machines),
vector-matrix
et al. [ 1989], and languages
such as
All in all, we feel that this article makes
mic contributions
that
should
be further
constructs
mentioned
above, many others
work
.
architecture
Examples
idioms,
many
efficient
implementations
on
primitives
as suggested
in Agrawal
Using Idioms
[Lempel
indicate
et al. 1992]
how prevalent
and plan
to obtain
experimental
each of the transformations
results
is.
ACKNOWLEDGMENTS
The authors
would
like
for helpful
discussions,
comments
on an earlier
anonymous
referees
to thank
as well
draft
for their
Ron Cytron,
as Jeanne
of this
article.
insightful
Ilan
Efrat,
Ferrante
and Liron
and David
We would
Mizrachi
Bernstein
for
to thank
the
also like
reviews.
REFERENCES
AGRAWAL, A., BLELLOCH,
primitives.
In
292-302.
AHo, A. V.,
the
G. E., KRAWITZ, R. L., AND PHILLIPS,
Symposium
SETHI, R., AND ULLMAN,
Addison-Wesley,
Reading,
on Programmmg
1988.
Language
R. AND KENNEDY
J. D.
Algorithms
1986.
C. A.
and
1989.
Four
Compilers—Principles,
Vector-matrix
ACM, New York,
Architectures.
Techniques,
and
Tools.
Mass.
AIKEN, A. AND NICOLAU, A.
ALLEN,
on Parallel
Optimal
Design
K.
and
1987.
loop parallelization.
Implementation.
Automatic
translation
In the SIGPLAN’88
Conference
ACM, New York, 308–3 17.
of FORTRAN
programs
to vector
form.
ALLEN,
F. E., BURKE, M., CHARLES, P., CYTRON, R., AND FERRANTE, J.
PTRAN
analysis
system
for multiprocessing.
J. Parall.
Dzstrib.
1988.
Comput.
AD overview
ALLEN, R., CALLAHAN, D., AND KENNEDY, K.
1987. Automatic
decomposition
Symposmm
on the Principles
grams for parallel execution. In the 14th Annual
New
of scientific
pro-
of Programming
ACM, New York, 63-76.
Languages.
BANERJEE,
of the
5, 5 (Oct.), 617-640.
U.
1988.
Dependence
Analysis
for
SupercomputZng.
Kluwer
Academic
Publishers,
York.
G. E.
1989.
Scans as primitive parallel operations. IEEE Trans. Comput. C-38, 11
(Nov.), 1526-1539.
BRODE, M.
1981. Precompilation
of FORTRAN programs to facilitate
array processing. Computer 14, 9 (Sept.), 46-51.
CALLAHAN, D. 1991. Recognizing and parallelizing
bounded recurrences. In the 4th Workshop
on Languages
and Compilers
for Parallel
Computmg.
Lecture Notes in Computer Science, vol.
589. Springer-Verlag,
New York, 169-185.
BLELLOCH,
ACM
Transactions
on Programming
Languages
and Systems,
Vol. 16, No. 3, May 1994
326
S. S. Pinter and R, Y, Plnter
.
CHEN, M. C.
VLSI.
1986, A parallel language and its compilation
to multiprocessor
machines or
Symposium
on the Principles
of Programmmg
Languages.
ACM,
In the 13th Annual
New York, 131-139.
CHURCH,A. ANDROSSER,J.B.
1936
Some properties ofconversion. Trans.
Am.
Math
Soc. 39,
472-482.
CITRON,R., FERRANTE,J.,
method
of computmg
Prmclples
RosEN,
static
of Programming
B. K., WEGMAN,
single
assignment
basic
linear
algebra
N., .ANDZADECK,
In the
F
K
1989.
16thAnnual
Inefficient
Sympost
urn on the
ACM, New York, 25-35.
Languages.
DONGARRA, J. J., CROZ, J. D., HAMMARLING,
FORTRAN
M
form.
S., AND HANSON,
subprograms.
ACM
Trans.
R. J.
1988
Math,
Soft.
An
14,
extended
set of
1 (Mar.), 1-17.
FERRANTE, J., OTTENSTEIN, K. J., AND WARREN, J. D. 1987. The program dependence
Lang.
Syst. 9, 3 (July), 319–349.
its use in optimization.
ACM Trans. Program.
graph and
GUZZI, M. D., PADUA, D. A., HO~FLINGER, J. P., AND LAWRIEL D. H. 1990. Cedar FORTRAN and
4, 1 (Mar.), 37–62.
other vector and parallel FORTRAN dialects. J Supercomput.
HUSON, C., MACKE, T., DAVIS, J. R., WOLFE, M. J., AND LEASURE, B. 1986. The KAP/205:
An
advanced source-to-source vectorizer for the Cyber 205 supercomputer.
In the International
Conference
on Parallel
Processing.
CRC Press, Inc., 827-832.
JOLWELOT, P., AND DEHBONEI, B. 1989. A unified semantic approach for the vectorization
and
Conference
on Supercomputing.
parallelization
of generalized reductions. In the International
ACM, New York, 186-194.
KARP. R. M., MILLER, R. E., AND WINOGRAD, S. 1967. The organization
of computations
uniform recurrence equations. J. ACM 14, 3 (July), 563–590.
KOGGE, P. M. AND STONE, H. S. 1973. A parallel
algorlthm
for the efficient
solution
C-22, 8 (Aug.), 786-793.
general class of recurrence equations. IEEE Trans. Comput.
Kumi, D., KUHN, R. H., PADUA, D. A,, LEASURE, B., AND WOLFE, M.
ACM
Symposium
and compiler optimizations
In the 18th Annual
gramming
Languages.
1974.
Dependence
of a
graphs
on the Principles
of Pro-
ACM, New York, 207-218.
LADNER, R. E. AND FISCHER, M. J.
831-838
LAMPORT, L.
1981.
for
The parallel
1980.
Parallel
execution
prefix
computation.
of do loops. Commun.
ACM
J. ACM
27,
4
17, 2 (Feb.), 89-93.
LAWSON, C. L., HANSON, R. J., KINCAID, D R., AND KROGH, F. T. 1979, Basic linear
subprograms for FORTRAN usage. ACM Trans. Math. Soft. 5, 3 (Sept.), 308–323.
LEE, G., KRUSKAL, C. P., AND KUCK, D. J.
of nonnumerical
programs for parallel
(oct.),
algebra
1985. An empirical study of automatic restructuring
Trans.
Comput.
C-34,
10 (Ott ),
processors. IEEE
927-933.
L~MPEL, O., PINTER, S. S., AND TURIEL, E. 1992. Parallelizing
a C dialect for distributed
on Languages
and Compilers
for Parallel
memory MIMD machines. In the 5th Workshop
Computmg.
Lecture Notes in Computer Science, vol. 757. Springer-Verlag,
New York, 369-390
LETOVSKY, S. I. 1988
Plan analysis of programs. Ph.D. thesis, YALEU/CSD/TR-662,
Dept. of
Computer Science, Yale Univ., New Haven, Corm.
MUNSHI, A. AND SIMONS, B. 1987. Scheduling
sequential
loops on parallel processors. Tech.
Rep RJ 5546, IBM Almaden Research Center, San Jose, Calif.
Research, Los Angeles,
PACIFIC-SIERRA RESEARCH. 1990. VAST for RS / 6000. Pacific-Sierra
Calif.
PF.RI.IS, A. .J. ANTI RITGARRR, S.
1979.
Programming
with idioms in APL. In APL ‘7.9 ACM. New
York, 232–235. Also, APL Quote Quad, vol. 9, no. 4.
PINTER, S. S. AND MIZBACHI, L. 1992. Using the computation
dependence graph for compile-time
of the Workshop
on Advanced
Compilation
optimization
and parallelization.
In Proceedings
Techruques
for Nouel Architectures.
Springer-Verlag,
New York.
POLYCHRONOPOULOS,C. D., KUCK, D. J., AND PADUA, D. A. 1986. Execution of parallel loops on
Conference
on Parallel
Processing.
IEEE, New
parallel processor systems. In the International
York, 519–527,
RICH, C. AND WATERS) R. C.
puter 21, 11 (Nov.), 10–25.
1988.
The programmer’s
ROSENDAHL, M. AND MANKWALD, K. P.
ACM TransactIons
on Programmmg
Languages
1979.
Analysis
and Systems,
apprentice:
A research
of programs
overview
by reduction
Vol. 16, No, 3, May
1994
Com-
of their
Optimization
structure.
In
Graph-Grammars
and
Their
and Parallelization
Applications
to
Using Idioms
Computer
Science
.
and
327
Biology.
Lecture Notes in Computer Science, vol. 73, Springer-Verlag, New York, 409-417.
THINKING MACHINES CORPORATION. 1987. Connection
Machine Model CM-2 technical
summary. Tech. Rep. HA87-4. Thinking
Machines Corp., Cambridge, Mass.
Supercompilers
for Supercomputers.
MIT Press, Cambridge,
WOLFE, M. J. 1989. Optimizing
Mass.
Received May 1990; revised
ACM
October
Transactions
1993; accepted October
on Programming
Languages
1993
and Systems,
Vol. 16, No. 3, May 1994.