Parallelism
in Sequential
Guy
Blelloch
John
(blelloch@cs.
Greiner
School
Functional
(jdg@cs.
Mellon
Abstract
This
formally
Scheme).
how
we are
parallelism
of a profiling
terms
of
with
arithmetic
the
prove
are
machines
at
the
on
Based
versions
by
O(log
depth
p).
the
on the
We
of size
O(rI
quicksort
of
A-PAL
on various
PAL
parallel
ma-
In particular
factor
of the
the
slowdown
the
the
A-PAL)
prove
work
(time
allel
PRAM
for
we describe
and
log n) work
to
capture
some
following
merge
algorithms
and
and
sort.
tics
run
0(log2
on
the
n) depth
that
for
A-PAL
depth
tion
model
(expected
purely
functional
languages
since
the
languages
lack
safely
be
evaluated
in
have
presented
vantage
allel
of
graph
niques
many
this
is their
side
Such
inherent
effects,
parallel.
work
necessary
to
add
explicit
languages
to
get
adequate
30],
has
various
suggested
parallel
to take
data-flow
that
constructs
parallelism
from
might
to
allelism
inherent
has
can
been
be
little
achieved
parallelism
study,
for
however,
not
various
in functional
problems,
languages
be
depth)
implementation
paper
based
on
of
makes
the
the
read,
are
226
the
the
results
in
in
us to use
about
the
the
complex-
language
to
prove
in
version
machine.
step
processors
of
for
simulating
we examine
both
the
and
con-
(CRCW)
(CREW)
of the
this
the
the
set
host
in-
machine
state
into
[25],
machine
transition
a new
are
set
of
scheduled
machine.
PRAM
The
proofs
P-ECD
each
substates
the
The
SECD
and
but-
variants.
1.
of the
transforms
each
modand
Figure
A state
model
hypercube,
write
write
in our
machine-based
PRAM,
exclusive
On
across
lan-
arguments
complexities
[15],
summarized
machine
prove
model.
the
of a set of substates,
the
of an
The lanfactors
parallelism
allows
ML-like
concurrent
a parallel
substates.
computa-
of eager
the
evaluating
results
an
PRAM
For
read,
P-ECD
subsets
of various
the
current
of
Pecnrission
to make digital/hard
copies of all or part of [his material
wlLhouc fee is granted
provided
tlmt the copies we not made or distributed
for profit or commercial
advantage,
the ACM
copyright/server
nouce, [he title of the publicmon
wrd Its date appmr,
amj notice is ,g!ven
of the Associatmn for Computing Machinery,
that copyright
m by pernrissmn
Inc. (ACM).
To copy otherwse,
to republish,to
post on servers or to
redistribute
to lists, requires
specific
permission
ondlor fee.
FPCA ’95 La Jolla, CA USA”
1995 ACM
0-89791 -7/95/0006
. .. S3.50
to prove
of
com-
computational
of the
when
from
on how
models.
consists
of
subexpressions
functional
complexities
results
the
to more
two
seman-
measures
algorithms.
including
troduce
The
A-
eval-
amount
depth
or Scheme
using
results
to
total
correspondence
while
concurrent
par-
the
This
about
two
a program.
comes
,Vcalculus
model
a profiling
is the
pure
evaluated
in parallel.
equivalent
within
constant
as ML
simpler
of
defines
the
the
the
are
to
such
is
that
ez
[7].
terily
lan-
or how
relates
by
simply
parallel
els,
functional
much
executed
languages
relate
tech-
functional
of how
needs
(call-by-value)
terms
work
assuming
2. We prove
par-
guages.
There
in
is basically
results
ad-
[28],
compiler
it
to
relation
this
order
semantics
complexity
ity
may
researchers
techniques
and
of
parallelism—
subexpressions
including
[20,
aspect
Furthermore,
implementation
parallelism,
reduction
[14].
an important
on par-
complexity
This
model
This
application e I
case for
those
that
lan-
research
parallel
specified
The
tree,
guages
argued
functional
this
issues
applicative
39].
(or
guage
have
these
a parallel
and
[38,
putation
an
).
researchers
of the
address
it
these
Furthermore,
relate
models.
Does
Before
complexity.
to
aspects
using
uation
tree-based
show
augment
we
sorting
models?
or lazy?
to previous
need
machine
introduce
calcnlus
the
Introduction
Many
To
to
of
would
contributions:
1. We
on the
on
simulating
analyze
We
also
various
language.
of
1
on
need
for
lan-
parallel
machine
results
we
of sort
with
is strict
model
compare
complexity.
model,
n these
to
various
we first
a formal
algorithms,
time
for
kind
compare
language
such
bounds
of a functional
What
bounds
algorithms,
are asymptotic
on
is proportional
bounds
with
run
the
sim-
product
answered,
if we want
(the
and
can
in
a com-
the
the
parallel
what
implementation
designed
guages
analyzing
or Haskell?
would
whether
in
A-PAL.
of quicksort
input
complexity
a parallel
matter
be
for
example,
as ML
How
algorithms
based
used
For
such
use?
bounds
defined
depth
processor-time
on
also
is
hypercube,
a constant
p processors
divided
most
PRAM
with
for
or
problems.
the PAL,
of the
bounds.
(the
guage
measures
operations)
eficient
with
model
parallel
the butterfly,
is within
and
the
simulation
work-
machines
A-PAL),
and
and
a simulation
including
and
ulat ions
to
describe
models,
models
work
using
languages
various
model,
The
PRAM.
sorting
of ML
models
as the
paral-
in placing
for
a complexity
semantics
total
We
extended
chine
University
much
sub;ets
interested
A-calculus.
terms
putation.
functional
is available
we introduce
call-by-value
of how
functional
In particular
the
question
in cal-by-value
(i. e., the
much
the
the
extensions
To do this
on
studies
is available
no parallel
on
emu. edu)
Science
standard
paper
lelism
emu. edu)
of Computer
Carnegie
Languages
model
We
on
also
our
Machine
number
Model
CREW
PRAM
CRCW
PRAM
CRCW
PRAM
Butterfly
Time
o(zu/p+ dlogp)
o(w/p + dlog logp)
(randomized)
o(w/p
(randomized)
Hypercube
(randomized)
lookup.
that
number
the
+ dlog’p)
effect
o(w/p
+ dlogp)
o(w/p
+ dlogp)
the
restrict
1:
The
mapping
model
(the
A-PAL)
models.
The
number
proposed
machine
is p.
For
the
randomized
high-probability
fied
time
that
very
is constant,
it
the
3.
cally
constant
and
stored
depth
tional
just
but
has
to
lack
of random
O(n
log n)
our
The
model
lem
is that
implement
and
on
would
and
order
but
speculation
our
some
model,
garbage
collect
In these
depend
critically
the
strictness
be
on
analysis
to formally
track
of how
simulation
many
what
is.
heuristics
An
with
variable
need
are
interesting
used
line
model
names
are needed.
to include
1We use the term to mean a fully
bounds
lazy
the
speculative
some
lan-
are
or how
of future
implementation
the
total
depth
of the
depth
keep
depth
for
the
par-
sake
with
some
version
A-PAL
of
a set
in-
as the
model
formalize
can
in
the
and
ar-
semantics
is defined
evaluating
D. The
able
and
may
be in E.
associated
by
APP
addition
as
to
“In
the
a program,
extension
value
If E has
the
the
useful
depth
and
acthe
an ex-
sum
work
of the
needed
result
in
to
extends
in terms
environment
and
rules
in Figure
2.
with
by
E[z
z, the
judgment
E.
w
for
standard
The
work
is denoted
bounds
the
u in
we start
the
the
of processors.
of an environment
a binding
to
simulation
number
measures.
the
of e I to
addition
complexities
which
cost
value
by
To
complexity
evaluating
work
a fixed
39],
with
u: w. d reads
is the
paper.
two
is the
eQ plus
of the
and
[38,
thk
result
of ez. The depth
of the
of the depths
of evaluating
e1
have
work
semantics
this
When
of proving
e ev~uates
in
and
tracks
of applying
that
function
computation
track
the
the
by a computation
e 1 and
purpose
machines
relation
The
227
the
polylogarithmic
model
executed
work
the
with
constants.
includes
in parallel,
our
to evaluate
We
is denoted
[19],
The
costs
computation.
of ez.
of the
is
as the
that
~-calculus
result
of el to the
is the maximum
vironment
keep
asymp-
of
For
along
model.
work
of the
e I eQ the
When
their
a set
extended
we consider
parallelism
eQ, plus
This
simulated
model
constants
model.
be evaluated
this
E 1- e ~
good
no
this
with
of parallelism
for
pression
In particular,
logarithm
always
operational
.
never
be
affecting
of the
integers
to
applicative-order
can
result
work
to
the
We
would
and
con-
range.
of a profiling
way
simplicity
cleaner.
can
a model
(the
PAL
its
data-types,
not
over
with
refer
an
clzlkr.elelez
consider
We
since
syntax
(PAL)
(A-PAL)
on parallel
heuris-
is some
also
using
3 much
variables)
version
A-calculus
required
and
to use call-
need
and
semantics
that
We chose the ,4-
(e. g,,
c ranges
pure
on the
aPp&
the
computation
between
using
is the
the
A-calculus
therefore
::=
constants
form
work
locaf
[7],
Expressions
we
pression
of
improve
and
operators).
paralfel
call-by-
model
implementations
our
plenty
that
sorting
work.
languages
meta-variable
to
measures,
to
there
a complexity
compare
bounds
and
to various
for
language
abstract
arithmetic
count
fist
work
when
computations
implementations
complexity
models.
One inconvenience
our
Typically
algorithms
in Section
The
applicative
only
of
no signif-
somewhere
to decide
of call-by-need,
The
of
2 describes
model
operational
measures.
results
overhead
c
the
refer
In
(lazy),
uses
implementations
is used
performance.
gument
as applicative-
possible
sit
totic
integer
implementa-
that
a
number.
the
untyped
a specific
recursion,
Arithmetic-PAL
prob-
range
models.
offer
of
chose
programs
Section
related
the
of modern
constant
a for-
offers
be
literature
speculative
and
on
than
with
teger
of normal-
asymptotic
call-by-speculation.
instead
needed.
would
Most
sorting,
wide
of work
features
of
work
based
simulation
be simulated
these
a model
might
analysis
by-speculation
and
5 gives
(call-by-value)
with
complexity
practicality,
require
The
have
many
ailel
the
our
cafl-by-need
amount
it
for
defining
actually
although
and
can
)1,
same
or strictness
in
constant
4 relate
6 discusses
rather
the
We
[41],
up
instead
Section
is
makes
where
merging.
complexity
particular,
the
PRAM
and
fixed
we
to convert
as folfows.
3 and
number
that
how
not
be to
Model
model
e
of func-
and
Call-by-speculation
bounds.
call-by-need
PAL
lists
borrows
evaluation.
the
in
for
is used
give
suggested
than
make
merging
cal-by-need,
the same
In
would
depth
guages
to
does
semantics.
The
ditionals,
require
algorithms
ambiguities
(lenient
[23].
parallelism
both
different
parallelism
to
as call-by-name,
very
call-by-name
Section
accentuates
the
depth,
evaluation
such
have
n)
normal-order
normal-order
ations,
two,
tic
of
for
mergesort
sorting
evaluation
icant
as
for
Se-
since
rather
for
are optimal
call-by-speculation
tions
all
trees,
This
in
changes
O(log2
because
based
merging.
applicative-order
is augmented
specifi-
implementations
Both
and
bounds
list.
designed
access.
are optimal
evaluation
mal
over
would
as trees
merging
substantial
Applicative-order
order
the
data
Sections
models.
Our
mergesort.
algorithm
of parallel
work
depth
any
traverse
algorithms
some
bounds
butter-
communicate
as balanced
and
would
a constant
require
show
this
is organized
machine
2
hypercube,
and
of storing
advantage
from
model
then
V. does
choice
allow
would
as sorting,
so that
Another
only
into
such
show
speci-
the
the
of quicksort
as a list,
languages.
ideas
for
algorithms,
versions
the importance
to take
3. For
and
variables
paper
input,
for
to
assume
in a program
of analyzing
of size n are stored
sequences
names
(can
the
results
more
the
are
to
particular
calculus
examples
parallel
O(n)
version
times
within
the
in Section
switches,
various
J-calculus
however,
The
algorithms,
of the
bounds.
This,
to account
is straightforward
for
size
asymptotic
the
(we) in order
it
of variables
of the
variables.
the
machine
running
run
variable
on
on the
Alf
in
).
We provide
quences
will
(d)
time
the
y).
be discussed
multiport
depth
of processors
probability
has p lgp
simultaneously
and
running
(i. e., they
high
as wilf
we assume
to
of independent
we assume
wires
(w)
algorithms
bounds
with
the number
fly
of work
variables
Fortunately
is independent
with
Figure
of independent
variable
the
ex-
depth
d.”
an empty
en-
with
I-+ v],
a variwhere
associated
z
value
E(z).
and
and
APPC
rules
show
depth
with
maximum.
how
work
The
is combined
uses
of the
(CONST)
EPC4C;1,1
E t- Az.e
L
E(z)
cl(ll,x,e);l,
(LAM)
1
= v
(VAR)
EEZLW;
l,l
E t- el
~
ci(E’,
z,e’);
E’[x
+
EEelez~v;
wl,
E1--ez
dl
W] E e’ ~
wl,
2,max(dl,
E~ez
dl
(APP)
v;ws,ds
wl+wz+w3+
Et-el~c;
~vz;wz,dz
6(C, W)
~uz;wz,dz
~u;
wl+wz+2,
max(dl,
Figure
constant
2 in the
between
simulations
since
gram
constants,
evaluate
we are
constants
constant
and
to
the
to
cluding
augment
depth
model
1
and
Adding
extend
to
obtain
stants
can
be
require
model
PAL
Model
basic
PAL
model
the
in
whose
the
base
Figure
This
work
to
ings
and
for
als
with
no
con-
with
arithmetic
model,
ladd.
I div2
–2
z ranges
are
addition,
the
test
for
primitive
important,
they
should
con-
These
2
constants
This
two
integers.
For
are
curried.
for
the
encoding
complexity
is why
The
The
but
i,
with
the
this
would
In this
section
and
depth.
A-PAL
model
]mullmul,
J functions
closures
e.g.,
division
for
in the
these
ti-rule
encoding
has
been
constants
for
primitive
of lower
cannot
pos?
mul,
i x 2’
:::;;:
for
can
>.TV .;i/2]
the
A-PAL
be used
of these
l
first
net work,
model.
to encode
constants
ity
two,
transitions
and
simplicity,
all
of primitives
bounds
ensures
that
condition-
requires
is
proofs
this
constant
im-
as integers.
on Various
with
div2,
and
the
pos?
v; w, d.
Machines
in Figure
3.
encod-
228
To
of the
The
SECD
basic
keeps
a set of substates
state
transition
2 new
over
the
machine,
idea
to
causes
each
es,
can
computation.
is equal
We
to
number
show
of state
that
to
show
the
(P-ECD)
is that
it
in parallel.
of
total
the
a parallel
convert
that
each
For
ECD
number
as
complex-
machine
be evaluated
the
31]
bounds.
Parallel
substate
so the
work
the
P-ECD
on
[25,
then
and
butterfly
A-PAL
the
given
the
RAM
we introduce
the
of the
that
the
and
A-PAL,
a serial
machine
how
within
substat
of a program
show
machine
of the
simulating
models.
a PRAM,
SECD
is related
SECD
can be implemented
variant
on
on
simulate
the
We first
program
simulations
plexity
simulation
of
for
machine
simulation
a variant
parallel
vary
the
the
step.
bounds
on various
hypercube.
of the
O, 1, or
are standard
~-calculus
neg,
by E k e &
simulation
PAL)
for
of an A-PAL
machine.
omitted.
are given
is the
mul,,
defined
describe
and
transition
certain
asymptotically
arrays
model
mul,
we give
(or
we use
functions
by
A-PAL
add,,
the A-PAL
extend
RAM,
choice
[2], which
and
semantics
Simulating
3
but
syntactic
purpose
=
=
c1(O, x, Jg. y)’
J functions
The
add,
work
division
The
schemes
bounds,
general
The
negation,
i)
i’)
con-
I pos?
integers.
be incompressible
of data
prove
the
positive
functions
not
kinds
over
multiplication,
%f t >0
each
an intermediate
where
=
booleans
Applying
Definition
v; w, d.
and
in both
. ..liladd
=
3: The
the
[7].
We
neg
t)
2)
J(mul,
J(mul,,
in-
then
::=
z + i’
eise
are
c 6 Constants
add,
=
work.
model.
overheads
Z)
J(pos?,
is also
by E 1- e ~
Arithmetic-PAL
simulated
the
however,
is the A-calculus
to the
=
z’)
J(neg,
[7].
defined
polylogarithmic
constants
argument
to
by
functions,
functions
J(add,
to
evaluate
depth.
constant
the semantics
the
stants
most
model.
program
function
and
PAL
J(add,,
Pro-
determined
work
constant
The PAL
with
analysis.
are assumed
is straightforward,
of their
Constants
now
for
It
with
do not
a constant
constant
here.
in our
constants
As usual,
is
of the
correspon-
A-expressions
Applying
is a function
stants
depth.
variables
with
used
the
Definition
of
assumption
those
the
semantics
processed
variables
themselves,
value
evaluate
is a reasonable
and
profiling
an exact
in asymptotic
and
d,)+2
2: The
states
Otherwise
work
environment.
assumed
The
3).
and
interested
evaluate
closures,
is to make
depth
A-expressions,
with
current
rules
and
(see Section
matter
We
these
work
= w
(APPC)
E1-elez
dence
dz) +d3+2
into
either
substates
the
will
work
number
A
comof
sub-
states
processed
equal
to the
We then
be
and
that
number
show
mapped
using
onto
the
of steps
depth
complexity
taken
by the
an appropriate
various
is exactly
P-ECD
scheduling
machines
with
Proof
machine.
how
a fixed
this
can
number
of
We
processors.
SECD
function
machine
~,
where
S of values,
pressions
function
chine
starts
and
Using
A-PAL
both
model
a control
and
and
D are
list
stack
expression
nil,
the
time
parts:
number
of states
in
and
mapping
of this
Lemma
the
mapping
on a RAM
mapping
of
a SECD
the
machine
Theorem
e in no more
Proof:
Follows
ma-
when
S is
result
e to u in
a transition
transition
sequence
outline:
tion
Generalizing
show
by
derivation
sequence
work
Simulating
in
into
two
For
the
chine.
to
sequence,
that
The
full
our
represented
ing
an
the
rnachme
parallel
only
extending
that
than
st ed
during
the
that
cause
number
P-ECD,
step
names
data
are
shared
in
e, is a small
size—it
is easy
among
as
J’s
that
each
it
is
does
2
be simulated
SE(7D
on
transition
a RAM
in
some
constant
k,
where
Proof
outline:
All
transitions
no
IE]
on
more
is the
state
than
time,
of variables
of all
the
with
one
Exit
(v)
only
happen
for
new
these
new
sub states
(nil,
exits
that
it
a step
three
in
e substate
The
eval
res(v,
which
evaluating
synchronize
steps.
can
the
program
a special
(this
We
can
are also
substates
eval,
and
and
create
the
if
and
vala,
are
by
valj
by
and
vala
one
can
of ualf
arguments,
these
the
in-
of this
intermediate
and
between
an
value
the
functions
be side-effected
array,
val~
may
coordinate
must
extend
Q,
machine
result
new
The
new
e is the
communicated
substeps
from
M
substate
reaches
left).
containing
so the processors
Array
P-ECD
where
substep
D)
is then
two
obtained
The
no
and
up the
program
of a step,
4,
These
each
makes
substate
in
SECD
is finishing.
transitions
Figure
only
results
each substate
substates.
a substate
v is the
C, D)
is to commu-
C, D)
e, n d),
is the
(E,
of executing
as in our
transitions
(E,
time.
when
where
when
computation
vala.
can
in E.
three
and
subcomputation
and
the
substate
the
element
result.
consists
computation
over
array
to be evaluated;
two
substate,
if the
this
vary
partial
tree
or
size of Q can
termediate
can
expression
one,
Each
of the step,
C, a single
the
taking
substates
substate
the beginning
After
zero,
the
a balanced
results.
and
so far,
( val(v))
all
current
of substates
machine,
or one
of how
to execute
the
Q
evaluated,
SECD
of each
the
transitions
time
array
obtained
E,
its
the
ma-
relating
transforms
being
we process
At
nicate
The
data
(S, E, C, D)
k lg IE]
number
k.
P-ECD
into
of state
The
results
(novcd)
D, a description
in
machine
in the
a step
control
results
The
calculate
constant
machines.
state.
of environment
defined
of
common
parallel
P-ECD
zero
dump
only
not
bound
Processing
guaranteed
have
of gen-
of the number
in all
can
the
split
we can
stack
to be evaluated,
more
this
could
loss
data
be
then
partial
transitions.
starts
is
cre-
representations.
Lemma
contains
thus
with
is independent
names
[24,
require
1. ❑
introduce
number
subexpressions
combination
name
case
where
constant
of M
results
In
e, contains
logarithm
to share
Extend-
environment
without
not
some
corollary
can
a new
the
of the
machine;
are
sharing
worst
Ve, the
the
consists
a variable
since
we assume
In practice
w
environment
In the
a RAM
to the
into
describes
three
new
expression
in e.
but
involves
environment
no
machine
does
for
A-PAL
of the
(Q, M)
in parallel.
we
simulation
and
To execute
as possible.
the
of each
on the P-ECD
on various
place
transition
trees.
with
of A-expressions
name,
a conilict.
of variables
of
of variables
evalua-
environments
a new
a result,
evaluation
variable
erality
the
the
As
M
•I
environment
so that
old.
the
as AVL
creates
changed,
A-PAL
environment
an environment
the
number
to the
own
such
old
environment
binding
larger
equal
the
[7].
machine,
trees,
with
in the
than
its
SECD
in
then
time,
1 and
simulation
transition
state
:: S, E, C, D)
crest es a new
structure
already
(v
be found
of the
as balanced
particular,
A
can
a pointer
simulation
kvew
Lemma
the
of the
Each
eval-
to all environments,
on the
from
Again
of the
v; w, d, then
*
environment
much
no
if E 1- e ~
proof
variant
lemma
for
the
u; w, d,
than
the A-PAL
complexity
of w states.
induction
(S, E, e :: C, D)
states.
In
the
structural
2 holds
since
1 If D ~ e ~
describes
Proof
depth
being
A-PAL
the SECD
the
❑
to memory.
v from
on a RAM.
u; w, d, then
access
C of ex-
between
in
Lemma
as a RAM
D which
e, the
can be split
work
to time
1 If D F e ~
we can
that
V. bounding
to return
It halts
with
random
each
uates
note
from
evaluation.
of a data
a “dump”
as a control
an
transition
S.
machine,
the
with
consists
nil, [e], ni~.
C and
in
SECD
E,
evaluate
(nil,
simpler
the
(S, E, C, D)
used
To
state
value
the
machine
@ (apply),
triples
calls.
singleton
the
symbol
in the
a singleton
the
a state
of (S, E, C)
from
is a state
an environment
or the
is a list
Follows
in the
44, 2] as well
The
stack
outline:
environment
latter
sub-
substeps:
update
eval
its
con-
tents.
and
environment
list
manipulations
take
constant
extension
or
time
same
arithmetic
trees
representing
in logarithmic
except
can be imdernented
primitive
(this
arit~metic
assumes
operations
the
as the
environments
time.
for environment
allow
with
A-PAL).
lookup
supports
identity y
~nd
ates
the
The
balanced
and
extension
two
and
vious
Corollary
the
no more
1 Each
empty
than
SECD
environment
kve
time,
for
transition
can
some
in
the
evaluation
be s,mulated
constant
on a RAM
no
of e
valf
k.
229
and
new
vala
The
steps)
plus
one
index
subst at es.
first
Whenever
result
the
i added
number
is used
from
case
that
machine
This
each
for each
completes
the
works.
eval on applications
transitions.
the
and
why
are
substates,
ID
it reads the
VI to V2. In
in
informally
to be independent
processor
❑
argue
transitions
argument.
teed
culation
from
now
interesting
simde
operations
RAM
We
lookup
to
the
M,
eval
to the
into
its result
second
the
dump
M.
(e.g.,
Whichever
M,
calculation
cre-
function
processed
into
non-
D is guaran-
processed
of substates
The
the
transition
evaluate
substate
as an index
writes
and
and
the
in precalreturns
completes,
and initiates
the application
of
the
two branches
complete
on the
constant
E,
C,
D
%1
res(c,
E,
Az.e,
D
%1
res(ci(E,
z,e),
E,
X,
D
=1
res(E(x),
D)
E,
el
D
%i
ez,
M,
D)
:=
@(cl(E,
E,
@(C,
z,e),
w),
D
D
u),
res(v,
ml)
res(v,
fn(E,
%1
Z) :: D)
where
i is new
lS((EIZ
H
res(~(c,
St
Exit(v)
@
case M,
of
vai(v’)
~
arg(E,
i) :: D)
~
of
vat(v’)
+.
4:
(OS,
Transitions
1S,
or 2S)
on
for
the
the
substates
of the
following
step.
expressions
in
prim-call
left-return
lS((E,
:=
valf
P-ECD.
ml(v);
OS
:=
and
@(w’,
v), D))
val(v);
vala
OS
are
identities.
foreachj~{l,...,
On
each
are used
steD.
each
to sequ~nce
I
1
add
(add
1 2)
2
add
(add
1 2), ‘
add
3
add,
4
add
5
add,
6
@(add,l),
1,
1
A-PAL
evaluation
3‘4
2
3,
4
3
@(add,3)
1,
@(add3,4)
4
that
4
in a state
3
an instance
2
res(v,
7
@(addl,2)
1
8
@(add,3)
1
9
@(adds,7)
all
5:
P-ECD
(add
steps
example
1 2)
of the
(add
evaluation
3 4).
lengths
The
using
total
the
work
sum
over
the
step,
we guarantee
other
and
is still
LAM,
By
waia phases.
ing
could
As
an example
shows
Q
Lemma
such
ing
(With
be avoided.
at
expression
steps.
that
d steps
calculation
the
they
both
do not
synchronizing
an atomic
believe
between
test-and-set,
that
the
S+CZI
valf
synchroniz-
beginning
(add
of
of the
each
1 2) (add
step
P-ECD,
of
Figure
evaluating
5
the
the
3 4)).
two
new
substat
es
of
steps
The
the
induction
P-ECD
the
taken
substate
(E,
the
of the
hypothesis
machine
is
at step
e, D)
of step
by
structure
s is
is in Q, then
s + d —
1 results
in
The
two
substates
one
after
coal
substep
semantics,
d =
(E, el, Dl)
step.
By
dl steps,
If the calculation
for
Q.
current
By the profiling
results
1, so the
is true.
fore
ez (i. e., dl
and
(E, e~, Dz)
are
the induction
hypothesis,
and e2 completes
after
dz
1 completes
< dz ), then
before
when
the
cal-
ez completes,
v2), D) is in Q at step s + d2 + 1. Otherwise,
completes,
(E, @(vl,
V2), D) is in Q at step
Therefore,
+1.
is in
execution
or
on
substep
or VAR:
at step s+max(dl,
step, s +max(dl,
)
of the
(add
3 For
by
number
that
eval
D).
eval,
culation
that
running
such
of the
(E, @(vI,
when
el
same
one,
induction
derivation.
created
after
e 1 completes
of Q.
the
v; w, d and
hypothesis
expression
is the
zero,
D).
APP:
add
that
(Q, M)
in res(v,
1
to
is d by
if E E e ~
CONST,
19
Work:
Figure
prove
3 4)
add
add,
We
machine
I
leads
of statements.
P-ECD
add
1 2,
2,
substate
IQI
Q
(add
q}
a group
Proof:
Step
D))
right-retuxn
Q~*Q~,
Semicolons
@2(v, v’),
lS((E,
M,
Otherwise,
FiEure
func-call
v], e, D))
M,
case M,
Q~%f
i) :: D))
exit
nouals
QJ%l
i) :: D), (E, e2, arg(E,
w), D)
novai
res(v,
apply
el, fn(E,
=1
va!a
=+
variable
nova~
2S((E,
E,
lambda
D)
By
the
evalsubstep
sults
in res(v,
that
d =
(E’,
D).
z, e), VZ), D)
is in Q
of the next
@ v], e, D)
induction
instance
of step
max(dl,
@(cl(E,
dz )+ 1.At the beginning
dz) + 2, the substate
(E[z
hypothesis,
(s+max(dl,
Since
dz)
the
an
d2)+2)+d3
profiling
semantics
+ dQ + 2, this
gives
the
of
–1
re-
shows
desired
results.
all
H h e +
A
expressions
v; w, d,
of a P-ECD
processes
e,
then
machine.
a totai
if
v
there
exists
is calculated
Furthermore,
a value
from
the
v
APPC:
e us-
The
except
P-ECD
the
of w substates.
substate
of the
230
argument
that
eval
at the
(E,
substep
is the
similar
beginning
C@(c, 92 ), D)
results
to
of step
in
is in
res(v,
the
previous
s + max(dl,
Q,
and
D).
rule,
dz ) + 1
an instance
Now
using
we show
that
induction
the
on the
calculation
A-PAL
processes
w substates,
size
derivation.
of the
ulations
array
the
substates,
CONST,
LAM,
or
processed
APP:
for
VAR:
each
By induction,
WZ, and
Exactly
of these
with
@(ci(E,
are
By
and
v)
WI+
e ~ ez and
processes
WI,
addition,
one
one with
so the
expression
total
expression
are
el
and
respectively.
In
some
The
e 1 ez and
processed,
so the
ez processes
1< Locally
one
total
WI
one
with
=
We now
need
to show
on a PRAM,
terfly
in
O(n
+ log p)
p)
also
time
assume
ecute
in
O(log
CM-5,
assume
all
time
the
switches,
fly
has the
and
time.
integer
time
same
and
model
bounds
need
for
We
adders
such
there
we
cross
are separate
for
states
4 Each
following
step
of
time
the
P-ECD
one,
be distributed
kv.(
CRCW
PRAM
CRCW
PRAM
Butterfly
kve(
(rand.)
[q/pi
kv.(
(rand.)
Hypercube
[q/pi
to
process
the
evai
q/p
for
(rand.)
ference
rial
apply
since
some
chmes
constant
hold
Proof:
For
by each
k,
with
high
the
step
where
the
for
that
and
quire
with
q sub-
within
the
p)
simulation
also
Each
one,
or two
new
array
new
sum
and
rxmor
number,
of the
array.
r.o it
For
the
can
each
processor
calculate
CREW
and
number
next
the
step.
After
~ontains
and
these
of the
[iq/p,
knows
its
to
array
. . . . (i+
own
its
simulations
pro-
section
the
of new
a solution
lem
[26]:
an
place
size km
for some
maining
earlier-.
linear
the
do
an
for
is a constant
times
On
approximate
p processor
CRCW
compaction
PRAM
the
m
on the
offset
the
prefix
on prob-
contain
of
allocate
substates
both
Since
total
an
array
is to first
the
the
efficiently
of an
1S, and
p)
length
PRAM
more
co mpactz
mark
the
the
of which
idea
a
and
as an
CRCW
cells
for
than
zero,
into
in O(q/p+log
be done
distinct
subst ates, we will maintain
Gil, Matias,
and Vishkin
array.
PRAM
is exactly
compaction.
larger
step
contains
result
can run
that
one
substate
for p processors
substate,
approximate
re-
second
CREW
approximate
OS,
references
a prefix-sum
k > 1. The
for each
(neither
and
butterfly
dktributed
the
can
in
the
The
array
the
array
array
objects
on each
CREW
on
to be
of n cells,
constant
positions
cases
linear
can
index
concurrent
the
a new
using
array
reset
memory
old
For
substates.
array
m
gets
bounds
bounds.
need
new
se-
We
array
require
by executing
new
to the
given
object,
array
the
the
on
the
these
by using
the
substate
in
write.
time
to
dif-
references.
in
network.
of the
a new
other
memory
The
takes
real
required
executed
same
and
only
of the
which
in
it
is similar
The
substep
step.
the
give
transition
generating
In both
into
will
be
which
time
W. memory
the
be done
the
most
-t Ig p since
next
can be executed
was
q/p.
substates
into
then
a pointer
butterfly
the
applies
that
first
substate
writing
This
and
for the elements
assume
all
processors
or 2S),
references
in the
array.
time.
ewd
also
requires
can
the
remain
i is responsible
1]).We
Each
for
array.
steps
additional
through
the
into
new
processor
1S,
machine.
at
the
substates,
this
two
elements
for
step
of new
will
(I.e.,
–
q/p
requires
element
requires
to an offset
is q/p
number
returned
for
the
new
Each
is therefore
for
is of size q,
is responsible
l)q/p
nm-
time
transitioned
array
processor
(OS,
memory
be executed
substates
each
array
i’s simply
added
third
distribution
randomized
can
a lg p latency
of the
the
mem-
transitions.
processor
for
step
have
an
of the
hypercube
+ log p)
substate
transiamong
The
each
V. time
can
requires
None
+ logp)
we keep
If this
transitions
array
transition.
the
and
substate
kve( [q/pi
probability.
in an array.
evai
This
val}
each
The
the independent
the
time
+ log”
on
is
sub-
shared
two
the
SECD
transition
generate
kve(
bounds
apply
it just
using
for
the
transitions.
of these
first
serial
require
machine
The
each
substates.
is the
transitions
we assume
+ log logp)
[q/p]
using
the
then
into
The
the
The
+ log p)
[q/pi
processor
following
is responsible.
substates
that
bounds.
PRAM
Time
PRAM
or two
to show
given
butterfly
CREW
it
substate
zero,
CRCW
machine
Model
most
no communication
the
transitions,
must
writes.
have
bounds:
Machine
and
is at
each
accessing
vaia
and
a new
sub~tep
memory.
switches
machme
on a p processor
which
synchronize,
We need
can
be processed
requires
applies
transitions.
Creating
substates
between
vala
round.
processor.
can
contain
marked,
array
of the
requires
va~and
then
The
Lemma
but
the
substates
the
in constant
models,
substates.
to butter-
shared
the
can be implemented
as on
can
similar
simulating
ex-
hypercube
messages
that
The
can
tree,
which
previous
33].
integer
the
is quite
to assume
the
calls
for
each
network.
computation
For
be
n elements
[27,
the
in which
step,
can
a machine
prefix
consists
sim-
don’t
are
the
that
a step
the
reading
first
machine
write)
has simple
adequate.
means
PRAM
that
p lg p switches
probability
This
As in
function
or
separate
be
wire.
adders.
on a single
A
such
of
elements.
4. This
processor
For the but-
references
through
hypercube
we do not
primitive
high
latency
that
also
each
However,
with
to
a prefix-sum
on each
for
(read
network
a multiport
wires
On
access
the
the
P-ECD
we have
memory
butterfly
p)
queues
that
such
would
the
for
3.
Machines
hypercube.
switches.
can
is due
in the
the
and
the
p processors
O(log
and
for p processors
banks,
through
the
Parallel
to simulate
network,
that
p memory
pipelined
of
how
butterfly
we assume
and
on other
it
holes
length
This
a synchronization
❑
the A-PAL
CRCW
in
kq/p
evaluating
in Figure
2. Evaluating
is w
W2 +2.
Simulating
of
ory
sub-
expression
processed
k.
the
These
total
at most
simulation
For
holes
below.
the
constant
for
q.
have
steps:
processed
addition,
that
responsible
tion
computing
substates,
stat e with
@(c,
In
can
as explained
we guarantee
+2.
induction,
wz
is
kg for
ez, ande’
processed,
iszu=wl+w2+wa
APPC:
el,
substate
rules.
respectively.
expression
x, e), v)
P-ECD
A-PAL
computing
W3 substates,
substate
one
is exactly
array
that
for
the
number
2S)
result
of re-
the invariant
mentioned
[16] have shown
that
the
problem
(ARBITRARY)
can
be solved
in O(n/p
on a
+ log”
p)
expected
time
(using
Zwick
[17]
have
solved
deterministically
When
the
we
stated
add
2 If
true
time,
for
for
for
each
of the
Proof:
The
prove
it
three
then
k.
mented
be
❑
u
only
can
+
results
are
Analogous
uses
the
Brent’s
identical.
We
scheduling
PRAM,
but
principle
the
assume
that
step
We know
from
Lemma
[9].
other
z of the
P-ECD
w.
We
also
log p)
know
time
to
substates
from
Lemma
process
step
4 that
The
i.
3 that
the
it takes
total
k’o.(
time
[q, /pl
the
O(p)
that
all
is then
the
a< d
T
M
=
k’o.(
~
[g,/Pl
+ logp)
internal
the
left
in
M,
and
k’u.
~(gJp
+ 1 +logp)
,=0
%<d
—
k’t&(W/P
s
2k’ve(w/p
=
kv, (w/p
one
branches
we have
+ dlogp)
will
works
and
We
now
In
this
a PRAM
section
PAL.
The
EREW,
we
prefix
[32] and
terms
of work
takes
(this
we don’t
base
our
how
[4].
the
to simulate
on the
most
results
each
that
needed
access
into
[2]).
the
CRCW
PRAM).
MP
PRAM
PAL
for
3 A program
some
Proof:
state
and
assume
that
memovg
k~
and
simulate
P is state
for
all
counter).
Let
c =
and
and
of the
can
work
c < m.
on
the
kdt log m log p
the
on state
code,
processors
ICI,
Each
and
based
C is the
P are stored
PRAM,
be gimzdated
and
m =
IMI,
as balanced
state
the
binary
transition
processors
=
Analyzing
Algorithms
call
W
of
on
the
in parallel
of W.
This
write-tree
leaf
the
of the
each
split
means
the
all
maximum
we ob-
can
be
write-tree)
m).
total
of the
values
down
done
at
most
which
total
p)
work
by O(p log p).
each
The
to split
O(log
node
work
to
is of
bound
work,
takes
is bound
The
= O(plog
M
is therefore
at each
m)).
~mple-
down
Since
on the
This
there
be
O (p log p) work
1 since
check
reauired.
~an
way.
bounds
than
de~th
following
complexity
(O(log
+ log m))
based
a key
by
or send
since
tree
we
branch
original
and
on
split
of the
O(p(logp
In this
alyze
and
IPI.
We
trees
and
section
and
techniques
of the
We fist
its
232
input
mergesort.
in the
note
how
As examples,
some
algorithms
be strictly
we examine
algorithms.
of quicksort
corresponds
will
M
and
work
M
tree
to
for
one
these
p separate
is at most
is there-
❑
A.
mem-
(i. e., registers
andp
as the
must
same
other
appropriate
write-tree
is the
leaf
so,
the
parts
in the
W
depth,
transitions
M’ is the
per
and
value
modify
split
depth
more
branches.
is O (p log m)
If
W
call
work
the
of size
original
(one
fore
t on a p processor
take
checks
5
a PRAM
where
m,
(C, M,
in time
kd.
P)
C, M,
p <
a step
m
runs
kWp log m
constants
program
to
with
We will
on the
ory,
u~ing
model
that
depth
to
also
in W belong
tree.
the
two
along
To prove
chains
as deep
Theorem
total
we have
other
and
to the
and
splitting
work
addresses
M
based
leaf,
the
case
the
we split
are p — 1 of them.
other
which
modify
with
total
work
cannot
to split
only
or the
we
trees
p pieces
are
minimum
of A4 with
the
address
W
leaf.
the
there
of whether
Since
models,
and
The
it
in
the
of &f
way
that
p)
node,
and
the
the
that
so that
we can access
To insert
W into
ikf,
all addresses
two
the
into
root
their
p log m).
tree
it
weaker
(MP
the
in
model,
sums
serve
the
multi-
is because
random
powerful
multiprefix
for
in
If not,
into
O(log
lg m,
by O(log
A-
is optimal
machines
the
an
greatest
of
consi~er
appropriate
work
This
as for pointer
for
on
as for the
simulation
variants.
to do better
unit-time
same
as well
The
PRAM
same
results
with
the
work
is the
know
PRAM
models
all
logarithmic
memory
will
scan
gives
a PRAM
CRC W PRAM
in
depth
simulating
we use
and
for
on an A-PAL
consider
simulation
CREW,
the
find
modlfv
W
of the
stores
at
and
addresses
together
of M
since
eventually
mented
Simulating
at the
be
of a
values
all
branch
returns.
can
requests
value
if all the
that
back
children
tree
❑
on
call
stored
two
Splitting
4
the
result
the
algorithm
+ dlogp)
set k = 2k’.
the
when
on the
M
where
put
address
+logp))
of
recursively
or
have
this
left-to-right,
we simply
to
left
W ) ). We assume
associated
W
is a single
we check
M
+ d(l
of
and
from
that
sorted
(M,
from
the
Otherwise,
just
of the
and
node
node,
the
we call
write
are sorted
of its descendants,
work
and depth.
if &l
in
which
write-tree
(modify
assume
return.
and
+ kp))
*=O
=
check
modify
k’w((zdd + 41
each
be a single
M
We
that
the
insertion
contain
p requests
sorted
each
p) depth.
addresses
branch.
the
sort
as discussed
addresses
addresses
the
address
constant
we first
<
the
nodes
in
also
the
memory
ordering
mammum
these in
,=0
the
O(log
at
a concurrent
in
by
requests
and ‘are
nodes
processors
reauests.
implement
and
consider
sorted
balanced
To
Since
work
stores
leaves,
out
can
depth
just
es are the
can be imple-
the
the
This
writ
and
from
We
p)
concurrent
depth.
The
splitting
the
combine
W into
requests
0(log2
tree.
address.
in
write-tree
+
to process
assume
we
same
done
q, =
We
start
in
multiprefix,
pro-
~~~~
and
write-tree,
m)
trees.
them.
work
section.
right
write
as we insert
O(p log p)
O(log
and
to implement,
recursively
then
We now
cesses q, substates.
instruction
and
the
and
can be imple-
p) depth,
appropriate
address
next
are
the
the
of M
O(log
work
by sorting
in
We
proofs
traversing
instructions
and
O(p log m)
interesting
to
CREW
work
mented
node
be calculated
in kv,(w/p
O(p)
with
requires
we get
Register-to-register
with
reads
time.
p processors
constant
can
substeps,
machines.
with
synchronous.
and
problem
models.
proof
for
almost
some
other
the
u; w, d,
PRAM
Goldberg
the
+ log log p)
times
I + e ~
for
the
that
the
e on a CREW
dlogp)
solution).
shown
in O(n/p
bounds
Theorem
from
a randomized
recently
the
model
can be used
we describe
These
necessary
two
for
parallel
algorithms
programming
to an-
versions
illustrate
efficient
model.
that
as a list
any
requires
sorting
depth
algorithm
at
least
that
represents
proportional
to
7
& &
4
2
6
3
2
2
91513518
Unsorted
8
&
7
4
2
2
3
3
2
18
35691315
Tree
Sorted
4
Tree
4
2
datatype
‘a Tree =
Empty I Leaf of ‘a
I Bode of int * ~a Tree * >a Tree
2
2
691513518311
less
Figure 6: Representing
sequences as trees. The values are
stored at the leaves and each internal
node stores the size of
its subtree (the number of leaves below it).
great
13
=
2
er
/
5
input size—this
is the time required
just to look at all
the elements.
In fact a simple mergesort
that makes its
two recursive calls in parallel
will match this lower bound
for depth.
To derive parallel
algorithms
that are sublinear
in the input
size requires
that the input
and output
are
represented
as trees. This section shows how trees can be
used to derive effective
paralleI
versions of quicksort
and
mergesort
and analyzes these versions in the PAL model.
The tree representation
we will use is given in Figure 6. We
assume that the ordering for sorted sequences is specified by
a left-to-right
traversal
of the tree.
2
3
R
15
18
2
695
%
3
11
fun append Empty b = b
I append a Empty = a
1 append a b = Tree (size (a)+size(b),
fun
Empty =
f (Leaf x)
then Leaf
f (Tree ((select
f
select
f
I select
if f x
I select
append
Parallel Quicksort:
The code for our quicksort
algorithm
is
given in Figure 7. The function
qsort -.rec returns a sorted
tree, but in general it will not be perfectly
balanced, so the
function
rebalance
rebalances it. The function
qsortxec
is
similar to the sequential
version of quicksort
on lists, except
that elt,
select,
and append are implemented
on trees.
The function
elt can be implemented
by traversing
the tree
down to the appropriate
leaf, and for a tree of depth d, this
requires O(d) work and depth (there is no parallelism).
The
function
select
is implemented
by calling itself recursively
in parallel
on both branches and putting
the results back
together.
Assuming
the function
f has constant
work and
depth, select
on a tree of size n and depth d requires O(n)
work and O(d) depth.
We note that the tree returned
by
select
is generally not going to be balanced,
which is why
we do not assume that d = lg n. The append function
simply
puts its two arguments
together in a tree node and therefore
has constant work and depth.
We first present a general theorem that bounds work and
depth for our quicksort
in the expected case for any input
tree, even if not balanced,
and as a corollary
give the bounds
for balanced input.
fun
empty
=
x else Empty
,left ,right))
left)
(select
a,b)
f right)
qsort.rec
x =
if (size x) < 2 then x
else
let val pivot
= elt a ((size
x)/2)
(fn x => x < pivot)
val less = select
val
val
equal
= select
(fn x => x = pivot)
greater
= select
(fn x => x > pivot)
a
a
a
in
append
(qsort-rec
IeSS)
(append
equal
(qsort.rec
greater))
end
fun rebalance
I rebalance
Empty = Empty
(Leaf
x) = Leaf
rebalance
a =
let
val half
= (size
x)/2
in
append
(rebalmce
(rebalance
x
(take
(drop
x half)
x half))
end
fun
Theorem
4 The quicksort
algorithm
specified
when applied to a tree with n leaves and depth
cute in O(n log n) work and O(dlog n) depth on
model, both expected case (i.e., aver-age over all
puts of that depth and size).
Pivot
/
its
Proof: We first consider qsort -rec. We
pivots in quicksort
will not perfectly
split
cursive paths will be longer than others.
path of recursive calls for qsort -ret on
2
7
in Figure
d will exethe A-PAL
possible in-
note that since the
the data, some reWe call the longest
a particular
input
233
quicksort
a = rebalance
(qsort-rec
a)
Figure 7: An example diagram
of the select
the code for the parallel
quicksort
algorithm.
function
and
the recursion
depth for that input.
We note that the worst
case recursion
depth is 0(n) and that fewer than 1 out of n
of the possible inputs will lead to a recursion
depth greater
To determine
the total computational
than k log n [34].
depth of qsort~ec,
we need to consider the computational
depth along the longest path. We claim that this computational depth is at most O(d) times the recursion depth since
each node along the recursion tree will require at most O(d)
depth.
This is because elt
and select
will run in O(d)
Since a fraction
of only l/n of the inputs will have
depth.z.
a recursion depth greater than O(log n), and these cases will
have recursion
depth at most O(n), the average (expected
case) computation
depth of qsortrec
is
D(n)
= O(d(logn
+ in))
= O(rllog
dat at ype ‘a Tree =
Empty I Leaf of ‘a
I IJode of int * ‘a *
fun
n).
To see that the work is expected to be O(n log n), we simply
note that all steps do no more than a constant fraction more
work than a list-based
sequential
implementation.
We now briefly consider the routine rebalance.
We note
that the depth of the tree returned by qsort-r-ec
is at most a
constant times the recursion depth. The function
rebalance
is implemented
by splitting
the tree along the path that
separates the tree into two equal size pieces (or off by 1),
recursively
calls itself on the two parts, and appends
the
results. We claim that for a tree of size n and depth d it will
run with O(n log n) work and O(dlog n) depth (worst case).
Given the above bounds on the recursion
depth, this gives
an expected depth of 0(log2 n).
D
select.kth
if V2 >
else
if
I select_kth
if V2 >
then
else
else
if
then
else
I select-kth
select-kth
I select-kth
if
V2 >
then
else
else
if
then
else
Tree
*
~a Tree
k (Leaf
VI then if
k = O then
k (Leaf
vi then
if
select_kth
select_kth
n2 > k
select.kth
select-kth
k (Mode
k (Leaf
k (Mode
vi)
(Leaf
v2) =
k = O then VI else
VO
VO else VI
v1) (Bode (rr2 ,v2 ,12 ,r2)
k > n2
(k-n2)
(Leaf
vi)
r2
k (Leaf
vI)
12
(Mode
VI then
if
select-kth
select-kth
k > (nl+n2)
(n2, v2,12, r2))
=
k > (nl+n2)
k (Node (nl ,vI ,11 ,rl))
12
(k-nl)
rl
(Node (n2 ,v2 ,12 ,r2)
select-kth
select-kth
k 11 (Mode (n2 ,v2 ,12 ,r2))
(k-nl)
(mode (nl ,v1 ,11 ,rl)
merge (Leaf
fun
‘a
) =
k (Leaf
vi)
12
(k-n2)
(Leaf
vI)
r2
(ni ,vl,ll,
rl))
(Leaf
v2)
v2) (Bode (ni ,vI ,11 ,rl))
(ni ,vl,li,
ri))
x) b = insert
y) = insert
=
=
)
) r2
x b
y a
I merge a (Leaf
merge a b =
let
val
val
k = ((size
a) + (size
b))
/
median . select-kth
k a b
2
in
append
(merge
(take-less
(take_less
(drop-less
(dropJess
(merge
Corollary
2 The quicksort
algorithm
specijied in Figure 7
when applied to a balanced tree wzth n ieaves wdi execute in
O(rJ log n) work and O(log2 n) depth on the A-PAL
model,
both expected case.
a
b
a
b
median)
median))
median)
median))
end
fun
if
merge-sort
(s;ze
a)
else
Parallel Mergesort:
We first consider the problem of merging two sorted trees. We use n to refer to sum of the sizes
of the two trees.
We assume that each internal
node of
the input trees contains
the maximum
value of its descendants, as well as its size. This is clearly easy to generate
in O(n) work and O(log n) depth.
The main component
of
the parallel
algorithm
is a routine
select lrth which given
two ordered trees a and b, returns
the kth smallest
value
from the combination
of the two sequences (see Figure 8).
It is implemented
using a dual binary search in which we go
down a branch from one of the two sequences on each step,
using the maximal
element at each node for navigation.
Assuming the depths of the two trees are da and db, the work
and depth complexity
of this routine is O(da -1-db).
To merge two trees, we use select-kth
to find their combined median element.
We then select the elements less and
greater,
respectively,
than the median
for each tree with
the functions
take_less
and dropdess.
These can be implemented
with O(log n) work and depth since the trees are
sorted and balanced
(it just requires going down a tree splitting along the way).
Recursively
merging
the two trees of
lesser elements and the two trees of greater elements gives us
two sorted trees which are guaranteed
to be the same size (or
off by one) by construction.
So, joining
them under a new
node produces a balanced
sorted tree. As a whole, merging
a =
< 2 then a
let val half = (size
in merge (merge-sort
(merge-sort
Figure
a)/2
(take
(drop
8: Code of the parallel
a half))
a half))
mergesort
algorithm.
in this manner takes O(n) work and O (logz n) depth
we recurse for the lg n depth of the trees.
since
Theorem
5 The rnergesort algorithm
specijied in Figure 8
when applied to a balanced tree with n leaves will execute in
O(n log n) work and 0(log3 n) depth on the A-PAL
model.
Proof
depth:
We can write
W(n)
D(n)
2Note that although
select does not return balanced trees, it will
never return a tree with depth greater than the original tree, which
has depth d
❑
234
the following
recurrences
=
—
2VV(n/2)
+ Wmerge(n)
2W’(n/2)
+ O(n)
—
—
O(n log n)
=
D(n/2)
.
D(n/2)
—
—
0(log3
+ Dmerge(n)
+ 0(log2
n)
n)
for work
and
This version of mergesort
is not as efficient as the quicksort previously
described.
However, if merging uses n/ lg n
split ters, rather than just the median,
the depth complexities of merging and mergesort
can each be improved
by a
factor of log n [7].
6
Related
Work
Several researchers have used cost-augmented
semantics for
automatic
time analysis of serial programs
[3, 38, 39, 45].
This work was concerned with serial running time, and since
they were primarily
interested
in automatically
analyzing
programs rather than deiining
complexity,
they each altered
the semantics
of functions
to simplify
such analysis.
Furthermore,
none related their complexity
models to more traditional
machine models, although
since the languages are
serial this should not be hard.
Roe [36, 37] and Zimmerman
[46, 47] both studied profiling semantics for parallel languages.
Roe formally
defined
a Profing
semantics for an extended A-calculus with lenient
evaluation.
In his semantics,
the two subexpressions
of a
special let expression
piet x = el in ez evaluate in parallel such that the evaluation
of an occurrence
of x in ez is
delayed until its value is available.
To define when this is
the case, he augmented
the standard denotational
semantics
with the time that each expression begins and ends evaluation.
He did not show any complexity
bounds resulting
from his definition
or relate this model to any other. Zimmerman introduced
a profiling
semantics for a data-parallel
language for the purpose of automatically
analyzing
PRAM
algorithms.
The language therefore almost directly
modeled
the PRAM
by adding a set of PRAM-like
primitive
operations. Complexity
was measured in terms of time and number of processors,
as it is measured for the PRAM.
It was
not shown, however, whether the model exactly modeled the
PRAM.
In particular
since it is not known until execution
how many processors are needed, it is not clear whether the
scheduling
could be done on the fly.
Hudak and Anderson
[19] suggest modeling parallelism
in
functional
languages using an extended
operational
semantics based on partially
ordered
multisets
(pomsets).
The
semantics can be though of as keeping a trace of the computation as a partial order specifying
what had to be computed
before what else. Although
significantly
more complicated,
their call-by-value
semantics are related to the A-PAL model
in the following
way. The work in the A-PAL model is within
a constant
factor of the number
of elements in the pomset, and the steps is within
a constant factor of the longest
chain in the pomset.
They did not relate their model to
other models of parallelism
or describe how it would effect
algorithms.
Previous work on formally
relating
language-based
models (languages
with cost-augmented
semantics)
to machine
models is sparse. Jones [21] related the time-augmented
semantics of simple while-loop
language to that of an equivalent machine language in order to study the effect of constant
factors in time complexity.
Seidl and Wilhelm
[40] provide
complexity
bounds for an implementation
of graph reduction
on the PRAM.
However, their implement at ion only considers a single step and requires that you know which graph
nocles
to execute
in parallel
in that
step
and
that
the
see Lemma 4). There have also been several experimental
studies of how much parallelism
is available
in sequential
functional
languages [11, 8, 10].
The work-step
paradigm
has been used for many years
for informally
describing
parallel
algorithms
[42, 22]. It was
first included in a formal model by Blelloch in the VRAM
[5].
NESL [6], a data-parallel
functional
language, includes complexity
measures based on work and steps and has been
used for describing
and teaching parallel
algorithms.
Skillicorn [43] also introduced
cost measures specified in terms
of work and steps for a data-parallel
language based on the
Bird-Meertens
formalism.
In both cases the languages were
not based on the pure J-calculus
but instead included
array primitives.
Also neither
formally
showed relationship
of their models to machine models.
Part of the motivation
of the work described
in this paper was to formalize
the
mapping
of complexity
to machine models and to see how
much parallelism
is available
without
adding
data-parallel
primitives.
Dornic,
et al. [13] and Reistad and Gifford
[35] explore
adding time information
to a functional
language type system. But for type inference to terminate,
only special forms
of recursion can be used, such as those of the Bird-Meertens
formalism.
There has been much work on comparing
machine models within
traditional
complexity
theory.
The most closely
related is that of Ben-Amram
and Galil [2], who show that
a pointer machine incurs logarithmic
overhead to simulate a
RAM. The pointer
machine [24, 44] is similar to the SECD
machine in that it addresses memory
only through
pointers, but it lacks direct support
for implementing
higherorder functions.
We borrow from them the paramet erization of models over incompressible
data types and operations. Paige [29] also compares models similar to those used
by Ben-Amram
and Galil.
Goodrich and Kosaraju
[18] introduced
a parallel pointer
machine (PPM),
but this is quite different
from our model
since it assumes a fixed number of processors and allows side
effecting of pointers.
Another
parallel
version of the SECD
machine was introduced
by Abramsky
and Sykes [1], but
their Seed-m machine was non-deterministic
and based on
the fair merge.
7
Conclusions
This paper has discussed a complexity
model based on the
A-calculus and shown various simulation
results.
A goal of
this work is to bring a closer tie between parallel algorithms
and functional
languages.
We believe that language-based
complexity
models, such as the ones suggested in this paper,
could be a useful way for describing
and thinking
about
parallel
algorithms
directly,
rather than always needing to
translate
to a machine model.
This paper leaves several open questions,
including
We mentioned
that a call-by-speculation
impler,,.ntation of normal-order
evaluation
might
allow for improved depth bounds for various problems.
In particular it allows for pipelined
execution.
Does this help,
and on what problems?
graph
has constant
in-degree.
Under these conditions
they show
how to process n nodes in O(n/p + p logp) time (which is
a factor of p worse than our bounds in the second term,
235
Is it possible
O(nlog n)?
to sort
within
d = o(logz
n),
and
w =
●
●
Our
simulations
bounds
●
[9] Richard
P. Brent.
The parallel
evaluation
of general
arithmetic
expressions.
Journal of the A CM, 21(2) :201–
206, 1974.
Can the bounds for simulating
the A-PAL on a PRAM
be improved?
The bounds for the butterfly
network
are tight.
Because
be placed
are
memory
inefficient.
Can
good
on the use of memory?
it lacks random-access,
can the A-PAL
model
be simulated
more efficiently
than the PRAM
on machines that have less powerful
communication
(e. g.,
fixed-topology
networks,
parallel
1/0 models, or the
LOGP model [12]), and can the complexity
model be
augmented
to capture the notion of locality
for these
machines?
8
Acknowledgments
This research was sponsored in part by the Wright
Laboratory, Aeronautical
Systems Center, Air Force Materiel
Command, USAF, and the Advanced
Research Pro j ects Agency
(ARPA)
under grant number F33615-93-1-1330
and contract
number F19628-91-C-0168.
It was also supported
in part by
an NSF Young Investigator
Award and by Finmeccanica.
The views and conclusions
contained
in this document
are
those of the authors and should not be interpreted
as necessarily representing
the official policies or endorsements,
either expressed or implied,
of Wright Laboratory
or the U. S.
Government.
References
[1]Samson
Abramsky
and R. Sykes.
Seed-m:
A virtual
machine for applicative
programming.
In Jean-Pieme
Jouannaud,
editor, Proceedings 2nd [international
Conference
on Functional
Programming
Languages
and
Computer
Architecture,
number
201 in Lecture
Notes
in Computer
Science, pages 81–98, 1985.
[2] Amir M. Ben-Amram
addresses.
Journal
1992.
[3] Bror
Bjerner
and
and Zvi Galil. On pointers
of the ACM,
39(3):617-648,
Soren
Holmstrom.
User’s
[5] Guy E. Blelloch.
Vector Models for
puting.
MIT Press, 1990.
(Vervnon
Data-Parallel
R. Karp,
D. Patterson,
A. Sahay, K. E.
[12] D. Culler,
Schauser,
E. Santos,
R. Subramonian,
and T. von
Eicken.
LogP: Towards
a realistic
model of parallel
computation.
In Proceedz ngs 4th ACM SIGPLA N Symposium
on Principles
and Practice of Parallel
Programming, May 1993.
Dornic, Pierre Jouvelot,
and David K. Gifford,
[13] Vincent
Polymorphic
time systems for estimating
program complexity.
ACM Letters on Programming
Languages
and
systems, 1(1):33-45,
March 1992.
[14] John T. Fee, David
C. Cann, and Rodney
R. Oldehoeft. A Report on the Sisal Language Project.
Journal
of ParaIiel
and Distributed
Computing,
10(4) :349–366,
December
1990.
Towards a theory
[16] J. Gil, Yossi Matias, and Uzi Vishkin.
of nearly constant time parallel algorithms.
In Proceedings Symposium
on Foundations
of Computer
Science,
pages 698–710, October 1991.
versus
July
and U. Zwick. Optimal
deterministic
[17] T. Goldberg
cessor allocation.
In Proceedings ~ th A CM-SIAM
on Discrete Algorithms,
January
1995.
A compositional
Manual
par[11] C. D. Clack and Simon Peyton Jones. Generating
allelism
from strictness
analysis.
Technical
Report
Internal Note 1679, Dept. Comp. Sci., University
College
London,
February
1985.
Parallelism
in ran[15] Steven Fortune and James W yllie.
dom access machines.
In Proceedings
10th ACM Symposium on Theory of Computing,
pages 114–118, 1978.
approach
to time analysis of first order lazy functional
programs.
In Proceedings ~th International
Conference
on Functional
Programmmg
Languages
and Computer
Architecture.
Springer-Verlag,
September
1989.
[4] Guy Blelloch.
An L1
Draft), November
1989.
[10] 1. Checkland
and C. Runciman.
Perfect
hash functions made parallel-lazy
functional
programming
on
a distributed
multiprocessor.
In T. N. Mudge,
V. Milutinovic,
and L. Hunter,
editors,
Proceedings
26th
Hawaii
International
Conference
on System Sciences,
volume 2, pages 397–406, January
1993.
proSymp,
and S. Rao Kosaraju.
Sorting on a
[18] Michael T. Goodrich
parallel pointer machine with applications
to set expression evaluation.
In 30th Annual
Sympostum
on Foundations of Computer
Science, pages 190–195, November
1989.
1.2:
Pomset interpreta[191 Paul Hudak and Steve Anderson.
tions of parallel
functional
programs.
In Proc.>dzngs
.5kd International
Conference
on Functional
Programming Languages
and Computer
Architecture,
number
274 in Lecture Notes in Computer
Science, pages 234–
256. Springer-Verlag,
September
1987.
L,
Com-
[6] Guy E. Blelloch.
NESL:
A nested data-parallel
language (version 2.6). Technical
Report CMU-CS-93-129,
School of Computer
Science, Carnegie Mellon
University, April 1993.
[20] Paul Hudak and Eric Mohr. Graphinators
and the Duality of SIMD
and MIMD.
In ACM
Conference
on
Lzsp and Functional
Programming,
pages ‘224-234, July
1988.
[7] Guy E. Blelloch
and John Greiner.
A parallel complexity model for functional
languages,
Technical
Report
CMU-CS-94-196,
Carnegie Mellon University,
October
1994.
do matter
(ex[21] Neil D. Jon= constant time factors
tended abstract).
In Proceedings l?5th ACM Symposium
on Theory of Computing,
pages 602–611, 1993.
[8] A. P. Willem
Bohm and R. E. Hiromoto.
The dataflow
time and space complexity
of FFTs. Journal of Paraltel
and Distributed
Computing,
18(3):301–313,
July 1993.
236
[22] R. M. Karp and V, Ramachanclran.
Parallel algorithms
for shared memory machines.
In J. Van Leeuwen, editor, Handbook of Theoretical
Computer
Science — Volume A: Algorithms
and Complexity.
MIT Press, Cambridge, Mass., 1990.
[37] Paul Roe. Parallel
Programmmg
guages. PhD thesis, Department
University
of Glasgow, February
[38] Mad. Rosendahl.
Automatic
complexity
proceedings Jth International
Conference
Programming
Languages
and Computer
[23] Richard
Kennaway.
A conflict
between
call-by-need
computation
and parallelism
(extended
abstract ). In
Proceedings
Conditional
Term Rewriting
Systems-9J,
February
1994.
[24] Donald E. Knuth.
Fundamental
Algorithms,
volume
of The Art of Computer
Programmmg.
Addison-Wesley
Publishing
Company,
Reading,
MA, 1968.
[25] P. J. Landin.
The mechanical
evaluation
Computer
Journal,
6:308-320,
1964.
Springer-Verlag,
1
of expressions.
[41] Yossi Shiloach and Uzi Vishkin.
Finding
the maximum,
merging and sorting in a parallel
computation
model.
Journal
of Algorithms,
2(1):88-102,
1981.
[42] Yossi Shiloach and Uzi Vishkin.
An 0(n2 log n) parallel
Max-Flow
algorithm.
J. Algorithms,
3:128-146,
1982.
[43] David B. Skillicorn
and W. Cai. A cost calculus for parallel functional
programming.
To appear in the Journal
of Parallel
and Distributed
Computing.
[28] Rishiyur
S. Nikhil.
ID Version 90.0 Reference Manual.
Computation
Structures
Group Memo 284-1, Laboratory for Computer
Science, Massachusetts
Institute
of
Technology,
July 1990.
[44] Robert E, Tarjan.
A class of algorithms
which require
nonlinear
time to maintain
disjoint
sets.
J. Comp ut.
System Sci., 18:110-127,
1979.
[45] Philip
Wadler.
Strictness
analysis aids time analysis.
In Proceedings 15th ACM Symposium
on Principles
of
Programming
Languages, January
1988.
[29] Robert
Paige.
Real-time
simulation
of a set machine
on a RAM.
In W. Koczkodaj,
editor,
Proceedings international
Conference on Computing
and Information,
volume 2, pages 68–73, 1989.
[46] Wolf Zimmerman.
analysis of parallel
066, International
ber 1990.
[30] S. L. Peyton Jones. Parallel
Implementations
of Functional Programming
Languages.
The Computer
Journal, 32(2):175–186,
1989.
Computation.
of Computer
How to emulate shared memory.
and System Sciences, 42(3):307-
Probabilistic
[34] R. Reischuk.
sorting
and selection.
SIAM
14(2):396–409,
1985.
parallel
Journal
&lgorithms
for
of Computing,
[35] Brian
Reistad
and David
K. Gifford.
Static dependent costs for estimating
execution
time.
In Proceedings ACM
Conference
on LISP
and Functional
Programming,
pages 65-78, July 1994.
[36] Paul Roe. Calculating
In Simon
L Peyton
Carsten Kehler
Hoist,
Programmingj
Glasgow
Springer-Verlag,
August
Automatic
worst case complexity
programs.
Technical
Report TR-90Computer
Science Institute,
Decem-
[47] Wolf Zimmerman.
Complexity
issues in the design of
functional
languages with explicit
parallelism.
In Proceedings International
Conference
on Computer
Languages, pages 34–43, 1992,
Gordon D. Plotkin.
Call-by-name,
call-by-value
and the
lambda calculus.
Theoretical
Computer
Science, 1:125159, 1975.
[33] Abhiram
G. Ranade.
Journal
of Computer
326, June 1991.
1989.
[40] Helmut Seidl and Reinhard
Wilhelm.
Probabilistic
load
balancing
for parallel
graph reduction.
In Proceedings
TENCON
’89, Jth IEEE Region 10 International
Conference, pages 879-884, November
1989.
[27] Kurt Mehlhorn
and Uzi Vishkin.
Randomized
and deterministic
simulations
of PRAMs
by parallel machines
with restricted
granularity
of parallel
memory.
Acts
Informatica,
21:339-374,
1984.
Fluent
Parallel
[32] Abhiram
G. Ranade.
PhD thesis, Yale University,
Department
Science, New Haven, CT, 1989.
September
analysis.
In
on Functional
Architecture.
[39] David Sands. Caiculi for T,me Analysis
of Functional
Programs.
PhD thesis, University
of London,
Imperial
College, September
1990.
[26] Yossi Matias
and Uzi Vishkin.
Converting
high probability
into nearly-constant
time—with
applications
to
parallel
hashing.
In Proceedings
ACM Symposium
on
Theory of Computing,
pages 307-316, May 1991.
[31]
using Functional
Lanof Computing
Science,
1991.
lenient programs’
performance.
Jones,
Graham
Hutton,
and
editors,
Proceedings
Functional
1990, Workshops in computing.
1990.
237
© Copyright 2026 Paperzz