Architecture-Independent Parallelism for Both Shared

Architecture-Independent Parallelism for Both Sharedand Distributed-Memory Machines using the
Filaments Package
David K. Lowenthal
Department of Computer Science
University of Georgia
[email protected]
Vincent W. Freeh
Department of Computer Science and Engineering
University of Notre Dame
[email protected]
March 3, 1999
Abstract
This paper presents the Filaments package, which can be used to create architecture-independent parallel programs|that is, programs that are portable and ecient across vastly dierent
parallel machines. Filaments virtualizes the underlying machine in terms of the number of
processors and the interconnection, allowing ne-grain, shared-memory programs to be written
or generated. Furthermore, Filaments uses a carefully designed API along with machine-specic
runtime libraries and preprocessing that allow programs to run unchanged on both sharedand distributed-memory machines. Performance is not sacriced, as almost all kernels and
applications we tested achieve a speedup of over 4 on 8 processors of both an SGI Challenge
and a cluster of Pentium Pros.
Keywords: ne-grain parallelism, architecture independence
1 Introduction
Recent improvements in VLSI and in fast networking have accelerated the growth of shared- and
distributed-memory parallel computers. Soon small-scale shared-memory multiprocessors will be
on desktops, and large clusters of workstations with fast communication will be in every building.
This increase in processing power presents an opportunity to computational scientists who are
always in need of parallel machines to execute programs. It is important for these scientists to be
able to execute their program on any available parallel machine: shared, distributed, or a hybrid.
This makes it critical to provide a parallel programming model that is architecture independent ,
which means that it can be implemented eciently on vastly dierent parallel architectures. In
other words, it must be simultaneously portable and ecient. We have designed and implemented
a library package called Filaments [FLA94, LFA96, LFA98] that provides such a model and an
ecient implementation.1
The Filaments package achieves architecture independence in two ways:
the use and ecient implementation of ne-grain parallelism and shared-variable communication, and
machine-specic libraries and preprocessing that allow programs to run on both shared- and
distributed-memory machines without any changes.
1
The Filaments package is freely available at http://www.cs.uga.edu/~dkl/laments/dist.html.
1
The granularity of a parallel program refers to the amount of computation in each process. The
coarsest-grain program is a sequential program with one process that does all the work. However,
a parallel program must partition the work between processors; a coarse-grain parallel program has
one process per processor. On the other hand, a ne-grain program creates one process (thread)
for each logical unit of work. Thus, a ne-grain program creates the abstract parallelism inherent
in the algorithm, whereas a coarse-grain program creates the concrete parallelism of the machine.
Fine-grain programs are simpler to write because they exhibit the natural parallelism of an
algorithm; indeed, parallelism is expressed in terms of the application and problem size, not in
terms of the number of processors that are used to execute the program. Clustering of independent
units of work into a xed set of larger tasks is not necessary.
Just as ne-grain parallelism provides the natural level of parallelism, the natural model for
communication is shared variables. Processes reference a variable without regard to its location,
just as in a sequential program. Remote variables are brought into the local address space transparently when they are referenced; hence, processes can communicate by just reading and writing
shared variables. Most algorithms are expressed in terms of shared variables [NL91]; furthermore,
programming with messages is tricky despite the various solutions that have been proposed [Tan95].
Compared to shared-variable programs, message passing adds many complications to a program,
the worst of which is signicant program reorganization. Shared-variable communication allows
a user to write a sequential program, prole it, and parallelize parts that are computationally
expensive|without dramatic changes to the rest of the program.
The combination of ne-grain parallelism and shared variable communication provides a portable substrate for parallel programming and is used in the Filaments package. The physical dimensions of the machine are not part of the application program, because the processors and
interconnect are virtualized. This programming model makes it easy to write or compile applications for both shared- and distributed-memory parallel computers. Moreover, this model naturally
extends to the potential of networks of shared-memory multiprocessors.
Even with a portable substrate, allowing programs to run unchanged on vastly dierent architectures is a challenge. The shared and distributed implementations have vast dierences. For
example, reduction variables are implemented as scalars on a distributed-memory machine and as
an array on a shared-memory machine. Filaments uses a carefully designed API and machinespecic libraries and preprocessing to allow application programs run unchanged on a variety of
shared- and distributed-memory machines. Currently the Filaments package runs on SGI and
SPARC multiprocessors and on SPARC and Pentium (both Solaris and Linux) clusters.
The contributions of this paper are:
demonstration that ne-grain parallelism and shared-variable communication provide a simple
programming model;
a system, Filaments, that supports ne-grain, shared-memory programs that run unchanged
on both shared- and distributed-memory machines, and
an ecient implementation of Filaments on both shared- and distributed-memory machines.
The next section illustrates how architecture-independent programs are written in Filaments.
Section 3 discusses Filaments implementations on both shared- and distributed-memory machines
and how portability is achieved. Section 4 examines the performance of Filaments on both sharedand distributed-memory machines. Section 5 discusses related work and the last section summarizes
the paper.
2
Memory
Processor 1
Processor 2
Processor 3
Processor 4
(a) Multiprocessor: one node; four processors
Shared bus
Memory
Memory
Memory
Memory
Processor 1
Processor 2
Processor 3
Processor 4
(b) Multicomputer: four nodes; four processors
Interconnection network
Memory
Processor 1
Processor 2
Memory
Processor 3
Processor 4
(c) Hybrid: two nodes; four processors
Figure 1: Three parallel computer architectures.
2 Architecture-Independent Parallel Programming in Filaments
The Filaments package is a software kernel that supports ecient execution of ne-grain parallelism
and shared-variable communication on a range of parallel machines. Filaments programs make
no mention of coarsening small threads into larger processor or sending and receiving messages
between processors. A lament is a very lightweight thread. Each lament can be quite small, as
in the computation of an average in Jacobi iteration; medium size, as in the computation of an
inner product in matrix multiplication; or large, as in a coarse-grain program with one process per
processor. The Filaments package provides a small set of primitives that are sucient to implement
architecture-independent program for the vast majority of scientic applications.
The Filaments package runs on both shared- and distributed-memory machines. Both architectures have multiple physical processors ; however, a multiprocessor has one physical address space,
and a multicomputer has one per processor. The term processor refers to a physical processor,
whether it be in a multiprocessor or a multicomputer. The term node refers to an address space.
Figure 1 shows three congurations of four processors. Figure 1(a) shows a shared-memory multiprocessor with four processors on one node. Figure 1(b) shows distributed-memory multicomputer
with four nodes, each containing one processor. Figure 1(c) shows a network of multiprocessors
with 4 processors and 2 nodes. (The Filaments implementation for this hybrid machine is under
development.)
The Filaments package supports two kinds of laments. Iterative laments execute repeatedly, with a global reduction operation (and hence a barrier synchronization) occurring after each
execution of all laments. The package also supports sequences of phases of iterative laments,
which are used when applications or loop bodies have multiple components. Iterative laments are
used in applications such as Jacobi iteration, LU decomposition and Water-Nsquared. Fork/join
laments recursively create new laments and wait for them to return results. They are used in
3
divide-and-conquer applications such as adaptive quadrature, quicksort, and recursive FFT.
Iterative laments execute in phases. All laments in a phase are independent and can execute
in any order. Within a phase, there are one or more pools, which are groups of laments that
reference data with spatial locality. Multiple pools per phase provide for more ecient use of the
cache and/or the distributed shared memory; however, it is important to note that using multiple
pools per phase is not necessary for correctness. A programmer places each iterative lament in a
pool (within a phase) when it is created. A phase consists of a pointer to the lament code and a
pointer to the post-phase code, which is a function that is called after each execution of all laments
in the phase. The post-phase code synchronizes the processors and determines whether a phase is
nished.
On the other hand, fork/join laments are created dynamically and in parallel. When a processor forks a lament, the lament is placed on that processor's list; however, any processor (on
any node) may execute the lament. Fork/join applications do not have inherent locality. Hence,
pools are not used because the data-reference patterns of fork/join laments in general cannot be
determined.
Below we describe both iterative and fork/join applications and how they are programmed
using Filaments. A program that uses Filaments contains three additional components relative to
a sequential program:
declarations of variables that are to be located in shared memory,
functions containing the code for each lament, and
a section that creates the laments, places them on processors and/or nodes, and controls
their execution.
The names of Filament library calls have been shortened for brevity. Additionally, for clarity, some
details of the code fragments are omitted. Note that in general we expect a compiler to generate
Filaments code; in fact, we have a prototype of a modied SUIF [AALT95] compiler that translates
sequential C programs to Filaments programs.
2.1 Jacobi Iteration: Iterative Filaments
Laplace's equation in two dimensions is the partial dierential equation
@2u @2u
+
= 0:
@x2 @y2
Jacobi iteration is one way to approximate the solution to this equation; it works by repeatedly
computing a new value for each grid point; the new value for a point is the average of the values
of its four neighbors from the previous iteration. The computation terminates when all new values
are within some tolerance, epsilon, of their respective old values. Because there are two grids, the
n2 updates are all independent computations; hence, all new values can be computed in parallel.
For this application, the key shared variables are the two n by n arrays, new and old, and the
reduction variable maxdiff. A reduction variable is a special kind of variable with one copy per
node. The local copy of a reduction variable can be accessed directly using the filGetVal macro.
Such variables are also used in calls to filReduce, which (1) atomically combines the private copy
on each processor into a single copy using a binary, associative operator (such as add or maximum)
and (2) copies the reduced value into each private copy. A call of filReduce also results in a barrier
synchronization between nodes.
The code executed by each lament computes an average and dierence:
4
void jacobi(int i, int j) {
double temp;
new[i][j] = (old[i-1][j] + old[i+1][j] +
old[i][j-1] + old[i][j+1]) * 0.25;
temp = absval(new[i][j] - old[i][j]);
if (filGetVal(maxdiff) < temp)
filGetVal(maxdiff) = temp;
}
After computing the new value of grid point (i,j), jacobi computes the dierence between the
old and new values of that point. If the dierence is larger than the maximum dierence observed
thus far on this iteration of the entire computation, then maxdiff needs to be updated.
After all grid points are updated, the following procedure is called to check for convergence and
to swap grids:
int check() {
filReduce(maxdiff, MAX);
if (filGetVal(maxdiff) < epsilon)
return F_DONE;
swap(old, new);
filGetVal(maxdiff) = 0.0;
return F_CONTINUE;
}
One processor executes this code on every node at the end of every update phase, i.e., after every
lament in the phase has been executed once. If check returns F CONTINUE, the next iteration is
performed; each lament will update a point on the next iteration. If it returns F DONE, then the
computation terminates.
The initialization section is executed on each node, because each address space must be initialized.
void main() {
/* create and initialize the shared variables */
filInit();
startRow = filMyNode() * n/filNumNodes();
endRow = startRow + n/filNumNodes() - 1;
phase = filCreatePhase(jacobi, check);
pool = filCreatePool(phase, numFil);
for (i = startRow; i <= endRow; i++) {
/* determine which pool to use for this row */
poolId = i * filNumProcessors()/n;
for (j = 1; j < n-1; j++)
filCreateFilament(phase, pool[poolId], i, j);
}
filStart();
}
The call to filInit initializes the Filaments package. The call of filCreatePhase creates a phase,
which contains laments that execute the function jacobi. The other argument is a pointer to a
function dening the post-phase code. The filCreatePool call creates an array of pools, one per
5
jacobi
Phase
check
Filament descriptor (i , j)
four 1x4 element pools
one 1x16 element pool
16
(0,0) (0,1)
...
(3,3)
4
(a) Uniprocessor: 1 node, 1 processor
(c) Multicomputer: 4 nodes, 4 processors
one 4x4 element pool
two 2x4 element pools
4
4
4
4
4
4
(b) Multiprocessor: 1 node, 4 processors
(d) Hybrid: 2 nodes, 4 processors
Figure 2: Distributing 4 4 laments in pools on various machines.
processor, where each pool contains space for numFil laments. The filCreateFilament routine
creates a single lament. Each lament is dened by i and j, which are passed as arguments
to jacobi. Variables poolId, startRow, and endRow are discussed below. The nal Filaments
package call, filStart, starts the parallelism. All previously created phases are completed before
filStart returns.
Architecture-Independent Assignment of Filaments to Processors and Nodes A Fila-
ments program must distribute the laments among the processors and nodes in order to balance
the load. An ecient distribution for Jacobi iteration is block, in which every processor updates
a contiguous strip of rows in the new array. This results in a load-balanced program with good
locality. The initialization code above shows a general way to distribute laments in blocks for multiprocessors, multicomputers, and networks of multiprocessors. The values startRow and endRow
are the start and end of the the block assigned to the node. The variable poolId contains the
processor that is assigned to the row. This simple method of assigning laments to pools can
support all regular distributions on multiprocessors, multicomputers, and hybrid machines. More
importantly, it makes the code architecture independent; no change is needed when porting the
code. Additional information on automatic lament assignment can be found in [LA96, Low98].
Figure 2 illustrates ways to distribute laments between processors and nodes using phases
and pools on several dierent architectures. The example shows the block distribution of a 4 4
matrix. It uses one phase and one pool per node. The phase contains the lament code pointer
(jacobi) and the post-phase code pointer (check). The pool is an array, with one element for every
processor. Figure 2(a) shows the pools on a uniprocessor. There is a one-element pool containing
all 16 lament descriptors. In the code fragment above that shows initialization, startrow and
endrow would be 0 and 3, and poolId would always be 0. Figure 2(b) shows the pools on a
multiprocessor with four processors. There is an array of four pools, where each array element
contains four lament descriptors; startrow and endrow would again be 0 and 3, but poolId now
will vary between 0 and 3 because filNumProcessors() is 4. Figure 2(c) shows the pools on a
multicomputer with four nodes. There are four distinct pools; each pool contains four lament
descriptors. Here filNumNodes() is 4 and filNumProcessors() is 1. Therefore, on each node,
startrow equals endrow which equals filMyNode(). On all nodes, poolId is always 0. Figure 2(d)
shows the pools on a hybrid machine with two nodes and four processors; filNumNodes() and
filNumProcessors() are both 2, which results in values for startrow and endrow of 0 and 1 on
6
one node and 2 and 3 on the other. Variable poolId varies between 0 and 1 on both nodes. There
are two arrays of two pools each.
The basic method for pool and node assignment above easily extends to any distribution and
any number of pools and pool sets, because every regular distribution can be described with simple
algebraic equations. For example, another common distribution is cyclic, where laments are
assigned to processors and nodes in a modulo fashion. This distribution is common in linear
algebra solvers such as LU Decomposition, Gaussian elimination, and QR factorization. A cyclic
distribution can be realized by (1) setting startrow equal to the node's id, endrow to the upper
bound of the loop, and incrementing the loop index by the number of nodes, and (2) setting poolId
to i modulo the number of processors. Note that step (1) is necessary to ensure that laments are
distributed cyclically to nodes and step (2) is necessary to ensure that laments are distributed
cyclically to processors.
2.2 Adaptive Quadrature: Fork/Join Filaments
This section introduces fork/join laments for programming divide-and-conquer applications. Consider the problem of approximating the integral
Z b
a
f (x)dx:
One method to solve this problem is adaptive quadrature, which works as follows. Divide an interval
in half, approximate the areas of both halves and of the whole interval, and then compare the sum
of the two halves to the area of the whole interval. If the dierence of these two values is not within
a specied tolerance, recursively compute the areas of both intervals and add them.
The simplest way to program adaptive quadrature is to use a divide-and-conquer approach.
Because subintervals are independent, a new lament computes each subinterval. Hence this application uses fork/join laments. The more rapidly the function is changing, the smaller the interval
needed to obtain the desired accuracy. Therefore, the work load is not uniformly distributed over
the entire interval.
The computational routine for adaptive quadrature is:
double quad(double a, double b, double fa, double fb, double area){
double left, right, fm, m, aleft, aright;
/* compute midpoint m and areas under f() from a to m (aleft)
* and m to b (aright) */
if (close enough)
return aleft + aright;
else {
/* recurse, forking two new filaments */
filFjFork(quad, left, a, m, fa, fm, aleft);
filFjFork(quad, right, m, b, fm, fb, aright);
filFjJoin();
/* wait for children to complete */
return left + right;
}
}
The algorithm above evaluates f () at each point just once and evaluates the area of each interval
just once. Previously computed values and areas are passed to new laments to avoid recomputing
the function and areas. If the computed estimate is not close enough, the program forks a lament
to compute the two subintervals and then waits for them to complete. The results are stored in
left and right.
7
The initialization section for adaptive quadrature is similar to that of Jacobi iteration; for
fork/join applications the filStart routine serves as an implicit join to the initial filFjFork call.
3 Filaments Implementation
A Filaments program runs eciently on shared- and distributed-memory machines without any
changes to the source code. Conceptually, this is accomplished with ne-grain parallelism and
shared-variable communication, which virtualizes the machines in terms of number of processors
and the communication mechanism. Physically, this is accomplished through a well-designed API
and an ecient implementation of the Filaments package.
Although a ne-grain, shared-variable model plus a well-designed API provides a simple and
portable way to program, achieving eciency and hence architecture independence requires careful
implementation. The potentially high costs of process creation, process synchronization, contextswitching, and excess memory usage present obstacles to ecient implementations of ne-grain parallelism. Furthermore, eciently implementing shared-variable communication potentially requires
long latencies for reading or writing remote values on a distributed-memory machine. Fortunately,
ne-grain parallelism integrates well with shared-variable communication, which presents many opportunities. In particular, ne-grain programs typically create many units of work, which makes
it easier to mask the aforementioned communication latency by overlapping communication with
computation (as it is easier to nd useful work to do while waiting for a response). Additionally, the
creation of many units of work make it easier to balance the load automatically. Filaments achieves
eciency through three key techniques: very lightweight context switches, runtime \coarsening"
to eliminate overhead, and a multithreaded distributed shared memory to tolerate memory latency
on a distributed-memory machine.
The following section discusses the Filaments API. Section 3.2 describes the implementation
of ne-grain parallelism. The last section contains a description of the multithreaded distributed
shared memory system that implements the shared-variable communication on a distributed memory machine.
3.1 Filaments Programming Model
The rst key to providing architecture-independent is to provide the correct level of abstraction.
As discussed above, ne-grain parallelism and implicit communication are an excellent abstraction.
This section describes the manifestation of the abstraction through the Filaments programming
model.
Section 2 gives the avor of the Filaments programming model. It consists of library calls and
preprocessor macros. Every laments operation is of the form:
fil<Verb > [<Nouns >]([<args >]);
All operations begin with fil and are followed by a action, subject, and arguments (the last two are
optional). Verb is the action, such as Get, Allocate, or Declare. Nouns is a string of zero or more
nouns that determine the subject of the action. And of course, there may be some call-dependent
arguments. Examples are filGetVal, filAllocateMemory, and filStart.
The Filaments API is a separate layer that lies between the application program and the different Filaments subsystems: laments, threads, and distributed shared memory (see below). The
API masks architecture dierences, completely freeing the programmer from low-level details. For
example, filAllocateMemory has a dierent implementation on shared- and distributed-memory
8
machines; the API calls the appropriate routine. The API also provides a uniform interface to the
package, despite the fact that there are three largely independent subsystems.
There are two dierent implementations of the Filaments package. Shared Filaments (SF) runs
on shared-memory multiprocessors and Distributed Filaments (DF) runs on distributed-memory
multicomputers. However, every Filaments source uses the same programming model, regardless
of whether it will execute on a shared- or distributed-memory machine. All machine-specic functionality is provided by the package. The dierence between a shared- and a distributed-memory
program is only how each is compiled.
The Filaments programming model consists of both function calls and macros. Libraries provide
the Filaments function calls and are dierent for SF and DF; therefore, the linker uses either the
SF or the DF version as appropriate.
A macro diers in the source. Therefore, the program denes either the SF or DF compiler
variable via a command-line compiler option, (i.e., -DSF) to select the correct macro denition.
Inside the Filaments package, a typical macro looks like:
#ifdef SF
... /* SF-specific definition */
#else
... /* DF-specific definition */
#endif
Although a macro often is used as an \optimized" function call, in Filaments the macro preprocessing also is necessary for machine-specic code. For example, recall that a reduction variable is
a special variable with one copy per node. The following macro denes a reduction variable of type
double and name sum:
filDeclareRedVar(double, sum);
On a distributed-memory machine, address spaces are independent, so the declaration simply expands to:
double sum;
However, on a shared-memory machine, all processors must share a reduction variable. To avoid
locking, all processors access a private copy by indexing into a shared array using their processor
ID. Furthermore, the elements are padded to the size of a cache block in order to avoid false sharing.
Thus, the macro expansion in SF is:
union {
double value;
int pad[F_STRIDE];
} sum[F_MAX_PROCS];;
Because the internal structure of a reduction variable is dierent, all program-level accesses occur
through macros as well. The following Filaments macro provides an l-value or r-value for the
reduction.
#ifdef SF
#define filGetVal(r) r[fMyProcessor()].value
#else
#define filGetVal(r) r
#endif
9
Thus, the use of a reduction looks like:
filGetVal(sum) = 5.0;
which expands to either
sum = 5.0;
in DF or
sum[fMyProcessor()].value = 5.0;
in SF. The key point is that the application source does not change.
Above we explain how both a shared-memory and distributed-memory Filaments program can
be created from the same source program. The SF and DF runtimes also are created from the
same source code. There are two levels of dierences; the rst is between SF and DF. For example,
implementing load balancing is dierent on shared-memory machines than distributed-memory
ones. We also must handle dierences between dierent architectures and operating systems within
SF and within DF. The most signicant such dierence is the context switching code. Also, to a
lesser extent, message passing and signal processing diers between distributed-memory machines.
The Filaments package uses two techniques to create dierent implementations from the same
source. The rst is conditional compilation, and the second is symbolic links. For small, localized
dierences conditional compilation is sucient and manageable. This is the primary way that the
dierences between SF and DF are handled. However, for vastly dierent implementations it is
simpler to use dierent machine-specic source les and create a symbolic link to the appropriate
le for a particular instance of the implementation. For example, message passing code diers
greatly, so the distribution has a communication le for each dierent architecture. When the
instance of the implementation is specied, the conguration script creates a symbolic link from a
generic le name to the machine-specic le. Accessing the generic le locates the machine-specic
le; hence, those denitions of the function calls are used to build the libraries.
The runtime consists of three dierent libraries.
Filaments (libf.a) contains code for creating, scheduling, executing, and synchronizing laments.
Threads (libt.a) provides traditional light-weight threads optimized for the Filaments package.
Distributed Shared Memory (libdsm.a) implements logically shared memory on a distributedmemory machine.
The laments library, libf.a, is almost completely independent of processor and the operating
system. However, SF and DF use dierent run-time data structures that are optimized for each
implementation. Conditional compilation is sucient for this library.
In contrast, libt.a is greatly dependent on the architecture and operating system because of the
existence of context switching code. Therefore, the system uses a symbolic link to the appropriate
source for the context switching code.
The last library, libdsm.a, is exclusively for DF. It is dependent on the operating system,
specically sockets and signals. This library uses both of the above techniques. For example, the
location of the address that caused a segmentation violation (page fault) depends on both the
operating system and the type of processor. To handle this, the code uses conditional compilation.
However, it is easier to use a symbolic link to select the dierent message-passing implementation,
as discussed above.
10
3.2 Implementing Ecient Fine-Grain Parallelism
Achieving ecient ne-grain parallelism requires careful implementation. This section rst describes the elements common to both SF and DF. Next, it describes the SF-specic aspects. Finally,
it describes those same aspects in the context of DF.
3.2.1 Common Filaments Concepts
A lament does not have a private stack; it consists only of a (shared) function pointer, arguments
to that function, and, for fork/join laments, a few other elds such as a parent pointer and the
number of children. Filaments are executed one at a time by server threads, which are traditional
lightweight threads with a private stack. Each processor creates at least one server thread. The
server thread's stack provides a place for temporary storage so that laments can compute partial
results without using a private stack per lament (which necessitates expensive context switching).
SF system creates one, and only one, thread per processor. These threads execute laments and
coordinate with each other. In contrast, DF initially creates one thread on each node. More threads
are created on the nodes as needed to overlap communication and computation.
Fork/join laments create work dynamically; therefore they are quite dierent from iterative
laments. Most importantly, fork/join laments must be capable of blocking and later resuming
execution; this requires some method of saving state. The server stack stores suspended laments,
which eliminates the need for private laments stacks. However, suspended laments must resumed
in inverse order. As with iterative laments, we want one server thread to execute many fork/join
laments to reduce the overhead of execution. Therefore, when a lament executes a join, it loops
waiting for its child counter to become zero. But while waiting, the parent can execute other
threads. In other words, the parent becomes the server thread.
Fork/join programs tend to employ a divide-and-conquer strategy. The computation starts on
just one processor. To get all processors involved in the computation, new work (from forks) is
given to idle processors. Then, once all processors have work, additional load balancing may be
required to keep the processors busy.
Many Filaments programs attain good performance with little or no optimization. For example,
in matrix multiplication, each lament performs a signicant amount of work (O(n) multiplications
and additions), which amortizes the lament overheads. However, achieving good performance
for iterative applications that possess many laments that perform very little work (e.g., Jacobi
iteration) requires using implicit coarsening, which allows laments in a pool to be executed as if
the application were written as a coarse-grain program.2 This reduces the cost of running laments,
reduces the working set size to make more ecient use of the cache, and uses code that is easier
for compilers to optimize. To implement implicit coarsening, we use two techniques: inlining and
pattern recognition.
Inlining consists of directly executing the body of each lament rather than making a procedure
call. In particular, when processing a pool, a server thread executes a loop, the body of which is
the code for the laments in the pool. This eliminates a function call for each lament, but the
server thread still has to traverse the list of lament descriptors in order to load the arguments.
The second technique is to recognize common patterns of laments at run time. Filaments
recognizes regular patterns of laments assigned to the same pool. In such cases, the package at
run time switches to code that iterates over the laments, generating the arguments in registers
rather than reading the lament descriptors from memory. Filaments currently recognizes a few
Systems such as Chores [EZ93] and the Uniform System [TC88] have a ne-grain specication and a coarse-grain
execution model, but use preprocessor support to create a machine-specic executable at compile time. Filaments
generates dierent codes at compile time and chooses among them at run time depending on the machine.
2
11
common patterns that support a large subset of regular problems; however, the technique is capable
of supporting any number of other patterns. Both inlining and pattern recognition are implemented
without compiler support; instead, they use special macros that cause multiple versions of functions
to be generated. At run time the proper version is invoked.
For fork/join laments the complementary optimization is dynamic pruning. When enough
work has been created to keep all processors busy, forks are turned into procedure calls and joins
into returns. This avoids excessive overheads due to laments creation and synchronization. There
is a danger with pruning, however: It is possible for a processor to traverse an entire subtree
sequentially after other processors become idle. To avoid this, whenever another processor needs
work, a processor executing sequentially returns to executing in parallel and creates new work.
This is implemented by having a server thread check a ag (each time) before deciding whether
to fork laments or call them directly. The above optimizations are discussed in greater detail in
[LFA96].
Implicit coarsening and pruning reduce the cost of running laments, reduce the working set
size to make more ecient use of the cache, and use code that is easier for compilers to optimize.
They enable ecient execution of very ne-grain programs.
3.2.2 Elements Specic to Shared Filaments
Three elements of the Filaments package have a dierent implementation in Shared Filaments (SF)
than in Distributed Filaments: memory allocation, reductions, and acquiring laments from other
processors or nodes.
Memory allocation in SF is very straightforward because there is hardware support for shared
memory. Consequently, Filaments is able to use the vendor-supplied memory allocation routine to
allocate shared memory.
There are two primary possible ways to implement reduction variables in SF. On the machines
that currently support Filaments, we have found the most ecient way to represent reduction
variables is by an array of P elements, where P is the number of processors. (The alternative
way is to use one shared variable and protect it with a lock. Variable access is cheaper, but
locking overhead is more expensive.) Processors update the array element corresponding to their
id. When all processors reach the reduction point, processor 0 reduces all the elements of the array
into one element and then writes this value into all elements. All processors perform a barrier
synchronization at the beginning and end of the reduction, thus ensuring both that processor 0
does not start reducing too soon and that the other processors do not read the value before it is
nal.
Implementing fork/join laments eciently requires acquiring laments from other processors.
Because laments in SF are created in shared memory, acquisition is relatively straightforward. An
idle processor scans all other processors' lists and takes a lament o the list that has the most
laments. SF takes care to ensure the integrity of the lists; a test-and-test-and-set method is used.
The scan proceeds without any mutual exclusion, but when a processor decides to remove laments
from another list, it locks the list, rechecks to make sure the appropriate number of laments is
still on the list, removes them, and then unlocks the list.
3.2.3 Elements Specic to Distributed Filaments
This section describes the implementation of the same three elements|memory allocation, reductions, and acquiring laments from other processors|in Distributed Filaments (DF).
Memory allocation in DF is much dierent than in SF because there are shared and private
sections of memory. The shared section contains the user data that is shared between the nodes.
12
It is imperative that each node allocate the shared data identically, in order that pointers into
the shared section have the same meaning on all nodes. Shared memory in DF is provided by a
multithreaded DSM, which is discussed in the next section. To ensure that all nodes start at the
same address, DF performs an internal max reduction on the start of the DSM area.
As in SF, a reduction implies a barrier synchronization, but in DF a barrier is between the nodes;
therefore, DF uses messages to perform reductions. DF allocates a local copy of the reduction
variable in each processor, similar to SF. After the local copies have been merged into one copy,
values on the nodes are merged. At the reduction, DF combines the value at each node into one
global value using a tournament reduction. There are O(log2 N ) steps in the reduction (N is the
number of nodes). In each step, half of the participating nodes send a reduction message (containing
the local reduction value) to another node and then drop out of the tournament. When a node
receives a reduction message, it merges the value in the message with its local value. After the last
step, all the local values have been reduced into one value, which is then disseminated back to all
the nodes in the reduction reply message.
A tournament reduction scales well and has low expected latency [HFM88]. There are 2(N ? 1)
messages in the reduction, which is minimal. The latency is O(log2 N ) steps. Although there are
other methods with less latency, each requires more messages. For example, each node could send
its value to every other node. This has minimal latency of 1 step; however, there are N 2 messages.
(A dissemination barrier has O(log2 N ) steps, but has O(N log2 N ) messages.)
3.3 The Multithreaded Distributed Memory System
The Filaments DSM provides logically shared memory for Filaments programs executing on distributed-memory machines. It is multithreaded, extensible, and implemented entirely in software.
Multithreading enables the overlapping of communication and computation, which eectively mitigates message latency and, hence, greatly improves performance for many applications. Filaments
programs have numerous independent laments that can be executed in any order, which enhances
the ability to overlap communication and computation.
The DSM supports extensibility by means of user-dened page consistency protocols (PCP). A
PCP provides page-based memory consistency. It supports the general framework only; the specic
operations occur by means of upcalls to user-dened routines, as described in [FA96, Fre96]. It is
easy to create, test, and evaluate new PCPs, which facilitates experimentation and can lead to a
more ecient system.
The Filaments DSM is implemented entirely in software. Consequently, it is exible, inexpensive, and portable. Software is more exible and less expensive to develop and produce than
hardware. It requires some operating system support, but has been ported to many dierent
machines and operating systems (SPARC/Solaris, PentiumPro/Solaris, and PentiumPro/Linux).
The DSM divides the address space of each node into both shared and private sections. Shared
user data (matrices, linked lists, etc.) are stored in the shared section, while local user data
(program code, loop variables, etc.) and all system data structures (queues, page tables, etc.)
are stored in the private section. This scheme reduces the amount of data that must be shared.
Moreover, some data, such as the call frame, is not shared|so it belongs in the private memory.
A DSM is more ecient when nodes use private memory correctly.
To date, every PCP that has been developed for the Filaments package employs distributed page
management because this is signicantly more ecient than using a centralized page manager, as
discussed in [LH89]. In particular, each DSM page is owned by a node. The owner of a page
services all page requests and maintains the state of the page. Page ownership migrates to a node
that is updating the page. When a single node exclusively accesses a page, it owns it. Moreover,
13
a node will not own a page it does not access. In this way, the overhead of managing the pages is
shared among the nodes.
Multithreading allows the DSM to overlap communication and computation in order to mitigate
the latency of remote page faults. On a multiprocessor this latency is very small. However, on a
distributed-memory machine the latency for servicing a remote page fault is large because servicing
a fault requires messages. By using multithreading, a ready thread can execute while another
thread is blocked waiting for a remote page fault to be serviced. Thus, progress can be made while
the page fault is outstanding. Fine-grain parallelism enables overlapping because of the tremendous
amount of parallelism available: When one lament faults, there are literally hundreds or thousands
of other laments that could execute.
The multithreaded DSM provides overlap by multithreading server threads. After a remote page
request is made, the faulting server thread is suspended and a ready server thread is scheduled for
execution. This server thread executes laments in a dierent pool, which access dierent data|
this avoids multiple faults on the same page. When the page arrives at the node, the server thread
that faulted (and any others that have also faulted on this page) is rescheduled. The DSM provides
mechanisms to suspend and resume threads, including the queues for holding ready, idle, and
suspended threads. However, the page-consistency protocol must provide overlap during the page
fault and message handlers.
4 Performance
This section evaluates the performance of the Filaments package. First we present results of eight
kernels on both both a shared- and distributed-memory machine. Then, we investigate the performance of a larger application, Water-Nsquared, which is from the Splash suite [PWG91]. Finally,
we examine one of the kernels, Jacobi iteration, in depth. Note that we report on applicationindependent overheads such as lament creation, barrier synchronization, and message overheads
elsewhere [FLA94]. The overheads have been found to be quite small.
All shared-memory programs were run on a Silicon Graphics Challenge multiprocessor, which
has 12 100 MHz processors running IRIX, a separate data and instruction cache of 16K each,
and a 1 megabyte secondary data cache. The (shared) main memory size is 256 megabytes. All
distributed-memory programs were run on a network of Pentium Pro workstations connected by a
100Mbs Fast Ethernet. Each workstation has a 200 MHz processor running Solaris, a 64K mixed
instruction and data cache, and 64 megabytes of main memory. The cluster is isolated so that no
outside network trac interferes with executing programs.
Each test was run at least three times, and the reported result is the median. The tests use
problem sizes so that the sequential time is between 50 and 100 seconds. All programs were compiled
with the vendor-supplied compiler with the optimization ag on and were run when no other user
was on the machine. In practice we have found test results to be very consistent|the other UNIX
daemon processes do not interfere.
4.1 Scientic Programming Kernels
This section describes the performance of 8 kernels: matrix multiplication, Jacobi iteration, LU
decomposition, Tomcatv, Mandelbrot set, adaptive quadrature, bonacci, and binary expression
trees. Programs are tested on both the SGI and the cluster of Pentium Pros. Again, we stress
that the programs are run unchanged on the dierent machines. Only a few lines in the Filaments
conguration le (to indicate shared- or distributed-memory and which type of architecture and
operating system is used) must be changed.
14
Kernel
Matrix Multiplication
Jacobi Iteration
LU Decomposition
Adaptive Quadrature
Tomcatv
Mandelbrot
Fibonacci
Expression Trees
1
2
3
4
5
6
7
8
1.
2.
3.
4.
5.
6.
7.
8.
Work Load Data Sharing
Static
Light
Static
Medium
Static
Heavy
Dynamic
None
Static
Medium
Dynamic
Light
Dynamic
None
Dynamic
Heavy
Synch.
None
Medium
Heavy
Medium
Medium
None
Medium
Medium
Only synchronization is termination detection. All shared data is read-only.
Edge sharing with neighbors. Compute maximum change between iterations.
Decreasing work. Every iteration disseminate values to all.
Fork/join parallelism. No data sharing.
Edge sharing with neighbors.
Variable amount of work.
Fork/join parallelism. No data sharing.
Fork/join parallelism; constant work per lament.
Table 1: Application kernels and their characteristics.
Table 1 summarizes the eight kernels and shows three properties of each: work load, data
sharing, and synchronization. Work load is a characterization of whether the number of tasks
or the amount of work per task can be determined statically at compile time or whether it has
to be determined at run time. Data sharing is a measure of the extent to which data is shared.
Synchronization is a measure of the amount of interprocess synchronization.
For each kernel test, we present the time in seconds to execute it with 1, 2, 4, and 8 processors.
Speedup is given relative to a sequential program, which is a separate uniprocessor program that
contains no calls to the Filaments library. Care was taken to make the sequential program as
ecient as possible.
Table 2 shows the performance of the kernels on an SGI Challenge. All programs except for
adaptive quadrature attain a speedup of at least 4.7 on 8 processors. Speedup tapers o because
the problem sizes are xed, which leads to decreasing amounts of work per processor as the number
of processors grows [SHG93]. The poor performance of adaptive quadrature on is unexplained.
It is an anomaly that is specic to the 12-processor SGI Challenge; on an older, 4-processor SGI
Challenge, adaptive quadrature achieves good speedup on 2 and 4 processors.
Note that on the grid-based, iterative applications, the sequential programs are between 5 and
10% faster than the Filaments one-processor program. This is primarily because of better register
allocation. The vendor-supplied compiler will never place global variables in registers, and the
Filaments programs must make certain variables global. For example, in Jacobi iteration, the
outermost timestep variable and the reduction variable must be global, because they are accessed
in multiple procedures. In the sequential program, the corresponding variables are local.
Table 3 shows the performance of the kernels on a cluster of Pentium Pros. The dierence
between the sequential program and the one-processor Filaments program arises for the same
reasons as with the multiprocessor tests. All programs other than Jacobi iteration and expression
trees attain a speedup of over 4 on 8 nodes. The speedup for Jacobi is poor on 8 nodes because of
15
SGI
Processors, P
Sequential 1
2
4
8
Matrix Multiplication, 600 600
67.2
70.7 42.4 24.1 14.1
Jacobi Iteration, 720 720, 100 iterations
56.0
60.4 31.6 16.7 10.0
Tomcatv, 300 300
65.3
70.5 36.6 18.8 9.91
LU Decomposition, 720 720
54.0
60.2 31.1 16.2 11.1
Mandelbrot, 920 920
50.3
51.2 26.8 14.3 8.10
Binary Expression Trees, 150 150, 64 levels
61.7
62.1 32.5 19.9 11.3
Adaptive Quadrature, interval 1{24
67.2
72.0 51.8 38.2 33.5
Fibonacci, 39
54.8
55.4 32.7 16.5 10.4
Table 2: Performance of kernels on SGI (times in seconds).
Pentium Pro Cluster
Processors, P
Sequential 1
2
4
8
Matrix Multiplication, 920 920
58.8
62.0 37.3 20.2 14.2
Jacobi Iteration, 800 800, 360 iterations
53.1
62.8 37.7 21.8 16.1
Tomcatv, 512 512
56.8
57.7 31.8 17.4 10.7
LU Decomposition, 1200 1200
81.7
85.2 50.0 27.1 17.7
Mandelbrot, 1600 1600
74.8
75.3 42.6 21.0 10.7
Binary Expression Trees, 360 360, 32 levels
97.4
97.6 60.4 33.9 24.7
Adaptive Quadrature, interval 1{25
62.0
62.5 31.4 20.8 9.50
Fibonacci, 42
82.3
83.5 41.8 23.0 15.3
Table 3: Performance of kernels on Pentium Pro cluster (times in seconds).
the signicant time to initially distribute data (via page faults). This overhead is amortized by the
number of iterations executed, so increasing the number of iterations improves speedup. Also, as
with the shared-memory tests, the xed problem size causes speedup to level o. Furthermore, the
sequential Jacobi program was signicantly rewritten to get the best performance. The expression
tree kernel gets only modest speedup because of the signicant data movement in sequential parts
of the code (near the top of the tree). In these sections of code, no overlap of communication and
computation is possible.
4.2 Water-Nsquared
Water-Nsquared is an N-body molecular dynamics application. It evaluates the forces and potentials
in a system of water molecules in the liquid state. The computation iterates for some number of
time-steps. Every time-step solves the Newtonian equations of motion for molecules in a cubical box
with periodic boundary conditions. To avoid computing the 21 n2 pairwise interactions, a spherical
cuto is used. The principal shared data structure is a large array of records; each element of
the array corresponds to one molecule and holds information about the molecule's position and
potential [SWG92].
The Filaments program consists of eight phases. The rst three phases initialize the potentials of
the molecules and, consequently, are executed once at the beginning of the program. The remaining
phases are executed each time-step. There is one lament for each molecule in every phase. There is
16
Processors, P
Sequential 1
2
4
8
SGI, 512 molecules, 10 steps
53.7
55.2 29.2 16.0 8.98
Pentium Pro, 512 molecules, 15 steps
74.8
79.6 44.6 25.8 16.3
Table 4: Performance of Water-Nsquared on SGI and Pentium (times in seconds).
Jacobi Iteration on SGI
Processors, P
1
2
4
8
Filaments (SGI)
60.4 31.6 16.7 10.0
Coarse-Grain (SGI) 57.9 30.1 16.1 9.33
Table 5: Performance of Jacobi iteration on SGI, Filaments versus coarse-grain (times in seconds).
read-sharing between laments (and processors) within a phase; however, there is no write-sharing
(or race conditions) because barriers are used.
Table 4 shows the execution times for Water-Nsquared. As before, we compare the Filaments
program to a sequential program as the baseline. The shared-memory programs compute for 10
time-steps on 512 molecules. The distributed-memory programs use 15 time-steps. Speedup on
the SGI is very good: 3.9 and 6.0 on 4 and 8 processors, respectively. Speedup on the cluster is
respectable: 2.9 and 4.6 on 4 and 8 processors, respectively.
The previous section demonstrates that Filaments programs run eciently very dierent scientic kernels. This section demonstrates that a scientic application, consisting of many dierent
computations, also executes eciently using Filaments. Hence, Filaments can be used to program
large-scale scientic applications.
4.3 Jacobi iteration
This Section examines the performance of one kernel, Jacobi iteration, in depth. Tables 5 and
6 show the performance of the Filaments Jacobi program versus a coarse-grain program on the
SGI and the cluster, respectively. The coarse-grain program provides a baseline as a very ecient
parallel implementation. The coarse-grain program uses one process per processor; furthermore,
on the cluster it uses explicit message passing (not the DSM) and overlaps communication and
computation. The Filaments programs are very competitive with the coarse-grain program in both
cases, performing no worse than 7.5% on 8 processors. This is due to the ecient mechanisms in
the Filaments package. As with the sequential programs, most of the overhead is due to register
allocation.
The Jacobi iteration times reported above take advantage of two performance enhancements:
the implicit-invalidate page consistency protocol (PCP) and multiple pools. The communication
overhead is reduced by using the implicit-invalidate PCP, which has fewer messages than the writeinvalidate PCP [FLA94]. The write-invalidate PCP requires invalidate messages to be sent, received, and acknowledged. The performance improvement, 6.8% and 13.6% on 4 and 8 nodes, can
be seen by comparing the results in the rst two entries of Table 7.3 The times for single-pool,
non-overlapping Jacobi iteration are shown in the last line of Table 7. Overlapping communication
leads to a 3.2% and 4.9% improvement on 4 and 8 nodes. It should be noted that the benet
The one-node performance of the write-invalidate program is slower than the other programs. We have proled
the code as well as examined the assembly code, and observe no dierence.
3
17
Jacobi Iteration on Pentium Pro Cluster
Processors, P
1
2
4
8
Filaments (Cluster)
62.8 37.7 21.8 16.1
Coarse-Grain (Cluster) 54.1 34.5 19.5 15.9
Table 6: Performance of Jacobi iteration on cluster, Filaments versus coarse-grain (times in seconds).
Jacobi Iteration on Pentium Pro Cluster
Processors, P
1
2
4
8
Multiple Pools, Implicit-Invalidate 62.8 37.7 21.8 16.1
Multiple Pools, Write-Invalidate
66.1 39.2 23.3 18.3
Single Pool, Implicit-Invalidate
62.8 38.1 22.5 16.9
Table 7: Performance of Jacobi iteration using one pool and write-invalidate page consistency
protocol (times in seconds).
of overlapping depends on the particular architecture and can be much greater; previously, on a
cluster of Sun IPCs, we found the benet to be over 20% [FLA94].
5 Related Work
The Filaments package builds on a wide range of work, primarily consisting of ecient threads packages, distributed shared memory, and overlapping communication and computation. Furthermore,
it is one of many approaches to architecture independence.
5.1 Threads
There are many general-purpose thread packages, including Threads [Doe87], Presto [BLL88],
System [BS90], C++ [BDSY92], and Sun Lightweight Processes [SS92]. All of the above packages
support pre-emption to provide fairness, which requires each thread to maintain a private stack.
Consequently, the package has to perform a full context switch when it switches between threads,
which makes ecient ne-grain parallelism impossible. Several researchers have proposed ways
to make standard thread packages more ecient, including Anderson, et al. [ALL89, ABLL92],
Schoenberg and Hummel [HS91], and Keppel [Kep93]. WorkCrews [VR88] supports fork/join parallelism on small-scale, shared-memory multiprocessors, and introduced the concepts of pruning
and of ordering queues to favor larger threads|concepts borrowed by Filaments. Cilk-5 [FLR98]
compiles a fast-clone and slow-clone for each parallel function, and then executes the slow clone
when all processors are busy, which is a similar idea to pruning. Markatos et al. [MB92] present
a thorough study of the tradeos between load balancing and locality in shared memory machines
with respect to thread scheduling.
The most related threads packages that provide ecient ne-grain parallelism are the Uniform
System [TC88], Chores [EZ93], and TAM [CDG+ 93]. The rst two do not support fork/join parallelism or a distributed-memory machine, and the latter is oriented towards functional programming
and a distributed-memory machine. Two other subsequent thread packages use an approach similar
to Filaments: uThread [Shu95] and the virtual processor approach [NS95].
18
5.2 Distributed Shared Memory
There is a wealth of related work on distributed shared memory systems. These can largely be
divided into three classes: hardware implementations, kernel implementations, and user implementations. Hardware implementations include MemNet [Del88], Plus [BR90], Alewife [KCA91], DASH
[LLJ+ 93], and FLASH [KOH+ 94]. The hardware detects reads and writes to shared locations and
maintains consistency. This is ecient but expensive.
Kernel implementations modify the operating system kernel to provide the DSM; these include
Ivy [LH89], Munin [CBZ91], and Mirage [FP89]. While less ecient than hardware mechanisms,
these implementations provide reasonable fault-handling times.
Modifying a kernel is complex and error-prone; as a result, user implementations recently have
become much more common. These include Treadmarks [KDCZ94], Midway [BZS93], and CRL
[JKW95]. Most of these systems dier from the Filaments DSM in that they are single-threaded
on each node and thus cannot mitigate communication latency. Alewife is multi-threaded, but in
hardware. A few later DSM systems also provide multi-threaded nodes, such as CVM [TK97].
5.3 Overlap of Communication and Computation
The idea of overlapping communication and computation is not new; the technique has been used
in hardware and operating systems for years. With the relatively long remote memory latencies in
parallel computing, mitigation becomes very important. There are two primary methods of overlap;
explicit and implicit.
Split-C supports explicit overlap of communication and computation [CDG+ 93]. The language
allows the programmer to overlap communication and computation. The challenge is to place the
prefetch suciently far enough in advance so that the reference occurs after the reply is received:
the time between a request and the rst reference must be greater than the latency. However,
the prefetch only can precede the access by a limited amount, which depends on the program.
CHARM is a ne-grain, explicit message-passing threads package [FRS+ 91]. It provides overlap of
communication and computation, and dynamic load balancing. CHARM has a distributed-memory
programming model that can be run eciently on both shared- and distributed-memory machines.
In contrast, Filaments implicitly overlaps communication with any local computation. This
dierence is the result of the fundamentally dierent way in which the overlap of communication
and computation is achieved in the two systems. The Filaments approach is more general (works
on more problems) and simpler (requires no source code modications). Also, Filaments provides
functionality similar to CHARM using a shared-memory programming model. Finally, Filaments
is implemented entirely in software as opposed to hardware.
5.4 Other Approaches to Architecture-Independence
The combination of ne-grain parallelism and shared-variable communication is one approach to architecture independence; there are many others. Three methods that have been given signicant attention are parallelizing compilers, functional languages, annotated languages, and message-passing
libraries.
The simplest path to a parallel program from a programmer's point of view is to write a sequential program and use a parallelizing compiler such as SUIF [AALT95] or Polaris [BDE+ 96]. The
compiler computes data dependences, detects where communication is needed, coarsens granularity,
and determines data distributions. If the compiler is retargeted eciently to several dierent architectures (which requires signicant eort), it may achieve architecture independence|the programmer writes a sequential program, and it is compiled to an ecient executable on several dierent
19
parallel machines. The problem is that compilers can only handle restricted application domains
eciently, and the best parallel algorithm may be dierent than the best sequential algorithm.
Another approach is to use a functional language, that is suciently high-level and hence
is independent of the machine. Sisal (Streams and Iteration in a Single Assignment Language)
is a general purpose functional language [FCO90]. It is implemented using a dataow model,
meaning program execution is determined by the availability of the data, not the static ordering
of expressions in the source code. The compiler can schedule the execution of expressions in any
order, including concurrently, as long as it preserves data dependencies. Sisal is arguably the most
successful high-level parallel language; Sisal programs execute on numerous sequential, vector,
shared-memory, and distributed-memory machines. However, if the compiler does not produce
ecient code (due to, for example, the failure of complicated update-in-place analysis), it is very
dicult to tune the program because of the vast dierence between the source and executable.
Other functional languages include Multilisp [Hal86] and Id [Nik89].
Another possibility is to use annotated languages such as HPF [HPF93] or Dino [RSW91]. The
programmer uses annotations to specify certain implementation-related aspects of the program,
such as which sections of code can be parallelized or how data should be distributed. This information is used by the compiler to generate an ecient parallel program for dierent architectures.
Although the annotations may change with the architecture, the hope is that these changes are
fairly localized. The problem with annotated languages is that although they lessen the burden on
the programmer compared to writing explicitly parallel programs using subroutine libraries, they
still require the programmer to make the key decisions. Furthermore, the annotations may have to
change signicantly between architectures.
Finally, PVM (parallel virtual machine) and MPI (message passing interface) are two approaches
that provide portability through an API. PVM virtualizes the underlying machine, even allowing
heterogeneous computing [Sun90]. MPI supports portable message-passing programs on a large
number of machines [For94]. Both packages primarily target distributed-memory machines and
have large installed bases. However, as previously stated, writing an correct, ecient messagepassing program is quite dicult.
6 Conclusion
This paper has discussed the design and implementation of Filaments, which provides an architecture-independent substrate for parallel programming. In particular, Filaments programs run
unchanged and eciently on both shared- and distributed-memory machines. We accomplish this
by using a well-designed API and an ecient implementation of ne-grain parallelism and sharedvariable communication.
Section 2 described Filaments programs for two applications. Both were programmed in an
architecture-independent manner, so that they could be executed on any machine. Section 3 presented details of the implementation of the Filaments API, ne-grain parallelism, and sharedvariable communication. The key techniques are very lightweight threads (laments), run-time
optimizations, and a multithreaded distributed shared memory. Section 4 showed that Filaments
programs attained good speedups on a shared-memory multiprocessor (SGI Challenge) and a
distributed-memory multicomputer (cluster of Pentium Pros). Furthermore, one application was
examined in detail, which showed that the performance of Filaments was very competitive with a
hand-written coarse-grain program.
In conclusion, Filaments demonstrates that ne-grain parallelism and shared-variable communication create a simple, general model for portable parallel programming. Additionally, the Filaments package shows that machine-specic libraries and preprocessing is sucient to allow pro20
grams to be created once and run unchanged on vastly dierent machines. Lastly, this paper
demonstrates that performance does not have to be sacriced for portability|that is, Filaments
delivers architecture-independent parallel computing.
Acknowledgements
The authors are grateful to Sandya Dwarkadas and the University of Rochester for access to their
SGI Challenge.
References
[AALT95] Saman P. Amarasinghe, Jennifer M. Anderson, Monica S. Lam, and Chau-Wen Tseng.
The SUIF compiler for scalable parallel machines. In Proceedings of the Seventh SIAM
Conference on Parallel Processing for Scientic Computing, February 1995.
[ABLL92] T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy. Scheduler activations: Eective kernel support for the user-level management of parallelism. ACM
Transcaction on Computer Systems, 10(1):53{79, February 1992.
[ALL89] T.E. Anderson, E.D. Lazowska, and H.M. Levy. The performance implications of thread
management alternatives for shared-memory multiprocessors. IEEE Transactions on
Computers, 38(12):1631{1644, December 1989.
[BDE+ 96] William Blume, Ramon Doallo, Rudolf Eigenmann, John Grout, Jay Hoeinger,
Thomas Lawrence, Jaejin Lee, David Padua, Yunheung Paek, Bill Pottenger, Lawrence
Rauchwerger, and Peng Tu. Parallel programming with Polaris. IEEE Computer,
29(12):78{82, December 1996.
[BDSY92] Peter A. Buhr, Glen Ditcheld, R.A. Stroobosscher, and B.M. Younger. uC++: concurrency in the object oriented language C++. Software|Practice and Experience,
22(2):137{172, February 1992.
[BLL88] B.N. Bershad, E.D. Lazowska, and H.M. Levy. PRESTO: a system for object-oriented
parallel programming. Software|Practice and Experience, 18(8):713{732, August 1988.
[BR90] Roberto Bisiani and Mosur Ravishankar. Plus: A distributed shared memory system. In
17th Annual International Symposium on Computer Architecture, pages 115{124, May
1990.
[BS90]
Peter A. Buhr and R.A. Stroobosscher. The uSystem: providing light-weight concurrency on shared memory multiprocessor computers running UNIX. Software Practice
and Experience, pages 929{964, September 1990.
[BZS93] Brian N. Bershad, Matthew J. Zekauskas, and Wayne A. Sawdon. The Midway distributed shared memory system. In COMPCON '93, pages 528{537, 1993.
[CBZ91] John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and performance of Munin. In Proceedings of 13th ACM Symposium On Operating Systems, pages
152{164, October 1991.
21
[CDG+ 93] David Culler, Andrea Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven
Lumetta, Thorsten von Eicken, and Katherine Yelick. Parallel programming in Split-C.
In Proceedings of Supercomputing '93, November 1993.
[Del88] G. Delp. The Architecture and Implementation of MemNet: A High-Speed Shared Memory Computer Communication Network. PhD thesis, University of Delaware, Newark,
Delaware, 1988.
[Doe87] Thomas W. Doeppner. Threads: a system for the support of concurrent programming.
Technical Report CS-87-11, Brown University, June 1987.
[EZ93]
Derek L. Eager and John Zahorjan. Chores: Enhanced run-time support for shared
memory parallel computing. ACM Transactions on Computer Systems, 11(1):1{32,
February 1993.
[FA96]
Vincent W. Freeh and Gregory R. Andrews. Dynamically controlling false sharing in
distributed shared memory. In Proceedings of the 5th Symposium on High Performance
Distributed Computing, August 1996.
[FCO90] John T. Feo, David C. Cann, and Rodney R. Oldehoeft. A report on the SISAL language
project. J. of Par. and Dist. Computing, 10(4):349{366, December 1990.
[FLA94] Vincent W. Freeh, David K. Lowenthal, and Gregory R. Andrews. Distributed Filaments: Ecient ne-grain parallelism on a cluster of workstations. In First Symposium
on Operating Systems Design and Implementation, pages 201{212, November 1994.
[FLR98] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the
Cilk-5 multithreaded language. In 1998 ACM SIGPLAN Conference on Programming
Language Design and Implementation (to appear), June 1998.
[For94] MPI Forum. MPI: A message-passing interface standard. International Journal of
Supercomputer Applications, 8(3/4):165{416, 1994.
[FP89] Brett D. Fleisch and Gerald J. Popek. Mirage: a coherent distributed shared memory
design. In Proceedings of 12th ACM Symposium On Operating Systems, pages 211{223,
December 1989.
[Fre96] Vincent W. Freeh. Software Support for Distributed and Parallel Computing. PhD
thesis, The University of Arizona, June 1996.
[FRS+ 91] W. Fenton, B. Ramkumar, V. A. Saletore, A. B. Sinha, and L. V. Kale. Supporting
machine independent programming on diverse parallel architectures. In Proceedings of
the 1991 International Conference on Parallel Processing, volume II, Software, pages
II{193{II{201, Boca Raton, FL, August 1991. CRC Press.
[Hal86] Robert H. Halstead. An assessment of Multilisp: Lessons from experience. International
Journal of Parallel Programming, 15(6):459{501, December 1986.
[HFM88] D. Hansgen, R. Finkel, and U. Manber. Two algorithms for barier synchronization. Int.
Journal of Parallel Programming, 17(1):1{18, February 1988.
[HPF93] High Performance Fortran language specication. October 1993.
22
[HS91]
[JKW95]
[KCA91]
[KDCZ94]
[Kep93]
[KOH+ 94]
[LA96]
[LFA96]
[LFA98]
[LH89]
[LLJ+ 93]
[Low98]
[MB92]
S.F. Hummel and E. Schonberg. Low-overhead scheduling of nested parallelism. IBM
Journel of Research and Development, 35(5):743{765, September 1991.
Kirk L. Johnson, M. Frans Kaashoek, and Deborah A. Wallach. CRL: High-performance
all-software distributed shared memory. In Fifteenth Symposium on Operating Systems
Principles, pages 213{228, December 1995.
Kiyoshi Kurihara, David Chaiken, and Anant Agarwal. Latency tolerance through
multithreading in large-scale multiprocessors. In International Symposium on Shared
Memory Multiprocessing, pages 91{101, April 1991.
Pete Keleher, Sandhya Dwarkadas, Alan Cox, and Willy Zwaenepoel. TreadMarks:
Distributed shared memory on standard workstations and operating systems. In Proceedings of the 1994 Winter Usenix Conference, pages 115{131, January 1994.
David Keppel. Tools and Techniques for Building Fast Portable Threads Packages.
Technical Report UWCSE 93-05-06, University of Washington, 1993.
Jerey Kuskin, David Ofelt, Mark Heinrich, John Heinlein, Richard Simoni, Kourosh
Gharachorloo, John Chapin, David Nakahira, Joel Baxter, Mark Horowitz, Anoop
Gupta, Mendel Rosenblum, and John Hennessy. The Stanford FLASH multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture,
pages 302{313, April 1994.
David K. Lowenthal and Gregory R. Andrews. An adaptive approach to data placement.
In Proceedings of the 10th International Symposium on Parallel Processing, pages 349{
353, April 1996.
David K. Lowenthal, Vincent W. Freeh, and Gregory R. Andrews. Using ne-grain
threads and run-time decision making in parallel computing. Journal of Parallel and
Distributed Computing, 37:41{54, November 1996.
David K. Lowenthal, Vincent W. Freeh, and Gregory R. Andrews. Ecient support for
ne-grain parallelism on shared-memory machines. Concurrency: Practice and Experience, 10(3):157{173, March 1998.
Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM
Transactions on Computer Systems, 7(4), November 1989.
Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop
Gupta, and John Hennessy. The DASH prototype: Logic overhead and perfornamce.
IEEE Transactions on Parallel and Distributed Systems, 4(1):41{61, January 1993.
David K. Lowenthal. Local and global data distribution in the Filaments package. In
PDPTA '98 (to appear), July 1998.
Evangelos P. Markatos and Thomas L. Blanc. Load balancing vs. locality management
in shared-memory multiprocessor. In Proceedings of the 1992 International Conference
on Parallel Processing, volume I, Architecture, pages I:258{267, Boca Raton, Florida,
August 1992. CRC Press.
23
[Nik89]
[NL91]
[NS95]
[PWG91]
[RSW91]
[SHG93]
[Shu95]
[SS92]
[Sun90]
[SWG92]
[Tan95]
[TC88]
[TK97]
[VR88]
Rishiyur S. Nikhil. The parallel programming language Id and its compilation for
parallel machines. In Proc. Workshop on Massive Parallelism: Hardware, Programming
nad Applications, October 1989.
Bill Nitzberg and Virginia Lo. Distributed shared memory: A survey of issues and
algorithms. IEEE Computer, pages 52{60, 8 1991.
Richard Neves and Robert B. Schnabel. Runtime support for execution of ne grain
parallel code on coarse-grain multiprocessors. In Fifth Symposium on the Frontiers of
Massively Parallel Computing, February 1995.
Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford Parallel Applications for Shared-Memory. Technical Report CSL-TR-91-469, Department
of Electrical Engineering and Computer Science, Stanford University, April 1991.
Matthew Rosing, Robert Schnabel, and Robert Weaver. The Dino parallel programming
language. Journal of Parallel and Distributed Computing, 13(1):30{42, September 1991.
Jaswinder Pal Singh, John L. Hennessy, and Anoop Gupta. Scaling parallel programs
for multiprocessors: Methodology and examples. Computer, 26(7):42{50, July 1993.
Wei Shu. Runtime support for user-level ultra lightweight threads on massively parallel
distributed memory machines. In Fifth Symposium on the Frontiers of Massively Parallel
Computing, February 1995.
D. Stein and D. Shah. Implementing lightweight threads. In USENIX 1992, June 1992.
V. S. Sunderam. PVM: A framework for parallel distributed computing. Concurrency:
Practice and Experience, 2(4), 1990.
Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford parallel applications for shared-memory. Computer Architecture News, 20(1):5{44, March
1992.
Andrew S. Tannebaum. Distributed Operating Systems. Prentice-Hall, Inc., Englewood
Clis, NJ, 1995.
Robert H. Thomas and Will Crowther. The Uniform system: An approach to runtime support for large scale shared memory parallel processors. In 1988 Conference on
Parallel Processing, pages 245{254, August 1988.
Kritchalach Thitikamol and Pete Keleher. Multi-threading and remote latency in software dsms. In Proceedings of the 17th International Conference on Distribute d Computing Systems, May 1997.
M. Vandevoorde and E. Roberts. WorkCrews: an abstraction for controlling parallelism.
International Journal of Parallel Programming, 17(4):347{366, August 1988.
24