Architecture-Independent Parallelism for Both Sharedand Distributed-Memory Machines using the Filaments Package David K. Lowenthal Department of Computer Science University of Georgia [email protected] Vincent W. Freeh Department of Computer Science and Engineering University of Notre Dame [email protected] March 3, 1999 Abstract This paper presents the Filaments package, which can be used to create architecture-independent parallel programs|that is, programs that are portable and ecient across vastly dierent parallel machines. Filaments virtualizes the underlying machine in terms of the number of processors and the interconnection, allowing ne-grain, shared-memory programs to be written or generated. Furthermore, Filaments uses a carefully designed API along with machine-specic runtime libraries and preprocessing that allow programs to run unchanged on both sharedand distributed-memory machines. Performance is not sacriced, as almost all kernels and applications we tested achieve a speedup of over 4 on 8 processors of both an SGI Challenge and a cluster of Pentium Pros. Keywords: ne-grain parallelism, architecture independence 1 Introduction Recent improvements in VLSI and in fast networking have accelerated the growth of shared- and distributed-memory parallel computers. Soon small-scale shared-memory multiprocessors will be on desktops, and large clusters of workstations with fast communication will be in every building. This increase in processing power presents an opportunity to computational scientists who are always in need of parallel machines to execute programs. It is important for these scientists to be able to execute their program on any available parallel machine: shared, distributed, or a hybrid. This makes it critical to provide a parallel programming model that is architecture independent , which means that it can be implemented eciently on vastly dierent parallel architectures. In other words, it must be simultaneously portable and ecient. We have designed and implemented a library package called Filaments [FLA94, LFA96, LFA98] that provides such a model and an ecient implementation.1 The Filaments package achieves architecture independence in two ways: the use and ecient implementation of ne-grain parallelism and shared-variable communication, and machine-specic libraries and preprocessing that allow programs to run on both shared- and distributed-memory machines without any changes. 1 The Filaments package is freely available at http://www.cs.uga.edu/~dkl/laments/dist.html. 1 The granularity of a parallel program refers to the amount of computation in each process. The coarsest-grain program is a sequential program with one process that does all the work. However, a parallel program must partition the work between processors; a coarse-grain parallel program has one process per processor. On the other hand, a ne-grain program creates one process (thread) for each logical unit of work. Thus, a ne-grain program creates the abstract parallelism inherent in the algorithm, whereas a coarse-grain program creates the concrete parallelism of the machine. Fine-grain programs are simpler to write because they exhibit the natural parallelism of an algorithm; indeed, parallelism is expressed in terms of the application and problem size, not in terms of the number of processors that are used to execute the program. Clustering of independent units of work into a xed set of larger tasks is not necessary. Just as ne-grain parallelism provides the natural level of parallelism, the natural model for communication is shared variables. Processes reference a variable without regard to its location, just as in a sequential program. Remote variables are brought into the local address space transparently when they are referenced; hence, processes can communicate by just reading and writing shared variables. Most algorithms are expressed in terms of shared variables [NL91]; furthermore, programming with messages is tricky despite the various solutions that have been proposed [Tan95]. Compared to shared-variable programs, message passing adds many complications to a program, the worst of which is signicant program reorganization. Shared-variable communication allows a user to write a sequential program, prole it, and parallelize parts that are computationally expensive|without dramatic changes to the rest of the program. The combination of ne-grain parallelism and shared variable communication provides a portable substrate for parallel programming and is used in the Filaments package. The physical dimensions of the machine are not part of the application program, because the processors and interconnect are virtualized. This programming model makes it easy to write or compile applications for both shared- and distributed-memory parallel computers. Moreover, this model naturally extends to the potential of networks of shared-memory multiprocessors. Even with a portable substrate, allowing programs to run unchanged on vastly dierent architectures is a challenge. The shared and distributed implementations have vast dierences. For example, reduction variables are implemented as scalars on a distributed-memory machine and as an array on a shared-memory machine. Filaments uses a carefully designed API and machinespecic libraries and preprocessing to allow application programs run unchanged on a variety of shared- and distributed-memory machines. Currently the Filaments package runs on SGI and SPARC multiprocessors and on SPARC and Pentium (both Solaris and Linux) clusters. The contributions of this paper are: demonstration that ne-grain parallelism and shared-variable communication provide a simple programming model; a system, Filaments, that supports ne-grain, shared-memory programs that run unchanged on both shared- and distributed-memory machines, and an ecient implementation of Filaments on both shared- and distributed-memory machines. The next section illustrates how architecture-independent programs are written in Filaments. Section 3 discusses Filaments implementations on both shared- and distributed-memory machines and how portability is achieved. Section 4 examines the performance of Filaments on both sharedand distributed-memory machines. Section 5 discusses related work and the last section summarizes the paper. 2 Memory Processor 1 Processor 2 Processor 3 Processor 4 (a) Multiprocessor: one node; four processors Shared bus Memory Memory Memory Memory Processor 1 Processor 2 Processor 3 Processor 4 (b) Multicomputer: four nodes; four processors Interconnection network Memory Processor 1 Processor 2 Memory Processor 3 Processor 4 (c) Hybrid: two nodes; four processors Figure 1: Three parallel computer architectures. 2 Architecture-Independent Parallel Programming in Filaments The Filaments package is a software kernel that supports ecient execution of ne-grain parallelism and shared-variable communication on a range of parallel machines. Filaments programs make no mention of coarsening small threads into larger processor or sending and receiving messages between processors. A lament is a very lightweight thread. Each lament can be quite small, as in the computation of an average in Jacobi iteration; medium size, as in the computation of an inner product in matrix multiplication; or large, as in a coarse-grain program with one process per processor. The Filaments package provides a small set of primitives that are sucient to implement architecture-independent program for the vast majority of scientic applications. The Filaments package runs on both shared- and distributed-memory machines. Both architectures have multiple physical processors ; however, a multiprocessor has one physical address space, and a multicomputer has one per processor. The term processor refers to a physical processor, whether it be in a multiprocessor or a multicomputer. The term node refers to an address space. Figure 1 shows three congurations of four processors. Figure 1(a) shows a shared-memory multiprocessor with four processors on one node. Figure 1(b) shows distributed-memory multicomputer with four nodes, each containing one processor. Figure 1(c) shows a network of multiprocessors with 4 processors and 2 nodes. (The Filaments implementation for this hybrid machine is under development.) The Filaments package supports two kinds of laments. Iterative laments execute repeatedly, with a global reduction operation (and hence a barrier synchronization) occurring after each execution of all laments. The package also supports sequences of phases of iterative laments, which are used when applications or loop bodies have multiple components. Iterative laments are used in applications such as Jacobi iteration, LU decomposition and Water-Nsquared. Fork/join laments recursively create new laments and wait for them to return results. They are used in 3 divide-and-conquer applications such as adaptive quadrature, quicksort, and recursive FFT. Iterative laments execute in phases. All laments in a phase are independent and can execute in any order. Within a phase, there are one or more pools, which are groups of laments that reference data with spatial locality. Multiple pools per phase provide for more ecient use of the cache and/or the distributed shared memory; however, it is important to note that using multiple pools per phase is not necessary for correctness. A programmer places each iterative lament in a pool (within a phase) when it is created. A phase consists of a pointer to the lament code and a pointer to the post-phase code, which is a function that is called after each execution of all laments in the phase. The post-phase code synchronizes the processors and determines whether a phase is nished. On the other hand, fork/join laments are created dynamically and in parallel. When a processor forks a lament, the lament is placed on that processor's list; however, any processor (on any node) may execute the lament. Fork/join applications do not have inherent locality. Hence, pools are not used because the data-reference patterns of fork/join laments in general cannot be determined. Below we describe both iterative and fork/join applications and how they are programmed using Filaments. A program that uses Filaments contains three additional components relative to a sequential program: declarations of variables that are to be located in shared memory, functions containing the code for each lament, and a section that creates the laments, places them on processors and/or nodes, and controls their execution. The names of Filament library calls have been shortened for brevity. Additionally, for clarity, some details of the code fragments are omitted. Note that in general we expect a compiler to generate Filaments code; in fact, we have a prototype of a modied SUIF [AALT95] compiler that translates sequential C programs to Filaments programs. 2.1 Jacobi Iteration: Iterative Filaments Laplace's equation in two dimensions is the partial dierential equation @2u @2u + = 0: @x2 @y2 Jacobi iteration is one way to approximate the solution to this equation; it works by repeatedly computing a new value for each grid point; the new value for a point is the average of the values of its four neighbors from the previous iteration. The computation terminates when all new values are within some tolerance, epsilon, of their respective old values. Because there are two grids, the n2 updates are all independent computations; hence, all new values can be computed in parallel. For this application, the key shared variables are the two n by n arrays, new and old, and the reduction variable maxdiff. A reduction variable is a special kind of variable with one copy per node. The local copy of a reduction variable can be accessed directly using the filGetVal macro. Such variables are also used in calls to filReduce, which (1) atomically combines the private copy on each processor into a single copy using a binary, associative operator (such as add or maximum) and (2) copies the reduced value into each private copy. A call of filReduce also results in a barrier synchronization between nodes. The code executed by each lament computes an average and dierence: 4 void jacobi(int i, int j) { double temp; new[i][j] = (old[i-1][j] + old[i+1][j] + old[i][j-1] + old[i][j+1]) * 0.25; temp = absval(new[i][j] - old[i][j]); if (filGetVal(maxdiff) < temp) filGetVal(maxdiff) = temp; } After computing the new value of grid point (i,j), jacobi computes the dierence between the old and new values of that point. If the dierence is larger than the maximum dierence observed thus far on this iteration of the entire computation, then maxdiff needs to be updated. After all grid points are updated, the following procedure is called to check for convergence and to swap grids: int check() { filReduce(maxdiff, MAX); if (filGetVal(maxdiff) < epsilon) return F_DONE; swap(old, new); filGetVal(maxdiff) = 0.0; return F_CONTINUE; } One processor executes this code on every node at the end of every update phase, i.e., after every lament in the phase has been executed once. If check returns F CONTINUE, the next iteration is performed; each lament will update a point on the next iteration. If it returns F DONE, then the computation terminates. The initialization section is executed on each node, because each address space must be initialized. void main() { /* create and initialize the shared variables */ filInit(); startRow = filMyNode() * n/filNumNodes(); endRow = startRow + n/filNumNodes() - 1; phase = filCreatePhase(jacobi, check); pool = filCreatePool(phase, numFil); for (i = startRow; i <= endRow; i++) { /* determine which pool to use for this row */ poolId = i * filNumProcessors()/n; for (j = 1; j < n-1; j++) filCreateFilament(phase, pool[poolId], i, j); } filStart(); } The call to filInit initializes the Filaments package. The call of filCreatePhase creates a phase, which contains laments that execute the function jacobi. The other argument is a pointer to a function dening the post-phase code. The filCreatePool call creates an array of pools, one per 5 jacobi Phase check Filament descriptor (i , j) four 1x4 element pools one 1x16 element pool 16 (0,0) (0,1) ... (3,3) 4 (a) Uniprocessor: 1 node, 1 processor (c) Multicomputer: 4 nodes, 4 processors one 4x4 element pool two 2x4 element pools 4 4 4 4 4 4 (b) Multiprocessor: 1 node, 4 processors (d) Hybrid: 2 nodes, 4 processors Figure 2: Distributing 4 4 laments in pools on various machines. processor, where each pool contains space for numFil laments. The filCreateFilament routine creates a single lament. Each lament is dened by i and j, which are passed as arguments to jacobi. Variables poolId, startRow, and endRow are discussed below. The nal Filaments package call, filStart, starts the parallelism. All previously created phases are completed before filStart returns. Architecture-Independent Assignment of Filaments to Processors and Nodes A Fila- ments program must distribute the laments among the processors and nodes in order to balance the load. An ecient distribution for Jacobi iteration is block, in which every processor updates a contiguous strip of rows in the new array. This results in a load-balanced program with good locality. The initialization code above shows a general way to distribute laments in blocks for multiprocessors, multicomputers, and networks of multiprocessors. The values startRow and endRow are the start and end of the the block assigned to the node. The variable poolId contains the processor that is assigned to the row. This simple method of assigning laments to pools can support all regular distributions on multiprocessors, multicomputers, and hybrid machines. More importantly, it makes the code architecture independent; no change is needed when porting the code. Additional information on automatic lament assignment can be found in [LA96, Low98]. Figure 2 illustrates ways to distribute laments between processors and nodes using phases and pools on several dierent architectures. The example shows the block distribution of a 4 4 matrix. It uses one phase and one pool per node. The phase contains the lament code pointer (jacobi) and the post-phase code pointer (check). The pool is an array, with one element for every processor. Figure 2(a) shows the pools on a uniprocessor. There is a one-element pool containing all 16 lament descriptors. In the code fragment above that shows initialization, startrow and endrow would be 0 and 3, and poolId would always be 0. Figure 2(b) shows the pools on a multiprocessor with four processors. There is an array of four pools, where each array element contains four lament descriptors; startrow and endrow would again be 0 and 3, but poolId now will vary between 0 and 3 because filNumProcessors() is 4. Figure 2(c) shows the pools on a multicomputer with four nodes. There are four distinct pools; each pool contains four lament descriptors. Here filNumNodes() is 4 and filNumProcessors() is 1. Therefore, on each node, startrow equals endrow which equals filMyNode(). On all nodes, poolId is always 0. Figure 2(d) shows the pools on a hybrid machine with two nodes and four processors; filNumNodes() and filNumProcessors() are both 2, which results in values for startrow and endrow of 0 and 1 on 6 one node and 2 and 3 on the other. Variable poolId varies between 0 and 1 on both nodes. There are two arrays of two pools each. The basic method for pool and node assignment above easily extends to any distribution and any number of pools and pool sets, because every regular distribution can be described with simple algebraic equations. For example, another common distribution is cyclic, where laments are assigned to processors and nodes in a modulo fashion. This distribution is common in linear algebra solvers such as LU Decomposition, Gaussian elimination, and QR factorization. A cyclic distribution can be realized by (1) setting startrow equal to the node's id, endrow to the upper bound of the loop, and incrementing the loop index by the number of nodes, and (2) setting poolId to i modulo the number of processors. Note that step (1) is necessary to ensure that laments are distributed cyclically to nodes and step (2) is necessary to ensure that laments are distributed cyclically to processors. 2.2 Adaptive Quadrature: Fork/Join Filaments This section introduces fork/join laments for programming divide-and-conquer applications. Consider the problem of approximating the integral Z b a f (x)dx: One method to solve this problem is adaptive quadrature, which works as follows. Divide an interval in half, approximate the areas of both halves and of the whole interval, and then compare the sum of the two halves to the area of the whole interval. If the dierence of these two values is not within a specied tolerance, recursively compute the areas of both intervals and add them. The simplest way to program adaptive quadrature is to use a divide-and-conquer approach. Because subintervals are independent, a new lament computes each subinterval. Hence this application uses fork/join laments. The more rapidly the function is changing, the smaller the interval needed to obtain the desired accuracy. Therefore, the work load is not uniformly distributed over the entire interval. The computational routine for adaptive quadrature is: double quad(double a, double b, double fa, double fb, double area){ double left, right, fm, m, aleft, aright; /* compute midpoint m and areas under f() from a to m (aleft) * and m to b (aright) */ if (close enough) return aleft + aright; else { /* recurse, forking two new filaments */ filFjFork(quad, left, a, m, fa, fm, aleft); filFjFork(quad, right, m, b, fm, fb, aright); filFjJoin(); /* wait for children to complete */ return left + right; } } The algorithm above evaluates f () at each point just once and evaluates the area of each interval just once. Previously computed values and areas are passed to new laments to avoid recomputing the function and areas. If the computed estimate is not close enough, the program forks a lament to compute the two subintervals and then waits for them to complete. The results are stored in left and right. 7 The initialization section for adaptive quadrature is similar to that of Jacobi iteration; for fork/join applications the filStart routine serves as an implicit join to the initial filFjFork call. 3 Filaments Implementation A Filaments program runs eciently on shared- and distributed-memory machines without any changes to the source code. Conceptually, this is accomplished with ne-grain parallelism and shared-variable communication, which virtualizes the machines in terms of number of processors and the communication mechanism. Physically, this is accomplished through a well-designed API and an ecient implementation of the Filaments package. Although a ne-grain, shared-variable model plus a well-designed API provides a simple and portable way to program, achieving eciency and hence architecture independence requires careful implementation. The potentially high costs of process creation, process synchronization, contextswitching, and excess memory usage present obstacles to ecient implementations of ne-grain parallelism. Furthermore, eciently implementing shared-variable communication potentially requires long latencies for reading or writing remote values on a distributed-memory machine. Fortunately, ne-grain parallelism integrates well with shared-variable communication, which presents many opportunities. In particular, ne-grain programs typically create many units of work, which makes it easier to mask the aforementioned communication latency by overlapping communication with computation (as it is easier to nd useful work to do while waiting for a response). Additionally, the creation of many units of work make it easier to balance the load automatically. Filaments achieves eciency through three key techniques: very lightweight context switches, runtime \coarsening" to eliminate overhead, and a multithreaded distributed shared memory to tolerate memory latency on a distributed-memory machine. The following section discusses the Filaments API. Section 3.2 describes the implementation of ne-grain parallelism. The last section contains a description of the multithreaded distributed shared memory system that implements the shared-variable communication on a distributed memory machine. 3.1 Filaments Programming Model The rst key to providing architecture-independent is to provide the correct level of abstraction. As discussed above, ne-grain parallelism and implicit communication are an excellent abstraction. This section describes the manifestation of the abstraction through the Filaments programming model. Section 2 gives the avor of the Filaments programming model. It consists of library calls and preprocessor macros. Every laments operation is of the form: fil<Verb > [<Nouns >]([<args >]); All operations begin with fil and are followed by a action, subject, and arguments (the last two are optional). Verb is the action, such as Get, Allocate, or Declare. Nouns is a string of zero or more nouns that determine the subject of the action. And of course, there may be some call-dependent arguments. Examples are filGetVal, filAllocateMemory, and filStart. The Filaments API is a separate layer that lies between the application program and the different Filaments subsystems: laments, threads, and distributed shared memory (see below). The API masks architecture dierences, completely freeing the programmer from low-level details. For example, filAllocateMemory has a dierent implementation on shared- and distributed-memory 8 machines; the API calls the appropriate routine. The API also provides a uniform interface to the package, despite the fact that there are three largely independent subsystems. There are two dierent implementations of the Filaments package. Shared Filaments (SF) runs on shared-memory multiprocessors and Distributed Filaments (DF) runs on distributed-memory multicomputers. However, every Filaments source uses the same programming model, regardless of whether it will execute on a shared- or distributed-memory machine. All machine-specic functionality is provided by the package. The dierence between a shared- and a distributed-memory program is only how each is compiled. The Filaments programming model consists of both function calls and macros. Libraries provide the Filaments function calls and are dierent for SF and DF; therefore, the linker uses either the SF or the DF version as appropriate. A macro diers in the source. Therefore, the program denes either the SF or DF compiler variable via a command-line compiler option, (i.e., -DSF) to select the correct macro denition. Inside the Filaments package, a typical macro looks like: #ifdef SF ... /* SF-specific definition */ #else ... /* DF-specific definition */ #endif Although a macro often is used as an \optimized" function call, in Filaments the macro preprocessing also is necessary for machine-specic code. For example, recall that a reduction variable is a special variable with one copy per node. The following macro denes a reduction variable of type double and name sum: filDeclareRedVar(double, sum); On a distributed-memory machine, address spaces are independent, so the declaration simply expands to: double sum; However, on a shared-memory machine, all processors must share a reduction variable. To avoid locking, all processors access a private copy by indexing into a shared array using their processor ID. Furthermore, the elements are padded to the size of a cache block in order to avoid false sharing. Thus, the macro expansion in SF is: union { double value; int pad[F_STRIDE]; } sum[F_MAX_PROCS];; Because the internal structure of a reduction variable is dierent, all program-level accesses occur through macros as well. The following Filaments macro provides an l-value or r-value for the reduction. #ifdef SF #define filGetVal(r) r[fMyProcessor()].value #else #define filGetVal(r) r #endif 9 Thus, the use of a reduction looks like: filGetVal(sum) = 5.0; which expands to either sum = 5.0; in DF or sum[fMyProcessor()].value = 5.0; in SF. The key point is that the application source does not change. Above we explain how both a shared-memory and distributed-memory Filaments program can be created from the same source program. The SF and DF runtimes also are created from the same source code. There are two levels of dierences; the rst is between SF and DF. For example, implementing load balancing is dierent on shared-memory machines than distributed-memory ones. We also must handle dierences between dierent architectures and operating systems within SF and within DF. The most signicant such dierence is the context switching code. Also, to a lesser extent, message passing and signal processing diers between distributed-memory machines. The Filaments package uses two techniques to create dierent implementations from the same source. The rst is conditional compilation, and the second is symbolic links. For small, localized dierences conditional compilation is sucient and manageable. This is the primary way that the dierences between SF and DF are handled. However, for vastly dierent implementations it is simpler to use dierent machine-specic source les and create a symbolic link to the appropriate le for a particular instance of the implementation. For example, message passing code diers greatly, so the distribution has a communication le for each dierent architecture. When the instance of the implementation is specied, the conguration script creates a symbolic link from a generic le name to the machine-specic le. Accessing the generic le locates the machine-specic le; hence, those denitions of the function calls are used to build the libraries. The runtime consists of three dierent libraries. Filaments (libf.a) contains code for creating, scheduling, executing, and synchronizing laments. Threads (libt.a) provides traditional light-weight threads optimized for the Filaments package. Distributed Shared Memory (libdsm.a) implements logically shared memory on a distributedmemory machine. The laments library, libf.a, is almost completely independent of processor and the operating system. However, SF and DF use dierent run-time data structures that are optimized for each implementation. Conditional compilation is sucient for this library. In contrast, libt.a is greatly dependent on the architecture and operating system because of the existence of context switching code. Therefore, the system uses a symbolic link to the appropriate source for the context switching code. The last library, libdsm.a, is exclusively for DF. It is dependent on the operating system, specically sockets and signals. This library uses both of the above techniques. For example, the location of the address that caused a segmentation violation (page fault) depends on both the operating system and the type of processor. To handle this, the code uses conditional compilation. However, it is easier to use a symbolic link to select the dierent message-passing implementation, as discussed above. 10 3.2 Implementing Ecient Fine-Grain Parallelism Achieving ecient ne-grain parallelism requires careful implementation. This section rst describes the elements common to both SF and DF. Next, it describes the SF-specic aspects. Finally, it describes those same aspects in the context of DF. 3.2.1 Common Filaments Concepts A lament does not have a private stack; it consists only of a (shared) function pointer, arguments to that function, and, for fork/join laments, a few other elds such as a parent pointer and the number of children. Filaments are executed one at a time by server threads, which are traditional lightweight threads with a private stack. Each processor creates at least one server thread. The server thread's stack provides a place for temporary storage so that laments can compute partial results without using a private stack per lament (which necessitates expensive context switching). SF system creates one, and only one, thread per processor. These threads execute laments and coordinate with each other. In contrast, DF initially creates one thread on each node. More threads are created on the nodes as needed to overlap communication and computation. Fork/join laments create work dynamically; therefore they are quite dierent from iterative laments. Most importantly, fork/join laments must be capable of blocking and later resuming execution; this requires some method of saving state. The server stack stores suspended laments, which eliminates the need for private laments stacks. However, suspended laments must resumed in inverse order. As with iterative laments, we want one server thread to execute many fork/join laments to reduce the overhead of execution. Therefore, when a lament executes a join, it loops waiting for its child counter to become zero. But while waiting, the parent can execute other threads. In other words, the parent becomes the server thread. Fork/join programs tend to employ a divide-and-conquer strategy. The computation starts on just one processor. To get all processors involved in the computation, new work (from forks) is given to idle processors. Then, once all processors have work, additional load balancing may be required to keep the processors busy. Many Filaments programs attain good performance with little or no optimization. For example, in matrix multiplication, each lament performs a signicant amount of work (O(n) multiplications and additions), which amortizes the lament overheads. However, achieving good performance for iterative applications that possess many laments that perform very little work (e.g., Jacobi iteration) requires using implicit coarsening, which allows laments in a pool to be executed as if the application were written as a coarse-grain program.2 This reduces the cost of running laments, reduces the working set size to make more ecient use of the cache, and uses code that is easier for compilers to optimize. To implement implicit coarsening, we use two techniques: inlining and pattern recognition. Inlining consists of directly executing the body of each lament rather than making a procedure call. In particular, when processing a pool, a server thread executes a loop, the body of which is the code for the laments in the pool. This eliminates a function call for each lament, but the server thread still has to traverse the list of lament descriptors in order to load the arguments. The second technique is to recognize common patterns of laments at run time. Filaments recognizes regular patterns of laments assigned to the same pool. In such cases, the package at run time switches to code that iterates over the laments, generating the arguments in registers rather than reading the lament descriptors from memory. Filaments currently recognizes a few Systems such as Chores [EZ93] and the Uniform System [TC88] have a ne-grain specication and a coarse-grain execution model, but use preprocessor support to create a machine-specic executable at compile time. Filaments generates dierent codes at compile time and chooses among them at run time depending on the machine. 2 11 common patterns that support a large subset of regular problems; however, the technique is capable of supporting any number of other patterns. Both inlining and pattern recognition are implemented without compiler support; instead, they use special macros that cause multiple versions of functions to be generated. At run time the proper version is invoked. For fork/join laments the complementary optimization is dynamic pruning. When enough work has been created to keep all processors busy, forks are turned into procedure calls and joins into returns. This avoids excessive overheads due to laments creation and synchronization. There is a danger with pruning, however: It is possible for a processor to traverse an entire subtree sequentially after other processors become idle. To avoid this, whenever another processor needs work, a processor executing sequentially returns to executing in parallel and creates new work. This is implemented by having a server thread check a ag (each time) before deciding whether to fork laments or call them directly. The above optimizations are discussed in greater detail in [LFA96]. Implicit coarsening and pruning reduce the cost of running laments, reduce the working set size to make more ecient use of the cache, and use code that is easier for compilers to optimize. They enable ecient execution of very ne-grain programs. 3.2.2 Elements Specic to Shared Filaments Three elements of the Filaments package have a dierent implementation in Shared Filaments (SF) than in Distributed Filaments: memory allocation, reductions, and acquiring laments from other processors or nodes. Memory allocation in SF is very straightforward because there is hardware support for shared memory. Consequently, Filaments is able to use the vendor-supplied memory allocation routine to allocate shared memory. There are two primary possible ways to implement reduction variables in SF. On the machines that currently support Filaments, we have found the most ecient way to represent reduction variables is by an array of P elements, where P is the number of processors. (The alternative way is to use one shared variable and protect it with a lock. Variable access is cheaper, but locking overhead is more expensive.) Processors update the array element corresponding to their id. When all processors reach the reduction point, processor 0 reduces all the elements of the array into one element and then writes this value into all elements. All processors perform a barrier synchronization at the beginning and end of the reduction, thus ensuring both that processor 0 does not start reducing too soon and that the other processors do not read the value before it is nal. Implementing fork/join laments eciently requires acquiring laments from other processors. Because laments in SF are created in shared memory, acquisition is relatively straightforward. An idle processor scans all other processors' lists and takes a lament o the list that has the most laments. SF takes care to ensure the integrity of the lists; a test-and-test-and-set method is used. The scan proceeds without any mutual exclusion, but when a processor decides to remove laments from another list, it locks the list, rechecks to make sure the appropriate number of laments is still on the list, removes them, and then unlocks the list. 3.2.3 Elements Specic to Distributed Filaments This section describes the implementation of the same three elements|memory allocation, reductions, and acquiring laments from other processors|in Distributed Filaments (DF). Memory allocation in DF is much dierent than in SF because there are shared and private sections of memory. The shared section contains the user data that is shared between the nodes. 12 It is imperative that each node allocate the shared data identically, in order that pointers into the shared section have the same meaning on all nodes. Shared memory in DF is provided by a multithreaded DSM, which is discussed in the next section. To ensure that all nodes start at the same address, DF performs an internal max reduction on the start of the DSM area. As in SF, a reduction implies a barrier synchronization, but in DF a barrier is between the nodes; therefore, DF uses messages to perform reductions. DF allocates a local copy of the reduction variable in each processor, similar to SF. After the local copies have been merged into one copy, values on the nodes are merged. At the reduction, DF combines the value at each node into one global value using a tournament reduction. There are O(log2 N ) steps in the reduction (N is the number of nodes). In each step, half of the participating nodes send a reduction message (containing the local reduction value) to another node and then drop out of the tournament. When a node receives a reduction message, it merges the value in the message with its local value. After the last step, all the local values have been reduced into one value, which is then disseminated back to all the nodes in the reduction reply message. A tournament reduction scales well and has low expected latency [HFM88]. There are 2(N ? 1) messages in the reduction, which is minimal. The latency is O(log2 N ) steps. Although there are other methods with less latency, each requires more messages. For example, each node could send its value to every other node. This has minimal latency of 1 step; however, there are N 2 messages. (A dissemination barrier has O(log2 N ) steps, but has O(N log2 N ) messages.) 3.3 The Multithreaded Distributed Memory System The Filaments DSM provides logically shared memory for Filaments programs executing on distributed-memory machines. It is multithreaded, extensible, and implemented entirely in software. Multithreading enables the overlapping of communication and computation, which eectively mitigates message latency and, hence, greatly improves performance for many applications. Filaments programs have numerous independent laments that can be executed in any order, which enhances the ability to overlap communication and computation. The DSM supports extensibility by means of user-dened page consistency protocols (PCP). A PCP provides page-based memory consistency. It supports the general framework only; the specic operations occur by means of upcalls to user-dened routines, as described in [FA96, Fre96]. It is easy to create, test, and evaluate new PCPs, which facilitates experimentation and can lead to a more ecient system. The Filaments DSM is implemented entirely in software. Consequently, it is exible, inexpensive, and portable. Software is more exible and less expensive to develop and produce than hardware. It requires some operating system support, but has been ported to many dierent machines and operating systems (SPARC/Solaris, PentiumPro/Solaris, and PentiumPro/Linux). The DSM divides the address space of each node into both shared and private sections. Shared user data (matrices, linked lists, etc.) are stored in the shared section, while local user data (program code, loop variables, etc.) and all system data structures (queues, page tables, etc.) are stored in the private section. This scheme reduces the amount of data that must be shared. Moreover, some data, such as the call frame, is not shared|so it belongs in the private memory. A DSM is more ecient when nodes use private memory correctly. To date, every PCP that has been developed for the Filaments package employs distributed page management because this is signicantly more ecient than using a centralized page manager, as discussed in [LH89]. In particular, each DSM page is owned by a node. The owner of a page services all page requests and maintains the state of the page. Page ownership migrates to a node that is updating the page. When a single node exclusively accesses a page, it owns it. Moreover, 13 a node will not own a page it does not access. In this way, the overhead of managing the pages is shared among the nodes. Multithreading allows the DSM to overlap communication and computation in order to mitigate the latency of remote page faults. On a multiprocessor this latency is very small. However, on a distributed-memory machine the latency for servicing a remote page fault is large because servicing a fault requires messages. By using multithreading, a ready thread can execute while another thread is blocked waiting for a remote page fault to be serviced. Thus, progress can be made while the page fault is outstanding. Fine-grain parallelism enables overlapping because of the tremendous amount of parallelism available: When one lament faults, there are literally hundreds or thousands of other laments that could execute. The multithreaded DSM provides overlap by multithreading server threads. After a remote page request is made, the faulting server thread is suspended and a ready server thread is scheduled for execution. This server thread executes laments in a dierent pool, which access dierent data| this avoids multiple faults on the same page. When the page arrives at the node, the server thread that faulted (and any others that have also faulted on this page) is rescheduled. The DSM provides mechanisms to suspend and resume threads, including the queues for holding ready, idle, and suspended threads. However, the page-consistency protocol must provide overlap during the page fault and message handlers. 4 Performance This section evaluates the performance of the Filaments package. First we present results of eight kernels on both both a shared- and distributed-memory machine. Then, we investigate the performance of a larger application, Water-Nsquared, which is from the Splash suite [PWG91]. Finally, we examine one of the kernels, Jacobi iteration, in depth. Note that we report on applicationindependent overheads such as lament creation, barrier synchronization, and message overheads elsewhere [FLA94]. The overheads have been found to be quite small. All shared-memory programs were run on a Silicon Graphics Challenge multiprocessor, which has 12 100 MHz processors running IRIX, a separate data and instruction cache of 16K each, and a 1 megabyte secondary data cache. The (shared) main memory size is 256 megabytes. All distributed-memory programs were run on a network of Pentium Pro workstations connected by a 100Mbs Fast Ethernet. Each workstation has a 200 MHz processor running Solaris, a 64K mixed instruction and data cache, and 64 megabytes of main memory. The cluster is isolated so that no outside network trac interferes with executing programs. Each test was run at least three times, and the reported result is the median. The tests use problem sizes so that the sequential time is between 50 and 100 seconds. All programs were compiled with the vendor-supplied compiler with the optimization ag on and were run when no other user was on the machine. In practice we have found test results to be very consistent|the other UNIX daemon processes do not interfere. 4.1 Scientic Programming Kernels This section describes the performance of 8 kernels: matrix multiplication, Jacobi iteration, LU decomposition, Tomcatv, Mandelbrot set, adaptive quadrature, bonacci, and binary expression trees. Programs are tested on both the SGI and the cluster of Pentium Pros. Again, we stress that the programs are run unchanged on the dierent machines. Only a few lines in the Filaments conguration le (to indicate shared- or distributed-memory and which type of architecture and operating system is used) must be changed. 14 Kernel Matrix Multiplication Jacobi Iteration LU Decomposition Adaptive Quadrature Tomcatv Mandelbrot Fibonacci Expression Trees 1 2 3 4 5 6 7 8 1. 2. 3. 4. 5. 6. 7. 8. Work Load Data Sharing Static Light Static Medium Static Heavy Dynamic None Static Medium Dynamic Light Dynamic None Dynamic Heavy Synch. None Medium Heavy Medium Medium None Medium Medium Only synchronization is termination detection. All shared data is read-only. Edge sharing with neighbors. Compute maximum change between iterations. Decreasing work. Every iteration disseminate values to all. Fork/join parallelism. No data sharing. Edge sharing with neighbors. Variable amount of work. Fork/join parallelism. No data sharing. Fork/join parallelism; constant work per lament. Table 1: Application kernels and their characteristics. Table 1 summarizes the eight kernels and shows three properties of each: work load, data sharing, and synchronization. Work load is a characterization of whether the number of tasks or the amount of work per task can be determined statically at compile time or whether it has to be determined at run time. Data sharing is a measure of the extent to which data is shared. Synchronization is a measure of the amount of interprocess synchronization. For each kernel test, we present the time in seconds to execute it with 1, 2, 4, and 8 processors. Speedup is given relative to a sequential program, which is a separate uniprocessor program that contains no calls to the Filaments library. Care was taken to make the sequential program as ecient as possible. Table 2 shows the performance of the kernels on an SGI Challenge. All programs except for adaptive quadrature attain a speedup of at least 4.7 on 8 processors. Speedup tapers o because the problem sizes are xed, which leads to decreasing amounts of work per processor as the number of processors grows [SHG93]. The poor performance of adaptive quadrature on is unexplained. It is an anomaly that is specic to the 12-processor SGI Challenge; on an older, 4-processor SGI Challenge, adaptive quadrature achieves good speedup on 2 and 4 processors. Note that on the grid-based, iterative applications, the sequential programs are between 5 and 10% faster than the Filaments one-processor program. This is primarily because of better register allocation. The vendor-supplied compiler will never place global variables in registers, and the Filaments programs must make certain variables global. For example, in Jacobi iteration, the outermost timestep variable and the reduction variable must be global, because they are accessed in multiple procedures. In the sequential program, the corresponding variables are local. Table 3 shows the performance of the kernels on a cluster of Pentium Pros. The dierence between the sequential program and the one-processor Filaments program arises for the same reasons as with the multiprocessor tests. All programs other than Jacobi iteration and expression trees attain a speedup of over 4 on 8 nodes. The speedup for Jacobi is poor on 8 nodes because of 15 SGI Processors, P Sequential 1 2 4 8 Matrix Multiplication, 600 600 67.2 70.7 42.4 24.1 14.1 Jacobi Iteration, 720 720, 100 iterations 56.0 60.4 31.6 16.7 10.0 Tomcatv, 300 300 65.3 70.5 36.6 18.8 9.91 LU Decomposition, 720 720 54.0 60.2 31.1 16.2 11.1 Mandelbrot, 920 920 50.3 51.2 26.8 14.3 8.10 Binary Expression Trees, 150 150, 64 levels 61.7 62.1 32.5 19.9 11.3 Adaptive Quadrature, interval 1{24 67.2 72.0 51.8 38.2 33.5 Fibonacci, 39 54.8 55.4 32.7 16.5 10.4 Table 2: Performance of kernels on SGI (times in seconds). Pentium Pro Cluster Processors, P Sequential 1 2 4 8 Matrix Multiplication, 920 920 58.8 62.0 37.3 20.2 14.2 Jacobi Iteration, 800 800, 360 iterations 53.1 62.8 37.7 21.8 16.1 Tomcatv, 512 512 56.8 57.7 31.8 17.4 10.7 LU Decomposition, 1200 1200 81.7 85.2 50.0 27.1 17.7 Mandelbrot, 1600 1600 74.8 75.3 42.6 21.0 10.7 Binary Expression Trees, 360 360, 32 levels 97.4 97.6 60.4 33.9 24.7 Adaptive Quadrature, interval 1{25 62.0 62.5 31.4 20.8 9.50 Fibonacci, 42 82.3 83.5 41.8 23.0 15.3 Table 3: Performance of kernels on Pentium Pro cluster (times in seconds). the signicant time to initially distribute data (via page faults). This overhead is amortized by the number of iterations executed, so increasing the number of iterations improves speedup. Also, as with the shared-memory tests, the xed problem size causes speedup to level o. Furthermore, the sequential Jacobi program was signicantly rewritten to get the best performance. The expression tree kernel gets only modest speedup because of the signicant data movement in sequential parts of the code (near the top of the tree). In these sections of code, no overlap of communication and computation is possible. 4.2 Water-Nsquared Water-Nsquared is an N-body molecular dynamics application. It evaluates the forces and potentials in a system of water molecules in the liquid state. The computation iterates for some number of time-steps. Every time-step solves the Newtonian equations of motion for molecules in a cubical box with periodic boundary conditions. To avoid computing the 21 n2 pairwise interactions, a spherical cuto is used. The principal shared data structure is a large array of records; each element of the array corresponds to one molecule and holds information about the molecule's position and potential [SWG92]. The Filaments program consists of eight phases. The rst three phases initialize the potentials of the molecules and, consequently, are executed once at the beginning of the program. The remaining phases are executed each time-step. There is one lament for each molecule in every phase. There is 16 Processors, P Sequential 1 2 4 8 SGI, 512 molecules, 10 steps 53.7 55.2 29.2 16.0 8.98 Pentium Pro, 512 molecules, 15 steps 74.8 79.6 44.6 25.8 16.3 Table 4: Performance of Water-Nsquared on SGI and Pentium (times in seconds). Jacobi Iteration on SGI Processors, P 1 2 4 8 Filaments (SGI) 60.4 31.6 16.7 10.0 Coarse-Grain (SGI) 57.9 30.1 16.1 9.33 Table 5: Performance of Jacobi iteration on SGI, Filaments versus coarse-grain (times in seconds). read-sharing between laments (and processors) within a phase; however, there is no write-sharing (or race conditions) because barriers are used. Table 4 shows the execution times for Water-Nsquared. As before, we compare the Filaments program to a sequential program as the baseline. The shared-memory programs compute for 10 time-steps on 512 molecules. The distributed-memory programs use 15 time-steps. Speedup on the SGI is very good: 3.9 and 6.0 on 4 and 8 processors, respectively. Speedup on the cluster is respectable: 2.9 and 4.6 on 4 and 8 processors, respectively. The previous section demonstrates that Filaments programs run eciently very dierent scientic kernels. This section demonstrates that a scientic application, consisting of many dierent computations, also executes eciently using Filaments. Hence, Filaments can be used to program large-scale scientic applications. 4.3 Jacobi iteration This Section examines the performance of one kernel, Jacobi iteration, in depth. Tables 5 and 6 show the performance of the Filaments Jacobi program versus a coarse-grain program on the SGI and the cluster, respectively. The coarse-grain program provides a baseline as a very ecient parallel implementation. The coarse-grain program uses one process per processor; furthermore, on the cluster it uses explicit message passing (not the DSM) and overlaps communication and computation. The Filaments programs are very competitive with the coarse-grain program in both cases, performing no worse than 7.5% on 8 processors. This is due to the ecient mechanisms in the Filaments package. As with the sequential programs, most of the overhead is due to register allocation. The Jacobi iteration times reported above take advantage of two performance enhancements: the implicit-invalidate page consistency protocol (PCP) and multiple pools. The communication overhead is reduced by using the implicit-invalidate PCP, which has fewer messages than the writeinvalidate PCP [FLA94]. The write-invalidate PCP requires invalidate messages to be sent, received, and acknowledged. The performance improvement, 6.8% and 13.6% on 4 and 8 nodes, can be seen by comparing the results in the rst two entries of Table 7.3 The times for single-pool, non-overlapping Jacobi iteration are shown in the last line of Table 7. Overlapping communication leads to a 3.2% and 4.9% improvement on 4 and 8 nodes. It should be noted that the benet The one-node performance of the write-invalidate program is slower than the other programs. We have proled the code as well as examined the assembly code, and observe no dierence. 3 17 Jacobi Iteration on Pentium Pro Cluster Processors, P 1 2 4 8 Filaments (Cluster) 62.8 37.7 21.8 16.1 Coarse-Grain (Cluster) 54.1 34.5 19.5 15.9 Table 6: Performance of Jacobi iteration on cluster, Filaments versus coarse-grain (times in seconds). Jacobi Iteration on Pentium Pro Cluster Processors, P 1 2 4 8 Multiple Pools, Implicit-Invalidate 62.8 37.7 21.8 16.1 Multiple Pools, Write-Invalidate 66.1 39.2 23.3 18.3 Single Pool, Implicit-Invalidate 62.8 38.1 22.5 16.9 Table 7: Performance of Jacobi iteration using one pool and write-invalidate page consistency protocol (times in seconds). of overlapping depends on the particular architecture and can be much greater; previously, on a cluster of Sun IPCs, we found the benet to be over 20% [FLA94]. 5 Related Work The Filaments package builds on a wide range of work, primarily consisting of ecient threads packages, distributed shared memory, and overlapping communication and computation. Furthermore, it is one of many approaches to architecture independence. 5.1 Threads There are many general-purpose thread packages, including Threads [Doe87], Presto [BLL88], System [BS90], C++ [BDSY92], and Sun Lightweight Processes [SS92]. All of the above packages support pre-emption to provide fairness, which requires each thread to maintain a private stack. Consequently, the package has to perform a full context switch when it switches between threads, which makes ecient ne-grain parallelism impossible. Several researchers have proposed ways to make standard thread packages more ecient, including Anderson, et al. [ALL89, ABLL92], Schoenberg and Hummel [HS91], and Keppel [Kep93]. WorkCrews [VR88] supports fork/join parallelism on small-scale, shared-memory multiprocessors, and introduced the concepts of pruning and of ordering queues to favor larger threads|concepts borrowed by Filaments. Cilk-5 [FLR98] compiles a fast-clone and slow-clone for each parallel function, and then executes the slow clone when all processors are busy, which is a similar idea to pruning. Markatos et al. [MB92] present a thorough study of the tradeos between load balancing and locality in shared memory machines with respect to thread scheduling. The most related threads packages that provide ecient ne-grain parallelism are the Uniform System [TC88], Chores [EZ93], and TAM [CDG+ 93]. The rst two do not support fork/join parallelism or a distributed-memory machine, and the latter is oriented towards functional programming and a distributed-memory machine. Two other subsequent thread packages use an approach similar to Filaments: uThread [Shu95] and the virtual processor approach [NS95]. 18 5.2 Distributed Shared Memory There is a wealth of related work on distributed shared memory systems. These can largely be divided into three classes: hardware implementations, kernel implementations, and user implementations. Hardware implementations include MemNet [Del88], Plus [BR90], Alewife [KCA91], DASH [LLJ+ 93], and FLASH [KOH+ 94]. The hardware detects reads and writes to shared locations and maintains consistency. This is ecient but expensive. Kernel implementations modify the operating system kernel to provide the DSM; these include Ivy [LH89], Munin [CBZ91], and Mirage [FP89]. While less ecient than hardware mechanisms, these implementations provide reasonable fault-handling times. Modifying a kernel is complex and error-prone; as a result, user implementations recently have become much more common. These include Treadmarks [KDCZ94], Midway [BZS93], and CRL [JKW95]. Most of these systems dier from the Filaments DSM in that they are single-threaded on each node and thus cannot mitigate communication latency. Alewife is multi-threaded, but in hardware. A few later DSM systems also provide multi-threaded nodes, such as CVM [TK97]. 5.3 Overlap of Communication and Computation The idea of overlapping communication and computation is not new; the technique has been used in hardware and operating systems for years. With the relatively long remote memory latencies in parallel computing, mitigation becomes very important. There are two primary methods of overlap; explicit and implicit. Split-C supports explicit overlap of communication and computation [CDG+ 93]. The language allows the programmer to overlap communication and computation. The challenge is to place the prefetch suciently far enough in advance so that the reference occurs after the reply is received: the time between a request and the rst reference must be greater than the latency. However, the prefetch only can precede the access by a limited amount, which depends on the program. CHARM is a ne-grain, explicit message-passing threads package [FRS+ 91]. It provides overlap of communication and computation, and dynamic load balancing. CHARM has a distributed-memory programming model that can be run eciently on both shared- and distributed-memory machines. In contrast, Filaments implicitly overlaps communication with any local computation. This dierence is the result of the fundamentally dierent way in which the overlap of communication and computation is achieved in the two systems. The Filaments approach is more general (works on more problems) and simpler (requires no source code modications). Also, Filaments provides functionality similar to CHARM using a shared-memory programming model. Finally, Filaments is implemented entirely in software as opposed to hardware. 5.4 Other Approaches to Architecture-Independence The combination of ne-grain parallelism and shared-variable communication is one approach to architecture independence; there are many others. Three methods that have been given signicant attention are parallelizing compilers, functional languages, annotated languages, and message-passing libraries. The simplest path to a parallel program from a programmer's point of view is to write a sequential program and use a parallelizing compiler such as SUIF [AALT95] or Polaris [BDE+ 96]. The compiler computes data dependences, detects where communication is needed, coarsens granularity, and determines data distributions. If the compiler is retargeted eciently to several dierent architectures (which requires signicant eort), it may achieve architecture independence|the programmer writes a sequential program, and it is compiled to an ecient executable on several dierent 19 parallel machines. The problem is that compilers can only handle restricted application domains eciently, and the best parallel algorithm may be dierent than the best sequential algorithm. Another approach is to use a functional language, that is suciently high-level and hence is independent of the machine. Sisal (Streams and Iteration in a Single Assignment Language) is a general purpose functional language [FCO90]. It is implemented using a dataow model, meaning program execution is determined by the availability of the data, not the static ordering of expressions in the source code. The compiler can schedule the execution of expressions in any order, including concurrently, as long as it preserves data dependencies. Sisal is arguably the most successful high-level parallel language; Sisal programs execute on numerous sequential, vector, shared-memory, and distributed-memory machines. However, if the compiler does not produce ecient code (due to, for example, the failure of complicated update-in-place analysis), it is very dicult to tune the program because of the vast dierence between the source and executable. Other functional languages include Multilisp [Hal86] and Id [Nik89]. Another possibility is to use annotated languages such as HPF [HPF93] or Dino [RSW91]. The programmer uses annotations to specify certain implementation-related aspects of the program, such as which sections of code can be parallelized or how data should be distributed. This information is used by the compiler to generate an ecient parallel program for dierent architectures. Although the annotations may change with the architecture, the hope is that these changes are fairly localized. The problem with annotated languages is that although they lessen the burden on the programmer compared to writing explicitly parallel programs using subroutine libraries, they still require the programmer to make the key decisions. Furthermore, the annotations may have to change signicantly between architectures. Finally, PVM (parallel virtual machine) and MPI (message passing interface) are two approaches that provide portability through an API. PVM virtualizes the underlying machine, even allowing heterogeneous computing [Sun90]. MPI supports portable message-passing programs on a large number of machines [For94]. Both packages primarily target distributed-memory machines and have large installed bases. However, as previously stated, writing an correct, ecient messagepassing program is quite dicult. 6 Conclusion This paper has discussed the design and implementation of Filaments, which provides an architecture-independent substrate for parallel programming. In particular, Filaments programs run unchanged and eciently on both shared- and distributed-memory machines. We accomplish this by using a well-designed API and an ecient implementation of ne-grain parallelism and sharedvariable communication. Section 2 described Filaments programs for two applications. Both were programmed in an architecture-independent manner, so that they could be executed on any machine. Section 3 presented details of the implementation of the Filaments API, ne-grain parallelism, and sharedvariable communication. The key techniques are very lightweight threads (laments), run-time optimizations, and a multithreaded distributed shared memory. Section 4 showed that Filaments programs attained good speedups on a shared-memory multiprocessor (SGI Challenge) and a distributed-memory multicomputer (cluster of Pentium Pros). Furthermore, one application was examined in detail, which showed that the performance of Filaments was very competitive with a hand-written coarse-grain program. In conclusion, Filaments demonstrates that ne-grain parallelism and shared-variable communication create a simple, general model for portable parallel programming. Additionally, the Filaments package shows that machine-specic libraries and preprocessing is sucient to allow pro20 grams to be created once and run unchanged on vastly dierent machines. Lastly, this paper demonstrates that performance does not have to be sacriced for portability|that is, Filaments delivers architecture-independent parallel computing. Acknowledgements The authors are grateful to Sandya Dwarkadas and the University of Rochester for access to their SGI Challenge. References [AALT95] Saman P. Amarasinghe, Jennifer M. Anderson, Monica S. Lam, and Chau-Wen Tseng. The SUIF compiler for scalable parallel machines. In Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientic Computing, February 1995. [ABLL92] T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy. Scheduler activations: Eective kernel support for the user-level management of parallelism. ACM Transcaction on Computer Systems, 10(1):53{79, February 1992. [ALL89] T.E. Anderson, E.D. Lazowska, and H.M. Levy. The performance implications of thread management alternatives for shared-memory multiprocessors. IEEE Transactions on Computers, 38(12):1631{1644, December 1989. [BDE+ 96] William Blume, Ramon Doallo, Rudolf Eigenmann, John Grout, Jay Hoeinger, Thomas Lawrence, Jaejin Lee, David Padua, Yunheung Paek, Bill Pottenger, Lawrence Rauchwerger, and Peng Tu. Parallel programming with Polaris. IEEE Computer, 29(12):78{82, December 1996. [BDSY92] Peter A. Buhr, Glen Ditcheld, R.A. Stroobosscher, and B.M. Younger. uC++: concurrency in the object oriented language C++. Software|Practice and Experience, 22(2):137{172, February 1992. [BLL88] B.N. Bershad, E.D. Lazowska, and H.M. Levy. PRESTO: a system for object-oriented parallel programming. Software|Practice and Experience, 18(8):713{732, August 1988. [BR90] Roberto Bisiani and Mosur Ravishankar. Plus: A distributed shared memory system. In 17th Annual International Symposium on Computer Architecture, pages 115{124, May 1990. [BS90] Peter A. Buhr and R.A. Stroobosscher. The uSystem: providing light-weight concurrency on shared memory multiprocessor computers running UNIX. Software Practice and Experience, pages 929{964, September 1990. [BZS93] Brian N. Bershad, Matthew J. Zekauskas, and Wayne A. Sawdon. The Midway distributed shared memory system. In COMPCON '93, pages 528{537, 1993. [CBZ91] John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and performance of Munin. In Proceedings of 13th ACM Symposium On Operating Systems, pages 152{164, October 1991. 21 [CDG+ 93] David Culler, Andrea Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven Lumetta, Thorsten von Eicken, and Katherine Yelick. Parallel programming in Split-C. In Proceedings of Supercomputing '93, November 1993. [Del88] G. Delp. The Architecture and Implementation of MemNet: A High-Speed Shared Memory Computer Communication Network. PhD thesis, University of Delaware, Newark, Delaware, 1988. [Doe87] Thomas W. Doeppner. Threads: a system for the support of concurrent programming. Technical Report CS-87-11, Brown University, June 1987. [EZ93] Derek L. Eager and John Zahorjan. Chores: Enhanced run-time support for shared memory parallel computing. ACM Transactions on Computer Systems, 11(1):1{32, February 1993. [FA96] Vincent W. Freeh and Gregory R. Andrews. Dynamically controlling false sharing in distributed shared memory. In Proceedings of the 5th Symposium on High Performance Distributed Computing, August 1996. [FCO90] John T. Feo, David C. Cann, and Rodney R. Oldehoeft. A report on the SISAL language project. J. of Par. and Dist. Computing, 10(4):349{366, December 1990. [FLA94] Vincent W. Freeh, David K. Lowenthal, and Gregory R. Andrews. Distributed Filaments: Ecient ne-grain parallelism on a cluster of workstations. In First Symposium on Operating Systems Design and Implementation, pages 201{212, November 1994. [FLR98] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation (to appear), June 1998. [For94] MPI Forum. MPI: A message-passing interface standard. International Journal of Supercomputer Applications, 8(3/4):165{416, 1994. [FP89] Brett D. Fleisch and Gerald J. Popek. Mirage: a coherent distributed shared memory design. In Proceedings of 12th ACM Symposium On Operating Systems, pages 211{223, December 1989. [Fre96] Vincent W. Freeh. Software Support for Distributed and Parallel Computing. PhD thesis, The University of Arizona, June 1996. [FRS+ 91] W. Fenton, B. Ramkumar, V. A. Saletore, A. B. Sinha, and L. V. Kale. Supporting machine independent programming on diverse parallel architectures. In Proceedings of the 1991 International Conference on Parallel Processing, volume II, Software, pages II{193{II{201, Boca Raton, FL, August 1991. CRC Press. [Hal86] Robert H. Halstead. An assessment of Multilisp: Lessons from experience. International Journal of Parallel Programming, 15(6):459{501, December 1986. [HFM88] D. Hansgen, R. Finkel, and U. Manber. Two algorithms for barier synchronization. Int. Journal of Parallel Programming, 17(1):1{18, February 1988. [HPF93] High Performance Fortran language specication. October 1993. 22 [HS91] [JKW95] [KCA91] [KDCZ94] [Kep93] [KOH+ 94] [LA96] [LFA96] [LFA98] [LH89] [LLJ+ 93] [Low98] [MB92] S.F. Hummel and E. Schonberg. Low-overhead scheduling of nested parallelism. IBM Journel of Research and Development, 35(5):743{765, September 1991. Kirk L. Johnson, M. Frans Kaashoek, and Deborah A. Wallach. CRL: High-performance all-software distributed shared memory. In Fifteenth Symposium on Operating Systems Principles, pages 213{228, December 1995. Kiyoshi Kurihara, David Chaiken, and Anant Agarwal. Latency tolerance through multithreading in large-scale multiprocessors. In International Symposium on Shared Memory Multiprocessing, pages 91{101, April 1991. Pete Keleher, Sandhya Dwarkadas, Alan Cox, and Willy Zwaenepoel. TreadMarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the 1994 Winter Usenix Conference, pages 115{131, January 1994. David Keppel. Tools and Techniques for Building Fast Portable Threads Packages. Technical Report UWCSE 93-05-06, University of Washington, 1993. Jerey Kuskin, David Ofelt, Mark Heinrich, John Heinlein, Richard Simoni, Kourosh Gharachorloo, John Chapin, David Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John Hennessy. The Stanford FLASH multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture, pages 302{313, April 1994. David K. Lowenthal and Gregory R. Andrews. An adaptive approach to data placement. In Proceedings of the 10th International Symposium on Parallel Processing, pages 349{ 353, April 1996. David K. Lowenthal, Vincent W. Freeh, and Gregory R. Andrews. Using ne-grain threads and run-time decision making in parallel computing. Journal of Parallel and Distributed Computing, 37:41{54, November 1996. David K. Lowenthal, Vincent W. Freeh, and Gregory R. Andrews. Ecient support for ne-grain parallelism on shared-memory machines. Concurrency: Practice and Experience, 10(3):157{173, March 1998. Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems, 7(4), November 1989. Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John Hennessy. The DASH prototype: Logic overhead and perfornamce. IEEE Transactions on Parallel and Distributed Systems, 4(1):41{61, January 1993. David K. Lowenthal. Local and global data distribution in the Filaments package. In PDPTA '98 (to appear), July 1998. Evangelos P. Markatos and Thomas L. Blanc. Load balancing vs. locality management in shared-memory multiprocessor. In Proceedings of the 1992 International Conference on Parallel Processing, volume I, Architecture, pages I:258{267, Boca Raton, Florida, August 1992. CRC Press. 23 [Nik89] [NL91] [NS95] [PWG91] [RSW91] [SHG93] [Shu95] [SS92] [Sun90] [SWG92] [Tan95] [TC88] [TK97] [VR88] Rishiyur S. Nikhil. The parallel programming language Id and its compilation for parallel machines. In Proc. Workshop on Massive Parallelism: Hardware, Programming nad Applications, October 1989. Bill Nitzberg and Virginia Lo. Distributed shared memory: A survey of issues and algorithms. IEEE Computer, pages 52{60, 8 1991. Richard Neves and Robert B. Schnabel. Runtime support for execution of ne grain parallel code on coarse-grain multiprocessors. In Fifth Symposium on the Frontiers of Massively Parallel Computing, February 1995. Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford Parallel Applications for Shared-Memory. Technical Report CSL-TR-91-469, Department of Electrical Engineering and Computer Science, Stanford University, April 1991. Matthew Rosing, Robert Schnabel, and Robert Weaver. The Dino parallel programming language. Journal of Parallel and Distributed Computing, 13(1):30{42, September 1991. Jaswinder Pal Singh, John L. Hennessy, and Anoop Gupta. Scaling parallel programs for multiprocessors: Methodology and examples. Computer, 26(7):42{50, July 1993. Wei Shu. Runtime support for user-level ultra lightweight threads on massively parallel distributed memory machines. In Fifth Symposium on the Frontiers of Massively Parallel Computing, February 1995. D. Stein and D. Shah. Implementing lightweight threads. In USENIX 1992, June 1992. V. S. Sunderam. PVM: A framework for parallel distributed computing. Concurrency: Practice and Experience, 2(4), 1990. Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford parallel applications for shared-memory. Computer Architecture News, 20(1):5{44, March 1992. Andrew S. Tannebaum. Distributed Operating Systems. Prentice-Hall, Inc., Englewood Clis, NJ, 1995. Robert H. Thomas and Will Crowther. The Uniform system: An approach to runtime support for large scale shared memory parallel processors. In 1988 Conference on Parallel Processing, pages 245{254, August 1988. Kritchalach Thitikamol and Pete Keleher. Multi-threading and remote latency in software dsms. In Proceedings of the 17th International Conference on Distribute d Computing Systems, May 1997. M. Vandevoorde and E. Roberts. WorkCrews: an abstraction for controlling parallelism. International Journal of Parallel Programming, 17(4):347{366, August 1988. 24
© Copyright 2026 Paperzz