Programming for Malleability with Hybrid MPI-2 and OpenMP – Experiences with a Simulation Program for Global Water Prognosis Claudia Leopold, Michael Süß, Jens Breitbart University of Kassel, Research Group Programming Languages / Methodologies Wilhelmshöher Allee 73, 34121 Kassel, Germany fleopold,[email protected], [email protected] Keywords—Parallelization of Simulation, Libraries and Programming Environments, Message Passing Abstract— This paper reports on our experiences in parallelizing WaterGAP, an originally sequential C++ program for global assessment and prognosis of water availability. The parallel program runs on a heterogeneous SMP cluster and combines different parallel programming paradigms: First, at its outer level, it uses master/slave communication implemented with MPI. Second, within the slave processes, multiple threads are spawned by OpenMP directives to exploit data parallelism. Time measurements show that the hybrid scheme pays off. It adapts to the heterogeneity of the cluster by using multiple threads only for the largest tasks and mapping these to multiprocessor nodes. Third, the program is malleable, which has been accomplished with the dynamic process management facilities of MPI-2, based on the MPICH2 implementation. In particular, it is possible to increase the number of processes while the program is running. Malleability is an important feature in both batch systems and grid environments. We discuss the support that MPI-2 provides for malleability. I. I NTRODUCTION Parallel programming today is dominated by the Message Passing Interface MPI [3], [9], OpenMP [10], and lowerlevel threading libraries [7], [14]. While MPI is most appropriate for distributed-memory architectures, OpenMP has been designed for shared-memory machines. A common architecture nowadays are SMP clusters, i.e., clusters of multiprocessor nodes, which combine shared memory in each node with distributed memory in between the nodes. Accordingly, hybrid forms of programming, using MPI at the outer level of the program and OpenMP for decomposing individual processes into threads have been proposed [11], [12]. Other properties of current architectures, especially heterogeneity and dynamic behavior, have received far less consideration as of yet in the parallel programming community. We speak of heterogeneity if an architecture is composed of different computing resources, for instance if the nodes of an SMP cluster differ in their number of processors. We speak of dynamic behavior if the number of processors available to an application varies during the program’s execution. The term malleability has been coined to denote the ability of programs to adapt to changes in the number of processors [15]. Malleability is helpful in batch systems, where it gives the scheduler more freedom in assigning jobs, but it is particularly important in grids. This paper evaluates the opportunities that MPI and OpenMP provide for making use of modern architectures, based on an example program from the simulation domain. The program is called WaterGAP, which stands for ”Water – Global Assessment and Prognosis”. It has been developed at the Center for Environmental Systems Research of the University of Kassel, and is explained in Section II of this paper. Briefly stated, WaterGAP partitions the surface area of continents into equal-sized grid cells. Based on input data for climate, vegetation etc., it simulates the flow of water, both vertically (precipitation, transpiration) and horizontally (routing through river networks). Presently, the input data are being refined, and to cope with the resulting increase in computational expense, the program was parallelized. From a computational point of view, the vertical simulation in WaterGAP gives rise to data parallelism, as grid cells are computed independently. The horizontal simulation requires communication, but one can observe that the world is partitioned into independent basins. Independent means in this context that there is no flow of water between basins, a property that is inherent in the hydrological model behind WaterGAP. Thus, different basins can be computed by different processes, without any need for communication. The basins differ in size, from a few very large basins to many small ones. The size of the basins is determined by natural conditions, and thus cannot be changed. As target architecture for our parallelization efforts, we consider the compute cluster of the University of Kassel. This Linux cluster consists of 58 double-processor nodes, and one 8-processor node that comprises 4 dual-core chips. Jobs are submitted through the Torque batch system, which is an OpenPBS derivative. The cluster is usually operating at full capacity, with the majority of jobs being sequential and long-running. Before a WaterGAP run can be started, it may take for several hours until the requested number of nodes has become free. During this time, the batch system assigns processors to WaterGAP as they become available, but it will not start the program until all the requested nodes are available. Thus, the other processors are idle for several hours, which is a waste of resources. On the cluster, both MPI-2, with the MPICH2 implementation, and OpenMP, with the Portland compiler, are available. The parallelization has been accomplished in three steps: 1. Master/slave structure: This standard pattern has been implemented with MPI. It is used to distribute the basins to be computed among processes. The resulting program scales almost linearly up to about 8 slaves. Beyond that, the computation of the largest basin prevents any further speedups. 2. Hybrid parallelism: Keeping the master/slave structure unchanged, we used OpenMP to split individual slaves into multiple threads. These threads exploit the data parallelism among grid cells within a basin, during the vertical and part of the horizontal simulation. The speedup from threading is sublinear, but the hybrid scheme can be used to overcome the bottleneck at the largest basin. In our heterogeneous cluster, the scheme is particularly efficient if the largest basin is mapped to the 8-processor node, and all but the two largest basins are computed by a single thread only. 3. Malleability: To avoid idling of processors before the start of WaterGAP through the batch system, we rewrote the application so that it can be started with a small number of slaves, and later incorporates more and more slaves as the processors become available. We implemented this feature using the dynamic process management facilities of MPI-2. While master/slave parallelism and hybrid parallelism are already quite well understood, we do not know of any previous work on the implementation of malleability with MPI-2. This paper describes our experiences, and discusses design decisions in our program. The paper is organized as follows. First, Sect. II describes the WaterGAP application. Then, the main part consists of sections III–V, which are devoted to the three steps of parallelization as outlined above. Sect. VI reviews related work, while finally, Sect. VII summarizes the paper and mentions directions for future research. II. WATER GAP The WaterGAP program [1], [6], [16] has been developed at the University of Kassel, with the goal to investigate current and future water availability worldwide. Several projects have already been carried out with this program, including studies on the impacts of global changes in climate as well as water withdrawals from households, factories, irrigated farms, and so on. The WaterGAP program operates on several hundred input files, and generates about 15 output files per year of the simulation period. A typical simulation period is 30 years. Among others, input data refer to climate, vegetation, and water use. Most input data are available for each grid cell; part of the data per day, others per month or per year. Among the input data is a flow direction map that assigns a single neighbouring cell to each grid cell, where the surplus water from this cell flows into. Output data include results on various specific measures such as groundwater runoff and snow cover, and are collected per grid cell. Most outputs are generated annually, some monthly. The WaterGAP program has been written in C++ and comprises about 15.000 lines of code. The program can be run in different modes (e.g. calibration mode), of which this paper only considers global computation mode, i.e., the simulation is carried out for each grid cell worldwide. After an initialization phase, the main loop of WaterGAP iterates through the days of the simulation period. For each day, it carries out a vertical simulation first, and several steps of horizontal simulation thereafter. During the vertical simulation, transpiration and water balance are computed for each grid cell, based on data for precipitation, temperature, etc. The horizontal simulation calculates the flow of water through the networks of rivers, lakes etc. Whereas the vertical simulation runs once per day, the horizontal simulation is carried out several times per day, with the exact number of time steps depending on river velocity. During each vertical simulation step, the computation of different grid cells is independent. During horizontal simulation, water is exchanged between grid cells, but this exchange is restricted to occur between cells of the same basin. For hydrological reasons, the world is divided into a total of about 10.000 basins. A few of these are very large, such as the Amazon area, but the majority of basins comprises only a few grid cells. The basins are stable, i.e., the set of basins does not depend on any inputs other than the flow direction map, and even the relative computational expense of the basins is only lightly dependent on input. Currently, the input data of WaterGAP are being refined to increase the spatial resolution from 0.5 to 5 minutes side lengths of grid cells. With the 0.5 version, WaterGAP had a running time of about 10 hours on a standard PC, for a simulation period of 30 years. With the new resolution, the running time increases by a factor of about 36, which can only be handled on a parallel machine in a reasonable amount of time. III. M ASTER /S LAVE PARALLELISM AND I/O The first step of our parallelization uses the fact that basins are computationally independent. It takes the simulation of each basin, over the whole simulation period, as a task. Since task sizes differ, we use the master/slave scheme for mapping tasks to processes. This well-known scheme deploys one master process and several slaves. The master starts by sending a task to each slave. Whenever a slave has finished its work, it reports the result to the master and gets the next task, until all tasks have been processed. In a preprocessing step, we group grid cells into basins, referring to the information in the flow direction map. The result contains many small basins, for which the overhead of master/slave communication and I/O would outweigh any gains in performance. Therefore, we group the small basins into larger working units to be distributed to the slaves, and store the result in a file. For ease of presentation, we denote the working units as basins, as well. The preprocessing does not need to be repeated in each program run, but only when the flow direction map changes instead. Table I depicts the distribution of sizes among basins. The numbers have been obtained by grouping natural basins of size up to 100 grid cells together in a working unit. Threshold 100 has been selected experimentally to maximizes the overall speedup, and is used throughout this paper. As can be seen in the table, there are a few very large basins, with the largest basin taking more than twice the compute time of the second largest. The master assigns the basins in decreasing order of size, i.e., the largest basin is assigned first. Another aspect of master/slave parallelization is I/O. Both the input files and the output files follow a standard format that allows their contents to be visualized with a tool. The format requires a particular arrangement of grid cells, which is not compatible with the assignment to basins. Thus, in each file, the grid cells of any single basin are scattered throughout the file. Therefore each slave must read and Number of basins 1 1 5 400 Size (in grid cells) 3165 1946 1000 – 1700 100 – 1000 Time (in sec.) 275 120 50 – 90 2 – 40 TABLE I: Size distribution of basins (units). Running time has been measured for a simulation period of two years on a single processor. write discontinuous data that do not follow a regular pattern. MPI-2 supports the efficient access of multiple processes to the same file. The focus, however, is on regular patterns that can be described by a MPI datatype. Although an explicit positioning of file pointers is possible, the use of MPI-2 I/O slowed down our program considerably. Therefore, we implemented a conservative approach using language I/O (fread, fwrite etc.). Input files are read by all processes, while output data are collected and written by the master. The master/slave scheme scales almost linearly up to about 8 slaves (see Sect. IV). With more slaves, the largest basin becomes a bottleneck and prevents any further speedup. IV. H YBRID PARALLELISM AND H ETEROGENEITY To scale the number of slaves beyond 8, we have to internally parallelize the computation of basins. As already stated, the program has potential for data parallelism, since grid cells are independent during vertical simulation. While this potential can, in principle, be exploited with MPI, the distributed-memory programming model of MPI would force us to partition data structures among processes, and to realize all communication between grid cells through explicit message passing. Therefore, we chose OpenMP for intra-basin parallelization. OpenMP has a shared-memory model, and so there was no need to change the existing data structures. Since the two parallel programming systems are used for orthogonal aspects of parallelism (inter-basin vs. intra-basin), their combination only slightly increases the complexity of the program. In fact, OpenMP parallelization was easy. We identified and parallelized some central loops: the main loop of vertical simulation, and several loops over grid cells during horizontal simulation. Figure 1 shows timing results of the hybrid MPI / OpenMP program with different numbers of slaves and threads. In addition to the slaves, one single-threaded master process was used. Running times are in seconds, and refer to a simulation period of two years. The programs have been compiled with mpich1.2.7, using the Portland compiler with optimization level -fast -Mconcur. Measurements were carried out on three architectures: the compute cluster of the University of Kassel (double-processor AMD Opteron 248 nodes), the compute cluster of the University of Frankfurt (double-processor AMD Opteron 244 nodes), and a more powerful node of the Kassel cluster (4 AMD Opteron 875 dual-core processors). Jobs have been submitted through the batch system so that each process and thread had exclusive access to a processor (except for the Fig. 1. Running time of hybrid program largest run on the 8-processor node). The program scales almost linearly up to 8 or 9 slaves, both in the pure MPI version, and in the hybrid version with two threads. Beyond that, the running time remains constant, and the computation of the largest basin forms a bottleneck (not shown in the figure, but was observed in the experiments). The numbers show that OpenMP alone contributes to a significant, although sublinear, performance gain, and that this gain is orthogonal to the gain from MPI parallelization. Since inter-basin parallelism yields higher speedups than intra-basin parallelism, the hybrid scheme pays off most if it is coupled with heterogeneity, i.e., if different processes use a different number of threads. Heterogeneity must be supported by the batch system. The Torque system accepts requests such as: 4 nodes with 2 processors plus 1 node with 8 processors plus 1 processor from any node. To start a heterogeneous run, some programming is required in the batch script, to control the assignment of MPI processes to nodes. The WaterGAP program profits from heterogeneity in three ways: First, in our realization of the master/slave scheme, the master does not participate in the computation (we chose this variant for simplicity). Instead, it concentrates on communication with slaves and I/O, which is not computeintensive. Therefore, it is sufficient to use a single thread for the master. The numbers in Fig. 1 have been generated this way. Additional experiments with a double-threaded master led to almost identical results, despite the higher consumption of resources. Second, the largest basin is mapped to the most powerful, in our case the 8-processor, node. As the batch script controls the assignment of MPI processes to nodes, the master program was easily adapted to distribute work accordingly. Third, only the largest basins profit from the use of mul- tiple threads to overcome bottlenecks, whereas the smaller basins gain more if the two processors of a node are used for two processes (since inter-basin parallelism yields higher speedups than intra-basin parallelism). Using heterogeneity, we reduced the running time of WaterGAP down to a minimum of 99 seconds on the Kassel cluster, which corresponds to a speedup of 22 as compared to a sequential run on the fastest node. This speedup is obtained with a total of 32 processors: 6 processors of the powerful node for the largest basin, 2 processors of the powerful node for the second largest basin, 2 processors of another node for the third largest basins, and 22 single-threaded processes for all other basins and the master. In this run, the bottleneck was observed at the largest basin. Using 8 threads for this basin, though, the second largest basin became a bottleneck. Further speedups are therefore dependent on the availability of a second multiprocessor node with more than two processors. V. M ALLEABILITY WITH MPI-2 As stated before, another problem on our target cluster is idling of processors, because the batch system has to wait for the availability of all requested processors before application startup. We solved this problem by making the program malleable, i.e., the program starts with a small number of processes, and later incorporates more and more processes when the processors become available. The master/slave pattern is well-suited to malleability, as the master can easily distribute the tasks to an increasing number of slaves. When a slave dies, the master could also reassign its task, but this aspect is not relevant here. Instead of submitting a single job to the batch system, we submit several jobs, which request part of the resources each. For instance, we submit one job per MPI process. We used MPI-2 for implementation, and therefore start this section with a brief introduction to the relevant parts of MPI-2. After that, we describe our implementation and discuss various design decisions and MPI-2 support. Briefly stated, we use a client/server scheme that complements the master/slave scheme already explained. The overall structure is depicted in Fig. 2, where edges marked 1 denote process startup, edges marked 2 represent master/slave communication for the distribution of basins, and edges marked 3 represent connection establishment by MPI-2 client/server routines. Note that master and server are different processes. The server is started in advance, while all other processes are started through the batch system. The first process takes the role of the master, and the others are slaves. A. Dynamic Process Management in MPI-2 Whereas the original MPI-1 standard required the number of processes to be fixed at program startup, the more recent MPI-2 standard supports dynamic process management with two sets of functions: First, MPI_Comm_spawn and related functions allow an MPI process to dynamically spawn new MPI processes. However, there is no obvious way to inform the existing MPI process about the fact that new processors have become available, and therefore we do not use these functions. Fig. 2. Structure of malleable program Second, MPI-2 defines a set of functions for client/server communication, somewhat similar to socket communication in networks. On the server side, functions include MPI_Open_port to establish a network address, MPI_Comm_accept to wait for a connection request from a client, and MPI_Publish_name to register port information on a name server On the client side, functions include MPI_Comm_connect to establish communication with a server for which port information is known, and MPI_Lookup_name to retrieve port information from the name server. After successful connection, the matching accept and connect calls each return an intercommunicator, through which server and client can send and receive messages in both directions. The concept of communicators has already been introduced in MPI-1. Communicators are data structures to store all information required for communication between a group of processes. MPI distinguishes intracommunicators and intercommunicators. Intracommunicators number processes consecutively, for easy access. At program startup, the intracommunicator MPI_COMM_WORLD is predefined and comprises all processes started. Many MPI programs rely solely on MPI_COMM_WORLD. The communicator returned after connection establishment is an intercommunicator, i.e., a bridge between two groups of processes. Note that both client and server may be composed of multiple processes. In this case, MPI_Comm_accept and MPI_Comm_connect have to be invoked by all processes of the respective group, and the calls return only after successful connection establishment (in MPI terms, the functions are collective and blocking). Intercommunicators are more difficult to handle than intracommunicators, but MPI provides a function called MPI_Intercomm_merge to transform an intercommunicator into an intracommunicator. This function is collective and blocking over both groups. Communicators can be released using the function MPI_Comm_disconnect. B. Master and Server Processes We now return to the malleable WaterGAP version. From the running application’s point of view, new processes become available at any point in time, and should enter computation as soon as possible. The master can not call MPI_Comm_accept itself, as the function is blocking and would therefore prevent it from doing other work. Thus, the function must be called in a separate thread of control, the server (see Fig. 2). There are two opportunities for implementation: master and server are different threads of the same process, or master and server are separate processes. In both cases, the master must be informed when a new slave becomes available, i.e., communication is required from server to master. This communication is rather difficult to accomplish in OpenMP, as it requires repeated inquiries and synchronization on the master’s side. In MPI, communication is more natural: According to the master/slave scheme, the master runs a loop and receives a message from any slave in each loop iteration. The scheme can be easily extended to let the master receive a message from either a slave or from the server. With MPI communication between master and server, the only gain from running both in the same process would be the immediate availability of the intercommunicator returned by MPI_Comm_accept to the master. This gain is outweighed, however, by the opportunity to start the server in advance, which the second scheme provides (we will explain later why this is useful). Therefore, master and server are run in different processes. We use a separate executable for the server, but the same executable for master and slaves. Thus, a job can dynamically decide to take over the role of master or slave. When an MPI process is started, it first connects to the server, who sends back the current value of a process counter. If this value is zero, the new process branches into the master code, otherwise into the slave code. In either case, the value is stored as a unique process number. C. Communicators In the non-malleable WaterGAP version, all communication has been accomplished through one intracommunicator: MPI_COMM_WORLD. In particular, the master used MPI_Iprobe(...MPI_ANY_SOURCE...) in its main loop, to wait for a message from any slave. MPI does not support a similar wait for a message from any communicator. To avoid major changes in the existing code, it would therefore be desirable to have a communicator for the group of all processes, as a replacement for MPI_COMM_WORLD. Although MPI provides constructor functions for communicators, these must be invoked in a collective and blocking call, i.e., all processes must carry out the functions at the same time, and the already existing slaves would have to interrupt their work for that. As a possible solution, the constructor may be invoked by a separate thread of each slave. In the hybrid MPI / OpenMP program, however, this thread is unrelated to the existing OpenMP threads, which would lead to an involved program structure with, e.g., nested parallelism. Therefore, we have decided against a global communicator, despite the drawback that we had to go through the whole program and adapt all MPI functions to the new communicator structure. This structure uses as many communicators as slaves, each of them comprising the master and one slave; plus another communicator for master and server. All communicators are intracommunicators. We now describe how they are constructed. As stated before, any new process first connects to the server and is assigned a process number. If the new process is the master, the returned intercommunicator is transformed into an intracommunicator and the former released. If the new process is a slave, the server informs the master about its existence, by sending a message. In the main loop of the master, we have replaced the original MPI_Iprobe(...MPI_ANY_SOURCE...) by a loop that cycles through all communicators, thereby checking if a message from either the server or a slave is available. When the master receives a message from the server, it invokes accept to establish a separate connection with the slave and build a communicator that connects master and slave only. D. MPICH2 in the Batch System In addition to incorporating MPI-2 functions into the program, we had to deal with process startup through the batch system. The MPI-2 standard leaves process startup to the implementation. The MPICH2 implementation [4] of MPI-2 includes a process management environment. Before an application is started with mpiexec, an MPI daemon, called mpd, must be invoked on each participating machine. Moreover, the set of daemons must form a ring. Daemons are started with the user commands mpdboot and mpd, where mpdboot starts a ring of daemons on a set of machines, and mpd starts a single daemon. As a parameter to mpd, one can specify the hostname / port of an existing mpd ring, which will connect the new daemon to this ring. Port information can be obtained by invoking the user command mpdtrace -l, on a machine that is already included in the ring. As a first approach, we considered starting a cluster-wide mpd ring in advance, to avoid the need to start a new daemon whenever the batch system assigns a job. With this approach, many daemons are running idle, though (especially on a large cluster), and therefore, we decided to start an mpd only after a node has been assigned to WaterGAP by the batch system. Daemons are started with mpd, and require hostname / port information to connect to the existing ring. This information must be provided in the job script, but it is not available before an initial mpd ring has been set up. For ease of implementation, we therefore decided to start the server (including mpd) in advance, using a cluster node with permission for interactive access. The malleable MPI program is not bound to a particular size of the jobs, i.e., in the sequence of jobs that form a WaterGAP run, any job script may request any number of nodes. If a script requests several nodes, mpd’s are started one at a time, and afterwards all processes are started with a single call to mpiexec. E. Evaluation of MPI-2 In summary, our experiments have shown that it is possible to build malleable applications with MPI-2 and the MPICH2 implementation, and to run these applications through a batch system. Performance is difficult to measure as it depends on the cluster load. While our context only requires to increase the number of processors, we ob- served in experiments that the program can to some extent also cope with the event of slaves being killed during program execution. In this case, the master proceeds with its main loop, communicating with existing and new processes despite the fact that MPI_Iprobe looks for messages from a dead slave. To achieve malleability, we had to apply major changes to the original MPI communication structure. Instead of using the single communicator MPI_COMM_WORLD, we had to introduce a set of communicators (one per process), and therefore could not use the convenient MPI_Recv(... MPI_ANY_SOURCE ...) calls anymore. The alternative of constructing one communicator for all processes proved difficult, as communicator constructor functions are collective and blocking. Use of an additional thread would have complicated the hybrid MPI / OpenMP structure. In summary, we missed non-blocking communicator constructor and destructor functions, as well as a nonblocking variant of MPI_Comm_accept. Another drawback of using a set of communicators instead of MPI_COMM_WORLD is lacking support for collective communication. In the WaterGAP program, we had to replace an initial broadcast of a parameter from the master to the slaves by pairwise communications. Altogether, malleable programming requires a higher programming expense. Among the reasons for that (besides the already mentioned ones) is the need to manage the ring of mpd’s. Moreover, debugging for errors in the communicator constructor and destructor calls (e.g. for a missing MPI_Comm_disconnect) is time-consuming. scale beyond 8 slaves, because of a bottleneck at the largest task. Therefore, in the second step, we mixed MPI and OpenMP to exploit two levels of parallelism. The hybrid scheme was most efficient when used with a varying number of threads depending on task size, yielding a maximum speedup of 22 with 32 processors. In the third step, we made the program malleable, as to better use the batch system and avoid wasting hardware resources. The paper discussed our deployment of MPI-2 dynamic process management functions for malleability, and pointed out difficulties such as lack of non-blocking communicator constructor functions. The performance gain from malleability is difficult to quantify, as it depends on the cluster load. The investigation is therefore left for future research. In this context, it would also be interesting to dynamically adapt the master’s task assignment to differences in computational power of the available and requested processors. The WaterGAP program can be further improved, in particular with respect to I/O, and maybe intra-basin parallelism. Another venue for future research is evaluating MPI2’s support for malleability with other applications. ACKNOWLEDGEMENTS We thank the Center for Environmental Systems Research, especially Kerstin Schulze, Frank Kaspar and Lucas Menzel, for providing the WaterGAP code and answering our questions about WaterGAP. We are grateful to our colleague Björn Knafla for proofreading the paper. The University Computing Centers in Kassel and Frankfurt provided the computing facilities used for our experiments. R EFERENCES VI. R ELATED W ORK Hybrid MPI / OpenMP programming has already been studied for several applications. Smith and Bull [12] identify situations in which hybrid programming is superior to a pure MPI approach, among them load balancing problems with MPI, and ease of implementation for a parallelization in multiple dimensions. Rabenseifner [11] classifies hybrid programs, WaterGAP belongs into the hybrid masteronly category. Spiegel and an Mey [13] dynamically vary the number of threads in different processes to improve load balancing on a shared-memory architecture. Numerical programs have been adapted to a heterogeneous cluster by Aliaga et al. [2] The need for malleability in the context of grid computing was pointed out by, e.g., Mayes et al. [8], who designed a performance control system for applications that consist of malleable components. They implemented malleability by interrupting an MPI program, saving its state, and later restarting the program with a different number of processes. Hungershöfer [5] has shown that the availability of malleable jobs improves the performance of supercomputer schedulers. He referred to multithreaded programs. [1] VII. C ONCLUSIONS [14] In this paper, we have described our experiences in parallelizing a simulation program for global water prognosis. The parallelization has been accomplished in three steps. First, in the base version of the program, we implemented a simple master/slave scheme in MPI. This program did not [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [15] [16] J. Alcamo and other. Development and testing of the WaterGAP 2 global model of water use and availability. Hydrological Science, 48(3):317–337, 2003. J. I. Aliaga et al. Parallelization of the GNU scientific library on heterogeneous systems. In Int. Symp. on Parallel and Distributed Computing, HeteroPar Workshop, pages 338–345, 2004. W. Gropp et al. MPI : The Complete Reference. MIT Press, 1998. W. Gropp et al. MPICH2 User’s Guide, Version 1.0.3, November 2005. Available at http://www-unix.mcs.anl.gov/mpi/mpich2. J. Hungershöfer. On the combined scheduling of malleable and rigid jobs. In IEEE Symp. on Computer Architecture and High Performance Computing, pages 206–213, 2004. F. Kaspar. Entwicklung und Unsicherheitsanalyse eines globalen hydrologischen Modells. PhD thesis, Universität Kassel, 2003. C. Leopold. Parallel And Distributed Computing: A Survey of Models, Paradigms, and Approaches. John Wiley & Sons, 2001. K. R. Mayes et al. Towards performance control on the grid. Philosophical Transactions of the Royal Society, 363(1833):1793–1805, 2005. MPI-2: Extensions to the message-passing interface, 1997. Available at http://www-unix.mcs.anl.gov/mpi. OpenMP Application Programming Interface Version 2.5, 2005. Available at http://www.openmp.org. R. Rabenseifner. Hybrid parallel programming on HPC platforms. In European Workshop on OpenMP, pages 185–194, 2003. L. Smith and M. Bull. Development of mixed mode MPI / OpenMP applications. Scientific Programming, 9(2–3):83–98, 2001. A. Spiegel and D. an Mey. Hybrid parallelization with dynamic thread balancing on a ccNUMA system. In European Workshop on OpenMP, pages 77–82, 2004. M. Süß and C. Leopold. Observations on the Publicity and Usage of Parallel Programming Systems and Languages: A Survey Approach, 2006. In preparation. G. Utrera, J. Corbalán, and J. Labarta. Implementing malleability on MPI jobs. In Proc. Parallel Architectures and Compilation Techniques, pages 215–224, 2004. WaterGAP Home Page. http://www.usf.uni-kassel.de/usf/forschung/ projekte/watergap.en.htm.
© Copyright 2026 Paperzz