TITLE C HARM ++ BYLINE Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801, USA [email protected] SYNONYMS DEFINITION Charm++ is a C++-based parallel programming system that implements a message-driven migratable objects programming model, supported by an adaptive runtime system. DISCUSSION Charm++ [1] is a parallel programming system developed at the University of Illinois at UrbanaChampaign. It is based on a message-driven migratable objects programming model, and consists of a C++-based parallel notation, an adaptive runtime system (RTS) that automates resource management, collection of debugging and performance analysis tools, and an associated family of higher level languages. It has been used to program several highly scalable parallel applications. Motivation and Design Philosophy One of the main motivations behind Charm++ is the desire to create an optimal division of labor between the programmer and the system: i.e., to design a programming system so that the programmers do what they can do best, while leaving to the “system” what it can automate best. It was observed that deciding what to do in parallel is relatively easy for the application developer to specify; conversely, it has been very difficult for a compiler (for example) to automatically parallelize a given sequential program. On the other hand, automating resource management — which subcomputation to carry out on what processor and which data to store on a particular processor — is something that the system may be able to do better than a human programmer, especially as the complexity of the resource management task increases. Another motivation is to emphasize the importance of data locality in the language, so that the programmer is made aware of the cost of non-local data references. Programming Model in Abstract A Charm++ computation consists of collections of objects that interact via asynchronous method invocations. Each object is called a chare. The chares are assigned to processors by an adaptive runtime system, with an optional override by the programmer. A chare is a special kind of C++ object. Its behavior is specified by a C++ class that is “special” only in the sense that it must have at least one method designated as an “entry” method. Designating a method as an entry method signifies that it can be invoked from a remote processor. The signatures of the entry methods (i.e., the type and structure of its parameters) are specified in a separate interface file, to allow the system to generate code for packing and unpacking the parameters into messages. (This is also called “serializing” or “marshalling” the parameters.) Other than the existence of the interface files, a Charm++ program is written in a manner very similar to standard C++ programs, and thus will feel very familiar to C++ programmers. The chare objects interact with each other through asynchronous method invocations. Such a method invocation does not return any value to the caller, and the caller can continue with its own execution. Of course, the called chare may choose to send a value back by invoking another method in the caller object. Each chare has a globally valid ID (its proxy), which can be passed around via method invocations. Note that the programmer refers to only the target chare by its global ID, and not by the processor on which it resides. Thus, in the baseline Charm++ model, the processor is not a part of the ontology of the programmer. Chares can also create other chares. Such creations are also asynchronous in that the caller does not wait until the new object is created. Programmers typically do not specify the processor on which the new chare is to be created; the system makes this decision at runtime. The number of chares may vary over time, and is typically much larger than the number of processors. Message Driven Scheduler Processor 1 Processor 2 Scheduler Scheduler Message Queue Message Queue Figure 1: Message driven scheduler 2 Since there are many chare objects per processor, with potentially multiple asynchronous method invocations pending for each, their execution requires a scheduler. The Charm++ runtime system employs a simple user-level scheduler for this purpose, as shown in Figure 1. The scheduler is user-level in the sense that the operating system is not aware of the scheduler. Normal Charm++ methods are non-preemptive: once a method begins execution, it returns control to the scheduler only after it has completed execution. The scheduler works with a queue of pending entry methods. There is one scheduler on each processor. Note that this queue may include asynchronous method invocations for chares located on this processor, as well as “seeds” for new chares: these can be thought of as invocations of the constructor entry method. The scheduler repeatedly selects a message (a method invocation) along with the proxy of the chare to which the message is targeted, identifies the object targeted, creating an object if necessary, unpacks the parameters from the message if necessary, and then invokes the specified method with the parameters. Only when the method returns does it select the next message and repeats the process. Chare-Arrays and Iterative Computations The model described so far, with its support for dynamic creation of work, is well-suited for expressing the divide-and-conquer as well as divide-and-divide computations. The latter occur in the state-space search. Charm++ (and its C-based precursor, Charm, and Chare Kernel [2]) were used in the late 1980s for implementing parallel Prolog [3] as well as several combinatorial search [4] applications. Science and engineering computations typically involve domain decomposition. The simple chares above can be used to create networks of chares. For example, one can organize chares in a two-dimensional mesh like network, and through some additional message passing, ensure that each chare knows the ID of its four neighboring chares. However, this method of creating a network of chares is quite cumbersome. Instead Charm++ supports indexed collections of chares, called chare-arrays. A Charm++ computation may include multiple chare-arrays. Each chare array is a collection of chares of the same type. Each chare is identified by an index that is unique within its collection. Thus, an individual chare belonging to a collection of chares (i.e. a chare-array) is completely identified by the ID of the collection, and its own index within the collection. Common index structures include dense as well as sparse multidimensional arrays, but arbitrary indices such as strings or bit vectors are also possible. Elements in a chare-array may be created all at once, or can be inserted one at a time. Method invocations can be broadcast to an entire chare-array by omitting the index in the call. Reductions over arrays are also supported, where each chare in a chare-array contributes a value, and all submitted values are combined via a commutativeassociative operation. Unlike in other programming models, where the reductions are a collective operation that block all callers, reductions in Charm++ are non-blocking, i.e., asynchronous. The contribute call simply deposits the value created by the calling chare into the system, and continues on. At some later point after all the values have been combined, the system delivers them to a user-specified callback. The callback could be a broadcast to an entry method of the same chare-array, for example. Charm++ does not allow generic global variables, but it does allow “specifically shared variables”. The simplest of these are read-only variables, which are initialized in the main chare’s constructor, and are treated as constants for the remainder of the program. The runtime system (RTS) makes sure that a copy of each read-only variable is available on all physical processors. 3 Benefits of Message-Driven Execution The message-driven execution model confers several performance and/or productivity benefits. Automatic and adaptive overlap of computation and communication Since objects are scheduled based on availability of messages, no single object can hold the processor while it waits for some remote data. Instead, objects that have asynchronous method invocations (messages) waiting for them in the scheduler’s queue are allowed to execute. This leads to a natural overlap of communication and computation, without extra work from the programmer. E.g., a chare may send a message to a remote chare, and wait for another message from it before continuing. The ensuing communication time, which would otherwise be an idle period, is naturally and automatically filled in (i.e., overlapped) by the scheduler with useful computaion, i.e., processing of another message from the scheduler’s queue for another chare. Concurrent composition The ability to compose in parallel two individually parallel modules is refered to as concurrent composition. Consider two modules P and Q that are both ready to execute and have no direct dependencies among them. With other programming models (e.g. MPI) that associate work directly with processors, one typically has two options: Either one divides the set of processors so that a subset is executing one module (P) while the remaining processors execute the other (Q). Alternatively, one sequentializes the modules, executing P first, followed by Q, on all processors. Neither alternative is efficient. Allowing the two modules to interleave the execution on all processors is often beneficial, but is hard to express even with wild-card receives, and it breaks abstraction boundaries between the modules in any case. With message driven execution, such interleaving happens naturally, allowing idle time in one module to be overlapped with computation in the other. Coupled with the ability of the adaptive runtime system to migrate communicating objects closer to each other, this adds up to strong support for concurrent composition, and thereby for increased modularity. Prefetching data and code Since the message driven scheduler can peek ahead at its queue, it knows what the next several objects scheduled to execute are and what methods will they be executing. This information can be used to asynchronously prefetch data for those objects, while the system executes the current object. This idea is used by the Charm++ runtime system for increasing efficiency in various contexts, including on accelerators such as the Cell processor, for out-of-core execution and for prefetching data from DRAM to cache. Capabilities of the adaptive Runtime System Based on Migratability of Chares Other capabilities of the runtime system arise from the ability to migrate chares across processors, and the ability to place chares on processors of its choice when they are created. 4 Supporting Task Parallelism with Seed Balancers When a program calls for the creation of a singleton chare, the RTS simply creates a seed for the new chare, which includes the constructor arguments and class information needed to create a new chare. Typically , these seeds are initially stored on the same processor where they are created, but may be passed form processor to processor under the control of a runtime component called the seed balancer. Many different seed balancer strategies are available to be linked in. For exaample, a strategy may be to monitor queue size on one’s own processor and neighboring processors and balance the queue of seeds as it sees fit. Another strategy might request work from a random processor when the local processor goes idle — a work stealing scheme [5, 6] . Charm++ also includes strategies that balance priorities and workloads simultaneously, trying to ensure high priority work gets executed faster over the entire system. Migration-based load balancers Elements of chare-arrays can be migrated across processors, either explicitly by the programmer or by the runtime system. The Charm++ RTS leverages this capability to provide a suite of dynamic load balancing strategies. One class of such strategies is based on the principle of persistence: in most science and engineering applications expressed in terms of their natural objects, computational loads and communication patterns tend to persist over time, even for dynamically evolving applications. Thus, recent past is a reasonable predictor of near future. Since the runtime system mediates communication and schedules computations, it can automatically instrument its execution so as to measure computational loads and communication patterns accurately. Load balancing strategies can use these measurements, or alternatively, any other mechanisms for predicting such patterns. Multiple load balancing strategies are available to choose from. The choice may depend on the machine context and applications, although one can always use the default strategy provided. Programmers can write their own strategy, either to specialize it to the specific needs of the application or in the hope of doing better than the provided strategies. Shrinking or Expanding the Sets of Processors A Charm++ program can be asked to change the set of processors it is using at runtime. This requires no effort by the programmer. It accomplishes this by migrating objects and adjusting its runtime data structures, such as spanning trees used in its collective operations. Thus, a 1024x1024x1024 cube of data partitioned into 16x16x16 array of chares, each holding 64x64x64 data subcube, can be shrunk from 256 cores to 247 cores (to pick an arbitrary number) without significantly losing efficiency. Each core will house 17 objects at most, instead of 16 it did before. Fault tolerance Charm++ provides multiple levels of support for fault tolerance, including alternative competing strategies. At a basic level it supports automated application-level checkpointing, by leveraging its ability to migrate objects. With this, it is possible to create a checkpoint of the program without writing extra user code. More interestingly, it is also possible to use a checkpoint created on P processors to restart the computation on a different number of processors than P. 5 On appropriate machines and with job schedulers that permit this, Charm++ can also automatically detect and recover from faults. It requires that the job scheduler not kill the job if one of its node fails. At the time of this writing, these schemes are available on workstation clusters. The most basic strategy uses the checkpoint created on disk, as described above, to effect recovery. A second strategy avoids using disks for checkpointing, creating multiple copies of the chare objects in the memory of two processors instead. It is suitable for those applications whose memory footprint at the point of checkpointing is relatively small compared with the available memory. Fortunately, many applications such as molecular dynamics and computational astronomy fall into this category. When it does apply, it is very fast, often accomplishing checkpointing in less than a second, and recovery in a few seconds. Both the above strategies send all the processors back to their checkpoints even when just one out of a million processors has failed. This wastes all the computation performed, and is ultimately not sustainable as the number of processors increases and the mean time between failures (MTBF) decreases sufficiently. A third experimental strategy in Charm++ sends only the failed processor(s) to their checkpoints by using a message-logging scheme, and leverages the over-decomposition and migratability of Charm++ objects to parallelize the restart process. I.e., the objects from failed processors are reincarnated on multiple other processors, where they re-execute, in parallel, from their checkpoints using the logged messages. Charm++ also provides a fourth proactive strategy to handle situations where a future fault can be predicted, say based on heat sensors, or estimates of increasing (corrected) cache errors. The runtime simply migrates objects away from such a processor, and readjusts its runtime data structures. Associated Tools The Charm++ system has several tools associated with it. LiveViz allows one to inject messages into a running program and display attributes and images from a running simulation. Projections supports performance analysis and visualization, including live visualization, parallel on-line analysis, and log-based post-mortem analysis. CharmDebug is a parallel debugger that understands Charm++ constructs, and provides online access to runtime data structures. It also supports a sophisticated record-replay scheme and provisional message delivery for debugging nondeterministic bugs. The communication required by these tools is integrated in the runtime system, leveraging the message-driven scheduler. No separate monitoring processes are necessary. Code Example Figures 2 and 3 show fragments from a simple Charm++ example program to give a flavor of the programming model. The program is a simple Lennard-Jones molecular dynamics code: the computation is decomposed into a 1-dimensional array of LJ objects, each holding a subset of atoms. The interface file (Fig. 2) describes the main Charm-level entities, their types and signatures. The program has a read-only integer called numChares that holds the size of the LJ chare-array. In this particular program, the main chare is called Main and has only one method, namely its constructor. The LJ class is declared as constituting a 1-dimensional array of chares. It has two entry methods in addition to its constructor. Note the others[n] notation used to specify a parameter that is an array of size n, where n itself is another integer parameter. This allows the system to generate code to serialize the parameters into a message, when necessary. CkReductionMsg is a system defined type, and is used as a target of entry methods used in reductions. 6 1 2 3 4 5 6 mainmodule ljdyn { readonly int numChares; ... mainchare Main { entry Main(CkArgMsg *m); }; 7 8 9 10 11 12 13 array [1D] LJ { entry LJ(void); entry void passOn(int home, int n, Particle others[n]); entry void startNextStep(CkReductionMsg *m); }; }; Figure 2: A simple molecular dynamics program: interface file Some important fragments from the C++ file that define the program itself are shown in Figure 3. Sequential code not important for understanding the program is omitted. The classes Main and LJ inherit from classes generated by a translator based on the interface file. The program execution consists of a number of time steps. In each time step, each processor sends its particles on a round-trip to visit all other chares. Whenever a packet of particles visits a chare via the passOn method, the chare calculates forces on each of the visiting (others) particles due to each of its own particles. Thus, when the particles return home after the round trip, they have accumulated forces due to all the other particles in the system. A sequential call (integrate) then adds local forces to the accumulated forces, and calculates new velocities and positions for each owned particle. At this point, to make sure the time step is truly finished for all chares, the program uses an “asynchronous barrier” via the contribute call. To emphasize the asynchronous nature of this call, the example makes the call before integrate. The contribute call simply deposits the contribution into the system and continues on to integrate. The contribute call specifies that after all the array elements of LJ have contributed, a callback will be made. In this case, the callback is a broadcast to all the members of the chare-array at the entry-method startNextStep. Inherited variable thisproxy is a proxy to the entire chare-array. Similarly, thisIndex refers to the index of the calling chare in the chare-array to which it belongs. Language Extensions and Features The baseline programming model described so far is adequate to express most parallel interaction structures. However, for programming convenience, increased productivity and/or efficiency, Charm++ supports a few additional features. For example, individual entry methods can be marked as “threaded”. This results in the creation of a user level, lightweight thread whenever the entry method is invoked. Unlike normal entry methods, which always complete their execution and return control to the scheduler before other entry methods are executed, the threaded entry methods can block their execution; of course they do so without blocking the processor they are running on. In particular, they can wait for a “future”, wait until another entry method unblocks them, or make blocking method invocations. An entry method can be tagged as blocking (actually called a 7 1 /*readonly*/ int numChares; 2 3 4 5 6 7 8 9 10 11 class Main : public CBase_Main { Main(CkArgMsg* m) { // Process command-line arguments ... numChares = atoi(m->argv[1]); ... CProxy_LJ arr = CProxy_LJ::ckNew(numChares); } }; LJ[0] LJ [n-1] LJ[1] 12 13 14 15 16 class LJ : public CBase_LJ { int timeStep, numParticles, next; Particle * myParticles; ... LJ [n-2] LJ[2] 17 18 19 20 21 22 23 24 25 LJ() { ... myParticles = new Particle[numParticles]; ... // initialize particle data next = (thisIndex + 1) % numChares; timeStep = 0; startNextStep((CkReductionMsg *)NULL); } LJ[k] LJ[k].passOn(...) 26 27 28 29 30 31 32 void startNextStep(CkReductionMsg* m) { if (++timeStep < MAXSTEPS) { if (thisIndex == 0) { ckout << "Done\n" << endl; CkExit(); } } else thisProxy[next].passOn(thisIndex, numParticles, myParticles); } 33 34 35 36 37 38 39 40 41 42 43 void passOn(int homeIndex, int n, Particle* others) { if (thisIndex != homeIndex) { interact(n, others); //add forces on "others" due to my particles thisProxy[next].passOn(homeIndex,n, others); } else { // particles are home, with accumulated forces CkCallback cb (CkIndex_LJ::startNextStep(NULL), thisProxy); contribute(cb); // asyncronous barrier integrate( n, others); // add forces and update positions } } 44 45 46 47 void interact(int n, Particle* others) { /* add forces on "others" due to my particles */ } 48 49 50 51 52 void integrate(int n, Particle* others) { /* ... apply forces, update positions ... */ } }; Figure 3: A simple molecular dynamics program: fragments from the C++ file 8 “sync” method), and such a method is capable of returning values, unlike normal methods that are asynchronous and therefore have a return type of void. Often, threaded entry methods are used to describe the life cycle of a chare. Another notation within Charm++, called “structured dagger”, accomplishes the same effect without the need for a separate stack and associated memory for each user level thread, and the, admittedly small, overhead associated with thread context switches. However, it requires that all dependencies on remote data be expressed in this notation within the text of an entry-method. In cotrast, the thread of control may block waiting for remote data within functions called from a threaded entry method. Charm++ as described so far does not bring in the notion of a “processor” in the programming model. However, some low-level constructs that refer to processors are also provided to programmers and especially library writers. For example, when a chare is created, one can optionally specify which processor to create it on. Similarly, when a chare-array is created, one can specify its initial mapping to processors. One can create specialized chare-arrays, called groups that have exactly one member on each processor, which are useful for implementing services such as load balancers. Languages in the Charm family AMPI or Adaptive MPI is an implementation of the message passing interface standard on top of the Charm++ runtime system. Each MPI process is implemented as a user level thread that is embedded inside a Charm++ object, as a threaded entry method. These objects can be migrated across processors, as is usual for Charm++ objects, thus bringing benefits of the Charm++ adaptive runtime system, such as dynamic load balancing and fault tolerance, to traditional MPI programs. Since there may be multiple MPI “processes” on each core, commensurate with the overdecomposition strategy of Charm++ applications, the MPI programs need to be modified in a mechanical, systematic fashion to avoid conflict among the global variables. Adaptive MPI provides tools for automating this process to some extent. As a result, a standard Adaptive MPI program is also a legal MPI program, but the converse is true only if the use of global variables has been handled via such modifications. In addition, Adaptive MPI provides primitives such as asynchronous collectives, which are not part of the MPI 2 standard. An asynchronous reduction, for example, carries out the communication associated with a reduction in the background while the main program continues on with its computation. A blocking call is then used to fetch the result of the reduction. Two recent languages in the Charm++ family are multiphase shared arrays (MSA) and Charisma. These are part of the Charm++ strategy of creating a toolbox consisting of incomplete languages that capture some interaction modes elegantly and frameworks that capture the needs of specific domains or data structures, both backed up by complete languages such as Charm++ and AMPI. The compositionality afforded by message driven execution ensures that modules written using multiple paradigms can be efficiently composed in a larger application. MSA is designed to support disciplined use of shared address space. The computation consists of collections of threads and multiple user-defined data arrays, each partitioned into user-defined pages. Both entities are implemented as migratable objects (i.e., chares) available to the Charm++ runtime system. The threads can access the data in the arrays, but each array is in only one of a restrictive set of access modes at a time. Read-only, exclusive-write, and accumulate are examples of the access modes supported by MSA. At designated synchronization points, a program may change the access modes of one or more arrays. 9 Charisma, another language implemented on top of Charm++, is designed to support computations that exhibit a static data flow pattern among a set of Charm++ objects. Such a pattern, where the flow of messages remains the same from iteration to iteration, even though the content and the length of messages may change, is extremely common in science and engineering applications. For such applications, Charisma provides a convenient syntax that captures the flow of values and control across multiple collections of objects clearly. In addition, Charisma provides a clean separation of sequential and parallel code that is convenient for collaborative application development involving parallel programmers and domain specialists. Frameworks atop Charm++ In addition to its use as a language for implementing applications directly, Charm++ is seen as backend for higher level frameworks and languages described above. Its utility in this context arises because of the interoperability and runtime features it provides, which one can leverage to put together a new domain-specific framework relatively quickly. An example of such a framework is ParFUM, which is aimed at unstructured mesh applications. ParFUM allows developers of sequential codes based on such meshes to retarget them to parallel machines, with relatively few changes. It automates several commonly needed functions including the need to exchange boundary nodes (or boundary layers, in general) with neighboring objects. Once a code is ported to ParFUM, it can automatically benefit from other Charm++ features such as load balancing and fault tolerance. Applications Some of the highly scalable applications developed using Charm++ are in extensive use by scientists on national supercomputers. These include NAMD (for biomolecular simulations), OpenAtom )for electronic structure simulations), and ChaNGa (for astrophysical N-body simulations). Origin and History The Chare Kernel, a precursor of the Charm++ system, arose from the work on parallel Prolog at the University of Illinois in Urbana-Champaign in the late 1980s. The implementation mechanism required a collection of computational entities, one for each active clause (i.e. its activation record) of the underlying logic program. Each of these entities typically received multiple responses from each one of its children in the proof tree, and they needed to create new nodes in the proof tree which had to fire new “tasks” for each of its active clauses. The implementation led to a message driven scheduler and dynamic creation of the seeds of work. Following the terminology used by an earlier parallel functional-languages project, RediFlow, these entities were called chares. The work on Chare Kernel essentially separated this implementation mechanism from its parallel Prolog context, into a C-based parallel programming paradigm of its own. Charm had similarities (especially the message driven execution) with the earlier research on reworking of the Hewitt’s Actor framework by Agha and Yonezawa [7, 8]. However, its intellectual progenitors were in parallel logic and functional languages. With the increase in popularity of C++, Charm++ became the version of Charm for C++, which was a natural fit for its object-based abstraction. As the attention in parallel computing shifted to scientific computations, the indexed collections of migratable 10 chares were developed in Charm++ to simplify addressing chares in mid-1990s. In the late 1990s, Adaptive MPI was developed in the context of applications being developed at the Center for Simulation of Advanced Rockets at Illinois. Charm++ continues to be developed and maintained from the University of Illinois, and its applications are in regular use at many supercomputers around the world. Availability and Usage Charm++ runs on most parallel machines available at the time this article was written, including multicore desktops, clusters and large-scale proprietary supercomputers, running Linux, Windows and other operating systems. Charm++, its associated software tools and libraries can be downloaded in source code and binary forms from http://charm.cs.illinois.edu under a license that allows free use for non-commercial purposes. RELATED ENTRIES Actors NAMD ChaNGa Parallel Combinatorial Search BIBLIOGRAPHIC NOTES AND FURTHER READING One of the earliest papers on the Chare Kernel, the precursor of Charm++ was published in 1989 [9], followed by a more detailed descriptions of the model and its load balancers [10, 2]. This work arose out of earlier work on parallel Prolog [3]. The message-driven aspects of Charm++ make it an Actor-like model, which was developed in earlier research by Hewitt, Agha, and Yonezawa [7, 8]. The C++ based version was described in an OOPSLA paper in 1993 [1], and was expanded upon in a book on parallel C++ in [11]. Some of the early work on quantifying benefits of the programming model is summarized in a later paper [12]. Some of the early applications using Charm++ were in symbolic computing, and specifically, in parallel combinatorial search [4]. A scalable framework for supporting migratable arrays of chares is described in a paper [13], that is also useful for understanding the programming model as well. An early paper [14] describes support for migrating Chares for dynamic load balancing. Recent papers describe an overview of Charm++ [15] and its applications [16]. BIBLIOGRAPHY [1] L.V. Kalé and S. Krishnan. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In A. Paepcke, editor, Proceedings of OOPSLA’93, pages 91–108. ACM Press, September 1993. [2] W.W. Shu and L.V. Kalé. Chare Kernel - a runtime support system for parallel computations. Journal of Parallel and Distributed Computing, 11:198–211, 1990. 11 [3] L. V. Kalé. Parallel execution of logic programs: the REDUCE-OR process model. In Proceedings of Fourth International Conference on Logic Programming, pages 616–632, May 1987. [4] L.V. Kale, B. Ramkumar, V. Saletore, and A. B. Sinha. Prioritization in parallel symbolic computing. In T. Ito and R. Halstead, editors, Lecture Notes in Computer Science, volume 748, pages 12–41. Springer-Verlag, 1993. [5] Yow-Jian Lin and Vipin Kumar. And-parallel execution of logic programs on a sharedmemory multiprocessor. Journal of Logic Programming, 10(1/2/3&4):155–178, 1991. [6] M. Frigo, C. E. Leiserson, and K. H. Randall. The Implementation of the Cilk-5 Multithreaded Language. In ACM SIGPLAN ’98 Conference on Programming Language Design and Implementation (PLDI), volume 33 of ACM Sigplan Notices, pages 212–223, Montreal, Quebec, Canada, June 1998. [7] G. Agha. Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press, 1986. [8] A. Yonezawa, J.-P. Briot, and E. Shibayama. Object-oriented concurrent programming in ABCL/1. ACM SIGPLAN Notices, Proceedings OOPSLA ’86, 21(11):258–268, Nov 1986. [9] L. V. Kalé and W. Shu. The Chare Kernel base language: Preliminary performance results. In Proceedings of the 1989 International Conference on Parallel Processing, pages 118–121, St. Charles, IL, August 1989. [10] L.V. Kale. The Chare Kernel parallel programming language and system. In Proceedings of the International Conference on Parallel Processing, volume II, pages 17–25, August 1990. [11] L. V. Kale and Sanjeev Krishnan. Charm++: Parallel Programming with Message-Driven Objects. In Gregory V. Wilson and Paul Lu, editors, Parallel Programming using C++, pages 175–213. MIT Press, 1996. [12] Attila Gursoy and L.V. Kalé. Performance and Modularity Benefits of Message-Driven Execution. Journal of Parallel and Distributed Computing, 64:461–480, 2004. [13] Orion Sky Lawlor and L. V. Kalé. Supporting dynamic parallel object arrays. Concurrency and Computation: Practice and Experience, 15:371–393, 2003. [14] Robert K. Brunner and Laxmikant V. Kalé. Handling application-induced load imbalance using parallel objects. In Parallel and Distributed Computing for Symbolic and Irregular Applications, pages 167–181. World Scientific Publishing, 2000. [15] Laxmikant V. Kale and Gengbin Zheng. Charm++ and AMPI: Adaptive Runtime Strategies via Migratable Objects. In M. Parashar, editor, Advanced Computational Infrastructures for Parallel and Distributed Applications, pages 265–282. Wiley-Interscience, 2009. 12 [16] Laxmikant V. Kale, Eric Bohm, Celso L. Mendes, Terry Wilmarth, and Gengbin Zheng. Programming Petascale Applications with Charm++ and AMPI. In D. Bader, editor, Petascale Computing: Algorithms and Applications, pages 421–441. Chapman & Hall / CRC Press, 2008. 13
© Copyright 2025 Paperzz