c SPRINGER VERLAG – Lecture Notes in Computer Science Parallel Juxtaposition for Bulk Synchronous Parallel ML Frédéric Loulergue? Laboratory of Algorithms, Complexity and Logic 61, avenue du général de Gaulle – 94010 Créteil cedex – France [email protected] Abstract. The BSMLlib library is a library for Bulk Synchronous Parallel (BSP) programming with the functional language Objective Caml. It is based on an extension of the λ-calculus by parallel operations on a parallel data structure named parallel vector. An attempt to add a parallel composition to this approach led to a non-confluent calculus and to a restricted form of parallel composition. This paper presents a new, simpler and more general semantics for parallel composition. 1 Introduction Declarative parallel languages are one possible way that may ease the programming of massively parallel architectures. Those languages do not enforce in the syntax itself an order of evaluation, and thus appear more suitable to automatic parallelization. Functional languages are often considered. Nevertheless, even if some problems encountered in the parallelization of sequential imperative languages are avoided, some still remain (for example two different but denotationally equivalent programs may lead to very different parallel programs) and some are added, for example the fact that in those languages data-structures are always dynamic ones. It makes the amount and/or the grain of parallelism often too low or difficult to control in case of speculation. An opposite direction of research is to give the programmer the entire control over parallelism. Message passing facilities are added to functional languages. But in this case, the obtained parallel languages are either non-deterministic, [10] or non-functional (i.e. referential transparency is lost) [3]. The design of parallel programming languages is, therefore, a tradeoff between the possibility of expressing parallel features necessary for predictable efficiency, but which make programs more difficult to write, to prove and to port ; the abstraction of such features that are necessary to make parallel programming easier, but which must not hinder efficiency and performance prediction. An intermediate approach is to offer only a set of algorithmic skeletons [1, 11] that are implemented in parallel. Among researchers interested in declarative parallel programming, there is a growing interest in execution cost models ? This work is supported by the ACI Grid program from the French Ministry of Research, under the project Caraml (www.caraml.org). taking into account global hardware parameters like the number of processors and bandwidth. The Bulk Synchronous Parallel [12] execution and cost model offers such possibilities and with similar motivations we have designed BSP extensions of the λ-calculus [8] and a library for the Objective Caml language, called BSMLlib, implementing those extensions. This framework is a good tradeoff for parallel programming for two main reasons. First, we defined a confluent calculus so we can design purely functional parallel languages from it. Without side-effects, programs are easier to prove, and to re-use (the semantics is compositional) and we can choose any evaluation strategy for the language. An eager language allows good performances. Secondly, this calculus is based on BSP operations, so programs are easy to port, their costs can be predicted and are also portable because they are parametrized by the BSP parameters of the target architecture. A BSP algorithm is said to be in direct mode [5] when its physical process structure is made explicit. Such algorithms offer predictable and scalable performance and BSML expresses them with a small set of primitives taken from the confluent BSλ calculus [8]: a constructor of parallel vectors, asynchronous parallel function application, synchronous global communications and a synchronous global conditional. Those operations are flat: it is impossible to express directly parallel divide-and-conquer algorithms. Nevertheless many algorithms are expressed as parallel divide-and-conquer algorithms [13] and it is difficult to transform them into flat algorithms. In a previous work, we proposed an operation called parallel composition [7], but is was limited to the composition of two terms whose evaluations require the same number of BSP super-steps. In this paper we present a new version of this operation and a new operation which allow to juxtapose (this is the reason why we renamed this operation parallel juxtaposition) on subsets of the parallel machine the evaluations of two terms even if they require a different number of BSP super-steps. We first present the flat BSMLlib library (section 2). Section 3 addresses the semantics of the parallel juxtaposition. We also give a small example in section 4, discuss related and future work in section 5 and conclude in section 6. 2 Functional Bulk Synchronous Parallelism There is currently no implementation of a full Bulk Synchronous Parallel ML language but rather a partial implementation as a library for Objective Caml. The so-called BSMLlib library is based on the following elements. It gives access to the BSP parameters of the underling architecture. In particular, it offers the function: bsp_p:unit->int such as the value of bsp_p() is p, the static number of processes of the parallel machine. This value does not change during execution, for “flat” programming only. It is no longer true when parallel juxtaposition is added to the language. There is also an abstract polymorphic type ’a par which represents the type of p-wide parallel vectors of objects of type ’a, one per process. The nesting of par types is prohibited. We have a type system [4] which enforces this restriction. The BSML parallel constructs operates on parallel vectors. Those parallel vectors are created by: mkpar: (int -> ’a) -> ’a par so that (mkpar f) stores (f i) on process i for i between 0 and (p − 1). We usually write f as fun pid->e to show that the expression e may be different on each processor. This expression e is said to be local. The expression (mkpar f) is a parallel object and it is said to be global. A BSP algorithm is expressed as a combination of asynchronous local computations and phases of global communication with global synchronization. Asynchronous phases are programmed with mkpar and with: apply: (’a -> ’b) par ->’a par -> ’b par apply (mkpar f) (mkpar e) stores (f i) (e i) on process i. The communication and synchronization phases are expressed by: put: (int->’a option) par -> (int->’a option) par where ’a option is defined by: type ’a option = None | Some of ’a. Consider the expression: put(mkpar(fun i->fsi ))(∗). To send a value v from process j to process i, the function fsj at process j must be such that (fsj i) evaluates to Some v. To send no value from process j to process i, (fsj i) must evaluate to None. Expression (∗) evaluates to a parallel vector containing a function fdi of delivered messages on every process. At process i, (fdi j) evaluates to None if process j sent no message to process i or evaluates to Some v if process j sent the value v to the process i. A global conditional operation: ifat : (bool par) ∗ int ∗ ’a * ’a -> ’a would also be contained in the full language. It is such that ifat (v,i,v1,v2) will evaluate to v1 or v2 depending on the value of v at process i. But Objective Caml is an eager language and this synchronous conditional operation can not be defined as a function. That is why the core BSMLlib contains the function: at:bool par -> int -> bool to be used only in the construction: if (at vec pid) then... else... where (vec:bool par) and (pid:int). if at expresses communication and synchronization phases. Without it, the global control cannot take into account data computed locally. Global conditional is necessary of express algorithms like: Repeat Parallel Iteration Until Max of local errors < Due to lack of space, in the following sections we will omit it. 3 Semantics of Parallel Juxtaposition Parallel juxtaposition allows the network to be divided in two subsets. If we add a notion of parallel juxtaposition to the BSλ-calculus, the confluence is lost [8]. Parallel juxtaposition has to preserve the BSP execution structure. Moreover the number of processors is no longer a constant and depends on the context of the term. “Flat” terms The local terms (denoted by lowercase characters) represents “sequential” values held locally by each processor. They are classical λ-expressions. They are given by the following grammar: e ::= n integer constant | λẋ. e lambda abstraction | (e → e, e) conditional | b boolean constant | e e application | ẋ local variable | ⊕ operations The set of processor names N is a set of p closed local terms in normal form. In the following will we consider that N = {0, . . . , p − 1}. The “flat” global terms (denoted by uppercase characters) represent parallel values. They are given by the grammar: Ef ::= x̄ | Ef Ef | Ef e | λx̄. Ef | λẋ. Ef | π e | Ef # Ef | Ef ? Ef π e is the intentional parallel vector which contains the local term e i at processor i. In the BSλp -calculus, the number of processors is constant. In this case, π e is syntactic sugar for h e 0 , . . . , e (p − 1) i, where the width of the vector is p. In the pratical programming language we will no longer use the enumerated form of parallel vectors. Nevertheless in the evaluation semantics, we will use enumerated vectors as values. Their widths can range from 1 to p. E1 # E2 is the point-wise parallel application. E1 must be a vector of functions and E2 a vector of values. On each process, the function is applied to the corresponding value. E1 ? E2 is the get operation used to communicate values between processors. E2 is a parallel vector of processor names. If at processor i the value held by E2 is j then the result of the get operation (a parallel vector) will contain the value held by parallel vector E1 at processor j. The execution of this operation corresponds to a communication and synchronization phase of a BSP super-step. With the get operation, a processor can only obtain data from one other processor. There exists a more general communication operation put, which allows any BSP communication schema using only one super-step. With this operation a processor can send at most p messages to the p processors of the parallel machine, each messages can have a different content. The put operation is of course a bit more complicated and its interaction with parallel juxtaposition is the same than the get operation. For the sake of simplicity we will only consider the get operation in this section. Terms with parallel juxtaposition E1 km E2 is the parallel juxtaposition, which means that the m first processors will evaluate E1 and the others will evaluate E2 . From the point of view of E1 the network will have m processors named 0, . . . , m − 1. From the point of view of E2 the network will have p − m processors (where p is the number of processors of the current network) named 0, . . . , (p − m − 1) (processor m is renamed 0, etc.). We want to preserve the flat BSP super-step structure and avoid subset synchronization [6]. The proposition of this paper relies on the sync operation. Consider parjux m E 1 E2 . The evaluation of E1 and E2 will be as described below for flat terms but respectively on a network of m and p − m processors except that the synchronization barriers of put and if at operations will concern the whole network. If for example E2 has less super-steps and thus less synchronization barriers than E1 then we need additional, useless in the point of view of E2 , synchronization barriers to preserve the BSP execution model. This is why we have added the sync operation. Each operation using synchronization barrier labels its calls to the synchronization barrier. With this label each processor knows what kind of operation the other processors have called to end the super-step. The sync operation repeatedly requests synchronization barriers. When each processor requests a synchronization barrier labeled by sync, it means all the processors reached the sync and the evaluation of sync ends. The sync operation introduce an additional synchronization barrier, but it allows a more flexible use of parallel juxtaposition. The complete syntax for global terms E is E ::= Ef | sync(Es ) where ‚ Es ::= x̄ | Es Es | Es e | λx̄. Es | λẋ. Es | Es # Es | Es ? Es | π e | Es ‚e Es It is important to notice that this syntax does not allow nesting of parallel objects: a local term cannot contain a global term. Thus a parallel vector can not contain a parallel vector. Semantics The semantics starts from a term, and returns a value. The local values are the constants and the local λ-abstractions. The global values are the enumerated vectors of local values and the λ-abstractions. The evaluation semantics for local terms e loc v is a classical call-by-value evaluation (omitted here for the sake of conciseness). The evaluation of global terms with parallel juxtaposition s uses an environment for global variables: a parallel vector can be bound to a variable in a subnetwork with p processors and this variable can be used in a subnetwork with p0 processors (p0 ≤ p). In this case we are only interest in p0 of the p values of the vector. So the global evaluation depends on the number of processors in the current subnetwork and the name (in the whole network) of the first processor of this subnetwork. 0 Judgments of this semantics have the form Es `pf E s V which means “in global environment Es on a subnetwork with p0 processors whose first processor has name f in the whole network, the term Es evaluates to value V ”. The environment for the evaluation of global terms with parallel juxtaposition Es can be seen as an association list which associates global variables to pairs composed of an integer and a global value. In the following we will use the notation [v0 ; . . . ; vn ] for lists and in particular environments. We will also note E(x̄) for the first value associated with the global variable x̄ in E. The two first rules are used to retrieve a value in the global environment. It is to notice that there are two different rules: one for parallel vectors and one for other values, i.e. global λ-abstractions. In this environment, a variable is not only bound to its value but to a pair whose first component is the name of the first processor of the network on which the value was produced. This information is only useful for the first rule. This rule says that if a parallel vector was bound to a variable in a subnetwork with p1 processors and with first processor f1 , then this variable can be used in a subnetwork with p2 processors (p2 ≤ p1 ) and with first processor f2 (f2 ≥ f1 ). In this case we are only interested by p2 components of the vector, starting at component numbered f2 − f1 : Es (x̄) = (f1 , h v0 , . . . , vp1 −1 i) Es `pf22 x̄ s h vf2 −f1 , . . . , vf2 −f1 +p2 −1 i Es (x̄) = (f1 , λx.Es ) Es `pf22 x̄ s λx.Es The next rule says that to evaluate an intentional parallel vector term π f , one has to evaluate the expression f i on each processor i of the subnetwork. This evaluation corresponds to a pure computation phase of a BSP super-step: ∀i ∈ 0 . . . p0 − 1, e i loc vi 0 Es `pf π e s h v0 , . . . , vp0 −1 i The three next rules are for application and point-wise parallel application. Observe that in the rule for application, the pair (f, V2 ), rather than a single value, is bound to the variable x̄: 0 Es `pf Es2 s V2 0 ( x̄ 7→ (f, V2 ) ) :: Es `pf Es1 s V1 e loc v 0 Es `pf Es [x ← v] s V 0 0 Es `pf (λx̄.Es1 ) Es2 s V1 Es `pf (λẋ.Es ) e s V 0 0 Es `pf Es1 s h λx.e0 , . . . , λx.ep0 −1 i Es `pf Es2 s h v0 , . . . , vp0 −1 i ∀i ∈ 0 . . . p0 − 1, (λx.ei )vi loc wi 0 Es `pf Es1 # Es2 s h w0 , . . . , wp0 −1 i The next rule is for the operation that need communications: 0 Es `pf Es1 s h v0 , . . . , vp0 −1 i 0 Es `pf Es2 s h n0 , . . . , np0 −1 i 0 Es `pf Es1 ? Es2 s h vn0 , . . . , vnp0 −1 i The last important rule is the one for parallel juxtaposition. The left side of parallel juxtaposition is evaluated on a subnetwork with m processors, the right side is evaluated on the remaining subnetwork containing p0 − m processors. 0 −m 1 E s `m Es `pf +m Es2 s h vm , . . . , vp0 −1 i f Es s h v0 , . . . , vm−1 i e loc m and 0 < m < p0 ‚ 0 Es `pf Es1 ‚e Es2 s h v0 , . . . , vp0 −1 i 0 (1) The remaining rule is a classical one: Es `pf λx.Es s λx.Es , where x is either ẋ or x̄. We now need to define the evaluation of global terms E (with parallel juxtaposition enclosed in a sync operation). The environment E for the evaluation of global terms E can be seen as an association list which associates global variables to global values. This evaluation has judgments of the form E ` E V which means “in global environment E on the whole network the term E evaluates to value V ”. The first rule is the rule used to evaluate a parallel composition in the scope of a sync: Es `p0 E s h v0 , . . . , vp−1 i E ` sync(E) h v0 , . . . , vp−1 i where Es (x̄) = (0, V ) if and only if E(x̄) = V . The remaining rules are similar to the previous ones except that the number of processes and the name of the first processor of the sub-network are not longer useful. Due to the lack of space we omit these rules. 4 Example Objective Caml is an eager language. To express parallel juxtaposition as a function we have to “freeze”the evaluation of its parallel arguments. Thus parallel juxtaposition must have the following type: parjux: int -> (unit->’a par) -> (unit->’a par) -> ’a par The sync operation has in the BSMLlib library the type ’a -> ’a even if it is only useful for global terms. The following example is a divide-and-conquer version of the scan program which is defined by scan ⊕ h v0 , . . . , vp−1 i = h v0 , . . . , v0 ⊕ v1 ⊕ . . . ⊕ vp−1 i: let rec scan op vec = if bsp_p()=1 then vec else let mid = bsp_p()/2 in let vec’=parjux mid (fun ()->scan op vec) (fun ()->scan op vec) in let msg vec=apply(mkpar(fun i v-> if i=mid-1 then fun dst-> if dst>=mid then Some v else None else fun dst-> None)) vec and parop=parfun2(fun x y->match x with None->y|Some v->op v y) in parop (apply(put(msg vec’))(mkpar(fun i->mid-1))) vec’ The network is divided into two parts and the scan is recursively applied to those two parts. The value held by the last processor of the first part is broadcast to all the processors of the second part, then this value and the value held locally are combined together by the operator op on each processor of the second part. To use this function at top-level, it must be put into a sync operation. For example : let this = mkpar (fun pid->pid) in sync (scan (+) this). 5 Related Work [14] presents another way to divide-and-conquer in the framework of an objectoriented language. There is no formal semantics and no implementation from now on. Furthermore, the proposed operation is rather a parallel superposition, several BSP threads use the whole network, than a parallel juxtaposition. The same author advocates in [9] a new extension of the BSP model in order to ease the programming of divide-and-conquer BSP algorithms. It adds another level to the BSP model with new parameters to describe the parallel machine. [15] is an algorithmic skeletons language based on the BSP model and offers divide-and-conquer skeletons. Nevertheless, the cost model is not really the BSP model but the D-BSP model [2] which allows subset synchronization. We follow [6] to reject such a possibility. 6 Conclusions and Future Work Compared to a previous attempt [7], the parallel juxtaposition has not the two main drawbacks of its predecessor : (a) the two sides of spatial parallel composition may not have the same number of synchronization barriers. This enhancement comes to the price of an additional synchronization barrier, but only one barrier even in the case of nested parallel compositions ; (b) the cost model is a compositional one. The remaining drawback is that the semantics is strategydependant. The use of the spatial parallel composition changes the number of processors: it is an imperative feature. Thus there is no hope in this way to obtain a confluent calculus with a parallel juxtaposition. The next released implementation of the BSMLlib library will include parallel juxtaposition. Its ease of use will be experimented by implementing BSP algorithms described as divide-and-conquer algorithms in the literature. References 1. M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, 1989. 2. P. de la Torre and C. P. Kruskal. Submachine locality in the bulk synchronous setting. In Euro-Par’96, number 1123-1124. Springer Verlag, 1996. 3. C. Foisy and E. Chailloux. Caml Flight: a portable SPMD extension of ML for distributed memory multiprocessors. In A. W. Böhm and J. T. Feo, editors, Workshop on High Performance Functionnal Computing, 1995. 4. F. Gava and F. Loulergue. A Polymorphic Type System for Bulk Synchronous Parallel ML. In Seventh International Conference on Parallel Computing Technologies (PaCT 2003), LNCS. Springer Verlag, 2003. to appear. 5. A. V. Gerbessiotis and L. G. Valiant. Direct Bulk-Synchronous Parallel Algorithms. Journal of Parallel and Distributed Computing, 22:251–267, 1994. 6. G. Hains. Subset synchronization in BSP computing. In H.R.Arabnia, editor, PDPTA’98, pages 242–246. CSREA Press, 1998. 7. F. Loulergue. Parallel Composition and Bulk Synchronous Parallel Functional Programming. In S. Gilmore, editor, Trends in Functional Programming, Volume 2, pages 77–88. Intellect Books, 2001. 8. F. Loulergue, G. Hains, and C. Foisy. A Calculus of Functional BSP Programs. Science of Computer Programming, 37(1-3):253–277, 2000. 9. J. Martin and A. Tiskin. BSP Algorithms Design for Hierarchical Supercomputers. submitted for publication, 2002. 10. P. Panangaden and J. Reppy. The essence of concurrent ML. In F. Nielson, editor, ML with Concurrency, Monographs in Computer Science. Springer, 1996. 11. S. Pelagatti. Structured Development of Parallel Programs. Taylor & Francis, 1998. 12. D. B. Skillicorn, J. M. D. Hill, and W. F. McColl. Questions and Answers about BSP. Scientific Programming, 6(3):249–274, 1997. 13. A. Tiskin. The Design and Analysis of Bulk-Synchronous Parallel Algorithms. PhD thesis, Oxford University Computing Laboratory, 1998. 14. A. Tiskin. A New Way to Divide and Conquer. Parallel Processing Letters, (4), 2001. 15. A. Zavanella. Skeletons and BSP : Performance Portability for Parallel Programming. PhD thesis, Universita degli studi di Pisa, 1999.
© Copyright 2026 Paperzz