Parallel Juxtaposition for Bulk Synchronous

c
SPRINGER
VERLAG – Lecture Notes in Computer Science
Parallel Juxtaposition for
Bulk Synchronous Parallel ML
Frédéric Loulergue?
Laboratory of Algorithms, Complexity and Logic
61, avenue du général de Gaulle – 94010 Créteil cedex – France
[email protected]
Abstract. The BSMLlib library is a library for Bulk Synchronous Parallel (BSP) programming with the functional language Objective Caml.
It is based on an extension of the λ-calculus by parallel operations on
a parallel data structure named parallel vector. An attempt to add a
parallel composition to this approach led to a non-confluent calculus and
to a restricted form of parallel composition. This paper presents a new,
simpler and more general semantics for parallel composition.
1
Introduction
Declarative parallel languages are one possible way that may ease the programming of massively parallel architectures. Those languages do not enforce in the
syntax itself an order of evaluation, and thus appear more suitable to automatic
parallelization. Functional languages are often considered. Nevertheless, even if
some problems encountered in the parallelization of sequential imperative languages are avoided, some still remain (for example two different but denotationally equivalent programs may lead to very different parallel programs) and
some are added, for example the fact that in those languages data-structures
are always dynamic ones. It makes the amount and/or the grain of parallelism
often too low or difficult to control in case of speculation. An opposite direction
of research is to give the programmer the entire control over parallelism. Message passing facilities are added to functional languages. But in this case, the
obtained parallel languages are either non-deterministic, [10] or non-functional
(i.e. referential transparency is lost) [3].
The design of parallel programming languages is, therefore, a tradeoff between the possibility of expressing parallel features necessary for predictable efficiency, but which make programs more difficult to write, to prove and to port ;
the abstraction of such features that are necessary to make parallel programming
easier, but which must not hinder efficiency and performance prediction.
An intermediate approach is to offer only a set of algorithmic skeletons [1,
11] that are implemented in parallel. Among researchers interested in declarative parallel programming, there is a growing interest in execution cost models
?
This work is supported by the ACI Grid program from the French Ministry of Research, under the project Caraml (www.caraml.org).
taking into account global hardware parameters like the number of processors
and bandwidth. The Bulk Synchronous Parallel [12] execution and cost model
offers such possibilities and with similar motivations we have designed BSP extensions of the λ-calculus [8] and a library for the Objective Caml language,
called BSMLlib, implementing those extensions.
This framework is a good tradeoff for parallel programming for two main
reasons. First, we defined a confluent calculus so we can design purely functional
parallel languages from it. Without side-effects, programs are easier to prove,
and to re-use (the semantics is compositional) and we can choose any evaluation
strategy for the language. An eager language allows good performances. Secondly, this calculus is based on BSP operations, so programs are easy to port,
their costs can be predicted and are also portable because they are parametrized
by the BSP parameters of the target architecture.
A BSP algorithm is said to be in direct mode [5] when its physical process
structure is made explicit. Such algorithms offer predictable and scalable performance and BSML expresses them with a small set of primitives taken from the
confluent BSλ calculus [8]: a constructor of parallel vectors, asynchronous parallel function application, synchronous global communications and a synchronous
global conditional. Those operations are flat: it is impossible to express directly
parallel divide-and-conquer algorithms. Nevertheless many algorithms are expressed as parallel divide-and-conquer algorithms [13] and it is difficult to transform them into flat algorithms. In a previous work, we proposed an operation
called parallel composition [7], but is was limited to the composition of two terms
whose evaluations require the same number of BSP super-steps.
In this paper we present a new version of this operation and a new operation
which allow to juxtapose (this is the reason why we renamed this operation
parallel juxtaposition) on subsets of the parallel machine the evaluations of two
terms even if they require a different number of BSP super-steps. We first present
the flat BSMLlib library (section 2). Section 3 addresses the semantics of the
parallel juxtaposition. We also give a small example in section 4, discuss related
and future work in section 5 and conclude in section 6.
2
Functional Bulk Synchronous Parallelism
There is currently no implementation of a full Bulk Synchronous Parallel ML
language but rather a partial implementation as a library for Objective Caml.
The so-called BSMLlib library is based on the following elements.
It gives access to the BSP parameters of the underling architecture. In particular, it offers the function: bsp_p:unit->int such as the value of bsp_p()
is p, the static number of processes of the parallel machine. This value does not
change during execution, for “flat” programming only. It is no longer true when
parallel juxtaposition is added to the language.
There is also an abstract polymorphic type ’a par which represents the type
of p-wide parallel vectors of objects of type ’a, one per process. The nesting of
par types is prohibited. We have a type system [4] which enforces this restriction.
The BSML parallel constructs operates on parallel vectors. Those parallel
vectors are created by: mkpar: (int -> ’a) -> ’a par so that (mkpar f)
stores (f i) on process i for i between 0 and (p − 1). We usually write f as
fun pid->e to show that the expression e may be different on each processor.
This expression e is said to be local. The expression (mkpar f) is a parallel
object and it is said to be global.
A BSP algorithm is expressed as a combination of asynchronous local computations and phases of global communication with global synchronization. Asynchronous phases are programmed with mkpar and with:
apply: (’a -> ’b) par ->’a par -> ’b par
apply (mkpar f) (mkpar e) stores (f i) (e i) on process i.
The communication and synchronization phases are expressed by:
put: (int->’a option) par -> (int->’a option) par
where ’a option is defined by: type ’a option = None | Some of ’a.
Consider the expression: put(mkpar(fun i->fsi ))(∗). To send a value v
from process j to process i, the function fsj at process j must be such that
(fsj i) evaluates to Some v. To send no value from process j to process i,
(fsj i) must evaluate to None. Expression (∗) evaluates to a parallel vector
containing a function fdi of delivered messages on every process. At process i,
(fdi j) evaluates to None if process j sent no message to process i or evaluates
to Some v if process j sent the value v to the process i.
A global conditional operation: ifat : (bool par) ∗ int ∗ ’a * ’a -> ’a would
also be contained in the full language. It is such that ifat (v,i,v1,v2) will
evaluate to v1 or v2 depending on the value of v at process i. But Objective Caml is an eager language and this synchronous conditional operation can
not be defined as a function. That is why the core BSMLlib contains the function: at:bool par -> int -> bool to be used only in the construction: if
(at vec pid) then... else... where (vec:bool par) and (pid:int). if
at expresses communication and synchronization phases. Without it, the global
control cannot take into account data computed locally. Global conditional is
necessary of express algorithms like:
Repeat Parallel Iteration Until Max of local errors < Due to lack of space, in the following sections we will omit it.
3
Semantics of Parallel Juxtaposition
Parallel juxtaposition allows the network to be divided in two subsets. If we add
a notion of parallel juxtaposition to the BSλ-calculus, the confluence is lost [8].
Parallel juxtaposition has to preserve the BSP execution structure. Moreover the
number of processors is no longer a constant and depends on the context of the
term.
“Flat” terms The local terms (denoted by lowercase characters) represents “sequential” values held locally by each processor. They are classical λ-expressions.
They are given by the following grammar:
e ::= n integer constant | λẋ. e lambda abstraction | (e → e, e) conditional
| b boolean constant | e e application
| ẋ
local variable
| ⊕ operations
The set of processor names N is a set of p closed local terms in normal form.
In the following will we consider that N = {0, . . . , p − 1}.
The “flat” global terms (denoted by uppercase characters) represent parallel
values. They are given by the grammar:
Ef ::= x̄ | Ef Ef | Ef e | λx̄. Ef | λẋ. Ef | π e | Ef # Ef | Ef ? Ef
π e is the intentional parallel vector which contains the local term e i at processor
i. In the BSλp -calculus, the number of processors is constant. In this case, π e
is syntactic sugar for h e 0 , . . . , e (p − 1) i, where the width of the vector is
p. In the pratical programming language we will no longer use the enumerated
form of parallel vectors. Nevertheless in the evaluation semantics, we will use
enumerated vectors as values. Their widths can range from 1 to p.
E1 # E2 is the point-wise parallel application. E1 must be a vector of functions and E2 a vector of values. On each process, the function is applied to the
corresponding value.
E1 ? E2 is the get operation used to communicate values between processors.
E2 is a parallel vector of processor names. If at processor i the value held by
E2 is j then the result of the get operation (a parallel vector) will contain the
value held by parallel vector E1 at processor j. The execution of this operation
corresponds to a communication and synchronization phase of a BSP super-step.
With the get operation, a processor can only obtain data from one other processor. There exists a more general communication operation put, which allows
any BSP communication schema using only one super-step. With this operation
a processor can send at most p messages to the p processors of the parallel machine, each messages can have a different content. The put operation is of course
a bit more complicated and its interaction with parallel juxtaposition is the same
than the get operation. For the sake of simplicity we will only consider the get
operation in this section.
Terms with parallel juxtaposition E1 km E2 is the parallel juxtaposition,
which means that the m first processors will evaluate E1 and the others will
evaluate E2 . From the point of view of E1 the network will have m processors
named 0, . . . , m − 1. From the point of view of E2 the network will have p − m
processors (where p is the number of processors of the current network) named
0, . . . , (p − m − 1) (processor m is renamed 0, etc.). We want to preserve the flat
BSP super-step structure and avoid subset synchronization [6].
The proposition of this paper relies on the sync operation. Consider parjux m E 1 E2 .
The evaluation of E1 and E2 will be as described below for flat terms but respectively on a network of m and p − m processors except that the synchronization
barriers of put and if at operations will concern the whole network. If for example E2 has less super-steps and thus less synchronization barriers than E1
then we need additional, useless in the point of view of E2 , synchronization
barriers to preserve the BSP execution model. This is why we have added the
sync operation. Each operation using synchronization barrier labels its calls to
the synchronization barrier. With this label each processor knows what kind
of operation the other processors have called to end the super-step. The sync
operation repeatedly requests synchronization barriers. When each processor requests a synchronization barrier labeled by sync, it means all the processors
reached the sync and the evaluation of sync ends. The sync operation introduce an additional synchronization barrier, but it allows a more flexible use of
parallel juxtaposition.
The complete syntax for global terms E is E ::= Ef | sync(Es ) where
‚
Es ::= x̄ | Es Es | Es e | λx̄. Es | λẋ. Es | Es # Es | Es ? Es | π e | Es ‚e Es
It is important to notice that this syntax does not allow nesting of parallel
objects: a local term cannot contain a global term. Thus a parallel vector can
not contain a parallel vector.
Semantics The semantics starts from a term, and returns a value. The local
values are the constants and the local λ-abstractions. The global values are the
enumerated vectors of local values and the λ-abstractions.
The evaluation semantics for local terms e loc v is a classical call-by-value
evaluation (omitted here for the sake of conciseness). The evaluation of global
terms with parallel juxtaposition s uses an environment for global variables:
a parallel vector can be bound to a variable in a subnetwork with p processors
and this variable can be used in a subnetwork with p0 processors (p0 ≤ p). In
this case we are only interest in p0 of the p values of the vector. So the global
evaluation depends on the number of processors in the current subnetwork and
the name (in the whole network) of the first processor of this subnetwork.
0
Judgments of this semantics have the form Es `pf E s V which means “in
global environment Es on a subnetwork with p0 processors whose first processor
has name f in the whole network, the term Es evaluates to value V ”.
The environment for the evaluation of global terms with parallel juxtaposition
Es can be seen as an association list which associates global variables to pairs
composed of an integer and a global value. In the following we will use the
notation [v0 ; . . . ; vn ] for lists and in particular environments. We will also note
E(x̄) for the first value associated with the global variable x̄ in E.
The two first rules are used to retrieve a value in the global environment. It
is to notice that there are two different rules: one for parallel vectors and one
for other values, i.e. global λ-abstractions. In this environment, a variable is not
only bound to its value but to a pair whose first component is the name of the
first processor of the network on which the value was produced. This information
is only useful for the first rule. This rule says that if a parallel vector was bound
to a variable in a subnetwork with p1 processors and with first processor f1 , then
this variable can be used in a subnetwork with p2 processors (p2 ≤ p1 ) and with
first processor f2 (f2 ≥ f1 ). In this case we are only interested by p2 components
of the vector, starting at component numbered f2 − f1 :
Es (x̄) = (f1 , h v0 , . . . , vp1 −1 i)
Es `pf22 x̄ s h vf2 −f1 , . . . , vf2 −f1 +p2 −1 i
Es (x̄) = (f1 , λx.Es )
Es `pf22 x̄ s λx.Es
The next rule says that to evaluate an intentional parallel vector term π f ,
one has to evaluate the expression f i on each processor i of the subnetwork.
This evaluation corresponds to a pure computation phase of a BSP super-step:
∀i ∈ 0 . . . p0 − 1, e i loc vi
0
Es `pf π e s h v0 , . . . , vp0 −1 i
The three next rules are for application and point-wise parallel application.
Observe that in the rule for application, the pair (f, V2 ), rather than a single
value, is bound to the variable x̄:
0
Es `pf Es2 s V2
0
( x̄ 7→ (f, V2 ) ) :: Es `pf Es1 s V1
e loc v
0
Es `pf Es [x ← v] s V
0
0
Es `pf (λx̄.Es1 ) Es2 s V1
Es `pf (λẋ.Es ) e s V
0
0
Es `pf Es1 s h λx.e0 , . . . , λx.ep0 −1 i Es `pf Es2 s h v0 , . . . , vp0 −1 i
∀i ∈ 0 . . . p0 − 1, (λx.ei )vi loc wi
0
Es `pf Es1 # Es2 s h w0 , . . . , wp0 −1 i
The next rule is for the operation that need communications:
0
Es `pf Es1 s h v0 , . . . , vp0 −1 i
0
Es `pf Es2 s h n0 , . . . , np0 −1 i
0
Es `pf Es1 ? Es2 s h vn0 , . . . , vnp0 −1 i
The last important rule is the one for parallel juxtaposition. The left side of
parallel juxtaposition is evaluated on a subnetwork with m processors, the right
side is evaluated on the remaining subnetwork containing p0 − m processors.
0
−m
1
E s `m
Es `pf +m
Es2 s h vm , . . . , vp0 −1 i
f Es s h v0 , . . . , vm−1 i
e loc m and 0 < m < p0
‚
0
Es `pf Es1 ‚e Es2 s h v0 , . . . , vp0 −1 i
0
(1)
The remaining rule is a classical one: Es `pf λx.Es s λx.Es , where x is
either ẋ or x̄.
We now need to define the evaluation of global terms E (with parallel juxtaposition enclosed in a sync operation). The environment E for the evaluation of
global terms E can be seen as an association list which associates global variables
to global values. This evaluation has judgments of the form E ` E V which
means “in global environment E on the whole network the term E evaluates to
value V ”. The first rule is the rule used to evaluate a parallel composition in the
scope of a sync:
Es `p0 E s h v0 , . . . , vp−1 i
E ` sync(E) h v0 , . . . , vp−1 i
where Es (x̄) = (0, V ) if and only if E(x̄) = V . The remaining rules are similar
to the previous ones except that the number of processes and the name of the
first processor of the sub-network are not longer useful. Due to the lack of space
we omit these rules.
4
Example
Objective Caml is an eager language. To express parallel juxtaposition as a
function we have to “freeze”the evaluation of its parallel arguments. Thus parallel
juxtaposition must have the following type:
parjux: int -> (unit->’a par) -> (unit->’a par) -> ’a par
The sync operation has in the BSMLlib library the type ’a -> ’a even if it is
only useful for global terms.
The following example is a divide-and-conquer version of the scan program
which is defined by scan ⊕ h v0 , . . . , vp−1 i = h v0 , . . . , v0 ⊕ v1 ⊕ . . . ⊕ vp−1 i:
let rec scan op vec =
if bsp_p()=1 then vec
else let mid = bsp_p()/2 in
let vec’=parjux mid (fun ()->scan op vec) (fun ()->scan op vec) in
let msg vec=apply(mkpar(fun i v->
if i=mid-1
then fun dst-> if dst>=mid then Some v else None
else fun dst-> None)) vec
and parop=parfun2(fun x y->match x with None->y|Some v->op v y) in
parop (apply(put(msg vec’))(mkpar(fun i->mid-1))) vec’
The network is divided into two parts and the scan is recursively applied to
those two parts. The value held by the last processor of the first part is broadcast
to all the processors of the second part, then this value and the value held locally
are combined together by the operator op on each processor of the second part.
To use this function at top-level, it must be put into a sync operation. For
example : let this = mkpar (fun pid->pid) in sync (scan (+) this).
5
Related Work
[14] presents another way to divide-and-conquer in the framework of an objectoriented language. There is no formal semantics and no implementation from
now on. Furthermore, the proposed operation is rather a parallel superposition,
several BSP threads use the whole network, than a parallel juxtaposition. The
same author advocates in [9] a new extension of the BSP model in order to ease
the programming of divide-and-conquer BSP algorithms. It adds another level
to the BSP model with new parameters to describe the parallel machine.
[15] is an algorithmic skeletons language based on the BSP model and offers
divide-and-conquer skeletons. Nevertheless, the cost model is not really the BSP
model but the D-BSP model [2] which allows subset synchronization. We follow
[6] to reject such a possibility.
6
Conclusions and Future Work
Compared to a previous attempt [7], the parallel juxtaposition has not the two
main drawbacks of its predecessor : (a) the two sides of spatial parallel composition may not have the same number of synchronization barriers. This enhancement comes to the price of an additional synchronization barrier, but only one
barrier even in the case of nested parallel compositions ; (b) the cost model is
a compositional one. The remaining drawback is that the semantics is strategydependant. The use of the spatial parallel composition changes the number of
processors: it is an imperative feature. Thus there is no hope in this way to
obtain a confluent calculus with a parallel juxtaposition.
The next released implementation of the BSMLlib library will include parallel juxtaposition. Its ease of use will be experimented by implementing BSP
algorithms described as divide-and-conquer algorithms in the literature.
References
1. M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation.
MIT Press, 1989.
2. P. de la Torre and C. P. Kruskal. Submachine locality in the bulk synchronous
setting. In Euro-Par’96, number 1123-1124. Springer Verlag, 1996.
3. C. Foisy and E. Chailloux. Caml Flight: a portable SPMD extension of ML for distributed memory multiprocessors. In A. W. Böhm and J. T. Feo, editors, Workshop
on High Performance Functionnal Computing, 1995.
4. F. Gava and F. Loulergue. A Polymorphic Type System for Bulk Synchronous Parallel ML. In Seventh International Conference on Parallel Computing Technologies
(PaCT 2003), LNCS. Springer Verlag, 2003. to appear.
5. A. V. Gerbessiotis and L. G. Valiant. Direct Bulk-Synchronous Parallel Algorithms.
Journal of Parallel and Distributed Computing, 22:251–267, 1994.
6. G. Hains. Subset synchronization in BSP computing. In H.R.Arabnia, editor,
PDPTA’98, pages 242–246. CSREA Press, 1998.
7. F. Loulergue. Parallel Composition and Bulk Synchronous Parallel Functional
Programming. In S. Gilmore, editor, Trends in Functional Programming, Volume
2, pages 77–88. Intellect Books, 2001.
8. F. Loulergue, G. Hains, and C. Foisy. A Calculus of Functional BSP Programs.
Science of Computer Programming, 37(1-3):253–277, 2000.
9. J. Martin and A. Tiskin. BSP Algorithms Design for Hierarchical Supercomputers.
submitted for publication, 2002.
10. P. Panangaden and J. Reppy. The essence of concurrent ML. In F. Nielson, editor,
ML with Concurrency, Monographs in Computer Science. Springer, 1996.
11. S. Pelagatti. Structured Development of Parallel Programs. Taylor & Francis, 1998.
12. D. B. Skillicorn, J. M. D. Hill, and W. F. McColl. Questions and Answers about
BSP. Scientific Programming, 6(3):249–274, 1997.
13. A. Tiskin. The Design and Analysis of Bulk-Synchronous Parallel Algorithms. PhD
thesis, Oxford University Computing Laboratory, 1998.
14. A. Tiskin. A New Way to Divide and Conquer. Parallel Processing Letters, (4),
2001.
15. A. Zavanella. Skeletons and BSP : Performance Portability for Parallel Programming. PhD thesis, Universita degli studi di Pisa, 1999.