Programming for Multilevel Parallelism

Programming for Multilevel Parallelism
Jörg Dümmler and Gudula Rünger
Chemnitz University of Technology
Department of Computer Science
Chair for Practical Computer Science
2nd Workshop of COST0805
Open Network for High-Performance Computing on Complex Environments
Timisoara, January, 25 – 27th, 2012
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Outline
1
Programming with parallel tasks
The Programming Model
Compiler framework TwoL – (Two-Level Parallelism)
2
CM-task programming model
3
Benchmarks
4
Summary
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Motivation
Goal: Efficient, scalable, adaptive, flexible implementation of parallel
application programs on multicore and cluster systems
I
I
Many application programs or algorithms have an inherent modular
structure of cooperating subtasks calling each other;
examples: environmental models, aircraft simulations.
Parallel tasks or multiprocessor tasks (M-tasks) are a suitable
programming abstraction:
I
I
I
Each module/component of the application is implemented as an
M-tasks (parallelism inside);
mixed task and data parallelism possible;
Separation of specification and execution:
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
portable performance
2nd Workshop of COST0
The M-task Programming Model
I
I
An application consists of a set of parallel tasks (M-tasks).
M-tasks can be:
I
I
I
I
I
I
I
I
basic (data parallel implementations), e.g. dataparallel matrix mult.;
composed (consisting of other M-tasks), e.g. matrix mult. by Strassen.
Basic M-tasks can have different implementations, e.g. different data
distributions.
M-tasks can run on an arbitrary number of processors.
M-tasks can be independent from each other
a task parallel execution is possible.
M-tasks can depend on other tasks (A → B)
consecutive
execution of different parallel program parts
M-tasks have input and output parameters/data
communication between tasks.
Data re-distribution operations are required if
I
I
Data distribution is different for producer task A and consumer tasks
B, or
Producer and consumer task are executed on different processor groups.
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Application task graph
I
M-task application is represented by a
directed acyclic graph (M-task dag)
G = (V , E ).
I
A node v ∈ V corresponds to an
M-task (either basic or composed).
I
An edge e ∈ E symbolizes a
dependency between two tasks.
I
(Homogeneous) Target platforms with
P processors are considered.
I
M-tasks v ∈ V can be executed on any
nonempty processor group of a target
platform with P processors.
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Cost model for M-task applications
I
Nodes v ∈ V are assigned a computation cost Tv : [1, . . . , P] → R+ .
I
Representation by measured runtimes or by symbolic runtime
formulas in closed form.
Derivation by:
I
I
I
I
I
runtime prediction techniques, or
fitting measured runtimes (simulation) to a function prototype.
Edges e = (v1 , v2 ) ∈ E have a communication cost arising from
possible re-distribution operations.
Communication costs given as formula dependent on:
I
I
transferred data size between nodes v1 and v 2, and
machine dependent parameters startup time and byte transfer
time.
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Scheduling in the M-task programming model
Jörg Dümmler and Gudula Rünger
Time
Schedule
I Computed at compile time (static scheduling).
I Assigns each M-task v ∈ V a starting time and a
processor group.
Feasible schedule
I Each processor executes at most one M-task
at any point in time.
I Each M-task is executed by at least 1 processor.
I All input data are available before starting a task.
Makespan
I Makespan of the schedule is the point in time
where all M-tasks have been finished.
Scheduling goal
I Minimal makespan.
Programming for Multilevel Parallelism
Processors
1
2
3
4
5
6
7
8
9
2nd Workshop of COST0
Compiler framework TwoL – (Two-Level Parallelism)
Mixed task and data parallelism:
application
algorithm
specification
program
transformation
coordination
program
translation
parallel
target program
specification program maximum degree of parallelism
I
composed modules → medium grain task parallelism
(coordination language)
I
basic modules → fine grain data parallelism
coordination program exploited degree of parallelism
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Overview of the transformation steps
SDS
GDS
0
t
i
m
e
processors
0
0
1
2
3
1
3
1
3
2
4
2
4
4
5
5
module dependence graph
module execution graph
module
specification
Jörg Dümmler and Gudula Rünger
module
specification
with scheduling
Programming for Multilevel Parallelism
5
module execution plan
module
specification
with scheduling
and group sizes
2nd Workshop of COST0
Outline
1
Programming with parallel tasks
2
CM-task programming model
Motivation
Application model
Cost model
CM-task scheduling
Programming support for CM-task applications
3
Benchmarks
4
Summary
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Extending the M-task programming model – motivation
I
I
Drawback of M-tasks: M-tasks can only exchange data by using
suitable input/output parameters;
interaction is only possible if the M-tasks are not running;
granularity of the M-tasks is restricted.
Possible solution: extension of M-tasks to communicating M-tasks
(CM-tasks)
I
I
I
CM-tasks additionally support interaction between running tasks
Advantages: CM-tasks do not need to be interrupted for data
exchanges with concurrently executed CM-tasks
CM-tasks can have a coarser granularity compared to M-tasks
better structuring of applications with repeated data exchanges
between program parts.
Optimized communication patterns can be exploited.
Areas of application: time-stepping methods like ODE solvers, large
simulation programs.
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
The CM-task programming model
I
CM-tasks are parallel program fragments executable on an arbitrary
number of processors.
I
Basic CM-tasks are implemented directly by the application developer.
I
Composed CM-tasks comprise activations of other CM-tasks and a
coordination structure describing the dependencies between the
CM-task activations.
Two types of relations are possible between CM-tasks:
I
I
I
P-Relations capture input/output dependencies as in the M-task model;
C-Relations signal data exchanges between CM-tasks during their
execution.
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Representation of CM-task programs
I
A CM-task program can be represented by a
CM-task graph G = (V , E ) with E = EP ∪ EC
where
I
I
I
I
the set of nodes V represents the CM-tasks,
the set of directed edges EP represents the
P-relations, and
the set of bidirectional edges EC represents the
C-relations between CM-tasks.
The CM-task graph represents restrictions of the
execution order:
I
I
I
CM-tasks connected by a path of directed edges
have to be executed one after another;
CM-tasks connected by a path of bidirectional
edged have to be executed concurrently;
CM-tasks not connected by a path are
independent, i.e., can be executed concurrently or
one after another in an arbitrary order.
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Cost annotations for CM-task graphs
I
Costs for CM-tasks are represented by an analytical function
T : V × {1, . . . , q} → R
where q is the number of processors of a homogeneous platform.
I
The costs T for basic CM-tasks are given as measured runtimes or by
symbolic runtime formulas in closed form.
I
The costs for composed CM-tasks are derived by combining the cost
information of the activated CM-tasks.
I
The directed edges (P-relations) of the CM-task graph are associated
with communication costs defined by a function
TP : EP × {1, . . . , q} × {1, . . . , q} → R.
I
These costs result from possible data re-distribution operations.
I
The communication costs depend on the number of data elements to
be transferred, the number of processors used to execute the
participating CM-tasks, and platform-specific parameters, like
startup time and byte transfer time of the network.
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
CM-task schedule
I
A CM-task schedule S maps each CM-task
Ai ∈ V of a CM-task graph G = (V , E ) to
an execution time interval and a set of
processors.
I
Total execution time is the point in time
when all CM-tasks have been finished.
I
Static schedule is computed at compile time.
A feasible schedule has to assure:
I
I
I
I
CM-tasks connected by a P-relation are
executed one after another (taking data
re-distribution costs into account);
CM-tasks connected by a C-relation are
executed concurrently;
Concurrently executed CM-tasks use
disjoint sets of processors;
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Programming support: CM-task framework (1)
Application Developer
Specification
Program
Platform
Description
CM-task
Implementations
Coordination
Program
(C+MPI)
CM-task
Compiler
Data
Re-distribution
Library
Load
Balancing
Library
CM-task Compiler Framework
Input of the CM-task Compiler Framework:
I
Platform-independent Specification Program that expresses the
degree of task parallelism of an an application.
I
Platform Description with platform-specific parameters
(number of processors, computational performance, speed of the
interconnection);
I
Implementation of the CM-tasks in form of parallel functions
(e.g. coded in C+MPI).
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Programming support: CM-task framework (2)
Application Developer
Specification
Program
Platform
Description
CM-task
Implementations
Coordination
Program
(C+MPI)
CM-task
Compiler
Data
Re-distribution
Library
Load
Balancing
Library
CM-task Compiler Framework
Output of the CM-task Compiler Framework:
I Platform-dependent Coordination Program that includes
I
I
I
I
a static schedule that is adapted to the specific target platform;
data re-distribution operations to provide input data in the correct data
distribution;
management of the subsets of processors used
(e.g. handling of the appropriate MPI communicators)
execution of the user-provided implementations on the appropriate sets
of processors.
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
CM-task Specification Language
Basic CM-tasks are specified by
I a parameter list with input, output and communication parameters
I a cost expression depending on the number of executing processors p.
Composed CM-tasks are specified by
I a parameter list similar to basic CM-tasks, and
I a hierarchical dependence expression M using the following grammar:
M → seq { M1 M2 . . . Mn }
| par { M1 M2 . . . Mn }
| for (i = 1 : n) { M1 }
| while (cond)#It { M1 }
| parfor (i = 1 : n) { M1 }
| if (cond) { M1 }
| if (cond) { M1 } else { M2 }
| C
C → BC (a1 , . . . , an );
| CC (a1 , . . . , an );
| cpar { C1 C2 . . . Cn }
| cparfor (i = 1 : n) { C1 }
Jörg Dümmler and Gudula Rünger
consecutive execution
data independence
loop with data dependencies
loop with data dependencies
loop without data dependencies
conditional execution
conditional execution
execution of a basic CM-task
execution of a composed CM-task
concurrent execution
concurrent execution of iterations
Programming for Multilevel Parallelism
2nd Workshop of COST0
Example: Specification Program for the PABM method
c o n s t K=8;
// number o f s t a g e v e c t o r s
c o n s t n = . . . ; // ODE s y s t e m s i z e
// b a s i c CM−t a s k d e f i n i t i o n s
cmtask pabmstep ( k : i n t , x , h : s c a l a r : i n , y k : v e c t o r : i n o u t : b l o c k ,
y k 1 : v e c t o r : i n : b l o c k , o r t : v e c t o r : comm) runtime [ . . . ] ;
// [ . . . ]
// composed CM−t a s k d e f i n i t i o n s
cmmain pabm (X : double , y : v e c s : i n o u t : r e p l i c ) {
// d e c l a r a t i o n o f l o c a l v a r i a b l e s
var x , h : s c a l a r ;
v a r ortcomm : v e c s ;
// d e p e n d e n c e e x p r e s s i o n
seq {
initstep (x , h );
w h i l e ( x [ 0 ] < X)#100 {
seq {
c p a r f o r ( k = 0 : K−1) {
pabmstep ( k , x , h , y [ k ] , y [ K−1] , ortcomm ) ; }
updatestep (x , h ) ;
}}}}
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Transformation steps of the CM-task Compiler
Specification
Program
CM-task Compiler
Data Flow
Analyzer
Static
Scheduler
Data
Distribution
Coordination
Program
(C+MPI)
Code
Generator
Platform
Description
Translation into the final coordination program proceeds in four
transformation steps:
I
Data Flow Analyzer: detection of dependencies (P-relations and
C-relations) between CM-task activations;
I
Static Scheduler: computation of execution order and assignment of
subsets of processors to CM-task activations;
I
Data Distribution: insertion of required data re-distribution
operations;
I
Code Generator: generation of the final coordination program.
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Outline
1
Programming with parallel tasks
2
CM-task programming model
3
Benchmarks
4
Summary
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Hardware Description
CHiC - Chemnitz High Performance Linux Cluster
I 530 compute nodes
I each node comprises 2 AMD Opteron dual-core processors (2.6 GHz)
I 10 GBit/s Infiniband interconnection
I MVAPICH 1.0 MPI library
I Pathscale 3.1 compiler
JuRoPA - Jülich Research on Petaflop Architectures
I 2208 compute nodes
I each node comprises 2 Intel Xeon X5570 (Nehalem-EP) quad-core processors (2.93
GHz)
I 40 GBit/s Infiniband network
I Parastation MPI 5.0
I Intel Compiler 11.0
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Application – NAS-MZ parallel benchmarks
I
Simplified versions of realistic solvers for flow equations from the area
of computational fluid dynamics (CFD).
I
Global 3-dimensional mesh is partitioned into independent zones.
I
Data parallel version: compute the zones one after another using all
processors.
I
CM-task version: map a specific number of zones to each CM-task
and execute the CM-tasks concurrently (border exchanges are
modeled as C-relations).
3 different NAS-MZ benchmarks are available that differ in
I
I
I
how the global mesh is partitioned into zones and
which solution method is used within a zone.
Benchmark
LU-MZ
SP-MZ
BT-MZ
Jörg Dümmler and Gudula Rünger
Zone partitioning
16 equal sized zones
256 (class C) or 1024 (class D) equal sized zones
256 (class C) or 1024 (class D) zones with different size
Programming for Multilevel Parallelism
2nd Workshop of COST0
Benchmark results for LU-MZ
LU−MZ benchmark on CHiC
total performance in GFlops/s
600
500
400
300
200
100
0
64
96 128 192 256 320 384 448 512 640 768 896
LU−MZ benchmark on JuRoPA
1600
total performance in GFlops/s
data parallel class C
task parallel (16 CM−tasks) class C
data parallel class D
task parallel (16 CM−tasks) class D
data parallel class C
task parallel (16 CM−tasks) class C
data parallel class D
task parallel (16 CM−tasks) class D
1400
1200
1000
800
600
400
200
0
64
128 256 320 384 448 512 640 768 896 1024
number of processor cores
number of processor cores
I
Data parallelism is superior on a low numbers of processors, since
communication operations for border exchanges can be avoided.
I
Task parallelism outperforms data parallelism on a high number of
processors, since amount of computation assigned to a processor for a
single zone gets too small.
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Benchmark results for SP-MZ
SP−MZ benchmark (class C) on CHiC
300
250
200
150
100
50
0
256
384
512
640
task parallel (4 CM−tasks)
task parallel (16 CM−tasks)
task parallel (64 CM−tasks)
task parallel (256 CM−tasks)
task parallel (1024 CM−tasks)
350
total performance in GFlops/s
350
total performance in GFlops/s
SP−MZ benchmark (class D) on CHiC
data parallel
task parallel (16 CM−tasks)
task parallel (64 CM−tasks)
task parallel (256 CM−tasks)
768
896
1024
300
250
200
150
100
50
0
256
number of processor cores
384
512
640
768
896
1024
number of processor cores
I
Zones are much smaller than in the LU-MZ benchmarks
potential
parallelism of a single zone is too small to exploit a large number of
processors
data parallelism leads to a poor performance.
I
The optimal degree of task parallelism that should be exploited
increases with the number of processors.
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Benchmark results for BT-MZ
BT−MZ benchmark (class D) on CHiC
500
400
300
200
100
0
256
384
512
640
task parallel (16 CM−tasks)
task parallel (64 CM−tasks)
task parallel (256 CM−tasks)
task parallel (512 CM−tasks)
task parallel (1024 CM−tasks)
600
total performance in GFlops/s
total performance in GFlops/s
BT−MZ benchmark (class C) on CHiC
data parallel
task parallel (16 CM−tasks)
task parallel (64 CM−tasks)
task parallel (256 CM−tasks)
768
896
1024
500
400
300
200
100
0
256
number of processor cores
384
512
640
768
896
1024
number of processor cores
I
CM-tasks include different amounts of computations
processor
groups sizes have to be adapted to the computational work.
I
Program versions with 256 and 512 CM-tasks exhibit a good
scalability
load balancing step of the CM-task scheduling algorithm
computes a suitable processor group layout.
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Outline
1
Programming with parallel tasks
2
CM-task programming model
3
Benchmarks
4
Summary
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0
Conclusion
I
The programming with multilevel parallelism is a suitable parallel
programming model for application with well-defined submodules.
I
Programming support: (Compiler framework TwoL, extension for
CM-Tasks, Library TLib)
specification, transformation, scheduler.
I
The scheduling algorithm are available for parallel task:
Toolset and library.
Extended scheduling algorithms for CM-Tasks.
Current work:
I
I
I
Dynamic library approach TLib for new architectues.
Hybrid MPI+OpenMP and mixed CPU+GPU implementations can
be integrated into the CM-tasks approach.
More information:
http://www.tu-chemnitz.de/informatik/PI/forschung/projekte/cmtask/
http://www.tu-chemnitz.de/informatik/PI/forschung/projekte/genMTS/
Jörg Dümmler and Gudula Rünger
Programming for Multilevel Parallelism
2nd Workshop of COST0