Day 4

FORTRAN and MPI
Message Passing Interface (MPI)
Day 4
Maya Neytcheva, IT, Uppsala University [email protected]
1
Course plan:
•
•
•
•
•
•
MPI - General concepts
Communications in MPI
– Point-to-point communications
– Collective communications
Parallel debugging
Advanced MPI: user-defined data types, functions
– Linear Algebra operations
Advanced MPI: communicators, virtual topologies
– Parallel sort algorithms
Parallel performance. Summary. Tendencies
Maya Neytcheva, IT, Uppsala University [email protected]
2
Recall: Distributed memory machines
P
P
111
000
M
000
111
000
111
111
000
M
000
111
000
111
interprocessor
communication
network
P
111
000
M
000
111
000
111
Maya Neytcheva, IT, Uppsala University [email protected]
3
Recall: Interconnection network topologies
(a)
array
linear
(b) ring
(c) star
(d) 2D mesh
(e) 2D toroidal (f) systolic array
mesh
.....
(g) completely(h) chordal ring (i) binary tree
connected
Maya Neytcheva, IT, Uppsala University [email protected]
(j) 3D cube
(k) 3D cube
4
Communicators
Maya Neytcheva, IT, Uppsala University [email protected]
5
Communicators
Creating communicators enables us to split the processors into groups.
Benefits:
• The different groups can perform independent tasks.
• Collective operations can be done on a subset of processors
• The subsets are logically the same as the initial (complete) group of PEs,
i.e., there exists P0 in each subgroup. In this way, for example, recursive
algorithms can be implemented.
• Safety - isolating messages, avoiding conflicts between modules etc.
Maya Neytcheva, IT, Uppsala University [email protected]
6
Groups of processes
First we need to introduce the notion of group
A group is an ordered set of process identifiers (henceforth processes).
Each process in a group is associated with an integer rank. Ranks are
contiguous and start from zero.
Groups cannot be directly transferred from one process to another.
A group is used within a communicator to describe the participants in a
communication “universe” and to rank such participants (thus giving them
unique names within that “universe” of communication).
There is a special pre-defined group: MPI GROUP EMPTY, which is a group
with no members. The predefined constant MPI GROUP NULL is the value
used for invalid group handles.
Maya Neytcheva, IT, Uppsala University [email protected]
7
Groups
MPI_COMM_GROUP(comm,group)
input: comm - communicator
output: group - group corresponding to the communicator
MPI_COMM_CREATE(COMM, GROUP, IERROR)
INTEGER COMM, GROUP, NEWCOMM, IERROR
Maya Neytcheva, IT, Uppsala University [email protected]
8
Groups
Various functions available to manipulate with communicators in MPI:
MPI GROUP UNION(group1, group2, newgroup)
MPI GROUP INTERSECTION(group1, group2, newgroup)
MPI GROUP DIFFERENCE(group1, group2, newgroup)
MPI GROUP INCL(group, n, ranks, newgroup)
MPI GROUP EXCL(group, n, ranks, newgroup)
MPI GROUP FREE(group)
Maya Neytcheva, IT, Uppsala University [email protected]
9
Communicators
MPI_COMM_CREATE(comm, group, newcomm)
[ IN comm] communicator (handle)
[ IN group] Group, which is a subset of the group of comm (handle)
[ OUT newcomm] new communicator (handle)
int MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm)
MPI_COMM_CREATE(COMM, GROUP, NEWCOMM, IERROR)
INTEGER COMM, GROUP, NEWCOMM, IERROR
The function creates a new communicator.
Maya Neytcheva, IT, Uppsala University [email protected]
10
Communicators
Various functions available to manipulate with communicators in MPI:
MPI COMM SIZE(comm, size)
MPI COMM COMPARE(comm1, comm2, result)
MPI COMM SPLIT(comm, color, key, newcomm)
MPI COMM FREE(comm)
Maya Neytcheva, IT, Uppsala University [email protected]
11
Communicators
MPI_COMM_SPLIT(comm, color, key, newcomm)
IN
comm
communicator (handle)
IN
color control of subset assignment (integer)
IN
key
control of rank assigment (integer)
OUT newcomm new communicator (handle)
int MPI_Comm_split(MPI_Comm comm, int color, int key,
MPI_Comm *newcomm)
MPI_COMM_SPLIT(COMM, COLOR, KEY, NEWCOMM, IERROR)
INTEGER COMM, COLOR, KEY, NEWCOMM, IERROR
Maya Neytcheva, IT, Uppsala University [email protected]
12
Communicators
What are those: color and key?
Partitions the group associated with comm into disjoint subgroups, one for
each value of color.
Each subgroup contains all processes of the same color.
Within each subgroup, the processes are ranked in the order defined by the
value of the argument key.
A new communicator is created for each subgroup and returned in newcomm.
A process may supply the color value MPI UNDEFINED, in which case
newcomm returns MPI COMM NULL. This is a collective call, but each
process is permitted to provide different values for color and key.
Maya Neytcheva, IT, Uppsala University [email protected]
13
Communicators
program Comm
implicit none
include "mpif.h"
integer ierror, rank, size, rankh, sizeh, key
integer ALL_GROUP, color, HALF_COMM
integer N, M
integer, dimension(MPI_STATUS_SIZE) :: status
call MPI_Init(ierror)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierror)
call MPI_Comm_size(MPI_COMM_WORLD, size, ierror)
N = rank
M = rank+10
call MPI_COMM_GROUP(MPI_COMM_WORLD, ALL_GROUP, ierror)
Maya Neytcheva, IT, Uppsala University [email protected]
14
Communicators
color=rank/2
key=0
call MPI_COMM_SPLIT(MPI_COMM_WORLD,color,key,HALF_COMM, ierror)
call MPI_Comm_rank(HALF_COMM, rankh, ierror)
call MPI_Comm_size(HALF_COMM, sizeh, ierror)
write(*,*) ’Global ’, rank,’ is now local rank ’,rankh,’ ’,color
write(*,*) ’Global ’,rank,’ local ’,rankh, ’ has ’,N,’ and ’,M
Maya Neytcheva, IT, Uppsala University [email protected]
15
Communicators
if (rankh.eq.0) then
call MPI_SENDRECV(N,1,MPI_INTEGER,1,1,
.
N,1,MPI_INTEGER,1,2,HALF_COMM,status,ierror)
else
call MPI_SENDRECV(N,1,MPI_INTEGER,0,2,
.
N,1,MPI_INTEGER,0,1,HALF_COMM,status,ierror)
endif
write(*,*) ’Global ’,rank,’ local ’,rankh, ’ new ’,N,’ and ’,M
call MPI_COMM_FREE(HALF_COMM, ierror)
call MPI_GROUP_FREE(ALL_GROUP, ierror)
call MPI_Finalize(ierror)
stop
end
Maya Neytcheva, IT, Uppsala University [email protected]
16
Communicators
Possible output of the above code (sorted afterwards)
Global
3 local
1 has
3 and
13
Global
2 local
0 has
2 and
12
------------------------------------------Global
1 local
1 new
0 and
11
Global
0 local
0 new
0 and
11
Global
2 local
0 new
2 and
13
Global
3 local
1 new
2 and
13
------------------------------------------------------------------------------------Global
2 local
0 has
2 and
12
Global
3 local
1 has
3 and
13
------------------------------------------Global
0 local
0 new
1 and
10
Global
1 local
1 new
0 and
11
Global
Global
2 local
3 local
0 new
1 new
3 and
2 and
12
13
Maya Neytcheva, IT, Uppsala University [email protected]
17
Virtual interconnection topology
Maya Neytcheva, IT, Uppsala University [email protected]
18
Virtual topologies
Algorithm −→ communication pattern −→ graph
The processes represent the nodes in that graph,
the edges connect processes that communicate with each other.
MPI provides message-passing between any pair of processes in a group.
However, it turns out to be convenient to describe the virtual communication
topology utilized by an algorithm.
The provided MPI functions for that are:
MPI GRAPH CREATE and MPI CART CREATE
which are used to create general (graph) virtual topologies and Cartesian
topologies, respectively. These topology creation functions are collective.
Maya Neytcheva, IT, Uppsala University [email protected]
19
Virtual topologies
The tool provided to describe Cartesian grids of processors is
MPI CART CREATE
Cartesian structures of arbitrary dimension are allowed.
In addition, for each coordinate direction one specifies whether the process
structure is periodic or not.
Note that an n-dimensional hypercube is an n-dimensional torus with 2
processes per coordinate direction. Thus, special support for hypercube
structures is not necessary.
Maya Neytcheva, IT, Uppsala University [email protected]
20
Process topologies
Cartesian topology functions
MPI CART CREATE(comm old,ndims,dims,periods,reorder,comm cart)
IN
IN
IN
comm old
ndims
dims
IN
periods
IN
OUT
reorder
comm cars
input communicator
number of dimensions in a Cartesian grid
integer array of size ndims specifying the number
of procs in each dimension
logical array of size ndims specifying whether the
grid is periodic (’true’) or not (’false’) in each
dimension
ranks may be reordered (’true’) or not (’false’)
communicator with new Cartesian topology
Returns a handle to a new communicator. If reorder=’false’, the rank of each
processor in the group is identical to its rank in the new group.
Maya Neytcheva, IT, Uppsala University [email protected]
21
Virtual topologies
0
(0,0)
1
(0,1)
2
(0,2)
7
4
5
(1,0)
(1,1)
(1,2)
(1,3)
9
8
6
3
(0,3)
10
11
(2,0)
(2,1)
(2,2)
(2,3)
12
13
14
15
(3,0)
(3,1)
(3,2)
(3,3)
...
dims(1) = 4
dims(2) = 4
periods(1) = ’true’
periods(2) = ’true’
reprder
= ’false’
MPI_CART_CREATE(MPI_COMM_WORLD,2,dims,
.
periods,reorder,GRID_COMM)
Maya Neytcheva, IT, Uppsala University [email protected]
22
Virtual topologies
MPI CART SHIFT(comm,direction,disp,rank src,rank dest)
IN
comm
IN
IN
direction
disp
OUT
OUT
rank src
rank dest
communicator with Cartesian
structure
coordinate dimension of shift
displacement (> 0: upwards
shift, < 0: downwards shift)
rank of source process
rank of destination process
int MPI_Cart_shift(MPI_Comm comm, int direction, int disp,
int *rank_source, int *rank_dest)
MPI_CART_SHIFT(COMM, DIRECTION, DISP, RANK_SOURCE, RANK_DEST, IERROR)
INTEGER COMM, DIRECTION, DISP, RANK_SOURCE, RANK_DEST, IERROR
Maya Neytcheva, IT, Uppsala University [email protected]
23
Virtual topologies
0 (0,3) 1 (0,0) 2 (0,1) 3 (0,2)
(0,0)
(0,1)
4 (1,3) 5 (1,0)
(1,0)
(1,1)
8 (2,3) 9 (2,0)
(2,0)
(2,1)
(0,2)
6 (1,1)
(1,2)
(0,3)
7 (1,2)
(1,3)
10 (2,1) 11 (2,2)
(2,2)
(2,3)
12 (3,3) 13 (3,0) 14 (3,1) 15 (3,2)
(3,0)
(3,1)
(3,2)
(3,3)
MPI_CART_SHIFT(GRID_COMM,1,1,src,dest,ierror)
Maya Neytcheva, IT, Uppsala University [email protected]
24
Virtual topologies
....
C find process rank
CALL MPI_COMM_RANK(comm, rank, ierr))
C find cartesian coordinates
CALL MPI_CART_COORDS(comm, rank, maxdims, coords, ierr)
C compute shift source and destination
CALL MPI_CART_SHIFT(comm, 0, coords(2), source, dest, ierr)
C skew array
CALL MPI_SENDRECV_REPLACE(A,1,MPI_REAL,dest,0,source,0,comm,
+
status, ierr)
OBS!:
In Fortran, the dimension indicated by DIRECTION = i has DIMS(i+1) nodes,
where DIMS is the array that was used to create the grid.
In C, the dimension indicated by direction = i is the dimension specified by dims[i].
Maya Neytcheva, IT, Uppsala University [email protected]
25
Parallel Divide-and-Conquer techniques
• Partitionings:
- data partitioning
- functional partitioning
• Examples of Divide-and-Conquer approaches:
n
n
P
P
ai ,
ai bi
i=1
i=1
- Numerical integration
- Solution of tridiagonal systems
- Bucket sort algorithm
Maya Neytcheva, IT, Uppsala University [email protected]
26
More examples of Divide-and-Conquer algorithms:
matrix manipulations
the ’knapsack’ problem
’Mergehull’
integer factorization
set-covering
block splitting of matrices, multifrontal methods
the problem to find a set of items each with a weight w
and a value v , in order to maximize the total value while
not exceeding a fixed weight limit
an algorithm for determining the convex hull of n points
in a plane.
an algorithm for decomposing positive integer numbers
into prime factors
the problem is to find a minimal set of subsets of a setS ,
which covers the properties of all elements of the set S .
Maya Neytcheva, IT, Uppsala University [email protected]
27
((((((a1+a2)+a3)+a4)+a5)+a6)+a7)+a8
+
a8
+
+
a5
+
+
+
a3
a2
+
a4
+
a1
+
a6
+
+
(a1+a2)+(a3+a4))+((a5+a6)+(a7+a8))
a7
a1
+
a2 a3
+
a4 a5
+
a6 a7
a8
A trivial example how to gain parallelism
Maya Neytcheva, IT, Uppsala University [email protected]
28
Parallel Divide-and-Conquer techniques
Scalar product, summing-up n numbers
FAN-IN, FAN-OUT algorithms
0
x1
x2
y1
*
p1
1
y2
x3
y3
*
y4
x n-1
*
*
p2
*
*
y n-1 x n
+
+
*
pn
+
p3
p1
yn
*
p n-1
p4
p3
+
2
x4
p n/2
...
.
.
.
+
+
p1
p n/4
+
p1
log n + 1
2
Binary tree algorithm represented by a cascade graph to compute
n
P
xi yi , where
i=1
n = 2s (s = 3)
Maya Neytcheva, IT, Uppsala University [email protected]
29
Parallel Divide-and-Conquer techniques
Numerical integration
f(x)
0
Proc0 :
Proci :
Proc0 :
a
P0
P1
P2
P3
h
h
h
h
b
x
broadcast a and b
h = (b − a)/p
ai = a + (i − 1) ∗ h, bI = a + Ih
area local = F (ai) + f (bi ) ∗ h ∗ 0.5
gather area local into area locali
Pp−1
area =
i=0 area locali
Maya Neytcheva, IT, Uppsala University [email protected]
30
Parallel Sort Algorithms
Maya Neytcheva, IT, Uppsala University [email protected]
31
Parallel Sort Algorithms
A general-purpose parallel sorting algorithm must be able to sort a large sequence with a
relatively small number of processors.
Let p be the number of processors and n be the number of elements to be sorted. Each
processor is assigned a block of size n/p elements. Let A0, A1 , · · · , Ap−1 be the blocks
assigned to processors P0, P1, · · · , Pp−1, respectively. We say that Ai < Aj if every
element of Ai is smaller than every element in Aj . When the sorting finishes, each
p−1
p−1
S
S ′
′
processor Pi holds a set Ai such that for i ≤ j and
Ai =
Ai .
i=0
i=0
In a typical sorting process, each processor sorts locally the block it owns, then selects a
pivot, split the block into two according to the pivot, exchange half of the block with its
neighbors, merge the received block with the block it retained.
Maya Neytcheva, IT, Uppsala University [email protected]
32
From Sequential Sort to Parallel Sort
• Sequential sort
– Selection sort, Insertion sort, Bubble sort, each has a complexity of O(n2 ), where
n is the number of elements
– Quick sort, Merge sort, Heap sort
- O(n log n)
- Quick sort best on the average
• Different approach to design a parallel sort
– Use a sequential sort and adapt
- How well can it be done in parallel? Not all sequential algorithms can be parallelized
easily.
- Sometimes a poor sequential algorithm can develop into a reasonable parallel
algorithm (e.g. bubble sort).
– Develop a different approach
- More difficult, but may lead to better solutions.
Maya Neytcheva, IT, Uppsala University [email protected]
33
Sequential sort algorithms
Name
bubble
Description
sort by comparing each adjacent pair of
items in a list in turn, swapping the items
if necessary, and repeating the pass through
the list until no swaps are done
Complexity
O(n2 )
insertion
Sort by repeatedly taking the next item and
inserting it into the final data structure
in its proper order with respect to items
already inserted. Run time is O(n2 )
because of moves.
O(n2 )
Maya Neytcheva, IT, Uppsala University [email protected]
Modifications
bidirectional
bubble sort,
exchange
sort,
sink
sort
binary
insertion
sort
34
Sequential sort algorithms
Name
bucket
Description
A distribution sort where input elements are
initially distributed to several (but relatively
few) buckets based on a certain predefined
value-brakets. Each bucket is sorted if
necessary, and the buckets’ contents are
concatenated.
Complexity
O(n log log n)
Modifications
bin
sort,
range sort
29, 25, 3, 49, 9, 37, 21, 43
29, 25
21
49 37
9
0−15
16−30
31−49
3
43
Maya Neytcheva, IT, Uppsala University [email protected]
35
Sequential sort algorithms
Name
quicksort
merge sort
Description
One element, x of the list to be sorted
is chosen and the other elements are split
into those elements less than x and those
greater than or equal to x. These two lists
are then sorted recursively using the same
algorithm until there is only one element
in each list, at which point the sublists are
recursively recombined in order yielding the
sorted list.
A sort algorithm which splits the items to
be sorted into two groups, recursively sorts
each group, and merges them into a final,
sorted sequence
Maya Neytcheva, IT, Uppsala University [email protected]
Complexity
O(n log n)
Modifications
balanced,
external,
hybrid
O(n log n)
(non)balanced,
balanced
two-way,
k-way
36
Sequential sort algorithms
Name
heap sort
radix sort
Description
A sort algorithm which builds a heap, then
repeatedly extracts the maximum item.
Heap data structure: A tree where every
node has a key more extreme (greater or
less) than the key of its parent.
A multiple pass sort algorithm that
distributes each item to a bucket according
to part of the item’s key beginning with
the least significant part of the key. After
each pass, items are collected from the
buckets, keeping the items in order, then
redistributed according to the next most
significant part of the key ( c depends on
the size of the key and number of buckets.
Maya Neytcheva, IT, Uppsala University [email protected]
Complexity
O(n log n)
Modifications
weak
heep
sort, adaptive
heap sort
O(cn)
bottomup
radix
sort,
radix
quicksort
37
Sequential sort algorithms
Name
diminishing increment
sort
shell sort
Description
Counting Sort
Maya Neytcheva, IT, Uppsala University [email protected]
38
Sequential sort algorithms
Parallel algorithms on sequences and strings:
Matching Parentheses: This is an interesting algorithm since one might think that
matching parentheses seems very sequential. For each location the algorithm returns the
index of the matching parenthesis. The algorithm is based on a scan and an integer
sort (rank). The scan returns the depth of each parenthesis and the sort groups them
into groups of equal depth. At this point we can simply switch the indices of neighbors.
Assuming a work-efficient radix sort, this algorithm does O(n) work and has the depth is
bounded by the sort.
function parentheses_match(string) =
let
depth = plus_scan({if c==‘( then 1 else -1 : c in string});
depth = {d + (if c==‘( then 1 else 0): c in string; d in depth};
rnk = permute([0:#string], rank(depth));
ret = interleave(odd_elts(rnk), even_elts(rnk))
in permute(ret, rnk);
parentheses_match("()(()())((()))");
Maya Neytcheva, IT, Uppsala University [email protected]
39
Sequential sort algorithms
Bucket sort algorithm
n numbers
filter
0 − 25
26 − 50
51 − 75
76 − 100
P0
P1
P2
P3
Quick sort in each bucket - serial complexity to sort k numbers: O(k log k)
T1 = n + p np log( np ) = n(1 + log( np )) = O(n)
Tp = n +
n
p
log( np )
Maya Neytcheva, IT, Uppsala University [email protected]
40
Parallel sort algorithms
Divide-and-Conquer approaches for sorting algorithms:
• Merge sort
- collects sorted list onto one processor, merging as items come together
- maps well to tree structure, sorting locally on leaves, then merging up the tree
- as items approach root of tree, processors drop out of the process, limiting parallelism
• Quick sort
- maps well to hypercube
- divide list across dimensions of the hypercube, then sort locally
- selection of partition values is even more critical than for sequential version since it
affects load balancing
- hypercube version leaves different numbers of items on different nodes
Maya Neytcheva, IT, Uppsala University [email protected]
41
Quick sort
• Basic idea of parallel quick sort algorithm on a hypercube
1: Select global partition value (pivot), split values across
highest cube dimension, high values going to upper side
and low values to lower side
2: Repeat on each of the lower dimensional cubes forming the
upper and lower halves of the original (divide)
3: Continue this process until the remaining cube is a single
processor, then sort locally
4: Each node contains a sorted list, and the lists from node to
node are in order, using Grey code numbering of nodes.
Maya Neytcheva, IT, Uppsala University [email protected]
42
Quick sort
• How to implement it? Hyper quick sort;
I.
Divide data equally among nodes
II.
Sort locally on each node first
III. Broadcast median value from node 0 as pivot
IV. Each list splits locally, then trades halves across highest
dimension
V.
Apply the above two steps successively (and in parallel) to
lower dimensional cube forming the two halves, and so on
until dimension reaches 0 (a single node)
Maya Neytcheva, IT, Uppsala University [email protected]
43
Quick sort
• Run time for quicksort on hypercube: following components contribute to run time.
1. Local sort (e.g. sequential quicksort) in O( np log np )
2. Broadcasting a pivot during i-th iteration takes O(d − (i − 1)) where d − (i − 1)
is the dimension of the sub-hypercube.
In a d-dimension hypercube, one-to-all broadcast can be done in d steps. Thus the
total time spent in broadcasting pivots (it needs d = log p iterations)
d
X
i
i=
(log p)(log p + 1)
d(d + 1)
=
= O(log2 p)
2
2
3. Partitioning n/p elements in O( np log np ), but we need to do it log p times
4. Exchange local list with neighbors in O( np log np ), but we need to do it log p times.
5. Merge local list with the one received in O( np log np ), but we need to do it log p
times
Maya Neytcheva, IT, Uppsala University [email protected]
44
Quick sort
So the total run time is
Tp = O
„
n
n
log
p
p
«
+O
„
n
log p
p
«
2
+ O(log p).
If the number of processors p is equal to n/ log n,
then its best run time is
2
2
O(log n ∗ log(log n)) + O(log n) = O(log n).
Maya Neytcheva, IT, Uppsala University [email protected]
45