FORTRAN and MPI Message Passing Interface (MPI) Day 4 Maya Neytcheva, IT, Uppsala University [email protected] 1 Course plan: • • • • • • MPI - General concepts Communications in MPI – Point-to-point communications – Collective communications Parallel debugging Advanced MPI: user-defined data types, functions – Linear Algebra operations Advanced MPI: communicators, virtual topologies – Parallel sort algorithms Parallel performance. Summary. Tendencies Maya Neytcheva, IT, Uppsala University [email protected] 2 Recall: Distributed memory machines P P 111 000 M 000 111 000 111 111 000 M 000 111 000 111 interprocessor communication network P 111 000 M 000 111 000 111 Maya Neytcheva, IT, Uppsala University [email protected] 3 Recall: Interconnection network topologies (a) array linear (b) ring (c) star (d) 2D mesh (e) 2D toroidal (f) systolic array mesh ..... (g) completely(h) chordal ring (i) binary tree connected Maya Neytcheva, IT, Uppsala University [email protected] (j) 3D cube (k) 3D cube 4 Communicators Maya Neytcheva, IT, Uppsala University [email protected] 5 Communicators Creating communicators enables us to split the processors into groups. Benefits: • The different groups can perform independent tasks. • Collective operations can be done on a subset of processors • The subsets are logically the same as the initial (complete) group of PEs, i.e., there exists P0 in each subgroup. In this way, for example, recursive algorithms can be implemented. • Safety - isolating messages, avoiding conflicts between modules etc. Maya Neytcheva, IT, Uppsala University [email protected] 6 Groups of processes First we need to introduce the notion of group A group is an ordered set of process identifiers (henceforth processes). Each process in a group is associated with an integer rank. Ranks are contiguous and start from zero. Groups cannot be directly transferred from one process to another. A group is used within a communicator to describe the participants in a communication “universe” and to rank such participants (thus giving them unique names within that “universe” of communication). There is a special pre-defined group: MPI GROUP EMPTY, which is a group with no members. The predefined constant MPI GROUP NULL is the value used for invalid group handles. Maya Neytcheva, IT, Uppsala University [email protected] 7 Groups MPI_COMM_GROUP(comm,group) input: comm - communicator output: group - group corresponding to the communicator MPI_COMM_CREATE(COMM, GROUP, IERROR) INTEGER COMM, GROUP, NEWCOMM, IERROR Maya Neytcheva, IT, Uppsala University [email protected] 8 Groups Various functions available to manipulate with communicators in MPI: MPI GROUP UNION(group1, group2, newgroup) MPI GROUP INTERSECTION(group1, group2, newgroup) MPI GROUP DIFFERENCE(group1, group2, newgroup) MPI GROUP INCL(group, n, ranks, newgroup) MPI GROUP EXCL(group, n, ranks, newgroup) MPI GROUP FREE(group) Maya Neytcheva, IT, Uppsala University [email protected] 9 Communicators MPI_COMM_CREATE(comm, group, newcomm) [ IN comm] communicator (handle) [ IN group] Group, which is a subset of the group of comm (handle) [ OUT newcomm] new communicator (handle) int MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm) MPI_COMM_CREATE(COMM, GROUP, NEWCOMM, IERROR) INTEGER COMM, GROUP, NEWCOMM, IERROR The function creates a new communicator. Maya Neytcheva, IT, Uppsala University [email protected] 10 Communicators Various functions available to manipulate with communicators in MPI: MPI COMM SIZE(comm, size) MPI COMM COMPARE(comm1, comm2, result) MPI COMM SPLIT(comm, color, key, newcomm) MPI COMM FREE(comm) Maya Neytcheva, IT, Uppsala University [email protected] 11 Communicators MPI_COMM_SPLIT(comm, color, key, newcomm) IN comm communicator (handle) IN color control of subset assignment (integer) IN key control of rank assigment (integer) OUT newcomm new communicator (handle) int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm) MPI_COMM_SPLIT(COMM, COLOR, KEY, NEWCOMM, IERROR) INTEGER COMM, COLOR, KEY, NEWCOMM, IERROR Maya Neytcheva, IT, Uppsala University [email protected] 12 Communicators What are those: color and key? Partitions the group associated with comm into disjoint subgroups, one for each value of color. Each subgroup contains all processes of the same color. Within each subgroup, the processes are ranked in the order defined by the value of the argument key. A new communicator is created for each subgroup and returned in newcomm. A process may supply the color value MPI UNDEFINED, in which case newcomm returns MPI COMM NULL. This is a collective call, but each process is permitted to provide different values for color and key. Maya Neytcheva, IT, Uppsala University [email protected] 13 Communicators program Comm implicit none include "mpif.h" integer ierror, rank, size, rankh, sizeh, key integer ALL_GROUP, color, HALF_COMM integer N, M integer, dimension(MPI_STATUS_SIZE) :: status call MPI_Init(ierror) call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierror) call MPI_Comm_size(MPI_COMM_WORLD, size, ierror) N = rank M = rank+10 call MPI_COMM_GROUP(MPI_COMM_WORLD, ALL_GROUP, ierror) Maya Neytcheva, IT, Uppsala University [email protected] 14 Communicators color=rank/2 key=0 call MPI_COMM_SPLIT(MPI_COMM_WORLD,color,key,HALF_COMM, ierror) call MPI_Comm_rank(HALF_COMM, rankh, ierror) call MPI_Comm_size(HALF_COMM, sizeh, ierror) write(*,*) ’Global ’, rank,’ is now local rank ’,rankh,’ ’,color write(*,*) ’Global ’,rank,’ local ’,rankh, ’ has ’,N,’ and ’,M Maya Neytcheva, IT, Uppsala University [email protected] 15 Communicators if (rankh.eq.0) then call MPI_SENDRECV(N,1,MPI_INTEGER,1,1, . N,1,MPI_INTEGER,1,2,HALF_COMM,status,ierror) else call MPI_SENDRECV(N,1,MPI_INTEGER,0,2, . N,1,MPI_INTEGER,0,1,HALF_COMM,status,ierror) endif write(*,*) ’Global ’,rank,’ local ’,rankh, ’ new ’,N,’ and ’,M call MPI_COMM_FREE(HALF_COMM, ierror) call MPI_GROUP_FREE(ALL_GROUP, ierror) call MPI_Finalize(ierror) stop end Maya Neytcheva, IT, Uppsala University [email protected] 16 Communicators Possible output of the above code (sorted afterwards) Global 3 local 1 has 3 and 13 Global 2 local 0 has 2 and 12 ------------------------------------------Global 1 local 1 new 0 and 11 Global 0 local 0 new 0 and 11 Global 2 local 0 new 2 and 13 Global 3 local 1 new 2 and 13 ------------------------------------------------------------------------------------Global 2 local 0 has 2 and 12 Global 3 local 1 has 3 and 13 ------------------------------------------Global 0 local 0 new 1 and 10 Global 1 local 1 new 0 and 11 Global Global 2 local 3 local 0 new 1 new 3 and 2 and 12 13 Maya Neytcheva, IT, Uppsala University [email protected] 17 Virtual interconnection topology Maya Neytcheva, IT, Uppsala University [email protected] 18 Virtual topologies Algorithm −→ communication pattern −→ graph The processes represent the nodes in that graph, the edges connect processes that communicate with each other. MPI provides message-passing between any pair of processes in a group. However, it turns out to be convenient to describe the virtual communication topology utilized by an algorithm. The provided MPI functions for that are: MPI GRAPH CREATE and MPI CART CREATE which are used to create general (graph) virtual topologies and Cartesian topologies, respectively. These topology creation functions are collective. Maya Neytcheva, IT, Uppsala University [email protected] 19 Virtual topologies The tool provided to describe Cartesian grids of processors is MPI CART CREATE Cartesian structures of arbitrary dimension are allowed. In addition, for each coordinate direction one specifies whether the process structure is periodic or not. Note that an n-dimensional hypercube is an n-dimensional torus with 2 processes per coordinate direction. Thus, special support for hypercube structures is not necessary. Maya Neytcheva, IT, Uppsala University [email protected] 20 Process topologies Cartesian topology functions MPI CART CREATE(comm old,ndims,dims,periods,reorder,comm cart) IN IN IN comm old ndims dims IN periods IN OUT reorder comm cars input communicator number of dimensions in a Cartesian grid integer array of size ndims specifying the number of procs in each dimension logical array of size ndims specifying whether the grid is periodic (’true’) or not (’false’) in each dimension ranks may be reordered (’true’) or not (’false’) communicator with new Cartesian topology Returns a handle to a new communicator. If reorder=’false’, the rank of each processor in the group is identical to its rank in the new group. Maya Neytcheva, IT, Uppsala University [email protected] 21 Virtual topologies 0 (0,0) 1 (0,1) 2 (0,2) 7 4 5 (1,0) (1,1) (1,2) (1,3) 9 8 6 3 (0,3) 10 11 (2,0) (2,1) (2,2) (2,3) 12 13 14 15 (3,0) (3,1) (3,2) (3,3) ... dims(1) = 4 dims(2) = 4 periods(1) = ’true’ periods(2) = ’true’ reprder = ’false’ MPI_CART_CREATE(MPI_COMM_WORLD,2,dims, . periods,reorder,GRID_COMM) Maya Neytcheva, IT, Uppsala University [email protected] 22 Virtual topologies MPI CART SHIFT(comm,direction,disp,rank src,rank dest) IN comm IN IN direction disp OUT OUT rank src rank dest communicator with Cartesian structure coordinate dimension of shift displacement (> 0: upwards shift, < 0: downwards shift) rank of source process rank of destination process int MPI_Cart_shift(MPI_Comm comm, int direction, int disp, int *rank_source, int *rank_dest) MPI_CART_SHIFT(COMM, DIRECTION, DISP, RANK_SOURCE, RANK_DEST, IERROR) INTEGER COMM, DIRECTION, DISP, RANK_SOURCE, RANK_DEST, IERROR Maya Neytcheva, IT, Uppsala University [email protected] 23 Virtual topologies 0 (0,3) 1 (0,0) 2 (0,1) 3 (0,2) (0,0) (0,1) 4 (1,3) 5 (1,0) (1,0) (1,1) 8 (2,3) 9 (2,0) (2,0) (2,1) (0,2) 6 (1,1) (1,2) (0,3) 7 (1,2) (1,3) 10 (2,1) 11 (2,2) (2,2) (2,3) 12 (3,3) 13 (3,0) 14 (3,1) 15 (3,2) (3,0) (3,1) (3,2) (3,3) MPI_CART_SHIFT(GRID_COMM,1,1,src,dest,ierror) Maya Neytcheva, IT, Uppsala University [email protected] 24 Virtual topologies .... C find process rank CALL MPI_COMM_RANK(comm, rank, ierr)) C find cartesian coordinates CALL MPI_CART_COORDS(comm, rank, maxdims, coords, ierr) C compute shift source and destination CALL MPI_CART_SHIFT(comm, 0, coords(2), source, dest, ierr) C skew array CALL MPI_SENDRECV_REPLACE(A,1,MPI_REAL,dest,0,source,0,comm, + status, ierr) OBS!: In Fortran, the dimension indicated by DIRECTION = i has DIMS(i+1) nodes, where DIMS is the array that was used to create the grid. In C, the dimension indicated by direction = i is the dimension specified by dims[i]. Maya Neytcheva, IT, Uppsala University [email protected] 25 Parallel Divide-and-Conquer techniques • Partitionings: - data partitioning - functional partitioning • Examples of Divide-and-Conquer approaches: n n P P ai , ai bi i=1 i=1 - Numerical integration - Solution of tridiagonal systems - Bucket sort algorithm Maya Neytcheva, IT, Uppsala University [email protected] 26 More examples of Divide-and-Conquer algorithms: matrix manipulations the ’knapsack’ problem ’Mergehull’ integer factorization set-covering block splitting of matrices, multifrontal methods the problem to find a set of items each with a weight w and a value v , in order to maximize the total value while not exceeding a fixed weight limit an algorithm for determining the convex hull of n points in a plane. an algorithm for decomposing positive integer numbers into prime factors the problem is to find a minimal set of subsets of a setS , which covers the properties of all elements of the set S . Maya Neytcheva, IT, Uppsala University [email protected] 27 ((((((a1+a2)+a3)+a4)+a5)+a6)+a7)+a8 + a8 + + a5 + + + a3 a2 + a4 + a1 + a6 + + (a1+a2)+(a3+a4))+((a5+a6)+(a7+a8)) a7 a1 + a2 a3 + a4 a5 + a6 a7 a8 A trivial example how to gain parallelism Maya Neytcheva, IT, Uppsala University [email protected] 28 Parallel Divide-and-Conquer techniques Scalar product, summing-up n numbers FAN-IN, FAN-OUT algorithms 0 x1 x2 y1 * p1 1 y2 x3 y3 * y4 x n-1 * * p2 * * y n-1 x n + + * pn + p3 p1 yn * p n-1 p4 p3 + 2 x4 p n/2 ... . . . + + p1 p n/4 + p1 log n + 1 2 Binary tree algorithm represented by a cascade graph to compute n P xi yi , where i=1 n = 2s (s = 3) Maya Neytcheva, IT, Uppsala University [email protected] 29 Parallel Divide-and-Conquer techniques Numerical integration f(x) 0 Proc0 : Proci : Proc0 : a P0 P1 P2 P3 h h h h b x broadcast a and b h = (b − a)/p ai = a + (i − 1) ∗ h, bI = a + Ih area local = F (ai) + f (bi ) ∗ h ∗ 0.5 gather area local into area locali Pp−1 area = i=0 area locali Maya Neytcheva, IT, Uppsala University [email protected] 30 Parallel Sort Algorithms Maya Neytcheva, IT, Uppsala University [email protected] 31 Parallel Sort Algorithms A general-purpose parallel sorting algorithm must be able to sort a large sequence with a relatively small number of processors. Let p be the number of processors and n be the number of elements to be sorted. Each processor is assigned a block of size n/p elements. Let A0, A1 , · · · , Ap−1 be the blocks assigned to processors P0, P1, · · · , Pp−1, respectively. We say that Ai < Aj if every element of Ai is smaller than every element in Aj . When the sorting finishes, each p−1 p−1 S S ′ ′ processor Pi holds a set Ai such that for i ≤ j and Ai = Ai . i=0 i=0 In a typical sorting process, each processor sorts locally the block it owns, then selects a pivot, split the block into two according to the pivot, exchange half of the block with its neighbors, merge the received block with the block it retained. Maya Neytcheva, IT, Uppsala University [email protected] 32 From Sequential Sort to Parallel Sort • Sequential sort – Selection sort, Insertion sort, Bubble sort, each has a complexity of O(n2 ), where n is the number of elements – Quick sort, Merge sort, Heap sort - O(n log n) - Quick sort best on the average • Different approach to design a parallel sort – Use a sequential sort and adapt - How well can it be done in parallel? Not all sequential algorithms can be parallelized easily. - Sometimes a poor sequential algorithm can develop into a reasonable parallel algorithm (e.g. bubble sort). – Develop a different approach - More difficult, but may lead to better solutions. Maya Neytcheva, IT, Uppsala University [email protected] 33 Sequential sort algorithms Name bubble Description sort by comparing each adjacent pair of items in a list in turn, swapping the items if necessary, and repeating the pass through the list until no swaps are done Complexity O(n2 ) insertion Sort by repeatedly taking the next item and inserting it into the final data structure in its proper order with respect to items already inserted. Run time is O(n2 ) because of moves. O(n2 ) Maya Neytcheva, IT, Uppsala University [email protected] Modifications bidirectional bubble sort, exchange sort, sink sort binary insertion sort 34 Sequential sort algorithms Name bucket Description A distribution sort where input elements are initially distributed to several (but relatively few) buckets based on a certain predefined value-brakets. Each bucket is sorted if necessary, and the buckets’ contents are concatenated. Complexity O(n log log n) Modifications bin sort, range sort 29, 25, 3, 49, 9, 37, 21, 43 29, 25 21 49 37 9 0−15 16−30 31−49 3 43 Maya Neytcheva, IT, Uppsala University [email protected] 35 Sequential sort algorithms Name quicksort merge sort Description One element, x of the list to be sorted is chosen and the other elements are split into those elements less than x and those greater than or equal to x. These two lists are then sorted recursively using the same algorithm until there is only one element in each list, at which point the sublists are recursively recombined in order yielding the sorted list. A sort algorithm which splits the items to be sorted into two groups, recursively sorts each group, and merges them into a final, sorted sequence Maya Neytcheva, IT, Uppsala University [email protected] Complexity O(n log n) Modifications balanced, external, hybrid O(n log n) (non)balanced, balanced two-way, k-way 36 Sequential sort algorithms Name heap sort radix sort Description A sort algorithm which builds a heap, then repeatedly extracts the maximum item. Heap data structure: A tree where every node has a key more extreme (greater or less) than the key of its parent. A multiple pass sort algorithm that distributes each item to a bucket according to part of the item’s key beginning with the least significant part of the key. After each pass, items are collected from the buckets, keeping the items in order, then redistributed according to the next most significant part of the key ( c depends on the size of the key and number of buckets. Maya Neytcheva, IT, Uppsala University [email protected] Complexity O(n log n) Modifications weak heep sort, adaptive heap sort O(cn) bottomup radix sort, radix quicksort 37 Sequential sort algorithms Name diminishing increment sort shell sort Description Counting Sort Maya Neytcheva, IT, Uppsala University [email protected] 38 Sequential sort algorithms Parallel algorithms on sequences and strings: Matching Parentheses: This is an interesting algorithm since one might think that matching parentheses seems very sequential. For each location the algorithm returns the index of the matching parenthesis. The algorithm is based on a scan and an integer sort (rank). The scan returns the depth of each parenthesis and the sort groups them into groups of equal depth. At this point we can simply switch the indices of neighbors. Assuming a work-efficient radix sort, this algorithm does O(n) work and has the depth is bounded by the sort. function parentheses_match(string) = let depth = plus_scan({if c==‘( then 1 else -1 : c in string}); depth = {d + (if c==‘( then 1 else 0): c in string; d in depth}; rnk = permute([0:#string], rank(depth)); ret = interleave(odd_elts(rnk), even_elts(rnk)) in permute(ret, rnk); parentheses_match("()(()())((()))"); Maya Neytcheva, IT, Uppsala University [email protected] 39 Sequential sort algorithms Bucket sort algorithm n numbers filter 0 − 25 26 − 50 51 − 75 76 − 100 P0 P1 P2 P3 Quick sort in each bucket - serial complexity to sort k numbers: O(k log k) T1 = n + p np log( np ) = n(1 + log( np )) = O(n) Tp = n + n p log( np ) Maya Neytcheva, IT, Uppsala University [email protected] 40 Parallel sort algorithms Divide-and-Conquer approaches for sorting algorithms: • Merge sort - collects sorted list onto one processor, merging as items come together - maps well to tree structure, sorting locally on leaves, then merging up the tree - as items approach root of tree, processors drop out of the process, limiting parallelism • Quick sort - maps well to hypercube - divide list across dimensions of the hypercube, then sort locally - selection of partition values is even more critical than for sequential version since it affects load balancing - hypercube version leaves different numbers of items on different nodes Maya Neytcheva, IT, Uppsala University [email protected] 41 Quick sort • Basic idea of parallel quick sort algorithm on a hypercube 1: Select global partition value (pivot), split values across highest cube dimension, high values going to upper side and low values to lower side 2: Repeat on each of the lower dimensional cubes forming the upper and lower halves of the original (divide) 3: Continue this process until the remaining cube is a single processor, then sort locally 4: Each node contains a sorted list, and the lists from node to node are in order, using Grey code numbering of nodes. Maya Neytcheva, IT, Uppsala University [email protected] 42 Quick sort • How to implement it? Hyper quick sort; I. Divide data equally among nodes II. Sort locally on each node first III. Broadcast median value from node 0 as pivot IV. Each list splits locally, then trades halves across highest dimension V. Apply the above two steps successively (and in parallel) to lower dimensional cube forming the two halves, and so on until dimension reaches 0 (a single node) Maya Neytcheva, IT, Uppsala University [email protected] 43 Quick sort • Run time for quicksort on hypercube: following components contribute to run time. 1. Local sort (e.g. sequential quicksort) in O( np log np ) 2. Broadcasting a pivot during i-th iteration takes O(d − (i − 1)) where d − (i − 1) is the dimension of the sub-hypercube. In a d-dimension hypercube, one-to-all broadcast can be done in d steps. Thus the total time spent in broadcasting pivots (it needs d = log p iterations) d X i i= (log p)(log p + 1) d(d + 1) = = O(log2 p) 2 2 3. Partitioning n/p elements in O( np log np ), but we need to do it log p times 4. Exchange local list with neighbors in O( np log np ), but we need to do it log p times. 5. Merge local list with the one received in O( np log np ), but we need to do it log p times Maya Neytcheva, IT, Uppsala University [email protected] 44 Quick sort So the total run time is Tp = O „ n n log p p « +O „ n log p p « 2 + O(log p). If the number of processors p is equal to n/ log n, then its best run time is 2 2 O(log n ∗ log(log n)) + O(log n) = O(log n). Maya Neytcheva, IT, Uppsala University [email protected] 45
© Copyright 2024 Paperzz