Matrix-Vector Multiplication Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 13, 2012 CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 1 / 37 Outline Matrix-vector multiplication rowwise decomposition columnwise decomposition checkerboard decomposition Gather, scatter, alltoall Grid-oriented communications CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 2 / 37 Matrix-Vector Multiplication 2 1 0 4 3 2 1 1 4 3 1 2 3 0 2 0 A CPD (DEI / IST) 1 x 3 4 9 = 14 19 1 11 b c Parallel and Distributed Computing – 16 2012-11-13 3 / 37 Matrix-Vector Multiplication 2 1 0 4 3 2 1 1 4 3 1 2 3 0 2 0 A CPD (DEI / IST) 1 x 3 4 9 = 14 19 1 11 b c Parallel and Distributed Computing – 16 2012-11-13 3 / 37 Matrix-Vector Multiplication 2 1 0 4 3 2 1 1 4 3 1 2 3 0 2 0 A CPD (DEI / IST) 1 x 3 4 9 = 14 19 1 11 b c Parallel and Distributed Computing – 16 2012-11-13 3 / 37 Matrix-Vector Multiplication 2 1 0 4 3 2 1 1 4 3 1 2 3 0 2 0 A CPD (DEI / IST) 1 x 3 4 9 = 14 19 1 11 b c Parallel and Distributed Computing – 16 2012-11-13 3 / 37 Matrix Decomposition rowwise decomposition columnwise decomposition checkered-board decomposition CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 4 / 37 Matrix Decomposition rowwise decomposition columnwise decomposition checkered-board decomposition Storing vectors: Divide vector elements among processes Replicate vector elements CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 4 / 37 Matrix Decomposition rowwise decomposition columnwise decomposition checkered-board decomposition Storing vectors: Divide vector elements among processes Replicate vector elements Vector replication acceptable because vectors have only n elements, versus n2 elements in matrices. CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 4 / 37 Rowwise Decomposition Task associated with row of matrix entire vector x AllGather = Row i of A ci x b CPD (DEI / IST) = Row i of A b Parallel and Distributed Computing – 16 2012-11-13 c 5 / 37 Rowwise Decomposition Task associated with row of matrix entire vector x AllGather = Row i of A ci x b CPD (DEI / IST) = Row i of A b Parallel and Distributed Computing – 16 2012-11-13 c 5 / 37 Rowwise Decomposition Task associated with row of matrix entire vector x AllGather = Row i of A ci x b CPD (DEI / IST) = Row i of A b Parallel and Distributed Computing – 16 2012-11-13 c 5 / 37 MPI Allgatherv int MPI_Allgatherv ( void *send_buffer, int send_cnt, MPI_Datatype send_type, void *receive_buffer, int *receive_cnt, int *receive_disp, MPI_Datatype receive_type, MPI_Comm communicator ) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 6 / 37 MPI Allgatherv Process 0 Process 0 send_buffer c o n send_cnt=3 receive_cnt 3 4 receive_buffer c receive_disp 4 0 3 Process 1 a t e receive_cnt 3 4 send_cnt=4 4 0 3 4 t e n a t e o n c a t e n a t e c a t e n a t e Process 2 t receive_cnt 3 a 7 send_buffer a c receive_buffer c receive_disp Process 2 n n Process 1 send_buffer c o 7 e send_cnt=4 c receive_disp 4 CPD (DEI / IST) 0 3 receive_buffer o n 7 Parallel and Distributed Computing – 16 2012-11-13 7 / 37 MPI Allgatherv Process 0 Process 0 send_buffer c o n send_cnt=3 receive_cnt 3 4 receive_buffer c receive_disp 4 0 3 Process 1 a t e receive_cnt 3 4 send_cnt=4 4 0 3 4 t e n a t e o n c a t e n a t e c a t e n a t e Process 2 t receive_cnt 3 a 7 send_buffer a c receive_buffer c receive_disp Process 2 n n Process 1 send_buffer c o 7 e send_cnt=4 c receive_disp 4 CPD (DEI / IST) 0 3 receive_buffer o n 7 Parallel and Distributed Computing – 16 2012-11-13 7 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) Sequential algorithm complexity: CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 8 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) Sequential algorithm complexity: Θ(n2 ) Parallel algorithm computational complexity: CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 8 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) Sequential algorithm complexity: Θ(n2 ) Parallel algorithm computational complexity: Θ(n2 /p) Communication complexity of all-gather: CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 8 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) Sequential algorithm complexity: Θ(n2 ) Parallel algorithm computational complexity: Θ(n2 /p) Communication complexity of all-gather: Θ(log p + n) Overall complexity: CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 8 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) Sequential algorithm complexity: Θ(n2 ) Parallel algorithm computational complexity: Θ(n2 /p) Communication complexity of all-gather: Θ(log p + n) Overall complexity: Θ(n2 /p + log p + n) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 8 / 37 Algorithm Scalability Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 9 / 37 Algorithm Scalability Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p) Sequential time complexity: T (n, 1) = Θ(n2 ) Parallel overhead is dominated by all-gather: large n T0 (n, p) = Θ(p(log p + n)) → Θ(pn) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 9 / 37 Algorithm Scalability Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p) Sequential time complexity: T (n, 1) = Θ(n2 ) Parallel overhead is dominated by all-gather: large n T0 (n, p) = Θ(p(log p + n)) → Θ(pn) n2 ≥ Cpn CPD (DEI / IST) ⇒ n ≥ Cp Parallel and Distributed Computing – 16 2012-11-13 9 / 37 Algorithm Scalability Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p) Sequential time complexity: T (n, 1) = Θ(n2 ) Parallel overhead is dominated by all-gather: large n T0 (n, p) = Θ(p(log p + n)) → Θ(pn) n2 ≥ Cpn ⇒ n ≥ Cp Scalability function: M(f (p))/p M(n) = n2 ⇒ M(Cp) C 2p2 = = C 2p p p ⇒ System is not highly scalable. CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 9 / 37 Analysis of the Parallel Algorithm Let α be the time to compute an iteration. Sequential execution time: αn2 Computation time of parallel program: αn l m n p All-gather requires dlog pe messages with latency λ Total vector elements transmitted: n Total execution time: CPD (DEI / IST) 8n n + λdlog pe + αn p β Parallel and Distributed Computing – 16 2012-11-13 10 / 37 Benchmarking p 1 2 3 4 5 6 7 8 16 Predicted 63,4 32,4 22,3 17,0 14,1 12,0 10,5 9,4 5,7 Actual 63,4 32,7 22,7 17,8 15,2 13,3 12,2 11,1 7,2 Speedup 1,00 1,94 2,79 3,56 4,16 4,76 5,19 5,70 8,79 Mflops 31,6 61,2 88,1 112,4 131,6 150,4 163,9 180,2 277,8 (time in mili-seconds) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 11 / 37 Columnwise Decomposition Primitive task associated with column of matrix vector element x Column i of A CPD (DEI / IST) Alltoall = b c x Column i of A Parallel and Distributed Computing – 16 = b ci 2012-11-13 12 / 37 Columnwise Decomposition Primitive task associated with column of matrix vector element x Column i of A CPD (DEI / IST) Alltoall = b c x Column i of A Parallel and Distributed Computing – 16 = b ci 2012-11-13 12 / 37 Columnwise Decomposition Primitive task associated with column of matrix vector element x Column i of A CPD (DEI / IST) Alltoall = b c x Column i of A Parallel and Distributed Computing – 16 = b ci 2012-11-13 12 / 37 Columnwise Decomposition Primitive task associated with column of matrix vector element x Column i of A CPD (DEI / IST) Alltoall = b c x Column i of A Parallel and Distributed Computing – 16 = b ci 2012-11-13 12 / 37 All-to-All Operation P0 P1 P2 P3 P0 P1 P2 P3 Alltoall CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 13 / 37 All-to-All Operation P0 P1 P2 P3 P0 P1 P2 P3 Alltoall CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 13 / 37 MPI Alltoallv int MPI_Alltoallv ( void *send_buffer, int *send_cnt, int *send_disp, MPI_Datatype send_type, void *receive_buffer, int *receive_cnt, int *receive_disp, MPI_Datatype receive_type, MPI_Comm communicator ) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 14 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) Sequential algorithm complexity: CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 15 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) Sequential algorithm complexity: Θ(n2 ) Parallel algorithm computational complexity: CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 15 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) Sequential algorithm complexity: Θ(n2 ) Parallel algorithm computational complexity: Θ(n2 /p) Communication complexity of alltoall: CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 15 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) Sequential algorithm complexity: Θ(n2 ) Parallel algorithm computational complexity: Θ(n2 /p) Communication complexity of alltoall: Θ(p + n) (p − 1 messages, and a total of n elements) Overall complexity: CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 15 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) Sequential algorithm complexity: Θ(n2 ) Parallel algorithm computational complexity: Θ(n2 /p) Communication complexity of alltoall: Θ(p + n) (p − 1 messages, and a total of n elements) Overall complexity: Θ(n2 /p + n + p) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 15 / 37 Algorithm Scalability Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 16 / 37 Algorithm Scalability Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p) Sequential time complexity: T (n, 1) = Θ(n2 ) The parallel overhead is alltoall and vector copying: large n T0 (n, p) = Θ(p(p + n)) → Θ(pn)) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 16 / 37 Algorithm Scalability Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p) Sequential time complexity: T (n, 1) = Θ(n2 ) The parallel overhead is alltoall and vector copying: large n T0 (n, p) = Θ(p(p + n)) → Θ(pn)) n2 ≥ Cpn ⇒ n ≥ Cp CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 16 / 37 Algorithm Scalability Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p) Sequential time complexity: T (n, 1) = Θ(n2 ) The parallel overhead is alltoall and vector copying: large n T0 (n, p) = Θ(p(p + n)) → Θ(pn)) n2 ≥ Cpn ⇒ n ≥ Cp Scalability function: M(f (p))/p M(n) = n2 ⇒ M(Cp) C 2p2 = = C 2p p p ⇒ System is not highly scalable. CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 16 / 37 Analysis of the Parallel Algorithm Let α be the time to compute an iteration. Sequential execution time: αn2 Computation time of parallel program: αn l m n p Alltoall requires p − 1 messages each of length at most n/p (8 bytes per element double). Total execution time: CPD (DEI / IST) n 8n αn + (p − 1) λ + p pβ Parallel and Distributed Computing – 16 2012-11-13 17 / 37 Benchmarking p 1 2 3 4 5 6 7 8 16 Predicted 63,4 32,4 22,2 17,2 14,3 12,5 11,3 10,4 8,5 Actual 63,8 32,9 22,6 17,5 14,5 12,6 11,2 10,0 7,6 Speedup 1,00 1,92 2,80 3,62 4,37 5,02 5,65 6,33 8,33 Mflops 31,4 60,8 88,5 114,3 137,9 158,7 178,6 200,0 263,2 (time in mili-seconds) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 18 / 37 Checkerboard Decomposition Primitive task associated with rectangular blocks of matrix (processes form a 2D grid) vector distributed by blocks among processes in first row of grid each block copied to processes in the same column of grid CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 19 / 37 Algorithm Steps P0 P1 Distribute b CPD (DEI / IST) P0 P1 Reduce across rows Multiply P2 P3 P2 P3 P4 P5 P4 P5 Parallel and Distributed Computing – 16 2012-11-13 20 / 37 Algorithm Steps P0 P1 Distribute b CPD (DEI / IST) P0 P1 Reduce across rows Multiply P2 P3 P2 P3 P4 P5 P4 P5 Parallel and Distributed Computing – 16 2012-11-13 20 / 37 Algorithm Steps P0 P1 Distribute b CPD (DEI / IST) P0 P1 Reduce across rows Multiply P2 P3 P2 P3 P4 P5 P4 P5 Parallel and Distributed Computing – 16 2012-11-13 20 / 37 Algorithm Steps P0 P1 Distribute b CPD (DEI / IST) P0 P1 Reduce across rows Multiply P2 P3 P2 P3 P4 P5 P4 P5 Parallel and Distributed Computing – 16 2012-11-13 20 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) √ Also, assume p is a square number: grid has size n/ p. Sequential algorithm complexity: CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 21 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) √ Also, assume p is a square number: grid has size n/ p. Sequential algorithm complexity: Θ(n2 ) Parallel algorithm computational complexity: CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 21 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) √ Also, assume p is a square number: grid has size n/ p. Sequential algorithm complexity: Θ(n2 ) Parallel algorithm computational complexity: Θ(n2 /p) √ √ (each process computes a submatrix n/ p × n/ p) Communication complexity of reduce: CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 21 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) √ Also, assume p is a square number: grid has size n/ p. Sequential algorithm complexity: Θ(n2 ) Parallel algorithm computational complexity: Θ(n2 /p) √ √ (each process computes a submatrix n/ p × n/ p) √ √ √ Communication complexity of reduce: Θ(log p (n/ p)) = Θ(n log p/ p) √ √ (log p messages, each with n/ p elements) Overall complexity: CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 21 / 37 Complexity Analysis (for simplicity, assume square n × n matrix) √ Also, assume p is a square number: grid has size n/ p. Sequential algorithm complexity: Θ(n2 ) Parallel algorithm computational complexity: Θ(n2 /p) √ √ (each process computes a submatrix n/ p × n/ p) √ √ √ Communication complexity of reduce: Θ(log p (n/ p)) = Θ(n log p/ p) √ √ (log p messages, each with n/ p elements) √ Overall complexity: Θ(n2 /p + n log p/ p) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 21 / 37 Algorithm Scalability Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 22 / 37 Algorithm Scalability Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p) Sequential time complexity: T (n, 1) = Θ(n2 ) The parallel overhead is reduce and vector copying: √ √ T0 (n, p) = Θ(pn log p/ p) = Θ(n p log p) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 22 / 37 Algorithm Scalability Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p) Sequential time complexity: T (n, 1) = Θ(n2 ) The parallel overhead is reduce and vector copying: √ √ T0 (n, p) = Θ(pn log p/ p) = Θ(n p log p) √ √ n2 ≥ Cn p log p ⇒ n ≥ C p log p CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 22 / 37 Algorithm Scalability Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p) Sequential time complexity: T (n, 1) = Θ(n2 ) The parallel overhead is reduce and vector copying: √ √ T0 (n, p) = Θ(pn log p/ p) = Θ(n p log p) √ √ n2 ≥ Cn p log p ⇒ n ≥ C p log p Scalability function: M(f (p))/p M(n) = n2 ⇒ √ M(C p log p) C 2 p log2 p = = C 2 log2 p p p ⇒ This system is much more scalable than the previous two implementations! CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 22 / 37 Creating Communicators collective communications involve all processes in a communicator CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 23 / 37 Creating Communicators collective communications involve all processes in a communicator need reductions among subsets of processes CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 23 / 37 Creating Communicators collective communications involve all processes in a communicator need reductions among subsets of processes processes in a virtual 2-D grid CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 23 / 37 Creating Communicators collective communications involve all processes in a communicator need reductions among subsets of processes processes in a virtual 2-D grid create communicators for processes in same row or same column CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 23 / 37 Creating a Virtual Grid of Processes MPI Dims create() input parameters: total number of processes in desired grid number of grid dimensions ⇒ Returns number of processes in each dimension MPI Cart create() Creates communicator with Cartesian topology CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 24 / 37 MPI Dims create int MPI_Dims_create ( int nodes, /* In - # procs in grid */ int dims, /* In - Number of dims */ int *size /* I/O- Size of each grid dim */ ) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 25 / 37 MPI Cart create int MPI_Cart_create ( MPI_Comm old_comm, int dims, int *size, int *periodic, int reorder, MPI_Comm *cart_comm /* /* /* /* In In In In - old communicator */ grid dimensions */ # procs in each dim */ 1 if dim i wraps around; 0 otherwise */ /* In - 1 if process ranks can be reordered */ /* Out - new communicator */ ) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 26 / 37 Using MPI Dims create and MPI Cart create MPI_Comm cart_comm; int p; int periodic[2]; int size[2]; ... size[0] = size[1] = 0; MPI_Dims_create (p, 2, size); periodic[0] = periodic[1] = 0; MPI_Cart_create (MPI_COMM_WORLD, 2, size, periodic, 1, &cart_comm); CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 27 / 37 Useful Grid-related Functions MPI Cart rank() given coordinates of a process in Cartesian communicator, returns process’ rank int MPI_Cart_rank ( MPI_Comm comm, /* In - Communicator */ int *coords, /* In - Array containing process’ grid location */ int *rank /* Out - Rank of process at coordinates */ ) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 28 / 37 Useful Grid-related Functions MPI Cart coords() given rank of a process in Cartesian communicator, returns process’ coordinates int MPI_Cart_coords ( MPI_Comm comm, /* In - Communicator */ int rank, /* In - Rank of process */ int dims, /* In - Dimensions in virtual grid */ int *coords /* Out - Coordinates of specified process in virtual grid */ ) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 29 / 37 MPI Comm split MPI Comm split() partitions the processes of a communicator into one or more subgroups constructs a communicator for each subgroup allows processes in each subgroup to perform their own collective communications needed for columnwise scatter and rowwise reduce CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 30 / 37 MPI Comm split int MPI_Comm_split ( MPI_Comm old_comm, int partition, int new_rank, MPI_Comm *new_comm /* In - Existing communicator */ /* In - Partition number */ /* In - Ranking order of processes in new communicator */ /* Out - New communicator shared by processes in same partition */ ) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 31 / 37 Example: Create Communicators for Process Rows MPI_Comm grid_comm; /* 2-D process grid */ int grid_coords[2]; /* Location of process in grid */ MPI_Comm row_comm; /* Processes in same row */ MPI_Comm_split (grid_comm, grid_coords[0], grid_coords[1], &row_comm); CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 32 / 37 Analysis of the Parallel Algorithm Let α be the time to compute an iteration. Sequential execution time: αn2 Computation time of parallel program: α Reduce requires log √ l √n p ml √n p m p messages each of length λ + 8 l √n p m /β (8 bytes per element double). Total execution time: n α √ p CPD (DEI / IST) √ n 8 n √ + log p λ + √ p β p Parallel and Distributed Computing – 16 2012-11-13 33 / 37 Benchmarking p 1 4 8 16 Predicted 63,4 17,8 9,7 6,2 Actual 63,4 17,4 9,7 6,2 Speedup 1,00 3,64 6,53 10,21 Mflops 31,6 114,9 206,2 322,6 (time in mili-seconds) CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 34 / 37 Comparison of the Three Algorithms Rowwise: Columnwise: Checkerboard: CPD (DEI / IST) 8n n + λdlog pe + αn p β n 8n αn + (p − 1) λ + p pβ n 8 n n α √ √ + log p λ + √ p p β p Parallel and Distributed Computing – 16 2012-11-13 35 / 37 Review Matrix-vector multiplication rowwise decomposition columnwise decomposition checkerboard decomposition Gather, scatter, alltoall Grid-oriented communications CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 36 / 37 Next Class load balancing termination detection CPD (DEI / IST) Parallel and Distributed Computing – 16 2012-11-13 37 / 37
© Copyright 2026 Paperzz