Lecture 6: Parallel Matrix Algorithms (part 3)

Lecture 6: Parallel Matrix
Algorithms (part 3)
1
A Simple Parallel Dense Matrix-Matrix Multiplication
Let 𝐴 = [π‘Žπ‘–π‘— ]𝑛×𝑛 and 𝐡 = [𝑏𝑖𝑗 ]𝑛×𝑛 be n × n matrices. Compute 𝐢 =
𝐴𝐡
β€’ Computational complexity of sequential algorithm: 𝑂(𝑛3 )
β€’ Partition 𝐴 and 𝐡 into 𝑝 square blocks 𝐴𝑖,𝑗 and 𝐡𝑖,𝑗 (0 ≀ 𝑖, 𝑗 < 𝑝)
of size (𝑛/ 𝑝) × (𝑛/ 𝑝) each.
β€’ Use Cartesian topology to set up process grid. Process 𝑃𝑖,𝑗 initially
stores 𝐴𝑖,𝑗 and 𝐡𝑖,𝑗 and computes block 𝐢𝑖,𝑗 of the result matrix.
β€’ Remark: Computing submatrix 𝐢𝑖,𝑗 requires all submatrices 𝐴𝑖,π‘˜ and
π΅π‘˜,𝑗 for 0 ≀ π‘˜ < 𝑝.
2
β€’ Algorithm:
– Perform all-to-all broadcast of blocks of A in each row of
processes
– Perform all-to-all broadcast of blocks of B in each column
of processes
– Each process 𝑃𝑖,𝑗 perform 𝐢𝑖,𝑗 =
π‘βˆ’1
π‘˜=0 𝐴𝑖,π‘˜
π΅π‘˜,𝑗
3
Performance Analysis
β€’
𝑝 rows of all-to-all broadcasts, each is among a group of 𝑝
processes. A message size is
𝑛2
𝑑𝑀
𝑝
β€’
𝑛2
,
𝑝
communication time: 𝑑𝑠 π‘™π‘œπ‘” 𝑝 +
π‘βˆ’1
𝑝 columns of all-to-all broadcasts, communication time:
𝑑𝑠 π‘™π‘œπ‘” 𝑝 +
𝑛2
𝑑𝑀
𝑝
π‘βˆ’1
β€’ Computation time: 𝑝 × (𝑛/ 𝑝)3 = 𝑛3 /𝑝
β€’ Parallel time: 𝑇𝑝 =
𝑛3
𝑝
+ 2 𝑑𝑠 π‘™π‘œπ‘” 𝑝
𝑛2
+ 𝑑𝑀
𝑝
π‘βˆ’1
4
Memory Efficiency of the Simple Parallel Algorithm
β€’ Not memory efficient
– Each process 𝑃𝑖,𝑗 has 2 𝑝 blocks of 𝐴𝑖,π‘˜ and π΅π‘˜,𝑗
– Each process needs Θ(𝑛2 / 𝑝) memory
– Total memory over all the processes is Θ(𝑛2 × π‘),
i.e., 𝑝 times the memory of the sequential
algorithm.
5
Cannon’s Algorithm of Matrix-Matrix Multiplication
Goal: to improve the memory efficiency.
Let 𝐴 = [π‘Žπ‘–π‘— ]𝑛×𝑛 and 𝐡 = [𝑏𝑖𝑗 ]𝑛×𝑛 be n × n matrices. Compute 𝐢 =
𝐴𝐡
β€’ Partition 𝐴 and 𝐡 into 𝑝 square blocks 𝐴𝑖,𝑗 and 𝐡𝑖,𝑗 (0 ≀ 𝑖, 𝑗 < 𝑝)
of size (𝑛/ 𝑝) × (𝑛/ 𝑝) each.
β€’ Use Cartesian topology to set up process grid. Process 𝑃𝑖,𝑗 initially
stores 𝐴𝑖,𝑗 and 𝐡𝑖,𝑗 and computes block 𝐢𝑖,𝑗 of the result matrix.
β€’ Remark: Computing submatrix 𝐢𝑖,𝑗 requires all submatrices 𝐴𝑖,π‘˜ and
π΅π‘˜,𝑗 for 0 ≀ π‘˜ < 𝑝.
β€’ The contention-free formula:
𝐢𝑖,𝑗 =
π‘βˆ’1
π‘˜=0 𝐴𝑖, 𝑖+𝑗+π‘˜ % 𝑝 𝐡 𝑖+𝑗+π‘˜ % 𝑝,𝑗
6
Cannon’s Algorithm
// make initial alignment
for 𝑖, 𝑗 :=0 to 𝑝 βˆ’ 1 do
Send block 𝐴𝑖,𝑗 to process 𝑖, 𝑗 βˆ’ 𝑖 + 𝑝 π‘šπ‘œπ‘‘ 𝑝 and block 𝐡𝑖,𝑗 to process
𝑖 βˆ’ 𝑗 + 𝑝 π‘šπ‘œπ‘‘ 𝑝, 𝑗 ;
endfor;
Process 𝑃𝑖,𝑗 multiply received submatrices together and add the result to 𝐢𝑖,𝑗 ;
// compute-and-shift. A sequence of one-step shifts pairs up 𝐴𝑖,π‘˜ and π΅π‘˜,𝑗
// on process 𝑃𝑖,𝑗 . 𝐢𝑖,𝑗 = 𝐢𝑖,𝑗 +𝐴𝑖,π‘˜ π΅π‘˜,𝑗
for step :=1 to 𝑝 βˆ’ 1 do
Shift 𝐴𝑖,𝑗 one step left (with wraparound) and 𝐡𝑖,𝑗 one step up (with
wraparound);
Process 𝑃𝑖,𝑗 multiply received submatrices together and add the result to 𝐢𝑖,𝑗 ;
Endfor;
Remark: In the initial alignment, the send operation is to: shift 𝐴𝑖,𝑗 to the left (with
wraparound) by 𝑖 steps, and shift 𝐡𝑖,𝑗 to the up (with wraparound) by 𝑗 steps. 7
Cannon’s Algorithm for 3 × 3 Matrices
Initial A, B
A, B initial
alignment
A, B after
shift step 1
A, B after
shift step 2
8
Performance Analysis
β€’ In the initial alignment step, the maximum distance
over which block shifts is 𝑝 βˆ’ 1
– The circular shift operations in row and column
𝑑𝑀 𝑛 2
directions take time: π‘‘π‘π‘œπ‘šπ‘š = 2(𝑑𝑠 +
)
𝑝
β€’ Each of the 𝑝 single-step shifts in the compute𝑑𝑀 𝑛 2
and-shift phase takes time: 𝑑𝑠 +
.
𝑝
β€’ Multiplying 𝑝 submatrices of size (
takes time: 𝑛3 /𝑝.
β€’ Parallel time: 𝑇𝑝 =
𝑛3
𝑝
+ 2 𝑝 𝑑𝑠 +
𝑑𝑀 𝑛 2
𝑝
n
)
𝑝
×(
n
)
𝑝
𝑑𝑀 𝑛 2
+ 2(𝑑𝑠 +
)
𝑝
9
int MPI_Sendrecv_replace( void *buf, int count,
MPI_Datatype datatype, int dest, int sendtag, int source,
int recvtag, MPI_Comm comm, MPI_Status *status );
β€’ Execute a blocking send and receive. The same buffer is
used both for the send and for the receive, so that the
message sent is replaced by the message received.
β€’ buf[in/out]: initial address of send and receive buffer
10
#include "mpi.h"
#include <stdio.h>
int main(int argc, char *argv[])
{
int myid, numprocs, left, right;
int buffer[10];
MPI_Request request;
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
right = (myid + 1) % numprocs;
left = myid - 1;
if (left < 0)
left = numprocs - 1;
MPI_Sendrecv_replace(buffer, 10, MPI_INT, left, 123, right, 123, MPI_COMM_WORLD,
&status);
MPI_Finalize();
return 0;
}
11
DNS Algorithm
β€’ The algorithm is named after Dekel, Nassimi and Aahni
β€’ It is based on partitioning intermediate data
β€’ It performs matrix multiplication in time 𝑂(log𝑛) by using
𝑂(𝑛3 /log𝑛) processes
The sequential algorithm for 𝐢 = 𝐴 × π΅
𝐢𝑖𝑗 = 0
π‘“π‘œπ‘Ÿ 𝑖 = 0; 𝑖 < 𝑛; 𝑖 + +
π‘“π‘œπ‘Ÿ 𝑗 = 0; 𝑗 < 𝑛; 𝑗 + +
π‘“π‘œπ‘Ÿ π‘˜ = 0; π‘˜ < 𝑛; π‘˜ + +
𝐢𝑖𝑗 = 𝐢𝑖𝑗 + π΄π‘–π‘˜ × π΅π‘˜π‘—
Remark: The algorithm performs 𝑛3 scalar multiplications
12
β€’ Assume that 𝑛3 processes are available for
multiplying two 𝑛 × π‘› matrices.
β€’ Then each of the 𝑛3 processes is assigned a single
scalar multiplication.
β€’ The additions for all 𝐢𝑖𝑗 can be carried out
simultaneously in log𝑛 steps each.
β€’ Arrange 𝑛3 processes in a three-dimensional
𝑛 × π‘› × π‘› logical array.
– The processes are labeled according to their location in the
array, and the multiplication π΄π‘–π‘˜ π΅π‘˜π‘— is assigned to process
P[i,j,k] (0 ≀ 𝑖, 𝑗, π‘˜ < 𝑛).
– After each process performs a single multiplication, the
contents of P[i,j,0],P[i,j,1],…,P[i,j,n-1] are added to obtain
𝐢𝑖𝑗 .
13
j
i
14
15
β€’ The vertical column of processes P[i,j,*] computes
the dot product of row π΄π‘–βˆ— and column π΅βˆ—π‘—
16
β€’ The DNS algorithm has three main
communication steps:
1. moving the rows of A and the columns of B to their
respective places,
2. performing one-to-all broadcast along the j axis for A
and along the i axis for B
3. all-to-one reduction along the k axis
17