Lecture 6: Parallel Matrix
Algorithms (part 3)
1
A Simple Parallel Dense Matrix-Matrix Multiplication
Let π΄ = [πππ ]π×π and π΅ = [πππ ]π×π be n × n matrices. Compute πΆ =
π΄π΅
β’ Computational complexity of sequential algorithm: π(π3 )
β’ Partition π΄ and π΅ into π square blocks π΄π,π and π΅π,π (0 β€ π, π < π)
of size (π/ π) × (π/ π) each.
β’ Use Cartesian topology to set up process grid. Process ππ,π initially
stores π΄π,π and π΅π,π and computes block πΆπ,π of the result matrix.
β’ Remark: Computing submatrix πΆπ,π requires all submatrices π΄π,π and
π΅π,π for 0 β€ π < π.
2
β’ Algorithm:
β Perform all-to-all broadcast of blocks of A in each row of
processes
β Perform all-to-all broadcast of blocks of B in each column
of processes
β Each process ππ,π perform πΆπ,π =
πβ1
π=0 π΄π,π
π΅π,π
3
Performance Analysis
β’
π rows of all-to-all broadcasts, each is among a group of π
processes. A message size is
π2
π‘π€
π
β’
π2
,
π
communication time: π‘π πππ π +
πβ1
π columns of all-to-all broadcasts, communication time:
π‘π πππ π +
π2
π‘π€
π
πβ1
β’ Computation time: π × (π/ π)3 = π3 /π
β’ Parallel time: ππ =
π3
π
+ 2 π‘π πππ π
π2
+ π‘π€
π
πβ1
4
Memory Efficiency of the Simple Parallel Algorithm
β’ Not memory efficient
β Each process ππ,π has 2 π blocks of π΄π,π and π΅π,π
β Each process needs Ξ(π2 / π) memory
β Total memory over all the processes is Ξ(π2 × π),
i.e., π times the memory of the sequential
algorithm.
5
Cannonβs Algorithm of Matrix-Matrix Multiplication
Goal: to improve the memory efficiency.
Let π΄ = [πππ ]π×π and π΅ = [πππ ]π×π be n × n matrices. Compute πΆ =
π΄π΅
β’ Partition π΄ and π΅ into π square blocks π΄π,π and π΅π,π (0 β€ π, π < π)
of size (π/ π) × (π/ π) each.
β’ Use Cartesian topology to set up process grid. Process ππ,π initially
stores π΄π,π and π΅π,π and computes block πΆπ,π of the result matrix.
β’ Remark: Computing submatrix πΆπ,π requires all submatrices π΄π,π and
π΅π,π for 0 β€ π < π.
β’ The contention-free formula:
πΆπ,π =
πβ1
π=0 π΄π, π+π+π % π π΅ π+π+π % π,π
6
Cannonβs Algorithm
// make initial alignment
for π, π :=0 to π β 1 do
Send block π΄π,π to process π, π β π + π πππ π and block π΅π,π to process
π β π + π πππ π, π ;
endfor;
Process ππ,π multiply received submatrices together and add the result to πΆπ,π ;
// compute-and-shift. A sequence of one-step shifts pairs up π΄π,π and π΅π,π
// on process ππ,π . πΆπ,π = πΆπ,π +π΄π,π π΅π,π
for step :=1 to π β 1 do
Shift π΄π,π one step left (with wraparound) and π΅π,π one step up (with
wraparound);
Process ππ,π multiply received submatrices together and add the result to πΆπ,π ;
Endfor;
Remark: In the initial alignment, the send operation is to: shift π΄π,π to the left (with
wraparound) by π steps, and shift π΅π,π to the up (with wraparound) by π steps. 7
Cannonβs Algorithm for 3 × 3 Matrices
Initial A, B
A, B initial
alignment
A, B after
shift step 1
A, B after
shift step 2
8
Performance Analysis
β’ In the initial alignment step, the maximum distance
over which block shifts is π β 1
β The circular shift operations in row and column
π‘π€ π 2
directions take time: π‘ππππ = 2(π‘π +
)
π
β’ Each of the π single-step shifts in the computeπ‘π€ π 2
and-shift phase takes time: π‘π +
.
π
β’ Multiplying π submatrices of size (
takes time: π3 /π.
β’ Parallel time: ππ =
π3
π
+ 2 π π‘π +
π‘π€ π 2
π
n
)
π
×(
n
)
π
π‘π€ π 2
+ 2(π‘π +
)
π
9
int MPI_Sendrecv_replace( void *buf, int count,
MPI_Datatype datatype, int dest, int sendtag, int source,
int recvtag, MPI_Comm comm, MPI_Status *status );
β’ Execute a blocking send and receive. The same buffer is
used both for the send and for the receive, so that the
message sent is replaced by the message received.
β’ buf[in/out]: initial address of send and receive buffer
10
#include "mpi.h"
#include <stdio.h>
int main(int argc, char *argv[])
{
int myid, numprocs, left, right;
int buffer[10];
MPI_Request request;
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
right = (myid + 1) % numprocs;
left = myid - 1;
if (left < 0)
left = numprocs - 1;
MPI_Sendrecv_replace(buffer, 10, MPI_INT, left, 123, right, 123, MPI_COMM_WORLD,
&status);
MPI_Finalize();
return 0;
}
11
DNS Algorithm
β’ The algorithm is named after Dekel, Nassimi and Aahni
β’ It is based on partitioning intermediate data
β’ It performs matrix multiplication in time π(logπ) by using
π(π3 /logπ) processes
The sequential algorithm for πΆ = π΄ × π΅
πΆππ = 0
πππ π = 0; π < π; π + +
πππ π = 0; π < π; π + +
πππ π = 0; π < π; π + +
πΆππ = πΆππ + π΄ππ × π΅ππ
Remark: The algorithm performs π3 scalar multiplications
12
β’ Assume that π3 processes are available for
multiplying two π × π matrices.
β’ Then each of the π3 processes is assigned a single
scalar multiplication.
β’ The additions for all πΆππ can be carried out
simultaneously in logπ steps each.
β’ Arrange π3 processes in a three-dimensional
π × π × π logical array.
β The processes are labeled according to their location in the
array, and the multiplication π΄ππ π΅ππ is assigned to process
P[i,j,k] (0 β€ π, π, π < π).
β After each process performs a single multiplication, the
contents of P[i,j,0],P[i,j,1],β¦,P[i,j,n-1] are added to obtain
πΆππ .
13
j
i
14
15
β’ The vertical column of processes P[i,j,*] computes
the dot product of row π΄πβ and column π΅βπ
16
β’ The DNS algorithm has three main
communication steps:
1. moving the rows of A and the columns of B to their
respective places,
2. performing one-to-all broadcast along the j axis for A
and along the i axis for B
3. all-to-one reduction along the k axis
17
© Copyright 2026 Paperzz