Parallel Programming: Techniques and
Applications Using Networked
Workstations and Parallel Computers
Barry Wilkinson and Michael Allen
Chapter 11: Numerical Algorithms
Sec 11.2: Implementing Matrix Multiplication
Modified 5/11/05 by T. O’Neil for 3460:4/577, Fall 2005, Univ. of
Akron.
Previous revision: 7/28/03.
(c) Prentice-Hall Inc., 2002. All rights reserved.
Implementing Matrix Multiplication
Sequential Code
Assume throughout that the matrices are square (nn
matrices).
The sequential code to compute AB could simply be
for (i=0; i<n; i++)
for (j=0; j<n; j++) {
c[i][j] = 0;
for (k=0; k<n; k++)
c[i][j] = c[i][j] + a[i][k]* b[k][j];
}
This algorithm requires n3 multiplications and n3
additions, leading to a sequential time complexity of
O(n3). Very easy to parallelize.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
2
Parallel Code
With n processors (and nn matrices), can obtain:
• Time complexity of O(n2) with n processors
Each instance of inner loop independent and can be
done by a separate processor. Cost optimal since
O(n3)=nO(n2).
• Time complexity of O(n) with n2 processors
One element of A and B assigned to each processor.
Cost optimal since O(n3)=n2O(n).
• Time complexity of O(log n) with n3 processors
By parallelizing the inner loop. Not cost-optimal since
O(n3)n3O(log n).
O(log n) lower bound for parallel matrix
multiplication.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
3
Partitioning into Submatrices
Suppose matrix divided into s2 submatrices. Each
submatrix has n/sn/s elements. Using notation Ap,q
as submatrix in submatrix row p and submatrix
column q:
for (p=0; p<s; p++)
for (q=0; q<s; q++) {
/* clear elements of submatrix */
Cp,q = 0;
/* submatrix mult and */
/* add to accum submatrix */
for (r=0; r<m; r++)
Cp,q = Cp,q + Ap,r * Br,q;
}
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
4
Partitioning into Submatrices (cont)
for (p=0; p<s; p++)
for (q=0; q<s; q++) {
Cp,q = 0;
for (r=0; r<m; r++)
Cp,q = Cp,q + Ap,r * Br,q;
}
The line Cp,q=Cp,q+Ap,r*Br,q; means multiply
submatrix Ap,r by Br,q using matrix
multiplication and add to submatrix Cp,q using
matrix addition. Known as block matrix
multiplication.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
5
Block Matrix Multiplication
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
6
Submatrix multiplication
Matrices
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
7
Submatrix multiplication
Multiplying A0,0B0,0 to obtain C0,0
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
8
Direct Implementation
One processor needed
to compute each
element of C – n2
processors would be
needed. One row of
elements of A and
one column of
elements of B
needed. Some of
same elements sent
to more than one
processor. Can use
submatrices.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
9
Performance Improvement
Using tree
construction n
numbers can
be added in
log n steps
using n
processors.
Computational
time
complexity of
O(log n) using
n3 processors.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
10
Recursive Implementation
Apply same algorithm on each submatrix recursively.
Excellent algorithm for a shared memory system
because of locality of operations.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
11
Recursive Algorithm
mat_mult(App,Bpp,s)
{
/* if submatrix has one elt multiply */
if (s==1) C = A*B;
/* continue to make recursive calls */
else {
s = s/2;
/* no elts in each row/column */
P0 = mat_mult(App,Bpp,s);
P1 = mat_mult(Apq,Bqp,s);
P2 = mat_mult(App,Bpq,s);
P3 = mat_mult(Apq,Bqq,s);
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
12
Recursive Algorithm (cont)
P4
P5
P6
P7
/*
/*
Cpp
Cpq
Cqp
Cqq
= mat_mult(Aqp,Bpp,s);
= mat_mult(Aqq,Bqp,s);
= mat_mult(Aqp,Bpq,s);
= mat_mult(Aqq,Bqq,s);
add submatrix products to form */
submatrices of final matrix */
= P0 + P1;
= P2 + P3;
= P4 + P5;
= P6 + P7;
}
return(C);
/* return final matrix */
}
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
13
Two-dimensional array (mesh)
interconnection network
Also three-dimensional – used in some large high
performance systems
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
14
Mesh Implementations
• Cannon’s algorithm
• Fox’s algorithm
• Not in textbook but similar complexity
• Systolic array
All involve using processors arranged in a mesh and
shifting elements of the arrays through the mesh.
Accumulate the partial sums at each processor.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
15
Mesh Implementations
Cannon’s Algorithm
Uses a mesh of processors with wraparound
connections (a torus) to shift the A elements (or
submatrices) left and the B elements (or
submatrices) up.
1. Initially processor Pi,j has elements ai,j and bi,j
(0i,k<n).
2. Elements are moved from their initial position to an
“aligned” position. The complete ith row of A is
shifted i places left and the complete jth column of B
is shifted j places upward. This has the effect of
placing the element ai,j+i and the element bi+j,j in
processor Pi,j. These elements are a pair of those
required in the accumulation of ci,j.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
16
Mesh Implementations
Cannon’s Algorithm (cont)
3. Each processor Pi,j multiplies its elements.
4. The ith row of A is shifted one place left, and the jth
column of B is shifted one place upward. This has
the effect of bringing together the adjacent elements
of A and B, which will also be required in the
accumulation.
5. Each processor Pi,j multiplies the elements brought
to it and adds the result to the accumulating sum.
6. Steps 4 and 5 are repeated until the final result is
obtained (n-1 shifts with n rows and n columns of
elements).
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
17
Movement of A and B elements
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
18
Step 2 – Alignment of elements of A and
B
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
19
Step 4 – One-place shift of elements of A
and B
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
20
Cannon’s Algorithm Example
3 1 5 1 3 1
1 2 1 2 4 1
4 3 2 5 3 5
30 28 29
10 14 8
20 30 17
3
1
5
1
3
1
1
2
1
2
4
1
4
3
2
5
3
5
Matrix elements from A in red, from B in blue
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
21
Cannon’s Algorithm Example (cont)
3
1
5
1
4
5
2
1
1
2
3
1
2
4
3
5
3
1
3
4
25
4
3
1
10
12
3
After realignment; accumulator values on right.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
22
Cannon’s Algorithm Example (cont)
1
5
3
2
3
1
1
1
2
5
3
1
4
3
2
1
4
5
5
19
28
9
6
3
14
24
13
After first shift.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
23
Cannon’s Algorithm Example (cont)
5
3
1
5
3
1
1
2
1
1
4
5
3
2
4
2
3
1
30
28
29
10
14
8
20
30
17
After second (final) shift.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
24
Systolic
array
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
25
Pseudocode
Each processor Pi,j repeatedly performs the same
algorithm (after c is initialized to zero):
recv(&a, Pi,j-1);
recv(&b, Pi-1,j);
c = c + a * b;
send(&a, Pi,j+1);
send(&b, Pi+1,j);
/*
/*
/*
/*
/*
receive from left */
receive from above */
accumulate value */
send right */
send downwards */
which accumulates the required summations.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
26
Matrix-Vector
Multiplication
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
27
Other Matrix Multiplication Methods
Strassen’s method (1969)
Suppose we wish to multiply nn matrices A and B to
produce C.
Divide each of these into four n/2n/2 submatrices as
follows:
A11
A
A21
A12
B11
B
A22
B21
B12
C11 C12
C
B22
C
C
22
21
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
28
Strassen’s Method (cont)
A traditional recursive approach executes in O(n3) time
so no faster.
Strassen discovered a different recursive approach that
executes in O(nlog 7) time:
1. Divide the input matrices into quarters as above.
2. Use O(n2) scalar additions and subtractions to
recursively compute seven matrix products Q1
through Q7.
3. Compute the desired submatrices of C by adding
and/or subtracting various combinations of the Qi
matrices, using O(n2) scalar operations.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
29
Strassen’s Method (cont)
To this end set
• Q1 = (A11 + A22)·(B11 + B22)
• Q2 = (A21 + A22)·B11
• Q3 = A11·(B12 – B22)
• Q4 = A22·(B21 – B11)
• Q5 = (A11 + A12)·B22
• Q6 = (A21 – A11)·(B21 + B22)
• Q7 = (A12 – A22)·(B21 + B22)
As indicated above can be done recursively.
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
30
Strassen’s Method (cont)
Now
• C11 = Q1 + Q4 – Q5 + Q7
• C21 = Q2 + Q4
• C12 = Q3 + Q5
• C22 = Q1 + Q3 – Q2 + Q6
Wilkinson & Allen, Parallel Programming: Techniques & Applications Using Networked Workstations & Parallel
Computers, 1999.
31
© Copyright 2026 Paperzz