Matrix-Vector Multiplication

Matrix-Vector Multiplication
Parallel and Distributed Computing
Department of Computer Science and Engineering (DEI)
Instituto Superior Técnico
November 13, 2012
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
1 / 37
Outline
Matrix-vector multiplication
rowwise decomposition
columnwise decomposition
checkerboard decomposition
Gather, scatter, alltoall
Grid-oriented communications
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
2 / 37
Matrix-Vector Multiplication
2
1
0
4
3
2
1
1
4
3
1
2
3
0
2
0
A
CPD (DEI / IST)
1
x
3
4
9
=
14
19
1
11
b
c
Parallel and Distributed Computing – 16
2012-11-13
3 / 37
Matrix-Vector Multiplication
2
1
0
4
3
2
1
1
4
3
1
2
3
0
2
0
A
CPD (DEI / IST)
1
x
3
4
9
=
14
19
1
11
b
c
Parallel and Distributed Computing – 16
2012-11-13
3 / 37
Matrix-Vector Multiplication
2
1
0
4
3
2
1
1
4
3
1
2
3
0
2
0
A
CPD (DEI / IST)
1
x
3
4
9
=
14
19
1
11
b
c
Parallel and Distributed Computing – 16
2012-11-13
3 / 37
Matrix-Vector Multiplication
2
1
0
4
3
2
1
1
4
3
1
2
3
0
2
0
A
CPD (DEI / IST)
1
x
3
4
9
=
14
19
1
11
b
c
Parallel and Distributed Computing – 16
2012-11-13
3 / 37
Matrix Decomposition
rowwise decomposition
columnwise decomposition
checkered-board decomposition
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
4 / 37
Matrix Decomposition
rowwise decomposition
columnwise decomposition
checkered-board decomposition
Storing vectors:
Divide vector elements among processes
Replicate vector elements
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
4 / 37
Matrix Decomposition
rowwise decomposition
columnwise decomposition
checkered-board decomposition
Storing vectors:
Divide vector elements among processes
Replicate vector elements
Vector replication acceptable because vectors have only n elements, versus
n2 elements in matrices.
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
4 / 37
Rowwise Decomposition
Task associated with
row of matrix
entire vector
x
AllGather
=
Row i of A
ci
x
b
CPD (DEI / IST)
=
Row i of A
b
Parallel and Distributed Computing – 16
2012-11-13
c
5 / 37
Rowwise Decomposition
Task associated with
row of matrix
entire vector
x
AllGather
=
Row i of A
ci
x
b
CPD (DEI / IST)
=
Row i of A
b
Parallel and Distributed Computing – 16
2012-11-13
c
5 / 37
Rowwise Decomposition
Task associated with
row of matrix
entire vector
x
AllGather
=
Row i of A
ci
x
b
CPD (DEI / IST)
=
Row i of A
b
Parallel and Distributed Computing – 16
2012-11-13
c
5 / 37
MPI Allgatherv
int MPI_Allgatherv (
void
*send_buffer,
int
send_cnt,
MPI_Datatype send_type,
void
*receive_buffer,
int
*receive_cnt,
int
*receive_disp,
MPI_Datatype receive_type,
MPI_Comm
communicator
)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
6 / 37
MPI Allgatherv
Process 0
Process 0
send_buffer
c
o
n
send_cnt=3
receive_cnt
3
4
receive_buffer
c
receive_disp
4
0
3
Process 1
a
t
e
receive_cnt
3
4
send_cnt=4
4
0
3
4
t
e
n
a
t
e
o
n
c
a
t
e
n
a
t
e
c
a
t
e
n
a
t
e
Process 2
t
receive_cnt
3
a
7
send_buffer
a
c
receive_buffer
c
receive_disp
Process 2
n
n
Process 1
send_buffer
c
o
7
e
send_cnt=4
c
receive_disp
4
CPD (DEI / IST)
0
3
receive_buffer
o
n
7
Parallel and Distributed Computing – 16
2012-11-13
7 / 37
MPI Allgatherv
Process 0
Process 0
send_buffer
c
o
n
send_cnt=3
receive_cnt
3
4
receive_buffer
c
receive_disp
4
0
3
Process 1
a
t
e
receive_cnt
3
4
send_cnt=4
4
0
3
4
t
e
n
a
t
e
o
n
c
a
t
e
n
a
t
e
c
a
t
e
n
a
t
e
Process 2
t
receive_cnt
3
a
7
send_buffer
a
c
receive_buffer
c
receive_disp
Process 2
n
n
Process 1
send_buffer
c
o
7
e
send_cnt=4
c
receive_disp
4
CPD (DEI / IST)
0
3
receive_buffer
o
n
7
Parallel and Distributed Computing – 16
2012-11-13
7 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
Sequential algorithm complexity:
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
8 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
Sequential algorithm complexity: Θ(n2 )
Parallel algorithm computational complexity:
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
8 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
Sequential algorithm complexity: Θ(n2 )
Parallel algorithm computational complexity: Θ(n2 /p)
Communication complexity of all-gather:
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
8 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
Sequential algorithm complexity: Θ(n2 )
Parallel algorithm computational complexity: Θ(n2 /p)
Communication complexity of all-gather: Θ(log p + n)
Overall complexity:
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
8 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
Sequential algorithm complexity: Θ(n2 )
Parallel algorithm computational complexity: Θ(n2 /p)
Communication complexity of all-gather: Θ(log p + n)
Overall complexity: Θ(n2 /p + log p + n)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
8 / 37
Algorithm Scalability
Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
9 / 37
Algorithm Scalability
Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p)
Sequential time complexity: T (n, 1) = Θ(n2 )
Parallel overhead is dominated by all-gather:
large n
T0 (n, p) = Θ(p(log p + n)) → Θ(pn)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
9 / 37
Algorithm Scalability
Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p)
Sequential time complexity: T (n, 1) = Θ(n2 )
Parallel overhead is dominated by all-gather:
large n
T0 (n, p) = Θ(p(log p + n)) → Θ(pn)
n2 ≥ Cpn
CPD (DEI / IST)
⇒
n ≥ Cp
Parallel and Distributed Computing – 16
2012-11-13
9 / 37
Algorithm Scalability
Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p)
Sequential time complexity: T (n, 1) = Θ(n2 )
Parallel overhead is dominated by all-gather:
large n
T0 (n, p) = Θ(p(log p + n)) → Θ(pn)
n2 ≥ Cpn
⇒
n ≥ Cp
Scalability function: M(f (p))/p
M(n) = n2
⇒
M(Cp)
C 2p2
=
= C 2p
p
p
⇒ System is not highly scalable.
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
9 / 37
Analysis of the Parallel Algorithm
Let α be the time to compute an iteration.
Sequential execution time: αn2
Computation time of parallel program: αn
l m
n
p
All-gather requires dlog pe messages with latency λ
Total vector elements transmitted: n
Total execution time:
CPD (DEI / IST)
8n
n
+ λdlog pe +
αn
p
β
Parallel and Distributed Computing – 16
2012-11-13
10 / 37
Benchmarking
p
1
2
3
4
5
6
7
8
16
Predicted
63,4
32,4
22,3
17,0
14,1
12,0
10,5
9,4
5,7
Actual
63,4
32,7
22,7
17,8
15,2
13,3
12,2
11,1
7,2
Speedup
1,00
1,94
2,79
3,56
4,16
4,76
5,19
5,70
8,79
Mflops
31,6
61,2
88,1
112,4
131,6
150,4
163,9
180,2
277,8
(time in mili-seconds)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
11 / 37
Columnwise Decomposition
Primitive task associated with
column of matrix
vector element
x
Column i of A
CPD (DEI / IST)
Alltoall
=
b
c
x
Column i of A
Parallel and Distributed Computing – 16
=
b
ci
2012-11-13
12 / 37
Columnwise Decomposition
Primitive task associated with
column of matrix
vector element
x
Column i of A
CPD (DEI / IST)
Alltoall
=
b
c
x
Column i of A
Parallel and Distributed Computing – 16
=
b
ci
2012-11-13
12 / 37
Columnwise Decomposition
Primitive task associated with
column of matrix
vector element
x
Column i of A
CPD (DEI / IST)
Alltoall
=
b
c
x
Column i of A
Parallel and Distributed Computing – 16
=
b
ci
2012-11-13
12 / 37
Columnwise Decomposition
Primitive task associated with
column of matrix
vector element
x
Column i of A
CPD (DEI / IST)
Alltoall
=
b
c
x
Column i of A
Parallel and Distributed Computing – 16
=
b
ci
2012-11-13
12 / 37
All-to-All Operation
P0
P1
P2
P3
P0
P1
P2
P3
Alltoall
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
13 / 37
All-to-All Operation
P0
P1
P2
P3
P0
P1
P2
P3
Alltoall
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
13 / 37
MPI Alltoallv
int MPI_Alltoallv (
void
*send_buffer,
int
*send_cnt,
int
*send_disp,
MPI_Datatype send_type,
void
*receive_buffer,
int
*receive_cnt,
int
*receive_disp,
MPI_Datatype receive_type,
MPI_Comm
communicator
)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
14 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
Sequential algorithm complexity:
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
15 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
Sequential algorithm complexity: Θ(n2 )
Parallel algorithm computational complexity:
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
15 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
Sequential algorithm complexity: Θ(n2 )
Parallel algorithm computational complexity: Θ(n2 /p)
Communication complexity of alltoall:
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
15 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
Sequential algorithm complexity: Θ(n2 )
Parallel algorithm computational complexity: Θ(n2 /p)
Communication complexity of alltoall: Θ(p + n)
(p − 1 messages, and a total of n elements)
Overall complexity:
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
15 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
Sequential algorithm complexity: Θ(n2 )
Parallel algorithm computational complexity: Θ(n2 /p)
Communication complexity of alltoall: Θ(p + n)
(p − 1 messages, and a total of n elements)
Overall complexity: Θ(n2 /p + n + p)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
15 / 37
Algorithm Scalability
Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
16 / 37
Algorithm Scalability
Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p)
Sequential time complexity: T (n, 1) = Θ(n2 )
The parallel overhead is alltoall and vector copying:
large n
T0 (n, p) = Θ(p(p + n)) → Θ(pn))
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
16 / 37
Algorithm Scalability
Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p)
Sequential time complexity: T (n, 1) = Θ(n2 )
The parallel overhead is alltoall and vector copying:
large n
T0 (n, p) = Θ(p(p + n)) → Θ(pn))
n2 ≥ Cpn ⇒ n ≥ Cp
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
16 / 37
Algorithm Scalability
Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p)
Sequential time complexity: T (n, 1) = Θ(n2 )
The parallel overhead is alltoall and vector copying:
large n
T0 (n, p) = Θ(p(p + n)) → Θ(pn))
n2 ≥ Cpn ⇒ n ≥ Cp
Scalability function: M(f (p))/p
M(n) = n2
⇒
M(Cp)
C 2p2
=
= C 2p
p
p
⇒ System is not highly scalable.
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
16 / 37
Analysis of the Parallel Algorithm
Let α be the time to compute an iteration.
Sequential execution time: αn2
Computation time of parallel program: αn
l m
n
p
Alltoall requires p − 1 messages each of length at most n/p (8 bytes per element
double).
Total execution time:
CPD (DEI / IST)
n
8n
αn
+ (p − 1) λ +
p
pβ
Parallel and Distributed Computing – 16
2012-11-13
17 / 37
Benchmarking
p
1
2
3
4
5
6
7
8
16
Predicted
63,4
32,4
22,2
17,2
14,3
12,5
11,3
10,4
8,5
Actual
63,8
32,9
22,6
17,5
14,5
12,6
11,2
10,0
7,6
Speedup
1,00
1,92
2,80
3,62
4,37
5,02
5,65
6,33
8,33
Mflops
31,4
60,8
88,5
114,3
137,9
158,7
178,6
200,0
263,2
(time in mili-seconds)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
18 / 37
Checkerboard Decomposition
Primitive task associated with
rectangular blocks of matrix (processes form a 2D grid)
vector
distributed by blocks among processes in first row of grid
each block copied to processes in the same column of grid
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
19 / 37
Algorithm Steps
P0
P1
Distribute b
CPD (DEI / IST)
P0
P1
Reduce
across rows
Multiply
P2
P3
P2
P3
P4
P5
P4
P5
Parallel and Distributed Computing – 16
2012-11-13
20 / 37
Algorithm Steps
P0
P1
Distribute b
CPD (DEI / IST)
P0
P1
Reduce
across rows
Multiply
P2
P3
P2
P3
P4
P5
P4
P5
Parallel and Distributed Computing – 16
2012-11-13
20 / 37
Algorithm Steps
P0
P1
Distribute b
CPD (DEI / IST)
P0
P1
Reduce
across rows
Multiply
P2
P3
P2
P3
P4
P5
P4
P5
Parallel and Distributed Computing – 16
2012-11-13
20 / 37
Algorithm Steps
P0
P1
Distribute b
CPD (DEI / IST)
P0
P1
Reduce
across rows
Multiply
P2
P3
P2
P3
P4
P5
P4
P5
Parallel and Distributed Computing – 16
2012-11-13
20 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
√
Also, assume p is a square number: grid has size n/ p.
Sequential algorithm complexity:
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
21 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
√
Also, assume p is a square number: grid has size n/ p.
Sequential algorithm complexity: Θ(n2 )
Parallel algorithm computational complexity:
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
21 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
√
Also, assume p is a square number: grid has size n/ p.
Sequential algorithm complexity: Θ(n2 )
Parallel algorithm computational complexity: Θ(n2 /p)
√
√
(each process computes a submatrix n/ p × n/ p)
Communication complexity of reduce:
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
21 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
√
Also, assume p is a square number: grid has size n/ p.
Sequential algorithm complexity: Θ(n2 )
Parallel algorithm computational complexity: Θ(n2 /p)
√
√
(each process computes a submatrix n/ p × n/ p)
√
√
√
Communication complexity of reduce: Θ(log p (n/ p)) = Θ(n log p/ p)
√
√
(log p messages, each with n/ p elements)
Overall complexity:
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
21 / 37
Complexity Analysis
(for simplicity, assume square n × n matrix)
√
Also, assume p is a square number: grid has size n/ p.
Sequential algorithm complexity: Θ(n2 )
Parallel algorithm computational complexity: Θ(n2 /p)
√
√
(each process computes a submatrix n/ p × n/ p)
√
√
√
Communication complexity of reduce: Θ(log p (n/ p)) = Θ(n log p/ p)
√
√
(log p messages, each with n/ p elements)
√
Overall complexity: Θ(n2 /p + n log p/ p)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
21 / 37
Algorithm Scalability
Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
22 / 37
Algorithm Scalability
Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p)
Sequential time complexity: T (n, 1) = Θ(n2 )
The parallel overhead is reduce and vector copying:
√
√
T0 (n, p) = Θ(pn log p/ p) = Θ(n p log p)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
22 / 37
Algorithm Scalability
Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p)
Sequential time complexity: T (n, 1) = Θ(n2 )
The parallel overhead is reduce and vector copying:
√
√
T0 (n, p) = Θ(pn log p/ p) = Θ(n p log p)
√
√
n2 ≥ Cn p log p ⇒ n ≥ C p log p
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
22 / 37
Algorithm Scalability
Isoefficiency analysis: T (n, 1) ≥ CT0 (n, p)
Sequential time complexity: T (n, 1) = Θ(n2 )
The parallel overhead is reduce and vector copying:
√
√
T0 (n, p) = Θ(pn log p/ p) = Θ(n p log p)
√
√
n2 ≥ Cn p log p ⇒ n ≥ C p log p
Scalability function: M(f (p))/p
M(n) = n2
⇒
√
M(C p log p)
C 2 p log2 p
=
= C 2 log2 p
p
p
⇒ This system is much more scalable than the previous two
implementations!
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
22 / 37
Creating Communicators
collective communications involve all processes in a communicator
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
23 / 37
Creating Communicators
collective communications involve all processes in a communicator
need reductions among subsets of processes
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
23 / 37
Creating Communicators
collective communications involve all processes in a communicator
need reductions among subsets of processes
processes in a virtual 2-D grid
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
23 / 37
Creating Communicators
collective communications involve all processes in a communicator
need reductions among subsets of processes
processes in a virtual 2-D grid
create communicators for processes in same row or same column
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
23 / 37
Creating a Virtual Grid of Processes
MPI Dims create()
input parameters:
total number of processes in desired grid
number of grid dimensions
⇒ Returns number of processes in each dimension
MPI Cart create()
Creates communicator with Cartesian topology
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
24 / 37
MPI Dims create
int MPI_Dims_create (
int nodes,
/* In - # procs in grid */
int dims,
/* In - Number of dims */
int *size
/* I/O- Size of each grid dim */
)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
25 / 37
MPI Cart create
int MPI_Cart_create (
MPI_Comm old_comm,
int dims,
int *size,
int *periodic,
int reorder,
MPI_Comm *cart_comm
/*
/*
/*
/*
In
In
In
In
-
old communicator */
grid dimensions */
# procs in each dim */
1 if dim i wraps around;
0 otherwise */
/* In - 1 if process ranks
can be reordered */
/* Out - new communicator */
)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
26 / 37
Using MPI Dims create and MPI Cart create
MPI_Comm cart_comm;
int p;
int periodic[2];
int size[2];
...
size[0] = size[1] = 0;
MPI_Dims_create (p, 2, size);
periodic[0] = periodic[1] = 0;
MPI_Cart_create (MPI_COMM_WORLD, 2, size, periodic, 1,
&cart_comm);
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
27 / 37
Useful Grid-related Functions
MPI Cart rank()
given coordinates of a process in Cartesian communicator, returns process’
rank
int MPI_Cart_rank (
MPI_Comm comm, /* In - Communicator */
int *coords,
/* In - Array containing
process’ grid location */
int *rank
/* Out - Rank of process at coordinates */
)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
28 / 37
Useful Grid-related Functions
MPI Cart coords()
given rank of a process in Cartesian communicator, returns process’
coordinates
int MPI_Cart_coords (
MPI_Comm comm, /* In - Communicator */
int rank,
/* In - Rank of process */
int dims,
/* In - Dimensions in virtual grid */
int *coords
/* Out - Coordinates of specified
process in virtual grid */
)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
29 / 37
MPI Comm split
MPI Comm split()
partitions the processes of a communicator into one or more
subgroups
constructs a communicator for each subgroup
allows processes in each subgroup to perform their own collective
communications
needed for columnwise scatter and rowwise reduce
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
30 / 37
MPI Comm split
int MPI_Comm_split (
MPI_Comm old_comm,
int partition,
int new_rank,
MPI_Comm *new_comm
/* In - Existing communicator */
/* In - Partition number */
/* In - Ranking order of processes
in new communicator */
/* Out - New communicator shared by
processes in same partition */
)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
31 / 37
Example: Create Communicators for Process Rows
MPI_Comm grid_comm;
/* 2-D process grid */
int grid_coords[2];
/* Location of process in grid */
MPI_Comm row_comm;
/* Processes in same row */
MPI_Comm_split (grid_comm, grid_coords[0], grid_coords[1],
&row_comm);
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
32 / 37
Analysis of the Parallel Algorithm
Let α be the time to compute an iteration.
Sequential execution time: αn2
Computation time of parallel program: α
Reduce requires log
√
l
√n
p
ml
√n
p
m
p messages each of length λ + 8
l
√n
p
m
/β (8 bytes per
element double).
Total execution time:
n
α √
p
CPD (DEI / IST)
√
n
8 n
√ + log p λ +
√
p
β
p
Parallel and Distributed Computing – 16
2012-11-13
33 / 37
Benchmarking
p
1
4
8
16
Predicted
63,4
17,8
9,7
6,2
Actual
63,4
17,4
9,7
6,2
Speedup
1,00
3,64
6,53
10,21
Mflops
31,6
114,9
206,2
322,6
(time in mili-seconds)
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
34 / 37
Comparison of the Three Algorithms
Rowwise:
Columnwise:
Checkerboard:
CPD (DEI / IST)
8n
n
+ λdlog pe +
αn
p
β
n
8n
αn
+ (p − 1) λ +
p
pβ
n
8 n
n
α √
√ + log p λ +
√
p
p
β
p
Parallel and Distributed Computing – 16
2012-11-13
35 / 37
Review
Matrix-vector multiplication
rowwise decomposition
columnwise decomposition
checkerboard decomposition
Gather, scatter, alltoall
Grid-oriented communications
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
36 / 37
Next Class
load balancing
termination detection
CPD (DEI / IST)
Parallel and Distributed Computing – 16
2012-11-13
37 / 37