CSE 574 Parallel Processing

Lecture 6:
Message Passing Interface
(MPI)
Parallel Programming Models
Message Passing Model

Used on Distributed memory MIMD architectures

Multiple processes execute in parallel
asynchronously
•

Process creation may be static or dynamic
Processes communicate by using send and
receive primitives
Parallel Programming Models
Example: Pi calculation
P = f01 f(x) dx = f01 4/(1+x2) dx = w ∑ f(xi)
f(x) = 4/(1+x2)
n = 10
w = 1/n
xi = w(i-0.5)
f(x)
x
0 0.1 0.2
xi
1
Parallel Programming Models
Sequential Code
f(x)
#define
f(x) 4.0/(1.0+x*x);
main(){
int n,i;
float w,x,sum,pi;
printf(“n?\n”);
scanf(“%d”, &n);
w=1.0/n;
sum=0.0;
for (i=1; i<=n; i++){
x=w*(i-0.5);
sum += f(x);
}
pi=w*sum;
printf(“%f\n”, pi);
}
x
0 0.1 0.2
xi
P = w ∑ f(xi)
f(x) = 4/(1+x2)
n = 10
w = 1/n
xi = w(i-0.5)
1
Message-Passing Interface (MPI)
http://www.mpi-forum.org
SPMD Parallel MPI Code
#include <stdio.h>
#include <mpi.h>
#define f(x) 4.0/(1.0+x*x)
main(int argc, char * argv[]){
int
myid, nproc, root, err;
int
n, i, start, end;
float w, x, sum, pi;
err = MPI_Init(&argc, &argv);
if (err != MPI_SUCCESS) {
printf(stderr, “initialization error\n”);
exit(1);
}
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
root=0;
if (myid == root) {
f1=fopen(“indata”, “r”);
fscanf(f1, “%d”, &n);
fclose(f1);
}
MPI_Bcast(&n, 1, MPI_INT, root, MPI_COMM_WORLD);
w=1.0/n;
sum=0.0;
start = myid*(n/nproc);
end = (myid+1)*(n/nproc);
for (i=start; i<end; i++){
x = w*(i-0.5);
sum += f(x);
}
MPI_Reduce(&sum, &pi, MPI_FLOAT, MPI_SUM, root, MPI_COMM_WORLD);
if (myid == root) {
f1=fopen(“outdata”, “w”);
fprintf(f1, “pi=%f”, &pi);
fclose(f1);
}
MPI_Finalize();
}
Message-Passing Interface (MPI)






MPI_INIT(int *argc, char ***argv): Initiate an MPI computation.
MPI_FINALIZE(): Terminate a computation.
MPI_COMM_SIZE (comm, size): Determine number of processes.
MPI_COMM_RANK(comm, pid): Determine my process identifier.
MPI_SEND(buf, count, datatype, dest, tag, comm): Send a message.
MPI_RECV(buf, count, datatype, source, tag, comm, status): Receive a message.
•
•
tag: message tag or MPI_ANY_TAG
source: process id of source process or MPI_ANY_SOURCE
Message-Passing Interface (MPI)
Deadlock:

MPI_SEND and MPI_RECV are blocking.
Consider the program where the two processes exchange data:
...
if (rank .eq. 0) then
call mpi_send( abuf, n, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, ierr )
call mpi_recv( buf, n, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, &status, ierr )
else if (rank .eq. 1) then
call mpi_send( abuf, n, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, ierr )
call mpi_recv( buf, n, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, &status, ierr )
endif
Message-Passing Interface (MPI)
Communicators

If two processes use different contexts for communication, there can be
no danger of their communication being confused.

Each MPI communicator contains a separate communication context;
this defines a separate virtual communication space.

Communicator Handle: identifies the process group and context with
respect to which the operation is to be performed

MPI_COMM_WORLD: contains all the processes in a parallel
computation
Message-Passing Interface (MPI)
Collective Operations
These operations are all executed in a collective fashion, meaning
that each process in a process group calls the communication routine





Barrier: Synchronize all processes.
Broadcast: Send data from one process to all processes.
Gather: Gather data from all processes to one process.
Scatter: Scatter data from one process to all processes.
Reduction operations: addition, multiplication, etc. of distributed
data.
Message-Passing Interface (MPI)
Collective Operations

Barrier (comm): Synchronize all processes
Message-Passing Interface (MPI)
Collective Operations

MPI_BCAST (inbuf, incnt, intype, root, comm):
1-to-all
Ex: MPI_BCAST(A, 5, MPI_INT, 0, MPI_COMM_WORLD);
A0
A1
A2
A3
A4
P0
P0
P1
P2
P3
A0
A1
A2
A3
A4
A0
A1
A2
A3
A4
A0
A1
A2
A3
A4
A0
A1
A2
A3
A4
Message-Passing Interface (MPI)
Collective Operations

MPI_SCATTER (inbuf, incnt, intype, outbuf, outcnt, outtype, root, comm):
Ex:
1-to-all
int A[100], B[25];
MPI_SCATTER(A, 25, MPI_INT, B, 25, MPI_INT, 0, MPI_COMM_WORLD);
A
B
A0
A1
A2
A3
P0
A0
P1
A1
P2
A2
P3
A3
P0
Message-Passing Interface (MPI)
Collective Operations

MPI_GATHER (inbuf, incnt, intype, outbuf, outcnt, outtype, root, comm):
Ex:
all-to-1
int A[100], B[25];
MPI_GATHER(B, 25, MPI_INT, A, 25, MPI_INT, 0, MPI_COMM_WORLD);
B
B0
A
P0
P0
B1
P1
B2
P2
B3
P3
B0
B1
B2
B3
Message-Passing Interface (MPI)
Collective Operations





Reduction operations: Combine the values in the input buffer of
each process using an operator
Operations:
MPI_MAX, MPI_MIN
MPI_SUM, MPI_PROD
MPI_LAND, MPI_LOR, MPI_LXOR (logical)
MPI_BAND, MPI_BOR, MPI_BXOR (bitwise)
Message-Passing Interface (MPI)
Collective Operations

MPI_REDUCE (inbuf, outbuf, count, type, op, root, comm)

Returns the combined value to the output buffer of a single root process
int A[2], B[2];
MPI_REDUCE(A, B, 2, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
Ex:
A
2 4
A
B
P0
P0
5 7
P1
2 4
0 2
min
B
0 3
P2
6 2
P3
0 2
5 7
0 3
6 2
Message-Passing Interface (MPI)
Collective Operations

MPI_ALLREDUCE (inbuf, outbuf, count, type, op, comm)

Returns the combined value to the output buffers of all processes
int A[2], B[2];
MPI_ALLREDUCE(A, B, 2, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
Ex:
A
B
A
2 4
2 4
P0
P0
0 2
5 7
P1
P1
0 2
min
B
0 3
P2
P2
0 2
6 2
P3
P3
0 2
0 2
5 7
0 3
6 2
Message-Passing Interface (MPI)
Asynchronous Communication

Data is distributed among processes which must then poll periodically
for pending read and write requests

Local computation may interleave with the processing of incoming
messages
Non-blocking send/receive



MPI_ISEND (buf, count, datatype, dest, tag, comm): Send a message.
MPI_IRECV (buf, count, datatype, source, tag, comm, status): Receive a message.
MPI_WAIT (MPI_Request *request, MPI_Status *status): Complete a non-blocking
operation
Message-Passing Interface (MPI)
Asynchronous Communication

MPI_IPROBE (source, tag, comm, flag, status): Polls for a pending message without
receiving it, and sets a flag. The message can then be received by using MPI_RECV.

MPI_PROBE (source, tag, comm, status): Blocks until the message is available.

MPI_GET_COUNT (status, datatype, count): Determines size of the message.

status (must be set by a previous probe):
• status.MPI_SOURCE
• status.MPI_TAG
Message-Passing Interface (MPI)
Asynchronous Communication
Ex:
int count, *buf, source;
MPI_PROBE (MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status);
source = status.MPI_SOURCE;
MPI_GET_COUNT(status, MPI_INT, count);
buf = malloc(count*sizeof(int));
MPI_RECV (buf, count, MPI_INT, source, 0, MPI_COMM_WORLD, &status);
Message-Passing Interface (MPI)
Communicators

Communicator Handle: identifies the process group and context with
respect to which the operation is to be performed

MPI_COMM_WORLD: contains all the processes in a parallel
computation (default)

New communicators are formed by either including or excluding
processes from an existing communicator.

MPI_COMM_SIZE() : Determine number of processes.
MPI_COMM_RANK() : Determine my process identifier.

Message-Passing Interface (MPI)
Communicators

MPI_COMM_DUP (comm, newcomm): creates a new handle for the same
process group

MPI_COMM_SPLIT (comm, color, key, newcomm): creates a new handle
for a subset of a given process group

MPI_INTERCOMM_CREATE (comm, leader, peer, rleader, tag, inter): links
processes in two groups

MPI_COMM_FREE (comm): destroys a handle
Message-Passing Interface (MPI)
Communicators
Ex: Two processes communicating with a new handle
MPI_COMM newcomm;
MPI_COMM_DUP (MPI_COMM_WORLD, newcomm);
if (myid == 0)
MPI_SEND (A, 100, MPI_INT, 1, 0, newcomm);
else
MPI_RECV (A, 100, MPI_INT, 0, 0, newcomm);
MPI_COMM_FREE (newcomm);
Message-Passing Interface (MPI)
Communicators
Ex: Creating a new group with 4 members
MPI_COMM comm, newcomm;
int myid, color;
...
MPI_COMM_RANK (comm, &myid);
if (myid<4)
color=1;
else
color=MPI_UNDEFINED;
MPI_COMM_SPLIT (comm, color, myid, &newcomm);
MPI_SCATTER (A, 10, MPI_INT, B, 10, MPI_INT, 0, newcomm);
Processes: P0
P1
P2
P3
P4
P5
P6
P7
Ranks in
comm:
0
1
2
3
4
5
6
7
Color:
1
1
1
1
?
?
?
?
Ranks in
newcomm:
0
1
2
3
Message-Passing Interface (MPI)
Communicators
Ex: Splitting processes into 3 independent groups
MPI_COMM comm, newcomm;
int myid, color;
...
MPI_COMM_RANK (comm, &myid);
color = myid % 3;
MPI_COMM_SPLIT (comm, color, myid, &newcomm);
Processes: P0
P1
P2
P3
P4
P5
P6
P7
Ranks in
comm:
0
1
2
3
4
5
6
7
Color:
0
1
2
0
1
2
0
1
Ranks in
newcomm:
0
1
2
0
1
2
0
1
Message-Passing Interface (MPI)
Communicators
MPI_INTERCOMM_CREATE (comm, local_leader, peer_comm, remote_leader, tag,
intercomm): links processes in two groups




comm: intracommunicator (within group)
local_leader: leader within the group
peer_comm: parent communicator
remote_leader: other groups’ leader within the parent communicator
Message-Passing Interface (MPI)
Communicators
Ex: Communication of processes in two different groups
P0
P2
P4
P6
P1
P3
P5
P7
MPI_COMM
newcomm, intercomm;
int myid, color;
...
MPI_COMM_SIZE (MPI_COMM_WORLD, &count);
if (count % 2 == 0){
MPI_COMM_RANK (MPI_COMM_WORLD, &myid);
color = myid % 2;
MPI_COMM_SPLIT (MPI_COMM_WORLD, color, myid, &newcomm);
MPI_COMM_RANK (newcomm, &newid);
remote_leader
if (newid % 2 == 0){
// group 0 local_leader
MPI_INTERCOMM_CREATE(newcomm, 0, MPI_COMM_WORLD, 1, 99, intercomm);
MPI_SEND (msg, 1, type, newid, 0, intercomm);
}
destination
local_leader
remote_leader
else {
// group 1
MPI_INTERCOMM_CREATE(newcomm, 0, MPI_COMM_WORLD, 0, 99, intercomm);
MPI_RECV (msg, 1, type, newid, 0, intercomm, &status);
}
}
MPI_COMM_FREE (intercomm);
MPI_COMM_FREE (newcomm);
Message-Passing Interface (MPI)
Communicators
Ex: Communication of processes in two different groups
Processes: P0
P1
P2
P3
P4
P5
P6
P7
1
2
3
4
5
6
7
Rank in
MPI_COMM_WORLD:0
newcomm
Processes: P0
newcomm
P2
P4
P6
P1
P3
P5
P7
2
4
6
1
3
5
7
1
2
3
0
1
2
3
Rank in
MPI_COMM_WORLD: 0
Rank in
newcomm:
0
local_leader
remote_leader
remote_leader
local_leader
Message-Passing Interface (MPI)
Derived Types
Allow noncontiguous data elements to be grouped together in a message.
Constructor functions:





MPI_TYPE_CONTIGUOUS (): constructs data type from contiguous elements
MPI_TYPE_VECTOR (): constructs data type from blocks separated by stride
MPI_TYPE_INDEXED (): constructs data type with variable indices and sizes
MPI_TYPE_COMMIT (): commit data type so that it can be used in
communication
MPI_TYPE_FREE (): used to reclaim storage
Message-Passing Interface (MPI)
Derived Types

MPI_TYPE_CONTIGUOUS (count, oldtype, newtype): constructs data type from
contiguous elements
Ex:
MPI_TYPE_CONTIGUOUS (10, MPI_REAL, &newtype);

MPI_TYPE_VECTOR (count, blocklength, stride, oldtype, newtype): constructs
data type from blocks separated by stride
Ex:
MPI_TYPE_VECTOR (5, 1, 4, MPI_FLOAT, &floattype);
A
Memory
Message-Passing Interface (MPI)
Derived Types

MPI_TYPE_INDEXED (count, blocklengths, indices, oldtype, newtype):
constructs data type with variable indices and sizes
Ex:
MPI_TYPE_INDEXED (3, BLenghts, Indices, MPI_INT, &newtype);
Data
Blengths
2 3 1
Indices
1 5 10
0 1 2
Block 0
3 4
5
6 7 8 9 10
Block 1
Block 2
Message-Passing Interface (MPI)
Derived Types

MPI_TYPE_COMMIT (type): commit data type so that it can be used in
communication

MPI_TYPE_FREE (type): used to reclaim storage
Message-Passing Interface (MPI)
Derived Types
Ex:
MPI_TYPE_INDEXED (3, BLenghts, Indices, MPI_INT, &newtype);
MPI_TYPE_COMMIT (&newtype);
MPI_SEND (A, 1, newtype, dest, 0, MPI_COMM_WORLD);
MPI_TYPE_FREE (newtype);
A
Blengths
2 3 1
Indices
1 5 10
0 1 2
Block 0
3 4
5
6 7 8 9 10
Block 1
Block 2