mpi_comm

Parallel Processing
(CS 667)
Lecture 9: Advanced Point to Point Communication
Jeremy R. Johnson
*Parts of this lecture was derived from chapters 13 in Pacheco
Parallel Processing
1
Introduction
• Objective: To further examine message passing
communication patterns.
• Topics
– Implementing Allgather
• Ring
• Hypercube
– Non-blocking send/recv
• MPI_Isend
• MPI_Wait
• MPI_Test
Parallel Processing
2
Broadcast/Reduce Ring
P3
P2
P3
P2
P0
P1
P0
P1
P3
P2
P3
P2
P0
P1
P0
P1
Parallel Processing
3
Bi-directional Broadcast Ring
P3
P2
P3
P2
P0
P1
P0
P1
P3
P2
P0
P1
Parallel Processing
4
Allgather Ring
P3
P2
P3
P2
x3
x2
x2,x3
x1,x2
x0
x1
x0,x3
x0,x1
P0
P1
P0
P1
P3
P2
P3
P2
x1,x2,x3
x0,x1,x2
x0,x1,x2,x3
x0,x1,x2,x3
x0,x2,x3
x0,x1,x3
x0,x1,x2,x3
x0,x1,x2,x3
P0
P0
P1
Parallel Processing
P1
5
AllGather
• int MPI_AllGather(
•
void*
send_data
/* in */
•
int
send_count
/* in */
•
MPI_Datatype send_type
/* in */
•
void*
recv_data
/* out */
•
int
recv_count
/* in */
•
MPI_Datatype recv_type
/* in */
•
MPI_Comm
communicator /* in */)
Process 0
Process 1
Process 2
Process 3
x0
x1
x2
x3
Parallel Processing
6
Allgather_ring
void Allgather_ring(float x[], int blocksize, float y[], MPI_Comm comm) {
int i, p, my_rank;
int successor, predecessor;
int send_offset, recv_offset;
MPI_Status status;
MPI_Comm_size(comm, &p); MPI_Comm_Rank(comm, &my_rank);
for (i=0; i < blocksize; i++)
y[i + my_rank*blocksize] = x[i];
successor = (my_rank + 1) % p;
predecessor = (my_rank – 1 + p) % p;
Parallel Processing
7
Allgather_ring
for (i=0; i < p-1; i++) {
send_offset = ((my_rank – i + p) % p)*blocksize;
recv_offset = ((my_rank –i – 1+p) % p)*blocksize;
MPI_Send(y + send_offset,blocksize,MPI_FLOAT, successor, 0, comm);
MPI_Recv(y + rec_offset,blocksize,MPI_FLOAT,predecessor,0,
comm,&status);
}
}
Parallel Processing
8
Hypercube
• Graph (recursively defined)
• n-dimensional cube has 2n nodes with each node
connected to n vertices
• Binary labels of adjacent nodes differ in one bit
110
10
0
11
00
011
010
1
111
100
01
Parallel Processing
000
101
001
9
Broadcast/Reduce
110
111
011
010
100
000
101
001
Parallel Processing
10
Allgather
Process
000
001
010
011
100
101
110
111
110
100
000
x1
x2
x3
x4
x5
x6
x7
111
011
010
Data
x0
101
001
Process
000
001
010
011
100
101
110
111
Process
000
001
010
011
100
101
110
111
Data
x0
x4
x1
x5
x2
x6
x3
x7
x0
x4
x1
x5
x2
x6
x3
x7
Data
x0
x2
x1
x0
x4
x3
x2
x1
x0
x4
x3
x2
x1
x0
Parallel Processing
x5
x3
x7
x6
x5
X4
x3
x7
x6
x4
x2
x1
x6
x5
x7
x6
x5
x7
11
Allgather
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Parallel Processing
12
Allgather_cube
void Allgather_cube(float x[], int blocksize, float y[], MPI_Comm comm) {
int i, d, p, my_rank;
unsigned eor_bit, and_bits;
int stage, partner;
MPI_Datatype hole_type;
int send_offset, recv_offset;
MPI_Status status;
int log_base2(int p);
MPI_Comm_size(comm, &p); MPI_Comm_Rank(comm, &my_rank);
for (i=0; i < blocksize; i++)
y[i + my_rank*blocksize] = x[i];
d = log_base2(p); eor_bit = 1 << (d-1); and_bits = (1 << d) – 1;
Parallel Processing
13
Allgather_cube
for (stage = 0; stage < d; stage++) {
partner = my_rank ^ eor_bit;
send_offset = (my_rank & and_bits) * blocksize;
recv_offset = (partner & and_bits)*blocksize;
MPI_Type_vector(1 << stage, blocksize, (1 << (d-stage))*blocksize,
MPI_FLOAT,&hold_type);
MPI_Type_commit(&hole_type);
MPI_Send(y+send_offset,1,hole_type,partner, 0, comm);
MPI_Recv(y+recv_offset,1,hole_type,partner, 0, comm,&status);
MPI_Type_free(&hole_type);
eor_bit = eor_bit >> 1;
and_bits = and_bits >> 1;
}
Parallel Processing
14
Buffering Assumption
• Previous code is not safe since it depends on sufficient
system buffers being available so that deadlock does not
occur.
• SendRecv can be used to guarantee that deadlock does not
occur.
Parallel Processing
15
SendRecv
•
•
•
•
•
•
•
•
•
•
•
•
•
int MPI_Sendrecv(
void*
send_buf
int
send_count
MPI_Datatype send_type
int
dest
int
send_tag
void*
recv_buf
int
recv_count
MPI_Datatype recv_type
int
source
int
recv_tag
MPI_Comm
communicator
MPI_Status* status
Parallel Processing
/* in */,
/* in */,
/* in */,
/* in */,
/* in */,
/* out */,
/* in */,
/* in */,
/* in */,
/* in */,
/* in */,
/* out */)
16
SendRecvReplace
• int MPI_Sendrecv_replace(
•
void*
buffer
•
int
count
•
MPI_Datatype datatype
•
int
dest
•
int
send_tag
•
int
source
•
int
recv_tag
•
MPI_Comm
communicator
•
MPI_Status* status
Parallel Processing
/* in */,
/* in */,
/* in */,
/* in */,
/* in */,
/* in */,
/* in */,
/* in */,
/* out */)
17
Nonblocking Send/Recv
•
•
•
•
•
•
•
•
•
•
Allow overlap of communication and computation. Does not wait
for buffer to be copied or receive to occur.
The communication is posted and can be tested later for
completion
int MPI_Isend( /* Immediate */
void* buffer /* in */,
int
count /* in */,
MPI_Datatype datatype /* in */,
int
dest /* in */,
int
tag
/* in */,
MPI_Comm comm /* in */,
MPI_Request* request /* out */)
Parallel Processing
18
Nonblocking Send/Recv
•
•
•
•
•
•
•
•
int MPI_Irecv(
void* buffer /* in */,
int
count /* in */,
MPI_Datatype datatype /* in */,
int
source /* in */,
int
tag
/* in */,
MPI_Comm comm /* in */,
MPI_Request* request /* out */)
•
•
•
int MPI_Wait(
MPI_Request* request /* in/out a*/,
MPI_Status* status /* out */)
•
int MPI_Test(MPI_Request* request, int * flat, MPI_Status* status);
Parallel Processing
19
Allgather_ring (Overlapped)
recv_offset = ((my_rank –1 + p) % p)*blocksize;
for (i=0; i < p-1; i++) {
MPI_ISend(y + send_offset,blocksize,MPI_FLOAT, successor,
0, comm, &send_request);
MPI_IRecv(y + rec_offset,blocksize,MPI_FLOAT,predecessor,0,
comm,&recv_request);
send_offset = ((my_rank – i -1 + p) % p)*blocksize;
recv_offset = ((my_rank – i – 2 +p) % p)*blocksize;
MPI_Wait(&send_request, &status);
MPI_Wait(&recv_request, &status);
}
Parallel Processing
20
AllGather
• int MPI_AllGather(
•
void*
send_data
/* in */
•
int
send_count
/* in */
•
MPI_Datatype send_type
/* in */
•
void*
recv_data
/* out */
•
int
recv_count
/* in */
•
MPI_Datatype recv_type
/* in */
•
MPI_Comm
communicator /* in */)
Process 0
Process 1
Process 2
Process 3
x0
x1
x2
x3
Parallel Processing
21
Alltoall
• int MPI_Alltoall(
•
void*
send_buffer
•
int
send_count
•
MPI_Datatype send_type
•
void*
recv_buffer
•
int
recv_count
•
MPI_Datatype recv_type
•
MPI_Comm
communicator
Process 0
Process 1
Process 2
Process 3
00
10
20
30
01
11
21
31
02
12
22
32
03
13
23
33
Parallel Processing
00
01
02
03
/* in */
/* in */
/* in */
/* out */
/* in */
/* in */
/* in */)
10
11
12
13
20
21
22
23
30
31
32
33
22
AlltoAll
• Sequence of permutations implemented with send_recv
0
1
2
3
4
5
6
7
7
0
1
2
3
4
5
6
6
7
0
1
2
3
4
5
5
6
7
0
1
2
3
4
4
5
6
7
0
1
2
3
3
4
5
6
7
0
1
2
2
3
4
5
6
7
0
1
1
2
3
4
5
6
7
0
Parallel Processing
23
AlltoAll (2 way)
• Sequence of permutations implemented with send_recv
0
1
2
3
4
5
6
7
1
0
3
2
5
4
7
6
2
3
0
1
6
7
4
5
3
2
1
0
7
6
5
4
4
5
6
7
0
1
2
3
5
4
7
6
1
0
3
2
6
7
4
5
2
3
0
1
7
6
5
4
3
2
1
0
Parallel Processing
24
Communication Modes
• Synchronous (wait for receive)
• Ready (make sure receive has been posted)
• Buffered (user provides buffer space)
Parallel Processing
25