Hypercube Prefix Sum
Design and Analysis of Parallel Algorithms
Due: Friday April 15 at 10:00
(A)
(B)
[6] (6)
[2] (2)
[3] (3)
[4] (4)
[5] (5)
[5] (5)
[4] (9)
[0] (1)
(C)
[9] (9)
[1] (1)
(D)
[15] (22)
[21] (28)
[22] (22)
[3] (28)
[6] (6)
[4] (22)
[0] (6)
[13] (13)
[2] (5)
[1] (1)
[0] (0)
[3] (6)
[6] (13)
[7] (7)
[6] (28)
[10] (28)
[9] (22)
[1] (6)
[28] (28)
[0] (28)
[15] (28)
[1] (28)
1
Introduction
Parallel algorithms often exhibit communication patterns that involve more than two processes. These are known as collective communication operations and there is a large number
of MPI functions devoted to this class of communication operations. The prototypical example is the (one-to-all) broadcast operation in which one process—the so called root—sends a
single message M of size m to p processes. One can naively implement the broadcast operation using p − 1 point-to-point communication operations from the root to each of the p − 1
remaining processes. However, the running time of this naive approach is in θ(p(ts + tw m)),
where ts + tw m denotes the time required for a point-to-point communication of a message of
size m using the standard linear model for communication.
A much more efficient algorithm for the broadcast operation is based on the so called
recursive doubling technique. The algorithm consists of dlog2 pe phases. Prior to phase k,
a subset Sk of the p processes have a local copy of the message M . Initially, S1 = {0}
(for simplicity, we assume that the root is process 0). In step k, each process in Sk sends the
message M to some (unique) process not in Sk . Note that in this way, we have #Sk+1 = 2#Sk
(hence the name recursive doubling). The choice of peers in each phase depends on the network
topology, but a common choice is to let process i send to process i + p/2k in phase k (for
simplicity, we assume that p = 2d for some integer d). The time complexity of this algorithm
is θ(log2 p(ts + tw m)). It is important to emphasize at this point the rather big differences
√
between p, p, and log2 p (see Table 1).
a=p
16
64
256
1024
4096
16384
b=
√
p
4
8
16
32
64
128
c = log2 p
4
6
8
10
12
14
a/b
4
8
16
32
64
128
a/c
4
10.7
32
102.4
341.3
1170.3
√
Table 1: The differences between p, p, and
b/c
1
1.3
2
3.2
5.3
9.1
log2 p.
There are, besides the broadcast, a number of other frequently occurring collective communication operations. Some of the more common ones are listed in Table 2. In this assignment,
you will implement and evaluate the prefix sum operation.
One-to-all broadcast
All-to-all broadcast
All-reduce
Scatter
All-to-all personalized
All-to-one reduction
All-to-all reduction
Prefix sum
Gather
Table 2: Common collective communication operations.
2
2
Prefix sum
A hypercube prefix sum algorithm can be implemented very similarly to the all-to-all broadcast
on a hypercube. The algorithm is outlined in Algorithm 1.
Algorithm 1 Hypercube prefix sum
Require: An associative binary operator ⊕ and p = 2d processes.
Ensure: Computes the partial reductions data0 ⊕ · · · ⊕ datak for k = 0, 1, . . . , p − 1, where p
is the number of processes and datak denotes the data initially stored on process k, and
stores the k-th partial reduction on process k.
1: Let me denote the rank (between 0 and p − 1) of the calling process.
2: msg ← datame
3: res ← datame
4: for k = 0, 1, . . . , d − 1 do
5:
partner ← me XOR 2k
6:
Send msg to partner and at the same time receive data from partner
7:
msg ← msg ⊕ data
8:
if partner < me then
9:
res ← res ⊕ data
10:
end if
11: end for
12: Return res
See the figure on the front page for an illustration of the hypercube prefix sum algorithm
on an eight-node hypercube. The contents of the msg buffers are shown in (·) and the contents
of the res buffers are shown in [·].
3
Instructions
The aim is to implement, model, and evaluate the hypercube prefix sum algorithm. The
focus is on modelling and evaluating the performance. Specifically, you should at least do the
following:
1. Implement the hypercube prefix sum algorithm using point-to-point operations in MPI.
You must support arbitrary numbers of processes (not only p = 2d ).
2. Verify the correctness of your implementation by comparing against the corresponding
MPI function MPI_Scan.
3. Analyze the parallel runtime of your implementation analytically.
4. Perform experiments on varying numbers of processes and varying message sizes.
5. Approximate the communication parameters ts and tw by fitting your model to the
empirical data.
6. Compare the performance of your implementation to that of MPI_Scan.
Assume that the data type is int (MPI_INT) and that the reduction operation is max (MPI_MAX).
3
© Copyright 2026 Paperzz