min{ai , aj} - users.cs.umn.edu

ai
aj
ai , a j
a j , ai
Pi
Pj
Pi
Pj
Step 1
Step 2
min{ai , a j }
Pi
max{ai , a j }
Pj
Step 3
Figure 9.1 A parallel compare-exchange operation. Processes Pi and P j send their elements to
each other. Process Pi keeps min{ai , a j }, and P j keeps max{ai , a j }.
1 6 8 11 13
2 7 9 10 12
1 6 8 11 13
2 7 9 10 12
Pi
Pj
Step 2
1 2 6 7 8 9 10 11 12 13
Pj
Pi
Step 3
Pj
Pi
Step 1
1 2 6 7 8 9 10 11 12 13
2 7 9 10 12
1 6 8 11 13
1 2 6 7 8
9 10 11 12 13
Pj
Pi
Step 4
Figure 9.2 A compare-split operation. Each process sends its block of size n/ p to the other
process. Each process merges the received block with its own block and retains only the appropriate
half of the merged block. In this example, process Pi retains the smaller elements and process P j
retains the larger elements.
x
x = min{x, y}
y
x
x = min{x, y}
y
y = max{x, y}
y = max{x, y}
(a)
x
x = max{x, y}
y
x
x = max{x, y}
y
y = min{x, y}
y = min{x, y}
(b)
Figure 9.3 A schematic representation of comparators: (a) an increasing comparator, and (b) a
decreasing comparator.
Output wires
Interconnection network
Input wires
Columns of comparators
Figure 9.4 A typical sorting network. Every sorting network is made up of a series of columns,
and each column contains a number of comparators connected in parallel.
Original
sequence
1st Split
2nd Split
3rd Split
4th Split
3
3
3
3
0
5
5
5
0
3
8
8
8
8
5
9
9
0
5
8
10
10
10
10
9
12
12
12
9
10
14
14
14
14
12
20
0
9
12
14
95
95
35
18
18
90
90
23
20
20
60
60
18
35
23
40
40
20
23
35
35
35
95
60
40
23
23
90
40
60
18
18
60
95
90
0
20
40
90
95
Figure 9.5 Merging a 16-element bitonic sequence through a series of log 16 bitonic splits.
Wires
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
3
3
3
3
0
5
5
5
0
3
8
8
8
8
5
9
9
0
5
8
10
10
10
10
9
12
12
12
9
10
14
14
14
14
12
20
0
9
12
14
95
95
35
18
18
90
90
23
20
20
60
60
18
35
23
40
40
20
23
35
35
35
95
60
40
23
23
90
40
60
18
18
60
95
90
0
20
40
90
95
Figure 9.6 A bitonic merging network for n = 16. The input wires are numbered 0, 1 . . . , n − 1,
and the binary representation of these numbers is shown. Each column of comparators is drawn
separately; the entire figure represents a ⊕BM[16] bitonic merging network. The network takes a
bitonic sequence and outputs it in sorted order.
Wires
0000
0001
BM[2]
0010
0011
BM[2]
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
BM[4]
BM[8]
BM[2]
BM[4]
BM[2]
BM[16]
BM[2]
BM[4]
BM[2]
BM[8]
BM[2]
BM[4]
BM[2]
Figure 9.7 A schematic representation of a network that converts an input sequence into a bitonic
sequence. In this example, ⊕BM[k] and BM[k] denote bitonic merging networks of input size k
that use ⊕ and comparators, respectively. The last merging network (⊕BM[16]) sorts the input.
In this example, n = 16.
Wires
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
10
10
5
3
20
20
9
5
5
9
10
8
9
5
20
9
3
3
14
10
8
8
12
12
12
14
8
14
14
12
3
20
90
0
0
95
0
90
40
90
60
60
60
60
40
40
90
40
23
23
95
35
35
35
35
23
95
95
23
18
18
18
18
0
Figure 9.8 The comparator network that transforms an input sequence of 16 unordered numbers
into a bitonic sequence. In contrast to Figure 9.6, the columns of comparators in each bitonic merging
network are drawn in a single box, separated by a dashed line.
0100
0000
1100
0110
0100
1110
0000
1010
0010
0110
0010
0001
0111
1101
1010
0001
1001
0111
0101
1111
0011
0000
0011
1100
1110
1010
0010
0110
0100
0010
0000
1110
1100
1010
1000
0111
1101
1111
1001
0111
0101
0001
0011
Step 3
1011
1001
1000
0001
1111
Step 2
0110
0101
1101
1011
Step 1
0100
1110
1000
1000
0101
1100
1101
1111
0011
1011
1001
1011
Step 4
Figure 9.9 Communication during the last stage of bitonic sort. Each wire is mapped to a hypercube process; each connection represents a compare-exchange between processes.
Processors
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
1
2,1
1
3,2,1
1
2,1
1
4,3,2,1
1
2,1
1
3,2,1
1
2,1
1
Stage 1
Stage 2
Stage 3
Stage 4
Figure 9.10 Communication characteristics of bitonic sort on a hypercube. During each stage of
the algorithm, processes communicate along the dimensions shown.
0000
0001
0010
0011
0000
0001
0010
0011
0000
0001
0100
0101
0100
0101
0110
0111
0111
0110
0101
0100
0010
0011
0110
0111
1000
1001
1010
1011
1000
1001
1010
1011
1000
1001
1100
1101
1100
1101
1110
1111
1111
1110
1101
1100
1010
1011
1110
1111
(a)
(b)
(c)
Figure 9.11 Different ways of mapping the input wires of the bitonic sorting network to a mesh
of processes: (a) row-major mapping, (b) row-major snakelike mapping, and (c) row-major shuffled
mapping.
Stage 4
Step 1
Step 2
Step 3
Step 4
Figure 9.12 The last stage of the bitonic sort algorithm for n = 16 on a mesh, using the rowmajor shuffled mapping. During each step, process pairs compare-exchange their elements. Arrows
indicate the pairs of processes that perform compare-exchange operations.
Unsorted
3
2
3
8
5
6
4
1
Phase 1 (odd)
2
3
3
8
5
6
1
4
Phase 2 (even)
2
3
3
5
8
1
6
4
Phase 3 (odd)
2
3
3
5
1
8
4
6
Phase 4 (even)
2
3
3
1
5
4
8
6
Phase 5 (odd)
2
3
1
3
4
5
6
8
Phase 6 (even)
2
1
3
3
4
5
6
8
Phase 7 (odd)
1
2
3
3
4
5
6
8
Phase 8 (even)
1
2
3
3
4
5
6
8
Sorted
Figure 9.13 Sorting n = 8 elements, using the odd-even transposition sort algorithm. During
each phase, n = 8 elements are compared.
0
3
4
5
6
7
2
1
0
3
4
5
6
7
2
1
0
3
4
5
6
7
2
1
Figure 9.14 An example of the first phase of parallel shellsort on an eight-process array.
(a)
3
2
1
5
8
4
3
7
(b)
1
2
3
5
8
4
3
7
(c)
1
2
3
3
4
5
8
7
(d)
1
2
3
3
4
5
7
8
(e)
1
2
3
3
4
5
7
8
Pivot
Final position
Figure 9.15 Example of the quicksort algorithm sorting a sequence of size n = 8.
3
1
5
2
3
8
4
7
Figure 9.16 A binary tree generated by the execution of the quicksort algorithm. Each level of the
tree represents a different array-partitioning iteration. If pivot selection is optimal, then the height of
the tree is (log n), which is also the number of iterations.
1
2
3
(a)
5
6
7
8
1
2
3
4
leftchild
1
rightchild
5
5
6
7
8
(c)
root = 4
(b)
1
(d)
4
33 21 13 54 82 33 40 72
4
5
leftchild
2
2
3
1
8
rightchild
6
5
6
7
8
1
2
leftchild
2
3
rightchild
6
3
4
5
1
8
5
6
7
8
(e)
7
[4] {54}
1 2 5 8
3 6
7
(f)
[1] {33}
2 3
[3] {13}
[5] {82}
6 7
8
[2] {21}
[6] {33}
3
7
[8] {72}
[7] {40}
Figure 9.17 The execution of the PRAM algorithm on the array shown in (a). The arrays leftchild
and rightchild are shown in (c), (d), and (e) as the algorithm progresses. Figure (f) shows the binary
tree constructed by the algorithm. Each node is labeled by the process (in square brackets), and
the element is stored at that process (in curly brackets). The element is the pivot. In each node,
processes with smaller elements than the pivot are grouped on the left side of the node, and those
with larger elements are grouped on the right side. These two groups form the two partitions of the
original array. For each partition, a pivot element is selected at random from the two groups that form
the children of the node.
P0
P1
P3
P2
8
pivot selection
4 19 16 5 12 11 8
after local
rearrangement
5 18 13 17 14 20 10 15 9 19 16 12 11 8
after global
rearrangement
7 13 18 2 17 1 14 20 6 10 15 9
3 16 19 4 11 12 5
First Step
pivot=7
P0
P1
P3
P2
7
2 18 13 1 17 14 20 6 10 15 9
7
2
1
6
3
P0
7
2
1
4
P1
6
3
4
3
P3
P2
P4
P4
5 18 13 17 14 20 10 15 9 19 16 12 11 8
pivot=5
Second Step
P4
pivot selection
pivot=17
P0
P1
P3
P2
P4
1
2
7
6
3
4
5 14 13 17 18 20 10 15 9 19 16 12 11 8
after local
rearrangement
1
2
3
4
5
7
6 14 13 17 10 15 9 16 12 11 8 18 20 19
after global
rearrangement
P0
1
2
3
P1
4
5
7
P3
P2
P4
6 14 13 17 10 15 9 16 12 11 8 18 20 19
pivot selection
Third Step
pivot=11
P0
1
2
3
P1
4
5
6
P2
7 10 13 17 14 15 9
Fourth Step
10 9
P0
1
2
3
P1
4
5
6
7
8 12 11 16 18 19 20
after local
rearrangement
after global
rearrangement
P3
after local
rearrangement
8 12 11 13 17 14 15 16
P2
8
P4
8 12 11 13 17 14 15 16
P2
10 9
P3
P3
P4
9 10 11 12 13 14 15 16 17 18 19 20
Solution
Figure 9.18 An example of the execution of an efficient shared-address-space quicksort algorithm.
P0
P1
P3
P2
7 13 18 2 17 1 14 20 6 10 15 9
P4
8
pivot selection
4 19 16 5 12 11 8
after local
rearrangement
3 16 19 4 11 12 5
pivot=7
P0
7
P1
P3
P2
2 18 13 1 17 14 20 6 10 15 9
|Si |
2
1
1
2
3
1
2
Prefix Sum
0
2
3
4
P4
3
3
2
|L i |
3
Prefix Sum
6
7
0
2
5
8 10 13
7
2
1
6
3
4
5 18 13 17 14 20 10 15 9 19 16 12 11 8
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Figure 9.19 Efficient global rearrangement of the array.
after global
rearrangement
P0
P1
P2
22 7 13 18 2 17 1 14 20 6 10 24 15 9 21 3 16 19 23 4 11 12 5
P0
1
2
P1
7 13 14 17 18 22 3
P0
1
2
3
4
6
9 10 15 20 21 24 4
6
7
8
P2
5
8 11 12 16 19 23
Initial element
distribution
Local sort &
sample selection
7 17 9 20 8 16
Sample combining
7
Global splitter
selection
8
9 16 17 20
P1
5
8
P2
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Final element
assignment
Figure 9.20 An example of the execution of sample sort on an array with 24 elements on three
processes.
Hypercube
100
000
Sequence of Elements
110
(a) Split along the third
dimension. Partitions
the sequence into two
big blocks−one smaller
010
101
0**
111
1**
and one larger than the
pivot.
011
001
110
100
000
010
101
01*
subblock into two smaller
subblocks.
111
00*
011
001
(b) Split along the second
dimension. Partitions each
11*
100
10*
110
(c) Split along the first
000
010
011
110
111
000
001
100
101
111
101
001
010
011
dimension. The elements
are sorted according to
the global ordering imposed
by the processors’ labels
onto the hypercube.
Figure 9.21 The execution of the hypercube formulation of quicksort for d = 3. The three splits –
one along each communication link – are shown in (a), (b), and (c). The second column represents
the partitioning of the n-element sequence into subcubes. The arrows between subcubes indicate
the movement of larger elements. Each box is marked by the binary representation of the process
labels in that subcube. A ∗ denotes that all the binary combinations are included.
Root
(a)
(b)
Figure 9.22 (a) An arbitrary portion of a mesh that holds part of the sequence to be sorted at
some point during the execution of quicksort, and (b) a binary tree embedded into the same portion
of the mesh.