Partitioned Parallel Radix Sort

Journal of Parallel and Distributed Computing 62, 656–668 (2002)
doi:10.1006/jpdc.2001.1808, available online at http://www.idealibrary.com on
Partitioned Parallel Radix Sort 1
Shin-Jae Lee, Minsoo Jeon, and Dongseung Kim
Department of Electrical Engineering, Korea University, Seoul 136-701, Korea
E-mail: [email protected]
and
Andrew Sohn
Department of Computer and Information Science, New Jersey Institute of Technology,
Newark, New Jersey 07102-1982
E-mail: [email protected]
Received August 28, 2000; revised June 18, 2001; accepted August 6, 2001
Load balanced parallel radix sort solved the load imbalance problem
present in parallel radix sort. By redistributing the keys in each round of
radix, each processor has exactly the same number of keys, thereby reducing
the overall sorting time. Load balanced radix sort is currently known as the
fastest internal sorting method for distributed-memory multiprocessors.
However, as the computation time is balanced, the communication time
emerges as the bottleneck of the overall sorting performance due to key
redistribution. We present in this report a new parallel radix sorter that solves
the communication problem of balanced radix sort, called partitioned parallel
radix sort. The new method reduces the communication time by eliminating
the redistribution steps. The keys are first sorted in a top-down fashion (leftto-right as opposed to right-to-left) by using some most significant bits. Once
the keys are localized to each processor, the rest of sorting is confined within
each processor, hence eliminating the need for global redistribution of keys.
It enables well balanced communication and computation across processors.
The proposed method has been implemented in three different distributedmemory platforms, including IBM SP2, Cray T3E, and PC Cluster. Experimental results with various key distributions indicate that partitioned parallel
radix sort indeed shows significant improvements over balanced radix sort.
IBM SP2 shows 13% to 30% improvement while Cray/SGI T3E does 20%
to 100% in execution time. PC cluster shows over 2.4-fold improvement in
execution time. © 2002 Elsevier Science (USA)
1
The work is partially supported by KRF (1999-E00287), KOSEF (985-0900-003-2), STEPI (97NF0304-A-01) and NSF (INT-9722545). The preliminary version of the paper was presented in the Third
International Symp. on High Performance Computing, Tokyo, Japan, October 2000.
0743-7315/02 $35.00
© 2002 Elsevier Science (USA)
All rights reserved.
656
⁄
PARTITIONED PARALLEL RADIX SORT
657
Key Words: parallel sorting; radix sort; distributed-memory machines; load
balancing.
1. INTRODUCTION
Sorting is one of the fundamental problems in computer science. Its use can be
found essentially almost everywhere, be it scientific computation or nonnumeric
computation [11, 12]. Sorting of a certain number of keys has been used in
benchmarking various parallel computers or judging the specific algorithm performance when it is experimented on the same parallel machine. Serial sorts often need
O(N logN) time, and the time becomes significant as the number of keys becomes
large. Because of its importance, numerous parallel sorting algorithms have been
developed to reduce the overall sorting time, including bitonic sort [1, 7, 8], sample
sort [4, 5], and column sort [10]. In general, parallel sorts consist of multiple
rounds of serial sort, called local sort, performed in each processor in parallel,
followed by movement of keys among processors, called the redistribution step [6].
Local sort and data redistribution may be interleaved and iterated a few times
depending on the algorithms used. The time spent in local sort depends on the
number of keys. Parallel sort time is the sum of the times of local sort and the times
for data redistribution in all rounds. To make the sort fast, it is important to distribute the keys as evenly as possible throughout the rounds, since the execution
time is dependent on the most heavily loaded processor in each round [5, 14]. If a
parallel sort has kept its work-load balanced perfectly in each round, there would
be no further improvement of the time spent in that part. However, the communication time varies depending on the data redistribution schemes (e.g., all-to-all, oneto-many, many-to-one), the amount of data, and the frequency of communication
(e.g., many short messages, or a few long messages) and network topologies
(hypercube, mesh, fat-tree) [3, 11]. It was reported that for a large number of keys,
the communication times occupy a great portion of the sorting time [3, 15]. Load
balanced parallel radix sort [14] (abbreviated by LBR) reduces the execution time
by perfectly balancing the load among processors in every round. Partitioned parallel radix sort or PPR, proposed in this paper, further improves the performance by
reducing the multiple rounds of data redistribution to one. While PPR may introduce slight load imbalance among processors due to its not-so-perfect key distribution, the overall performance gain can be of particular significance since it substantially reduces the overall communication time. It is precisely the purpose of this
report to introduce this new algorithm that features balanced computation and
balanced communication.
The paper is organized as follows. Section 2 briefly explains balanced parallel
radix sort and identifies its deficiency in terms of communication. Section 3 presents
a new partitioned parallel radix sort and gives an analytical view of the new algorithm. Section 4 lists the experimental results of the algorithm on three different
distributed-memory parallel machines including SP2, T3E, and PC cluster. The last
section concludes this report.
658
LEE ET AL.
2. PARALLEL RADIX SORT
Radix sort is a simple yet very efficient sorting method that outperforms many
well known comparison-based algorithms for a certain type of keys such as integer.
Suppose N keys are evenly distributed to P processors initially such that there are
n=NP keys per processor. When sort completes, we expect that all keys are ordered
according to the rank of processors as P0 , P1 , ..., PP − 1 , besides keys in each processor have also been sorted. Serial radix sort is implemented in two different ways:
radix exchange sort and straight radix sort [13]. Since parallel radix sorts are typically derived from a serial radix sort, we first present serial radix sort, followed by a
parallel version. We define some symbols used later in this paper as listed below:
• b is the number of bits of an integer key such that an integer key is represented as (ib − 1 ib − 2 · · · i1 i0 ).
• g is the number of consecutive bits of a key used at each round of scanning.
• r=Kgb L is the number of rounds each key goes through.
Radix exchange sort generates and maintains an ordered queue. Initially, it reads
the g least significant bits (in other words, ig − 1 ig − 2 · · · i1 i0 ) of each key and stores it
in a new queue at the location determined by the g bits. If all keys are examined
and placed in the new queue, the round completes. Keys are ordered according to
their least significant g bits. The following round scans the next g least significant
bits (i2g − 1 i2g − 2 · · · ig+1 ig ) to order them in a new queue. The same operations are
done as before in the subsequent rounds. Keys move back and forth during the
rounds. After r rounds, all bits are scanned, and the sort completes.
The main idea of load balanced radix sort (LBR) is to store keys from any processor which has over N/P keys to its neighboring processor. Each processor first
obtains the bin counts of all the keys locally stored by scanning g bits. The keys are
then put into appropriate buckets by re-scanning them. These two steps are local
operations, involving no communication between processors. An all-to-all transpose
operation is performed across all processors to find the global bin count. Each
processor now has the bin counts of all processors. The resulting transposed bin
count allows each and every processor to compute which processors get exactly how
many keys from what bins and what processors, to make the load balanced.
Overloaded processors will now be able to spill keys to their immediate neighbors.
Keys will move after all the bins and their keys are located in the global processor
space. A round therefore requires an all-to-all transpose operation. For 32-bit
integers with the radix of 8 bits, balanced radix sort needs four all-to-all transpose
operations of bin counts. LBR is reported to outperform fastest parallel sorts by up
to 100% in execution time [14]. LBR, however, requires data redistribution across
processors in every round, thus, it consumes a considerable amount of time in
communication.
Straight radix sort initially uses M=2 g buckets (first-level buckets) instead of the
ordered queues. It first bucket-sorts [13] using the g most significant bits
(ib − 1 ib − 2 · · · ib − g ) of each key. Bucket-sort places each key into the bucket whose
index corresponds to its own g bits. Thus, keys with the same g bits gather in the
PARTITIONED PARALLEL RADIX SORT
659
same bucket. In the second round, keys in each bucket are bucket-sorted again
using the next g most significant bits (ib − (g+1) ib − (g+2) · · · ib − 2g ), generating M new
second-level buckets called subbuckets per bucket. The remaining rounds are done
in the same manner. In this scheme keys never leave the upper-level bucket where
they have been placed in a previous round. One serious problem in the scheme is
that the number of overall buckets (subbuckets) explodes exponentially, and there
may be many buckets with few keys wasting a lot of resource (memory) if not carefully implemented.
In our parallel implementation, the first round is exactly the same as the serial
straight radix sort. Then, according to the global histogram of the bucket population
(key counts), each processor is assigned and will be in charge of only a few consecutive buckets obtained in the first round. Now, buckets of keys are exchanged among
processors according to their index, thus keys with the same g most significant bits
are collected from all processors into one. In the remaining r − 1 rounds, bucket sorts
continues locally by radix exchange sort using the b − g bits. No further data
exchange is done across processors. If keys are evenly distributed among buckets,
each processor will hold M/P buckets in average. However, it is possible that some
processors may be allocated with buckets with a lot of keys while others have few,
depending on the distribution characteristics of the keys. This static/naive partitioning of keys may cause severe load imbalance among processors. PPR solves this
problem as described in the next section [9].
3. PARTITIONED PARALLEL RADIX SORT
Assume that we use only M=2 g buckets per processor throughout the sort. Bij
represents bucket-j in processor Pi .
3.1. The Algorithm
PPR consists of two phases: local sort and key partitioning. PPR needs r=Kgb L
rounds in all. Details are given below.
I. Key Partitioning. Each processor bucket-sorts its keys using the g most significant bits. From now on, the left-most g bits of a keys are represented by the
most significant digit (MSD). Each key is placed into an appropriate bucket, thus,
processor Pk stores a key to bucket Bkj , where j corresponds to the MSD of the key.
At the end of the bucket sort, all keys have been partially sorted locally with respect
to their MSDs, in other words, the first bucket includes the smallest keys, the
second the next smallest,..., and the last the largest. Then, an all-to-all transpose of
key counts is performed to find a global key distribution map (which correspond to
finding a histogram of the keys among all processors) as follows, illustrated by
Fig. 1.
For all j=0, 1, ..., M − 1, key counts of Bkj are added up to get Gj , a global
count of keys in all buckets Bkj across processors Pk (k=0, 1, ..., P − 1). Then
prefix sums of global key counts of Gj s are computed.
660
LEE ET AL.
FIG. 1. Local and global key count maps and bucket partitioning. (a) Bucket counts in individual
processors. (b) Global bucket count and the partitioning map.
Let’s consider hypothetical buckets (called global buckets) GBj s which are a collection of jth buckets of Bkj from all processors P0 , P1 , ..., PP − 1 . Then Gj corresponds to
the key count of bucket GBj . Taking into account the prefix sums and the average
number of keys (n=N/P), global buckets are to be divided into P groups, each
having one or more consecutive buckets, in such a way that the key counts of each
group become as equal as possible. The first group consists of the first few buckets
GB0 , GB1 , ...GBk − 1 whose counts add up to approximately n, the second GBk , GBk+1 ,
...GBl again to have approximately n, etc. The jth group of buckets is now assigned
to Pj , which becomes the owner of the buckets (j=0, 1, 2, ...P − 1). Now all processors send their buckets of keys to their respective owners simultaneously. After this
movement, keys are sorted partially across processors, since any key in GBi is smaller
than any key in GBj for i < j. Note that keys have not been sorted locally yet.
II. Local Sort. Keys in each processor are now sorted locally at a time by all
processors, to make all N keys in order. Serial radix exchange sort is performed at
first with the rightmost g bits, then, with the next rightmost g bits,..., until all b − g
bits are used up. Only b − g bits are examined because the left most g bits have
already been used in Phase I. Phase II consists of K(b − g)/gL rounds.
The performance of PPR relies on how evenly the keys are distributed in the first
phase. It is not very likely that each processor gets exactly the same number of keys
after the redistribution. Refinement of the partitioning of keys can be made in
Phase I by further dividing the buckets that lie in the partition boundary and that
have excessive keys. However, simply splitting a bucket and allocating to two
neighboring processors could not produce the desired sorted output in Phase II,
since keys having the same MSD would stay in different processors. Thus, we avoid
splitting buckets further. Keys will be distributed to processors by buckets. This
refinement is explained below.
661
PARTITIONED PARALLEL RADIX SORT
PPR resembles sample sort [4, 5] from the data partitioning and local sort perspectives. In sample sort, after keys have moved according to the splitters (or pivots
to each processor, they are partially ordered across processors, thus further movement of keys across processors is not needed. One significant difference is the fact
that the global key distribution statistic in sample sort is not known until keys
actually have moved to designated processors, while in PPR it is known before the
costly data movement. Thus, it is possible to adjust the partitioning before the
actual key movement. If current partitioning is not likely to give satisfactory
balance in work load, PPR increases g so that the keys in the boundary buckets can
spread out further into a larger number of subbuckets to produce a more even partition. For example, if g is increased by 2 (bits), the keys in each boundary bucket
are split into four buckets, enabling finer partitioning. The process repeats until a
satisfactory partitioning is obtained.
3.2. Performance Analysis
Since the previous work of LBR has included comparisons with other competitive sorts [14], only LBR will be used for performance comparison with PPR.
Assume that both PPR and LBR are executed on the same machine. The execution
time of LBR should reflect r=Kb/gL iterations of local bucket-sort, one transpose
of key counts for the histogram computation, and a set of key send/receive operations [14]. The parallel time of PPR consists of three terms: the times for r rounds
of local bucket-sort, one transpose of key counts, and one round of bucket movement. The execution times of TLBR and TPPR can be expressed respectively as
1 NP 2+rT (M, P)+ C T (D , P)
N
(N, P)= C T 1 (1.0+D ) 2+T (M, P)+T (D , P),
P
r
TLBR (N, P)=rTseq
tp
move
i
(1)
i=1
r
TPPR
seq
j
tp
move
−
1
(2)
j=1
where M=2 g, Tseq (n) is the time for serial radix sort of n keys in a processor,
Ttp (M, P) is the time for transposing M key counts of buckets per processor, Di , D −j
are the amounts of data per processor to move across processors during redistribution at round i, j for LBR and PPR, respectively, Tmove (Di , P) is the time for
exchanging Di keys per processor on P processors at round i, and Dj represents the
maximum load imbalance from the perfect balance at the jth round. We assume
that all processors are equally powerful and have the same communication capability. A speed up of PPR over LBR, denoted as g, is defined as the ratio of TLBR to
TPPR ,
rT (n)+rTtp (M, P)+; ri=1 Tmove (Di , P)
g= r seq
,
; j=1 Tseq (n(1.0+Dj ))+Ttp (M, P)+Tmove (D1 , P)
(3)
where n=NP . Let’s refine Eq. (3) under some assumptions. Suppose the values of
input keys are evenly distributed throughout their range, for example, 0 to 2 b − 1 for
662
LEE ET AL.
FIG. 2. Comparison of communication times of PPR and LBR on SP2 with gaussian distribution.
positive keys (uniform initialization introduced in the next section generates keys
like this). Under the assumption that keys are uniformly distributed in all processors the last terms in Eq. (1) and (2) can be expressed as rTmove (DE , P) where
DE =nP P− 1 . Now g becomes
rTseq (n)+rTtp (M, P)+rTmove (DE , P)
g= r
.
; j=1 Tseq (n(1.0+Dj ))+Ttp (M, P)+Tmove (DE , P)
(4)
If PPR keeps load imbalance so small that Dj is ignored, the first terms in the
nominator and the denominator of Eq. (4) are nearly equal. Substituting the value
Dj for zero and dividing both nominator and denominator by rTseq (n) yields the
following relationship,
FIG. 3. Comparison of communication times on T3E with uniform distribution.
PARTITIONED PARALLEL RADIX SORT
663
FIG. 4. Percentage deviation of work load from perfect balance on SP2 with gaussian distribution.
T (M, P)+Tmove (DE , P)
1+ tp
Tseq (n)
1+F(n, M, P)
g%
=
,
1 Ttp (M, P)+Tmove (DE , P)
1
1+ ·
1+ · F(n, M, P)
r
Tseq (n)
r
(5)
where F(n, M, P)=(Ttp (M, P)+Tmove (DE , P))/Tseq (n). The speedup g is greater
than 1.0 since F is positive. g is an increasing function of F which grows asymptotically to r. Notice that Tseq (n) is not a function of communication speed of the
machine. If the communication speed of the machine is slow, the nominator of F is
FIG. 5. Execution times on SP2 with uniform distribution.
664
LEE ET AL.
FIG. 6. Execution times on SP2 with gaussian distribution.
large, and so is F, then g gets large. In other words, the improvement of PPR over
LBR becomes significant as the ratio of communication time to the overall execution time increases. Although we have assumed particular key characteristics above,
PPR also balances the work load reasonably well for keys with other distribution
characteristics. Experimental results support the fact in the next section.
4. EXPERIMENTS AND DISCUSSION
PPR has been implemented on three different parallel machines: IBM SP2,
Cray T3E, and PC cluster. PC cluster is a set of 16 personal computers with
300 MHz Pentium-II CPUs interconnected by a 100 Mbps fast ethernet switch. T3E
FIG. 7. Execution times on T3E with uniform distribution.
PARTITIONED PARALLEL RADIX SORT
665
FIG. 8. Execution times on T3E with gaussian distribution.
is the fastest machine among them, as long as the computational speed is concerned. As inputs of sort, various sets of N/P keys are synthetically generated in
each processor with different distributions called uniform, gauss, and stagger [14].
Uniform creates keys with uniform distribution. Gauss forms keys with Gaussian
distribution. Stagger produces specially distributed keys as described in [4]. We run
the programs onto up to 64 processors, each with maximum of 64 M keys
(1M=2 20). Keys are 32-bit integers for SP2 and PC cluster, and 64-bit integers for
T3E. Code is written in C with MPI communication library [16]. Among many
experiments we have performed, only a few representative results are shown here.
We first verify that PPR reduces the communication time while it minimizes the
load imbalance. We expect the communication time be cut down to 1/4 and 1/8 at
FIG. 9. Execution times on PC cluster with uniform distribution.
666
LEE ET AL.
FIG. 10. Execution times on PC cluster with stagger distribution.
maximum with g=4, 8, compared to LBR for sorts of 32-bit and 64-bit integer
keys, respectively. As seen in Figs. 2, 3, there is great reduction in communication
times: they are now about 1/4 for 32-bit keys in SP2, and around 1/6 for 64-bit
integers in T3E.
The load imbalance among processors is shown in Fig. 4. It is the greatest for the
case of Gauss, with maximum difference of 5.2% against the perfect balanced case,
which proves it is not so severe as to significantly impair the overall performance of
PPR. Improved performance of PPR over LBR can be observed in Figs. 5 and 6 for
SP2, and Figs. 7 and 8 for T3E.
We have found that in T3E the communication portion in sorting time is greater
than SP2. In addition, since the keys are 64-bit integers in T3E, more improvement
of PPR over LBR is expected due to larger r because we save r − 1 rounds of
interprocessor communication. More enhancement on T3E can be observed in
Figs. 7 and 8 compared to Figs. 5 and 6, respectively. Sorting times are shortened,
and the improvement ranges from 13% to 30% on SP2, and 20% to 100% on T3E.
In PC cluster, the network is so slow that the two parallel sorts are slower than
the uniprocessor sort for the cases of P \ 8 as shown in Figs. 9 and 10.
Nevertheless, PPR delivers remarkable performance over LBR since the communication time dominates the computation time. Table I lists the speedup figures
greater than 2.4.
5. CONCLUSION
We have proposed the partitioned parallel radix sort, which removes the communication bottleneck of balanced radix sort. The main idea is to divide the keys to
processors in a way that each processor holds keys that are sorted across processors
but not within each processor yet. Upon localization of keys to each processor,
serial radix sort is applied to each for locally sorting the assigned keys. The method
667
PARTITIONED PARALLEL RADIX SORT
TABLE 1
Execution Times and the Speedups on a 4-Processor PC Cluster
Keys
LBR
PPR
g(speedup)
0.577
1.094
2.090
5.166
9.480
2.485
2.712
2.858
2.768
2.706
0.593
1.101
2.268
4.279
9.342
2.432
2.732
2.660
3.302
2.784
Uniform
1M
2M
4M
8M
16M
1.434
2.967
5.974
14.30
25.660
Stagger
1M
2M
4M
8M
16M
1.442
3.008
6.034
14.129
26.007
thus improves the overall performance by reducing the communication time significantly. Experimental results on three distributed-memory machines have indicated
that partitioned parallel radix sort always performs better than the previous scheme
regardless of data size, the number of processors, and key initialization schemes.
REFERENCES
1. K. E. Batcher, Sorting networks and their applications, in ‘‘Proc. AFIPS Conference, 1968,’’
pp. 307–314.
2. R. Beigel and J. Gill, Sorting n objects with k-sorter, IEEE Trans. Comput. 39(5) (1990), 714–716.
3. A. C. Dusseau, D. E. Culler, K. E. Schauser, and R. P. Martin, Fast parallel sorting under LogP:
Experience with the CM-5, IEEE Trans. Parallel Distrib. Systems 7(8) (1996).
4. D. R. Helman, D. A. Bader, and J. JaJa, Parallel algorithms for personalized communication and
sorting with an experimental study, in ‘‘Proc. ACM Symposium on Parallel Algorithms and Architectures, Padua, Italy, 1996,’’ pp. 211–220.
5. J. S. Huang and Y. C. Chow, Parallel sorting and data partitioning by sampling, in ‘‘Proc. the 7th
Computer Software and Applications Conference, 1983,’’ pp. 627–631.
6. J. JaJa, ‘‘Introduction to Parallel Algorithms,’’ Addison–Wesley, Reading, MA, 1992.
7. Y. C. Kim, M. Jeon, D. Kim, and A. Sohn, Communication-efficient bitonic sort on a distributed
memory parallel computer, in ‘‘Proc. International Conference on Parallel and Distributed Systems,
(ICPADS’2001), Kyongju, Korea, June 26–29, 2001.’’
8. J.-D. Lee and K. E. Batcher, Minimizing communication in the bitonic sort, IEEE Trans. Parallel
Distrib. Systems 11(5) (2000), 459–474.
9. S.-J. Lee, ‘‘Partitioned Parallel Radix Sort,’’ MS thesis, Korea University, February 1999.
10. F. T. Leighton, Tight bounds on the complexity of parallel sorting, IEEE Trans. Comput. 34 (1985),
344–354.
11. F. T. Leighton, ‘‘Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes,’’ Addison–Wesley/Morgan Kauffman, Reading, MA, 1992.
668
LEE ET AL.
12. W. A. Martin, Sorting, ACM Comput. Surveys 3(4) (1971), 147–174.
13. Sedgewick, ‘‘Algorithms,’’ Wiley, New York, 1990.
14. A. Sohn and Y. Kodama, Load balanced parallel radix sort, in ‘‘Proc. 12th ACM International
Conference on Supercomputing, Melbourne, Australia, July 14–17, 1998.’’
15. A. Sohn, Y. Kodama, M. Sato, H. Sakane, H. Yamada, S. Sakai, and Y. Yamaguchi, Identifying
the capability of overlapping computation with communication, in ‘‘Proc. ACM /IEEE Parallel
Architecture and Compilation Techniques, Boston, MA, Oct. 1996.’’
16. Message Passing Interface Forum, ‘‘MPI: A Message-Passing Interface Standard,’’ Technical
Report, University of Tennessee, Knoxville, TN, June 1995.
SHIN-JAE LEE received his B.S. and M.S. from the Department of Electrical Engineering of Korea
University, Seoul, Korea, in 1997 and 1999 respectively. He is currently a research staff member at LG
Telecommunications, Anyang, Korea.
MINSOO JEON received his B.S. and M.S. from the Department of Electrical Engineering of Korea
University, Seoul, Korea, in 1996 and 1998, respectively. He is currently a Ph.D. candidate at the same
school. His research interests include parallel and distributed algorithms.
DONGSEUNG KIM is a professor in the Department of Electrical Engineering at Korea University,
Seoul, Korea. He was an assistant professor at POSTECH, Pohang, Korea, from 1989 to 1995. He
received his Ph.D. from the University of Southern California, Los Angeles, his M.S. from KAIST, and
his B.S. from Seoul National University, Seoul, Korea, in 1988, 1980, and 1978, respectively. His
research interests include parallel/cluster computing and parallel algorithms.
ANDREW SOHN is an associate professor in the Computer Science Department at the New Jersey
Institute of Technology, Newark, New Jersey. He received his B.S., M.S., and Ph.D. from the University
of Southern California, Los Angeles. His area of research covers design of scalable web servers, highperformance algorithms, and compilers.