Journal of Parallel and Distributed Computing 62, 656–668 (2002) doi:10.1006/jpdc.2001.1808, available online at http://www.idealibrary.com on Partitioned Parallel Radix Sort 1 Shin-Jae Lee, Minsoo Jeon, and Dongseung Kim Department of Electrical Engineering, Korea University, Seoul 136-701, Korea E-mail: [email protected] and Andrew Sohn Department of Computer and Information Science, New Jersey Institute of Technology, Newark, New Jersey 07102-1982 E-mail: [email protected] Received August 28, 2000; revised June 18, 2001; accepted August 6, 2001 Load balanced parallel radix sort solved the load imbalance problem present in parallel radix sort. By redistributing the keys in each round of radix, each processor has exactly the same number of keys, thereby reducing the overall sorting time. Load balanced radix sort is currently known as the fastest internal sorting method for distributed-memory multiprocessors. However, as the computation time is balanced, the communication time emerges as the bottleneck of the overall sorting performance due to key redistribution. We present in this report a new parallel radix sorter that solves the communication problem of balanced radix sort, called partitioned parallel radix sort. The new method reduces the communication time by eliminating the redistribution steps. The keys are first sorted in a top-down fashion (leftto-right as opposed to right-to-left) by using some most significant bits. Once the keys are localized to each processor, the rest of sorting is confined within each processor, hence eliminating the need for global redistribution of keys. It enables well balanced communication and computation across processors. The proposed method has been implemented in three different distributedmemory platforms, including IBM SP2, Cray T3E, and PC Cluster. Experimental results with various key distributions indicate that partitioned parallel radix sort indeed shows significant improvements over balanced radix sort. IBM SP2 shows 13% to 30% improvement while Cray/SGI T3E does 20% to 100% in execution time. PC cluster shows over 2.4-fold improvement in execution time. © 2002 Elsevier Science (USA) 1 The work is partially supported by KRF (1999-E00287), KOSEF (985-0900-003-2), STEPI (97NF0304-A-01) and NSF (INT-9722545). The preliminary version of the paper was presented in the Third International Symp. on High Performance Computing, Tokyo, Japan, October 2000. 0743-7315/02 $35.00 © 2002 Elsevier Science (USA) All rights reserved. 656 ⁄ PARTITIONED PARALLEL RADIX SORT 657 Key Words: parallel sorting; radix sort; distributed-memory machines; load balancing. 1. INTRODUCTION Sorting is one of the fundamental problems in computer science. Its use can be found essentially almost everywhere, be it scientific computation or nonnumeric computation [11, 12]. Sorting of a certain number of keys has been used in benchmarking various parallel computers or judging the specific algorithm performance when it is experimented on the same parallel machine. Serial sorts often need O(N logN) time, and the time becomes significant as the number of keys becomes large. Because of its importance, numerous parallel sorting algorithms have been developed to reduce the overall sorting time, including bitonic sort [1, 7, 8], sample sort [4, 5], and column sort [10]. In general, parallel sorts consist of multiple rounds of serial sort, called local sort, performed in each processor in parallel, followed by movement of keys among processors, called the redistribution step [6]. Local sort and data redistribution may be interleaved and iterated a few times depending on the algorithms used. The time spent in local sort depends on the number of keys. Parallel sort time is the sum of the times of local sort and the times for data redistribution in all rounds. To make the sort fast, it is important to distribute the keys as evenly as possible throughout the rounds, since the execution time is dependent on the most heavily loaded processor in each round [5, 14]. If a parallel sort has kept its work-load balanced perfectly in each round, there would be no further improvement of the time spent in that part. However, the communication time varies depending on the data redistribution schemes (e.g., all-to-all, oneto-many, many-to-one), the amount of data, and the frequency of communication (e.g., many short messages, or a few long messages) and network topologies (hypercube, mesh, fat-tree) [3, 11]. It was reported that for a large number of keys, the communication times occupy a great portion of the sorting time [3, 15]. Load balanced parallel radix sort [14] (abbreviated by LBR) reduces the execution time by perfectly balancing the load among processors in every round. Partitioned parallel radix sort or PPR, proposed in this paper, further improves the performance by reducing the multiple rounds of data redistribution to one. While PPR may introduce slight load imbalance among processors due to its not-so-perfect key distribution, the overall performance gain can be of particular significance since it substantially reduces the overall communication time. It is precisely the purpose of this report to introduce this new algorithm that features balanced computation and balanced communication. The paper is organized as follows. Section 2 briefly explains balanced parallel radix sort and identifies its deficiency in terms of communication. Section 3 presents a new partitioned parallel radix sort and gives an analytical view of the new algorithm. Section 4 lists the experimental results of the algorithm on three different distributed-memory parallel machines including SP2, T3E, and PC cluster. The last section concludes this report. 658 LEE ET AL. 2. PARALLEL RADIX SORT Radix sort is a simple yet very efficient sorting method that outperforms many well known comparison-based algorithms for a certain type of keys such as integer. Suppose N keys are evenly distributed to P processors initially such that there are n=NP keys per processor. When sort completes, we expect that all keys are ordered according to the rank of processors as P0 , P1 , ..., PP − 1 , besides keys in each processor have also been sorted. Serial radix sort is implemented in two different ways: radix exchange sort and straight radix sort [13]. Since parallel radix sorts are typically derived from a serial radix sort, we first present serial radix sort, followed by a parallel version. We define some symbols used later in this paper as listed below: • b is the number of bits of an integer key such that an integer key is represented as (ib − 1 ib − 2 · · · i1 i0 ). • g is the number of consecutive bits of a key used at each round of scanning. • r=Kgb L is the number of rounds each key goes through. Radix exchange sort generates and maintains an ordered queue. Initially, it reads the g least significant bits (in other words, ig − 1 ig − 2 · · · i1 i0 ) of each key and stores it in a new queue at the location determined by the g bits. If all keys are examined and placed in the new queue, the round completes. Keys are ordered according to their least significant g bits. The following round scans the next g least significant bits (i2g − 1 i2g − 2 · · · ig+1 ig ) to order them in a new queue. The same operations are done as before in the subsequent rounds. Keys move back and forth during the rounds. After r rounds, all bits are scanned, and the sort completes. The main idea of load balanced radix sort (LBR) is to store keys from any processor which has over N/P keys to its neighboring processor. Each processor first obtains the bin counts of all the keys locally stored by scanning g bits. The keys are then put into appropriate buckets by re-scanning them. These two steps are local operations, involving no communication between processors. An all-to-all transpose operation is performed across all processors to find the global bin count. Each processor now has the bin counts of all processors. The resulting transposed bin count allows each and every processor to compute which processors get exactly how many keys from what bins and what processors, to make the load balanced. Overloaded processors will now be able to spill keys to their immediate neighbors. Keys will move after all the bins and their keys are located in the global processor space. A round therefore requires an all-to-all transpose operation. For 32-bit integers with the radix of 8 bits, balanced radix sort needs four all-to-all transpose operations of bin counts. LBR is reported to outperform fastest parallel sorts by up to 100% in execution time [14]. LBR, however, requires data redistribution across processors in every round, thus, it consumes a considerable amount of time in communication. Straight radix sort initially uses M=2 g buckets (first-level buckets) instead of the ordered queues. It first bucket-sorts [13] using the g most significant bits (ib − 1 ib − 2 · · · ib − g ) of each key. Bucket-sort places each key into the bucket whose index corresponds to its own g bits. Thus, keys with the same g bits gather in the PARTITIONED PARALLEL RADIX SORT 659 same bucket. In the second round, keys in each bucket are bucket-sorted again using the next g most significant bits (ib − (g+1) ib − (g+2) · · · ib − 2g ), generating M new second-level buckets called subbuckets per bucket. The remaining rounds are done in the same manner. In this scheme keys never leave the upper-level bucket where they have been placed in a previous round. One serious problem in the scheme is that the number of overall buckets (subbuckets) explodes exponentially, and there may be many buckets with few keys wasting a lot of resource (memory) if not carefully implemented. In our parallel implementation, the first round is exactly the same as the serial straight radix sort. Then, according to the global histogram of the bucket population (key counts), each processor is assigned and will be in charge of only a few consecutive buckets obtained in the first round. Now, buckets of keys are exchanged among processors according to their index, thus keys with the same g most significant bits are collected from all processors into one. In the remaining r − 1 rounds, bucket sorts continues locally by radix exchange sort using the b − g bits. No further data exchange is done across processors. If keys are evenly distributed among buckets, each processor will hold M/P buckets in average. However, it is possible that some processors may be allocated with buckets with a lot of keys while others have few, depending on the distribution characteristics of the keys. This static/naive partitioning of keys may cause severe load imbalance among processors. PPR solves this problem as described in the next section [9]. 3. PARTITIONED PARALLEL RADIX SORT Assume that we use only M=2 g buckets per processor throughout the sort. Bij represents bucket-j in processor Pi . 3.1. The Algorithm PPR consists of two phases: local sort and key partitioning. PPR needs r=Kgb L rounds in all. Details are given below. I. Key Partitioning. Each processor bucket-sorts its keys using the g most significant bits. From now on, the left-most g bits of a keys are represented by the most significant digit (MSD). Each key is placed into an appropriate bucket, thus, processor Pk stores a key to bucket Bkj , where j corresponds to the MSD of the key. At the end of the bucket sort, all keys have been partially sorted locally with respect to their MSDs, in other words, the first bucket includes the smallest keys, the second the next smallest,..., and the last the largest. Then, an all-to-all transpose of key counts is performed to find a global key distribution map (which correspond to finding a histogram of the keys among all processors) as follows, illustrated by Fig. 1. For all j=0, 1, ..., M − 1, key counts of Bkj are added up to get Gj , a global count of keys in all buckets Bkj across processors Pk (k=0, 1, ..., P − 1). Then prefix sums of global key counts of Gj s are computed. 660 LEE ET AL. FIG. 1. Local and global key count maps and bucket partitioning. (a) Bucket counts in individual processors. (b) Global bucket count and the partitioning map. Let’s consider hypothetical buckets (called global buckets) GBj s which are a collection of jth buckets of Bkj from all processors P0 , P1 , ..., PP − 1 . Then Gj corresponds to the key count of bucket GBj . Taking into account the prefix sums and the average number of keys (n=N/P), global buckets are to be divided into P groups, each having one or more consecutive buckets, in such a way that the key counts of each group become as equal as possible. The first group consists of the first few buckets GB0 , GB1 , ...GBk − 1 whose counts add up to approximately n, the second GBk , GBk+1 , ...GBl again to have approximately n, etc. The jth group of buckets is now assigned to Pj , which becomes the owner of the buckets (j=0, 1, 2, ...P − 1). Now all processors send their buckets of keys to their respective owners simultaneously. After this movement, keys are sorted partially across processors, since any key in GBi is smaller than any key in GBj for i < j. Note that keys have not been sorted locally yet. II. Local Sort. Keys in each processor are now sorted locally at a time by all processors, to make all N keys in order. Serial radix exchange sort is performed at first with the rightmost g bits, then, with the next rightmost g bits,..., until all b − g bits are used up. Only b − g bits are examined because the left most g bits have already been used in Phase I. Phase II consists of K(b − g)/gL rounds. The performance of PPR relies on how evenly the keys are distributed in the first phase. It is not very likely that each processor gets exactly the same number of keys after the redistribution. Refinement of the partitioning of keys can be made in Phase I by further dividing the buckets that lie in the partition boundary and that have excessive keys. However, simply splitting a bucket and allocating to two neighboring processors could not produce the desired sorted output in Phase II, since keys having the same MSD would stay in different processors. Thus, we avoid splitting buckets further. Keys will be distributed to processors by buckets. This refinement is explained below. 661 PARTITIONED PARALLEL RADIX SORT PPR resembles sample sort [4, 5] from the data partitioning and local sort perspectives. In sample sort, after keys have moved according to the splitters (or pivots to each processor, they are partially ordered across processors, thus further movement of keys across processors is not needed. One significant difference is the fact that the global key distribution statistic in sample sort is not known until keys actually have moved to designated processors, while in PPR it is known before the costly data movement. Thus, it is possible to adjust the partitioning before the actual key movement. If current partitioning is not likely to give satisfactory balance in work load, PPR increases g so that the keys in the boundary buckets can spread out further into a larger number of subbuckets to produce a more even partition. For example, if g is increased by 2 (bits), the keys in each boundary bucket are split into four buckets, enabling finer partitioning. The process repeats until a satisfactory partitioning is obtained. 3.2. Performance Analysis Since the previous work of LBR has included comparisons with other competitive sorts [14], only LBR will be used for performance comparison with PPR. Assume that both PPR and LBR are executed on the same machine. The execution time of LBR should reflect r=Kb/gL iterations of local bucket-sort, one transpose of key counts for the histogram computation, and a set of key send/receive operations [14]. The parallel time of PPR consists of three terms: the times for r rounds of local bucket-sort, one transpose of key counts, and one round of bucket movement. The execution times of TLBR and TPPR can be expressed respectively as 1 NP 2+rT (M, P)+ C T (D , P) N (N, P)= C T 1 (1.0+D ) 2+T (M, P)+T (D , P), P r TLBR (N, P)=rTseq tp move i (1) i=1 r TPPR seq j tp move − 1 (2) j=1 where M=2 g, Tseq (n) is the time for serial radix sort of n keys in a processor, Ttp (M, P) is the time for transposing M key counts of buckets per processor, Di , D −j are the amounts of data per processor to move across processors during redistribution at round i, j for LBR and PPR, respectively, Tmove (Di , P) is the time for exchanging Di keys per processor on P processors at round i, and Dj represents the maximum load imbalance from the perfect balance at the jth round. We assume that all processors are equally powerful and have the same communication capability. A speed up of PPR over LBR, denoted as g, is defined as the ratio of TLBR to TPPR , rT (n)+rTtp (M, P)+; ri=1 Tmove (Di , P) g= r seq , ; j=1 Tseq (n(1.0+Dj ))+Ttp (M, P)+Tmove (D1 , P) (3) where n=NP . Let’s refine Eq. (3) under some assumptions. Suppose the values of input keys are evenly distributed throughout their range, for example, 0 to 2 b − 1 for 662 LEE ET AL. FIG. 2. Comparison of communication times of PPR and LBR on SP2 with gaussian distribution. positive keys (uniform initialization introduced in the next section generates keys like this). Under the assumption that keys are uniformly distributed in all processors the last terms in Eq. (1) and (2) can be expressed as rTmove (DE , P) where DE =nP P− 1 . Now g becomes rTseq (n)+rTtp (M, P)+rTmove (DE , P) g= r . ; j=1 Tseq (n(1.0+Dj ))+Ttp (M, P)+Tmove (DE , P) (4) If PPR keeps load imbalance so small that Dj is ignored, the first terms in the nominator and the denominator of Eq. (4) are nearly equal. Substituting the value Dj for zero and dividing both nominator and denominator by rTseq (n) yields the following relationship, FIG. 3. Comparison of communication times on T3E with uniform distribution. PARTITIONED PARALLEL RADIX SORT 663 FIG. 4. Percentage deviation of work load from perfect balance on SP2 with gaussian distribution. T (M, P)+Tmove (DE , P) 1+ tp Tseq (n) 1+F(n, M, P) g% = , 1 Ttp (M, P)+Tmove (DE , P) 1 1+ · 1+ · F(n, M, P) r Tseq (n) r (5) where F(n, M, P)=(Ttp (M, P)+Tmove (DE , P))/Tseq (n). The speedup g is greater than 1.0 since F is positive. g is an increasing function of F which grows asymptotically to r. Notice that Tseq (n) is not a function of communication speed of the machine. If the communication speed of the machine is slow, the nominator of F is FIG. 5. Execution times on SP2 with uniform distribution. 664 LEE ET AL. FIG. 6. Execution times on SP2 with gaussian distribution. large, and so is F, then g gets large. In other words, the improvement of PPR over LBR becomes significant as the ratio of communication time to the overall execution time increases. Although we have assumed particular key characteristics above, PPR also balances the work load reasonably well for keys with other distribution characteristics. Experimental results support the fact in the next section. 4. EXPERIMENTS AND DISCUSSION PPR has been implemented on three different parallel machines: IBM SP2, Cray T3E, and PC cluster. PC cluster is a set of 16 personal computers with 300 MHz Pentium-II CPUs interconnected by a 100 Mbps fast ethernet switch. T3E FIG. 7. Execution times on T3E with uniform distribution. PARTITIONED PARALLEL RADIX SORT 665 FIG. 8. Execution times on T3E with gaussian distribution. is the fastest machine among them, as long as the computational speed is concerned. As inputs of sort, various sets of N/P keys are synthetically generated in each processor with different distributions called uniform, gauss, and stagger [14]. Uniform creates keys with uniform distribution. Gauss forms keys with Gaussian distribution. Stagger produces specially distributed keys as described in [4]. We run the programs onto up to 64 processors, each with maximum of 64 M keys (1M=2 20). Keys are 32-bit integers for SP2 and PC cluster, and 64-bit integers for T3E. Code is written in C with MPI communication library [16]. Among many experiments we have performed, only a few representative results are shown here. We first verify that PPR reduces the communication time while it minimizes the load imbalance. We expect the communication time be cut down to 1/4 and 1/8 at FIG. 9. Execution times on PC cluster with uniform distribution. 666 LEE ET AL. FIG. 10. Execution times on PC cluster with stagger distribution. maximum with g=4, 8, compared to LBR for sorts of 32-bit and 64-bit integer keys, respectively. As seen in Figs. 2, 3, there is great reduction in communication times: they are now about 1/4 for 32-bit keys in SP2, and around 1/6 for 64-bit integers in T3E. The load imbalance among processors is shown in Fig. 4. It is the greatest for the case of Gauss, with maximum difference of 5.2% against the perfect balanced case, which proves it is not so severe as to significantly impair the overall performance of PPR. Improved performance of PPR over LBR can be observed in Figs. 5 and 6 for SP2, and Figs. 7 and 8 for T3E. We have found that in T3E the communication portion in sorting time is greater than SP2. In addition, since the keys are 64-bit integers in T3E, more improvement of PPR over LBR is expected due to larger r because we save r − 1 rounds of interprocessor communication. More enhancement on T3E can be observed in Figs. 7 and 8 compared to Figs. 5 and 6, respectively. Sorting times are shortened, and the improvement ranges from 13% to 30% on SP2, and 20% to 100% on T3E. In PC cluster, the network is so slow that the two parallel sorts are slower than the uniprocessor sort for the cases of P \ 8 as shown in Figs. 9 and 10. Nevertheless, PPR delivers remarkable performance over LBR since the communication time dominates the computation time. Table I lists the speedup figures greater than 2.4. 5. CONCLUSION We have proposed the partitioned parallel radix sort, which removes the communication bottleneck of balanced radix sort. The main idea is to divide the keys to processors in a way that each processor holds keys that are sorted across processors but not within each processor yet. Upon localization of keys to each processor, serial radix sort is applied to each for locally sorting the assigned keys. The method 667 PARTITIONED PARALLEL RADIX SORT TABLE 1 Execution Times and the Speedups on a 4-Processor PC Cluster Keys LBR PPR g(speedup) 0.577 1.094 2.090 5.166 9.480 2.485 2.712 2.858 2.768 2.706 0.593 1.101 2.268 4.279 9.342 2.432 2.732 2.660 3.302 2.784 Uniform 1M 2M 4M 8M 16M 1.434 2.967 5.974 14.30 25.660 Stagger 1M 2M 4M 8M 16M 1.442 3.008 6.034 14.129 26.007 thus improves the overall performance by reducing the communication time significantly. Experimental results on three distributed-memory machines have indicated that partitioned parallel radix sort always performs better than the previous scheme regardless of data size, the number of processors, and key initialization schemes. REFERENCES 1. K. E. Batcher, Sorting networks and their applications, in ‘‘Proc. AFIPS Conference, 1968,’’ pp. 307–314. 2. R. Beigel and J. Gill, Sorting n objects with k-sorter, IEEE Trans. Comput. 39(5) (1990), 714–716. 3. A. C. Dusseau, D. E. Culler, K. E. Schauser, and R. P. Martin, Fast parallel sorting under LogP: Experience with the CM-5, IEEE Trans. Parallel Distrib. Systems 7(8) (1996). 4. D. R. Helman, D. A. Bader, and J. JaJa, Parallel algorithms for personalized communication and sorting with an experimental study, in ‘‘Proc. ACM Symposium on Parallel Algorithms and Architectures, Padua, Italy, 1996,’’ pp. 211–220. 5. J. S. Huang and Y. C. Chow, Parallel sorting and data partitioning by sampling, in ‘‘Proc. the 7th Computer Software and Applications Conference, 1983,’’ pp. 627–631. 6. J. JaJa, ‘‘Introduction to Parallel Algorithms,’’ Addison–Wesley, Reading, MA, 1992. 7. Y. C. Kim, M. Jeon, D. Kim, and A. Sohn, Communication-efficient bitonic sort on a distributed memory parallel computer, in ‘‘Proc. International Conference on Parallel and Distributed Systems, (ICPADS’2001), Kyongju, Korea, June 26–29, 2001.’’ 8. J.-D. Lee and K. E. Batcher, Minimizing communication in the bitonic sort, IEEE Trans. Parallel Distrib. Systems 11(5) (2000), 459–474. 9. S.-J. Lee, ‘‘Partitioned Parallel Radix Sort,’’ MS thesis, Korea University, February 1999. 10. F. T. Leighton, Tight bounds on the complexity of parallel sorting, IEEE Trans. Comput. 34 (1985), 344–354. 11. F. T. Leighton, ‘‘Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes,’’ Addison–Wesley/Morgan Kauffman, Reading, MA, 1992. 668 LEE ET AL. 12. W. A. Martin, Sorting, ACM Comput. Surveys 3(4) (1971), 147–174. 13. Sedgewick, ‘‘Algorithms,’’ Wiley, New York, 1990. 14. A. Sohn and Y. Kodama, Load balanced parallel radix sort, in ‘‘Proc. 12th ACM International Conference on Supercomputing, Melbourne, Australia, July 14–17, 1998.’’ 15. A. Sohn, Y. Kodama, M. Sato, H. Sakane, H. Yamada, S. Sakai, and Y. Yamaguchi, Identifying the capability of overlapping computation with communication, in ‘‘Proc. ACM /IEEE Parallel Architecture and Compilation Techniques, Boston, MA, Oct. 1996.’’ 16. Message Passing Interface Forum, ‘‘MPI: A Message-Passing Interface Standard,’’ Technical Report, University of Tennessee, Knoxville, TN, June 1995. SHIN-JAE LEE received his B.S. and M.S. from the Department of Electrical Engineering of Korea University, Seoul, Korea, in 1997 and 1999 respectively. He is currently a research staff member at LG Telecommunications, Anyang, Korea. MINSOO JEON received his B.S. and M.S. from the Department of Electrical Engineering of Korea University, Seoul, Korea, in 1996 and 1998, respectively. He is currently a Ph.D. candidate at the same school. His research interests include parallel and distributed algorithms. DONGSEUNG KIM is a professor in the Department of Electrical Engineering at Korea University, Seoul, Korea. He was an assistant professor at POSTECH, Pohang, Korea, from 1989 to 1995. He received his Ph.D. from the University of Southern California, Los Angeles, his M.S. from KAIST, and his B.S. from Seoul National University, Seoul, Korea, in 1988, 1980, and 1978, respectively. His research interests include parallel/cluster computing and parallel algorithms. ANDREW SOHN is an associate professor in the Computer Science Department at the New Jersey Institute of Technology, Newark, New Jersey. He received his B.S., M.S., and Ph.D. from the University of Southern California, Los Angeles. His area of research covers design of scalable web servers, highperformance algorithms, and compilers.
© Copyright 2026 Paperzz