Fast Parallel GPU-Sorting Using a Hybrid Algorithm Erik Sintorn and Ulf Assarsson Department of Computer Science and Engineering Chalmers University of Technology Gothenburg, Sweden Presented by Mike Patterson Motivation • Quickly sorting large sequences (more than 512K elements) • Distance sorting scenes – Transparency, particles, etc. – Depth buffer is flawed • Z-fighting • Precision weighted towards near clip plane – Painter’s algorithm is correct, but expensive Related Work • Bitonic Sort Θ(n·log2(n)) • GPU-ABiSort (Adaptive Bitonic Sorting) O(n·log(n)) • Radix Sort with CUDA O(k·N) • Merge Sort O(n·log(n)) Parallelism on GeForce 8800 GTX • 128 x 1.35 GHz processors (16 x 8-core multiprocessors) 128-bit word size (i.e., four 32-bit floats) Ideally 2 threads/core (swap “technically” takes 0 clocks) Overview of the Algorithm 1. Randomly shuffle elements O(N) 2. Compute initial pivot points O(N) 3. Bucket Sort O(N) 4. Vector Merge Sort O(N·log(L)) Initial Pivot Points • Calculate minimum and maximum – glMinMax (~1% of total execution time) • Choose initial pivot points – Linearly interpolate from min to max – L-1 pivot points | L ≥ 2P • With 32 SP (GeForce 8600), L=1024 was best Bucket Sort 1. Compute histogram (optimized bucket counter) • Relies on linearly interpolated pivot points 2. Refine pivot points • Assume uniform distribution over range of each bucket 3. Count elements per bucket • ∀x | 1 ≤ x ≤ L : size(sublist[ x]) ≈ N L 4. Reposition elements (float4 aligned) • ∀a ∈ sublist[l + 1], b ∈ sublist[l ] : a > b Vector Merge Sort // get a.xyzw // get b.xyzw the four lowest floats = (a.xyzw < b.wzyx) ? a.xyzw : b.wzyx the four highest floats = (b.xyzw >= a.wzyx) ? b.xyzw : a.wzyx sortElements(float4 r) r = (r.xyzw > r.yxwz) ? r.yyww : r.xxzz r = (r.xyzw > r.zwxy) ? r.zwzw : r.xyxy r = (r.xyzw > r.xzyw) ? r.xzzw : r.xyyw Pentium D @ 2.8 GHz GeForce 8600 GTS 8.6s 0.561s GPU-based hybrid algorithm 6-14 times faster for 1-8M elements respectively Pentium D @ 2.8 GHz GeForce 8600 GTS 0.561s Transferring elements to/from the GPU using CUDA takes ~10% of total time Pentium D @ 2.8 GHz GeForce 8600 GTS 0.561s w/ random shuffling step GPUSort requires POT GPUSorti is virtual Distance Sorting Two Different Scenes (timings in milliseconds) Conclusion • GPU-based sorting algorithm • Supports lists of arbitrary size • Ideal for more than 512K elements • Sensitive to input distribution
© Copyright 2026 Paperzz