Fast Parallel GPU-Sorting Using a Hybrid Algorithm

Fast Parallel GPU-Sorting
Using a Hybrid Algorithm
Erik Sintorn and Ulf Assarsson
Department of Computer Science and Engineering
Chalmers University of Technology
Gothenburg, Sweden
Presented by Mike Patterson
Motivation
• Quickly sorting large sequences (more than 512K
elements)
• Distance sorting scenes
– Transparency, particles, etc.
– Depth buffer is flawed
• Z-fighting
• Precision weighted towards near clip plane
– Painter’s algorithm is correct, but expensive
Related Work
• Bitonic Sort
Θ(n·log2(n))
• GPU-ABiSort (Adaptive Bitonic Sorting) O(n·log(n))
• Radix Sort with CUDA
O(k·N)
• Merge Sort
O(n·log(n))
Parallelism on GeForce 8800 GTX
•
128 x 1.35 GHz processors (16 x 8-core multiprocessors)
128-bit word size (i.e., four 32-bit floats)
Ideally 2 threads/core (swap “technically” takes 0 clocks)
Overview of the Algorithm
1. Randomly shuffle elements O(N)
2. Compute initial pivot points O(N)
3. Bucket Sort
O(N)
4. Vector Merge Sort
O(N·log(L))
Initial Pivot Points
•
Calculate minimum and maximum
– glMinMax (~1% of total execution time)
•
Choose initial pivot points
– Linearly interpolate from min to max
– L-1 pivot points | L ≥ 2P
•
With 32 SP (GeForce 8600), L=1024 was best
Bucket Sort
1. Compute histogram (optimized bucket counter)
•
Relies on linearly interpolated pivot points
2. Refine pivot points
•
Assume uniform distribution over range of each bucket
3. Count elements per bucket
• ∀x | 1 ≤ x ≤ L : size(sublist[ x]) ≈
N
L
4. Reposition elements (float4 aligned)
• ∀a ∈ sublist[l + 1], b ∈ sublist[l ] : a > b
Vector Merge Sort
// get
a.xyzw
// get
b.xyzw
the four lowest floats
= (a.xyzw < b.wzyx) ? a.xyzw : b.wzyx
the four highest floats
= (b.xyzw >= a.wzyx) ? b.xyzw : a.wzyx
sortElements(float4 r)
r = (r.xyzw > r.yxwz) ? r.yyww : r.xxzz
r = (r.xyzw > r.zwxy) ? r.zwzw : r.xyxy
r = (r.xyzw > r.xzyw) ? r.xzzw : r.xyyw
Pentium D @ 2.8 GHz
GeForce 8600 GTS
8.6s
0.561s
GPU-based hybrid algorithm 6-14 times faster for 1-8M elements respectively
Pentium D @ 2.8 GHz
GeForce 8600 GTS
0.561s
Transferring elements to/from the GPU using CUDA takes ~10% of total time
Pentium D @ 2.8 GHz
GeForce 8600 GTS
0.561s
w/ random
shuffling step
GPUSort requires POT
GPUSorti is virtual
Distance Sorting Two Different Scenes (timings in milliseconds)
Conclusion
• GPU-based sorting algorithm
• Supports lists of arbitrary size
• Ideal for more than 512K elements
• Sensitive to input distribution