GRACE: GPU Accelerated Ray Tracing for Astrophysics Sam Thomson*, Martin Rüfenacht and Eric Tittley *[email protected] Performance Abstract 8 y (Mpc) 6 4 2 00 2 4 x (Mpc) 6 8 sphray (1-core) — from red (lowest key values) to pink (highest key values) — is based on the full set of keys.To compute an N -bit key from co-ordinates x, y and z in [0, 1) we interleave the N/3-highest bits of their binary representations. BVH Construction Implementation The Problem Put simply, we are given a set of overlapping spheres (the SPH dataset) and a source (a point in the simulation which emits radiation); we must shoot rays out from this source and, for each ray, obtain a list of all intersected spheres. Additional processing may occur at an intersection; we integrate optical depth over a ray’s path through each sphere. We implement the O(n) complexity algorithm presented by Apetrei (2014), computing the hierarchy and AABBs simultaneously. Node m is defined to split the hierarchy between particles m and m + 1, and building proceeds from the bottom up, as shown in Fig. 2. 2 δ(2) < δ(−1) 0 We use a binary tree bounding volume hierarchy (BVH) to accelerate the traversal. The dataset is divided into two child (inner) nodes, which are themselves divided into two further children; this process terminates when a child contains < φ particles, at which point it becomes a leaf node. Each node has an axis-aligned bounding box (AABB), tightly containing all of the particles within it. These AABBs are tested for intersection against the rays to quickly eliminate many particles from the search. BVHs Construction on GPUs An efficient method of building hierarchies on the GPU was first presented by Lauterbach et al. (2009). At a high level, this proceeds as follows. 1 Compute each particle’s Morton key and sort (see Fig. 1). 2 Split this 1D curve at the point where the most-significant bit changes from 0 to 1 (∼ spatial mid-point in x). 3 Split each half at the change in the second-most significant bit (∼ spatial mid-point in y). 4 Continue, recursively, cycling through x, y and z (see Fig. 3). 1 δ(0) < δ(−1) δ(1) < δ(0) Figure 3: The first three levels of a binary tree BVH. Coloured blocks are nodes, along with their key prefixes, and arrows show parent-child relationships. All Morton keys starting with the bits 00 are contained in the left child of the root node, and those starting with 01 are contained in the right. This (approximately) cuts the set of keys into two halves. These two halves are also subdivided, and the processes repeats recursively. Traversing the Tree δ(2) < δ(3) The Bounding Volume Hierarchy 101 Figure 1: The Morton curve passing through all points in a slice of a 1283 particle, (10 Mpc)3 simulation (1 pc = 3.09 × 1016 metres). The line colour Time (s) Ray tracing is the most accurate method for numerically modelling the transport of radiation, visible light or otherwise. However, its high computational cost makes coupled simulations of (ray-traced) radiation transport, gas dynamics and gravity prohibitively expensive. Such simulations are nonetheless necessary if we are to accurately model the early universe on cosmological, galactic and stellar scales. The cost can be somewhat mitigated by the highly parallel nature of the problem, and it is suitable for computation on GPUs. We describe here our implementation of a GPU ray tracing library, using NVIDIA’s CUDA platform, targeted at astrophysical smooth particle hydrodynamics (SPH) simulations. grace is a template library and may be easily adapted to other datasets. We achieve favourable performance: a factor of four faster on one GPU than an optimized CPU code running on 32 cores. Nparticles = 1283, tree construction takes ∼ 20 ms. • Accumulating optical depth along N & 103 rays, on a single Tesla M2090 we achieve ∼ 4× the performance of an OpenMP, SIMD-optimized CPU code running on two 16-core 2.3 GHz AMD Opteron 6276s. • Computing data for every ray-particle intersection, in-order, along each ray, we compare in Fig. 4 our code to the SPH radiative transfer code sphray (Altay et al., 2008). • For δ(1) < δ(2) 0 δ(0) = 011 1 δ(1) = 001 2 δ(2) = 110 3 k0: 001 k1: 010 k2: 011 k3: 101 Figure 2: Where δ(i) ≡ ki ⊕ ki+1, and δ(−1) = δ(3) ≡ ∞. Red squares are nodes and green circles are leaves/particles. Each thread is assigned to a leaf and compares its key, ki, to its neighbours’. The neighbour with the ‘closer’ key, determined via δ, defines the parent index. E.g. leaf 1 computes δ(0) = 001 ⊕ 010 = 011 and δ(1) = 010 ⊕ 011 = 001, finding δ(1) < δ(0), hence node 1 is the parent. δ may be any commutative function; in practice, we use δ(i) ≡ kri − ri+1k2. Two threads will compute the same parent index; the second to do so is continues up the tree. This is realized with atomicAdd() and an array of per-node counters. The treeclimb kernel is iterative; each call adds new layers to the tree. This allows us to store the per-node counters in fast shared memory. A thread exits the tree-climb after reaching a node containing > φ particles, and that node’s children become wide leaves. Once all wide leaves are found they form the base of a new tree climb, completing the tree. We find φ ∼ 32 optimal. Each ray is assigned to a single thread. Traversal then proceeds as follows, starting at the root node: 1 If the current node’s AABB is intersected by the ray, its left child becomes the current node and its right child is pushed to a stack. Repeat this step. 2 If the current node is missed, pop the stack. 3 If the stack is empty, we are done. A stackless implementation, where each node contains go-to indices, performs less well than a local stack in light of the below. sphray (6-core) grace (M2090) grace (K20) grace (GTX 970) 100 10−1 102 103 Nrays 104 105 Figure 4: The time taken to trace rays out from the centre of an Nparticle = 1283 SPH dataset, where all per-ray intersection data is output (in-order). The dashed green line is an estimated time for a hypothetical, parallelized version of sphray running on six cores. The CPU used was a 2.66 GHz Intel Xeon X5650. Outlook We have presented a new GPU ray tracing code for SPH datasets, attaining a performance increase over previous efforts of more than an order of magnitude. This is currently being combined with a GPU radiative transfer code, which achieves similar gains over its CPU ancestor. The result will be coupled to the hydrodynamics code GIZMO. Traversal Optimization Ray tracing typically exhibits scattered memory accesses. Our key optimizations for improving memory coherence are: • Rays are sorted first by their origin, then sub-sorted by the Morton key of their normalized direction vector. • Rays are processed in warp-sized packets: if any ray within a packet wishes to visit a node, the entire packet must do so. This also prevents thread divergence. • Each node is packed into 64 bytes (c.f. an L2 cacheline of 32 bytes) containing child node indices and both child AABBs. Fast CPU methods for testing a ray against an AABB contain substantial branching; we instead implement the standard Smits (1998) test using only min and max functions. References Altay, G., Croft, R., and Pelupessy, I. (2008). Mon. Not. R. Astron. Soc., 386(4):1931– 1946. Apetrei, C. (2014). In Borgo, R. and Tang, W., editors, Comput. Graph. Vis. Comput. Lauterbach, C., Garland, M., Sengupta, S., Luebke, D., and Manocha, D. (2009). Comput. Graph. Forum, 28(2):375–384. Smits, B. (1998). J. Graph. Tools, 3(2):1–14. Acknowledgements We acknowledge support from the Science and Technology Facilities Council. The beamerposter template for this slide is due to Nathaniel Johnston (www.nathanieljohnston.com).
© Copyright 2026 Paperzz