GRACE: GPU Accelerated Ray Tracing for Astrophysics

GRACE: GPU Accelerated Ray Tracing for Astrophysics
Sam Thomson*, Martin Rüfenacht and Eric Tittley
*[email protected]
Performance
Abstract
8
y (Mpc)
6
4
2
00
2
4
x (Mpc)
6
8
sphray (1-core)
— from red (lowest key values) to pink (highest key values) — is based on
the full set of keys.To compute an N -bit key from co-ordinates x, y and z
in [0, 1) we interleave the N/3-highest bits of their binary representations.
BVH Construction Implementation
The Problem
Put simply, we are given a set of overlapping spheres (the
SPH dataset) and a source (a point in the simulation which
emits radiation); we must shoot rays out from this source
and, for each ray, obtain a list of all intersected spheres.
Additional processing may occur at an intersection; we integrate optical depth over a ray’s path through each sphere.
We implement the O(n) complexity algorithm presented by
Apetrei (2014), computing the hierarchy and AABBs simultaneously. Node m is defined to split the hierarchy between
particles m and m + 1, and building proceeds from the bottom
up, as shown in Fig. 2.
2
δ(2) < δ(−1)
0
We use a binary tree bounding volume hierarchy (BVH) to
accelerate the traversal. The dataset is divided into two child
(inner) nodes, which are themselves divided into two further
children; this process terminates when a child contains < φ
particles, at which point it becomes a leaf node. Each node
has an axis-aligned bounding box (AABB), tightly containing all of the particles within it. These AABBs are tested for
intersection against the rays to quickly eliminate many particles
from the search.
BVHs Construction on GPUs
An efficient method of building hierarchies on the GPU was
first presented by Lauterbach et al. (2009). At a high level,
this proceeds as follows.
1 Compute each particle’s Morton key and sort (see Fig. 1).
2 Split this 1D curve at the point where the most-significant
bit changes from 0 to 1 (∼ spatial mid-point in x).
3 Split each half at the change in the second-most significant
bit (∼ spatial mid-point in y).
4 Continue, recursively, cycling through x, y and z (see
Fig. 3).
1
δ(0) < δ(−1)
δ(1) < δ(0)
Figure 3: The first three levels of a binary tree BVH. Coloured blocks are
nodes, along with their key prefixes, and arrows show parent-child
relationships. All Morton keys starting with the bits 00 are contained in the
left child of the root node, and those starting with 01 are contained in the
right. This (approximately) cuts the set of keys into two halves. These two
halves are also subdivided, and the processes repeats recursively.
Traversing the Tree
δ(2) < δ(3)
The Bounding Volume Hierarchy
101
Figure 1: The Morton curve passing through all points in a slice of a 1283
particle, (10 Mpc)3 simulation (1 pc = 3.09 × 1016 metres). The line colour
Time (s)
Ray tracing is the most accurate method for numerically
modelling the transport of radiation, visible light or otherwise. However, its high computational cost makes coupled
simulations of (ray-traced) radiation transport, gas dynamics and gravity prohibitively expensive. Such simulations
are nonetheless necessary if we are to accurately model the
early universe on cosmological, galactic and stellar scales.
The cost can be somewhat mitigated by the highly parallel nature of the problem, and it is suitable for computation on GPUs. We describe here our implementation of a
GPU ray tracing library, using NVIDIA’s CUDA platform,
targeted at astrophysical smooth particle hydrodynamics
(SPH) simulations. grace is a template library and may
be easily adapted to other datasets. We achieve favourable
performance: a factor of four faster on one GPU than an
optimized CPU code running on 32 cores.
Nparticles = 1283, tree construction takes ∼ 20 ms.
• Accumulating optical depth along N & 103 rays, on a single
Tesla M2090 we achieve ∼ 4× the performance of an
OpenMP, SIMD-optimized CPU code running on two
16-core 2.3 GHz AMD Opteron 6276s.
• Computing data for every ray-particle intersection, in-order,
along each ray, we compare in Fig. 4 our code to the SPH
radiative transfer code sphray (Altay et al., 2008).
• For
δ(1) < δ(2)
0 δ(0) = 011 1 δ(1) = 001 2 δ(2) = 110 3
k0: 001
k1: 010
k2: 011
k3: 101
Figure 2: Where δ(i) ≡ ki ⊕ ki+1, and δ(−1) = δ(3) ≡ ∞. Red squares
are nodes and green circles are leaves/particles. Each thread is assigned to a
leaf and compares its key, ki, to its neighbours’. The neighbour with the
‘closer’ key, determined via δ, defines the parent index. E.g. leaf 1 computes
δ(0) = 001 ⊕ 010 = 011 and δ(1) = 010 ⊕ 011 = 001, finding
δ(1) < δ(0), hence node 1 is the parent. δ may be any commutative
function; in practice, we use δ(i) ≡ kri − ri+1k2.
Two threads will compute the same parent index; the second to do so is continues up the tree. This is realized with
atomicAdd() and an array of per-node counters. The treeclimb kernel is iterative; each call adds new layers to the tree.
This allows us to store the per-node counters in fast shared
memory.
A thread exits the tree-climb after reaching a node containing
> φ particles, and that node’s children become wide leaves.
Once all wide leaves are found they form the base of a new tree
climb, completing the tree. We find φ ∼ 32 optimal.
Each ray is assigned to a single thread. Traversal then proceeds
as follows, starting at the root node:
1 If the current node’s AABB is intersected by the ray, its left
child becomes the current node and its right child is pushed
to a stack. Repeat this step.
2 If the current node is missed, pop the stack.
3 If the stack is empty, we are done.
A stackless implementation, where each node contains go-to
indices, performs less well than a local stack in light of the
below.
sphray (6-core)
grace (M2090)
grace (K20)
grace (GTX 970)
100
10−1
102
103
Nrays
104
105
Figure 4: The time taken to trace rays out from the centre of an
Nparticle = 1283 SPH dataset, where all per-ray intersection data is output
(in-order). The dashed green line is an estimated time for a hypothetical,
parallelized version of sphray running on six cores. The CPU used was a
2.66 GHz Intel Xeon X5650.
Outlook
We have presented a new GPU ray tracing code for SPH
datasets, attaining a performance increase over previous efforts of more than an order of magnitude. This is currently
being combined with a GPU radiative transfer code, which
achieves similar gains over its CPU ancestor. The result
will be coupled to the hydrodynamics code GIZMO.
Traversal Optimization
Ray tracing typically exhibits scattered memory accesses. Our
key optimizations for improving memory coherence are:
• Rays are sorted first by their origin, then sub-sorted by the
Morton key of their normalized direction vector.
• Rays are processed in warp-sized packets: if any ray within
a packet wishes to visit a node, the entire packet must do
so. This also prevents thread divergence.
• Each node is packed into 64 bytes (c.f. an L2 cacheline of
32 bytes) containing child node indices and both child
AABBs.
Fast CPU methods for testing a ray against an AABB contain substantial branching; we instead implement the standard
Smits (1998) test using only min and max functions.
References
Altay, G., Croft, R., and Pelupessy, I. (2008). Mon. Not. R. Astron. Soc., 386(4):1931–
1946.
Apetrei, C. (2014). In Borgo, R. and Tang, W., editors, Comput. Graph. Vis. Comput.
Lauterbach, C., Garland, M., Sengupta, S., Luebke, D., and Manocha, D. (2009). Comput.
Graph. Forum, 28(2):375–384.
Smits, B. (1998). J. Graph. Tools, 3(2):1–14.
Acknowledgements
We acknowledge support from the Science and Technology Facilities Council. The beamerposter template for this slide is due to Nathaniel Johnston
(www.nathanieljohnston.com).