Poster

Parallel Pipelined Traversal Unit for Hardware Accelerated Ray Tracing
Jin-Woo Kim1, Won-Jong Lee2, Min-Woo Lee1, Tack-Don Han1
2SAIT, Samsung Electronics, Korea
1Yonsei University, Korea,
Traversal Operation of Ray Tracing
Our Approach: Parallel-pipeline-based Traversal Unit
Ray tracing generates a more realistic image than does rasterization, but it requires
tremendous computational power for traversal and ray-primitive intersections. Traversal is
the process of searching an acceleration structure (AS), such as a kd-tree or bounding
volume hierarchy (BVH), to find a small subset of the primitives for testing by the ray.
The traversal operation consists of sub-pipelines for the node-fetch/leaf-node test, the nodes AABB-ray intersection test, and stack operation,
depending on the state of the ray.
Our Traversal algorithms
1.
1
Child BVH[Aila et al., 2009]
This method stores two child nodes consecutively. After that,
it fetches and traverses two child nodes when tracing rays.
 reducing memory traffic for stack operations
2.
3
2
4
6
8
5
10
7
Single pipeline
Parallel-pipeline
A single pipeline is designed such
that it connects these operations
serially to maximize the overall
throughput.
The sub-pipelines are independent, and the outputs constitute a crossbar, through which they are fed
back to the input stage. After each operation, the processed ray is fed back through the output
crossbar or passed on for the shading or intersection test according to the next operation to be
processed.
• Less hardware-intensive
small size (4~8 entries) of buffer and crossbar
9
11
-
3.
.
Short Stack using Restart trail [S. Laine, 2010]
This method stores one bit per hierarchy level to represent
whether the near node has been visited or not.
 reducing memory traffic for stack operations
4.
• Reducing the inessential data transfer
ray is immediately fed back to its subpipeline when its state is changed.
0
Restart
trail
1
1
.
• Configurability of this structure enables us
to improve the performance in a cost
effective manner
-
5.
Intersection Test Culling using primAABBs
This method performs a ray-primAABB test using the existing
traversal unit before sending the ray to the intersection unit.
 reuse of ray-AABB test unit
A full paper version
will be announced in
the near future
Simulation Setup
Traversal Pipeline
Traversal operation is consists of leaf node test,
AABB-Ray intersection and stack operation.
1.
2.
3.
Inner Node: TRV  LCHD  RCHD  POST
Leaf Node: TRV  PRM  PRM … …  POST
Leaf node test(STATE_TRV)
 Perfomed only once at begin step
STATE_
TRV
.
AABB-Ray intersection test.
(STATE_LCHD, STATE_RCHD, STATE_PRM)
result of leaf node test == true (leaf node)
 AABB tests is iterated as the number of primitives
result of leaf node test == false (inner node)
 AABB tests of left/right child are performed
Leaf node
STATE_
PRM
Not
leaf node
STATE_
LCHD
STATE_
RCHD
Stack pop/push
0 Primitive
N primitive
iteration
STATE_
TRV_
POST
To verify the proposed architecture, we used SGRT
[Lee et al. 2012]’s cycle-accurate simulator. This
simulator collects scene and ray data from the files
and simulates the execution of our architecture.
• Core Clock: 1GHz
• Memory Clock: GDDR3 1GHz
simulator from GPGPU-Sim [Bakhoda et al. 2009]
.
• Resolution: 1024x720
• Ray type: primary ray, ambient occlusion(AO)
data from Aila’s GPU ray tracer[Aila et al., 2009]
.
.
Stack operation(STATE_POST)  Perfomed only once at end step
• 16 TRV(traversal) Units with 8 KB Cache.
Problem
Existing H/W engine is based on single
deep pipeline structure in order to increase
throughput of ray processing per unit time.
[Nah et al. 2011][Lee et al. 2012].
Traversal
operations
involve
nondeterministic changes in the states of a ray.
Therefore, in some cases, the ray may be
unnecessarily transferred between pipeline
stages, thereby increasing the overall latency.
Traversal Pipeline
Node Fetch/Leaf Node Test
Result
TRV

PRM
PRM

PRM
PRM

POST
.
Single Pipelined Total: 18 cycle
Parallel Pipelined leaf Test: 2 cycle
Parallel Pipelined AABB Test: 12 cycle
Parallel Pipelined Stack Operation: 4 cycle.
• 1 IST(Intersection) Unit with 8 KB Cache, 36 cycle
.
• Test Scenes
AABB Test
Stack Operation
Conference
Crytek sponza
Fairy Forest
Sibenik
Triangles
190K
279K
174K
80K
Nodes
202K
276K
169K
79K
According the result, our approach shows maximum 30% performance
improvement with coherent ray.
Scene
Ray Type
Conference
(190K tris)
Crytek sponza
(279K tris)
Fairy Forest
(174K tris)
Sibenik
(79K tris)
Primary
AO
Primary
AO
Primary
AO
Primary
AO
Performance (Mray/s)
Single Pipeline
Parallel Pipeline
230.71
655.70
086.55
157.78
162.75
331.91
192.35
377.98
253.02
734.69
112.68
196.65
185.43
362.44
236.40
426.54
(9.8%)
(12.1%)
(30.2%)
(24.6%)
(13.9%)
(0.2%)
(22.9%)
(12.9%)
According the result, AABB subpipeline has high usage. It
means that the configurability of
this structure enables us to
improve the performance in a
cost-effective manner.
Reference
• AILA, T., AND LAINE, S. 2009. Understanding the efficiency of ray traversal on gpus. In Proc. of High
Performance Graphics.
• LAINE, S. 2010. Restart trail for stackless BVH traversal. In Proc. of High Performance Graphics.
• LEE, W.-J., LEE, S.-H., NAH, J.-H., KIM, J.-W., SHIN, Y., LEE,J., AND JUNG, S.-Y. 2012. SGRT: a scalable mobile
gpu architecture based on ray tracing. In ACM SIGGRAPH 2012 Talks, ACM, New York, NY, USA,
SIGGRAPH ’12, 2:1.2:1.
• NAH, J.-H., PARK, J.-S., PARK, C., KIM, J.-W., JUNG, Y.-H., PARK, W.-C., AND HAN, T.-D. 2011. T & i engine:
traversal and intersection engine for hardware accelerated ray tracing. ACM Trans. Graph. 30, 6 (Dec.),
160:1.160:10.
• G., FUNG, W., WONG, H., AND AAMODT, T. 2009. Analyzing CUDA workloads using a detailed GPU
simulator. In Proceedings of IEEE International Symposium on Performance Analysis of Systems and
Software 2009, 163.174.
This work was supported by
Samsung Electronics Co., Ltd.
[email protected]
[email protected]
[email protected]
[email protected]