Parallel Pipelined Traversal Unit for Hardware Accelerated Ray Tracing Jin-Woo Kim1, Won-Jong Lee2, Min-Woo Lee1, Tack-Don Han1 2SAIT, Samsung Electronics, Korea 1Yonsei University, Korea, Traversal Operation of Ray Tracing Our Approach: Parallel-pipeline-based Traversal Unit Ray tracing generates a more realistic image than does rasterization, but it requires tremendous computational power for traversal and ray-primitive intersections. Traversal is the process of searching an acceleration structure (AS), such as a kd-tree or bounding volume hierarchy (BVH), to find a small subset of the primitives for testing by the ray. The traversal operation consists of sub-pipelines for the node-fetch/leaf-node test, the nodes AABB-ray intersection test, and stack operation, depending on the state of the ray. Our Traversal algorithms 1. 1 Child BVH[Aila et al., 2009] This method stores two child nodes consecutively. After that, it fetches and traverses two child nodes when tracing rays. reducing memory traffic for stack operations 2. 3 2 4 6 8 5 10 7 Single pipeline Parallel-pipeline A single pipeline is designed such that it connects these operations serially to maximize the overall throughput. The sub-pipelines are independent, and the outputs constitute a crossbar, through which they are fed back to the input stage. After each operation, the processed ray is fed back through the output crossbar or passed on for the shading or intersection test according to the next operation to be processed. • Less hardware-intensive small size (4~8 entries) of buffer and crossbar 9 11 - 3. . Short Stack using Restart trail [S. Laine, 2010] This method stores one bit per hierarchy level to represent whether the near node has been visited or not. reducing memory traffic for stack operations 4. • Reducing the inessential data transfer ray is immediately fed back to its subpipeline when its state is changed. 0 Restart trail 1 1 . • Configurability of this structure enables us to improve the performance in a cost effective manner - 5. Intersection Test Culling using primAABBs This method performs a ray-primAABB test using the existing traversal unit before sending the ray to the intersection unit. reuse of ray-AABB test unit A full paper version will be announced in the near future Simulation Setup Traversal Pipeline Traversal operation is consists of leaf node test, AABB-Ray intersection and stack operation. 1. 2. 3. Inner Node: TRV LCHD RCHD POST Leaf Node: TRV PRM PRM … … POST Leaf node test(STATE_TRV) Perfomed only once at begin step STATE_ TRV . AABB-Ray intersection test. (STATE_LCHD, STATE_RCHD, STATE_PRM) result of leaf node test == true (leaf node) AABB tests is iterated as the number of primitives result of leaf node test == false (inner node) AABB tests of left/right child are performed Leaf node STATE_ PRM Not leaf node STATE_ LCHD STATE_ RCHD Stack pop/push 0 Primitive N primitive iteration STATE_ TRV_ POST To verify the proposed architecture, we used SGRT [Lee et al. 2012]’s cycle-accurate simulator. This simulator collects scene and ray data from the files and simulates the execution of our architecture. • Core Clock: 1GHz • Memory Clock: GDDR3 1GHz simulator from GPGPU-Sim [Bakhoda et al. 2009] . • Resolution: 1024x720 • Ray type: primary ray, ambient occlusion(AO) data from Aila’s GPU ray tracer[Aila et al., 2009] . . Stack operation(STATE_POST) Perfomed only once at end step • 16 TRV(traversal) Units with 8 KB Cache. Problem Existing H/W engine is based on single deep pipeline structure in order to increase throughput of ray processing per unit time. [Nah et al. 2011][Lee et al. 2012]. Traversal operations involve nondeterministic changes in the states of a ray. Therefore, in some cases, the ray may be unnecessarily transferred between pipeline stages, thereby increasing the overall latency. Traversal Pipeline Node Fetch/Leaf Node Test Result TRV PRM PRM PRM PRM POST . Single Pipelined Total: 18 cycle Parallel Pipelined leaf Test: 2 cycle Parallel Pipelined AABB Test: 12 cycle Parallel Pipelined Stack Operation: 4 cycle. • 1 IST(Intersection) Unit with 8 KB Cache, 36 cycle . • Test Scenes AABB Test Stack Operation Conference Crytek sponza Fairy Forest Sibenik Triangles 190K 279K 174K 80K Nodes 202K 276K 169K 79K According the result, our approach shows maximum 30% performance improvement with coherent ray. Scene Ray Type Conference (190K tris) Crytek sponza (279K tris) Fairy Forest (174K tris) Sibenik (79K tris) Primary AO Primary AO Primary AO Primary AO Performance (Mray/s) Single Pipeline Parallel Pipeline 230.71 655.70 086.55 157.78 162.75 331.91 192.35 377.98 253.02 734.69 112.68 196.65 185.43 362.44 236.40 426.54 (9.8%) (12.1%) (30.2%) (24.6%) (13.9%) (0.2%) (22.9%) (12.9%) According the result, AABB subpipeline has high usage. It means that the configurability of this structure enables us to improve the performance in a cost-effective manner. Reference • AILA, T., AND LAINE, S. 2009. Understanding the efficiency of ray traversal on gpus. In Proc. of High Performance Graphics. • LAINE, S. 2010. Restart trail for stackless BVH traversal. In Proc. of High Performance Graphics. • LEE, W.-J., LEE, S.-H., NAH, J.-H., KIM, J.-W., SHIN, Y., LEE,J., AND JUNG, S.-Y. 2012. SGRT: a scalable mobile gpu architecture based on ray tracing. In ACM SIGGRAPH 2012 Talks, ACM, New York, NY, USA, SIGGRAPH ’12, 2:1.2:1. • NAH, J.-H., PARK, J.-S., PARK, C., KIM, J.-W., JUNG, Y.-H., PARK, W.-C., AND HAN, T.-D. 2011. T & i engine: traversal and intersection engine for hardware accelerated ray tracing. ACM Trans. Graph. 30, 6 (Dec.), 160:1.160:10. • G., FUNG, W., WONG, H., AND AAMODT, T. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software 2009, 163.174. This work was supported by Samsung Electronics Co., Ltd. [email protected] [email protected] [email protected] [email protected]
© Copyright 2026 Paperzz