A Ray Tracing Hardware Architecture for Dynamic Scenes by Sven Woop A thesis submitted in partial fulfillment of the requirements for the degree of Diplom-Informatiker (Diploma in Computer Science) Completed under the supervision of Jörg Schmittler and Prof. Dr.-Ing. Philipp Slusallek at the Universität des Saarlandes Fachrichtung 6.2 - Informatik Computer Graphik Im Stadtwald - Geb. 36.1, Raum 018 66123 Saarbrücken March 29, 2004 [email protected] c 2004, by Sven Woop Copyright 2 i Eidesstattliche Erklärung Hiermit erkläre ich an Eides Statt, dass ich die vorliegende Arbeit selbständig verfasst und außer den angegebenen keine weiteren Hilfsmittel verwendet habe. Saarbrücken den 29. März, 2004 Sven Woop ii Acknowledgements I would like to thank Jörg Schmittler for his assistance and for spending several nights to get the prototype working. Thanks to Prof. Slusallek for his support and constructive criticism. iii Abstract This thesis describes a ray tracing hardware architecture for dynamic scenes that makes it possible to ray trace highly complex scenes in real time. Ray tracing of dynamic scenes does not seem to be efficiently possible, as ray tracing requires an acceleration structure whose creation is very costly. The well-known solution to this problem is to partition the scene into movable objects, which causes to use a top-level acceleration structure over the objects, and a bottom-level acceleration structure in each object. The presented architecture efficiently supports such partitioned scenes by using one transformation unit for both the triangle intersection and the object space transformation. A prototype of the hardware architecture has been implemented into an FPGA which is in fact the first working special purpose real time ray tracing hardware available today. The performance and implementation details of this prototype are discussed in detail at the end of this thesis. iv Contents 1 Introduction 1 2 Previous Work 5 3 The Basic Ray Tracing Algorithm 7 3.1 k-D Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.1 k-D Tree Creation . . . . . . . . . . . . . . . . . . . . 11 3.1.2 Recursive k-D Tree Traversal . . . . . . . . . . . . . . 13 3.1.3 Packet k-D Tree Traversal . . . . . . . . . . . . . . . . 17 4 The Dynamic Ray Tracing Algorithm 21 4.1 Top-Level k-D Tree Creation . . . . . . . . . . . . . . . . . . 24 4.2 Bounding Box Clipping . . . . . . . . . . . . . . . . . . . . . 25 4.3 Overlapping Objects . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.1 Hierarchical k-D Trees . . . . . . . . . . . . . . . . . . 28 4.3.2 Mailboxing . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.3 Multiple Scenes . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Ray Transformation . . . . . . . . . . . . . . . . . . . . . . . 30 4.5 Hit-Distance Transformation . . . . . . . . . . . . . . . . . . 31 4.6 Normal Transformation . . . . . . . . . . . . . . . . . . . . . 32 5 Triangle Intersection 5.1 5.2 35 Affine Triangle Transformation . . . . . . . . . . . . . . . . . 36 5.1.1 Memory Efficient Triangle Transformation . . . . . . . 36 5.1.2 Normal Consistent Triangle Transformation . . . . . . 38 Unit Triangle Intersection . . . . . . . . . . . . . . . . . . . . 38 v vi CONTENTS 6 The Dynamic SaarCOR Architecture 6.1 6.2 Dynamic Ray Tracing Core . . . . . . . . . . . . . . . . . . . 43 6.1.1 Traversal Unit . . . . . . . . . . . . . . . . . . . . . . 44 6.1.2 Mailboxed List Unit . . . . . . . . . . . . . . . . . . . 46 6.1.3 Transformation Unit . . . . . . . . . . . . . . . . . . . 47 6.1.4 Intersection Unit . . . . . . . . . . . . . . . . . . . . . 49 6.1.5 Balancing . . . . . . . . . . . . . . . . . . . . . . . . . 50 Shading Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.2.1 Primary Rays . . . . . . . . . . . . . . . . . . . . . . . 51 6.2.2 Light Rays . . . . . . . . . . . . . . . . . . . . . . . . 52 6.2.3 Reflection Rays . . . . . . . . . . . . . . . . . . . . . . 52 7 FPGA Prototype 7.1 7.2 41 55 Implementation Statistics . . . . . . . . . . . . . . . . . . . . 60 7.1.1 Gate Count . . . . . . . . . . . . . . . . . . . . . . . . 60 7.1.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . 61 Performance Statistics . . . . . . . . . . . . . . . . . . . . . . 63 7.2.1 Hardware Quality Index . . . . . . . . . . . . . . . . . 63 7.2.2 Graphics Hardware Quality Index . . . . . . . . . . . 64 7.2.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.2.4 Cache Hit Rate . . . . . . . . . . . . . . . . . . . . . . 67 7.2.5 Memory Bandwidth . . . . . . . . . . . . . . . . . . . 68 7.2.6 Performance 70 . . . . . . . . . . . . . . . . . . . . . . . 8 Conclusion 71 9 Future Work 73 10 Appendix A 75 10.1 Office . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 10.2 Gael . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 10.3 Conference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 10.4 Trees4000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 List of Figures 3.1 Ray Tracing Basics . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 k-D Tree Semantics . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 k-D Tree Example . . . . . . . . . . . . . . . . . . . . . . . . 11 3.4 k-D Tree Traversal Example . . . . . . . . . . . . . . . . . . . 13 3.5 Hit-Distance Computation . . . . . . . . . . . . . . . . . . . . 14 3.6 Traversal Decisions . . . . . . . . . . . . . . . . . . . . . . . . 16 3.7 Packet Traversal . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.8 Example of an Invalid Packet . . . . . . . . . . . . . . . . . . 19 4.1 Dynamic Acceleration Structure . . . . . . . . . . . . . . . . 22 4.2 Ray Transformation into Object Space . . . . . . . . . . . . . 23 4.3 Bounding Box of Object Instances . . . . . . . . . . . . . . . 24 4.4 Bounding Box Clipping . . . . . . . . . . . . . . . . . . . . . 25 4.5 Bounding Box Clipping Example . . . . . . . . . . . . . . . . 26 4.6 Overlapping Objects . . . . . . . . . . . . . . . . . . . . . . . 27 4.7 Room Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.8 Hierarchical k-D Trees as Solution to the Room Problem . . . 28 4.9 Normal Transformation . . . . . . . . . . . . . . . . . . . . . 32 5.1 Unit Triangle Intersection . . . . . . . . . . . . . . . . . . . . 35 6.1 Dynamic Ray Tracing Architecture . . . . . . . . . . . . . . . 43 6.2 Traversal Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.3 Mailboxed List Unit . . . . . . . . . . . . . . . . . . . . . . . 47 6.4 Transformation Unit . . . . . . . . . . . . . . . . . . . . . . . 48 6.5 Compressable Packets . . . . . . . . . . . . . . . . . . . . . . 49 6.6 Reflection Matrix Illustration . . . . . . . . . . . . . . . . . . 54 vii viii LIST OF FIGURES 7.1 ADMXRC Development Platform . . . . . . . . . . . . . . . . 55 7.2 ADMXRC Top-Level Flowchart . . . . . . . . . . . . . . . . . 55 7.3 Dynamic SaarCOR Prototype . . . . . . . . . . . . . . . . . . 56 7.4 Hardware Optimized Hilbert Curve . . . . . . . . . . . . . . . 59 7.5 Cache Hit Rate using the Hardware Optimized Hilbert Curve 59 7.6 Hardware Quality Index . . . . . . . . . . . . . . . . . . . . . 64 7.7 Graphics Hardware Quality Index . . . . . . . . . . . . . . . . 65 7.8 Usage of Units . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.9 Frame Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.10 Cache Hit Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.11 Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . . 68 7.12 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 List of Tables 4.1 Millions of operations for various strategies . . . . . . . . . . 30 7.1 Maximum Cache Size per Unit . . . . . . . . . . . . . . . . . 58 7.2 Gate Count Computation . . . . . . . . . . . . . . . . . . . . 60 7.3 Complexity of one Ray Tracing Pipeline . . . . . . . . . . . . 61 7.4 Gate Count and Memory Bits per Unit using 32 Packets . . . 61 7.5 Gate Count and Memory Bits per Unit using 512 Cache Lines 62 ix x LIST OF TABLES Chapter 1 Introduction Ray tracing is in fact one of the most popular rendering techniques to create highly realistic images. However, because it is a computationally expensive recursive algorithm that requires large memory bandwidth, it is a challenging task to implement it in hardware. As a consequence, the state of the art in interactive 3D computer graphics is still rasterization hardware. The rasterization algorithm is efficient for scenes consisting of few triangles, while ray tracing is not. Thus, todays computer graphics hardware can handle scenes with several hundred thousand triangles. This is made possible by high memory bandwidth and high floating point performance. For instance Nvidia’s GeForce 3[1] offers 76 GFlops at a clock rate of 200 MHz and has a 256 bit wide memory interface running at 230 MHz, delivering a memory bandwidth of 7.2 GB/s. In recent years the scenes of standard computer games have become more and more detailed. Indeed, computer games are developed based on the current graphics card standard, but rasterization hardware will become a limiting factor in the near future. Because the main concept behind rasterization hardware is to project each triangle of the scene to a frame- and z-buffer, the rasterization algorithm scales linearly in the number of triangles of the scene. Furthermore it is difficult to parallelize the rasterization algorithm, as the bandwidth to the frame- and z-buffer becomes critical. This is because each triangle that is projected onto the image plane involves many memory accesses to the frame- and z-buffer. If the triangles of the scene are large the performance consequently drops. For a detailed description of the rasterization algorithm see any standard textbook for computer 1 2 CHAPTER 1. INTRODUCTION graphics, for example that by Shirley [2]. Ray tracing does not suffer from these problems, as the tracing of single rays can trivially be parallelized, because they are not dependent on each other. On the other hand, it can be shown that the ray tracing algorithm scales logarithmically in the number of triangles in the scene [3]. The only problem might be that the initial hardware cost for ray tracing is high and the memory interface to the scene database has to deliver sufficient bandwidth to the parallel working ray tracing units. Later in this thesis it will be shown that it is possible to deliver the required bandwidth using fairly small caches. A main advantage of the ray tracing algorithm is that it simulates reality, by supporting different kinds of lighting effects like reflections, refractions, shadows and even real-time global illumination [4]. For a human viewer these effects are very important to understand the three dimensional relation between the objects of the scene. Here, shadows play an especially important role. Indeed rasterization hardware supports some of these effects, but only by using multi-pass rasterization tricks to fake them. These multi-pass rasterization techniques (to produce shadows for instance) are often non-obvious and difficult to implement. In contrast, ray tracing offers an extremely simple and intuitive shading model. For instance, it is simple to shoot a ray from a point in the scene to a light source to check whether it lies in the shadow of the light source or not. Particularly the large number of memory accesses (which are more or less randomly distributed over the scene data) and the expensive computation made it impossible to create a real-time ray tracing system recently. Lately a lot of work has been done to cope with these problems. Taking advantage of the coherence between neighboring rays to reduce the memory bandwidth and using a cluster of processors to provide enough computational power made real-time ray tracing possible. Such a software based real-time ray tracing system has been developed by the Computer Graphics Lab of the Saarland University [5, 6, 7]. However these techniques require a lot of costly, but standard hardware. The SaarCOR project follows a different way. Instead of using standard PC hardware for the computation, it is more efficient to create special purpose hardware that is optimized to the ray tracing application. Jörg Schmit- 3 tler designed such an architecture which is called SaarCOR (Saarbrücken’s Coherence Optimized Ray Tracer). This architecture has been fully simulated with really nice results [8]. Up to now SaarCOR has been limited to static scenes and to a standard k-D tree as acceleration structure. In such a static scene the camera can be moved around, but no object can be moved itself. This is a hard limitation which makes it impossible to develop a computer game for the standard SaarCOR architecture for instance. In this thesis a ray tracing hardware architecture for dynamic scenes is presented based on the SaarCOR architecture. As ray tracing heavily relies on precomputations it seems to be difficult to ray trace dynamic scenes. Thus a data structure is required that allows as many precomputations as possible to be done, but also to move objects in the scene around. This can be achieved by partitioning the scene into movable objects and building a top-level acceleration structure over them. This top-level acceleration structure needs to be recomputed each time an object has been moved. Each object itself contains a precomputed bottom-level acceleration structure that stays static forever. To traverse a ray in the bottom-level acceleration structure of the object, the ray has to be transformed to its local coordinate system. This requires a transformation unit, which can be used as a kind of precomputation too, if using a new triangle intersection method, described in Section 5. Using the structured scene representation it is possible to share geometry by placing the same object at several positions. This reduces the representation of most scenes. A prototype of the hardware architecture has been implemented into an FPGA which is in fact the first working special purpose real-time ray tracing hardware available today. Most of the concepts of this thesis can be understood without a detailed knowlege of FPGAs or ASICs, but a short description will be given here. An FPGA (field programmable gate array) can be seen as a CLB array (configurable logic block) with some programmable routing resources to connect the single CLBs. The internal structure of these CLBs differs from architecture to architecture. In Xilinx FPGAs, the CLBs mainly consist of some registers and LUTs (look up tables). LUTs are programmable 4 to 4 CHAPTER 1. INTRODUCTION 1 function generators that can be used together with the routing resources to encode each circuit. The circuit in the FPGA can be reconfigured arbitrarily often. For a detailed description on FPGAs see the book “Field Programmable Gate Arrays” [9]. In contrast an ASIC (application specific integrated circuit) consists of an array of NAND gates. The interconnection between different gates is done by some extra silicon layers that are added to the chip. Thus a main difference from FPGAs is that ASICs are in no way reconfigurable. The advantage of ASICs are their high gate capacity, low price at a high number of pieces and high speed compared to FPGAs. A description on ASIC design can be found in the book “Application-Specific Integrated Circuits” [10]. At the beginning of this thesis the basics of the ray tracing algorithm using k-D trees are explained. To achieve dynamics, the standard k-D trees are extended to 2-level k-D trees and the transformations needed for the 2level traversal algorithm are discussed as well as some problems that might occur. The next Section describes a new triangle intersection method that is used in the hardware architecture. These Sections form the basics to understand the ray tracing hardware architecture for dynamic scenes, presented in the following Chapter. The prototype implementation of the architecture is described and a detailed analysis of the performance is given. The last part finally summarizes this thesis and shows areas of future work. Chapter 2 Previous Work The state of the art in interactive ray tracing are in fact software based systems. Several approaches have already been realized on MIMD and SIMD architectures [11, 12, 13] exploiting the coherence between neighboring rays. By parallelization of the algorithm on supercomputers [14, 15, 16, 17, 18, 19] and recently standard PCs [6, 7] interactive ray tracing has become possible. Besides these software based ray tracing systems some special purpose hardware has been developed. As the most costly operation of the ray tracing algorithm is the ray triangle intersection, the first commercially available ray tracing accelerator performed this operation only [20, 21]. This ray tracing accelerator has no hardware support for the traversal operation thus it is not able to do ray tracing in real-time. In 1999, Pfister et al. published the VolumePro 500 architecture which is a single-chip real-time volume rendering hardware [22]. A different approach is to map the ray tracing application to a multiprocessor architecture on a single chip [23], which should be available in the near future. Purcell has simulated a ray tracer for such an architecture delivering real-time performance [24]. A kind of multi-processor vector architecture is present in todays high end graphics cards too, in form of programmable pixel shaders. It has been shown that the ray tracing application can be mapped to these shaders [25]. Ray tracing of dynamic scenes is a new topic of research. The paper “Distributed Interactive Ray Tracing of Dynamic Scenes” [26] discusses basics of the 2-level ray tracing algorithm for dynamic scenes used in the hardware architecture presented in this thesis. Instead of rebuilding the acceleration 5 6 CHAPTER 2. PREVIOUS WORK structure to achieve dynamics it is possible to use special algorithms to update it. Thus using a hierarchical grid as an acceleration structure, it is possible to update an object’s position in the scene in constant time [27]. Chapter 3 The Basic Ray Tracing Algorithm Ray Tracing is a simulation technique to create realistic images of 3 dimensional scenes. This is done by shooting imaginary rays through a scene and interpreting the resulting intersection, as described in this Section. In a real environment light is emitted by some light sources and then distributed to the scene in a manner consistent with physical laws. If a camera is positioned into this environment some light enters it and an image is projected onto the image plane. The physical theory of this light distribution is well-known today, but to simulate it exactly is difficult, since available computational power is strongly limited. Thus in practice some approximations need to be made. In contrast to reality, ray tracing goes the opposite way and follows the light back from the camera to the light sources. This is done by shooting so called primary rays for each pixel of the image from the camera into the scene and computing the closest object that is hit by the ray, the hit-object. This shooting of a ray to determine the hit-object is called ray casting and the origin of the primary rays is the projection center of the camera. The point in 3D space where the object is hit is called the hit-point of the ray (see Figure 3.1). After computing the hit-object and hit-point to a primary ray it is known which object is visible through the pixel, thus the algorithm does a kind of visible surface computation. At this stage a shader corresponding to the material of the hit-object is called, which has the task of computing the 7 8 CHAPTER 3. THE BASIC RAY TRACING ALGORITHM Figure 3.1: A 2 dimensional example of the ray tracing algorithm. For each pixel of the image a primary ray is shot into the scene and the closest object that is hit by the ray is computed. color of the pixel using the intersection results of the ray with the hit-object. The shader computes the pixel color based on several material properties of the hit-object, like the object’s color, surface normal, reflectivity and transparency, and using scene properties such as the light sources. More advanced shaders would shoot secondary rays to simulate several light effects. Thus it is possible to shoot light rays from the light sources of the scene to the hit-point of the ray to compute whether the hit-point lies in the shadow of a light source or not. Even reflections can be computed by using the surface normal to compute a reflection ray to determine which geometry is seen through the reflective surface. The shading computation in detail is out of the scope of this thesis. For further information about shading see the book ”Fundamentals of Computer Graphics” [2]. A costly part of the algorithm is the ray casting operation to find the closest hit-object. To do this efficiently a data structure which subdivides the space of the scene into subspaces is required. This allows objects to be found efficiently at a given location. Such a data structure is called an acceleration structure as it accelerates the ray casting operation. Many acceleration structures exists, some are recursive and others flat data structures [28]. In 9 3.1. K-D TREES the hardware architecture presented in this thesis only k-D trees are used as acceleration structure, thus the basics of k-D trees are explained in the next Section. 3.1 k-D Trees A k-D tree is an acceleration structure that is typically used for ray tracing to accelerate the ray casting operation. It subdivides a k-dimensional space containing some objects recursively and axis aligned into subspaces, and stores for each of these subspaces the contained geometry. Because ray tracing is applied to a 3D space, only this case will be discussed here. The scene subdivision is encoded as a binary tree, the k-D tree. Each leaf node of this tree specifies one of the subspaces and contains a list of all objects that lie in the subspace. Using this recursive data structure it is possible to efficiently find the closest object hit by a ray. This is done by determining the subspaces through which the ray traverses. In the order the ray traverses these subspaces, it is intersected with the geometry in each subspace. This walking through the subspaces is called the traversal operation and it terminates if a hit-point in the current subspace has been found. This traversal operation is very efficient, as only the geometry in the subspaces the ray traverses, need to be used in the intersection calculation. Geometry far away from the ray will never be touched if the subspacing is fine enough. Definition 3.1.1. A plane h in R3 can be defined by an implicit function H(x) = n · x − d = 0, if n 6= 0, n ∈ R3 and d ∈ R. We define h+ = {x ∈ R3 | H(x) ≥ 0} and h− = {x ∈ R3 | H(x) < 0} to be the positive and negative half-space bounded by h, respectively. Let k ∈ {1, 2, 3} be the so called splitting axis and n = ek be the k-th unit vector, then we call the plane h an axis aligned splitting plane and the value d the splitting position. Definition 3.1.2. A k-D tree T is defined by the following grammar: T = N ode((k, d), Tlef t , Tright ) | Leaf ({Object1 , . . . , Objectn }) Object ⊂ R3 closed 10 CHAPTER 3. THE BASIC RAY TRACING ALGORITHM On the one hand, a node of a k-D tree can be a normal Node containing an axis aligned splitting plane (k, d) and a left and right subtree (Tlef t and Tright ). On the other hand, it can be a Leaf node containing a set of objects. This set can be empty if the number of objects n is 0. An object is a closed subset of R3 . In practice mostly triangles or cubes will be used as objects. The semantics of the k-D tree defines a subspace S(T ) to each node T of a k-D tree. The subspace of the root node is defined as R3 . If S(T ) is the subspace of the node T = N ode((k, d), Tlef t , Tright ) and h the plane defined by (k, d) then the subspace of the left subtree is S(Tlef t ) = S(T ) ∩ h− and the subspace of the right subtree S(Tright ) = S(T ) ∩ h+ . Figure 3.2 shows this subdivision scheme of the space. Figure 3.2: This Figure shows how the space is recursively subdivided by k-D trees. The large box is the subspace of node T and is split into two halves by the splitting plane p1 = (1, d). The normal of this splitting plane is parallel to the x-axis and goes through the point (d, 0, 0). As the splitting planes in the nodes of the k-D tree are axis aligned, it is called an axis aligned BSP tree (binary space subdivision tree). It is possible to use other non axis aligned splitting planes too, which yields to BSP trees in general and more complex traversal computations. In the following only the case of axis aligned splitting planes will be considered. 3.1. K-D TREES 3.1.1 11 k-D Tree Creation The task of the k-D tree creation algorithm is to build a k-D tree for a scene consisting of several objects. Thus it has to subdivide the space of the scene recursively into subspaces. It starts with the complete space R3 containing all the geometry of the scene. Then an axis aligned splitting plane is selected which splits the space into the left and the right subspace according to the semantics of the k-D tree. For each of both subspaces the objects that intersect with it are computed. Note that objects can belong to both subspaces. The subspaces together with the objects intersecting them are handled recursively by the algorithm. If some termination criteria is fulfilled, the subdivision of the current subspace is terminated and a leaf node, containing all objects in it, is created. This is the main concept for each k-D tree creation algorithm. Different algorithms mostly differ only in the heuristics that are used to search the splitting plane and in the termination criteria. The algorithm 3.1.3 defines the createKDTree function in an abstract way. It gets a subspace S and a set O of objects and returns a k-D tree. The subspaces can be represented as simple bounding boxes (that are possibly infinite) and the set of objects as arrays or lists. For an example of a simple k-D tree in 2 dimensions see Figure 3.3. Figure 3.3: Figure (b) shows a k-D tree for the simple 2D scene of Figure (a). The labels on the inner nodes of the k-D tree tell the splitting plane and the leaf nodes contain a list of objects. 12 CHAPTER 3. THE BASIC RAY TRACING ALGORITHM Algorithm 3.1.3. k-D Tree Creation function createKDTree (S, O) begin if termination criteria is fulfilled then return Leaf(O) Select an axis aligned splitting plane h by some criteria. Slef t = S ∩ h− Sright = S ∩ h+ Olef t = {x ∈ O | x ∩ Slef t 6= ∅} Oright = {x ∈ O | x ∩ Sright 6= ∅} Tlef t = createKDTree (Slef t ,Olef t ) Tright = createKDTree (Sright ,Oright ) return Node(h,Tlef t ,Tright ) end There are two issues we have not dealt with yet. The first one is how to select the splitting plane and the second one is what the termination criteria looks like. As a simple approach the splitting plane can be selected such that the largest dimension of the subspace is split exactly in the middle. It can be shown that this is not very efficient especially if the objects in the scene are not equally distributed [3]. As termination criteria a maximal tree depth in conjunction with a minimal number of objects in the leaves can be used, for instance. A different more advanced approach is to search the optimal splitting plane related to a cost function. Such a function was proposed by Havran [3] and can be used as a termination criteria too, by comparing the cost of a split and no split. 3.1. K-D TREES 3.1.2 13 Recursive k-D Tree Traversal The reason why we introduced k-D trees was to optimize the ray casting operation, which means to compute the closest hit-point of a ray with the scene. The k-D tree subdivides the scene into subspaces. Thus the sequence of subspaces a ray traverses can be determined to intersect the ray with the geometry stored in them. The algorithm that performs this enumeration of the subspaces is called the k-D tree traversal algorithm. In conjunction with an object intersection algorithm, the closest hit-point of the ray with the scene can be computed. Definition 3.1.4. A ray R is represented by a tuple R = (org, dir) ∈ (R3 )2 . The first component org of the tuple is a point of R3 and represents the origin of the ray. The second component is a vector of R3 and specifies the direction of the ray. The points on the ray can be computed by R(x) = org + x · dir if 0 ≤ x. Definition 3.1.5. Such a ray R hits an object obj if there is a λ ∈ [0, +∞[ such that R(λ) ∈ obj. A minimal λ with this property is called the hitdistance of the ray to the object and R(λ) the hit-point. Because an axis aligned splitting plane is a closed subset of R3 , we can define the terms hit-distance and hit-point the same way for rays and splitting planes. Figure 3.4: The ray R of Figure (a) is traversed according to Figure (b) through the k-D tree. Since a k-D tree is a recursive data structure, the k-D tree traversal algorithm is a completely recursive algorithm as well. It works recursively on the nodes of the k-D tree and makes a traversal decision at each node. 14 CHAPTER 3. THE BASIC RAY TRACING ALGORITHM The traversal decision determines whether the ray traverses the subspace of the left and/or right subtree and the order it traverses them. Using this traversal decision the algorithm follows the ray through the k-D tree data structure by working on the subtree that is traversed first and putting the other one onto the stack. If a leaf node is reached the intersection algorithm is called to intersect the ray with each object stored in the leaf node and the closest hit-point is determined. If this hit-point lies in the subspace of the leaf node a valid hit-point is found and the ray is called a terminated ray. In such a case or if the stack is empty the algorithm terminates. Otherwise, it continues by obtaining the next node from the stack. To compute the traversal decision the algorithm needs the near and f ar-value which is the distance to the the entry-point and exit-point of the ray with the subspace of the current node. Using this near and f ar-value together with the distance d to the splitting plane of the current node the traversal decision can be computed. If δ is the splitting position and k the splitting axis, then the intersection distance d of the ray R = (org, dir) to the splitting plane can be computed according to the formula of Figure 3.5. d= δ − orgk dirk Figure 3.5: Hit-Distance Computation To compute the traversal order the algorithm determines the half-space of the splitting plane that is closer to the origin of the ray. If orgk ≤ δ this is the negative half-space (corresponding to the left subtree) or otherwise the positive half-space. The closer subspace is traversed first, if the ray intersects it. The farther one follows later. In the first case the so called traversal order is from left to right, otherwise from right to left. 3.1. K-D TREES Algorithm 3.1.6. k-D Tree Traversal function traverseKDTree (R, T ) begin λ=∞ near = −∞ f ar = ∞ while true begin while T is of Node((k,split),Tlef t ,Tright ) begin d = (split − R.orgk )/R.dirk if R.orgk ≤ split then Tnear = Tlef t , Tf ar = Tright else Tf ar = Tlef t , Tnear = Tright go near = d ≥ near ∨ d ≤ 0 go far = d ≤ f ar ∧ d ≥ 0 if go near ∧ go far then push f ar and Tf ar to the stack T = Tnear , far = d else if go near ∧ not go far then T = Tnear else if not go near ∧ go far then T = Tf ar end T is of Leaf({Object1 , . . . , Objectn }) compute closest hit-distance λ for {Object1 , . . . , Objectn } if λ ≤ far then return λ if stack is empty then return λ near = f ar pop f ar and T from stack end end 15 16 CHAPTER 3. THE BASIC RAY TRACING ALGORITHM Figure 3.6: Traversal Decisions Whether the ray really traverses through the nearer and/or farther side is computed by the following formulas, which are illustrated in Figure 3.6. go near = d ≥ near ∨ d ≤ 0 go far = d ≤ f ar ∧ d ≥ 0 One important invariant of the algorithm is that the near and f arvalue is exactly the distance to the entry and exit point of the ray with the subspace of the current node. This property is essential and has to be maintained through the complete algorithm. Thus the near and f ar-values have to be updated at each traversal step of the algorithm. If only one of the subtrees is traversed by the ray, then the near and f ar values stay the same (see Figure 3.6), but if both children have to be traversed, the near and f ar values need to be updated. As the algorithm first traverses into the closer child node the near value can be maintained but the f ar value has to be set to the hit distance d. To restore the f ar-value later, it is pushed onto the far-stack and the farther node onto the node-stack. If later a leaf node is reached and no hit has been found in it a node is popped from the node-stack and the near and f ar values are updated by setting near = f ar 3.1. K-D TREES 17 and taking the f ar value from the far-stack as the new f ar value. Using the near and f ar value it is possible to determine whether there is a valid hit-point which is necessary to terminate the ray. A valid hit-point is found if a leaf is encountered and the hit-distance to the current closest hit-point is smaller than the current f ar-value, since then the found hitpoint lies in (or before) the leaf node’s subspace. Alternatively the ray can be terminated at the next traversal step by testing if the closest hit-distance is smaller than the current near-value. Figure 3.6 shows the most important situations that occur in the traversal algorithm. Besides these cases there are some degenerate ones that have to be handled carefully. These cases occur if the ray does not have got a well-defined single hit-point with the splitting plane. If so the hit-distance cannot be computed and the traversal decision formulas cannot be applied. This can happen if the ray is parallel to the splitting plane or if it lies completely in it. The later hardware approach solves this problem by using a normalized floating point representation that cannot represent the value zero. Thus each ray has a hit-point with each possible splitting plane. 3.1.3 Packet k-D Tree Traversal A drawback of ray tracing is the large memory bandwidth that is needed for the computation. Reducing this bandwidth is possible by exploiting the ray coherence between rays corresponding to neighboring pixels on the screen. This coherence derives from the fact that rays traversing through a similar region of the 3D space, traverse similar nodes of the k-D tree and intersect many of the same objects. It is possible to take advantage of this ray coherence by traversing a packet of some neighboring rays in parallel as if they were one single ray. This strategy reduces the required memory bandwidth, as data is fetched for a complete packet of rays instead of a single ray. Furthermore when implementing such a packet traversal algorithm in software, SIMD architectures available in todays standard PCs can be taken advantage of. Because these SIMD architectures allow 4 computations to be done in parallel packets of 4 rays can be handled efficiently using these special instructions. The packet traversal algorithm is closely related to the standard traversal algorithm, but instead of computing a traversal decision for a single ray it computes a similar packet traversal decision for a packet of rays. In the 18 CHAPTER 3. THE BASIC RAY TRACING ALGORITHM computation of this packet traversal decision, only so called active rays of the packet are involved. A ray of a packet is active in the current node if it is not terminated and if it intersects with the subspace of the node. Because this active value is required for each ray in the packet an active vector for the packet is needed. Although this active vector needs to be recomputed at each traversal step this is quite simple since a ray is active in the left child of a node if it is active in the current node and if it wants to traverse through the left child. The same holds for the right child. If one of the active rays of the packet wants to traverse through the left child, then the packet traverses through the left child as well. The same holds for the right child. The traversal order for the packet is inherited from the active rays of the packet that traverse through both children, if it is the same for each of these rays. The packet is terminated if each of its rays is terminated. If a pop operation is done, the active vector has to be updated, and therefore needs to be pushed onto the stack together with a node. A further situation that might occur is that a node is reached and ray R1 traverses through both children and R2 through the farther child only. Here, the farther node is pushed onto the stack and R1 traversed through the nearer child. Later a pop operation obtains the farther node from the stack, and each of both rays is active in this node. However ray R1 needs to update its near and f ar values, as it traversed the nearer and farther child, unlike R2 . Thus a kind of both vector needs to be pushed to the stack also, indicating if a ray wants to traverse through both children to update the near and f ar values correctly. Figure 3.7: The packet is traversed from left to right, as the rays R1 and R2, traverse from left to right. Thus the right node is pushed onto the stack and the operation continues in the left child. The rays R1 and R2 are active in both children, but R3 only in the right one. 3.1. K-D TREES 19 A problem occurs if the traversal order is not the same for each active ray of the packet that wants to traverse both children. Such a packet is called an invalid packet. It is invalid since no valid packet traversal decision can be computed. No matter which child is handled first there is always a ray in the packet that wants to handle the other one first. If the packet terminates in the first traversed child, a possible closer hit-point in the other child is forgotten (see Figure 3.8). In practice this case occurs very rarely and it can be shown that this does not happen if there are no two rays of the packet that cross in at leat one of the 3 projections to the xy-,yz- or xz-plane. This never occurs for primary rays and light rays, since rays with the same origin never cross. Therefore the algorithm can handle these types of packets correctly. Figure 3.8: The Figure shows a situation in which no packet traversal decision exists. No matter which child is handled first, either ray R1 or ray R2 is not intersected with triangle tri3 . If no such packet traversal decision exists this situation can be handled as a kind of special case. If a node for which no packet traversal decision exists is reached, the left child is traversed first. The right child is remembered and traversed later by treating it as a special case. A different possibility is to split the packet before the traversal into sub packets, in which the rays do not cross as explained above. To split the packet this way, only the signs of the three components of the ray directions must be compared. If there are two rays whose direction sign is different in one dimension then the rays cross and have to be put in different sub packets. One of these solutions needs only to be applied if a shader produces invalid packets. This for instance can happen if a packet is reflected by a curved surface. However, if only primary rays and light rays are allowed, 20 CHAPTER 3. THE BASIC RAY TRACING ALGORITHM the problem never can occur. In the hardware architecture to be described later only primary rays are used and the problem of crossing rays can be safely ignored. Chapter 4 The Dynamic Ray Tracing Algorithm In this Chapter a ray tracing algorithm for dynamic scenes is presented that allows the movement of a huge number of triangles in the scene. On the first view the efficient ray tracing of dynamic scenes does not seem to be possible since fast ray tracing relies so much on precomputations. In particular, the precomputed acceleration structure is a problem since it has to be rebuilt or updated if the geometry of the scene has changed. For a dynamic real-time ray tracing system this update must work even if the complete scene consists of several million triangles. Here standard acceleration structures cannot be used since the construction of a k-D tree for instance is at least in O(n) in the number of triangles in the scene (each triangle has to be visited at least once). It is possible to build acceleration structures that allow updating the position of triangles in constant time [27], but several million triangles cannot be moved around this way. There exists a simple solution to this problem if the scene is restricted to some kind of structured motion [26]. The case of unstructured motion, that is if triangles are moved around arbitrarily, is not covered in this thesis. In contrast to unstructured motion, structured motion is if some triangles are moved around in some sense as one single object. For instance, in a scene consisting of a table and a chair, normally all triangles in the chair or table are moved around at once. For such structured motion, the structure of the motion can be exploited by packing the triangles into movable objects. These objects internally stay 21 22 CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM static, thus a local bottom-level acceleration structure and a local bounding volume can be precomputed for them. The local bounding volume contains all the geometry of the object. The object can be positioned, rotated and scaled in the scene by an affine transformation. Such a positioned object is called an object instance and consists of the affine transformation used and a reference to the object. This concept of having some objects and one or more object instances to each object leads to a kind of geometry sharing, as an object needs to be saved only once. To traverse rays efficiently through the object instances a dynamic toplevel acceleration structure must be built over them. Only this top-level acceleration structure needs to be updated, if the position of an object instance has changed. This is possible as long as the number of objects in the scene stays small. As there is a dynamic top-level acceleration structure over the object instances and a bottom-level acceleration structures in the objects, this is a kind of 2-level acceleration structure (see Figure 4.1). In the example of the chair and table, two objects have to be modeled: one chair and one table. These two objects inside stay static over time but they can be instantiated at several positions in the scene. Thus the toplevel acceleration structure is quite simple (it consists of few objects) but the objects themselves can be fairly complex. Figure 4.1: The Figure shows a dynamic top-level acceleration structure over 4 object instances i1 , . . . , i4 of 3 objects o1 , o2 , o3 . The objects consist of their static bottom level acceleration structure. The traversal algorithm for 2-level acceleration structures first traverses through the top-level acceleration structure until an object instance needs to 23 be intersected. This is done by transforming the ray to the local coordinate system of the object and traversing through the local acceleration structure to find the hit-triangle in the object. The transformation of the ray to the local coordinate system is necessary as the acceleration structure of the object is only valid in the coordinate system in which it has been created. Thus the positioning of the object instance needs to be reversed by transforming the ray. Thus the inverse of the transformation that was used to position the object is required to transform the ray to the local coordinate system of the object. An important property of the concept is that the internal geometry of the object is hidden from the rest of the world. Thus from outside the object’s geometry is only represented by its local bounding volume, which needs to be as accurate as possible to avoid unnecessary ray object intersections, which are normally very costly. Figure 4.2: Figure (a) shows a simple scene consisting of two instances of the same object. The drawn ray hits the left chair thus it is transformed to its local coordinate system, as can be seen in Figure (b). There the splitting planes are again axis aligned so that the traversal can be continued in the object. The concepts of the dynamic ray tracing algorithm does not depend on a special acceleration structure or kind of local bounding volume, but in the following only k-D trees and axis aligned bounding volumes will be used. Furthermore no update strategy for the top-level k-D tree will be used it is simply rebuilt each time the object positions have changed. The following Sections describe some details of the dynamic ray tracing algorithm. Some special properties of the top-level k-D tree creation will be discussed as well as problems that might occur using local bounding volumes. As affine transformations are used to position objects in the scene, the way 24 CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM a ray is transformed under an affine transformation needs to be analysed. Furthermore we show that the hit-distance is maintained under an arbitrary affine transformation which dramatically simplifies an implementation of the algorithm. As most shading models need the normal of the geometry in the world coordinate system, normal transformation is also discussed. 4.1 Top-Level k-D Tree Creation The basic k-D tree creation algorithm has been described in Section 3.1.1. This algorithm can be applied the same way to compute a top-level k-D tree for a set of object instances by using the transformed local bounding volume of the object instances as their simplified geometry. Because this transformed bounding volume is no longer axis aligned determining if it intersects with a subspace or not is very costly to compute. As it is mostly required to rebuild the top-level acceleration structure for each frame, some optimization needs to be done to speed up the toplevel k-D tree construction. This is done by computing the smallest axis aligned bounding box that encloses the transformed bounding volume. This is called the instance bounding volume and is used as the geometry of the object instance in the k-D tree creation algorithm (see Figure 4.3). To compute the intersection of the axis aligned instance bounding volume and the subspace (which can be represented as an possibly infinite axis aligned bounding box also) is trivial. Figure 4.3: Figure (a) shows an object with its bounding box. In Figure (b) this object is instantiated using a rotation. The estimated bounding box for the object instance is drawn dotted. Figure (c) shows the best possible bounding box estimation for the object instance if the exact geometry of the object is used in the estimation. This simplification has some disadvantages since the axis aligned instance 4.2. BOUNDING BOX CLIPPING 25 bounding volume is not optimal (see Figure 4.3). Although a best estimation for the axis aligned instance bounding volume exists, it is not a good idea to compute it, because then the internal structure of the object would have to be involved in the computation, which might be too costly. What can be done is to search for a better representation of the local bounding volume of an object. Instead of an axis aligned box an ellipsoid can be used which often is a better approximation. Such an elliptic bounding volume of an object can be computed in O(n) [29]. A different optimization would be to rotate the object in such a way that its initial bounding box fits as well as possible. A situation like in the left most image of Figure 4.3 is in fact the worst case. 4.2 Bounding Box Clipping Intersections with object instances are mostly very expensive, as this requires one ray transformation and some traversal steps in the object. One possibility to avoid and optimize ray object intersections is to perform a kind of bounding box clipping on the instance bounding volume in the top-level k-D tree and on the local bounding volume of the object at the beginning of the bottom-level k-D tree. Figure 4.4: Figure (a) shows a 2 dimensional rectangle with its clipping planes. The corresponding clipping tree is shown in Figure (b). Figure (c) shows the clipping tree to to a box in 3 dimensions. Using traversal steps this bounding box clipping has the task of determining if the ray intersects with an axis aligned bounding box or not. This can 26 CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM be done by using 6 clipping planes that exactly correspond to the bounding planes of the axis aligned bounding box (see Figure 4.4). The bounding box clipping to the instance bounding volume in the toplevel k-D tree guarantees that the bounding box of the object’s instance is really hit if a leaf node containing this object is encountered. The bounding box clipping at the beginning of the bottom-level traversal is useful too, as the local bounding box available there is much more accurate than the bounding box of the instance. Furthermore this bottom-level bounding box clipping should be performed since many unnecessary traversal steps can be avoided. This is due to the fact that otherwise the infinitely large empty space around the object is not handled optimally as the clipping planes at the border of the object reach to infinity. This causes many traversal steps if the ray does not hit the object and traverses to infinity (see Figure 4.5). Figure 4.5: Figure (a) shows a chair without bounding box clipping, whose clipping planes reach to infinity. Here the drawn ray would traverse through many subspaces of the acceleration structure. In Figure (b) some bold extra clipping planes clip against the bounding box of the chair. Here the ray traverses only 2 subspaces of the acceleration structure. Because of the same reason it is better to perform a kind of scene bounding clipping at the beginning of the top-level acceleration structure otherwise ray losses (that is if rays traverse to infinity and produce no hit) will be costly. Only if no ray losses can occur in the scene, this scene bounding clipping should not be performed. 4.3. OVERLAPPING OBJECTS 4.3 27 Overlapping Objects Overlapping objects play a crucial role in 2-level k-D trees since in the overlapping area each of the objects need to be intersected. Consider a scene consisting of n objects that overlap completely. A ray that intersects this region in space needs to traverse through each of the n objects. For such worst case scenes, the dynamic ray tracing algorithm scales linearly in the number of objects. Thus overlapping of objects should be avoided as often as possible, if modelling a scene. If two objects overlap only slightly it is usually best to partition the scene in such a way that the area filled by both objects is separated by the clipping planes (see Figure 4.6). Thus only in the overlapping area both objects need to be intersected. The overlapping area cannot be handled more efficiently since each of both objects could generate the closer hit-point, which is not known in advance. Figure 4.6: Figure (a) shows two object instances that overlap a bit. By the clipping planes h1 , . . . , h4 , the overlapping area is separated. The corresponding k-D tree is shown in Figure (b). Much more critical is the case where there are a lot of objects in a different object like in Figure 4.7. If the standard algorithm to create an acceleration structure is used, then the large object 1 (which contains the other ones) is in each leaf node of the tree. This is a problem as during traversal each time a leaf node is encountered the algorithm intersects with object o1 , but one intersection with it would be sufficient. This problem is called the room problem, as it typically occurs, if a room is modeled with 28 CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM some objects inside. The resulting k-D tree for such a scene can be seen, as a degenerate case of the space subdivision because after each subdivision, the object o1 is in each of both subspaces. Figure 4.7: Figure (a) shows a large object o1 containing 3 other objects. The corresponding k-D tree in Figure (b) has object o1 in each leaf node. 4.3.1 Hierarchical k-D Trees Several possible solutions to the room problem exist. One possibility is to allow objects to be in k-D tree nodes too and not only in the leaves. This concept is called hierarchical k-D trees as the hierarchy of the objects is encoded to the k-D tree. Figure 4.8: Figure (a) shows the same scene as in Figure 4.7. The corresponding hierarchical k-D tree can be seen in Figure (b). The difference to a normal k-D tree is that object o1 is in the inner object list of the root node of the hierarchical k-D tree. 4.3. OVERLAPPING OBJECTS 29 Figure 4.8 shows a hierarchical k-D tree. Each node of it has a set of so called inner objects which are intersected if this node is handled during traversal. The structure of the k-D tree in Figure 4.8 forces the traversal algorithm to intersect object o1 exactly one time. Note that the size of the hierarchical k-D tree is reduced compared to the last version, as the leaves are smaller. This is a principal property of the concept, as each time an object is in all or almost all leaf nodes reachable from a node N , it is more optimal to put the object in the inner object list of node N , which reduces the size of the tree. 4.3.2 Mailboxing A different solution to the problem is known as mailboxing which is a kind of object intersection cache. In a small cache the objects the ray has been intersected with are saved. If an object needs to be intersected by the traversal algotithm, the mailbox system looks up the cache. If the ray has already been intersected with this object no further intersection is done. Otherwise the object is intersected and added to the cache. There are several possible strategies to handle the cache. The most popular is to save the last n objects intersected with. Another would be to use a hashing function to map the objects to slots. The mailboxing approach has been shown to be more efficient than using hierarchical k-D trees. The reason is that hierarchical k-D trees alone only solve the special room problem. However there are many more situations where an object is intersected more than once since each object is mostly in several leaf nodes. 4.3.3 Multiple Scenes In some cases it is sufficient to use a much simpler solution to the problem. Imagine a level of a standard shooting game where is mostly a large main scene, modelled as a single object, and perhaps some dynamic objects. The main scene object is an object containing a lot of other objects which is a problem, as described earlier. Instead of putting the main scene object to the root node of a k-D tree (which the hierarchical k-D tree concept had done) first the main scene is traversed and then the other geometry. If there is only one large object containing many other ones, this concept is nearly 30 CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM equivalent to the hierarchical k-D tree concept but simpler. In the following table some simulation results of the conference scene at a resolution of 1024x768 are listed. The first line shows a simulation without any of the optimizations followed by hierarchical k-D trees. Mailboxing is simulated such that the last 8 objects are saved and in the last simulations the main scene object (room) of the conference scene is traversed before the objects (chairs). The number of traversal operations (Trav-Ops), object intersection operations (Obj-Int-Ops) and triangle intersections (TriangleInt-Ops) can be seen. Optimization None Hierarchical k-D tree Mailboxing Multiple Scenes Trav-Ops 295.4 70.3 63.2 71.5 Obj-Int-Ops 6.4 2.0 1.6 2.1 Triangle-Int-Ops 57.6 10.9 10.3 11.3 Table 4.1: Millions of operations for various strategies It can bee seen that mailboxing is the best of the three optimizations, thus the later hardware architecture implements this strategy. 4.4 Ray Transformation If an object instance is hit during traversal, the ray is first transformed into the local coordinate system of that object. This is done by applying an affine transformation to the ray. In this Section we show how a ray has to be transformed using such an affine transformation. The affine transformation is given by f (v) = Av + B with A ∈ M atR (3 × 3) and B ∈ R3 and maps points of R3 to points of R3 . The ray is given by a tuple R = (org, dir) ∈ (R3 )2 . The origin of the ray can easily be transformed by plugging it into v, as it is a point. The direction of the ray represents a vector not a point, thus it has to be transformed in a different way. As vectors represent directions, this property has to be maintained by the transformation. Assume there are two points X and Y given, then there 4.5. HIT-DISTANCE TRANSFORMATION 31 is a vector V = Y − X connecting X to Y . The transformed vector f (V ) has to fulfill the equation: f (V ) = f (Y ) − f (X) = A Y + B − (A X + B) = A(Y − X) = A V Thus the transformation of a complete ray looks like: f (R) = (f (org), f (dir)) = (A · org + B, A · dir) 4.5 Hit-Distance Transformation Some hit-point information needs to be computed during the traversal algorithm: the hit-point with the splitting plane and the hit-point with the scene. One possibility would be to save the hit-point as a real point of R3 but this has the disadvantage that it has to be transformed back to the world coordinate system if the hit-point lies in an instantiated object. A much better way is to store a hit-point with a ray R = (org, dir) indirectly as a λ-value or hit-distance such that the real hit-point H ∈ R3 fulfills the following equation: H = R(λ) = org + λ · dir On the one hand this hit-distance can be used to compute a traversal decision (see Section 3.1.2) and on the other hand no back transformation of the hit-distance is required, which the following equations show. Let f be an affine transformation f (x) = A · x + B then it yields: f (H) = f (org + λ · dir) = A · (org + λ · dir) + B = (A · org + B) + λ · A · dir = f (org) + λ · f (dir) =⇒ org + λ · dir = f −1 (f (org) + λ · f (dir)) This means that the same λ-value can be used to represent the hit- 32 CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM point in both coordinate systems. Computing the hit-point in the object and transforming it back to the world coordinate system is the same as using the same λ to compute the hit-point in the world-coordinate system. Thus the value λ is in some sense invariant under the application of affine transformations. With this background it can be explained why only affine transformations are used in the 2-level k-D tree algorithm. The relevant point is that if intersecting with an object instance not the object in the instance is transformed, but the ray itself. If the object’s geometry had been transformed (which is too costly) it would be possible to use an arbitrary transformation. But transforming the ray has to result in a ray again. Affine transformations fulfill this property and map rays to rays as the above equations show. 4.6 Normal Transformation Most shading models (like Phong shading for example) need the normal of the geometry at the hit-point to approximate the surface lighting behavior. However, this normal is needed in the world coordinate system, but normals are present only in the local coordinate system of the hit-object. Therefore, the shader has to transform these normals back to the world coordinate system using the inverse of the transformation that was used to position the object. Thus we need to analyse how a normal is transformed under an arbitrary affine transformation f (x) = Ax + B. Like for vectors, this transformation has to be applied in a special way to preserve the normal property. It is trivial to see that affine transformations map tangents to tangents. This fact will be used to derive a transformation for normals. Figure 4.9: Figure (a) shows a box and the normal n of the right side. If the box is transformed under an affine transformation like in Figure (b), the correctly transformed normal nf is different from An. 33 4.6. NORMAL TRANSFORMATION The tangent in the source space is called t and the transformed one in the destination space tf , which is equal to A t as tangents are vectors. Analogously the normal in the source space is called n and the searched one in the destination space nf . As n is a normal, nf is not the same as A n as seen in Figure 4.9. The following shows that a matrix A′ can be found such that nf = A′ n. The vectors n and t are perpendicular, which means that the scalar product is zero nT t = 0. Doing some transformations yields: T nT t = nT A−1 A t = (nT A−1 )(A t) = (nT A−1 )T T tf = ((A−1 ) n)T tf = 0 T This equation shows that (A−1 ) n is a vector that is perpendicular to tf , thus it has to be the searched normal. The transformation matrix A′ is T given by A′ = (A−1 ) and the complete mapping of a normal looks like: T nf = (A−1 ) n Even if the normal n was normalized, nf is usually not normalized. 34 CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM Chapter 5 Triangle Intersection In order to decrease the required floating point resources of the hardware architecture described later in this thesis, I developed a special triangle intersection method that is based on affine ray transformations. Because such affine ray transformations are necessary in the dynamic ray tracing algorithm using this intersection method will make it possible to save a lot of hardware resources by sharing one transformation unit for two purposes. The so called unit triangle intersection method consists of two stages. First the ray is transformed, using a triangle specific affine triangle transformation, to a coordinate system, in which the triangle looks like the unit triangle ∆unit with the edge points (1, 0, 0), (0, 1, 0) and (0, 0, 0). In the second stage, a simple intersection test of the transformed ray with the unit triangle is done. Figure 5.1: Unit Triangle Intersection 35 36 CHAPTER 5. TRIANGLE INTERSECTION 5.1 Affine Triangle Transformation The affine triangle transformation to a triangle ∆ = (a, b, c) is an affine transformation T∆ (x) = m · x + n with m ∈ M atR (3 × 3) and n ∈ R3 −1 that maps the triangle ∆ to the unit triangle ∆unit . The inverse T∆ (x) = m′ · x + n′ of T∆ can easily be described by the following equations: 1 0 0 −1 −1 −1 T∆ 0 = a T∆ 1 = b T∆ 0 = c 0 0 0 These equations map the edge points of the unit triangle to the edge points of the triangle. If q ∈ R3 is an arbitrary vector, then the solution −1 T∆ of the equations takes the form: ax − cx bx − cx qx m′ = ay − cy ax − cz by − cy bz − c z qy qz cx n ′ = cy cz Unfortunately the vector q is undetermined but there are two useful possibilities to choose q. The first concept is to minimize the memory needed to store a triangle matrix and the second one allows to do some dot product computations for free. 5.1.1 Memory Efficient Triangle Transformation The representation of the triangle transformation can be minimized by choosing q in such a way that the triangle transformation matrix m of T∆ has the first column equal to (1, 1, 1)T , which can be achieved by setting q = −(a − c) − (b − c) + (1, 0, 0)T . Here it is not necessary to save the first column of the matrix. ax − cx bx − cx −(ax − cx ) − (bx − cx ) + 1 m′ = ay − cy ax − cz by − c y bz − c z −(ay − cy ) − (by − cy ) + 0 −(az − cz ) − (bz − cz ) + 0 cx n ′ = cy cz −1 It needs to be shown that the inverse T∆ of T∆ is of the form: 37 5.1. AFFINE TRIANGLE TRANSFORMATION 1 βx γx m = 1 βy 1 βz γy γz δx n = δy δz Using properties of affine transformations, it can be shown that n = −m′−1 · n′ . Thus it is equivalent to prove: 1 1 1 T∆ · 0 = 1 + n = 1 − m′−1 · n′ 0 1 1 This can be shown using the inverse of T∆ : 1 1 1 1 −1 ′−1 ′ ′ ′ ′ 0 = T∆ 1 − m · n = m · 1 − n + n = 0 0 1 1 0 This proof requires the existence of T∆ and it turns out that this inverse does not always exist. The choice of q geometrically means to map the −1 normal Nunit = (0, 0, 1)T of the unit triangle to the point T∆ (Nunit ) = −(a − c) − (b − c) + e1 + c. In fact the part −(a − c) − (b − c) + c of this sum lies in the triangle plane. Thus the triangle transformation does not exist if the triangle normal is perpendiculer to e1 , since then −(a−c)−(b−c)+e1 +c lies in the triangle plane too. This problem can be solved by choosing q in such a way that one of the other two columns of m is (1, 1, 1)T . The n-th column can be set to zero if q = −(a − c) − (b − c) + en . The proof of this is analogous to above. To store the minimized representation of the triangle transformation it is necessary to save the number of the column that is equal to (1, 1, 1)T . But this can simply be encoded in 2 bits. Furthermore a criteria is required that chooses the column to be set to (1, 1, 1)T . But this is quite simple, since n is optimal if the normal Nunit is mapped to a point as far away from the triangle as possible. Thus n is choosen such that the angle between en and the normal of the triangle ∆ is minimal. 38 CHAPTER 5. TRIANGLE INTERSECTION 5.1.2 Normal Consistent Triangle Transformation A different possibility is to choose q in such a way that the normalized normal N = (a − c) × (b − c)/|(a − c) × (b − c)| of the triangle is mapped to the normal of the unit triangle. 0 −1 −1 T∆ (Nunit ) = T∆ 0 =N 1 −1 The solution to T∆ looks like: ax − cx bx − cx Nx m′ = ay − cy az − cz by − c y bz − c z Ny Nz cx n ′ = cy cz −1 −1 The transformation T∆ is completely defined and the inversion of T∆ yields again an affine transformation if the triangle is not degenerate. Thus T∆ exists for each not degenerate triangle ∆. 5.2 Unit Triangle Intersection To intersect a ray R = (org, dir) with a triangle ∆ the ray R is transformed using T∆ to the unit triangle space. The intersection distance λ and the barycentric (u,v)-coordinates do not change under an arbitrary bijective affine transformation. As the triangle transformation is bijective for not degenerate triangles, it is equivalent to compute the ray-triangle intersection in the world coordinate system between R and ∆, or in the unit triangle coordinate system between the transformed ray R′ and ∆unit . The advantage of the second method, is that the intersection computation of an ray with the unit triangle is quite simple, since the unit triangle lies in the xy-plane. Let R′ = T∆ (R) = T∆ (org, dir) = (m · org + n, m · dir) = (org ′ , dir′ ) be the ray transformed to the unit triangle space, then the intersection can be computed by: 5.2. UNIT TRIANGLE INTERSECTION 39 orgz′ dirz′ u = λ · dirx′ + orgx′ λ = − v = λ · diry′ + orgy′ The hit-point lies in the triangle, if the so called in-triangle test u ≥ 0 ∧ v ≥ 0 ∧ u + v ≤ 1 is fulfilled and has the barycentric triangle coordinates (u, v, 1 − u − v). If the second triangle transformation that maps the geometry normal of the triangle to the normal of the unit triangle is used, it is possible to compute the dot product between the ray direction and the triangle normal in both coordinate systems. In the unit triangle system the computation is extremely simple: 0 dir′ · 0 = dirz′ 1 Thus the z-component of the transformed ray direction, is the dotproduct between the ray direction and the geometry normal of the triangle. If the ray direction of R was normalized, then dirz′ is exactly the cosine between the ray direction and the normal vector. It is not obvious to see that the dot product is maintained under the unit triangle transformation, but this special transformation has this property as it can be written as: T∆ = Txy ◦ TR ◦ TT The transformation TT is a translation that maps the triangle edge point c to (0, 0, 0). The rotation TR rotates the triangle to the xy-plane and the last transformation Txy is a composition of transformations that maps the triangle in the xy-plane to the correct form. This last transformation does not change the z-component of its input vector. The translation and rotation does not change any angle nor length and the transformation Txy does not change the result of the dot product with the normal vector (0, 0, 1) as the transformation is perpendicular to the 40 CHAPTER 5. TRIANGLE INTERSECTION normal. Thus the complete triangle transformation does not change the dot product. Note that because the last transformation Txy changes the length of the vectors the angle between the ray direction and the normal is not maintained by the triangle transformation, only the dot product. The described method can be used to compute the cosine between the ray direction and triangle normal only if the direction of the initial ray is normalized. Thus in conjunction with the dynamic ray tracing algorithm the only transformations that can be used to instantiate objects are compositions of translation and rotation matrices, as otherwise the length of the direction is changed. Of course this concept to transform the ray first and then to intersect with a unit object can be applied to many other types of objects like ellipses or rectangles too. An advantage is that only one representation is required for a wide range of objects, as the transformation to the unit object is described by an affine transformation in each case. Additionally only the type of the object has to be stored, to call the correct unit intersection function. A drawback of this triangle intersection method is that the triangle matrix depends on each of the edge points of the triangle. Thus because of computation accuracy rays can be shot through two triangles that lie beside each other and have two vertices in common. This problem can be solved by using a small epsilon in the comparisons of the in-triangle test. Nevertheless most triangle intersection methods suffer from this problem. Chapter 6 The Dynamic SaarCOR Architecture The architecture presented in this Section is a general approach for a dynamic ray tracing hardware architecture which has many aspects in common with the standard SaarCOR architecture [8]. A main difference is that the Dynamic SaarCOR Architecture supports dynamic scenes but the standard SaarCOR architecture not. Dynamics is achieved by partitioning the scene into movable objects as described in Section 4. The geometry in the objects remains static but the objects themselves can be moved around. This requires the rebuilding of a top-level acceleration structure over the objects in each frame, if some objects have been moved. The architecture gives no hardware support to rebuild the top-level acceleration structure, as this is sufficiently possible using the host PC, if the number of objects is less than 50000. Hardware support is given for the triangle intersection, traversal through the dynamic 2-level acceleration structure and the shading computation as these are the most expensive operations. To support this a costly affine ray transformation unit to transform rays to the local coordinate system of an object is required. Because this unit is almost of the same complexity as a standard triangle intersection unit a naive approach would double the required chip area. But using the special unit triangle intersection method as described in Section 5, it is possible to share the transformation unit for two purposes. Furthermore the shader can use the transformation unit to perform the primary or secondary ray computation. 41 42 CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE The reasons why the special triangle intersection method is used, is to share the transformation unit mainly for the object space transformation and the triangle intersection. On principle, it would be possible, to separate the transformation and intersection using two independent units. But this has some disadvantages because the transformation unit would be used only 20% of time if it would by fully pipelined. A lot of computational power is wasted this way. Increasing the usage of the transformation unit would be possible if the operation is done sequentially, such that approximately 5 cycles are required per ray transformation. But then the transformation could slow down the complete pipeline, if at some parts of the scene it is used much more frequently. This slowing down is a typical behavior if too many special purpose units are used in the design. To exploit coherence between neighboring rays the architecture handles packets of rays as described in Section 3.1.3. By doing so data is always accessed for a packet of rays reducing the size of the memory interface. At a given time there are always several independent packets in the ray tracing system to increase the usage of the units. This is necessary as the special purpose pipelines needed for the computation are fairly deep. On the other hand, memory latency can be hidden since during a memory request of one packet, the other packets can do operations in the chip. Because each packet can be seen as a single thread running in the system this concept is a kind of multi-threading. Each packet corresponds to a complete data-set in the chip, consisting of near and far value, stacks and other required internal data. In order to guarantee that each packet accesses only its data-set, a unique packet-id (pid) identifies it and is used to address the correct data-set. This packet-id is passed from unit to unit, as a kind of job-passing. If the traversal unit reaches a leaf node for instance, the packet-id is delivered to a different unit that handles the list of objects. A very important topic in ray tracing is the shading computation. Due to the variety of possible shading models, the corresponding shading hardware should be a fully programmable special purpose CPU. As shading is out of the scope of this thesis shading will be marginally mentioned only. The Dynamic Ray Tracing Architecture (see Fig. 6.1) consits of one or more Dynamic Ray Tracing Pipelines (DynRTP) which are subdivided to a Ray Generation and Shading unit (RGS) and the Dynamic Ray Tracing Core (DynRTC). The main task of the RGS unit is to do the shading com- 6.1. DYNAMIC RAY TRACING CORE 43 Figure 6.1: Dynamic Ray Tracing Architecture putations, using the Dynamic Ray Tracing Core (DynRTC) to shoot rays through the scene, and to compute primary rays. The Dynamic Ray Tracing Core consists of four main parts. First there is the traversal unit that traverses a packet of rays through the acceleration structure. The lists of objects of the acceleration structure are handled by the list unit. The transformation unit applies an affine transformation to a packet of rays and the intersection unit intersects rays with the unit triangle. The Ray Generation Controller tells the DynRTP units which pixels to render next. The scene data as well as some other configuration data (camera position, acceleration structure, etc.) are sent through a PCI or AGP interface to the chip. Each Dynamic Ray Tracing pipeline has access to the scene data through a cache interface. This cache interface consists of four independent caches for each type of data that is used. 6.1 Dynamic Ray Tracing Core The ray tracing core is the basic ray casting unit of the architecture. Thus it is responsible for tracing packets of rays through the scene and returning the information in the object that was hit. As a fundamental concept of the dynamic ray tracing approach is the partitioning of the scene into movable objects, the dynamic ray tracing core has to traverse the packet through a 44 CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE top-level acceleration structure to find a possible hit-object and then transform it to the local coordinate system of that object. There the traversal needs to be continued to find a possible hit-triangle. The Dynamic Ray Tracing Core is used by the shader unit (RGS) to shoot rays through the scene. To do so the shader first needs to initialize the dynamic ray tracing core by sending the k-D tree root node for the next packet and the transformation to apply first. For a primary ray, this transformation is a simple camera transformation. After that, the shader sends the packet of rays in sequence to the pipeline. It always passes the transformation unit first, which applies the stored transformation to it. Because the transformed ray has to traverse through the scene it is saved in the traversal and transformation unit for later use. The traversal unit starts the top-level traversal of the packet until a leaf node is reached. It sends the list of objects, saved in the leaf node, to the list unit which has the task to handle the list. Thus it reads the first list entry out of the list and sends it to the transformation unit. This one fetches the object, stores the object’s root node into the traversal unit and applies the stored inverse object transformation to the packet of rays. At this point the inverse of the object transformation is required, since we do not position the object, but transform the ray into the object. The transformed ray is now in the local coordinate system of the object and is saved in the traversal and transformation unit. The traversal starts with the bottom-level traversal in the object with the transformed ray until a leaf node is reached. The list unit handles the list again but the transformation unit now reads unit triangle matrices out of memory and applies these transformations to the packet. The packet transformed to the unit triangle space is intersected with the unit triangle by the intersection unit. The intersection result is stored in the traversal unit which in particular needs the hit-distance to compute the ray termination correctly. If the list of triangles was empty, the operation is continued at the list unit or otherwise at the traversal unit. 6.1.1 Traversal Unit The traversal unit traverses packets of rays in parallel through the scene. This is done using a k-D tree and k-D tree traversal algorithm as explained in the Sections 3.1.2 and 3.1.3. 6.1. DYNAMIC RAY TRACING CORE 45 The traversal unit consists of a memory interface, to load k-D tree nodes, and a special purpose pipeline. This one is internally subdivided into some traversal slices to handle the single rays of the packet in parallel. In each pass through the pipeline a packet traversal step is computed. Figure 6.2: The Figure shows the traversal unit consisting of the memory interface, 4 traversal slices, a packet traversal decision unit and the collect hits unit. For each of the units the necessary internal data is shown. Figure 6.2 shows the internal structure of the traversal unit. The operation always starts at the memory interface which fetches the next or first k-D tree node out of memory. If this node was a leaf then the packet together with the list address is sent to the list unit to compute intersection results. Otherwise the node is sent to the traversal slices which compute a traversal decision for each of the rays in the packet. These single traversal decisions are combined into a packet traversal decision by the packet traversal decision unit. The packet traversal decision is sent to the memory interface and back to the traversal slices as these have to do stack operations depending on it. Using the packet traversal decision the memory interface can fetch the next node to process and do push/pop operations of the nodes. Because the memory interface is responsible for the computation of the 46 CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE node addresses it saves the current node and handles the node stack. In contrast because the traversal slices compute the traversal decision for a ray they need to store and update the near and f ar values, the ray and handle the far-stack needed for the computations. The collect hits unit computes the closest intersection for each ray of the packet. If this unit gets a new intersection result, it determines whether the new hit-distance is closer than the one saved. If so, the new intersection result is saved and the one stored is deleted. The intersection result typically consists of the hit-distance, hit-object and hit-triangle. Because the local barycentric uv-coordinates of the hit-point are required to support textures they need to be saved as intersection result too. As the special unit triangle intersection method is used, the cosine between ray direction and triangle normal can be computed for free. Therefore it is saved as intersection result for later usage in the shader. An important point is that the collect hits unit gives the traversal slices access to the current hit-distance of their ray of the packet. Using this information the traversal slices can terminate a ray. A ray is terminated if there is a hit closer than the far value of the leaf node, where the hit occured. As the traversal unit terminates the ray at the next traversal step the hitdistance is compared against the current near value. If it is before the near value each further hit would be farther away than the stored one. If each ray of the packet is finished or the stack is empty the traversal operation is finished. 6.1.2 Mailboxed List Unit The mailboxed list unit has the task of handling a list of object or triangle addresses, filtering the addresses in a kind of intersection cache (mailbox) and sending the passed addresses to the transformation unit. This mailboxing is necessary as most objects are present in several leaf nodes of the k-D tree. Therefore it can happen that an object is intersected several times which greatly reduces the performance (see Section 4.3). Especially the room problem decreases the performance. Therefore it is required to avoid multiple intersections with objects and triangles. This is the task of the mailbox unit, which saves already intersected objects in slots and preserves packets to be intersected twice with an object. The list unit gets a job from the traversal unit consisting of a single 6.1. DYNAMIC RAY TRACING CORE 47 Figure 6.3: Mailboxed List Unit address of the list to handle. The first entry of the list is read and sent to the mailbox unit. This one is a packet based mailbox which checks if the packet has already been intersected with this object. If so control is returned to the list unit to read the next list entry or to continue at the traversal unit if this was the last list entry. If the list entry was not yet intersected, it is sent to the transformation unit to be intersected. The operation at the list unit is continued if a triangle intersection or object intersection operation is done and the list was not empty. If the list was empty the traversal operation is continued. 6.1.3 Transformation Unit An essential part of the algorithm is the ray transformation which is done by a specialized transformation unit. This unit performs the transformation of the rays to the object’s coordinate system and transforms the rays to the unit triangle system as a kind of precomputation for the intersection unit. Furthermore the shader can use the transformation unit to apply the camera transformation to compute a primary ray and to compute secondary rays like light rays or reflection rays. The transformation of a packet is done sequentially, which allows for a good balancing between the traversal unit and the transformation unit (see Section 6.1.5). Because most ray packets have a single ray origin, this origin needs to be transformed only once. The transformation unit exploits this property 48 CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE by a kind of packet compression that transforms a packet of n rays into n + 1 vectors. The first vector is the common ray origin and the other ones the direction vectors of the packet. Such a compressed packet can be transformed by a fairly cheap transformation unit for vectors and decoded to a normal packet of rays by a decompression unit. Figure 6.4: Transformation Unit A transformation job starts at the load matrix unit which reads the matrix of an object or triangle column by column out of memory and stores them in the transform unit. If the matrix was completely read, the send packet unit gets the job. This unit has a copy of the rays of the packet to process and sends these to the compress packet unit. This unit compresses the packet and sends the vectors and points to be transformed sequentially to the transform unit. This unit applies the previously stored affine transformation to its input vectors. Finally, the packet is combined into a valid packet again by the decompress packet unit. 6.1. DYNAMIC RAY TRACING CORE 49 There is an important path of the transformed packet to the send packet unit. This path is needed if a packet was transformed to the local coordinate system of an object, because then the transformed ray needs to be saved to be intersected with the triangles in the object later. There exist two modes for the transformation, one to transform points and a different one to transform vectors. This is important as both have to be transformed differently as explained in Section 4.4. Furthermore, there exist two compression modes that indicate whether the packet has a common origin or not. If so the packet is compressed. Otherwise, each origin and direction of the packet is transformed, resulting in 2n transformations. The compression mode is set by the shader, as it has the necessary information about the type of the packet. Figure 6.5: The Figure shows that primary rays as well as light rays are types of packets with a single origin. Even reflections at planar surfaces maintain this property, as the virtual origin can be seen as the common origin of the packet. It figures out that most kinds of packets can be compressed (see Figure 6.5). Packets of primary rays are trivially compressable, since their origin is the projection center of the camera. Light rays that shoot from the light source to the hit-points have a common light source origin. Even reflected packets of rays retain their single origin if the packet was reflected by exactly one planar surface. The reflection at curved surfaces yields a compressable packet only in special cases. 6.1.4 Intersection Unit The intersection unit is a simple pipeline that intersects rays with the unit triangle, applying the formulas of Section 5.2. As inputs it gets rays trans- 50 CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE formed to the unit triangle space and computes an intersection result consisting of the hit-distance, barycentric coordinates and the dot product between the ray direction and the triangle normal vector. This intersection result is combined with the hit-object and hit-triangle and then saved in the collect hits unit of the traversal. 6.1.5 Balancing The subdivision of an algorithm into special purpose units may become a problem if the units are too special and used very rarely. Thus the balancing between the individual units of the dynamic ray tracing core need to be analysed. The most expensive units of the design are the traversal unit and the transformation unit. Simulations showed that a balancing of 4 to 1 between the traversal and intersection operation is optimal for the k-D tree algorithm [8]. The same ratio can be used for the ratio between the traversal and the transformation unit too, which means that 4 times more traversal operations as ray transformations need to be done. This ratio can approximately be achieved using a packet size of 4 rays per packet, which are traversed in parallel and transformed sequentially. Thus the transformation unit requires five times more cycles to handle a packet than the traversal unit if the packet can be compressed. Thus we have a ratio of 5 to 1 if the packet can be compressed, or 8 to 1, otherwise. This ratio of 5 to 1 has been shown to be optimal for the dynamic architecture, as can be seen in the usage statistics in Appendix A. 6.2 Shading Unit The shading unit should consist of several programmable special purpose shading CPUs, because of the wide range of possible shading models. This concept of the programmable shading unit will not be discussed, but rather the interface between the shader unit and the Dynamic RTC. This interface consists (besides a channel to send the k-D tree root node) of a channel to store a transformation in the RTC. Because this stored transformation is always applied to the packet sent to the ray tracing core, the shader can compute primary rays, light rays or reflection rays, using the transformation unit. Each of these computations can be performed by 51 6.2. SHADING UNIT storing a suitable transformation in the RTC and by sending a special ray to be transformed. If all rays of the packet have been transformed, the RTC starts with the traversal operation. 6.2.1 Primary Rays Primary rays are rays from the camera to the scene, which are computed for each pixel of the image. A camera can be represented by three orthogonal vectors u, v, w and its position p. The vectors u, v and w define the local coordinate system of the camera, such that u shows to the right, v to the top and w in the viewing direction of the camera. To a pixel (x,y) on the screen belongs the primary ray: x′ = x xmax − 1 ′ y 1 y = − 2 ymax 2 prim ray = (p , x′ · u + y ′ · v + w) This primary ray can also be computed by the following ray transformation: Tshear = 1 xmax 0 − 12 0 1 ymax − 21 0 0 u x vx wx p x Tc = u y vy u z vz 0 0 1 0 x pre prim ray = 0 , y 0 1 0 wy wz py pz prim ray = Tc (Tshear (pre prim ray)) The shown 4x3 matrices represent affine transformations where the left 3x3 minor stands for the linear part and the fourth column for the affine part. The transformation Tshear is a shearing transformation that performs the mapping of the pixel coordinates to the x′ and y ′ values. The transformation Tc performs the affine composition of the u, v, w and p vectors with the x′ , y ′ values. If the special ray pre prim ray is transformed first with Tshear and 52 CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE then with Tc , the primary ray computation is performed. Thus the RGS unit stores the camera matrix Tc ◦ Tshear as a transformation to the RTC and sends the pre-primary rays pre prim ray to it. 6.2.2 Light Rays Light rays are secondary rays that are computed to determine the amount of light that illuminates the hit-point of a primary ray for instance. Such a light ray goes from the light source to the hit-point of the primary ray. To compute a light ray for a primary ray, the shader has to read back the primary ray R = (org, dir) and the intersection result from the RGC. The intersection result consists among other things of the hit-distance λ which is needed to compute the hit-point R(λ). If L is the position of the light source, the light ray can be computed by: Rlight = (L, R(λ) − L) = (L, org + λ · dir − L) The same computation can be done by the following ray transformation: orgx dirx Lx Lx Tlight = orgy diry Ly Ly orgz dirz Lz Lz 0 1 R′ = 0 , λ 0 −1 The transformation of the ray R′ by Tlight yields the light ray from the light source to the hit-point of the ray. Note that because this transformation Tlight depends on the ray R the shader has to load a special matix for each of the rays of the packet. Furthermore the real hit-point of the ray does not need to be computed in the shader. 6.2.3 Reflection Rays Reflection rays are computed to simulate reflective surfaces. Thus a ray that hits a reflective surface is reflected by it and traversed further into the 53 6.2. SHADING UNIT reflection direction. The geometry the reflected ray hits, is exactly what is seen through the reflective surface. The reflection of a ray at a planar surface can be performed by an affine reflection transformation. Such a reflection transformation can be precomputed for each triangle of the scene using the normal consisting triangle transformation. The concept is to transform the ray first to the unit triangle space, then to reflect it at the xy-plane and to transform it back again. This precomputation can be done by the following composition of 3 affine transformations: 1 0 0 −1 Tref lect = T∆ ◦ 0 1 0 0 0 ◦ T∆ 0 0 −1 0 This reflection transformation depends on the triangle and maps each ray to the reflected ray. The reflected ray starts at the reflected origin and has the reflected ray direction as can be seen in Figure 6.6. To use a ray reflected this way, the traversal of the reflected ray has to start at the hitdistance of the unreflected ray. This can be done by setting the near value of the traversal algorithm to the hit-distance of it and ignoring each hit that is closer than this distance. 54 CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE If the triangle lies in an object (which is always the case for the dynamic ray tracing algorithm) two additional transformations have to be done. First the ray has to be transformed into the object coordinate system, then to be reflected, and at last to be transformed back again to the world coordinate system. These additional transformations increase the cost of a reflection ray, but can be performed in 3 passes by the transformation unit as well. Figure 6.6: The Figure shows how the reflection matrix reflects a packet of rays at a surface. Chapter 7 FPGA Prototype In this Section the prototype implementation of the dynamic ray tracing architecture is described. As development platform the ADM-XRC-II PCI board from Alpha Data [30] has been used. This board contains a Xilinx Virtex-II 6000-4 [31] FPGA, 6 SRAM chips each with 4 MB of memory, a PCI controller and some IO-adapters. Figure 7.1: ADMXRC Development Platform Figure 7.2: Flowchart ADMXRC Top-Level These IO-adapters are used as a VGA-out interface by generating a digital RGB-signal in the chip which is translated by an external digital-toanalog converter to an analog video signal. My work on the prototype was the development of the dynamic ray tracing core which has been completely developed using JHDL [32] as hardware 55 56 CHAPTER 7. FPGA PROTOTYPE description language. JHDL has been used as it has a powerful debugging infrastructure that allows the simulation of the complete RTC at one part and to log data buses into files. The system was completely developed under Linux. Some limitation had to be done mapping the architecture to the Xilinx Virtex-II 6000 FPGA. Unfortunately there were only enough resources to implement one ray tracing pipeline. The main problem was the strongly limited memory resources (blockrams) in the chip. Another limitation was the dedicated multipliers of the Virtex-II platform which are only 18 bits wide. Thus a floating point representation with a 16 bit mantisse size, 7-bit exponent and 1 sign-bit is used. It turned out that this accuracy is sufficient to do ray tracing even for complex and highly detailed standard scenes. The number of packets in the ray tracing chip can be adjusted from 1 to 64 packets for simulation purposes. Later it is shown that a number of 32 packets in the system is in some sense optimal. Because the prototype is not capable of rebuilding the top-level k-D tree on the chip it has to be computed by the host PC the PCI-card is connected to. After each frame, the updated top-level k-D tree is written to the ray tracing prototype. Figure 7.3: The Figure shows the Dynamic SaarCOR Prototype Top-Level Chart. The numbers at the busses are the used data and address bits. The traversal unit is subdivided into 4 traversal slices as packets of 4 rays per packet are handled in parallel. The two traversal levels (top-level 57 and bottom-level) are done by using an internal depth bit. This bit is 0 in the top-level operation and 1 in the bottom-level operation. Some of the internal registers need to be duplicated for both traversal levels, since they are more or less unrelated. Because even the stack is duplicated both the top-level and bottom-level operations support a stack depth of 31 entries. If one of the stacks is full the traversal operation cannot be continued correctly. This problem can partially be solved by doing no further push operations and continuing the traversal operation. This strategy works quite well, as errors occur only in tiny details of the scene. The traversal unit works on k-D tree nodes of 64 bit width. Thus a 64 bit wide memory interface is required, delivering a bandwidth of 0.68 GB/s at 85 MHz. The list unit reads 19 bit wide addresses out of a list and is one of the most trivial units of the design, as it mainly consists of an address counter. A special bit marks the last list entry. The mailbox unit is implemented as a mailbox with 8 slots. Each time an object is handled which is not already present in the mailbox, it is saved into an empty slot. Because no strategy is implemented to clear the slots again a full mailbox stays full. This simple mailbox has been very efficient in the prototype. It is used at the top-level and bottom-level, thus works for objects and triangles. The transformation unit can store an affine transformation for each packet in the system. This strategy is wasteful, but allows transformations to be read out of memory independently of the transformation, which simplified the low level design. The object and triangle transformations are represented by a 4x3 matrix and only normal consistent triangle matrices that map the triangle normal to the unit triangle normal are used. Thus the cosine between the ray direction and triangle normal can be computed in the intersection unit. The memory interface consists of three caches: one for the k-D tree nodes, one for the lists and one for the matrices. The FPGA has access to six 32 bit wide SRAM chips with a 20 bit address space. Three of these SRAM chips are used by the ray tracing core. The matrix columns are mapped to all three SRAMS, the 64 bit wide nodes to two of the SRAMS and the 32 bit wide list entries to one SRAM as shown in Figure 6.1. Thus the prototype has the following limitations for the scene size. The 58 CHAPTER 7. FPGA PROTOTYPE maximum number of k-D tree nodes as well as the number of list entries is limited to 524288 nodes. Triangles and/or objects can be 131072 in total. Note, that it is possible to support scenes with more than 131072 triangles if using objects and instantiating the same object several times. Thus scenes with several billions of triangles can be visualized. The used small direct mapped caches (see table 7.1) showed to be sufficient for a wide range of scenes. The cache size can be adjusted in 10 steps from 20 to 29 cache lines for simulation purposes. The use of a direct mapped cache (as opposed to a 2-way cache for instance) was caused by the coarse internal granulation of the memory blocks of the FPGA to 2 kB blocks. Unit Traversal List Transformation Total Cache 4 kB 2 kB 6 kB 12 kB Table 7.1: Maximum Cache Sizes per Unit (without index structure) The prototype shader is a simple eye light shading pipeline that uses a color per triangle and the cosine between the ray direction and triangle normal which is computed by the RTC. Light rays and reflection rays are supported in the latest version too. The standard resolution of the prototype is 512x384. To increase the cache hit rate, the RGC unit performs no scanline ray generation, but uses a kind of hardware optimized hilbert curve. Computing the image line by line results in bad cache hit rates, as the 2D image space is not scanned locally. If there is a triangle on the left of the image, it is very probable that it no longer is in the cache if the complete line is finished. Therefore it is important to work locally on the image like the hilbert curve does. But this is not suitable to be computed in hardware as it is too complicated. 59 Figure 7.4: Figure (a) shows the recursive pattern that is used to compute the hardware optimized hilbert curve in Figure (b). The curve used in the prototype can be efficiently computed in hardware but fulfills the same purpose as the hilbert curve. The curve is computed by a simple counter whose destination bits are interpreted as . . . y3 x3 y2 x2 y1 x1 y0 x0 . The coordinates (x[3 : 0], y[3 : 0]) generate a curve like in Figure 7.4. By using this curve to generate primary rays the cache hit rate is increased by approximately 10% to 20%, especially for the list and matrix cache (see Figure 7.5). Scene Gael 100 80 80 60 60 Hitrate Hitrate Scene Gael 100 40 40 20 20 Traversal List Transformation Traversal List Transformation 0 0 0 100 200 300 Cachelines 400 500 600 0 100 200 300 Cachelines 400 500 600 Figure 7.5: Both figures show the Cache hit rate depending on the number of cache lines, once with scanline on the left and the hardware optimized hilbert curve on the right. 60 CHAPTER 7. FPGA PROTOTYPE 7.1 Implementation Statistics In this Section some statistics about the complexity of the ray tracing prototype are given. The presented numbers are in each case worst case numbers that are computed out of some statistics of the Xilinx routing software. 7.1.1 Gate Count The complexity of hardware circuits is usually measured in number of gates. This gate count tells how many NAND gates are necessary to implement the circuit. In the following Sections gate counts are stated for the prototype, which are computed using the following mapping. Unit full adder D flip-flop D flip-flop with clock enable 4-input LUT 3-input LUT memory bit gate count 9 6 8 1 to 9 1 to 6 4 Table 7.2: Gate Count Computation The source of this data is the Xilinx application note XAPP059 [33]. In addition dual port memory bits are counted as two single port memory bits and the embedded 18-bit multipliers with 7000 gates per unit. In the computations the worst case gate count for the LUTs are used and gates necessary to address the memory bits are ignored. 61 7.1. IMPLEMENTATION STATISTICS 7.1.2 Complexity The table 7.3 lists the complexity of one ray tracing pipeline measured in the number of floating-point units for addition, multiplication, division and comparison, respectively. The rightmost column additionally lists the amount of internal memory each unit uses to store ray-data, stacks and further needed internal data. Unit Traversal List Transformation Intersection Cache (with index structure) Total Add 4 0 9 3 0 16 Mul 0 0 9 2 0 11 Div 4 0 0 1 0 5 Comp 13 0 0 3 0 16 Mem 44.5 kB 0.8 kB 9.3 kB 0.0 kB 15.6 kB 70.2 kB Table 7.3: Complexity of one ray tracing pipeline with 32 packets and 512 cache lines (dual port memory bits counted as 2 bits) DynamicRTC DynamicRTC Traversal TraversalMemoryInterface TraversalStackPointer TraversalSlice0 TraversalSlice1 TraversalSlice2 TraversalSlice3 PacketTraversalDecision CollectHits List Mailbox LoadObject SendPacket PacketEncoder Transformation PacketDecoder Intersection Total logic gates 21,338 8,470 5,060 2,568 43,107 43,107 43,107 43,107 309 4,155 2,743 7,108 4,557 2,262 1,316 148,040 694 105,972 487,020 bits per packet 0 0 1,292 12 2,352 2,352 2,352 2,352 0 688 76 136 19 1,152 0 1,152 72 0 14,007 memory bits 0 0 41,344 384 75,264 75,264 75,264 75,264 0 22,016 2,432 4,352 608 36,864 0 36,864 2,304 0 448,224 memory gates 0 0 165,376 1,536 301,056 301,056 301,056 301,056 0 88,064 9,728 17,408 2,432 147,456 0 147,456 9,216 0 1,792,896 Table 7.4: Gate Count and Memory Bits per Unit using 32 Packets Table 7.4 shows the estimated number of gates for each of the units of 62 CHAPTER 7. FPGA PROTOTYPE MemoryInterface MemoryInterface NodeCache ListCache MatrixCache Total Total gates logic gates 4,323 4,152 3,704 5,624 17,803 bits per cache line 0 83 51 115 249 cache memory bits 0 42,496 26,112 58,880 127,488 cache memory gates 0 169,984 104,448 235,520 509,952 2,807,671 Table 7.5: Gate Count and Memory Bits per Unit using 512 Cache Lines the design. Further it shows the number of memory bits required per packet in the system as well as the required memory gates for the on chip memory for a system with 32 packets. Table 7.5 shows the gate count of the memory interface and caches, as well as the number of bits required per cache line. A system with 512 cache lines and 32 packets requires at most a number of 2,807,671 gates. If P is the number of packets in the system and CL the number of cache lines, then the gate count CRT C for the complete Dynamic RTC can be estimated by the following formula: CRT C = 487, 020 + 56, 028 · P + 996 · CL The necessary internal memory bits can be computed by: BitsRT C = 14, 007 · P + 249 · CL 7.2. PERFORMANCE STATISTICS 7.2 63 Performance Statistics This Section discusses the performance achieved with the ray tracing prototype. On the one hand the maximal performance is shown as well as some analysis to estimate the quality of the design. These quality estimates are based on gate level computations, thus only of interest for a mapping to an ASIC, not for an FPGA. The Section describes several kinds of statistics that are listed in Appendix A for 4 test scenes. 7.2.1 Hardware Quality Index It is easy to develop arithmetic units in hardware, but to feed these units is very difficult. To feed them on-chip memory in the form of registers stacks and caches is required. This on-chip memory is necessary but most of its gates are idle during the computations in contrast to the arithmetic units. Thus the definition of the following hardware quality index QHW describes the percentage of gates that are working in the chip. QHW = UAU · CAU · 100 CAU + CIM The value UAU is the usage ratio of the arithmetic units and CAU the cost of them in gates. Analogous CIM is the cost of the internal memory in gates. The hardware quality index can be used to compare two different versions of the same hardware algorithm. The version with the higher quality index is to be preferred, as it uses the gates more efficiently. Optimal system parameters, such as cache size and the number of internal packets, can be computed using this index. Figure 7.6 shows the hardware quality index dependent on the number of packets in the system for two scenes. The best gate usage of about 9.5% can be achieved with a number of 32 packets in the system. This means that it is more efficient to put several ray tracing pipelines with 32 packets onto the chip than a smaller number of pipelines with more than 32 packets. Because the same yields in the other direction it is better to use 32 packets than more units with a smaller number of them. The computed maximum is not optimal for an FPGA architecture as 64 CHAPTER 7. FPGA PROTOTYPE Scene Gael, 512x384, 85 MHz Scene Conference, 512x384, 85 MHz 10 9 9 8 Hardware Quality Index Hardware Quality Index 8 7 6 5 4 3 7 6 5 4 3 2 2 1 1 Hardware Quality Index Hardware Quality Index 0 0 0 10 20 30 40 50 60 70 0 10 20 Packets 30 40 50 60 70 Packets Figure 7.6: This Figure shows the Hardware Quality Index of the Dynamic Ray Tracing Core for the scene Gael and Conference dependent on the number of packets in the system. there the cost should not be counted in gates. This is because todays FPGAs consist (beside CLBs) of some special resources like blockrams and multiplier blocks. Thus memory can be much cheaper if these memory blocks can be used efficiently by the design. The optimal values for several system parameters depend on each other. Thus for the ray tracing architecture it is required to take into account the available memory bandwidth, memory latency and delay, cache size, packets in the system, pipeline depth of the internal pipelines and the kind of scene to be handled efficiently. Therefore, in practice it is difficult to build the perfect system, but using the described index it is possible to compare different configurations of the hardware. 7.2.2 Graphics Hardware Quality Index The hardware quality index described in the last Chapter has the disadvantage that it makes no statement about the quality of the ray tracing algorithm used, only whether the algorithm is computed efficiently. But in fact a different ray tracing algorithm might require less traversal steps to achieve the same result, but much more sleeping memory resources. Nevertheless it could be the better choice. The following graphics hardware quality index QGHW can be used to compare different kinds of ray tracing and rasterization hardware algorithms, since it takes into account the performance in rays shot per cycle achieved by the algorithm. QGHW = rays per cycle · 1, 000, 000 CAU + CIM 65 7.2. PERFORMANCE STATISTICS The index QGHW describes the number of rays a single gate of the circuit can shoot in 1,000,000 clock cycles through the scene. For rasterization hardware, the number of shot rays per cycle has to be replaced by the number of pixels that are rendered per cycle. Scene Conference, 512x384, 85 MHz 0.016 0.014 0.014 Ray Tracing Quality Index Ray Tracing Quality Index Scene Gael, 512x384, 85 MHz 0.016 0.012 0.01 0.008 0.006 0.004 0.002 0.012 0.01 0.008 0.006 0.004 0.002 Ray Tracing Quality Index Ray Tracing Quality Index 0 0 0 10 20 30 40 50 60 70 0 10 Packets 20 30 40 50 60 70 Packets Figure 7.7: This Figure shows the Graphics Hardware Quality Index of the Dynamic Ray Tracing Core for the scene Gael and Conference dependent on the number of packets in the system. Figure 7.7 shows the ray tracing quality of the prototype for two scenes. The maximal quality is again achieved at a number of 32 packets in the system. As the rays shot per cycle are proportional to the usage of the arithmetic units, the hardware quality index and graphics hardware quality index yield the maximum at the same position. Unfortunately it is difficult to compute a fair quality index for todays rasterization hardware, as these chips support many extra features besides simple rasterization of triangles. But in general it can be said that for scenes consisting of little triangles, the quality index for rasterization hardware will be much higher. In contrast if considering scenes with several million of triangles ray tracing will become more efficient at some point. 7.2.3 Usage The usage of a unit is the percentage of cycles where it is working. This usage can be computed for the 4 most important units of the design and it directly corresponds to the achieved performance. Therefore it is an important task to adjust the system parameters in such a way that the usage is fairly high. The usage can be increased by using more packets in the system to fill the pipeline stages, or by larger caches, to prevent long wait cycles for memory requests. Both parameters have to be 66 CHAPTER 7. FPGA PROTOTYPE increased carefully, as too much internal memory may be a drawback too, as the required gates compute nothing. Figure 7.8 shows the usage of the individual units dependent on the number of packets in the system. The usage increases with the number of packets in the system as each packet can fill stages of the pipelines. Scene Gael, 512x384, 85 MHz Scene Conference, 512x384, 85 MHz 80 80 60 60 Usage 100 Usage 100 40 40 20 20 Traversal List Transformation Intersection Traversal List Transformation Intersection 0 0 0 10 20 30 40 50 60 70 0 10 20 Packets 30 40 50 60 70 Packets Figure 7.8: Usage of Units There are several pipelines in the system that are separated by FIFO queues (first in first out queues) and memory interfaces, and filled differently by one packet. Thus a packet fills one pipeline stage in the traversal unit, since the 4 rays of the packet are traversed in parallel, but normally 5 stages in the transformation pipeline. This is because the rays of the packet are transformed in sequence, which means first transforming the ray origin and then the 4 ray directions. It seems that the usage scales linearly in the number of packets in the system. But this is only true if there are few packets in the system as the number of total pipeline stages limits the linear scaling. Thus it is impossible to increase the usage any more if the usage of one unit reaches nearly 100%. Even the usage of the other units that is normally far below 100% cannot be increased any more, as there is always a fixed ratio between the usage values of the units for a given image. The curves of Figure 7.8 approximate to the maximal theoretical usage for each unit in the limit and there is a fixed factor between each 2 curves that is independent of the number of packets in the system. The frames per second dependent on the number of packets in the system, directly corresponds to the usage of the single units. This is because the usage of the units is proportional to the performance achieved. 67 7.2. PERFORMANCE STATISTICS Scene Conference, 512x384, 85 MHz 25 20 20 Frames per Second Frames per Second Scene Gael, 512x384, 85 MHz 25 15 10 5 15 10 5 fps fps 0 0 0 10 20 30 40 50 60 70 0 10 20 30 Packets 40 50 60 70 Packets Figure 7.9: Frame Rate 7.2.4 Cache Hit Rate The cache hit rates are an important aspect of ray tracing hardware algorithms, since the required bandwidth behind the caches determines the number of parallel working units that can be connected to the available memory interface. Scene Conference, 512x384, 85 MHz 100 80 80 Cache Hit Rate Cache Hit Rate Scene Gael, 512x384, 85 MHz 100 60 40 20 60 40 20 Traversal List Transformation Traversal List Transformation 0 0 0 100 200 300 Cache Lines 400 500 600 0 100 200 300 Cache Lines 400 500 600 Figure 7.10: Cache Hit Rate Figure 7.10 shows the cache hit rate of the 3 types of caches dependent on the number of cache lines for the Gael and Conference scenes. The size of the required direct mapped caches is extremely low especially for the nodes. This is because 4 cache lines are required to map a complete matrix but only one to map a k-D tree node and because the coherence of k-D tree nodes at the top of the k-D tree is much higher than for nodes at the bottom. This is because the subspace a node at the top of the tree represents is much larger than near its leaf nodes. The cache hit rates for the triangle matrices is not satisfactory, but can be improved using more advanced cache strategies. Thus 2-way or 4-way caches 68 CHAPTER 7. FPGA PROTOTYPE should achieve much better cache hit rates in an ASIC implementation of the design. 7.2.5 Memory Bandwidth One of the most critical points of most types of hardware is the memory interface as it has to deliver the required bandwidth, otherwise the chip cannot work to its limit. One strategy that the ray tracing prototype uses to decrease the required memory bandwidth is to traverse packets of rays in parallel. Here the k-D tree nodes, list entries and matrices are fetched only once for 4 rays of a packet. In spite of this optimization, the required memory bandwidth is fairly high. Therefore it is necessary to use caches for each of the units in the pipeline. Scene Conference, 512x384, 85 MHz 25 20 20 Frames per Second Frames per Second Scene Gael, 512x384, 85 MHz 25 15 10 5 15 10 5 fps fps 0 0 0 200 400 600 800 Memory Bandwidth [MB/s] 1000 1200 0 200 400 600 800 1000 1200 Memory Bandwidth [MB/s] Figure 7.11: Achieved performance using 64 packets and 512 cache lines if the memory bandwidth is scaled by the memory clock ratio factor. A point of interest is the memory bandwidth needed behind the caches, which is analysed by Figure 7.11. The maximal memory bandwidth of the RTC to the 3 SRAM chips, is 1.02 GB/s at 85 MHz. The Figure shows how the performance drops if the memory bandwidth behind the caches is reduced to the specified value. Note that for most scenes it is possible to use 4 ray tracing pipelines in parallel as a scaling of the memory bandwidth of 1 4 produces a drop in the performance of only about 20%. The conference scene is a exception as the performance drops extremely if the memory bandwidth is limited. This shows that larger or more efficient caches are required for this scene. The data of figure 7.11 can be used to compute a worst case frame rate, if several parallel prototype RTCs together with their small caches are 69 7.2. PERFORMANCE STATISTICS connected to the 1.02 GB/s memory interface. For instance the performance of two RTC units at the 1.02 GB/s memory interface is higher than twice the performance that reaches one unit at a 0.5 GB/s memory interface. Figure 7.12 shows the possible performance if there would be the specified number of pipelines working in parallel at the memory interface of 1.02 GB/s. Scalability, Scene Gael, 512x384, 85 MHz Scalability, Scene Conference, 512x384, 85 MHz 70 90 80 60 Frames per Second Frames per Second 70 50 40 30 60 50 40 30 20 20 10 10 fps fps 0 0 0 2 4 6 8 10 12 14 16 0 2 4 RTC Units 6 8 RTC Units Figure 7.12: Scalability 10 12 14 16 70 CHAPTER 7. FPGA PROTOTYPE 7.2.6 Performance The ray tracing prototype is able to achieve a real time performance of 10 to 30 frames per second for a wide range of scenes at a resolution of 512x384. For a detailed overview of the reached performance for 4 test scenes see Appendix A. Dependent on the routing achieved by the Xilinx software maximal frequencies of 85 to 92 MHz are possible. For the statistics in Appendix A and the following performance values, the lower value of 85 MHz is used. At a frequency of 85 MHz, the prototype has a floating point performance of 4.08 billion flops, which when compared to todays rasterization hardware is a fairly low value. The frequency of the prototype cannot be increased much more because the used internal 18-bit wide multiplier blocks allow a scaling to maximally 110 MHz. Maximally 85 million packet traversal steps per second can be done, which is equivalent to 340 million single ray traversal steps. The transformation unit can transform approximately 68 million rays per second (if the packets are compressable) and consequently the same number of triangle intersections can be done. Chapter 8 Conclusion This thesis has shown that creating a special purpose real-time hardware for ray tracing is possible, even on FPGAs with their limited CLB and memory resources. The used FPGA is not the best available today as there are new FPGA chips with about 60% more CLBs and four times more memory and multiplier blocks. Especially these memory and multiplier resources have been the most limiting factor in the prototype. Thus using these new FPGAs a ray tracing chip with two or four ray tracing pipelines should be possible. By mapping the architecture to an ASIC it would be possible to do ray tracing at a resolution of 1024x768 in real-time, even if some secondary rays are shot. This is as the capacity of todays high end ASICs is in the range of 52 million gates using a 0.095 µm silicon gate CMOS process. Since a ray tracing pipeline requires 2.8 million gates, at most 18 ray tracing units could be placed on the chip. But because programmable shaders are required, as well as some larger caches to provide the parallel working units, a number of 8 ray tracing pipelines per ASIC would be realistic. In conjunction with an increasing of the frequency to about 266 MHz, the performance of a high end ASIC implementation would have about 20 times more performance than the prototype. Because the described hardware architecture supports structured motion the scene has to be partitioned into movable objects. A main part of the traversal algorithm for such partitioned scenes is to transform the ray to the local coordinate system of the object to continue the traversal in it. This operation requires an affine ray transformation unit, which is fairly costly. 71 72 CHAPTER 8. CONCLUSION To reduce the required floating point resources on the chip, this transformation unit is also used to intersect with triangles. This is possible using the described unit triangle intersection method. One further optimization was to exploit the fact that since most packets of rays have the same ray origin it needs only be transformed once for the packet. The last Sections showed how optimal values for several system parameters like number of packets and cache lines can be computed. This is important to map the architecture to an ASIC, since there for cost reduction purposes, it is necessary to use the available gates as efficiently as possible. Inspite of the small caches it would be possible to use 2 or 4 ray tracing cores in parallel at the described memory interface delivering 1.02 GB per second. Using larger more advanced caches and some cache hierarchy it will be possible to use many more units in parallel. Chapter 9 Future Work Of course the development of the ray tracing prototype is not yet finished. To support larger scenes, cheaper DRAM resources should be used as a scene database. The used alpha data development platform contains 256 MB of DRAM memory on a 64 bit wide interface, but because of the simpler protocol, the SRAM resources have been used only. Inspite of the fact that the top-level k-D tree was rebuilt fast enough on the host PC for our test scenes, hardware support for this operation should be supported, especially if the number of objects gets too large. This hardware support should be available for k-D trees consisting of triangles too, because then vertex shaders can be used to modify the position of the vertex edge points of the triangles, followed by a k-D tree reconstruction. Up to now the ray tracing prototype supports only a simple fixed eye light shading model. This shader should be replaced by some programmable special purpose shading CPUs that perform the color and secondary ray computation. Shading CPUs are necessary because of the wide range of shading models available for the ray tracing application. The prototype uses a k-D tree as acceleration structure, but in fact no analysis have been done, if this is the best for a hardware ray tracing approach. Indeed the k-D tree algorithm seems to be the best choice in software based systems [3], but some other acceleration structures can be implemented using fairly simple traversal units. For the regular grid acceleration structure for instance there exist simple traversal algorithms based on integer arithmetic. This integer arithmetic causes a much flater traversal unit, which consequently requires less packets in the system. Furthermore 73 74 CHAPTER 9. FUTURE WORK no stack is required in the grid traversal algorithm. Chapter 10 Appendix A The following Sections show statistics of four test scenes used to test the prototype. The shown statistic diagramms are discussed in detail in Section 7.2. The standard configuration for the statistics is a resolution of 512x384, a number of 64 packets in the system, 512 cache lines and using the hardware optimized hilbert curve if not specified differently. The last two statistics of each image show a walk through the scene, to show typical frame rates that are achieved. 75 76 CHAPTER 10. APPENDIX A 10.1 Office Objects Total Triangles FPGA Szene Size Typical Frame Rate Resolution Scene Office, 512x384, 85 MHz 1 34,313 3.7 MB 20-30 fps 512x384 Scene Office, 512x384, 85 MHz 35 100 30 60 20 Usage Frames per Second 80 25 15 40 10 20 Traversal List Transformation Intersection 5 fps 0 0 0 10 20 30 40 Packets 50 60 70 0 10 Scene Office, 512x384, 85 MHz 20 30 40 Packets 50 60 70 Scene Office, 512x384, 85 MHz 10 0.025 9 0.02 Ray Tracing Quality Index Hardware Quality Index 8 7 6 5 4 3 2 0.015 0.01 0.005 1 Hardware Quality Index Ray Tracing Quality Index 0 0 0 10 20 30 40 Packets 50 60 70 0 10 30 40 Packets 50 60 70 Scene Office, 512x384, 85 MHz 250 100 200 80 Cache Hit Rate Frames per Second Scalability, Scene Office, 512x384, 85 MHz 20 150 100 50 60 40 20 Traversal List Transformation fps 0 0 0 2 4 6 8 RTC Units 10 12 14 16 0 100 200 300 Cache Lines 400 500 600 77 10.1. OFFICE Scene Office, 512x384, 85 MHz Scene Office, 512x384, 85 MHz 35 100 30 60 20 Usage Frames per Second 80 25 15 40 10 20 Traversal List Transformation Intersection 5 fps 0 0 0 200 400 600 800 1000 1200 0 200 Memory Bandwidth [MB/s] 400 600 800 1000 1200 Memory Bandwidth [MB/s] Scene Office, 512x384, 85 MHz Scene Office, 512x384, 85 MHz 40 100 35 80 25 60 Usage Frames per Second 30 20 40 15 10 20 Traversal List Transformation Intersection 5 fps 0 0 0 20 40 60 80 100 120 Frame Number 140 160 180 200 0 20 40 60 80 100 120 Frame Number 140 160 180 200 78 CHAPTER 10. APPENDIX A 10.2 Gael Objects Total Triangles FPGA Szene Size Typical Frame Rate Resolution Scene Gael, 512x384, 85 MHz 25 100 20 80 15 60 Usage Frames per Second Scene Gael, 512x384, 85 MHz 1 68,624 7.0 MB 17-25 fps 512x384 10 40 5 20 Traversal List Transformation Intersection fps 0 0 0 10 20 30 40 Packets 50 60 70 0 10 Scene Gael, 512x384, 85 MHz 30 40 Packets 50 60 70 Scene Gael, 512x384, 85 MHz 10 0.016 9 0.014 Ray Tracing Quality Index 8 Hardware Quality Index 20 7 6 5 4 3 2 0.012 0.01 0.008 0.006 0.004 0.002 1 Hardware Quality Index Ray Tracing Quality Index 0 0 0 10 20 30 40 Packets 50 60 70 0 10 Scalability, Scene Gael, 512x384, 85 MHz 20 30 40 Packets 50 60 70 Scene Gael, 512x384, 85 MHz 70 100 60 Cache Hit Rate Frames per Second 80 50 40 30 60 40 20 20 10 Traversal List Transformation fps 0 0 0 2 4 6 8 RTC Units 10 12 14 16 0 100 200 300 Cache Lines 400 500 600 79 10.2. GAEL Scene Gael, 512x384, 85 MHz 100 20 80 15 60 Usage Frames per Second Scene Gael, 512x384, 85 MHz 25 10 40 5 20 Traversal List Transformation Intersection fps 0 0 0 200 400 600 800 1000 1200 0 200 Memory Bandwidth [MB/s] 600 800 1000 1200 Memory Bandwidth [MB/s] Scene Gael, 512x384, 85 MHz Scene Gael, 512x384, 85 MHz 30 100 25 80 20 60 Usage Frames per Second 400 15 40 10 20 5 Traversal List Transformation Intersection fps 0 0 0 20 40 60 80 100 120 Frame Number 140 160 180 200 0 20 40 60 80 100 120 Frame Number 140 160 180 200 80 CHAPTER 10. APPENDIX A 10.3 Conference Objects Total Triangles FPGA Szene Size Typical Frame Rate Resolution Scene Conference, 512x384, 85 MHz 25 100 20 80 15 60 Usage Frames per Second Scene Conference, 512x384, 85 MHz 54 282,801 5.3 MB 17-20 fps 512x384 10 40 5 20 Traversal List Transformation Intersection fps 0 0 0 10 20 30 40 Packets 50 60 70 0 10 20 50 60 70 Scene Conference, 512x384, 85 MHz 9 0.016 8 0.014 Ray Tracing Quality Index Hardware Quality Index Scene Conference, 512x384, 85 MHz 30 40 Packets 7 6 5 4 3 2 0.012 0.01 0.008 0.006 0.004 0.002 1 Hardware Quality Index Ray Tracing Quality Index 0 0 0 10 20 30 40 Packets 50 60 70 0 10 Scalability, Scene Conference, 512x384, 85 MHz 20 30 40 Packets 50 60 70 Scene Conference, 512x384, 85 MHz 90 100 80 80 60 Cache Hit Rate Frames per Second 70 50 40 30 20 60 40 20 Traversal List Transformation 10 fps 0 0 0 2 4 6 8 RTC Units 10 12 14 16 0 100 200 300 Cache Lines 400 500 600 81 10.3. CONFERENCE Scene Conference, 512x384, 85 MHz 100 20 80 15 60 Usage Frames per Second Scene Conference, 512x384, 85 MHz 25 10 40 5 20 Traversal List Transformation Intersection fps 0 0 0 200 400 600 800 1000 1200 0 200 Memory Bandwidth [MB/s] 600 800 1000 1200 Memory Bandwidth [MB/s] Scene Conference, 512x384, 85 MHz Scene Conference, 512x384, 85 MHz 30 100 25 80 20 60 Usage Frames per Second 400 15 40 10 20 5 Traversal List Transformation Intersection fps 0 0 0 20 40 60 80 100 120 Frame Number 140 160 180 200 0 20 40 60 80 100 120 Frame Number 140 160 180 200 82 CHAPTER 10. APPENDIX A 10.4 Trees4000 Objects Total Triangles FPGA Szene Size Typical Frame Rate Resolution Scene trees4000, 512x384, 85 MHz 4,000 20 Million 3.4 MB 8-14 fps 512x384 Scene trees4000, 512x384, 85 MHz 10 100 9 80 7 6 60 Usage Frames per Second 8 5 4 40 3 2 20 Traversal List Transformation Intersection 1 fps 0 0 0 10 20 30 40 Packets 50 60 70 0 10 20 50 60 70 Scene trees4000, 512x384, 85 MHz 8 0.008 7 0.007 Ray Tracing Quality Index Hardware Quality Index Scene trees4000, 512x384, 85 MHz 30 40 Packets 6 5 4 3 2 0.006 0.005 0.004 0.003 0.002 1 0.001 Hardware Quality Index Ray Tracing Quality Index 0 0 0 10 20 30 40 Packets 50 60 70 0 10 Scalability, Scene trees4000, 512x384, 85 MHz 20 30 40 Packets 50 60 70 Scene trees4000, 512x384, 85 MHz 35 100 30 Cache Hit Rate Frames per Second 80 25 20 15 60 40 10 20 5 Traversal List Transformation fps 0 0 0 2 4 6 8 RTC Units 10 12 14 16 0 100 200 300 Cache Lines 400 500 600 83 10.4. TREES4000 Scene trees4000, 512x384, 85 MHz Scene trees4000, 512x384, 85 MHz 10 100 9 80 7 6 60 Usage Frames per Second 8 5 4 40 3 2 20 Traversal List Transformation Intersection 1 fps 0 0 0 200 400 600 800 1000 1200 0 200 Memory Bandwidth [MB/s] 400 600 800 1000 1200 Memory Bandwidth [MB/s] Scene trees4000, 512x384, 85 MHz Scene trees4000, 512x384, 85 MHz 20 100 18 80 14 12 60 Usage Frames per Second 16 10 8 40 6 4 20 Traversal List Transformation Intersection 2 fps 0 0 0 20 40 60 80 100 120 Frame Number 140 160 180 200 0 20 40 60 80 100 120 Frame Number 140 160 180 200 84 CHAPTER 10. APPENDIX A Bibliography [1] http://www.nvidia.com. Geforce3 - the world’s most advanced processor, 2001. [2] Peter Shirley. Fundamentals of Computer Graphics. A K Peters Ltd, June 2002. [3] Vlastimil Havran. Heuristic Ray Shooting Algorithms. PhD the- sis, Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague, http://www.cgg.cvut.cz/˜havran/phdthesis.html, November 2000. [4] Ingo Wald, Thomas Kollig, Carsten Benthin, Alexander Keller, and Philipp Slusallek. Interactive Global Illumination using Fast Ray Tracing. Rendering Techniques 2002, pages 15–24, 2002. (Proceedings of the 13th Eurographics Workshop on Rendering). [5] Ingo Wald and Philipp Slusallek. State-of-the-Art in Interactive RayTracing. In State of the Art Reports, Eurographics 2001, pages 21–42, 2001. [6] Ingo Wald, Carsten Benthin, Markus Wagner, and Philipp Slusallek. Interactive Rendering with Coherent Ray Tracing. Computer Graphics Forum (Proceedings of EUROGRAPHICS 2001, 20(3), 2001. [7] Ingo Wald, Philipp Slusallek, and Carsten Benthin. Interactive Distributed Ray Tracing of Highly Complex Models. In Proceedings of the 12th EUROGRPAHICS Workshop on Rendering, June 2001. London. [8] Jörg Schmittler, Ingo Wald, and Philipp Slusallek. SaarCOR – A Hardware Architecture for Ray Tracing. In Proceedings of Eurographics Workshop on Graphics Hardware, pages 27–36, 2002. 85 86 BIBLIOGRAPHY [9] John V. Oldfield and Richard C. Dorf. Field Programmable Gate Arrays. Wiley-Interscience, January 1995. [10] Michael John Sebastian Smith. Application-Specific Integrated Circuits. Addison-Wesley, June 1997. [11] Stuart A. Green and Derek J. Paddon. Exploiting coherence for multiprocessor ray tracing. IEEE Computer Graphics and Applications, 9(6):12–26, 1989. [12] Stuart A. Green and Derek J. Paddon. A highly flexible multiprocessor solution for ray tracing. The Visual Computer, 6(2):62–73, 1990. [13] Tony T.Y. Lin and Mel Slater. Stochastic Ray Tracing Using SIMD Processor Arrays. The Visual Computer, pages 187–199, 1991. [14] Michael J. Muuss. Towards real-time ray-tracing of combinatorial solid geometric models. In Proceedings of BRL-CAD Symposium ’95, June 1995. [15] M. J. Keates and Roger J. Hubbold. Interactive ray tracing on a virtual shared-memory parallel computer. Computer Graphics Forum, 14(4):189–202, 1995. [16] Steven Parker, Peter Shirley, Yarden Livnat, Charles Hansen, and Peter Pike Sloan. Interactive ray tracing. In Interactive 3D Graphics (I3D), pages 119–126, April 1999. [17] Steven Parker, Michael Parker, Yaren Livnat, Peter Pike Sloan, Chuck Hansen, and Peter Shirley. Interactive ray tracing for volume visualization. IEEE Transactions on Computer Graphics and Visualization, 5(3), 1999. [18] Steven Parker, Peter Shirley, Yarden Livnat, Charles Hansen, and Peter Pike Sloan. Interactive ray tracing for isosurface rendering. In IEEE Visualization ’98, 1998. [19] Matt Pharr, Craig Kolb, Reid Gershbein, and Pat Hanrahan. Rendering complex scenes with memory-coherent ray tracing. Computer Graphics, 31(Annual Conference Series):101–108, August 1997. [20] Advanced Rendering Technologies. http://www.art-render.com. BIBLIOGRAPHY 87 [21] D. Hall. The AR350: Today’s ray trace rendering processor. In Proceedings of the Eurographics/SIGGRAPH workshop on Graphics hardware - Hot 3D Session 1, 2001. [22] Hanspeter Pfister, Jan Hardenbergh, Jim Knittel, Hugh Lauer, and Larry Seiler. The VolumePro real-time ray-casting system. In Computer Graphics 31, pages 251–260, 1999. [23] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, and M. Horowitz. Smart Memories: A Modular Recongurable Architecture. IEEE International Symposium on Computer Architecture, 2000. [24] Timothy Purcell. The SHARP Ray Tracing Architecture. SIGGRAPH course on Interactive Ray Tracing, 2001. [25] Timothy J. Purcell, Ian Buck, William R. Mark, and Pat Hanrahan. Ray Tracing on Programmable Graphics Hardware. In Proceedings of SIGGRAPH 2002, 2002. [26] Ingo Wald, Carsten Benthin, and Philipp Slusallek. Distributed Interactive Ray Tracing of Dynamic Scenes. In Proceedings of the IEEE Symposium on Parallel and Large-Data Visualization and Graphics (PVG), 2003. [27] Erik Reinhard, Brian Smits and Chuck Hansen. Dynamic acceleration structures for interactive ray tracing. In Proceedings of SIGGRAPH, 2002. [28] Allen Y. Chang. A Survey of Geometric Data Structures for Ray Tracing. Technical report, Polytechnic University, October 2001. [29] Emo Welzl. Smallest enclosing disks (ball and ellipsoids), chapter New Results and New Trends in Computer Science (H. Maurer, ed.), pages 359–370. 1991. [30] Alphadata. www.alpha-data.com. [31] Xilinx, Virtex2-6000 FPGA. www.xilinx.com/virtex2. [32] Peter Bellows and Brad Hutchings. JHDL - An HDL for Reconfigurable Systems. Technical report, Department of Electrical and Computer Engineering, www.jhdl.org. 88 [33] Xilinx. www.xilinx.com. BIBLIOGRAPHY
© Copyright 2026 Paperzz