A Ray Tracing Hardware Architecture for Dynamic

A Ray Tracing
Hardware Architecture
for Dynamic Scenes
by
Sven Woop
A thesis submitted in partial fulfillment of the requirements
for the degree of
Diplom-Informatiker
(Diploma in Computer Science)
Completed under the supervision of
Jörg Schmittler and Prof. Dr.-Ing. Philipp Slusallek
at the
Universität des Saarlandes
Fachrichtung 6.2 - Informatik
Computer Graphik
Im Stadtwald - Geb. 36.1, Raum 018
66123 Saarbrücken
March 29, 2004
[email protected]
c 2004, by Sven Woop
Copyright 2
i
Eidesstattliche Erklärung
Hiermit erkläre ich an Eides Statt, dass ich die vorliegende Arbeit selbständig
verfasst und außer den angegebenen keine weiteren Hilfsmittel verwendet
habe.
Saarbrücken den 29. März, 2004
Sven Woop
ii
Acknowledgements
I would like to thank Jörg Schmittler for his assistance and for spending
several nights to get the prototype working. Thanks to Prof. Slusallek for
his support and constructive criticism.
iii
Abstract
This thesis describes a ray tracing hardware architecture for dynamic
scenes that makes it possible to ray trace highly complex scenes in real
time. Ray tracing of dynamic scenes does not seem to be efficiently possible, as ray tracing requires an acceleration structure whose creation is very
costly. The well-known solution to this problem is to partition the scene
into movable objects, which causes to use a top-level acceleration structure
over the objects, and a bottom-level acceleration structure in each object.
The presented architecture efficiently supports such partitioned scenes by
using one transformation unit for both the triangle intersection and the object space transformation. A prototype of the hardware architecture has
been implemented into an FPGA which is in fact the first working special
purpose real time ray tracing hardware available today. The performance
and implementation details of this prototype are discussed in detail at the
end of this thesis.
iv
Contents
1 Introduction
1
2 Previous Work
5
3 The Basic Ray Tracing Algorithm
7
3.1
k-D Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
3.1.1
k-D Tree Creation . . . . . . . . . . . . . . . . . . . .
11
3.1.2
Recursive k-D Tree Traversal . . . . . . . . . . . . . .
13
3.1.3
Packet k-D Tree Traversal . . . . . . . . . . . . . . . .
17
4 The Dynamic Ray Tracing Algorithm
21
4.1
Top-Level k-D Tree Creation . . . . . . . . . . . . . . . . . .
24
4.2
Bounding Box Clipping . . . . . . . . . . . . . . . . . . . . .
25
4.3
Overlapping Objects . . . . . . . . . . . . . . . . . . . . . . .
27
4.3.1
Hierarchical k-D Trees . . . . . . . . . . . . . . . . . .
28
4.3.2
Mailboxing . . . . . . . . . . . . . . . . . . . . . . . .
29
4.3.3
Multiple Scenes . . . . . . . . . . . . . . . . . . . . . .
29
4.4
Ray Transformation . . . . . . . . . . . . . . . . . . . . . . .
30
4.5
Hit-Distance Transformation . . . . . . . . . . . . . . . . . .
31
4.6
Normal Transformation . . . . . . . . . . . . . . . . . . . . .
32
5 Triangle Intersection
5.1
5.2
35
Affine Triangle Transformation . . . . . . . . . . . . . . . . .
36
5.1.1
Memory Efficient Triangle Transformation . . . . . . .
36
5.1.2
Normal Consistent Triangle Transformation . . . . . .
38
Unit Triangle Intersection . . . . . . . . . . . . . . . . . . . .
38
v
vi
CONTENTS
6 The Dynamic SaarCOR Architecture
6.1
6.2
Dynamic Ray Tracing Core . . . . . . . . . . . . . . . . . . .
43
6.1.1
Traversal Unit . . . . . . . . . . . . . . . . . . . . . .
44
6.1.2
Mailboxed List Unit . . . . . . . . . . . . . . . . . . .
46
6.1.3
Transformation Unit . . . . . . . . . . . . . . . . . . .
47
6.1.4
Intersection Unit . . . . . . . . . . . . . . . . . . . . .
49
6.1.5
Balancing . . . . . . . . . . . . . . . . . . . . . . . . .
50
Shading Unit . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
6.2.1
Primary Rays . . . . . . . . . . . . . . . . . . . . . . .
51
6.2.2
Light Rays . . . . . . . . . . . . . . . . . . . . . . . .
52
6.2.3
Reflection Rays . . . . . . . . . . . . . . . . . . . . . .
52
7 FPGA Prototype
7.1
7.2
41
55
Implementation Statistics . . . . . . . . . . . . . . . . . . . .
60
7.1.1
Gate Count . . . . . . . . . . . . . . . . . . . . . . . .
60
7.1.2
Complexity . . . . . . . . . . . . . . . . . . . . . . . .
61
Performance Statistics . . . . . . . . . . . . . . . . . . . . . .
63
7.2.1
Hardware Quality Index . . . . . . . . . . . . . . . . .
63
7.2.2
Graphics Hardware Quality Index . . . . . . . . . . .
64
7.2.3
Usage . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
7.2.4
Cache Hit Rate . . . . . . . . . . . . . . . . . . . . . .
67
7.2.5
Memory Bandwidth . . . . . . . . . . . . . . . . . . .
68
7.2.6
Performance
70
. . . . . . . . . . . . . . . . . . . . . . .
8 Conclusion
71
9 Future Work
73
10 Appendix A
75
10.1 Office . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
10.2 Gael . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
10.3 Conference . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
10.4 Trees4000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
List of Figures
3.1
Ray Tracing Basics . . . . . . . . . . . . . . . . . . . . . . . .
8
3.2
k-D Tree Semantics . . . . . . . . . . . . . . . . . . . . . . . .
10
3.3
k-D Tree Example . . . . . . . . . . . . . . . . . . . . . . . .
11
3.4
k-D Tree Traversal Example . . . . . . . . . . . . . . . . . . .
13
3.5
Hit-Distance Computation . . . . . . . . . . . . . . . . . . . .
14
3.6
Traversal Decisions . . . . . . . . . . . . . . . . . . . . . . . .
16
3.7
Packet Traversal . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.8
Example of an Invalid Packet . . . . . . . . . . . . . . . . . .
19
4.1
Dynamic Acceleration Structure
. . . . . . . . . . . . . . . .
22
4.2
Ray Transformation into Object Space . . . . . . . . . . . . .
23
4.3
Bounding Box of Object Instances . . . . . . . . . . . . . . .
24
4.4
Bounding Box Clipping . . . . . . . . . . . . . . . . . . . . .
25
4.5
Bounding Box Clipping Example . . . . . . . . . . . . . . . .
26
4.6
Overlapping Objects . . . . . . . . . . . . . . . . . . . . . . .
27
4.7
Room Problem . . . . . . . . . . . . . . . . . . . . . . . . . .
28
4.8
Hierarchical k-D Trees as Solution to the Room Problem . . .
28
4.9
Normal Transformation . . . . . . . . . . . . . . . . . . . . .
32
5.1
Unit Triangle Intersection . . . . . . . . . . . . . . . . . . . .
35
6.1
Dynamic Ray Tracing Architecture . . . . . . . . . . . . . . .
43
6.2
Traversal Unit
. . . . . . . . . . . . . . . . . . . . . . . . . .
45
6.3
Mailboxed List Unit . . . . . . . . . . . . . . . . . . . . . . .
47
6.4
Transformation Unit . . . . . . . . . . . . . . . . . . . . . . .
48
6.5
Compressable Packets . . . . . . . . . . . . . . . . . . . . . .
49
6.6
Reflection Matrix Illustration . . . . . . . . . . . . . . . . . .
54
vii
viii
LIST OF FIGURES
7.1
ADMXRC Development Platform . . . . . . . . . . . . . . . .
55
7.2
ADMXRC Top-Level Flowchart . . . . . . . . . . . . . . . . .
55
7.3
Dynamic SaarCOR Prototype . . . . . . . . . . . . . . . . . .
56
7.4
Hardware Optimized Hilbert Curve . . . . . . . . . . . . . . .
59
7.5
Cache Hit Rate using the Hardware Optimized Hilbert Curve
59
7.6
Hardware Quality Index . . . . . . . . . . . . . . . . . . . . .
64
7.7
Graphics Hardware Quality Index . . . . . . . . . . . . . . . .
65
7.8
Usage of Units . . . . . . . . . . . . . . . . . . . . . . . . . .
66
7.9
Frame Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
7.10 Cache Hit Rate . . . . . . . . . . . . . . . . . . . . . . . . . .
67
7.11 Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . .
68
7.12 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
List of Tables
4.1
Millions of operations for various strategies . . . . . . . . . .
30
7.1
Maximum Cache Size per Unit . . . . . . . . . . . . . . . . .
58
7.2
Gate Count Computation . . . . . . . . . . . . . . . . . . . .
60
7.3
Complexity of one Ray Tracing Pipeline . . . . . . . . . . . .
61
7.4
Gate Count and Memory Bits per Unit using 32 Packets . . .
61
7.5
Gate Count and Memory Bits per Unit using 512 Cache Lines 62
ix
x
LIST OF TABLES
Chapter 1
Introduction
Ray tracing is in fact one of the most popular rendering techniques to create
highly realistic images. However, because it is a computationally expensive
recursive algorithm that requires large memory bandwidth, it is a challenging
task to implement it in hardware.
As a consequence, the state of the art in interactive 3D computer graphics is still rasterization hardware. The rasterization algorithm is efficient
for scenes consisting of few triangles, while ray tracing is not. Thus, todays computer graphics hardware can handle scenes with several hundred
thousand triangles. This is made possible by high memory bandwidth and
high floating point performance. For instance Nvidia’s GeForce 3[1] offers
76 GFlops at a clock rate of 200 MHz and has a 256 bit wide memory
interface running at 230 MHz, delivering a memory bandwidth of 7.2 GB/s.
In recent years the scenes of standard computer games have become
more and more detailed. Indeed, computer games are developed based on
the current graphics card standard, but rasterization hardware will become
a limiting factor in the near future. Because the main concept behind rasterization hardware is to project each triangle of the scene to a frame- and
z-buffer, the rasterization algorithm scales linearly in the number of triangles of the scene. Furthermore it is difficult to parallelize the rasterization
algorithm, as the bandwidth to the frame- and z-buffer becomes critical.
This is because each triangle that is projected onto the image plane involves
many memory accesses to the frame- and z-buffer. If the triangles of the
scene are large the performance consequently drops. For a detailed description of the rasterization algorithm see any standard textbook for computer
1
2
CHAPTER 1. INTRODUCTION
graphics, for example that by Shirley [2].
Ray tracing does not suffer from these problems, as the tracing of single
rays can trivially be parallelized, because they are not dependent on each
other. On the other hand, it can be shown that the ray tracing algorithm
scales logarithmically in the number of triangles in the scene [3]. The only
problem might be that the initial hardware cost for ray tracing is high and
the memory interface to the scene database has to deliver sufficient bandwidth to the parallel working ray tracing units. Later in this thesis it will
be shown that it is possible to deliver the required bandwidth using fairly
small caches.
A main advantage of the ray tracing algorithm is that it simulates reality,
by supporting different kinds of lighting effects like reflections, refractions,
shadows and even real-time global illumination [4]. For a human viewer
these effects are very important to understand the three dimensional relation
between the objects of the scene. Here, shadows play an especially important
role.
Indeed rasterization hardware supports some of these effects, but only by
using multi-pass rasterization tricks to fake them. These multi-pass rasterization techniques (to produce shadows for instance) are often non-obvious
and difficult to implement. In contrast, ray tracing offers an extremely simple and intuitive shading model. For instance, it is simple to shoot a ray
from a point in the scene to a light source to check whether it lies in the
shadow of the light source or not.
Particularly the large number of memory accesses (which are more or less
randomly distributed over the scene data) and the expensive computation
made it impossible to create a real-time ray tracing system recently. Lately a
lot of work has been done to cope with these problems. Taking advantage of
the coherence between neighboring rays to reduce the memory bandwidth
and using a cluster of processors to provide enough computational power
made real-time ray tracing possible. Such a software based real-time ray
tracing system has been developed by the Computer Graphics Lab of the
Saarland University [5, 6, 7].
However these techniques require a lot of costly, but standard hardware.
The SaarCOR project follows a different way. Instead of using standard PC
hardware for the computation, it is more efficient to create special purpose
hardware that is optimized to the ray tracing application. Jörg Schmit-
3
tler designed such an architecture which is called SaarCOR (Saarbrücken’s
Coherence Optimized Ray Tracer). This architecture has been fully simulated with really nice results [8].
Up to now SaarCOR has been limited to static scenes and to a standard
k-D tree as acceleration structure. In such a static scene the camera can be
moved around, but no object can be moved itself. This is a hard limitation
which makes it impossible to develop a computer game for the standard
SaarCOR architecture for instance.
In this thesis a ray tracing hardware architecture for dynamic scenes is
presented based on the SaarCOR architecture. As ray tracing heavily relies
on precomputations it seems to be difficult to ray trace dynamic scenes.
Thus a data structure is required that allows as many precomputations as
possible to be done, but also to move objects in the scene around. This
can be achieved by partitioning the scene into movable objects and building a top-level acceleration structure over them. This top-level acceleration
structure needs to be recomputed each time an object has been moved. Each
object itself contains a precomputed bottom-level acceleration structure that
stays static forever. To traverse a ray in the bottom-level acceleration structure of the object, the ray has to be transformed to its local coordinate
system. This requires a transformation unit, which can be used as a kind of
precomputation too, if using a new triangle intersection method, described
in Section 5.
Using the structured scene representation it is possible to share geometry
by placing the same object at several positions. This reduces the representation of most scenes. A prototype of the hardware architecture has been
implemented into an FPGA which is in fact the first working special purpose
real-time ray tracing hardware available today.
Most of the concepts of this thesis can be understood without a detailed
knowlege of FPGAs or ASICs, but a short description will be given here.
An FPGA (field programmable gate array) can be seen as a CLB array (configurable logic block) with some programmable routing resources to
connect the single CLBs. The internal structure of these CLBs differs from
architecture to architecture. In Xilinx FPGAs, the CLBs mainly consist of
some registers and LUTs (look up tables). LUTs are programmable 4 to
4
CHAPTER 1. INTRODUCTION
1 function generators that can be used together with the routing resources
to encode each circuit. The circuit in the FPGA can be reconfigured arbitrarily often. For a detailed description on FPGAs see the book “Field
Programmable Gate Arrays” [9].
In contrast an ASIC (application specific integrated circuit) consists of
an array of NAND gates. The interconnection between different gates is
done by some extra silicon layers that are added to the chip. Thus a main
difference from FPGAs is that ASICs are in no way reconfigurable. The advantage of ASICs are their high gate capacity, low price at a high number of
pieces and high speed compared to FPGAs. A description on ASIC design
can be found in the book “Application-Specific Integrated Circuits” [10].
At the beginning of this thesis the basics of the ray tracing algorithm
using k-D trees are explained. To achieve dynamics, the standard k-D trees
are extended to 2-level k-D trees and the transformations needed for the 2level traversal algorithm are discussed as well as some problems that might
occur. The next Section describes a new triangle intersection method that is
used in the hardware architecture. These Sections form the basics to understand the ray tracing hardware architecture for dynamic scenes, presented
in the following Chapter. The prototype implementation of the architecture
is described and a detailed analysis of the performance is given. The last
part finally summarizes this thesis and shows areas of future work.
Chapter 2
Previous Work
The state of the art in interactive ray tracing are in fact software based
systems. Several approaches have already been realized on MIMD and SIMD
architectures [11, 12, 13] exploiting the coherence between neighboring rays.
By parallelization of the algorithm on supercomputers [14, 15, 16, 17, 18, 19]
and recently standard PCs [6, 7] interactive ray tracing has become possible.
Besides these software based ray tracing systems some special purpose
hardware has been developed. As the most costly operation of the ray tracing algorithm is the ray triangle intersection, the first commercially available
ray tracing accelerator performed this operation only [20, 21]. This ray tracing accelerator has no hardware support for the traversal operation thus it
is not able to do ray tracing in real-time. In 1999, Pfister et al. published
the VolumePro 500 architecture which is a single-chip real-time volume rendering hardware [22].
A different approach is to map the ray tracing application to a multiprocessor architecture on a single chip [23], which should be available in
the near future. Purcell has simulated a ray tracer for such an architecture delivering real-time performance [24]. A kind of multi-processor vector
architecture is present in todays high end graphics cards too, in form of programmable pixel shaders. It has been shown that the ray tracing application
can be mapped to these shaders [25].
Ray tracing of dynamic scenes is a new topic of research. The paper “Distributed Interactive Ray Tracing of Dynamic Scenes” [26] discusses basics
of the 2-level ray tracing algorithm for dynamic scenes used in the hardware
architecture presented in this thesis. Instead of rebuilding the acceleration
5
6
CHAPTER 2. PREVIOUS WORK
structure to achieve dynamics it is possible to use special algorithms to update it. Thus using a hierarchical grid as an acceleration structure, it is
possible to update an object’s position in the scene in constant time [27].
Chapter 3
The Basic Ray Tracing
Algorithm
Ray Tracing is a simulation technique to create realistic images of 3 dimensional scenes. This is done by shooting imaginary rays through a scene and
interpreting the resulting intersection, as described in this Section.
In a real environment light is emitted by some light sources and then
distributed to the scene in a manner consistent with physical laws. If a
camera is positioned into this environment some light enters it and an image is projected onto the image plane. The physical theory of this light
distribution is well-known today, but to simulate it exactly is difficult, since
available computational power is strongly limited. Thus in practice some
approximations need to be made.
In contrast to reality, ray tracing goes the opposite way and follows the
light back from the camera to the light sources. This is done by shooting
so called primary rays for each pixel of the image from the camera into the
scene and computing the closest object that is hit by the ray, the hit-object.
This shooting of a ray to determine the hit-object is called ray casting and
the origin of the primary rays is the projection center of the camera. The
point in 3D space where the object is hit is called the hit-point of the ray
(see Figure 3.1).
After computing the hit-object and hit-point to a primary ray it is known
which object is visible through the pixel, thus the algorithm does a kind of
visible surface computation. At this stage a shader corresponding to the
material of the hit-object is called, which has the task of computing the
7
8
CHAPTER 3. THE BASIC RAY TRACING ALGORITHM
Figure 3.1: A 2 dimensional example of the ray tracing algorithm. For each
pixel of the image a primary ray is shot into the scene and the closest object
that is hit by the ray is computed.
color of the pixel using the intersection results of the ray with the hit-object.
The shader computes the pixel color based on several material properties of the hit-object, like the object’s color, surface normal, reflectivity and
transparency, and using scene properties such as the light sources. More advanced shaders would shoot secondary rays to simulate several light effects.
Thus it is possible to shoot light rays from the light sources of the scene to
the hit-point of the ray to compute whether the hit-point lies in the shadow
of a light source or not. Even reflections can be computed by using the
surface normal to compute a reflection ray to determine which geometry is
seen through the reflective surface.
The shading computation in detail is out of the scope of this thesis. For
further information about shading see the book ”Fundamentals of Computer
Graphics” [2].
A costly part of the algorithm is the ray casting operation to find the
closest hit-object. To do this efficiently a data structure which subdivides the
space of the scene into subspaces is required. This allows objects to be found
efficiently at a given location. Such a data structure is called an acceleration
structure as it accelerates the ray casting operation. Many acceleration
structures exists, some are recursive and others flat data structures [28]. In
9
3.1. K-D TREES
the hardware architecture presented in this thesis only k-D trees are used
as acceleration structure, thus the basics of k-D trees are explained in the
next Section.
3.1
k-D Trees
A k-D tree is an acceleration structure that is typically used for ray tracing
to accelerate the ray casting operation. It subdivides a k-dimensional space
containing some objects recursively and axis aligned into subspaces, and
stores for each of these subspaces the contained geometry. Because ray
tracing is applied to a 3D space, only this case will be discussed here.
The scene subdivision is encoded as a binary tree, the k-D tree. Each
leaf node of this tree specifies one of the subspaces and contains a list of all
objects that lie in the subspace.
Using this recursive data structure it is possible to efficiently find the
closest object hit by a ray. This is done by determining the subspaces
through which the ray traverses. In the order the ray traverses these subspaces, it is intersected with the geometry in each subspace. This walking
through the subspaces is called the traversal operation and it terminates if a
hit-point in the current subspace has been found. This traversal operation
is very efficient, as only the geometry in the subspaces the ray traverses,
need to be used in the intersection calculation. Geometry far away from the
ray will never be touched if the subspacing is fine enough.
Definition 3.1.1. A plane h in R3 can be defined by an implicit function
H(x) = n · x − d = 0, if n 6= 0, n ∈ R3 and d ∈ R. We define h+ = {x ∈
R3 | H(x) ≥ 0} and h− = {x ∈ R3 | H(x) < 0} to be the positive and
negative half-space bounded by h, respectively. Let k ∈ {1, 2, 3} be the so
called splitting axis and n = ek be the k-th unit vector, then we call the
plane h an axis aligned splitting plane and the value d the splitting position.
Definition 3.1.2. A k-D tree T is defined by the following grammar:
T
= N ode((k, d), Tlef t , Tright )
|
Leaf ({Object1 , . . . , Objectn })
Object ⊂ R3 closed
10
CHAPTER 3. THE BASIC RAY TRACING ALGORITHM
On the one hand, a node of a k-D tree can be a normal Node containing
an axis aligned splitting plane (k, d) and a left and right subtree (Tlef t and
Tright ). On the other hand, it can be a Leaf node containing a set of objects.
This set can be empty if the number of objects n is 0. An object is a closed
subset of R3 . In practice mostly triangles or cubes will be used as objects.
The semantics of the k-D tree defines a subspace S(T ) to each node T
of a k-D tree. The subspace of the root node is defined as R3 . If S(T ) is the
subspace of the node T = N ode((k, d), Tlef t , Tright ) and h the plane defined
by (k, d) then the subspace of the left subtree is S(Tlef t ) = S(T ) ∩ h− and
the subspace of the right subtree S(Tright ) = S(T ) ∩ h+ . Figure 3.2 shows
this subdivision scheme of the space.
Figure 3.2: This Figure shows how the space is recursively subdivided by
k-D trees. The large box is the subspace of node T and is split into two
halves by the splitting plane p1 = (1, d). The normal of this splitting plane
is parallel to the x-axis and goes through the point (d, 0, 0).
As the splitting planes in the nodes of the k-D tree are axis aligned, it is
called an axis aligned BSP tree (binary space subdivision tree). It is possible
to use other non axis aligned splitting planes too, which yields to BSP trees
in general and more complex traversal computations. In the following only
the case of axis aligned splitting planes will be considered.
3.1. K-D TREES
3.1.1
11
k-D Tree Creation
The task of the k-D tree creation algorithm is to build a k-D tree for a
scene consisting of several objects. Thus it has to subdivide the space of
the scene recursively into subspaces. It starts with the complete space R3
containing all the geometry of the scene. Then an axis aligned splitting
plane is selected which splits the space into the left and the right subspace
according to the semantics of the k-D tree. For each of both subspaces the
objects that intersect with it are computed. Note that objects can belong to
both subspaces. The subspaces together with the objects intersecting them
are handled recursively by the algorithm. If some termination criteria is
fulfilled, the subdivision of the current subspace is terminated and a leaf
node, containing all objects in it, is created.
This is the main concept for each k-D tree creation algorithm. Different
algorithms mostly differ only in the heuristics that are used to search the
splitting plane and in the termination criteria.
The algorithm 3.1.3 defines the createKDTree function in an abstract
way. It gets a subspace S and a set O of objects and returns a k-D tree. The
subspaces can be represented as simple bounding boxes (that are possibly
infinite) and the set of objects as arrays or lists. For an example of a simple
k-D tree in 2 dimensions see Figure 3.3.
Figure 3.3: Figure (b) shows a k-D tree for the simple 2D scene of Figure
(a). The labels on the inner nodes of the k-D tree tell the splitting plane
and the leaf nodes contain a list of objects.
12
CHAPTER 3. THE BASIC RAY TRACING ALGORITHM
Algorithm 3.1.3. k-D Tree Creation
function createKDTree (S, O)
begin
if termination criteria is fulfilled then
return Leaf(O)
Select an axis aligned splitting plane h by some criteria.
Slef t = S ∩ h−
Sright = S ∩ h+
Olef t = {x ∈ O | x ∩ Slef t 6= ∅}
Oright = {x ∈ O | x ∩ Sright 6= ∅}
Tlef t = createKDTree (Slef t ,Olef t )
Tright = createKDTree (Sright ,Oright )
return Node(h,Tlef t ,Tright )
end
There are two issues we have not dealt with yet. The first one is how to
select the splitting plane and the second one is what the termination criteria
looks like. As a simple approach the splitting plane can be selected such that
the largest dimension of the subspace is split exactly in the middle. It can
be shown that this is not very efficient especially if the objects in the scene
are not equally distributed [3]. As termination criteria a maximal tree depth
in conjunction with a minimal number of objects in the leaves can be used,
for instance.
A different more advanced approach is to search the optimal splitting
plane related to a cost function. Such a function was proposed by Havran
[3] and can be used as a termination criteria too, by comparing the cost of
a split and no split.
3.1. K-D TREES
3.1.2
13
Recursive k-D Tree Traversal
The reason why we introduced k-D trees was to optimize the ray casting
operation, which means to compute the closest hit-point of a ray with the
scene. The k-D tree subdivides the scene into subspaces. Thus the sequence
of subspaces a ray traverses can be determined to intersect the ray with the
geometry stored in them. The algorithm that performs this enumeration
of the subspaces is called the k-D tree traversal algorithm. In conjunction
with an object intersection algorithm, the closest hit-point of the ray with
the scene can be computed.
Definition 3.1.4. A ray R is represented by a tuple R = (org, dir) ∈ (R3 )2 .
The first component org of the tuple is a point of R3 and represents the origin
of the ray. The second component is a vector of R3 and specifies the direction
of the ray. The points on the ray can be computed by R(x) = org + x · dir
if 0 ≤ x.
Definition 3.1.5. Such a ray R hits an object obj if there is a λ ∈ [0, +∞[
such that R(λ) ∈ obj. A minimal λ with this property is called the hitdistance of the ray to the object and R(λ) the hit-point. Because an axis
aligned splitting plane is a closed subset of R3 , we can define the terms
hit-distance and hit-point the same way for rays and splitting planes.
Figure 3.4: The ray R of Figure (a) is traversed according to Figure (b)
through the k-D tree.
Since a k-D tree is a recursive data structure, the k-D tree traversal
algorithm is a completely recursive algorithm as well. It works recursively
on the nodes of the k-D tree and makes a traversal decision at each node.
14
CHAPTER 3. THE BASIC RAY TRACING ALGORITHM
The traversal decision determines whether the ray traverses the subspace of
the left and/or right subtree and the order it traverses them. Using this
traversal decision the algorithm follows the ray through the k-D tree data
structure by working on the subtree that is traversed first and putting the
other one onto the stack. If a leaf node is reached the intersection algorithm
is called to intersect the ray with each object stored in the leaf node and the
closest hit-point is determined. If this hit-point lies in the subspace of the
leaf node a valid hit-point is found and the ray is called a terminated ray.
In such a case or if the stack is empty the algorithm terminates. Otherwise,
it continues by obtaining the next node from the stack.
To compute the traversal decision the algorithm needs the near and
f ar-value which is the distance to the the entry-point and exit-point of the
ray with the subspace of the current node. Using this near and f ar-value
together with the distance d to the splitting plane of the current node the
traversal decision can be computed.
If δ is the splitting position and k the splitting axis, then the intersection
distance d of the ray R = (org, dir) to the splitting plane can be computed
according to the formula of Figure 3.5.
d=
δ − orgk
dirk
Figure 3.5: Hit-Distance Computation
To compute the traversal order the algorithm determines the half-space
of the splitting plane that is closer to the origin of the ray. If orgk ≤ δ this
is the negative half-space (corresponding to the left subtree) or otherwise
the positive half-space. The closer subspace is traversed first, if the ray
intersects it. The farther one follows later. In the first case the so called
traversal order is from left to right, otherwise from right to left.
3.1. K-D TREES
Algorithm 3.1.6. k-D Tree Traversal
function traverseKDTree (R, T )
begin
λ=∞
near = −∞ f ar = ∞
while true
begin
while T is of Node((k,split),Tlef t ,Tright )
begin
d = (split − R.orgk )/R.dirk
if R.orgk ≤ split then
Tnear = Tlef t , Tf ar = Tright
else
Tf ar = Tlef t , Tnear = Tright
go near = d ≥ near ∨ d ≤ 0
go far = d ≤ f ar ∧ d ≥ 0
if go near ∧ go far then
push f ar and Tf ar to the stack
T = Tnear , far = d
else if go near ∧ not go far then
T = Tnear
else if not go near ∧ go far then
T = Tf ar
end
T is of Leaf({Object1 , . . . , Objectn })
compute closest hit-distance λ for {Object1 , . . . , Objectn }
if λ ≤ far then return λ
if stack is empty then return λ
near = f ar
pop f ar and T from stack
end
end
15
16
CHAPTER 3. THE BASIC RAY TRACING ALGORITHM
Figure 3.6: Traversal Decisions
Whether the ray really traverses through the nearer and/or farther side
is computed by the following formulas, which are illustrated in Figure 3.6.
go near = d ≥ near ∨ d ≤ 0
go far = d ≤ f ar ∧ d ≥ 0
One important invariant of the algorithm is that the near and f arvalue is exactly the distance to the entry and exit point of the ray with
the subspace of the current node. This property is essential and has to be
maintained through the complete algorithm. Thus the near and f ar-values
have to be updated at each traversal step of the algorithm. If only one of
the subtrees is traversed by the ray, then the near and f ar values stay the
same (see Figure 3.6), but if both children have to be traversed, the near
and f ar values need to be updated. As the algorithm first traverses into the
closer child node the near value can be maintained but the f ar value has
to be set to the hit distance d. To restore the f ar-value later, it is pushed
onto the far-stack and the farther node onto the node-stack. If later a leaf
node is reached and no hit has been found in it a node is popped from the
node-stack and the near and f ar values are updated by setting near = f ar
3.1. K-D TREES
17
and taking the f ar value from the far-stack as the new f ar value.
Using the near and f ar value it is possible to determine whether there
is a valid hit-point which is necessary to terminate the ray. A valid hit-point
is found if a leaf is encountered and the hit-distance to the current closest
hit-point is smaller than the current f ar-value, since then the found hitpoint lies in (or before) the leaf node’s subspace. Alternatively the ray can
be terminated at the next traversal step by testing if the closest hit-distance
is smaller than the current near-value.
Figure 3.6 shows the most important situations that occur in the traversal algorithm. Besides these cases there are some degenerate ones that have
to be handled carefully. These cases occur if the ray does not have got a
well-defined single hit-point with the splitting plane. If so the hit-distance
cannot be computed and the traversal decision formulas cannot be applied.
This can happen if the ray is parallel to the splitting plane or if it lies completely in it. The later hardware approach solves this problem by using
a normalized floating point representation that cannot represent the value
zero. Thus each ray has a hit-point with each possible splitting plane.
3.1.3
Packet k-D Tree Traversal
A drawback of ray tracing is the large memory bandwidth that is needed for
the computation. Reducing this bandwidth is possible by exploiting the ray
coherence between rays corresponding to neighboring pixels on the screen.
This coherence derives from the fact that rays traversing through a similar
region of the 3D space, traverse similar nodes of the k-D tree and intersect
many of the same objects.
It is possible to take advantage of this ray coherence by traversing a
packet of some neighboring rays in parallel as if they were one single ray.
This strategy reduces the required memory bandwidth, as data is fetched for
a complete packet of rays instead of a single ray. Furthermore when implementing such a packet traversal algorithm in software, SIMD architectures
available in todays standard PCs can be taken advantage of. Because these
SIMD architectures allow 4 computations to be done in parallel packets of
4 rays can be handled efficiently using these special instructions.
The packet traversal algorithm is closely related to the standard traversal
algorithm, but instead of computing a traversal decision for a single ray it
computes a similar packet traversal decision for a packet of rays. In the
18
CHAPTER 3. THE BASIC RAY TRACING ALGORITHM
computation of this packet traversal decision, only so called active rays of
the packet are involved. A ray of a packet is active in the current node if it
is not terminated and if it intersects with the subspace of the node. Because
this active value is required for each ray in the packet an active vector for
the packet is needed. Although this active vector needs to be recomputed
at each traversal step this is quite simple since a ray is active in the left
child of a node if it is active in the current node and if it wants to traverse
through the left child. The same holds for the right child.
If one of the active rays of the packet wants to traverse through the left
child, then the packet traverses through the left child as well. The same
holds for the right child. The traversal order for the packet is inherited from
the active rays of the packet that traverse through both children, if it is the
same for each of these rays. The packet is terminated if each of its rays is
terminated.
If a pop operation is done, the active vector has to be updated, and
therefore needs to be pushed onto the stack together with a node. A further
situation that might occur is that a node is reached and ray R1 traverses
through both children and R2 through the farther child only. Here, the
farther node is pushed onto the stack and R1 traversed through the nearer
child. Later a pop operation obtains the farther node from the stack, and
each of both rays is active in this node. However ray R1 needs to update its
near and f ar values, as it traversed the nearer and farther child, unlike R2 .
Thus a kind of both vector needs to be pushed to the stack also, indicating
if a ray wants to traverse through both children to update the near and f ar
values correctly.
Figure 3.7: The packet is traversed from left to right, as the rays R1 and
R2, traverse from left to right. Thus the right node is pushed onto the stack
and the operation continues in the left child. The rays R1 and R2 are active
in both children, but R3 only in the right one.
3.1. K-D TREES
19
A problem occurs if the traversal order is not the same for each active
ray of the packet that wants to traverse both children. Such a packet is
called an invalid packet. It is invalid since no valid packet traversal decision
can be computed. No matter which child is handled first there is always
a ray in the packet that wants to handle the other one first. If the packet
terminates in the first traversed child, a possible closer hit-point in the other
child is forgotten (see Figure 3.8).
In practice this case occurs very rarely and it can be shown that this
does not happen if there are no two rays of the packet that cross in at leat
one of the 3 projections to the xy-,yz- or xz-plane. This never occurs for
primary rays and light rays, since rays with the same origin never cross.
Therefore the algorithm can handle these types of packets correctly.
Figure 3.8: The Figure shows a situation in which no packet traversal decision exists. No matter which child is handled first, either ray R1 or ray R2
is not intersected with triangle tri3 .
If no such packet traversal decision exists this situation can be handled as
a kind of special case. If a node for which no packet traversal decision exists
is reached, the left child is traversed first. The right child is remembered
and traversed later by treating it as a special case.
A different possibility is to split the packet before the traversal into sub
packets, in which the rays do not cross as explained above. To split the
packet this way, only the signs of the three components of the ray directions
must be compared. If there are two rays whose direction sign is different
in one dimension then the rays cross and have to be put in different sub
packets.
One of these solutions needs only to be applied if a shader produces
invalid packets. This for instance can happen if a packet is reflected by a
curved surface. However, if only primary rays and light rays are allowed,
20
CHAPTER 3. THE BASIC RAY TRACING ALGORITHM
the problem never can occur. In the hardware architecture to be described
later only primary rays are used and the problem of crossing rays can be
safely ignored.
Chapter 4
The Dynamic Ray Tracing
Algorithm
In this Chapter a ray tracing algorithm for dynamic scenes is presented that
allows the movement of a huge number of triangles in the scene.
On the first view the efficient ray tracing of dynamic scenes does not
seem to be possible since fast ray tracing relies so much on precomputations.
In particular, the precomputed acceleration structure is a problem since it
has to be rebuilt or updated if the geometry of the scene has changed.
For a dynamic real-time ray tracing system this update must work even
if the complete scene consists of several million triangles. Here standard
acceleration structures cannot be used since the construction of a k-D tree
for instance is at least in O(n) in the number of triangles in the scene (each
triangle has to be visited at least once). It is possible to build acceleration
structures that allow updating the position of triangles in constant time [27],
but several million triangles cannot be moved around this way.
There exists a simple solution to this problem if the scene is restricted to
some kind of structured motion [26]. The case of unstructured motion, that
is if triangles are moved around arbitrarily, is not covered in this thesis. In
contrast to unstructured motion, structured motion is if some triangles are
moved around in some sense as one single object. For instance, in a scene
consisting of a table and a chair, normally all triangles in the chair or table
are moved around at once.
For such structured motion, the structure of the motion can be exploited
by packing the triangles into movable objects. These objects internally stay
21
22
CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM
static, thus a local bottom-level acceleration structure and a local bounding
volume can be precomputed for them. The local bounding volume contains
all the geometry of the object.
The object can be positioned, rotated and scaled in the scene by an
affine transformation. Such a positioned object is called an object instance
and consists of the affine transformation used and a reference to the object.
This concept of having some objects and one or more object instances to
each object leads to a kind of geometry sharing, as an object needs to be
saved only once.
To traverse rays efficiently through the object instances a dynamic toplevel acceleration structure must be built over them. Only this top-level
acceleration structure needs to be updated, if the position of an object instance has changed. This is possible as long as the number of objects in the
scene stays small.
As there is a dynamic top-level acceleration structure over the object
instances and a bottom-level acceleration structures in the objects, this is a
kind of 2-level acceleration structure (see Figure 4.1).
In the example of the chair and table, two objects have to be modeled:
one chair and one table. These two objects inside stay static over time but
they can be instantiated at several positions in the scene. Thus the toplevel acceleration structure is quite simple (it consists of few objects) but
the objects themselves can be fairly complex.
Figure 4.1: The Figure shows a dynamic top-level acceleration structure
over 4 object instances i1 , . . . , i4 of 3 objects o1 , o2 , o3 . The objects consist
of their static bottom level acceleration structure.
The traversal algorithm for 2-level acceleration structures first traverses
through the top-level acceleration structure until an object instance needs to
23
be intersected. This is done by transforming the ray to the local coordinate
system of the object and traversing through the local acceleration structure
to find the hit-triangle in the object. The transformation of the ray to the
local coordinate system is necessary as the acceleration structure of the object is only valid in the coordinate system in which it has been created. Thus
the positioning of the object instance needs to be reversed by transforming
the ray. Thus the inverse of the transformation that was used to position
the object is required to transform the ray to the local coordinate system of
the object.
An important property of the concept is that the internal geometry of the
object is hidden from the rest of the world. Thus from outside the object’s
geometry is only represented by its local bounding volume, which needs
to be as accurate as possible to avoid unnecessary ray object intersections,
which are normally very costly.
Figure 4.2: Figure (a) shows a simple scene consisting of two instances of
the same object. The drawn ray hits the left chair thus it is transformed to
its local coordinate system, as can be seen in Figure (b). There the splitting
planes are again axis aligned so that the traversal can be continued in the
object.
The concepts of the dynamic ray tracing algorithm does not depend on
a special acceleration structure or kind of local bounding volume, but in the
following only k-D trees and axis aligned bounding volumes will be used.
Furthermore no update strategy for the top-level k-D tree will be used it is
simply rebuilt each time the object positions have changed.
The following Sections describe some details of the dynamic ray tracing
algorithm. Some special properties of the top-level k-D tree creation will be
discussed as well as problems that might occur using local bounding volumes.
As affine transformations are used to position objects in the scene, the way
24
CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM
a ray is transformed under an affine transformation needs to be analysed.
Furthermore we show that the hit-distance is maintained under an arbitrary
affine transformation which dramatically simplifies an implementation of the
algorithm. As most shading models need the normal of the geometry in the
world coordinate system, normal transformation is also discussed.
4.1
Top-Level k-D Tree Creation
The basic k-D tree creation algorithm has been described in Section 3.1.1.
This algorithm can be applied the same way to compute a top-level k-D
tree for a set of object instances by using the transformed local bounding
volume of the object instances as their simplified geometry. Because this
transformed bounding volume is no longer axis aligned determining if it
intersects with a subspace or not is very costly to compute.
As it is mostly required to rebuild the top-level acceleration structure
for each frame, some optimization needs to be done to speed up the toplevel k-D tree construction. This is done by computing the smallest axis
aligned bounding box that encloses the transformed bounding volume. This
is called the instance bounding volume and is used as the geometry of the
object instance in the k-D tree creation algorithm (see Figure 4.3). To
compute the intersection of the axis aligned instance bounding volume and
the subspace (which can be represented as an possibly infinite axis aligned
bounding box also) is trivial.
Figure 4.3: Figure (a) shows an object with its bounding box. In Figure (b)
this object is instantiated using a rotation. The estimated bounding box
for the object instance is drawn dotted. Figure (c) shows the best possible
bounding box estimation for the object instance if the exact geometry of the
object is used in the estimation.
This simplification has some disadvantages since the axis aligned instance
4.2. BOUNDING BOX CLIPPING
25
bounding volume is not optimal (see Figure 4.3). Although a best estimation
for the axis aligned instance bounding volume exists, it is not a good idea
to compute it, because then the internal structure of the object would have
to be involved in the computation, which might be too costly.
What can be done is to search for a better representation of the local
bounding volume of an object. Instead of an axis aligned box an ellipsoid
can be used which often is a better approximation. Such an elliptic bounding
volume of an object can be computed in O(n) [29]. A different optimization
would be to rotate the object in such a way that its initial bounding box
fits as well as possible. A situation like in the left most image of Figure 4.3
is in fact the worst case.
4.2
Bounding Box Clipping
Intersections with object instances are mostly very expensive, as this requires one ray transformation and some traversal steps in the object. One
possibility to avoid and optimize ray object intersections is to perform a kind
of bounding box clipping on the instance bounding volume in the top-level
k-D tree and on the local bounding volume of the object at the beginning
of the bottom-level k-D tree.
Figure 4.4: Figure (a) shows a 2 dimensional rectangle with its clipping
planes. The corresponding clipping tree is shown in Figure (b). Figure (c)
shows the clipping tree to to a box in 3 dimensions.
Using traversal steps this bounding box clipping has the task of determining if the ray intersects with an axis aligned bounding box or not. This can
26
CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM
be done by using 6 clipping planes that exactly correspond to the bounding
planes of the axis aligned bounding box (see Figure 4.4).
The bounding box clipping to the instance bounding volume in the toplevel k-D tree guarantees that the bounding box of the object’s instance is
really hit if a leaf node containing this object is encountered.
The bounding box clipping at the beginning of the bottom-level traversal
is useful too, as the local bounding box available there is much more accurate than the bounding box of the instance. Furthermore this bottom-level
bounding box clipping should be performed since many unnecessary traversal steps can be avoided. This is due to the fact that otherwise the infinitely
large empty space around the object is not handled optimally as the clipping planes at the border of the object reach to infinity. This causes many
traversal steps if the ray does not hit the object and traverses to infinity
(see Figure 4.5).
Figure 4.5: Figure (a) shows a chair without bounding box clipping, whose
clipping planes reach to infinity. Here the drawn ray would traverse through
many subspaces of the acceleration structure. In Figure (b) some bold extra
clipping planes clip against the bounding box of the chair. Here the ray
traverses only 2 subspaces of the acceleration structure.
Because of the same reason it is better to perform a kind of scene bounding clipping at the beginning of the top-level acceleration structure otherwise ray losses (that is if rays traverse to infinity and produce no hit) will
be costly. Only if no ray losses can occur in the scene, this scene bounding
clipping should not be performed.
4.3. OVERLAPPING OBJECTS
4.3
27
Overlapping Objects
Overlapping objects play a crucial role in 2-level k-D trees since in the
overlapping area each of the objects need to be intersected. Consider a
scene consisting of n objects that overlap completely. A ray that intersects
this region in space needs to traverse through each of the n objects. For
such worst case scenes, the dynamic ray tracing algorithm scales linearly in
the number of objects. Thus overlapping of objects should be avoided as
often as possible, if modelling a scene.
If two objects overlap only slightly it is usually best to partition the scene
in such a way that the area filled by both objects is separated by the clipping
planes (see Figure 4.6). Thus only in the overlapping area both objects need
to be intersected. The overlapping area cannot be handled more efficiently
since each of both objects could generate the closer hit-point, which is not
known in advance.
Figure 4.6: Figure (a) shows two object instances that overlap a bit. By
the clipping planes h1 , . . . , h4 , the overlapping area is separated. The corresponding k-D tree is shown in Figure (b).
Much more critical is the case where there are a lot of objects in a
different object like in Figure 4.7. If the standard algorithm to create an
acceleration structure is used, then the large object 1 (which contains the
other ones) is in each leaf node of the tree. This is a problem as during
traversal each time a leaf node is encountered the algorithm intersects with
object o1 , but one intersection with it would be sufficient. This problem is
called the room problem, as it typically occurs, if a room is modeled with
28
CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM
some objects inside.
The resulting k-D tree for such a scene can be seen, as a degenerate case
of the space subdivision because after each subdivision, the object o1 is in
each of both subspaces.
Figure 4.7: Figure (a) shows a large object o1 containing 3 other objects.
The corresponding k-D tree in Figure (b) has object o1 in each leaf node.
4.3.1
Hierarchical k-D Trees
Several possible solutions to the room problem exist. One possibility is to
allow objects to be in k-D tree nodes too and not only in the leaves. This
concept is called hierarchical k-D trees as the hierarchy of the objects is
encoded to the k-D tree.
Figure 4.8: Figure (a) shows the same scene as in Figure 4.7. The corresponding hierarchical k-D tree can be seen in Figure (b). The difference to a
normal k-D tree is that object o1 is in the inner object list of the root node
of the hierarchical k-D tree.
4.3. OVERLAPPING OBJECTS
29
Figure 4.8 shows a hierarchical k-D tree. Each node of it has a set of
so called inner objects which are intersected if this node is handled during
traversal. The structure of the k-D tree in Figure 4.8 forces the traversal
algorithm to intersect object o1 exactly one time.
Note that the size of the hierarchical k-D tree is reduced compared to
the last version, as the leaves are smaller. This is a principal property of the
concept, as each time an object is in all or almost all leaf nodes reachable
from a node N , it is more optimal to put the object in the inner object list
of node N , which reduces the size of the tree.
4.3.2
Mailboxing
A different solution to the problem is known as mailboxing which is a kind
of object intersection cache. In a small cache the objects the ray has been
intersected with are saved. If an object needs to be intersected by the
traversal algotithm, the mailbox system looks up the cache. If the ray has
already been intersected with this object no further intersection is done.
Otherwise the object is intersected and added to the cache.
There are several possible strategies to handle the cache. The most
popular is to save the last n objects intersected with. Another would be to
use a hashing function to map the objects to slots.
The mailboxing approach has been shown to be more efficient than using
hierarchical k-D trees. The reason is that hierarchical k-D trees alone only
solve the special room problem. However there are many more situations
where an object is intersected more than once since each object is mostly in
several leaf nodes.
4.3.3
Multiple Scenes
In some cases it is sufficient to use a much simpler solution to the problem.
Imagine a level of a standard shooting game where is mostly a large main
scene, modelled as a single object, and perhaps some dynamic objects. The
main scene object is an object containing a lot of other objects which is a
problem, as described earlier. Instead of putting the main scene object to
the root node of a k-D tree (which the hierarchical k-D tree concept had
done) first the main scene is traversed and then the other geometry. If there
is only one large object containing many other ones, this concept is nearly
30
CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM
equivalent to the hierarchical k-D tree concept but simpler.
In the following table some simulation results of the conference scene at
a resolution of 1024x768 are listed. The first line shows a simulation without
any of the optimizations followed by hierarchical k-D trees. Mailboxing is
simulated such that the last 8 objects are saved and in the last simulations
the main scene object (room) of the conference scene is traversed before
the objects (chairs). The number of traversal operations (Trav-Ops), object
intersection operations (Obj-Int-Ops) and triangle intersections (TriangleInt-Ops) can be seen.
Optimization
None
Hierarchical k-D tree
Mailboxing
Multiple Scenes
Trav-Ops
295.4
70.3
63.2
71.5
Obj-Int-Ops
6.4
2.0
1.6
2.1
Triangle-Int-Ops
57.6
10.9
10.3
11.3
Table 4.1: Millions of operations for various strategies
It can bee seen that mailboxing is the best of the three optimizations,
thus the later hardware architecture implements this strategy.
4.4
Ray Transformation
If an object instance is hit during traversal, the ray is first transformed into
the local coordinate system of that object. This is done by applying an
affine transformation to the ray. In this Section we show how a ray has to
be transformed using such an affine transformation.
The affine transformation is given by f (v) = Av + B with A ∈ M atR (3 ×
3) and B ∈ R3 and maps points of R3 to points of R3 . The ray is given
by a tuple R = (org, dir) ∈ (R3 )2 . The origin of the ray can easily be
transformed by plugging it into v, as it is a point. The direction of the ray
represents a vector not a point, thus it has to be transformed in a different
way. As vectors represent directions, this property has to be maintained by
the transformation. Assume there are two points X and Y given, then there
4.5. HIT-DISTANCE TRANSFORMATION
31
is a vector V = Y − X connecting X to Y . The transformed vector f (V )
has to fulfill the equation:
f (V ) = f (Y ) − f (X) = A Y + B − (A X + B) = A(Y − X) = A V
Thus the transformation of a complete ray looks like:
f (R) = (f (org), f (dir))
= (A · org + B, A · dir)
4.5
Hit-Distance Transformation
Some hit-point information needs to be computed during the traversal algorithm: the hit-point with the splitting plane and the hit-point with the
scene.
One possibility would be to save the hit-point as a real point of R3 but
this has the disadvantage that it has to be transformed back to the world
coordinate system if the hit-point lies in an instantiated object. A much
better way is to store a hit-point with a ray R = (org, dir) indirectly as
a λ-value or hit-distance such that the real hit-point H ∈ R3 fulfills the
following equation:
H = R(λ) = org + λ · dir
On the one hand this hit-distance can be used to compute a traversal
decision (see Section 3.1.2) and on the other hand no back transformation
of the hit-distance is required, which the following equations show. Let f
be an affine transformation f (x) = A · x + B then it yields:
f (H) = f (org + λ · dir)
= A · (org + λ · dir) + B
= (A · org + B) + λ · A · dir
= f (org) + λ · f (dir)
=⇒ org + λ · dir = f −1 (f (org) + λ · f (dir))
This means that the same λ-value can be used to represent the hit-
32
CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM
point in both coordinate systems. Computing the hit-point in the object
and transforming it back to the world coordinate system is the same as
using the same λ to compute the hit-point in the world-coordinate system.
Thus the value λ is in some sense invariant under the application of affine
transformations.
With this background it can be explained why only affine transformations are used in the 2-level k-D tree algorithm. The relevant point is that if
intersecting with an object instance not the object in the instance is transformed, but the ray itself. If the object’s geometry had been transformed
(which is too costly) it would be possible to use an arbitrary transformation.
But transforming the ray has to result in a ray again. Affine transformations
fulfill this property and map rays to rays as the above equations show.
4.6
Normal Transformation
Most shading models (like Phong shading for example) need the normal of
the geometry at the hit-point to approximate the surface lighting behavior.
However, this normal is needed in the world coordinate system, but normals
are present only in the local coordinate system of the hit-object. Therefore,
the shader has to transform these normals back to the world coordinate
system using the inverse of the transformation that was used to position
the object. Thus we need to analyse how a normal is transformed under
an arbitrary affine transformation f (x) = Ax + B. Like for vectors, this
transformation has to be applied in a special way to preserve the normal
property. It is trivial to see that affine transformations map tangents to
tangents. This fact will be used to derive a transformation for normals.
Figure 4.9: Figure (a) shows a box and the normal n of the right side. If
the box is transformed under an affine transformation like in Figure (b), the
correctly transformed normal nf is different from An.
33
4.6. NORMAL TRANSFORMATION
The tangent in the source space is called t and the transformed one
in the destination space tf , which is equal to A t as tangents are vectors.
Analogously the normal in the source space is called n and the searched one
in the destination space nf . As n is a normal, nf is not the same as A n as
seen in Figure 4.9. The following shows that a matrix A′ can be found such
that nf = A′ n. The vectors n and t are perpendicular, which means that
the scalar product is zero nT t = 0. Doing some transformations yields:
T
nT t = nT A−1 A t = (nT A−1 )(A t) = (nT A−1 )T T tf = ((A−1 ) n)T tf = 0
T
This equation shows that (A−1 ) n is a vector that is perpendicular to
tf , thus it has to be the searched normal. The transformation matrix A′ is
T
given by A′ = (A−1 ) and the complete mapping of a normal looks like:
T
nf = (A−1 ) n
Even if the normal n was normalized, nf is usually not normalized.
34
CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM
Chapter 5
Triangle Intersection
In order to decrease the required floating point resources of the hardware
architecture described later in this thesis, I developed a special triangle
intersection method that is based on affine ray transformations. Because
such affine ray transformations are necessary in the dynamic ray tracing
algorithm using this intersection method will make it possible to save a lot
of hardware resources by sharing one transformation unit for two purposes.
The so called unit triangle intersection method consists of two stages.
First the ray is transformed, using a triangle specific affine triangle transformation, to a coordinate system, in which the triangle looks like the unit
triangle ∆unit with the edge points (1, 0, 0), (0, 1, 0) and (0, 0, 0). In the second stage, a simple intersection test of the transformed ray with the unit
triangle is done.
Figure 5.1: Unit Triangle Intersection
35
36
CHAPTER 5. TRIANGLE INTERSECTION
5.1
Affine Triangle Transformation
The affine triangle transformation to a triangle ∆ = (a, b, c) is an affine
transformation T∆ (x) = m · x + n with m ∈ M atR (3 × 3) and n ∈ R3
−1
that maps the triangle ∆ to the unit triangle ∆unit . The inverse T∆
(x) =
m′ · x + n′ of T∆ can easily be described by the following equations:

1


0


0




−1 
−1 
−1 
T∆
 0  = a T∆  1  = b T∆  0  = c
0
0
0
These equations map the edge points of the unit triangle to the edge
points of the triangle. If q ∈ R3 is an arbitrary vector, then the solution
−1
T∆
of the equations takes the form:

ax − cx bx − cx qx

m′ =  ay − cy
ax − cz
by − cy
bz − c z


qy 
qz

cx



n ′ =  cy 
cz
Unfortunately the vector q is undetermined but there are two useful
possibilities to choose q. The first concept is to minimize the memory needed
to store a triangle matrix and the second one allows to do some dot product
computations for free.
5.1.1
Memory Efficient Triangle Transformation
The representation of the triangle transformation can be minimized by
choosing q in such a way that the triangle transformation matrix m of T∆
has the first column equal to (1, 1, 1)T , which can be achieved by setting
q = −(a − c) − (b − c) + (1, 0, 0)T . Here it is not necessary to save the first
column of the matrix.

ax − cx bx − cx −(ax − cx ) − (bx − cx ) + 1

m′ =  ay − cy
ax − cz
by − c y
bz − c z


−(ay − cy ) − (by − cy ) + 0 
−(az − cz ) − (bz − cz ) + 0

cx



n ′ =  cy 
cz
−1
It needs to be shown that the inverse T∆ of T∆
is of the form:
37
5.1. AFFINE TRIANGLE TRANSFORMATION

1 βx γx

m =  1 βy
1 βz


γy 
γz

δx



n =  δy 
δz
Using properties of affine transformations, it can be shown that n =
−m′−1 · n′ . Thus it is equivalent to prove:

1


1


1

   
 
T∆ ·  0  =  1  + n =  1  − m′−1 · n′
0
1
1
This can be shown using the inverse of T∆ :

1


1



1


1

 


 
−1 
′−1
′
′ 
′
′
 0  = T∆   1  − m · n  = m ·  1  − n + n =  0 
0
1
1
0
This proof requires the existence of T∆ and it turns out that this inverse
does not always exist. The choice of q geometrically means to map the
−1
normal Nunit = (0, 0, 1)T of the unit triangle to the point T∆
(Nunit ) =
−(a − c) − (b − c) + e1 + c. In fact the part −(a − c) − (b − c) + c of this sum
lies in the triangle plane. Thus the triangle transformation does not exist if
the triangle normal is perpendiculer to e1 , since then −(a−c)−(b−c)+e1 +c
lies in the triangle plane too. This problem can be solved by choosing q in
such a way that one of the other two columns of m is (1, 1, 1)T . The n-th
column can be set to zero if q = −(a − c) − (b − c) + en . The proof of this
is analogous to above.
To store the minimized representation of the triangle transformation it
is necessary to save the number of the column that is equal to (1, 1, 1)T . But
this can simply be encoded in 2 bits.
Furthermore a criteria is required that chooses the column to be set to
(1, 1, 1)T . But this is quite simple, since n is optimal if the normal Nunit
is mapped to a point as far away from the triangle as possible. Thus n is
choosen such that the angle between en and the normal of the triangle ∆ is
minimal.
38
CHAPTER 5. TRIANGLE INTERSECTION
5.1.2
Normal Consistent Triangle Transformation
A different possibility is to choose q in such a way that the normalized
normal N = (a − c) × (b − c)/|(a − c) × (b − c)| of the triangle is mapped to
the normal of the unit triangle.

0


−1
−1 
T∆
(Nunit ) = T∆
 0 =N
1
−1
The solution to T∆
looks like:

ax − cx bx − cx Nx

m′ =  ay − cy
az − cz
by − c y
bz − c z


Ny 
Nz

cx



n ′ =  cy 
cz
−1
−1
The transformation T∆
is completely defined and the inversion of T∆
yields again an affine transformation if the triangle is not degenerate. Thus
T∆ exists for each not degenerate triangle ∆.
5.2
Unit Triangle Intersection
To intersect a ray R = (org, dir) with a triangle ∆ the ray R is transformed using T∆ to the unit triangle space. The intersection distance λ and
the barycentric (u,v)-coordinates do not change under an arbitrary bijective
affine transformation. As the triangle transformation is bijective for not degenerate triangles, it is equivalent to compute the ray-triangle intersection
in the world coordinate system between R and ∆, or in the unit triangle coordinate system between the transformed ray R′ and ∆unit . The advantage
of the second method, is that the intersection computation of an ray with
the unit triangle is quite simple, since the unit triangle lies in the xy-plane.
Let R′ = T∆ (R) = T∆ (org, dir) = (m · org + n, m · dir) = (org ′ , dir′ ) be
the ray transformed to the unit triangle space, then the intersection can be
computed by:
5.2. UNIT TRIANGLE INTERSECTION
39
orgz′
dirz′
u = λ · dirx′ + orgx′
λ = −
v = λ · diry′ + orgy′
The hit-point lies in the triangle, if the so called in-triangle test u ≥
0 ∧ v ≥ 0 ∧ u + v ≤ 1 is fulfilled and has the barycentric triangle
coordinates (u, v, 1 − u − v).
If the second triangle transformation that maps the geometry normal
of the triangle to the normal of the unit triangle is used, it is possible to
compute the dot product between the ray direction and the triangle normal
in both coordinate systems. In the unit triangle system the computation is
extremely simple:

0

 
dir′ ·  0  = dirz′
1
Thus the z-component of the transformed ray direction, is the dotproduct between the ray direction and the geometry normal of the triangle.
If the ray direction of R was normalized, then dirz′ is exactly the cosine
between the ray direction and the normal vector.
It is not obvious to see that the dot product is maintained under the unit
triangle transformation, but this special transformation has this property as
it can be written as:
T∆ = Txy ◦ TR ◦ TT
The transformation TT is a translation that maps the triangle edge point
c to (0, 0, 0). The rotation TR rotates the triangle to the xy-plane and the
last transformation Txy is a composition of transformations that maps the
triangle in the xy-plane to the correct form. This last transformation does
not change the z-component of its input vector.
The translation and rotation does not change any angle nor length and
the transformation Txy does not change the result of the dot product with
the normal vector (0, 0, 1) as the transformation is perpendicular to the
40
CHAPTER 5. TRIANGLE INTERSECTION
normal. Thus the complete triangle transformation does not change the dot
product. Note that because the last transformation Txy changes the length
of the vectors the angle between the ray direction and the normal is not
maintained by the triangle transformation, only the dot product.
The described method can be used to compute the cosine between the
ray direction and triangle normal only if the direction of the initial ray is
normalized. Thus in conjunction with the dynamic ray tracing algorithm
the only transformations that can be used to instantiate objects are compositions of translation and rotation matrices, as otherwise the length of the
direction is changed.
Of course this concept to transform the ray first and then to intersect
with a unit object can be applied to many other types of objects like ellipses
or rectangles too. An advantage is that only one representation is required
for a wide range of objects, as the transformation to the unit object is
described by an affine transformation in each case. Additionally only the
type of the object has to be stored, to call the correct unit intersection
function.
A drawback of this triangle intersection method is that the triangle matrix depends on each of the edge points of the triangle. Thus because of
computation accuracy rays can be shot through two triangles that lie beside
each other and have two vertices in common. This problem can be solved by
using a small epsilon in the comparisons of the in-triangle test. Nevertheless
most triangle intersection methods suffer from this problem.
Chapter 6
The Dynamic SaarCOR
Architecture
The architecture presented in this Section is a general approach for a dynamic ray tracing hardware architecture which has many aspects in common
with the standard SaarCOR architecture [8]. A main difference is that the
Dynamic SaarCOR Architecture supports dynamic scenes but the standard
SaarCOR architecture not.
Dynamics is achieved by partitioning the scene into movable objects as
described in Section 4. The geometry in the objects remains static but the
objects themselves can be moved around. This requires the rebuilding of
a top-level acceleration structure over the objects in each frame, if some
objects have been moved. The architecture gives no hardware support to
rebuild the top-level acceleration structure, as this is sufficiently possible
using the host PC, if the number of objects is less than 50000.
Hardware support is given for the triangle intersection, traversal through
the dynamic 2-level acceleration structure and the shading computation as
these are the most expensive operations. To support this a costly affine
ray transformation unit to transform rays to the local coordinate system of
an object is required. Because this unit is almost of the same complexity
as a standard triangle intersection unit a naive approach would double the
required chip area. But using the special unit triangle intersection method
as described in Section 5, it is possible to share the transformation unit for
two purposes. Furthermore the shader can use the transformation unit to
perform the primary or secondary ray computation.
41
42
CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE
The reasons why the special triangle intersection method is used, is to
share the transformation unit mainly for the object space transformation
and the triangle intersection. On principle, it would be possible, to separate
the transformation and intersection using two independent units. But this
has some disadvantages because the transformation unit would be used only
20% of time if it would by fully pipelined. A lot of computational power
is wasted this way. Increasing the usage of the transformation unit would
be possible if the operation is done sequentially, such that approximately
5 cycles are required per ray transformation. But then the transformation
could slow down the complete pipeline, if at some parts of the scene it is
used much more frequently. This slowing down is a typical behavior if too
many special purpose units are used in the design.
To exploit coherence between neighboring rays the architecture handles
packets of rays as described in Section 3.1.3. By doing so data is always
accessed for a packet of rays reducing the size of the memory interface.
At a given time there are always several independent packets in the ray
tracing system to increase the usage of the units. This is necessary as the
special purpose pipelines needed for the computation are fairly deep. On the
other hand, memory latency can be hidden since during a memory request
of one packet, the other packets can do operations in the chip.
Because each packet can be seen as a single thread running in the system
this concept is a kind of multi-threading. Each packet corresponds to a
complete data-set in the chip, consisting of near and far value, stacks and
other required internal data. In order to guarantee that each packet accesses
only its data-set, a unique packet-id (pid) identifies it and is used to address
the correct data-set. This packet-id is passed from unit to unit, as a kind
of job-passing. If the traversal unit reaches a leaf node for instance, the
packet-id is delivered to a different unit that handles the list of objects.
A very important topic in ray tracing is the shading computation. Due to
the variety of possible shading models, the corresponding shading hardware
should be a fully programmable special purpose CPU. As shading is out of
the scope of this thesis shading will be marginally mentioned only.
The Dynamic Ray Tracing Architecture (see Fig. 6.1) consits of one or
more Dynamic Ray Tracing Pipelines (DynRTP) which are subdivided to
a Ray Generation and Shading unit (RGS) and the Dynamic Ray Tracing
Core (DynRTC). The main task of the RGS unit is to do the shading com-
6.1. DYNAMIC RAY TRACING CORE
43
Figure 6.1: Dynamic Ray Tracing Architecture
putations, using the Dynamic Ray Tracing Core (DynRTC) to shoot rays
through the scene, and to compute primary rays.
The Dynamic Ray Tracing Core consists of four main parts. First there
is the traversal unit that traverses a packet of rays through the acceleration
structure. The lists of objects of the acceleration structure are handled by
the list unit. The transformation unit applies an affine transformation to a
packet of rays and the intersection unit intersects rays with the unit triangle.
The Ray Generation Controller tells the DynRTP units which pixels
to render next. The scene data as well as some other configuration data
(camera position, acceleration structure, etc.) are sent through a PCI or
AGP interface to the chip. Each Dynamic Ray Tracing pipeline has access
to the scene data through a cache interface. This cache interface consists of
four independent caches for each type of data that is used.
6.1
Dynamic Ray Tracing Core
The ray tracing core is the basic ray casting unit of the architecture. Thus
it is responsible for tracing packets of rays through the scene and returning
the information in the object that was hit. As a fundamental concept of the
dynamic ray tracing approach is the partitioning of the scene into movable
objects, the dynamic ray tracing core has to traverse the packet through a
44
CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE
top-level acceleration structure to find a possible hit-object and then transform it to the local coordinate system of that object. There the traversal
needs to be continued to find a possible hit-triangle.
The Dynamic Ray Tracing Core is used by the shader unit (RGS) to
shoot rays through the scene. To do so the shader first needs to initialize
the dynamic ray tracing core by sending the k-D tree root node for the
next packet and the transformation to apply first. For a primary ray, this
transformation is a simple camera transformation. After that, the shader
sends the packet of rays in sequence to the pipeline. It always passes the
transformation unit first, which applies the stored transformation to it.
Because the transformed ray has to traverse through the scene it is saved
in the traversal and transformation unit for later use. The traversal unit
starts the top-level traversal of the packet until a leaf node is reached. It
sends the list of objects, saved in the leaf node, to the list unit which has
the task to handle the list. Thus it reads the first list entry out of the list
and sends it to the transformation unit. This one fetches the object, stores
the object’s root node into the traversal unit and applies the stored inverse
object transformation to the packet of rays. At this point the inverse of the
object transformation is required, since we do not position the object, but
transform the ray into the object.
The transformed ray is now in the local coordinate system of the object
and is saved in the traversal and transformation unit. The traversal starts
with the bottom-level traversal in the object with the transformed ray until
a leaf node is reached. The list unit handles the list again but the transformation unit now reads unit triangle matrices out of memory and applies
these transformations to the packet.
The packet transformed to the unit triangle space is intersected with the
unit triangle by the intersection unit. The intersection result is stored in
the traversal unit which in particular needs the hit-distance to compute the
ray termination correctly. If the list of triangles was empty, the operation
is continued at the list unit or otherwise at the traversal unit.
6.1.1
Traversal Unit
The traversal unit traverses packets of rays in parallel through the scene.
This is done using a k-D tree and k-D tree traversal algorithm as explained
in the Sections 3.1.2 and 3.1.3.
6.1. DYNAMIC RAY TRACING CORE
45
The traversal unit consists of a memory interface, to load k-D tree nodes,
and a special purpose pipeline. This one is internally subdivided into some
traversal slices to handle the single rays of the packet in parallel. In each
pass through the pipeline a packet traversal step is computed.
Figure 6.2: The Figure shows the traversal unit consisting of the memory
interface, 4 traversal slices, a packet traversal decision unit and the collect
hits unit. For each of the units the necessary internal data is shown.
Figure 6.2 shows the internal structure of the traversal unit. The operation always starts at the memory interface which fetches the next or first k-D
tree node out of memory. If this node was a leaf then the packet together
with the list address is sent to the list unit to compute intersection results.
Otherwise the node is sent to the traversal slices which compute a traversal
decision for each of the rays in the packet. These single traversal decisions
are combined into a packet traversal decision by the packet traversal decision unit. The packet traversal decision is sent to the memory interface and
back to the traversal slices as these have to do stack operations depending
on it. Using the packet traversal decision the memory interface can fetch
the next node to process and do push/pop operations of the nodes.
Because the memory interface is responsible for the computation of the
46
CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE
node addresses it saves the current node and handles the node stack. In
contrast because the traversal slices compute the traversal decision for a ray
they need to store and update the near and f ar values, the ray and handle
the far-stack needed for the computations.
The collect hits unit computes the closest intersection for each ray of
the packet. If this unit gets a new intersection result, it determines whether
the new hit-distance is closer than the one saved. If so, the new intersection
result is saved and the one stored is deleted. The intersection result typically
consists of the hit-distance, hit-object and hit-triangle. Because the local
barycentric uv-coordinates of the hit-point are required to support textures
they need to be saved as intersection result too. As the special unit triangle
intersection method is used, the cosine between ray direction and triangle
normal can be computed for free. Therefore it is saved as intersection result
for later usage in the shader.
An important point is that the collect hits unit gives the traversal slices
access to the current hit-distance of their ray of the packet. Using this
information the traversal slices can terminate a ray. A ray is terminated if
there is a hit closer than the far value of the leaf node, where the hit occured.
As the traversal unit terminates the ray at the next traversal step the hitdistance is compared against the current near value. If it is before the near
value each further hit would be farther away than the stored one. If each
ray of the packet is finished or the stack is empty the traversal operation is
finished.
6.1.2
Mailboxed List Unit
The mailboxed list unit has the task of handling a list of object or triangle
addresses, filtering the addresses in a kind of intersection cache (mailbox)
and sending the passed addresses to the transformation unit.
This mailboxing is necessary as most objects are present in several leaf
nodes of the k-D tree. Therefore it can happen that an object is intersected
several times which greatly reduces the performance (see Section 4.3). Especially the room problem decreases the performance. Therefore it is required
to avoid multiple intersections with objects and triangles. This is the task
of the mailbox unit, which saves already intersected objects in slots and
preserves packets to be intersected twice with an object.
The list unit gets a job from the traversal unit consisting of a single
6.1. DYNAMIC RAY TRACING CORE
47
Figure 6.3: Mailboxed List Unit
address of the list to handle. The first entry of the list is read and sent to
the mailbox unit. This one is a packet based mailbox which checks if the
packet has already been intersected with this object. If so control is returned
to the list unit to read the next list entry or to continue at the traversal unit
if this was the last list entry. If the list entry was not yet intersected, it is
sent to the transformation unit to be intersected.
The operation at the list unit is continued if a triangle intersection or
object intersection operation is done and the list was not empty. If the list
was empty the traversal operation is continued.
6.1.3
Transformation Unit
An essential part of the algorithm is the ray transformation which is done
by a specialized transformation unit. This unit performs the transformation
of the rays to the object’s coordinate system and transforms the rays to
the unit triangle system as a kind of precomputation for the intersection
unit. Furthermore the shader can use the transformation unit to apply the
camera transformation to compute a primary ray and to compute secondary
rays like light rays or reflection rays. The transformation of a packet is done
sequentially, which allows for a good balancing between the traversal unit
and the transformation unit (see Section 6.1.5).
Because most ray packets have a single ray origin, this origin needs to
be transformed only once. The transformation unit exploits this property
48
CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE
by a kind of packet compression that transforms a packet of n rays into
n + 1 vectors. The first vector is the common ray origin and the other
ones the direction vectors of the packet. Such a compressed packet can be
transformed by a fairly cheap transformation unit for vectors and decoded
to a normal packet of rays by a decompression unit.
Figure 6.4: Transformation Unit
A transformation job starts at the load matrix unit which reads the matrix of an object or triangle column by column out of memory and stores
them in the transform unit. If the matrix was completely read, the send
packet unit gets the job. This unit has a copy of the rays of the packet to
process and sends these to the compress packet unit. This unit compresses
the packet and sends the vectors and points to be transformed sequentially
to the transform unit. This unit applies the previously stored affine transformation to its input vectors. Finally, the packet is combined into a valid
packet again by the decompress packet unit.
6.1. DYNAMIC RAY TRACING CORE
49
There is an important path of the transformed packet to the send packet
unit. This path is needed if a packet was transformed to the local coordinate
system of an object, because then the transformed ray needs to be saved to
be intersected with the triangles in the object later.
There exist two modes for the transformation, one to transform points
and a different one to transform vectors. This is important as both have to
be transformed differently as explained in Section 4.4. Furthermore, there
exist two compression modes that indicate whether the packet has a common
origin or not. If so the packet is compressed. Otherwise, each origin and
direction of the packet is transformed, resulting in 2n transformations. The
compression mode is set by the shader, as it has the necessary information
about the type of the packet.
Figure 6.5: The Figure shows that primary rays as well as light rays are
types of packets with a single origin. Even reflections at planar surfaces
maintain this property, as the virtual origin can be seen as the common
origin of the packet.
It figures out that most kinds of packets can be compressed (see Figure
6.5). Packets of primary rays are trivially compressable, since their origin
is the projection center of the camera. Light rays that shoot from the light
source to the hit-points have a common light source origin. Even reflected
packets of rays retain their single origin if the packet was reflected by exactly
one planar surface. The reflection at curved surfaces yields a compressable
packet only in special cases.
6.1.4
Intersection Unit
The intersection unit is a simple pipeline that intersects rays with the unit
triangle, applying the formulas of Section 5.2. As inputs it gets rays trans-
50
CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE
formed to the unit triangle space and computes an intersection result consisting of the hit-distance, barycentric coordinates and the dot product between
the ray direction and the triangle normal vector. This intersection result is
combined with the hit-object and hit-triangle and then saved in the collect
hits unit of the traversal.
6.1.5
Balancing
The subdivision of an algorithm into special purpose units may become a
problem if the units are too special and used very rarely. Thus the balancing
between the individual units of the dynamic ray tracing core need to be
analysed.
The most expensive units of the design are the traversal unit and the
transformation unit. Simulations showed that a balancing of 4 to 1 between
the traversal and intersection operation is optimal for the k-D tree algorithm
[8]. The same ratio can be used for the ratio between the traversal and the
transformation unit too, which means that 4 times more traversal operations
as ray transformations need to be done.
This ratio can approximately be achieved using a packet size of 4 rays per
packet, which are traversed in parallel and transformed sequentially. Thus
the transformation unit requires five times more cycles to handle a packet
than the traversal unit if the packet can be compressed. Thus we have a
ratio of 5 to 1 if the packet can be compressed, or 8 to 1, otherwise. This
ratio of 5 to 1 has been shown to be optimal for the dynamic architecture,
as can be seen in the usage statistics in Appendix A.
6.2
Shading Unit
The shading unit should consist of several programmable special purpose
shading CPUs, because of the wide range of possible shading models. This
concept of the programmable shading unit will not be discussed, but rather
the interface between the shader unit and the Dynamic RTC.
This interface consists (besides a channel to send the k-D tree root node)
of a channel to store a transformation in the RTC. Because this stored
transformation is always applied to the packet sent to the ray tracing core,
the shader can compute primary rays, light rays or reflection rays, using
the transformation unit. Each of these computations can be performed by
51
6.2. SHADING UNIT
storing a suitable transformation in the RTC and by sending a special ray
to be transformed. If all rays of the packet have been transformed, the RTC
starts with the traversal operation.
6.2.1
Primary Rays
Primary rays are rays from the camera to the scene, which are computed for
each pixel of the image. A camera can be represented by three orthogonal
vectors u, v, w and its position p. The vectors u, v and w define the local
coordinate system of the camera, such that u shows to the right, v to the
top and w in the viewing direction of the camera. To a pixel (x,y) on the
screen belongs the primary ray:
x′ =
x
xmax
−
1 ′
y
1
y =
−
2
ymax 2
prim ray = (p , x′ · u + y ′ · v + w)
This primary ray can also be computed by the following ray transformation:


Tshear = 
1
xmax
0
− 12
0
1
ymax
− 21

0 
0



u x vx wx p x

Tc =  u y vy
u z vz
0
0
1
  
0
x
   
pre prim ray =  0  ,  y 
0
1
0
wy
wz


py 
pz
prim ray = Tc (Tshear (pre prim ray))
The shown 4x3 matrices represent affine transformations where the left
3x3 minor stands for the linear part and the fourth column for the affine part.
The transformation Tshear is a shearing transformation that performs the
mapping of the pixel coordinates to the x′ and y ′ values. The transformation
Tc performs the affine composition of the u, v, w and p vectors with the x′ , y ′
values. If the special ray pre prim ray is transformed first with Tshear and
52
CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE
then with Tc , the primary ray computation is performed.
Thus the RGS unit stores the camera matrix Tc ◦ Tshear as a transformation to the RTC and sends the pre-primary rays pre prim ray to it.
6.2.2
Light Rays
Light rays are secondary rays that are computed to determine the amount
of light that illuminates the hit-point of a primary ray for instance. Such a
light ray goes from the light source to the hit-point of the primary ray.
To compute a light ray for a primary ray, the shader has to read back the
primary ray R = (org, dir) and the intersection result from the RGC. The
intersection result consists among other things of the hit-distance λ which
is needed to compute the hit-point R(λ). If L is the position of the light
source, the light ray can be computed by:
Rlight = (L, R(λ) − L) = (L, org + λ · dir − L)
The same computation can be done by the following ray transformation:

orgx dirx Lx Lx



Tlight =  orgy diry Ly Ly 
orgz dirz Lz Lz
  

0
1
  

R′ =  0  ,  λ 
0
−1
The transformation of the ray R′ by Tlight yields the light ray from the
light source to the hit-point of the ray. Note that because this transformation
Tlight depends on the ray R the shader has to load a special matix for each
of the rays of the packet. Furthermore the real hit-point of the ray does not
need to be computed in the shader.
6.2.3
Reflection Rays
Reflection rays are computed to simulate reflective surfaces. Thus a ray
that hits a reflective surface is reflected by it and traversed further into the
53
6.2. SHADING UNIT
reflection direction. The geometry the reflected ray hits, is exactly what is
seen through the reflective surface.
The reflection of a ray at a planar surface can be performed by an affine
reflection transformation. Such a reflection transformation can be precomputed for each triangle of the scene using the normal consisting triangle
transformation. The concept is to transform the ray first to the unit triangle space, then to reflect it at the xy-plane and to transform it back again.
This precomputation can be done by the following composition of 3 affine
transformations:

1 0
0
−1 
Tref lect = T∆
◦ 0 1
0
0


0  ◦ T∆
0 0 −1 0
This reflection transformation depends on the triangle and maps each
ray to the reflected ray. The reflected ray starts at the reflected origin and
has the reflected ray direction as can be seen in Figure 6.6. To use a ray
reflected this way, the traversal of the reflected ray has to start at the hitdistance of the unreflected ray. This can be done by setting the near value
of the traversal algorithm to the hit-distance of it and ignoring each hit that
is closer than this distance.
54
CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE
If the triangle lies in an object (which is always the case for the dynamic
ray tracing algorithm) two additional transformations have to be done. First
the ray has to be transformed into the object coordinate system, then to be
reflected, and at last to be transformed back again to the world coordinate
system. These additional transformations increase the cost of a reflection
ray, but can be performed in 3 passes by the transformation unit as well.
Figure 6.6: The Figure shows how the reflection matrix reflects a packet of
rays at a surface.
Chapter 7
FPGA Prototype
In this Section the prototype implementation of the dynamic ray tracing
architecture is described. As development platform the ADM-XRC-II PCI
board from Alpha Data [30] has been used. This board contains a Xilinx
Virtex-II 6000-4 [31] FPGA, 6 SRAM chips each with 4 MB of memory, a
PCI controller and some IO-adapters.
Figure 7.1: ADMXRC Development
Platform
Figure 7.2:
Flowchart
ADMXRC Top-Level
These IO-adapters are used as a VGA-out interface by generating a digital RGB-signal in the chip which is translated by an external digital-toanalog converter to an analog video signal.
My work on the prototype was the development of the dynamic ray tracing core which has been completely developed using JHDL [32] as hardware
55
56
CHAPTER 7. FPGA PROTOTYPE
description language. JHDL has been used as it has a powerful debugging
infrastructure that allows the simulation of the complete RTC at one part
and to log data buses into files. The system was completely developed under
Linux.
Some limitation had to be done mapping the architecture to the Xilinx
Virtex-II 6000 FPGA. Unfortunately there were only enough resources to
implement one ray tracing pipeline. The main problem was the strongly
limited memory resources (blockrams) in the chip.
Another limitation was the dedicated multipliers of the Virtex-II platform which are only 18 bits wide. Thus a floating point representation with
a 16 bit mantisse size, 7-bit exponent and 1 sign-bit is used. It turned out
that this accuracy is sufficient to do ray tracing even for complex and highly
detailed standard scenes.
The number of packets in the ray tracing chip can be adjusted from 1 to
64 packets for simulation purposes. Later it is shown that a number of 32
packets in the system is in some sense optimal.
Because the prototype is not capable of rebuilding the top-level k-D tree
on the chip it has to be computed by the host PC the PCI-card is connected
to. After each frame, the updated top-level k-D tree is written to the ray
tracing prototype.
Figure 7.3: The Figure shows the Dynamic SaarCOR Prototype Top-Level
Chart. The numbers at the busses are the used data and address bits.
The traversal unit is subdivided into 4 traversal slices as packets of 4
rays per packet are handled in parallel. The two traversal levels (top-level
57
and bottom-level) are done by using an internal depth bit. This bit is 0 in
the top-level operation and 1 in the bottom-level operation. Some of the
internal registers need to be duplicated for both traversal levels, since they
are more or less unrelated. Because even the stack is duplicated both the
top-level and bottom-level operations support a stack depth of 31 entries. If
one of the stacks is full the traversal operation cannot be continued correctly.
This problem can partially be solved by doing no further push operations
and continuing the traversal operation. This strategy works quite well, as
errors occur only in tiny details of the scene.
The traversal unit works on k-D tree nodes of 64 bit width. Thus a 64
bit wide memory interface is required, delivering a bandwidth of 0.68 GB/s
at 85 MHz.
The list unit reads 19 bit wide addresses out of a list and is one of the
most trivial units of the design, as it mainly consists of an address counter.
A special bit marks the last list entry.
The mailbox unit is implemented as a mailbox with 8 slots. Each time
an object is handled which is not already present in the mailbox, it is saved
into an empty slot. Because no strategy is implemented to clear the slots
again a full mailbox stays full. This simple mailbox has been very efficient
in the prototype. It is used at the top-level and bottom-level, thus works
for objects and triangles.
The transformation unit can store an affine transformation for each
packet in the system. This strategy is wasteful, but allows transformations to be read out of memory independently of the transformation, which
simplified the low level design.
The object and triangle transformations are represented by a 4x3 matrix
and only normal consistent triangle matrices that map the triangle normal to
the unit triangle normal are used. Thus the cosine between the ray direction
and triangle normal can be computed in the intersection unit.
The memory interface consists of three caches: one for the k-D tree
nodes, one for the lists and one for the matrices. The FPGA has access to
six 32 bit wide SRAM chips with a 20 bit address space. Three of these
SRAM chips are used by the ray tracing core. The matrix columns are
mapped to all three SRAMS, the 64 bit wide nodes to two of the SRAMS
and the 32 bit wide list entries to one SRAM as shown in Figure 6.1.
Thus the prototype has the following limitations for the scene size. The
58
CHAPTER 7. FPGA PROTOTYPE
maximum number of k-D tree nodes as well as the number of list entries is
limited to 524288 nodes. Triangles and/or objects can be 131072 in total.
Note, that it is possible to support scenes with more than 131072 triangles if
using objects and instantiating the same object several times. Thus scenes
with several billions of triangles can be visualized.
The used small direct mapped caches (see table 7.1) showed to be sufficient for a wide range of scenes. The cache size can be adjusted in 10
steps from 20 to 29 cache lines for simulation purposes. The use of a direct
mapped cache (as opposed to a 2-way cache for instance) was caused by
the coarse internal granulation of the memory blocks of the FPGA to 2 kB
blocks.
Unit
Traversal
List
Transformation
Total
Cache
4 kB
2 kB
6 kB
12 kB
Table 7.1: Maximum Cache Sizes per Unit (without index structure)
The prototype shader is a simple eye light shading pipeline that uses
a color per triangle and the cosine between the ray direction and triangle
normal which is computed by the RTC. Light rays and reflection rays are
supported in the latest version too. The standard resolution of the prototype
is 512x384.
To increase the cache hit rate, the RGC unit performs no scanline ray
generation, but uses a kind of hardware optimized hilbert curve. Computing
the image line by line results in bad cache hit rates, as the 2D image space
is not scanned locally. If there is a triangle on the left of the image, it
is very probable that it no longer is in the cache if the complete line is
finished. Therefore it is important to work locally on the image like the
hilbert curve does. But this is not suitable to be computed in hardware as
it is too complicated.
59
Figure 7.4: Figure (a) shows the recursive pattern that is used to compute
the hardware optimized hilbert curve in Figure (b).
The curve used in the prototype can be efficiently computed in hardware
but fulfills the same purpose as the hilbert curve. The curve is computed by a
simple counter whose destination bits are interpreted as . . . y3 x3 y2 x2 y1 x1 y0 x0 .
The coordinates (x[3 : 0], y[3 : 0]) generate a curve like in Figure 7.4.
By using this curve to generate primary rays the cache hit rate is increased by approximately 10% to 20%, especially for the list and matrix
cache (see Figure 7.5).
Scene Gael
100
80
80
60
60
Hitrate
Hitrate
Scene Gael
100
40
40
20
20
Traversal
List
Transformation
Traversal
List
Transformation
0
0
0
100
200
300
Cachelines
400
500
600
0
100
200
300
Cachelines
400
500
600
Figure 7.5: Both figures show the Cache hit rate depending on the number
of cache lines, once with scanline on the left and the hardware optimized
hilbert curve on the right.
60
CHAPTER 7. FPGA PROTOTYPE
7.1
Implementation Statistics
In this Section some statistics about the complexity of the ray tracing prototype are given. The presented numbers are in each case worst case numbers
that are computed out of some statistics of the Xilinx routing software.
7.1.1
Gate Count
The complexity of hardware circuits is usually measured in number of gates.
This gate count tells how many NAND gates are necessary to implement the
circuit. In the following Sections gate counts are stated for the prototype,
which are computed using the following mapping.
Unit
full adder
D flip-flop
D flip-flop with clock enable
4-input LUT
3-input LUT
memory bit
gate count
9
6
8
1 to 9
1 to 6
4
Table 7.2: Gate Count Computation
The source of this data is the Xilinx application note XAPP059 [33].
In addition dual port memory bits are counted as two single port memory
bits and the embedded 18-bit multipliers with 7000 gates per unit. In the
computations the worst case gate count for the LUTs are used and gates
necessary to address the memory bits are ignored.
61
7.1. IMPLEMENTATION STATISTICS
7.1.2
Complexity
The table 7.3 lists the complexity of one ray tracing pipeline measured in the
number of floating-point units for addition, multiplication, division and comparison, respectively. The rightmost column additionally lists the amount of
internal memory each unit uses to store ray-data, stacks and further needed
internal data.
Unit
Traversal
List
Transformation
Intersection
Cache (with index structure)
Total
Add
4
0
9
3
0
16
Mul
0
0
9
2
0
11
Div
4
0
0
1
0
5
Comp
13
0
0
3
0
16
Mem
44.5 kB
0.8 kB
9.3 kB
0.0 kB
15.6 kB
70.2 kB
Table 7.3: Complexity of one ray tracing pipeline with 32 packets and 512
cache lines (dual port memory bits counted as 2 bits)
DynamicRTC
DynamicRTC
Traversal
TraversalMemoryInterface
TraversalStackPointer
TraversalSlice0
TraversalSlice1
TraversalSlice2
TraversalSlice3
PacketTraversalDecision
CollectHits
List
Mailbox
LoadObject
SendPacket
PacketEncoder
Transformation
PacketDecoder
Intersection
Total
logic
gates
21,338
8,470
5,060
2,568
43,107
43,107
43,107
43,107
309
4,155
2,743
7,108
4,557
2,262
1,316
148,040
694
105,972
487,020
bits per
packet
0
0
1,292
12
2,352
2,352
2,352
2,352
0
688
76
136
19
1,152
0
1,152
72
0
14,007
memory
bits
0
0
41,344
384
75,264
75,264
75,264
75,264
0
22,016
2,432
4,352
608
36,864
0
36,864
2,304
0
448,224
memory
gates
0
0
165,376
1,536
301,056
301,056
301,056
301,056
0
88,064
9,728
17,408
2,432
147,456
0
147,456
9,216
0
1,792,896
Table 7.4: Gate Count and Memory Bits per Unit using 32 Packets
Table 7.4 shows the estimated number of gates for each of the units of
62
CHAPTER 7. FPGA PROTOTYPE
MemoryInterface
MemoryInterface
NodeCache
ListCache
MatrixCache
Total
Total gates
logic
gates
4,323
4,152
3,704
5,624
17,803
bits per
cache
line
0
83
51
115
249
cache
memory
bits
0
42,496
26,112
58,880
127,488
cache
memory
gates
0
169,984
104,448
235,520
509,952
2,807,671
Table 7.5: Gate Count and Memory Bits per Unit using 512 Cache Lines
the design. Further it shows the number of memory bits required per packet
in the system as well as the required memory gates for the on chip memory
for a system with 32 packets. Table 7.5 shows the gate count of the memory
interface and caches, as well as the number of bits required per cache line.
A system with 512 cache lines and 32 packets requires at most a number of
2,807,671 gates.
If P is the number of packets in the system and CL the number of cache
lines, then the gate count CRT C for the complete Dynamic RTC can be
estimated by the following formula:
CRT C = 487, 020 + 56, 028 · P + 996 · CL
The necessary internal memory bits can be computed by:
BitsRT C = 14, 007 · P + 249 · CL
7.2. PERFORMANCE STATISTICS
7.2
63
Performance Statistics
This Section discusses the performance achieved with the ray tracing prototype. On the one hand the maximal performance is shown as well as some
analysis to estimate the quality of the design. These quality estimates are
based on gate level computations, thus only of interest for a mapping to an
ASIC, not for an FPGA.
The Section describes several kinds of statistics that are listed in Appendix A for 4 test scenes.
7.2.1
Hardware Quality Index
It is easy to develop arithmetic units in hardware, but to feed these units is
very difficult. To feed them on-chip memory in the form of registers stacks
and caches is required. This on-chip memory is necessary but most of its
gates are idle during the computations in contrast to the arithmetic units.
Thus the definition of the following hardware quality index QHW describes
the percentage of gates that are working in the chip.
QHW =
UAU · CAU
· 100
CAU + CIM
The value UAU is the usage ratio of the arithmetic units and CAU the
cost of them in gates. Analogous CIM is the cost of the internal memory in
gates.
The hardware quality index can be used to compare two different versions
of the same hardware algorithm. The version with the higher quality index
is to be preferred, as it uses the gates more efficiently. Optimal system
parameters, such as cache size and the number of internal packets, can be
computed using this index.
Figure 7.6 shows the hardware quality index dependent on the number
of packets in the system for two scenes. The best gate usage of about 9.5%
can be achieved with a number of 32 packets in the system.
This means that it is more efficient to put several ray tracing pipelines
with 32 packets onto the chip than a smaller number of pipelines with more
than 32 packets. Because the same yields in the other direction it is better
to use 32 packets than more units with a smaller number of them.
The computed maximum is not optimal for an FPGA architecture as
64
CHAPTER 7. FPGA PROTOTYPE
Scene Gael, 512x384, 85 MHz
Scene Conference, 512x384, 85 MHz
10
9
9
8
Hardware Quality Index
Hardware Quality Index
8
7
6
5
4
3
7
6
5
4
3
2
2
1
1
Hardware Quality Index
Hardware Quality Index
0
0
0
10
20
30
40
50
60
70
0
10
20
Packets
30
40
50
60
70
Packets
Figure 7.6: This Figure shows the Hardware Quality Index of the Dynamic
Ray Tracing Core for the scene Gael and Conference dependent on the number of packets in the system.
there the cost should not be counted in gates. This is because todays FPGAs
consist (beside CLBs) of some special resources like blockrams and multiplier
blocks. Thus memory can be much cheaper if these memory blocks can be
used efficiently by the design.
The optimal values for several system parameters depend on each other.
Thus for the ray tracing architecture it is required to take into account
the available memory bandwidth, memory latency and delay, cache size,
packets in the system, pipeline depth of the internal pipelines and the kind
of scene to be handled efficiently. Therefore, in practice it is difficult to build
the perfect system, but using the described index it is possible to compare
different configurations of the hardware.
7.2.2
Graphics Hardware Quality Index
The hardware quality index described in the last Chapter has the disadvantage that it makes no statement about the quality of the ray tracing
algorithm used, only whether the algorithm is computed efficiently.
But in fact a different ray tracing algorithm might require less traversal
steps to achieve the same result, but much more sleeping memory resources.
Nevertheless it could be the better choice. The following graphics hardware
quality index QGHW can be used to compare different kinds of ray tracing and rasterization hardware algorithms, since it takes into account the
performance in rays shot per cycle achieved by the algorithm.
QGHW =
rays per cycle
· 1, 000, 000
CAU + CIM
65
7.2. PERFORMANCE STATISTICS
The index QGHW describes the number of rays a single gate of the circuit
can shoot in 1,000,000 clock cycles through the scene. For rasterization
hardware, the number of shot rays per cycle has to be replaced by the
number of pixels that are rendered per cycle.
Scene Conference, 512x384, 85 MHz
0.016
0.014
0.014
Ray Tracing Quality Index
Ray Tracing Quality Index
Scene Gael, 512x384, 85 MHz
0.016
0.012
0.01
0.008
0.006
0.004
0.002
0.012
0.01
0.008
0.006
0.004
0.002
Ray Tracing Quality Index
Ray Tracing Quality Index
0
0
0
10
20
30
40
50
60
70
0
10
Packets
20
30
40
50
60
70
Packets
Figure 7.7: This Figure shows the Graphics Hardware Quality Index of the
Dynamic Ray Tracing Core for the scene Gael and Conference dependent
on the number of packets in the system.
Figure 7.7 shows the ray tracing quality of the prototype for two scenes.
The maximal quality is again achieved at a number of 32 packets in the
system. As the rays shot per cycle are proportional to the usage of the
arithmetic units, the hardware quality index and graphics hardware quality
index yield the maximum at the same position.
Unfortunately it is difficult to compute a fair quality index for todays
rasterization hardware, as these chips support many extra features besides
simple rasterization of triangles. But in general it can be said that for scenes
consisting of little triangles, the quality index for rasterization hardware will
be much higher. In contrast if considering scenes with several million of
triangles ray tracing will become more efficient at some point.
7.2.3
Usage
The usage of a unit is the percentage of cycles where it is working. This
usage can be computed for the 4 most important units of the design and it
directly corresponds to the achieved performance.
Therefore it is an important task to adjust the system parameters in such
a way that the usage is fairly high. The usage can be increased by using
more packets in the system to fill the pipeline stages, or by larger caches, to
prevent long wait cycles for memory requests. Both parameters have to be
66
CHAPTER 7. FPGA PROTOTYPE
increased carefully, as too much internal memory may be a drawback too,
as the required gates compute nothing.
Figure 7.8 shows the usage of the individual units dependent on the
number of packets in the system. The usage increases with the number of
packets in the system as each packet can fill stages of the pipelines.
Scene Gael, 512x384, 85 MHz
Scene Conference, 512x384, 85 MHz
80
80
60
60
Usage
100
Usage
100
40
40
20
20
Traversal
List
Transformation
Intersection
Traversal
List
Transformation
Intersection
0
0
0
10
20
30
40
50
60
70
0
10
20
Packets
30
40
50
60
70
Packets
Figure 7.8: Usage of Units
There are several pipelines in the system that are separated by FIFO
queues (first in first out queues) and memory interfaces, and filled differently
by one packet. Thus a packet fills one pipeline stage in the traversal unit,
since the 4 rays of the packet are traversed in parallel, but normally 5 stages
in the transformation pipeline. This is because the rays of the packet are
transformed in sequence, which means first transforming the ray origin and
then the 4 ray directions.
It seems that the usage scales linearly in the number of packets in the
system. But this is only true if there are few packets in the system as the
number of total pipeline stages limits the linear scaling. Thus it is impossible
to increase the usage any more if the usage of one unit reaches nearly 100%.
Even the usage of the other units that is normally far below 100% cannot
be increased any more, as there is always a fixed ratio between the usage
values of the units for a given image.
The curves of Figure 7.8 approximate to the maximal theoretical usage
for each unit in the limit and there is a fixed factor between each 2 curves
that is independent of the number of packets in the system.
The frames per second dependent on the number of packets in the system,
directly corresponds to the usage of the single units. This is because the
usage of the units is proportional to the performance achieved.
67
7.2. PERFORMANCE STATISTICS
Scene Conference, 512x384, 85 MHz
25
20
20
Frames per Second
Frames per Second
Scene Gael, 512x384, 85 MHz
25
15
10
5
15
10
5
fps
fps
0
0
0
10
20
30
40
50
60
70
0
10
20
30
Packets
40
50
60
70
Packets
Figure 7.9: Frame Rate
7.2.4
Cache Hit Rate
The cache hit rates are an important aspect of ray tracing hardware algorithms, since the required bandwidth behind the caches determines the
number of parallel working units that can be connected to the available
memory interface.
Scene Conference, 512x384, 85 MHz
100
80
80
Cache Hit Rate
Cache Hit Rate
Scene Gael, 512x384, 85 MHz
100
60
40
20
60
40
20
Traversal
List
Transformation
Traversal
List
Transformation
0
0
0
100
200
300
Cache Lines
400
500
600
0
100
200
300
Cache Lines
400
500
600
Figure 7.10: Cache Hit Rate
Figure 7.10 shows the cache hit rate of the 3 types of caches dependent
on the number of cache lines for the Gael and Conference scenes. The size of
the required direct mapped caches is extremely low especially for the nodes.
This is because 4 cache lines are required to map a complete matrix but only
one to map a k-D tree node and because the coherence of k-D tree nodes at
the top of the k-D tree is much higher than for nodes at the bottom. This is
because the subspace a node at the top of the tree represents is much larger
than near its leaf nodes.
The cache hit rates for the triangle matrices is not satisfactory, but can be
improved using more advanced cache strategies. Thus 2-way or 4-way caches
68
CHAPTER 7. FPGA PROTOTYPE
should achieve much better cache hit rates in an ASIC implementation of
the design.
7.2.5
Memory Bandwidth
One of the most critical points of most types of hardware is the memory
interface as it has to deliver the required bandwidth, otherwise the chip
cannot work to its limit.
One strategy that the ray tracing prototype uses to decrease the required
memory bandwidth is to traverse packets of rays in parallel. Here the k-D
tree nodes, list entries and matrices are fetched only once for 4 rays of a
packet. In spite of this optimization, the required memory bandwidth is
fairly high. Therefore it is necessary to use caches for each of the units in
the pipeline.
Scene Conference, 512x384, 85 MHz
25
20
20
Frames per Second
Frames per Second
Scene Gael, 512x384, 85 MHz
25
15
10
5
15
10
5
fps
fps
0
0
0
200
400
600
800
Memory Bandwidth [MB/s]
1000
1200
0
200
400
600
800
1000
1200
Memory Bandwidth [MB/s]
Figure 7.11: Achieved performance using 64 packets and 512 cache lines if
the memory bandwidth is scaled by the memory clock ratio factor.
A point of interest is the memory bandwidth needed behind the caches,
which is analysed by Figure 7.11. The maximal memory bandwidth of the
RTC to the 3 SRAM chips, is 1.02 GB/s at 85 MHz. The Figure shows
how the performance drops if the memory bandwidth behind the caches is
reduced to the specified value. Note that for most scenes it is possible to use
4 ray tracing pipelines in parallel as a scaling of the memory bandwidth of
1
4
produces a drop in the performance of only about 20%. The conference scene
is a exception as the performance drops extremely if the memory bandwidth
is limited. This shows that larger or more efficient caches are required for
this scene.
The data of figure 7.11 can be used to compute a worst case frame
rate, if several parallel prototype RTCs together with their small caches are
69
7.2. PERFORMANCE STATISTICS
connected to the 1.02 GB/s memory interface. For instance the performance
of two RTC units at the 1.02 GB/s memory interface is higher than twice the
performance that reaches one unit at a 0.5 GB/s memory interface. Figure
7.12 shows the possible performance if there would be the specified number
of pipelines working in parallel at the memory interface of 1.02 GB/s.
Scalability, Scene Gael, 512x384, 85 MHz
Scalability, Scene Conference, 512x384, 85 MHz
70
90
80
60
Frames per Second
Frames per Second
70
50
40
30
60
50
40
30
20
20
10
10
fps
fps
0
0
0
2
4
6
8
10
12
14
16
0
2
4
RTC Units
6
8
RTC Units
Figure 7.12: Scalability
10
12
14
16
70
CHAPTER 7. FPGA PROTOTYPE
7.2.6
Performance
The ray tracing prototype is able to achieve a real time performance of 10 to
30 frames per second for a wide range of scenes at a resolution of 512x384.
For a detailed overview of the reached performance for 4 test scenes see
Appendix A.
Dependent on the routing achieved by the Xilinx software maximal frequencies of 85 to 92 MHz are possible. For the statistics in Appendix A and
the following performance values, the lower value of 85 MHz is used.
At a frequency of 85 MHz, the prototype has a floating point performance
of 4.08 billion flops, which when compared to todays rasterization hardware
is a fairly low value. The frequency of the prototype cannot be increased
much more because the used internal 18-bit wide multiplier blocks allow a
scaling to maximally 110 MHz.
Maximally 85 million packet traversal steps per second can be done,
which is equivalent to 340 million single ray traversal steps. The transformation unit can transform approximately 68 million rays per second (if the
packets are compressable) and consequently the same number of triangle
intersections can be done.
Chapter 8
Conclusion
This thesis has shown that creating a special purpose real-time hardware for
ray tracing is possible, even on FPGAs with their limited CLB and memory
resources. The used FPGA is not the best available today as there are
new FPGA chips with about 60% more CLBs and four times more memory
and multiplier blocks. Especially these memory and multiplier resources
have been the most limiting factor in the prototype. Thus using these new
FPGAs a ray tracing chip with two or four ray tracing pipelines should be
possible.
By mapping the architecture to an ASIC it would be possible to do ray
tracing at a resolution of 1024x768 in real-time, even if some secondary rays
are shot. This is as the capacity of todays high end ASICs is in the range of
52 million gates using a 0.095 µm silicon gate CMOS process. Since a ray
tracing pipeline requires 2.8 million gates, at most 18 ray tracing units could
be placed on the chip. But because programmable shaders are required, as
well as some larger caches to provide the parallel working units, a number of
8 ray tracing pipelines per ASIC would be realistic. In conjunction with an
increasing of the frequency to about 266 MHz, the performance of a high end
ASIC implementation would have about 20 times more performance than
the prototype.
Because the described hardware architecture supports structured motion
the scene has to be partitioned into movable objects. A main part of the
traversal algorithm for such partitioned scenes is to transform the ray to the
local coordinate system of the object to continue the traversal in it. This
operation requires an affine ray transformation unit, which is fairly costly.
71
72
CHAPTER 8. CONCLUSION
To reduce the required floating point resources on the chip, this transformation unit is also used to intersect with triangles. This is possible using the
described unit triangle intersection method. One further optimization was
to exploit the fact that since most packets of rays have the same ray origin
it needs only be transformed once for the packet.
The last Sections showed how optimal values for several system parameters like number of packets and cache lines can be computed. This is important to map the architecture to an ASIC, since there for cost reduction
purposes, it is necessary to use the available gates as efficiently as possible.
Inspite of the small caches it would be possible to use 2 or 4 ray tracing
cores in parallel at the described memory interface delivering 1.02 GB per
second. Using larger more advanced caches and some cache hierarchy it will
be possible to use many more units in parallel.
Chapter 9
Future Work
Of course the development of the ray tracing prototype is not yet finished.
To support larger scenes, cheaper DRAM resources should be used as a
scene database. The used alpha data development platform contains 256
MB of DRAM memory on a 64 bit wide interface, but because of the simpler
protocol, the SRAM resources have been used only.
Inspite of the fact that the top-level k-D tree was rebuilt fast enough
on the host PC for our test scenes, hardware support for this operation
should be supported, especially if the number of objects gets too large. This
hardware support should be available for k-D trees consisting of triangles
too, because then vertex shaders can be used to modify the position of the
vertex edge points of the triangles, followed by a k-D tree reconstruction.
Up to now the ray tracing prototype supports only a simple fixed eye
light shading model. This shader should be replaced by some programmable
special purpose shading CPUs that perform the color and secondary ray
computation. Shading CPUs are necessary because of the wide range of
shading models available for the ray tracing application.
The prototype uses a k-D tree as acceleration structure, but in fact
no analysis have been done, if this is the best for a hardware ray tracing
approach. Indeed the k-D tree algorithm seems to be the best choice in
software based systems [3], but some other acceleration structures can be
implemented using fairly simple traversal units. For the regular grid acceleration structure for instance there exist simple traversal algorithms based
on integer arithmetic. This integer arithmetic causes a much flater traversal
unit, which consequently requires less packets in the system. Furthermore
73
74
CHAPTER 9. FUTURE WORK
no stack is required in the grid traversal algorithm.
Chapter 10
Appendix A
The following Sections show statistics of four test scenes used to test the
prototype. The shown statistic diagramms are discussed in detail in Section
7.2. The standard configuration for the statistics is a resolution of 512x384,
a number of 64 packets in the system, 512 cache lines and using the hardware
optimized hilbert curve if not specified differently.
The last two statistics of each image show a walk through the scene, to
show typical frame rates that are achieved.
75
76
CHAPTER 10. APPENDIX A
10.1
Office
Objects
Total Triangles
FPGA Szene Size
Typical Frame Rate
Resolution
Scene Office, 512x384, 85 MHz
1
34,313
3.7 MB
20-30 fps
512x384
Scene Office, 512x384, 85 MHz
35
100
30
60
20
Usage
Frames per Second
80
25
15
40
10
20
Traversal
List
Transformation
Intersection
5
fps
0
0
0
10
20
30
40
Packets
50
60
70
0
10
Scene Office, 512x384, 85 MHz
20
30
40
Packets
50
60
70
Scene Office, 512x384, 85 MHz
10
0.025
9
0.02
Ray Tracing Quality Index
Hardware Quality Index
8
7
6
5
4
3
2
0.015
0.01
0.005
1
Hardware Quality Index
Ray Tracing Quality Index
0
0
0
10
20
30
40
Packets
50
60
70
0
10
30
40
Packets
50
60
70
Scene Office, 512x384, 85 MHz
250
100
200
80
Cache Hit Rate
Frames per Second
Scalability, Scene Office, 512x384, 85 MHz
20
150
100
50
60
40
20
Traversal
List
Transformation
fps
0
0
0
2
4
6
8
RTC Units
10
12
14
16
0
100
200
300
Cache Lines
400
500
600
77
10.1. OFFICE
Scene Office, 512x384, 85 MHz
Scene Office, 512x384, 85 MHz
35
100
30
60
20
Usage
Frames per Second
80
25
15
40
10
20
Traversal
List
Transformation
Intersection
5
fps
0
0
0
200
400
600
800
1000
1200
0
200
Memory Bandwidth [MB/s]
400
600
800
1000
1200
Memory Bandwidth [MB/s]
Scene Office, 512x384, 85 MHz
Scene Office, 512x384, 85 MHz
40
100
35
80
25
60
Usage
Frames per Second
30
20
40
15
10
20
Traversal
List
Transformation
Intersection
5
fps
0
0
0
20
40
60
80
100
120
Frame Number
140
160
180
200
0
20
40
60
80
100
120
Frame Number
140
160
180
200
78
CHAPTER 10. APPENDIX A
10.2
Gael
Objects
Total Triangles
FPGA Szene Size
Typical Frame Rate
Resolution
Scene Gael, 512x384, 85 MHz
25
100
20
80
15
60
Usage
Frames per Second
Scene Gael, 512x384, 85 MHz
1
68,624
7.0 MB
17-25 fps
512x384
10
40
5
20
Traversal
List
Transformation
Intersection
fps
0
0
0
10
20
30
40
Packets
50
60
70
0
10
Scene Gael, 512x384, 85 MHz
30
40
Packets
50
60
70
Scene Gael, 512x384, 85 MHz
10
0.016
9
0.014
Ray Tracing Quality Index
8
Hardware Quality Index
20
7
6
5
4
3
2
0.012
0.01
0.008
0.006
0.004
0.002
1
Hardware Quality Index
Ray Tracing Quality Index
0
0
0
10
20
30
40
Packets
50
60
70
0
10
Scalability, Scene Gael, 512x384, 85 MHz
20
30
40
Packets
50
60
70
Scene Gael, 512x384, 85 MHz
70
100
60
Cache Hit Rate
Frames per Second
80
50
40
30
60
40
20
20
10
Traversal
List
Transformation
fps
0
0
0
2
4
6
8
RTC Units
10
12
14
16
0
100
200
300
Cache Lines
400
500
600
79
10.2. GAEL
Scene Gael, 512x384, 85 MHz
100
20
80
15
60
Usage
Frames per Second
Scene Gael, 512x384, 85 MHz
25
10
40
5
20
Traversal
List
Transformation
Intersection
fps
0
0
0
200
400
600
800
1000
1200
0
200
Memory Bandwidth [MB/s]
600
800
1000
1200
Memory Bandwidth [MB/s]
Scene Gael, 512x384, 85 MHz
Scene Gael, 512x384, 85 MHz
30
100
25
80
20
60
Usage
Frames per Second
400
15
40
10
20
5
Traversal
List
Transformation
Intersection
fps
0
0
0
20
40
60
80
100
120
Frame Number
140
160
180
200
0
20
40
60
80
100
120
Frame Number
140
160
180
200
80
CHAPTER 10. APPENDIX A
10.3
Conference
Objects
Total Triangles
FPGA Szene Size
Typical Frame Rate
Resolution
Scene Conference, 512x384, 85 MHz
25
100
20
80
15
60
Usage
Frames per Second
Scene Conference, 512x384, 85 MHz
54
282,801
5.3 MB
17-20 fps
512x384
10
40
5
20
Traversal
List
Transformation
Intersection
fps
0
0
0
10
20
30
40
Packets
50
60
70
0
10
20
50
60
70
Scene Conference, 512x384, 85 MHz
9
0.016
8
0.014
Ray Tracing Quality Index
Hardware Quality Index
Scene Conference, 512x384, 85 MHz
30
40
Packets
7
6
5
4
3
2
0.012
0.01
0.008
0.006
0.004
0.002
1
Hardware Quality Index
Ray Tracing Quality Index
0
0
0
10
20
30
40
Packets
50
60
70
0
10
Scalability, Scene Conference, 512x384, 85 MHz
20
30
40
Packets
50
60
70
Scene Conference, 512x384, 85 MHz
90
100
80
80
60
Cache Hit Rate
Frames per Second
70
50
40
30
20
60
40
20
Traversal
List
Transformation
10
fps
0
0
0
2
4
6
8
RTC Units
10
12
14
16
0
100
200
300
Cache Lines
400
500
600
81
10.3. CONFERENCE
Scene Conference, 512x384, 85 MHz
100
20
80
15
60
Usage
Frames per Second
Scene Conference, 512x384, 85 MHz
25
10
40
5
20
Traversal
List
Transformation
Intersection
fps
0
0
0
200
400
600
800
1000
1200
0
200
Memory Bandwidth [MB/s]
600
800
1000
1200
Memory Bandwidth [MB/s]
Scene Conference, 512x384, 85 MHz
Scene Conference, 512x384, 85 MHz
30
100
25
80
20
60
Usage
Frames per Second
400
15
40
10
20
5
Traversal
List
Transformation
Intersection
fps
0
0
0
20
40
60
80
100
120
Frame Number
140
160
180
200
0
20
40
60
80
100
120
Frame Number
140
160
180
200
82
CHAPTER 10. APPENDIX A
10.4
Trees4000
Objects
Total Triangles
FPGA Szene Size
Typical Frame Rate
Resolution
Scene trees4000, 512x384, 85 MHz
4,000
20 Million
3.4 MB
8-14 fps
512x384
Scene trees4000, 512x384, 85 MHz
10
100
9
80
7
6
60
Usage
Frames per Second
8
5
4
40
3
2
20
Traversal
List
Transformation
Intersection
1
fps
0
0
0
10
20
30
40
Packets
50
60
70
0
10
20
50
60
70
Scene trees4000, 512x384, 85 MHz
8
0.008
7
0.007
Ray Tracing Quality Index
Hardware Quality Index
Scene trees4000, 512x384, 85 MHz
30
40
Packets
6
5
4
3
2
0.006
0.005
0.004
0.003
0.002
1
0.001
Hardware Quality Index
Ray Tracing Quality Index
0
0
0
10
20
30
40
Packets
50
60
70
0
10
Scalability, Scene trees4000, 512x384, 85 MHz
20
30
40
Packets
50
60
70
Scene trees4000, 512x384, 85 MHz
35
100
30
Cache Hit Rate
Frames per Second
80
25
20
15
60
40
10
20
5
Traversal
List
Transformation
fps
0
0
0
2
4
6
8
RTC Units
10
12
14
16
0
100
200
300
Cache Lines
400
500
600
83
10.4. TREES4000
Scene trees4000, 512x384, 85 MHz
Scene trees4000, 512x384, 85 MHz
10
100
9
80
7
6
60
Usage
Frames per Second
8
5
4
40
3
2
20
Traversal
List
Transformation
Intersection
1
fps
0
0
0
200
400
600
800
1000
1200
0
200
Memory Bandwidth [MB/s]
400
600
800
1000
1200
Memory Bandwidth [MB/s]
Scene trees4000, 512x384, 85 MHz
Scene trees4000, 512x384, 85 MHz
20
100
18
80
14
12
60
Usage
Frames per Second
16
10
8
40
6
4
20
Traversal
List
Transformation
Intersection
2
fps
0
0
0
20
40
60
80
100
120
Frame Number
140
160
180
200
0
20
40
60
80
100
120
Frame Number
140
160
180
200
84
CHAPTER 10. APPENDIX A
Bibliography
[1] http://www.nvidia.com. Geforce3 - the world’s most advanced processor, 2001.
[2] Peter Shirley. Fundamentals of Computer Graphics. A K Peters Ltd,
June 2002.
[3] Vlastimil Havran.
Heuristic Ray Shooting Algorithms.
PhD the-
sis, Department of Computer Science and Engineering, Faculty
of Electrical Engineering, Czech Technical University in Prague,
http://www.cgg.cvut.cz/˜havran/phdthesis.html, November 2000.
[4] Ingo Wald, Thomas Kollig, Carsten Benthin, Alexander Keller, and
Philipp Slusallek. Interactive Global Illumination using Fast Ray Tracing. Rendering Techniques 2002, pages 15–24, 2002. (Proceedings of
the 13th Eurographics Workshop on Rendering).
[5] Ingo Wald and Philipp Slusallek. State-of-the-Art in Interactive RayTracing. In State of the Art Reports, Eurographics 2001, pages 21–42,
2001.
[6] Ingo Wald, Carsten Benthin, Markus Wagner, and Philipp Slusallek.
Interactive Rendering with Coherent Ray Tracing. Computer Graphics
Forum (Proceedings of EUROGRAPHICS 2001, 20(3), 2001.
[7] Ingo Wald, Philipp Slusallek, and Carsten Benthin. Interactive Distributed Ray Tracing of Highly Complex Models. In Proceedings of the
12th EUROGRPAHICS Workshop on Rendering, June 2001. London.
[8] Jörg Schmittler, Ingo Wald, and Philipp Slusallek. SaarCOR – A Hardware Architecture for Ray Tracing. In Proceedings of Eurographics
Workshop on Graphics Hardware, pages 27–36, 2002.
85
86
BIBLIOGRAPHY
[9] John V. Oldfield and Richard C. Dorf. Field Programmable Gate Arrays. Wiley-Interscience, January 1995.
[10] Michael John Sebastian Smith. Application-Specific Integrated Circuits.
Addison-Wesley, June 1997.
[11] Stuart A. Green and Derek J. Paddon. Exploiting coherence for multiprocessor ray tracing. IEEE Computer Graphics and Applications,
9(6):12–26, 1989.
[12] Stuart A. Green and Derek J. Paddon. A highly flexible multiprocessor
solution for ray tracing. The Visual Computer, 6(2):62–73, 1990.
[13] Tony T.Y. Lin and Mel Slater. Stochastic Ray Tracing Using SIMD
Processor Arrays. The Visual Computer, pages 187–199, 1991.
[14] Michael J. Muuss. Towards real-time ray-tracing of combinatorial solid
geometric models. In Proceedings of BRL-CAD Symposium ’95, June
1995.
[15] M. J. Keates and Roger J. Hubbold. Interactive ray tracing on a virtual shared-memory parallel computer. Computer Graphics Forum,
14(4):189–202, 1995.
[16] Steven Parker, Peter Shirley, Yarden Livnat, Charles Hansen, and Peter Pike Sloan. Interactive ray tracing. In Interactive 3D Graphics
(I3D), pages 119–126, April 1999.
[17] Steven Parker, Michael Parker, Yaren Livnat, Peter Pike Sloan, Chuck
Hansen, and Peter Shirley. Interactive ray tracing for volume visualization. IEEE Transactions on Computer Graphics and Visualization,
5(3), 1999.
[18] Steven Parker, Peter Shirley, Yarden Livnat, Charles Hansen, and Peter Pike Sloan. Interactive ray tracing for isosurface rendering. In IEEE
Visualization ’98, 1998.
[19] Matt Pharr, Craig Kolb, Reid Gershbein, and Pat Hanrahan. Rendering
complex scenes with memory-coherent ray tracing. Computer Graphics,
31(Annual Conference Series):101–108, August 1997.
[20] Advanced Rendering Technologies. http://www.art-render.com.
BIBLIOGRAPHY
87
[21] D. Hall. The AR350: Today’s ray trace rendering processor. In Proceedings of the Eurographics/SIGGRAPH workshop on Graphics hardware
- Hot 3D Session 1, 2001.
[22] Hanspeter Pfister, Jan Hardenbergh, Jim Knittel, Hugh Lauer, and
Larry Seiler. The VolumePro real-time ray-casting system. In Computer
Graphics 31, pages 251–260, 1999.
[23] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, and M. Horowitz.
Smart Memories: A Modular Recongurable Architecture. IEEE International Symposium on Computer Architecture, 2000.
[24] Timothy Purcell. The SHARP Ray Tracing Architecture. SIGGRAPH
course on Interactive Ray Tracing, 2001.
[25] Timothy J. Purcell, Ian Buck, William R. Mark, and Pat Hanrahan.
Ray Tracing on Programmable Graphics Hardware. In Proceedings of
SIGGRAPH 2002, 2002.
[26] Ingo Wald, Carsten Benthin, and Philipp Slusallek. Distributed Interactive Ray Tracing of Dynamic Scenes. In Proceedings of the IEEE Symposium on Parallel and Large-Data Visualization and Graphics (PVG),
2003.
[27] Erik Reinhard, Brian Smits and Chuck Hansen. Dynamic acceleration
structures for interactive ray tracing. In Proceedings of SIGGRAPH,
2002.
[28] Allen Y. Chang. A Survey of Geometric Data Structures for Ray Tracing. Technical report, Polytechnic University, October 2001.
[29] Emo Welzl. Smallest enclosing disks (ball and ellipsoids), chapter New
Results and New Trends in Computer Science (H. Maurer, ed.), pages
359–370. 1991.
[30] Alphadata. www.alpha-data.com.
[31] Xilinx, Virtex2-6000 FPGA. www.xilinx.com/virtex2.
[32] Peter Bellows and Brad Hutchings. JHDL - An HDL for Reconfigurable
Systems. Technical report, Department of Electrical and Computer
Engineering, www.jhdl.org.
88
[33] Xilinx. www.xilinx.com.
BIBLIOGRAPHY

Download Report

A Ray Tracing Hardware Architecture for Dynamic

Paperzz.com

Your Paperzz