Parallel Tessellation Using Compute Shaders
Group 1:
David Sierra
Matthew Faller
Erwin Holzhauser
April 27th, 2015
Sponsored by:
i
Table of Contents
Executive Summary.......................................................................................................................... 1
Project Motivation ........................................................................................................................... 3
Uses of Tessellation ..................................................................................................................... 3
About Tessellation ....................................................................................................................... 5
Tessellation Hardware ................................................................................................................. 7
Specifications and Requirements .................................................................................................... 9
Implementation Specification ...................................................................................................... 9
Metrics ......................................................................................................................................... 9
Research......................................................................................................................................... 10
Integrated Development Environment Choice .......................................................................... 10
Code::Blocks (Codeblocks) ..................................................................................................... 10
Dev-C++ .................................................................................................................................. 11
Eclipse .................................................................................................................................... 11
Microsoft Visual Studio .......................................................................................................... 12
Various GPU Programming Languages ...................................................................................... 14
C++ AMP .................................................................. 14
OpenCL ........................................................................................................ 15
DirectCompute (DX Compute) ................................ 15
CUDA ........................................................................................... 15
ii
Microsoft Reference Rasterizer ................................................................................................. 16
OpenGL Specification ................................................................................................................. 17
Detailed Design .............................................................................................................................. 18
Tessellating Isolines ................................................................................................................... 18
General Overview .................................................................................................................. 18
Input Description ................................................................................................................... 18
Output Description ................................................................................................................ 19
Processing tessellation factors............................................................................................... 19
Point Generation .................................................................................................................... 22
PlacePointIn1D ....................................................................................................................... 23
Point Connectivity .................................................................................................................. 25
Proposed Parallelism Technique Design .................................................................................... 26
Parallelizing Point Generation ............................................................................................... 26
Parallelizing Point Connectivity.............................................................................................. 30
Attempted Isoline Parallelization Techniques ........................................................................... 31
Run Times of the various implementations ........................................................................... 32
Tessellation of Triangles ............................................................................................................ 32
Processing Tessellation Factors ............................................................................................. 33
Point Generation .................................................................................................................... 38
Point Connectivity .................................................................................................................. 45
Parallel Triangle Tessellation Design.......................................................................................... 49
General Overview of Quads ....................................................................................................... 52
Input Description ................................................................................................................... 52
Output Description ................................................................................................................ 53
Process Tessellation Factors .................................................................................................. 53
Point Generation .................................................................................................................... 55
Point Connectivity .................................................................................................................. 61
Parallel Quad Tessellation Design .............................................................................................. 68
High Level Design ................................................................................................................... 68
Detailed Design ...................................................................................................................... 69
Processing Tessellation Factors ............................................................................................. 69
Point Generation .................................................................................................................... 70
Point Connectivity .................................................................................................................. 74
iii
Attempted Parallel Implementations .................................................................................... 74
Experimental Results: ............................................................................................................ 76
Design Summary ............................................................................................................................ 76
Isolines ....................................................................................................................................... 76
Triangles ..................................................................................................................................... 77
Quads ......................................................................................................................................... 77
Project Administration ................................................................................................................... 78
Facilities and Equipment ............................................................................................................ 78
Personal Work............................................................................................................................ 78
Erwin Holzhauser ................................................................................................................... 79
Matthew Faller ....................................................................................................................... 79
David Sierra ............................................................................................................................ 79
Lessons Learned ......................................................................................................................... 80
Erwin Holzhauser ................................................................................................................... 80
Matthew Faller ....................................................................................................................... 80
David Sierra ............................................................................................................................ 81
Project Plan and Milestones ...................................................................................................... 81
Testing Methodology ................................................................................................................. 84
Testing Harness ...................................................................................................................... 85
Test Cases................................................................................................................................... 86
Error Reporting Conventions ..................................................................................................... 90
Project Summary and Conclusions ................................................................................................ 94
1 | Page
Executive Summary
Tessellation is a process in which low detail surfaces are subdivided into higher
detail surfaces. This allows developers to save memory and bandwidth by only
having them supply low detail models over the memory bus. Then when they need
higher detail models, the graphics card can dynamically generate additional detail
in real time.
Figure 1: In this image we're tessellating a low resolution teapot. The face selected in red is a
quad. Source: http://caig.cs.nctu.edu.tw/
AMD and other hardware vendors usually implement this functionality in fixed
function hardware on the GPU die. This hardware is extremely fast and efficient
but it can only be used for tessellation. This means that when the hardware isn’t
tessellating, it is sitting idle and doing nothing. The scope of our project is to explore
a more general purpose software approach using the general purpose shading
units on the GPU.
The main goal of our project is implement the entirety of the tessellator’s logic as
a compute shader. Our project sponsor Advanced Micro Devices, Inc. (AMD)
imposed a language choice of Microsoft’s DirectCompute because they believed
overall it would be the easiest to use. DirectCompute is a high level language used
to program the graphics card’s general purpose shader units that was released as
alongside DirectX 11. Our main deliverable would then just be a collection of
DirectCompute source files (.hlsl) capable of handling all input values the
tessellator can possibly expect and generating the correct output for each case.
Our other expected deliverable is a detailed performance report comparing the
performance of our software implementation to the fixed function hardware’s
performance. The performance metric AMD is expecting us to measure is triangles
per second generated by the tessellator. All development will be done using
Microsoft Visual Studio because of its tight integration with DirectCompute making
it easier to develop and profile our code.
2 | Page
Before the algorithm is described we will take a second to describe what a patch
is. A patch is a grid of points that map onto the face of a model and describe how
it will look in the 3d world. In order for 2 patches to connect they must have the
same amount of points and they must be spaced in the exact same way. In
essence this means that the outer edges of the patches must match up to adjacent
patches. This also means that the inside of the patch can look like anything it feels
like.
Figure 2: These 2 radically different patches can connect because their outer points align
The tessellation algorithm takes in a standard set of inputs. The first is tessellation
shape. This can either be Isoline, Triangle, or Quad. Each value corresponds to a
target shape the algorithm is expected to generate. Isoline is a grid of lines,
Triangle is just a triangle, and Quad is a rectangle composed of triangles. The
second input is a grid of outer tessellation factors, one for each edge. So for
example, a quad would need 4 outer tessellation factors while a triangle would only
Figure 3: Fractional Even vs Fractional Odd mode on a quad
3 | Page
need 3. The outer tessellation factors are different for each edge because the
output patch (grid of points) needs to be able to connect to other patches as
described above. The last input is a pair of inner tessellation factors. Even though
the user inputs 2 inner tessellation factors, Isolines use none of them and Triangles
only use 1. These inner tessellation factors describe how the inside of the shape
will be divided.
The last input is the tessellation partitioning mode. This can either be Integer,
Pow2, Fractional Odd, or Fractional Even. These values describe how the points
will be spaced out. Integer and Pow2 modes produce evenly spaced points while
fractional mode allow a much wider range of input factors.
The general purpose shaders that we will be programming for contain an extremely
numerous amount of ALUs. The card given to us by AMD to work with contains
over 2800 of them! Knowing this, we hope to outperform the specialized hardware
by utilizing thousands of ALUs to parallelize our calculations.
Project Motivation
Uses of Tessellation
Tessellation is a powerful feature that can give 3d objects an incredible amount of
detail without loading a large mesh file onto the GPU. Instead a smaller 3d model
that takes up less space is moved to the GPU, freeing up space for other resources
such as textures. Whenever the model is drawn, detail is added via tessellation
before textures and lighting calculations have been applied, giving the same effect
as if the large model had been utilized.
Figure 4: Tessellated Toad. Credit: Crytek, cryengine3
tech demo.
4 | Page
Sometimes, when a model is viewed from a distance, a low level of detail is
acceptable or even preferred since it can be drawn faster. This is a common and
important optimization technique when rendering a complicated scene with
hundreds of meshes.
Commonly this is referred to as LOD (level of detail). Traditionally, all level of detail
needed to be handled by creating multiple instances of the same model – each
with lower amounts of polygons. Not only is this technique obnoxious for a 3d artist
to implement, but it also can be very costly since creating these additional models
is very time consuming. Implementing LOD using tessellation reduces a significant
amount of time that the artist needs to spend making the same model ad nauseum.
Figure 5: Face Normals pointing out away from the mesh
into
the
environment.
Credit:
http://flylib.com/books/en/2.451.1.14/1/
Dynamic Tessellation can be used to perform LOD on a per-triangle basis,
depending on a number of desired factors. Most often each triangle is tessellated
based on its distance from the camera, but can also be controlled based on the
angle between its face normal and the camera.
5 | Page
About Tessellation
Tessellation is a stage in the directX pipeline that allows a mesh object to become
more complex. In brief, tessellation can give a virtual environment an
unprecedented level of high quality visuals. The direct 11 pipeline is split into a
series of 8 stages, three of which pertain directly to tessellation: the Hull Shader,
Tessellator, and Domain Shader.
Hull Shader Stage
Tessellator Stage
Domain Shader
Stage
The hull shader calculates on a per-patch basis the level of detail needed for the
particular patch. The desired detail is controlled by the tessellation factors that the
hull shader determines. When the hull shader has finished calculating all of the
factors for a patch, the factors are passed to the tessellator.
The tessellator is responsible for generating primitives of three domain types:
Isolines
o A simple line
Triangles
o A simple triangle shape
6 | Page
Quads
o A quadrilateral composed of triangles
The tessellator subdivides one of these primitive geometry into one that has more
segments. In the case of isolines, it produces a line composed of additional points
and also outputs multiple displaced instances of the line.
Figure 6: The original, undivided line is on the left with the new lines on the right hand side.
Figure 7: Output for quads (left) and Triangles (right)
When the tessellator has run to completion the next stage of the pipeline, the
domain shader, is called. The hull shader also has the option of passing the
tessellator factors that will cause the enter patch to be culled. In such a case, the
tessellator is skipped and the pipeline moves immediately to the domain shader
stage.
7 | Page
The domain shader takes the barycentric UV coordinates output by the tessellator
and calculates the correct positioning for the new vertex in 3d space for each of
these coordinates. Typically the domain shader uses some sort of complicated
algorithm for the new position of the vertices, such as the Bezier, B-Spline or
NURBs algorithms.
Hull Shader
T. Factor 10(10 segments)
Tessellator
Domain Shader
Figure 8: The flow of a single isoline patch through the tessellation pipeline. The Less detailed isoline is
subdivided by the tessellator, then moved by the domain shader into a smooth arc.
Tessellation Hardware
The primitive generation portion of tessellation is implemented on special fixedfunction hardware. This hardware is designed to calculate the primitive point
generation and primitive index connectivity quickly, and does an adequate job.
However using fixed hardware of this nature has several downsides:
1. Only one use
8 | Page
Time, effort, and money must go into the design, integration, and
testing of complicated hardware that has zero reuse.
2. Limited Bandwidth
The hardware only has a limited amount of throughput and cannot
scale when the GPU demands more tessellation.
3. Takes up space on the GPU die
The hardware throughput could increase by taking up extra space on
the die with more powerful hardware. However, this would mean
more power consumption and less chip space for other more
important components. So there is a practical limit to the resources
that can be dedicated to this hardware.
Shader Performance Gains
The shaders take advantage of the general purpose computing power now
available on modern GPUs via the use compute languages. These compute
shaders run in a similar fashion to pixel and vertex shaders, applying a single
instruction concurrently across 1000s of pieces of data. Not only could an
intelligent implementation be fast, it has the potential to possibly outperform the
hardware. In addition, as the general purpose processing cores on the GPU
increase in performance, so too will a parallel shader implementation.
Figure 9: B-Spline algorithm with six control points interacting
with an isoline. Credit: http://en.wikipedia.org/wiki/B-spline
9 | Page
Specifications and Requirements
Implementation Specification
The implementation needs to process many threads in parallel
o Must use HLSL compute shader
o The exact structure will be discussed in detail at a later section but
there are two ways we might split it up.
Give each patch its own thread to perform tessellation.
For a given triangle, split up the calculation into many
smaller calculations, i.e. divide and conquer in parallel.
Operate on a per-patch level performing each point
generation / index connection in parallel.
The system will be faster than the Microsoft Reference Rasterizer.
o This is a very naïve implementation, so hopefully gaining speed
over the reference rasterizer will not prove difficult.
The system will be faster than the AMD tessellation hardware.
o It is important that we end up with much higher throughput (maybe
an entire mesh can be tessellated faster with our implementation).
o It is worth noting that we also need to not tie up too many resources
on the GPU. If our implementation is fast, but consumes the entire
GPU, this is also no good.
The system will tessellate three domains: lines, triangles, and quads.
o Each of these has its own tessellation factors that affect how the
geometry is tessellated.
o There are also 4 different ways to partition the geometry
Fractional odd
Fractional even
Integer
Power of 2
o 6 tessellation factors per patch
Metrics
Our implementation needs to match the output of the Microsoft Reference
Rasterizer bit-for-bit.
o This metric will take our output, the reference output, and run a
simple diff to see if there is a match.
10 | Page
Performance will be measured in triangles per GPU clock cycle
Performance will be measured using AMD proprietary diagnostics tools
Research
Integrated Development Environment Choice
The reference rasterizer given to us was written in C++ so our range of IDEs to
choose from was actually quite large. The IDEs we tried were Code::Blocks, DevC++, Eclipse, and Visual Studio.
Figure 10: Screenshot of Code::Blocks
Code::Blocks (Codeblocks)
The first IDE we tried was Codeblocks. We came to it first because it is used very
frequently in the UCF undergrad course track. We also came in knowing that it was
a simple IDE for simple projects, but decided to give it an honest try anyways as it
would reduce potential time wasted learning the ins and outs of a new IDE.
Although it proved adequate for our simpler projects, once our project scope began
11 | Page
to expand and our lines of code started to balloon Codeblocks struggled to keep
up.
Dev-C++
Figure 11: Screenshot of Dev-C++
The second IDE we tried was Dev-C++. We decided to try it because one of our
group members had used it before in a programming class. At first it seemed pretty
good but soon after we realized it suffered from the same problems as Codeblocks
(Not very scalable). Even worse, Dev-C++ is now sparsely updated. Not only is
this an undesirable trait in general, but GPU programming is relatively new and
growing field and we would like software that can keep up with it.
Eclipse
Third, we tried Eclipse. Eclipse actually surprised us as a powerful C++ IDE. Our
entire group had only known of it as “the IDE from Java class”, so when we found
out that it supported C++ and actually had tons of features on top of that we were
stoked. After our first meeting AMD we were told to just have fun exploring OpenCL
and HLSL for a while. We initially chose Eclipse and OpenCL because they were
12 | Page
open source and cross platform. Working with Eclipse and OpenCL was our
group’s first foray into GPU computing and we had very little complaints. Eclipse
and OpenCL both had plenty of tutorials and documentation online. The biggest
problem we had with eclipse was not even its fault. During our next meeting at
AMD we were told that we would be using Microsoft’s High Level Shader Language
(HLSL). This made it very obvious that we would have to learn to use Visual Studio
in order to get the most out of the language.
Figure 12: Screenshot of Eclipse
Microsoft Visual Studio
Finally we ended up at Visual Studio 2013. The primary reason we ended up here
was because it was tightly integrated with the language that we had to end up
using (HLSL). The reason we stayed is because it ended up being everything
Eclipse was, but better. The UI was smoother, the auto complete functionality was
top notch, and debugging capabilities blew us away. The killer feature of the
debugger is its watch capability. With the watch feature you can assign any
variable while stepping through code to be watched. Any time after that, when the
variable’s value changes the watch window will automatically update it. You can
13 | Page
also modify the watch variable and have it be pre-processed by a function in your
code before it is displayed. For example, instead of watching variable x you can
watch Math.sqrt(x) and have that value displayed in real time. Visual Studio’s
was also one of the most customizable UI’s we had ever seen. You can partition
the window in as many ways as you would like. There is no denying how useful it
is to have 5 windows open all editing the same file when you have a 5000 line
Figure 13: Watching a fixed point number, but have its more readable floating point
representation be shown
source file that you are trying to dissect and understand.
Another great feature Visual Studio had was its peek definition function (Alt + F12).
This feature allows us to open a nested window in the code editor the peeks at
another functions definition. It is extremely useful when you have source files that
are almost 5000 lines long.
Looking into the future, Visual Studio 2015 is slated to have a new GPU
performance profiler allowing us to analyze frame rates, frame times, and GPU
utilization. It is near impossible to profile a graphics card with the current crop of
IDE’s unless you have proprietary software from the GPU vendors so it is very nice
that Visual Studio will have one included.
Figure 14: Visual Studio’s Peek Definition feature
14 | Page
Figure 15: Promotional screenshot of Visual Studio's new GPU profiling tool. Source:
blogs.msdn.com
Various GPU Programming Languages
When we were first tasked with toying around with GPU Programming we were
given a wide range of languages to choose from. We ended up choosing
Microsoft’s DX Compute, but we spent some time dabbling in: C++ AMP, OpenCL,
and even Nvidia’s CUDA.
C++ AMP
AMP is a C++ library developed by Microsoft with the purpose of making it
extremely easy to run GPU code from within a C++ program. We would say that
they have succeeded with this. In order to run any code all you need to do is include
some headers and call a special function that execute a for loop on the GPU.
Really no more than 10 lines. The only problem is that you do not have much if
any control over the performance and it is difficult to get advanced functionality out
of the library.
15 | Page
OpenCL
OpenCL is a multi-device programming framework developed by Khronos, the
same group that develops OpenGL. As such, its open source and cross platform.
This is the real reason we tried it after we ruled out AMP. The best thing about it
was the online tutorials. OpenCL had a bunch of tutorials for programming a bunch
of stuff on GPUs. In addition to tutorials, most open source software designed to
run on the GPU was written in OpenCL. This was convenient because it gave a
glimpse into the high level design of GPU applications. Despite how awesome
OpenCL was, DX compute had a killer feature that we were not very willing to give
up.
DirectCompute (DX Compute)
Our group did not even know DX Compute existed. And there was a good reason
for that, it is mainly a DirectX 11 feature, and DirectX 11 is not extremely popular
with developers nowadays. It also is not open source or free to use with enterprise
applications. The reason we chose it is because AMD strongly recommended it.
First of all there is an adaptive tessellation example written by Microsoft that we
can use as a reference. Also, the reference rasterizer given to us by AMD already
defines an interface for our project. This makes it easy for us to perform a variety
of tests as we develop our solution. The only thing we do not like about DX
Compute is the verbose syntax. It really is a handful for unexperienced
programmers.
CUDA
CUDA is Nvidia’s proprietary shader language. Nvidia may be better known to you
as AMD’s direct competitor in graphics and that alone is reason enough for us not
using their language. Nonetheless we decided to try their language and it was
actually quite clever. To perform operation on the GPU you would use familiar c
functions prefixed with cuda. For example to allocate memory on the GPU you
would use cudaMalloc, to free memory on the GPU you’d use cudaFree, and
16 | Page
to copy memory to and from the GPU you can use cudaMemCpy. As cool as the
language is, we sadly were not even allowed to consider it.
Microsoft Reference Rasterizer
The Microsoft reference rasterizer (RefRast) is an app given to us by AMD to help
us visualize and test the tessellation algorithm.
The RefRast is split up into 2 pieces, the OpenGL renderer and the C
implementation of the Tessellator. The OpenGL renderer’s only real job is to take
the output from the tessellator and use the contents of the index and vertex buffers
Figure 16: Screenshot of AMD's Reference Rasterizer
17 | Page
to draw points and lines on the screen. In the background it takes input from the
user to control the tessellation factors and feeds them into the tessellator. The
RefRast is also the source of most of the figures in this document.
The second half to the RefRast is the C implementation of the tessellator. This
implementation provides us with perfectly accurate tessellator that follows the
Microsoft spec 100%. In fact, it is the Microsoft spec, a Microsoft employee wrote
the C tessellator and gave it to AMD so they could use it to develop hardware. And
it is that which explains the code’s layout. The code is not written to be efficient on
CPUs at all. In fact here is a direct quote from the comments:
//There is lots of headroom to make this code run faster
on CPUs. It was written merely as a reference for what
results hardware should produce, with CPU performance not
a consideration.
Figure 17: Quote from the tessellator source code
The code is literally written in such a way that you can lay down circuits on a board
as you read the code. While this may be fantastic for AMD hardware engineers, it
is quite the nightmare for undergraduate computer science students with little to
no computer engineering experience.
Anyways, the RefRast contains something very useful for testing. It contains an
interface for an HLSL tessellator. This means our code can just implement the
interface and hook right in to the RefRast’s rendering capabilities. This makes it
easy to diff results of our tessellator with the results of the reference tessellator.
We can even render our data as an overlay on top of the reference data to gain
visual insight into bugs in our code.
Overall the RefRast is an invaluable tool both with its hard to read yet highly
detailed code, and its extensible rendering capabilities. It is a shame we only got
our hands on it about 2 months after we were assigned the project.
OpenGL Specification
The “OpenGL Specification” is a document that describes the OpenGL graphics
system. Version 4.5 of the document is freely available from the OpenGL webpage.
The document intends to provide information about the nature and behavior of the
OpenGL system, along with requirements for implementation. The specification
covers tessellation control shaders, primitive generation, and evaluation shaders
under the section for programmable vertex processing. The section on primitive
generation is relevant to what we are trying to implement.
Along with an overview of primitive generation, the specification delves further into:
Subdivision
18 | Page
Tessellation Types: Triangles, Quads, and Isolines
Partitioning Modes: Equal Spacing, Fractional Even, Fractional Odd
For the tessellation types, the specification discusses which tessellation factors
apply to which tessellation types. For the partitioning modes, the range of values
to clamp, rounding of tessellation values, and division of segments along edges
are provided.
Detailed Design
Tessellating Isolines
General Overview
The isoline tessellator takes input values, processes them and creates a grid of
lines that have been subdivided based on the input values. Isolines are the
simplest form of tessellation as they require only 2 tessellation factors and a
tessellation mode. They are also extremely fast compared to triangles and quads.
Figure 18: Sample output grid
Input Description
Isoline tessellation takes only 3 inputs:
Tessellation Factor 1
o A floating point value describing how many segments the horizontal
lines will be made up of
Tessellation Factor 2
o A floating point value describing the number of horizontal lines
o This is always done in integer tessellation mode for isolines
Tessellation Mode
o Describes how the lines generated by the algorithm will be spaced
o Also used as a guideline for processing input values
19 | Page
Figure 19: An example of Fractional Odd and
Integer partitioning given that both tessellation
factors are set to 4.2
Output Description
Isoline tessellation generates 2 output structures:
Index Buffer
o Contains a list of points generated in uv coordinates
Vertex Buffer
o Contains a list of ints
o These ints are stored 2 at a time and correspond to the endpoints of
each line segment
Processing tessellation factors
If Tessellation factor 1 or 2 is less than or equal to 0, then the algorithm short
circuits and returns nothing. Otherwise we must process the tessellation factors
into more useful numbers.
The first step in processing the tessellation factors is to clamp them to their valid
ranges based on the tessellation mode. Below is a table specifying the valid values
the tessellations factors will be clamped to:
Table 1:Table showing valid tessellation factor ranges
Integer
[1,64]
Pow2
[1, 64]
Fractional Odd
[1, 63]
Fractional Even
[2, 64]
20 | Page
If the tessellation mode is set to one of the integer modes (Integer or Pow2) then
both the tessellation factors must be rounded up to the nearest whole number.
The tessellation parity is then stored. Tessellation parity can either be even or odd
and is based on whether the tessellation factor is even or odd.
Next, the tessellation factor context is computed for the first tessellation factor. The
tessellation factor context is a struct of numbers that are used repeatedly
throughout the tessellation process. Below is a list describing the values to be
stored in the tessellation factor context.
halfTessFactorFraction
o The fractional part of
2
numHalfTessFactorPoints
o Half of the amount of points we expect to generate
o The ceiling of
𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟
𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟
2
splitPointOnFloorHalfTessFactor
o This is an integer that tells the tessellator, when in fractional mode,
at what index to insert the small line segment
o Calculation of this number depends on the tessellation parity and is
only used in fractional tessellation modes
Even
Odd
𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟
𝑅𝑒𝑚𝑜𝑣𝑒𝑀𝑆𝐵 (𝑓𝑙𝑜𝑜𝑟 (
2
) ∗ 2) + 1
If the tessellation factor is less than 3, then this number
is simply 0
Otherwise
it
is
equal
to
𝑅𝑒𝑚𝑜𝑣𝑒𝑀𝑆𝐵 ((𝑓𝑙𝑜𝑜𝑟 (
𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟
2
) − 1) ∗ 2) + 1
Figure 20: Image showing what splitPointOnFloorHalfTessFactor represents
invHalfTessFactorCeil
o The upper bound for the length of a segment
o Calculation of this number depends on the tessellation parity
Even
Odd
𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟
The inverse of 𝑐𝑒𝑖𝑙 (
2
)∗2
21 | Page
𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟
The inverse of 𝑐𝑒𝑖𝑙 (
2
)∗2−1
invHalfTessFactorFloor
o The lower bound for the length of a segment
o Calculation of this number depends on the tessellation parity
Even
𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟
The inverse of 𝑓𝑙𝑜𝑜𝑟 (
2
)∗2
Odd
𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟
The inverse of 𝑓𝑙𝑜𝑜𝑟 (
2
)∗2−1
Next, the number of points per line is calculated. Once again this relies on the
tessellator parity and is described below in a table.
𝟐 ∗ 𝒄𝒆𝒊𝒍(. 𝟓
𝒕𝒆𝒔𝒔𝒆𝒍𝒍𝒂𝒕𝒊𝒐𝒏 𝒇𝒂𝒄𝒕𝒐𝒓
+
)
𝟐
𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟
Even Parity
1 + 2 ∗ 𝑐𝑒𝑖𝑙(
)
2
Figure 21: Table showing number of points calculations
Odd Parity
Now we must compute the tessellation factor context for the second tessellation
factor. This is done the same way as the first one except for 3 key things:
We force the tessellation mode to Integer mode.
We must round up the second tessellation factor to the next whole number.
(because we are now in integer mode)
We must re-assign the tessellation parity based on whether our new second
tessellation factor is even of odd
22 | Page
Figure 22: Notice the missing line at the bottom.
We then calculate the number lines we will be producing. This is calculated in the
same way as the first tessellation factor but we subtract 1 from the final result. This
is because we do not want to draw the final line.
Next we calculate the number of points that will be drawn, which is just equal to
the number of points per line multiplied by the number of lines.
Point Generation
Now that we have our processed tessellation factors we can now generate the
points. This is done with a nested for loop that loops through all the points in each
line for every line and runs the PlacePointIn1D function. The pseudo code for the
body of the nested loop is provided below.
Set tessellator parity to the parity of the first tessellation factor
U = PlacePointIn1D(tessellation factor context, current point)
Set tessellator parity to the parity of the second tessellation
factor
V = PlacePointIn1D(tessellation factor context, current line)
Add point (u, v) to list of points
23 | Page
Figure 23: Image highlighting loop execution
PlacePointIn1D
The first thing we do when generating points is make sure that the point we’re on
resides on the left side of the line. We do this because the points we generate are
symmetric about the center of the line. If the point is on the right side of the line,
we set 𝑝𝑜𝑖𝑛𝑡 = 𝑡𝑜𝑡𝑎𝑙 𝑝𝑜𝑖𝑛𝑡𝑠 𝑜𝑛 𝑙𝑖𝑛𝑒 − 𝑝𝑜𝑖𝑛𝑡. If the tessellation parity is set to odd
then we must subtract 1 from this value.
Figure 24: Image showing the need to subtract 1 from point when tessellation parity is
odd
Now
we
make
2
values:
indexOnCeilHalfTessFactor
and
indexOnFloorHalfTessFactor. Initially these two numbers are set to point (the index
of the current point we are working on). If the point we are on is greater than the
splitPointOnFloorHalfTessFactor calculated in the tessellation factor context then
we reduce indexOnFloorHalfTessFactor by 1. The reason for this will become
apparent very shortly but remember that splitPointOnFloorHalfTessFactor is the
index at which we insert the small line segments in the fractions tessellation
modes.
We now make two new values which again reference our tessellation factor
contexts:
24 | Page
𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟
= 𝑖𝑛𝑑𝑒𝑥𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ 𝑖𝑛𝑣𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑙𝑜𝑜𝑟
𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐶𝑒𝑖𝑙𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟
= 𝑖𝑛𝑑𝑒𝑥𝑂𝑛𝐶𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ 𝑖𝑛𝑣𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐶𝑒𝑖𝑙
Now we are ready to calculate the final location of the point we are placing.
Figure 25: Geometric visualization of
linear interpolation. Source: Wikipedia
𝑓𝑖𝑛𝑎𝑙𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛
= 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (1 − ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛)
+ 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐶𝑒𝑖𝑙𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛)
The location is calculated by taking the minimum possible segment location and
maximum possible segment location and linearly interpolating them according to
how close the tessellation factor is to a whole number. Now obviously, this number
will never be bigger than locationOnCeilTessFactor (maximum segment position),
and it approaches that position as halfTessFactorFraction approaches .5. This is
the same as saying that as the tessellation factor approaches a whole odd (or even
depending on parity) number, the size of the small line segments approaches the
size of the other segments.
25 | Page
Figure 26: Image showing the process of tessellating from tessellation factor
3 to tessellation factor 5 in fractional odd mode
Finally, if we flipped our point at the beginning we must set it to 1 – location since
we are mirroring it about the center of the line.
Now that we have a new generated point, we can store it in the next empty spot in
the vertex buffer.
i
0
1
2
3
4
Point.u 0
.25
.50
.75
1
Point.v 0
0
0
0
0
Table 2: Table showing a sample vertex buffer for the all N points in row 0
Point Connectivity
Point connectivity is actually quite simple. We have a global array that is initialized
to have a size that is equal to the number of indices that we will end up with. This
is equal to the number of segments per line, multiplied by the number of lines. We
then multiply this by 2 since each segment is defined by 2 points. It is worth noting
that point connectivity only relies on the same tessellation factor context that point
generation relies on. This means that both can be done in parallel.
The actual process of connectivity generation is done in a nested for loop that goes
row by row and column by column and just inserts pairs of ints into the index buffer.
The integers stored in the index buffer correspond to vertexes in the vertex buffer.
Once again, the vertex is populated when the tessellator generates the points.
26 | Page
Figure 27: Image showing how connectivity is stored in the index buffer
Proposed Parallelism Technique Design
Parallelizing isolines is going to be the simplest of the trio of tessellation modes.
First we must obviously compute the tessellation factor context. This is a pretty
linear process and is needed by the point generator and connectivity generator.
The good thing is we need to do it twice, so we can do them at the same time to
save a little time.
After the tessellation factor context is computed, we are ready to generate the
points and connectivity. As previously stated these can be done independently of
each other.
Parallelizing Point Generation
Before we go into how we intend to parallelize point generation I will outline some
basic facts about the AMD GCN architecture.
For this example I will be referring to the AMD Radeon R9 290x’s hardware.
27 | Page
And AMD GPU Core consists of:
44 individual compute units.
Each compute unit consists of 4 SIMD vector processors.
Each SIMD vector processor consists of 16 ALUs.
Figure 28: Image showing an AMD SIMD vector
processor
Each SIMD vector processor executes the same instruction on all 16 of its
ALUs. Also, each ALU can operate on a different piece of data. This means
that each compute unit can have all of its 64 ALUs execute the same instruction
on 64 pieces of data. This leads to ridiculously parallelized code that far
surpasses what a normal CPU can do. Remember that each tessellation factor
for isolines maxes out at 64. This means that we can use one compute unit to
compute an entire row of points at the same time. This is the basis for our point
generation optimizations.
Figure 29: Image showing a vector operation
28 | Page
Figure 30: High level overview of a compute unit Source: Anadtech.com
29 | Page
Thus, my proposed technique for point generation is to use one compute unit
to compute rows of values at a time. This would reduce the time complexity of
point generation from an O(n2) operation to an O(n) operation.
As a further optimization, imagine a tessellation factor 1 of 2, a tessellation
factor 2 of 16, and fractional even tessellation mode. The number of points we
would end up with would be 51. This means that we can calculate the
coordinates of all 51 points in O(1) time with just 1 compute unit!
Figure 31: All of these points can be generated in O(1) time
In more general terms, the number of iterations required to compute all of our
points is simply:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠
64
30 | Page
Parallelizing Point Connectivity
We can accomplish the parallelization of point connectivity similarly to how we did
point generation. The computations inside the loop that generate connectivity do
not rely on the results of the previous loop. That means if we have 64 ALUs, we
can generate 64 line segments at a time.
Figure 32: 64 of these segments can be generated at a time
One Caveat to this approach is best described with a question. How does each
thread know which segment it should be generating? Well, when a thread is
launched, it is assigned a thread id and group id among other things. If we launch
a number of threads equal to the number of line segments then each thread would
know which segment it is calculating. It would just look at its thread id. Then we
can do some simple calculations to arrive at which 2 points are the endpoints of
our segments. And remember, since this data will be stored in an index buffer, we
only have to store an int that corresponds to the right point in the vertex buffer.
31 | Page
Figure 33: Simple flow chart diagramming the proposed parallelization technique
Attempted Isoline Parallelization Techniques
One thread group per point
On the surface this idea seems terrible, and that is because it is. While coding a
“one thread per point” solution we thought we were assigning one point to one
ALU, but we ended up assigning one point to one compute unit.
One thread per grid of points
In DirectCompute, a dispatch group is limited to 1024 threads. Given this limitation
we tried having threads compute all data for grids of points in sizes of 8x8, 4x4,
and 2x2. Obviously 2x2 groups were the fastest, but they were not fast enough.
One thread per 4 collinear points
This implementation was similar to the 2x2 grids, but the points were on the same
line, saving some precious cycles only having to calculate the y-value once.
One group of threads (64) per 8x8 cube of points
In this implementation we only launched as many threads as we required. If the
output patch was less than 8x8 then it could have been handled entirely by a single
compute unit. This implementation proved to be the fastest thus far.
32 | Page
One group of threads per line of points
In this final implementation we launch one group of threads per line of points. This
proves to be the fastest by about half a millisecond.
Why not use both
As a last minute optimization the software actually analyzes the input and decides
whether it should use the cube method or line method to minimize resources used.
Run Times of the various implementations
Isoline Test Results
400
350
Time (ms)
300
250
200
150
100
50
0
1x1
8x8
4x4
2x2
4x1
8x8 bound
64x1 bound
Grid Size
HD 8490
R9 290X
Intel Integrated
Figure 34: Graph showing run times of various isoline tessellation implementations
What this test measures is the time the GPU spends in the compute shader stage
while going through every single possible isoline input in integer mode.
As can be seen, as the implementations as time went on were faster on more
classes of hardware.
Tessellation of Triangles
The tessellation of triangles consists of the subdivision of triangles into smaller,
non-overlapping triangles that entirely cover the area of the original triangle.
Subdivision of triangles is dependent on:
Outer Tessellation Factors, t0 through t1
Inner Tessellation Factor, i0
33 | Page
It ignores the Outer Tessellation Factors t2 and t3 and Inner Tessellation Factor
i1.
The high-level steps involved in the tessellation of triangles are:
1. Processing Tessellation Factors
2. Point Generation
3. Point Connectivity
Processing Tessellation Factors
Processing of the tessellation factors takes as input:
Outer Tessellation Factors, t0 through t1
Inner Tessellation Factor, i0
It generates the following information utilized in point generation and connectivity:
Per tessellation factor
Clamped value
Parity
Context, defined
below
Number of points per
edge
Global
Total number of
points
Set Flags
Base-case to do
minimum tessellation
work
Culled Patch
First, the tessellation factors must be checked for the base case where the patch
is culled—that is, not displayed. This is the case where all of the outer tessellation
factors are non-positive, in which case a culled flag is set to let the tessellator
know, and further processing of the tessellation factors is aborted.
Next, the tessellation factors must be clamped—that is, bumped up or down to
ensure that they fall within a given range. Their appropriate ranges are based off
of the chosen partitioning mode. For integer and power of two partitioning, the
outer tessellation factors are clamped to the range of values 1 through 64.
Likewise, for fractional even and fractional odd partitioning, the outer tessellation
values are clamped to the range of values 2 through 64 and 1 through 63,
respectively. Largely, these same ranges apply to the clamping of the inner
tessellation value, but the clamping for inner tessellation value for fractional odd
partitioning is a special case; the lower bound for this range is incremented by 2 16, the smallest value represented by the fixed point representation utilized in the
tessellation hardware specification, so that the concentric inner triangle later
generated does not overlap with the outermost triangle (See Figure 1). Because
tessellation factors are read as floating point numbers, those factors with fractional
parts must be rounded to the next nearest integer for the integer and power of two
partitioning modes.
34 | Page
Figure 1 – Inner Triangle Does Not Overlap With Outermost Triangle
Next, the vertex and index buffers are cleansed before the bulk of the tessellation
factor processing. In hardware, these buffers have enough memory to support the
four tessellation factors set to the maximum values of 64, which would upwards of
3,000 vertices and 6,000 triangle subdivisions.
For integer and power of two partitioning, the parity of each tessellation factor is
set to the parity of its clamped value. Otherwise, the parities for all tessellation
values are set to the parity corresponding to the chosen partitioning; for fractional
even and fractional odd partitioning, all tessellation factor parities are set to even
and odd, respectively.
35 | Page
There is another base case for integer, power of two, and odd partitioning modes
where all tessellation factors are equal to one; in this case, a single triangle is
output (See Figure 2). Now that the tessellation factors are clamped to their
appropriate ranges, this base case can be checked against. If it is this case, a flag
is set to let the tessellator know that it will be doing the minimum amount of work,
and further processing of the tessellation factors is aborted.
Figure 2 – Special Case; All Tessellation Factors Equal To One
Next, the context—a collection of values useful for point generation and
connectivity—for each tessellation factor is computed as a function of itself, and
its parity.
Tessellation Factor Context Variables:
1
invNumSegmentsOnFloorTessFactor := 𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟
invNumSegmentsOnCeilTessFactor := 𝑐𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟
1
36 | Page
halfTessFactorFraction := ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 − 𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟
numHalfTessFactorPoints := 𝑐𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟
splitPointOnFloorHalfTessFactor :=
o If 𝑐𝑒𝑖𝑙𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 == 𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟, some value is picked
for the tessellator to ignore; the hardware chooses
𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠 + 1.
o For odd tessellation factor parity,
If 𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 == 1, 0
Otherwise, (𝑅𝑒𝑚𝑜𝑣𝑒𝑀𝑆𝐵(𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 − 1) ≪ 1) +
1
o Otherwise, (𝑅𝑒𝑚𝑜𝑣𝑒𝑀𝑆𝐵(𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟) ≪ 1) + 1
Where,
halfTessFactor :=
𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟
2
o If the parity of the tessellation factor is odd, halfTessFactor is equal
to 0.5, halfTessFactor is incremented by 0.5.
floorHalfTessFactor := 𝑓𝑙𝑜𝑜𝑟(ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟)
ceilHalfTessFactor := 𝑐𝑒𝑖𝑙(ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟)
numFloorSegments := 𝑓𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ 2
o For odd tessellation factor parity, the value is decremented by 1.
numCeilSegments := 𝑐𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ 2
o For odd tessellation factor parity, the value is decremented by 1.
𝑹𝒆𝒎𝒐𝒗𝒆𝑴𝑺𝑩(𝒙: 𝒇𝒍𝒐𝒂𝒕): 𝒇𝒍𝒐𝒂𝒕, is a function that removes the most
significant bit from a float.
Finally, the number of points corresponding to each tessellation factor is
calculated. For the outer tessellation factors, this directly corresponds to the
number of points for each respective edge. For the inner tessellation factor, the
value corresponds to the number of points for the line that runs along edge of the
inner concentric triangle adjacent to the outer triangle. Given the minimum bound
on the tessellation factors for the different tessellation partitioning modes, the
minimum point count for the inner tessellation factor is 4 for odd partitioning, and
3 for all others.
37 | Page
Inner
Tessellation
Point Count
=3
Figure 3 – Minimum Point Count for Inner Tessellation Factor
For odd parity tessellation factors, number of points are given by:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠𝑜𝑑𝑑 = (
𝑐𝑒𝑖𝑙(0.5 + 𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟)
)∗2
2
Similarly, the number of points for other tessellation factor parities are given by:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠𝑜𝑡ℎ𝑒𝑟 = (𝑐𝑒𝑖𝑙 (
𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟
) ∗ 2) + 1
2
38 | Page
The inside edge point base offset is given by:
3
( ∑ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠𝑒𝑑𝑔𝑒 ) − 3
𝑒𝑑𝑔𝑒=1
Finally, the total number of points is given by:
3
( ∑ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠𝑒𝑑𝑔𝑒 ) − 3 + 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑝𝑜𝑖𝑛𝑡𝑠
𝑒𝑑𝑔𝑒=1
Where,
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠 ∶=
(𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑓𝑜𝑟 𝑖𝑛𝑠𝑖𝑑𝑒 𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟 ≫ 1) − 1
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠 ∶=
3 ∗ (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠 ∗ (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠 + 1)
− 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠,
𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑜𝑑𝑑
3 ∗ (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠 ∗ (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑖𝑜𝑟 𝑟𝑖𝑛𝑔𝑠 + 1)) + 1,
{
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Point Generation
The tessellator generates points along the rings—or concentric triangles—in a
spiraling, clockwise (or counterclockwise) fashion, from the outermost ring towards
the innermost rings. As the outermost and inner rings are generated as a function
of the outer and inner tessellation factors, respectively, the outermost ring points
can be computed separately from the points of the inner rings; this presents a clear
opportunity for parallelization. Non-odd parity of the inner tessellation factor is a
special case that implies the innermost ring be a single point, as opposed to a
triangle (Figure 3); this case is handled separately.
39 | Page
Figure 4 – Spiraling Point Generation & Center Point Special Case
As point generation iteratively generates sequential points along sequential edges,
in the chosen orientation, we need to keep indices for the current edge and point
for each ring, as well as the point offset for purposes of storage in the vertex buffer.
Let us define the clockwise ordering of the edges for a ring as U, V, and W. Points
generated along these edges are defined by a three-tuple of barycentric
coordinates (u, v, w) with respect to U, V, W. Coordinate ‘w’ can be implicitly
defined, however, as a function of ‘u’ and ‘v’.
40 | Page
(0, 0, 1)
Edge 1
(0, 1, 0)
V
Edge 0
U
Edge 2
W
(1,0,0)
Outermost Vertices with Barycentric Coordinate Labeling
For the point generation of each outer edges, we begin each edge by calculating
the parity of the edge and the index of the edge’s end point, given by:
(𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑓𝑜𝑟 𝑒𝑑𝑔𝑒) − 1. The number of points is decremented because
we do not want to include the last point along the edge, as the next edge begins
with it. We need the parity per edge because we need to reverse the orientation in
which points are generated along some edges. This is because ‘u’ and ‘v’ alternate
increasing and decreasing for coordinates along the axes of said edges. For points
along the W (edge 2) and edge U (edge 0) axes, we have ‘v’ and ‘u’ coordinate
values decreasing, respectively, so we have to reverse the orientation of the point
along these edges—these correspond to even parity edges. Similarly, edge V
41 | Page
(edge 1), which has ‘u’ increasing and an odd edge parity, does not require a flip
of the orientation along which points are generated.
‘u’ increasing
edge parity := odd
‘u’ decreasing
edge parity := even & 0x1
edge parity := even
Increasing and Decreasing of ‘u’ and ‘v’ Coordinates Along Axes
Per edge, we start with the initial point and iterate through the end point,
incrementing the point offset with every point for every edge. We calculate the
index of the point’s positioning along the axis of the current edge using the edge
parity, which is given by:
𝑞 ∶= {
𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑒𝑑𝑔𝑒 𝑖 , 𝑒𝑑𝑔𝑒 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑜𝑑𝑑
𝑒𝑛𝑑 𝑝𝑜𝑖𝑛𝑡𝑒𝑑𝑔𝑒 𝑖 − 𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑒𝑑𝑔𝑒 𝑖 , 𝑒𝑑𝑔𝑒 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑒𝑣𝑒𝑛
42 | Page
Now that the index for point placement is adjusted for the parity of the edge, we
have to define the point in barycentric space. For each point from 0 through the
end point corresponding to edges U, V, and W, the point is given by:
(0, 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛, 𝑤), 𝑒𝑑𝑔𝑒 𝑈 (𝑒𝑑𝑔𝑒 0)
(𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛, 0, 𝑤), 𝑒𝑑𝑔𝑒 𝑉 (𝑒𝑑𝑔𝑒 1)
{
(𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛, 1 − 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛, 𝑤), 𝑒𝑑𝑔𝑒 𝑊(𝑒𝑑𝑔𝑒 2)
Where 𝑤 and 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 are defined by:
𝑤 ∶= 1 − 𝑢 − 𝑣, and,
0.5, 𝑝 > 𝑠𝑝𝑙𝑖𝑡𝑃𝑜𝑖𝑛𝑡𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟
𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 ∶= { 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (1 − ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛) + ,
𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐶𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐶𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ≔ 𝑝 ∗ 𝑖𝑛𝑣𝑁𝑢𝑚𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑠𝑂𝑛𝐶𝑒𝑖𝑙𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟,
𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ≔
(𝑝 − 1) ∗ 𝑖𝑛𝑣𝑁𝑢𝑚𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑠𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟, 𝑝 > 𝑠𝑝𝑙𝑖𝑡𝑃𝑜𝑖𝑛𝑡𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟
{
,
𝑝 ∗ 𝑖𝑛𝑣𝑁𝑢𝑚𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑠𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
and,
(𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠 ≪ 1) − 𝑞, 𝑞 ≥ 𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠
(𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠 ≪ 1) − 𝑞 − 1,
𝑝≔{
.
(𝑞 ≥ 𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠) 𝑎𝑛𝑑 𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑜𝑑𝑑
It is important to note, for the formula for location, the complement—(1 −
𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛)—is taken if 𝑞 is greater than numHalfTessFactorPoints.
For
location:
splitPointOnFloorHalfTessFactor,
halfTessFactorFraction,
invNumSegmentsOnCeilTessFactor, invNumSegmentsOnFloorTessFactor, and
numHalfTessFactorPoints are given by the tessellation factor context for the
tessellation factor corresponding to the edge being worked on. Outer tessellation
factors 0, 1, and 2, correspond to edges U, V, and W, respectively. The points are
stored in the vertex buffer at the index of the point offset.
Similarly to the outermost ring, points for the inner rings are calculated iteratively
from the outermost inner rings towards the center ring, along the edges in a
clockwise fashion. The number of inner rings is given by:
𝑛𝑢𝑚𝑃𝑜𝑖𝑛𝑡𝑠𝐹𝑜𝑟𝐼𝑛𝑠𝑖𝑑𝑒𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ≫ 1,
Where numPointsForInsideTessFactor comes from the processed tessellation
factors.
Because the points for all edges of the inner rings are generated from a single
tessellation factor, each edge had the same number of segments (and points) per
ring. Per ring, the start and end points are given by follows: 𝑠𝑡𝑎𝑟𝑡 𝑝𝑜𝑖𝑛𝑡 ∶= 𝑟𝑖𝑛𝑔
and 𝑒𝑛𝑑 𝑝𝑜𝑖𝑛𝑡 ∶= (𝑛𝑢𝑚𝑃𝑜𝑖𝑛𝑡𝑠𝐹𝑜𝑟𝐼𝑛𝑠𝑖𝑑𝑒𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 − 1 − 𝑠𝑡𝑎𝑟𝑡 𝑝𝑜𝑖𝑛𝑡). This is
43 | Page
because each inner ring has two less points per edge than the corresponding edge
of the ring surrounding it, and the property still holds that we the last point along
the edge that the following edge begins with.
Ring 0
Ring 1
Ring 2 End Point
For Edge 1
Ring 2 Start Point
For Edge 1
Inner Tessellation Factor: 7 (7 Segments for Ring 1)
For Ring 1, each edge has 6 points.
For Ring 2, each edge has 4 points.
Generally, Ring i Points Per Edge := (Ring 1 Points) – 2*(i – 1)
All Edges of Innermost Triangles, Per Triangle, Have Equal Points and Segments.
For each edge of the inner rings, the parity is still relevant, because the property
still holds that we must switch the orientation of points generated along the axes
of the edges to generate points sequentially.
44 | Page
We calculate the placement of the point along the axis of the current edge using
the edge parity, which is given by:
𝑞𝑖𝑛𝑛𝑒𝑟 ∶
= {
𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑒𝑑𝑔𝑒 𝑖 , 𝑒𝑑𝑔𝑒 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑜𝑑𝑑
𝑒𝑛𝑑 𝑝𝑜𝑖𝑛𝑡𝑒𝑑𝑔𝑒 𝑖 − (𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑒𝑑𝑔𝑒 𝑖 − 𝑠𝑡𝑎𝑟𝑡 𝑝𝑜𝑖𝑛𝑡), 𝑒𝑑𝑔𝑒 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑒𝑣𝑒𝑛
Now that the index for point placement is adjusted for the parity of the edge, we
have to define the point in barycentric space. For each point from 0 through the
end point corresponding to edges U, V, and W, the point is given by:
(𝑝𝑒𝑟𝑝, 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 − (
(𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 − (
{(𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 − (
𝑝𝑒𝑟𝑝
2
𝑝𝑒𝑟𝑝
2
𝑝𝑒𝑟𝑝
2
), 𝑤) , 𝑒𝑑𝑔𝑒 𝑈 (𝑒𝑑𝑔𝑒 0)
), 𝑝𝑒𝑟𝑝, 𝑤) , 𝑒𝑑𝑔𝑒 𝑉 (𝑒𝑑𝑔𝑒 1)
), 1 − (𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 − (
𝑝𝑒𝑟𝑝
2
,
) − 𝑝𝑒𝑟𝑝, 𝑤) , 𝑒𝑑𝑔𝑒 𝑊(𝑒𝑑𝑔𝑒 2)
Where 𝑝𝑒𝑟𝑝 is given similarly to 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 by:
0.5, 𝑝𝑝𝑒𝑟𝑝 > 𝑠𝑝𝑙𝑖𝑡𝑃𝑜𝑖𝑛𝑡𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟
𝑝𝑒𝑟𝑝 ∶= { 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (1 − ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛) + ,
𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐶𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ∗ (ℎ𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐶𝑒𝑖𝑙𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ≔ 𝑝𝑝𝑒𝑟𝑝 ∗ 𝑖𝑛𝑣𝑁𝑢𝑚𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑠𝑂𝑛𝐶𝑒𝑖𝑙𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟,
𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟 ≔
(𝑝𝑝𝑒𝑟𝑝 − 1) ∗ 𝑖𝑛𝑣𝑁𝑢𝑚𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑠𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟, 𝑝𝑝𝑒𝑟𝑝 > 𝑠𝑝𝑙𝑖𝑡𝑃𝑜𝑖𝑛𝑡𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟
{
𝑝𝑝𝑒𝑟𝑝 ∗ 𝑖𝑛𝑣𝑁𝑢𝑚𝑆𝑒𝑔𝑚𝑒𝑛𝑡𝑠𝑂𝑛𝐹𝑙𝑜𝑜𝑟𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
, and,
𝑝𝑝𝑒𝑟𝑝
(𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠 ≪ 1) − 𝑠𝑡𝑎𝑟𝑡 𝑝𝑜𝑖𝑛𝑡, 𝑞𝑖𝑛𝑛𝑒𝑟 ≥ 𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠
(𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠 ≪ 1) − 𝑞𝑖𝑛𝑛𝑒𝑟 − 1,
≔{
.
(𝑞𝑖𝑛𝑛𝑒𝑟 ≥ 𝑛𝑢𝑚𝐻𝑎𝑙𝑓𝑇𝑒𝑠𝑠𝐹𝑎𝑐𝑡𝑜𝑟𝑃𝑜𝑖𝑛𝑡𝑠) 𝑎𝑛𝑑 𝑡𝑒𝑠𝑠𝑒𝑙𝑙𝑎𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑖𝑡𝑦 == 𝑜𝑑𝑑
In the formula for 𝑝𝑒𝑟𝑝, similarly to that for location, the complement—(1 − 𝑝𝑒𝑟𝑝)—
is taken if 𝑞𝑖𝑛𝑛𝑒𝑟 is greater than numHalfTessFactorPoints. After the above
calculations for 𝑝𝑒𝑟𝑝, the value is multiplied by two-thirds. 𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛 is defined as
for the
outermost ring,
except
that for
𝑝𝑒𝑟𝑝
and 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛:
splitPointOnFloorHalfTessFactor,
halfTessFactorFraction,
invNumSegmentsOnCeilTessFactor, invNumSegmentsOnFloorTessFactor, and
numHalfTessFactorPoints are given by the tessellation factor context for innermost
tessellation factor.
Lastly, the special case is handled where non-odd parity of the inner tessellation
1 1 1
factor produces a single point at the center. It is simple hardcoded as (3 , 3 , 3). The
points are stored in the vertex buffer at the index of the point offset.
45 | Page
Special Case; Inner Tess Factor = 6, Pictured Above. Non-Odd Inner Tess Factor
Parity Produces a Degenerate Triangle—Single Point—at the Center.
Point Connectivity
Like point generation, point connectivity occurs in a clockwise (or
counterclockwise), spiraling fashion from the outermost ring towards the innermost
ring. Triangles are formed between all rings, one triangle side at a time per ring.
46 | Page
Point Connectivity occurs in a spiraling, clockwise fashion from the outermost ring
towards the center. Triangles are generated in the order indexed at the center of
the triangle.
Several variables are necessary for the stitching (connecting) of points between
two given edges—i.e. per ring. First, we need to know which ring we are working
on because this is an iterative process—an index is maintained for this. Second,
we need the tessellation factor context for the outer and inner edge of the rings we
are working through. Finally, you need the offset of the point for the beginning of
the inner and outer edges for the side of the ring you are working on.
47 | Page
Point connectivity of the outermost ring is a separate case from the point
connectivity of the inner rings. Per ring, the connectivity of the U (first) and V
(second) edges are a separate case from the connectivity of the W (third) edge.
For the outermost ring, stitching of the edge points is a function of the tessellation
factor of the outer edge in question and the inner tessellation factor. This is
because the point generation of the outer edges are a function of the outer
tessellation factors, and similarly, the inner edges of the inner tessellation factor.
The stitching per ring edge is divided into a first and second half, with certain
special cases that can result in a “quad”, or an outward pointing triangle—inner to
outer edge—or similarly defined inward facing triangle at the center. Triangles are
stitched by connection of points from the outer and inner edge, with an edge
normally beginning at the point where the last left off. Points on either edge are
advanced forward in an alternating fashion as the point connections are defined,
until the second-to-last point for each edge is reached, marking the beginning of
the next side of the ring.
Inner-Outer-Inner Stitch
Outer-Outer-Inner Stitch
Example of initial Outer-Outer-Inner stitch, and Inner-Outer-Inner stitch, from below.
48 | Page
For the first half of the ring stitching, the first triangle generated is defined by the
first outer point, the second outer point, and the first inner point; the outer point
offset has been incremented, as no further triangles can be defined with the first
outer point without overlapping the first triangle. For convention, such a stitch will
be referred to as outer-outer-inner—i.e. the triangle is defined by connecting the
currently indexed outer edge point to the following outer edge point, thereby
incrementing it, and then finishing the triangle by a connection to the currently
indexed inner edge point. Through the first half, the ring is advanced by alternating
between inner-outer-inner and outer-outer-inner stitches.
The special middle cases occur when either (or both):
1. The tessellation factor parity for the outer and inner edge differs.
2. The inner tessellation factor is of odd parity.
When the second condition is true and the parities of the inner and outer
tessellation factor are equal, a “quad” is formed in the middle by two triangles of
inner-outer-inner and inner-outer-outer stitches.
Special case Inner Tessellation Factor is odd (3), and Inner and Outer Tessellation Factors are equal, forming
a center “quad” for Edge U of the outermost ring. The quad consists of triangles indexed 2 and 3.
49 | Page
When the first condition holds and the inner tessellation is of even parity, an
inward-pointing triangle is formed by an inner-outer-outer stich.
In all other cases where both conditions occur, an outward-pointing triangle is
formed by an inner-outer-inner stitching.
For the second half of the ring stitching, the ring is advanced by alternating
between outer-outer-inner and inner-outer-inner stitches as in the first half. The
sequential combination of alternating stitch patterns the ring advances with is
determined with a look-up table.
Stitching of the inner tessellation rings is simply a function of the inner tessellation
factor. This is because all points generated along the edges of all of the concentric
inner triangles is determined by the processed inner tessellation factor. Stitching
of the edge of an inner tessellation ring begins with an outer-outer-inner stitch, and
then divides the stitching of the length of the triangle into two mirrored halves. For
the first half, connectivity of the ring edge consists of strictly alternating outer-innerinner and outer-outer-inner stitches; each pair of these forms a “diagonal” that
points from the outer edge towards the inner edge. For the second, mirrored half,
connectivity consists of strictly alternating inner-outer-outer and inner-outer-inner
edges. The end of the ring edge is a base case outer-outer-inner stitch.
If the inner tessellation factor parity is odd, the inner edges of the centermost ring
constitutes a single triangle; otherwise, the innermost ring constitutes a single
point.
Parallel Triangle Tessellation Design
The work for triangle tessellation is split into two separate shaders: point
generation and point connectivity, which run in parallel. This was the simplest
approach to implement, made possible by the inward-spiraling pattern in which
points are generated and connected. Because we are guaranteed the order and
quantity of points generated along edges, their two-dimensional coordinates are
not needed for point connectivity.
All that is needed for the above approach is a set of preprocessed data derived
from the tessellation factors; this includes such information as edge parities,
number of points per edge, and additional information used to determine the
spacing between points. This information is stored in a structure, initialized with
the tessellation factors for the primitive patch being tessellated. The CPU passes
this structure to the two shaders through a read-write structured buffer, the
contents of which get written to group shared memory. While a separate shader
could be dedicated to this preprocessing alone, its’ percentage toward the whole
of the tessellation workload was small enough that the point generation and
connectivity shaders were implemented to compute it for themselves.
50 | Page
The point generation shader was parallelized to allow a single point calculation per
thread, based on its’ thread ID. For triangle tessellation where all of the
corresponding tessellation factors (three outer and one inner) are set to the
maximum of sixty four, 4225 points are computed. Still, even in these ideally large
cases for parallelization, an insignificant speedup was achieved from a point-perthread approach, presumably due to the overhead of each thread calculating its’
own contextual information about where its’ point is located (e.g the corresponding
edge and ring).
CPU
Dispatch
Threads
Dispatch
Threads
Point Generation
Shader
Point Connectivity
Shader
Process
Process
Tess Factors
Tess Factors
Tess Fact
Tess Fact
Context
Context
Generate
Generate
Points
Connections
Vertex Buffer
Index Buffer
For point generation, 67 thread groups of 64 threads are launched. This approach
has the benefit of aligning groups of work along wavefronts, which consist of 64
compute units each. While this approach allows for most every thread of each
thread group to do work at the higher tessellation factors, many threads are wasted
at the lower factors. An approach to mitigate this would be separate the
51 | Page
preprocessing of tessellation factors from point generation, so that the quantity of
points to be generated can be used to dynamically allocate threads for the point
generation shader. However, a downside to this approach is the inability for
compute shaders to dispatch work for themselves; the CPU would have to dispatch
the preprocessing work, read back the relevant information, and feed it back to the
respective shaders. This is not preferable as CPU-GPU intercommunication is very
expensive.
Potential improvements include reducing the overhead of threads computing their
own contextual information and applying significant parallelization to point
connectivity. Like quadrilaterals, the complex connectivity pattern (including the
patching of incorrectly produced intermediate values) proves hard to parallelize
while maintaining correctness.
52 | Page
General Overview of Quads
When the type of geometric primitive is set to quad, our Tessellator subdivides a
single quadrilateral into more detailed geometry. There are separate levels of detail
for the inside and outside of the new quad. Each outer edge can be split into a
different number of segments. Likewise, the number of inside horizontal and
vertical segments can be set to different levels, as shown below.
Figure 35: A subdivided quad with different outer and inner detail levels. Connected Triangles are numbered
in blue.
In figure 1, the outer-left side of the quad in is set to 3 segments and the inside
horizontal and vertical portions of the quad are set to 3 and 4 segments,
respectively.
Input Description
The tessellation factors that control the level of detail are as follows:
Inner1 – (horizontal or U-Axis)
53 | Page
Inner2 –(vertical or V-Axis)
Outer1 –(left)
Outer2 –(top)
Outer3 –(right)
Outer4 –(bottom)
These will be each be a floating point number that represents how many segments
to create.
Output Description
The quad Tessellator outputs both a vertex and index buffer.
Vertex Buffer
o The vertex buffer stores a (U, V) coordinate for each generated point.
Index Buffer
o The index buffer stores the order that vertices are connected to each
other.
Process Tessellation Factors
For each of the tessellation factors, a series of “magic numbers” is calculated that
contains useful information. First, if any of the tessellation factors are equal to or
below zero, a flag is set and the patch is culled later on. Otherwise the factors are
clamped based on what partitioning mode is used. The two fractional modes both
have different ranges from the integer modes, as shown in the following table:
Mode
Factor Range
Integer
[1, 64]
Pow2
[1, 64]
Fractional
[2, 64]
Even
Fractional
[1, 63]
Odd
If the partitioning mode is set to one of the integer modes (integer, or Pow2), then
the ceiling of the tessellation factor is stored in the magic numbers structure. The
tessellation factor’s even or odd parity is stored as well, unless the Tessellator is
set to a fractional mode, in which case the mode’s parity is stored. At this point,
after clamping and rounding have been completed, if all factors are set to 1, then
we set a special flag to do the minimum amount of work. In this case, since
subdividing a line into 1 segment is rather trivial, no additional information is
required.
If some of the tessellation factors are greater than 1, additional information is
needed for each factor. This is called the tessellation factor context, or
TessFactorCtx which stores the following:
Variable
halfTessFactor
Description
tessFactor/2
54 | Page
Note: 0.5 is added to this number on the
following conditions:
Mode is fractional odd, or halfTessFactor is
itself 0.5.
halfTessFactorFloor
Floor(halfTessFactor)
halfTessFactorCeil
Ceil(halfTessFactor)
halfTessFactorFraction
halfTessFactor – halfTessFactorFloor
invHalfTessFactorFloor
1 / ( 2 * halfTessFactorFloor - 1) if mode is
ODD otherwise
1 / (2 * halfTessFactorFloor)
Used in placing points into the correct position
using linear interpolation later on.
invHalfTessFactorCeil
1 / ( 2 * halfTessFactorCeil - 1) if mode is ODD
otherwise
1 / (2 * halfTessFactorCeil)
Used in placing points into the correct position
using linear interpolation later on.
splitPoint
A. Remove Most Significant bit of
halfTessFactorFloor
B. If fractional_odd subtract 1
C. Multiply by 2 and add 1.
A very important value in determining which
point is the “split point” for fractional even and
odd partitioning modes. Unused for other
modes.
numFloorSegments
halfTessFactorFloor * 2
Note: If in fractional Odd Mode subtract 1 from
this.
numCeilSegments
halfTessFactorCeil * 2
Note: If in fractional Odd Mode subtract 1 from
this.
numPointsForOutsideEdge[4] Stores the number of points that the
Tessellation factor will generate per edge.
numPointsForInside[2]
Stores the number of points that the inner
tessellation factors will generate for both the
inside axis.
tessellationParityInner[2]
Stores if the rounded tessellation factor is
even or odd.
tessellationParityOuter[4]
Stores if the rounded outer factors are even
or odd.
For each outside edge and inner axis of the quad we store the number of points
that are going to be generated based on the proper tessellation factor. The total
number of outside points are stored in the magic numbers as the base offset for
the inside points. The total number of both outside and inside points is stored here
as well.
55 | Page
Point Generation
At the start of point generation, the tessellator sets an integer named pointOffset
to zero. This variable is used as an index for accessing the vertex buffer. Quad
point generation is split up into two processes. The outside point generation, which
consists of generating the points for each of the four edges of the quad, and the
inner point generation, which consists of all the points interior to the
aforementioned edges. When both are put together, the process can be thought
of as traversing the quad in a spiral pattern. Such a spiral pattern takes advantage
of the fact that a certain set of points always lies along a line perpendicular to one
of the two U-V Axis. Therefore, a portion of the line remains unchanging. This is
not unlike the equation of a horizontal or vertical line. The difference being that
here, each point is a UV coordinate as described earlier.
Outside Points
As the edges are traversed, odd edges have a constant V location. Conversely,
even edges have a constant U location. Edge 0 has a U coordinate of 0, while
edge 2 has a U coordinate of 1. Edge 1 has a V coordinate of 0 while edge 3 has
a V coordinate of 1. The other portions of the UV coordinate that are not constant
are calculated via the placePointIn1D function. This is done per edge by a for-loop
that loops from 0 to an endpoint calculated ahead of time in the
TessFactorContext. The end point is subtracted by 1 since the next edge’s loop
1
Point
(U, 0)
(1, V)
2
0
(0, V)
3
(U, 1)
Figure 2: The edges of the quad are labeled 0 - 3, each of the four
points shown could be any arbitrary point along an edge.
will already calculate that point.
Beginning with edge 0, points are placed bottom to top along the V axis. On edge
1, points are placed from left to right along the U axis. On edge 2, points are placed
top to bottom along the V axis, and finally on edge 3, points are placed right to left
along the U axis. This allows for simple code reuse since both even edges and
both odd edges use the same point calculations, except in opposite order. On
edges 1 and 2, the order is flipped by subtracting the current point from the end
56 | Page
point. Each time a point is placed into the vertex buffer, the pointOffset variable is
incremented. As an example:
DefinePoint(1, param, pointOffset++)
The first argument is the U-coordinate, the second argument is the V-Coordinate,
and the third is the pointOffset. This example is placing points along edge 2. After
the last point on the last edge is placed, the outer point generation is completed.
3
1
2
3
2
2
Inside Spiral Portion
1
1
0
2
1
0
Figure 3: The spiral pattern of the outside points. The numbers boxed
outside depict the values passed to placePointIn1D
Inside Points
The inside point generation is a bit more complex, and involves 3 nested for loops.
o for each ring
for each edge (in a ring)
for each point (on an edge)
57 | Page
The number rings are calculated using some useful data from the tessellation
factor context. Each tessellation factor has an associated number of points that it
Start/end Ring
4
Start/end Ring 3
Start/end Ring 2
startPoint
for ring #1
endPoint 1
Figure 36: Each ring places points along 4 edges, and each edge contains some number of points based on the inner
factors.
would generate if it were tessellating one line. For example, a tessellation factor of
3 would generate 4 points, since a line would be divided into 3 segments. This can
be seen in figure 3.
Variable
(int) startPoint
Calculation
This is just the current ring, renamed for
clarity based on how it will be used later.
(int) numRings
Min(numPointsInner1,
numPointsInner2) / 2
(int) endpoint[0]
numPointsInner1 – 1 – startPoint
(int) endpoint[1]
numPointsInner2 – 1 – startPoint
To determine the number of inside rings take the smallest of the point counts
associated with the two inner factors. Halving this smallest number will yield the
number of rings. At the beginning, the ring loop initializes the startPoint to 1, and
58 | Page
the two endpoints are calculated based on the point counts associated with the
two inner factors. Each iteration of the ring loop increments the startPoint and
recalculates both end points: endpoint[0], endpoint[1].
The number of edges to use for the edge loop is 4, since each ring has 4 for edges.
The edge logic is similar to the previous outer edge calculations but with some
additional challenges. The primary difficulty lies in proper placement of the
perpendicular portion of the UV coordinate. When calculating the outer edges, this
value was trivially either a zero or one. Now, the value can range anywhere
between [0, 1] and changes depending the level of ring being calculated. Each
iteration of the edge loop calculates several important values including the
perpendicular portion of the UV coordinate.
Variable
(int) parity[0]
Calculation
oddOrEvenParity(edge)
note: the current edge
This governs whether an edge is
moving along the U or V axis.
(int) parity[1]
oddOrEvenParity (edge + 1)
note: the next edge
This governs whether an edge is
moving along the U or V axis.
(int) perpendicularAxisPoint
For edges 0 and 1:
= startPoint
For edges 2 and 3:
= endpoint[ parity[0] ]
The axis point is passed to the
placePointIn1D function and changes
depending on edge and the parity of
the current edge.
(float) perpParam
The perpendicular portion (either U or
V) returned by placePointIn1D
After the perpendicularAxisPoint is calculated it is passed to the placePointIn1D
function which then returns the perpendicular U or V coordinate to use in the next
loop.
The inner most loop is responsible for calculating the second portion of the UV
coordinate. It loops from p = startPoint to p < endpoint[parity[1]]. In other words, its
terminating condition depends on the parity of the next edge (really the inverse
parity of the current edge). When the edge is edge 0 or 3, the order of the points
is reversed. For example, if looping from p = 1 to p < 4, instead of placing points in
the order of {1, 2, 3} they are placed as : { 4 , 3, 2 }. This reversed point, q, is
calculated as:
59 | Page
q = endpoint[parity[1]] – ( p – startPoint)
The V axis tess factor here has been rounded to 8. And numPointsInner2 is
Calculated as 9. The startPoint is 1, and the endpoint = 9 – 1 – startPoint = 7
2
3
4
5
6
7
Point order is reversed since the
edge is even (0)
Figure 37: The inner for loop for the first edge calculates the values points to pass to placePointIn1D: {7, 6, 5, 4, 3, 2}
Once this reverse point has been determined, the placePointIn1D function is called
to calculate the second portion of the UV coordinate. A point will be defined based
on the parity of the edge. If the edge is odd, then the U coordinate the
perpendicular coordinate that will be reused. Otherwise, the V coordinate will see
reuse.
Odd:
o DefinePoint(perpParam, param, pointOffset++)
Even:
o DefinePoint(param, perpParam, pointOffset++)
60 | Page
After the last ring has been completed, inner point generation is finished, except
for two exceptional cases that occur only when an inner tessellation factor is
rounded to an even number. When this occurs, the middle portion of the ring
becomes degenerate, i.e. it degenerates into a single row or column of points, and
the logic that normally calculates a ring fails. This means that two additional loops
need to be created to handle these two special cases.
Figure 38: The degenerate rings of the two edge cases are shown in red, while the regular rings are shown in blue.
The easiest way to handle this behavior is to run the regular ring loop just as
before, but afterward use the following pseudo-code:
If (tessParityInner[0] == EVEN OR tessParityInner[1] == EVEN)
o If numPointsInner1 (U-Axis) > numPointsInner2 (V-Axis)
for each point
DefinePoint( p, 0.5, pointOffset++)
o If numPointsInner1 (U-Axis) <= numPointsInner2 (V-Axis)
for each point
DefinePoint( 0.5, p, pointOffset++)
61 | Page
This ensures that the middle line for both of these cases will be filled in properly
along the center 0.5 for either UV coordinates. And, this technique also works
regardless of the number of rings or total number of points.
V = 0.5
Figure 39: No matter how many points totally points or width, there will always only be one degenerate ring, and it will
always lie along the line U = 0.5 or V = 0.5
With both the inner and outer point generation completed, the Vertex Buffer now
contains all of the correct UV coordinates.
Point Connectivity
The purpose of point connectivity is too create a series of indices into the vertex
buffer that together define a primitive geometric shape that the graphics card can
draw and rasterize. In the case of a quad, that primitive is the triangle, even for the
simplest non-tessellated quad, which is made up of two triangles and four points.
Each of the two triangles would normally require three vertices, for a total of six.
The index buffer allows two of the vertices to be reused, since at least two vertices
62 | Page
must be shared by these triangles. In longer triangle strips, this savings can be
quite significant.
1
2
Triangle
#
Vertex #
3
0
Figure 8: The triangles are numbered in blue. The four vertices are numbered 0 - 3. The winding
direction for the triangles is counter clockwise (ccw).
When a quad is made of two triangles, the vertex buffer contains the following:
Vertex Number
0
1
2
3
U
0.0f
0.0f
1.0f
1.0f
V
0.0f
1.0f
1.0f
0.0f
Index
0
1
2
3
4
5
Corresponding
Vertex
0
2
1
0
3
1
The direction that the triangles are
wound
depends on the vertex order that is specified in the Index Buffer. There are two
possible winding directions:
o Counter clockwise (CCW)
Example: {0, 2, 1}
o Clockwise (CW)
Example: {0, 1, 2}
63 | Page
A programmer using the Direct X 11 or OpenGL pipelines can switch between
these two winding orders as needed, so both must be supported by the Tessellator.
As stated earlier, the triangles are placed together in much the same way as a strip
of triangles.
Figure 40: A Triangle strip made of 7 vertices, v0-v6. Credit: Khronos.org
The strip inside a quad however, has been twisted into a spiral pattern due to the
order of the point generation. The first triangle is made from the first point on the
outside edge the first point on the first inside ring, and the second point on the
outside edge. The next triangle is made from the second point on the outside edge,
the second point on the first inside ring, and the third point on the outside edge.
The pattern continues until that side of the quad has been traversed.
For points after the halfway point, the direction of the diagonals is flipped in the
opposite direction, almost as if the triangles have been calculated backwards, from
triangle 6 to triangle 4. In the first example, all of the tessellation factors have been
set to 4, causing the rings to artificially line up in a nice manner. When inside and
Figure 41: The triangles are connected in the order
along the spiral.
64 | Page
outside tessellation factors are set to differing levels of detail, issues begin to crop
up in the proper sequence of the triangle’s connectivity.
This irregular difference in the number of points can be corrected by calculating
the correct triangles connections based on:
Figure 42: For the left and right edges, the connectivity between the inside of the quad and the outside of the
quad is no longer 1 - 1.
o Number of inside points for a given edge.
o Number of outside points for a given edge.
o Type of needed Diagonal connection.
1. Inside to outside
2. Inside to outside (except middle)
3. Diagonals Mirrored.
Furthermore, since the connections made in the index are completely independent
of the actual UV locations of the points, this technique can work for any parallel
pair of inside and outside edges. This portion of the algorithm is placed inside the
65 | Page
stitchRegular() function for reuse. The whole connectivity algorithm for quads
works thusly:
For each ring
o For each outer/inner edge pair
Call stitchRegular
Passing:
insideEdgePoint
o The starting point for the inner row or column
of points
outsideEdgePoint
o The starting point for the outer row or column
of points
numInsideEdgePoints
o In the previous figure, there are 3 inside
edge points per edge.
baseIndexOffset
o The base index that will be used for emitting
new triangles into the index buffer (an index
for the index buffer).
Diagonal type
o Based on tessellation factor and side of the
Quad.
Figure 43: In this case, there would be two
problem points. Both are highlighted in red.
66 | Page
In most cases, no incorrect triangle will be generated in the index buffer using the
aforementioned method. However, at the end of each ring, the very last triangle
always contains a wrong point. The indexing of the final point is what causes this
error. Recall that each ring of points begins on a certain start index, and that the
very last point of a complete ring would itself be that same start index. The
stitchRegular() function has no context or concept of these rings, so instead of
detecting the end of a ring and indexing the vertex of the triangle as the first point,
it instead believes erroneously that a new point exists. One simple fix to this
problem is to create a small lookup table that contains the indices of these incorrect
points, along with the new points, and “patch” any inconsistencies as they are
created.
A special case also exists when one of the inside tessellation factors is odd. This
Figure 44: The degenerate row of quads is shown in red.
causes a degenerate row of triangles that are missed by the ring based triangle
connections. Much like the degenerate row of points during point generation, this
row of triangles can occur either horizontally or vertically along the U–V axes. The
method to handle these two cases is nearly identical to the method used in point
67 | Page
generation. After the normal ring algorithm has run to completion, check if either
of the two inner factors is odd. Next check which of the two factors will generate
the most points. These values have conveniently been pre-calculated in the
TessFactorCtx as numPointsForInner[0] (U-Axis) and numPointsForOuter[1] (VAxis). If the number of points along the U-Axis is greater, a quad strip is connected
along the U-Axis. Otherwise if the number of points along the V-Axis is greater a
Figure 45: An equal number of points for the inner tessellation factors causes a single quad to be missing from
the center.
strip is connected along the V-Axis. If the number of points for both axis are equal,
a single quad is connected in the center. After the correct two rows of points are
determined, they are processed via stitchRegular(), which connects some number
of primitive quads. That is to say, the connected quads have only 4 points and two
triangles.
68 | Page
After the last triangle is connected and placed into the index buffer, the triangle
connectivity has run to completion. Although our understanding of quad
connectivity as described above is mostly complete, this is an area of the
tessellation project still being researched, so certain details and edge cases still
need to be fleshed out. Regardless, the algorithm works closely enough to our
current model that we do not feel that our final design will be drastically different.
Parallel Quad Tessellation Design
High Level Design
The input to the quad primitive generator will simply be the six floating point
tessellation factors as described in the quad tessellation section. The needed
context for each tessellation factor will be placed into a single read-write buffer on
the GPU, including the unprocessed tessellation factors. Based on the raw
tessellation factors, two buffers will be created on the GPU with sufficient space to
act as the Vertex and Index buffers. After the raw tessellation factors have been
loaded onto the GPU, a compute shader will be dispatched to process the
tessellation factors. When this shader’s execution has completed, four more
compute shaders will be dispatched – each with access to the now processed
tessellation context. The first two of these shaders will handle point generation
while the last two will handle triangle connectivity.
Raw Factors
Tess. F.
Process
Shader
Context [1 – 6]
PointGen
Outer:
Shader
PointGen
Inner:
Shader
Vertex Buffer
ConnGen
Outer:
Shader
ConnGen
Inner:
Shader
Index Buffer
Figure 46: Each of the needed shaders has access to the six tess. factor contexts.
The point generation shaders are dispatched at the same time as the point
connectivity shaders since even though these processes seem interdependent,
69 | Page
they can be completed separately without any data shared between them. One of
the reasons for this is that the points are generated in such a regular pattern that
only the number of points and TessFactorContext is needed for the point
connectivity.
Detailed Design
Processing Tessellation Factors
Each tessellation factor’s context will be stored in the following structure:
TessFactorCtx
(float) invNumSegmentsOnFloorTessFactor
(float) invNumSegmentsOnCeilTessFactor
(float) halfTessFactorFraction
(float) tessFactor
(int) numHalfTessFactorPoints
(int) splitPoint
(Parity) tessFactorParity
(int) numPointsForTessFactor
(bool) isCulled
(bool) isMinimumWork
An array of length 6 will represent the factors.
TessFactorCtx factors[6];
The factors must first be loaded onto the GPU. In direct compute, this means
wrapping the data in several layers of buffers, sub-resources, and shader resource
views. The reason that so many layers exist is that there are many different types
of buffers that can be created on the GPU, and each of these buffers can be
configured to interact efficiently with a large number of threads. Shader Resource
Views allow for even more customization by ensuring that a shader interacts in a
very specific manor with a buffer.
Sub-resource
RWStructured
Buffer
Unordered
Access View
(Context)
TessFactorCtx
Shaders on GPU
Figure 47: The initial data is loaded into a view that the shaders can access.
70 | Page
For loading the tessellation factors and their initialized context onto the GPU, the
array will first be placed into a D3D11_SUBRESOURCE_DATA. A structured
buffer will be created to contain this sub-resource. Because the buffer will be
accessed by 4 shaders simultaneously, the shader resource view for the buffer
should be an unordered access view (UAV) to allow for multiple shaders to read
from it concurrently.
Since the array of TessFactorCtx is of length six, the high level shader language
(HLSL) file will dispatch 6 threads, with the dispatch control line looking something
like:
[numthreads(6, 1, 1)]
This dispatches six threads in the X dimension when the shader is dispatched from
the C++ coding running on the CPU. Each thread is indexed with an X, Y, Z
position, so after dispatch the following threads are running:
Thread (0, 0, 0)
Thread (1, 0, 0)
Thread (2, 0, 0)
Thread (3, 0, 0)
Thread (4, 0, 0)
Thread (5, 0, 0)
Processing each tessellation factor in parallel will then be achieved by each thread
calculating the TessFactorCtx that corresponds to its own Dispatch Thread ID.x.
When all six threads run to completion, the TessFactorCtx will have been filled in,
and execution will resume on the CPU so that the shaders responsible for
calculating point generation and point connectivity.
Point Generation
Because the TessFactorCtx has already been loaded onto the GPU previously,
only one buffer needs to be loaded onto the GPU for point generation. This buffer
will serve as the Vertex Buffer and will be the output for this stage. Loading the
buffer onto the GPU follows a similar pattern to the buffer used for the tessellation
context, with two key differences. The buffer is a simple buffer instead of a
structured buffer, and it does not require a sub-resource since there is no initial
data to load onto the buffer. The signature for the buffer as seen from the point
generation hlsl file uses a built in data type:
RWBuffer<float2> vertexBuffer;
71 | Page
Float2 is a simple data type consisting of an X and Y float, which for the purposes
of point output will act as the U-V coordinate storage.
RWBuffer
(Vertex Buffer)
Unordered
Access View
Outer Compute Shader
Inner Compute Shader
Figure 48: Both shaders access the Vertex Buffer at the same time, with provisions that
ensure they only write to separate locations.
In this proposed parallel implementation, many of the function signatures as
described previously for quad tessellation will be identical. The primary difference
lies how the nested loops will be traversed. Chiefly, in the shader implementation,
they will not be traversed. Instead, the for-loops will be “unwrapped” as much as
possible by utilizing clever thread indexing. Take the outer point generation loops
as an example:
For each edge
o For each point on an edge
placePointIn1D
Four threads will be lunched in place of the outer for-loop instead of its iteration
from edge 0 to 3. Unfortunately, in this case, the inner for-loop cannot be
unwrapped due to the potentially irregular nature of the four outer tessellation
factors. However, the four threads will allow each edge to calculate simultaneously.
Thread 0
o For each point on an edge
placePointIn1D
Thread 1
o For each point on an edge
placePointIn1D
Thread 2
o For each point on an edge
placePointIn1D
Thread 3
o For each point on an edge
placePointIn1D
To have correct placement of points into the Vertex Buffer, each thread must
calculate the baseOffset of what its own “first point” will be. Thread 0 has no
baseOffset, since it clearly will be placing the first point. Each of the preceding
72 | Page
threads must add in the number of points that the previous threads have
calculated. These numbers are already calculated in the TessFactorCtx, so it is
just a matter of accessing the information per thread.
(0, 0, 0)
Index 0
(1, 0, 0)
Ind. 8
(2, 0, 0)
Ind. 16
(3, 0, 0)
Ind. 24
Index 1
Index 2
Index 3
Index 4
Index 5
Index 6
Index 7
Ind. 9
Ind. 10
Ind. 11
Ind. 12
Ind. 13
Ind. 14
Ind. 15
Ind. 17
Ind. 18
Ind. 19
Ind. 20
Ind. 21
Ind. 22
Ind. 23
Ind. 25
Ind. 26
Ind. 27
Ind. 28
Ind. 29
Ind. 30
Ind. 31
The inner point generation is more regular and both outer for-loops can be unrolled
into thread IDs as before.
o For each ring
For each edge
For each point on an edge
o placePointIn1D
Becomes:
o Thread(ring, edge)
For each point on an edge
placePointIn1D
The shader for the inner points will be dispatched with (ring, edge, 1) number of
threads. Each ring now has an index based on the Dispatch Thread ID.x
coordinate, and each edge has index based on the Dispatch Thread ID.y
coordinate. I.E. the thread handling the calculations for the 4th ring on edge 3 would
have the dispatch thread id of (3, 3, 0), while the thread handling the 1st ring on
edge 0 would be (0, 0, 0).
73 | Page
(0, 1, 0)
(0, 1, 0)
(0, 0, 0)
(1, 1, 0)
(0, 0, 0)
(1, 0, 0)
(0, 0, 0)
(0, 3, 0)
(0, 1, 0)
(1, 2, 0)
(0, 2, 0)
(0, 2, 0)
(1, 3, 0)
(0, 2, 0)
(0, 3, 0)
(0, 3, 0)
Figure 49: An inner point generation with 2 rings and 16 points. A total of 8 threads are dispatched. There are
rings 0 and 1. Edges 0 – 3. Each point has the corresponding thread that is responsible for its calculation
represented as an ordered triple (ring ID, edge ID, 0).
Because the loops have been unfurled, calculations that a parent loop would
normally make once now must be calculated per-child. A good example of this
would be the perpendicular U-V parameter that was previously calculated in the
edge-loop. Now, this parameter must be calculated by each thread. This small
amount of extra work pales in comparison to the amount of parallelism that the
new thread indexing provides. The nested method would require 1024 sequential
steps to calculate the locations of the inner points for inner tessellation factors of
32 U-Axis and 32 V-Axis. The parallelized method would dispatch (16, 4, 1)
number of threads, for a total of 64 simultaneous threads. The threads that perform
the most work are the outermost rings, such as thread (0, 0, 0), which will loop at
most 30 times. This naïvely seems as if the parallel version takes only 2.9% the
time of the sequential version, but this is not a true order analysis, and is not based
off of actual instruction count or execution time. This is an area that will require
testing to determine the actual speedup, if any.
74 | Page
Point Connectivity
The Tessellation Context that the connectivity needs will have already been loaded
onto the GPU by the tessellation processing stage. The point connectivity will also
launch two compute shaders, one for the outside ring of triangle connections, and
one for the inside ring. The shaders will access the index buffer each time a triangle
is emitted. This index buffer will be loaded into a RWBuffer of integers and the
shaders will have access to the RWBuffer via another unordered access view
(UAV).
RWBuffer
(Index Buffer)
Unordered
Access View
Outer Compute Shader
Inner Compute Shader
Figure 50: After the index buffer is loaded onto the GPU, the shaders access it via the
UAV.
The current proposed design for the shaders is nearly identical to that of the inner
point connectivity. Since there is a regular number of rings, and a fixed number of
sides for each ring, the nested loops can be unwound in the same fashion. The
shaders will be dispatched with ring number of shaders for the X IDs and edge
number of threads for the Y IDs.
Attempted Parallel Implementations
The actual implementation ended up being a fair bit different from the proposed
implementation for a number of reasons, the primary being that of performance
concerns. During the initial design phase, an overlooked detail was the impact of
dynamic flow control when performing branching based on thread IDs. When the
GCN efficiently processes a group of threads, they all should ideally all be
executing the same instruction across the entirety of the group. After studying the
architecture in further depth, this makes logical sense, since each group physically
executes on a SIMD core. Normal branch instructions do not necessarily pose the
same risk, because although a branch is taken, all threads who encounter the
branch will take it. Unfortunately, my initial design involved placing threads in
situations that would guarantee divergence between all threads. A second design,
one that had mid-semester, involved completely unrolling all loops instead, with
the hope that a large sequential access would allow for great cache performance.
The primary issue with this approach is that it too requires all threads to diverge
significantly from each other. A secondary issue, and also a primary cause of the
divergence, would be that every thread now must correctly calculate its own
position within the buffer, the current ring that it lies in, and the edge the position
75 | Page
is on, all of which must be different values, arrived at through divergent
calculations. Worse yet, that is only the initial setup stage for each location in the
output buffer. After these values have been calculated, the calculations for either
the vertex or index buffers must be performed, which also creates divergence
between all threads.
The tessellated primitive for quadrilaterals presents a unique challenge since the
patterns necessary to produce correct output can be quite complex. Verification of
this correctness was achieved through two methods: a side by side visualization
of the output pattern, and an iterative test script. The test script increases all
tessellation factors by a uniform step of 0.01. The parallel implementation was
tested from factors [0, 65] with all cases passing. The script ensures that the
contents of each index buffer match exactly, while allowing a difference of 0.0005
for the uv coordinates in the vertex buffer. This accepted difference accounts for
inaccuracies in the 32 bit fixed point reference when compared with our more
accurate IEEE float implementation. For performance, our implementation takes
advantage of the quad’s square structure, since even when subdivided, the quad
will still be composed of quads. There are two distinctions, however. Each
outermost edge may be subdivided by a different factor, while the remaining
interior subdivides in NxM dimensions. The reference algorithm provided from
Microsoft emits points in an outside to inside square spiral in order to ensure there
are never any duplicate points during triangulation. An added benefit of this
ordering is that the triangulation may be performed independent of point
generation. To take advantage of this inherent parallelism, two thread groups
dispatch to perform these calculations. Each thread group has a group size of 64
to take advantage of the AMD’s GCN architecture, which allows for logical
execution of 64 wide thread groups simultaneously. A group of threads calculates
the data for all points on an edge, and edge at a time. When the kernel is first
lunched, it calculates the aforementioned magic numbers and stores the results in
groupshared memory, a low-level data store. The number of threads each kernel
uses totals 128. Because of the relatively small kernel size, multiple dispatches
may be lunched at once, allowing many patches to be processed without filling
more than a few compute units.
76 | Page
Experimental Results:
CPU VS GPU
RUN TIME (IN MS) OF TESSELLATOR AT VARIOUS
INPUTS
Tessellation factors
16x16x16x16x16x16
32x32x32x32x32x32
48x48x48x48x48x48
64x64x64x64x64x64
32x16x8x6x4x2
64x15x23x14x13x46
CPU
0.00718
0.02932
0.06026
0.11265
0.00156
0.01564
GPU
0.31857
0.76839
1.47739
2.36981
0.13351
0.46012
CPU vs GPU
Time (MS)
2.5
2
1.5
1
0.5
0
Tessellation Factors
Design Summary
Isolines
Our final Isoline implementation is a shader that calculates the vertex and index
data for a single point. This dispatch grouping of the shaders is based on the
tessellation factors. If the output patch would resemble a cube then the work would
be dispatched into groups of threads that are responsible for 8x8 segments of the
output to reduce resource usage. Otherwise, a group of threads would be
dispatched for each row of points.
77 | Page
Triangles
The two primary tasks of tessellation, point generation and connectivity, are split
into separate shaders that run in parallel. Each shader computes its’ own copy of
the necessary derivative values of the tessellation factor contexts (processed
tessellation factors). This avoids unnecessary, expensive communication between
the CPU and GPU at the cost of statically dispatching threads to the shaders
because no prior information about point generation or connectivity are known
before the shader dispatch calls—this leads to wasted threads at lower tessellation
values. For point generation, 67-by-64 threads are dispatched to compute a pointper-thread.
Quads
The quad tessellation begins by processing the input tessellation factors by using
a single compute shader calculate the tessellation factor contexts for each the 6
factors.
1. A RWStructured buffer is loaded onto the GPU to hold the TessFactorCtx
2. A compute shader with six threads is dispatched to process the factors.
a. [numthreads(6, 1, 1)]
3. After the previous shader has finished executing
a. A RWBuffer is loaded onto the GPU to be used as the Vertex Buffer
b. A RWBuffer is loaded onto the GPU to be used as the Index Buffer.
The number of threads required to finish point generation are calculated based on
information from the tessellation factor context. The outermost point generation
requires no additional calculations, so the compute shader that handles the outside
Raw Factors
Tess. F.
Process
Shader
Context [1 – 6]
PointGen
Kernel
Vertex Buffer
Figure 51: The high level diagram for quad tessellation
ConnGen
Outer:
Kernel
Index Buffer
78 | Page
points is launched at this point with 4 threads – one for each edge of the quad. In
order to calculate the number of threads to dispatch for the inner quad point
generation, the number of inside rings is calculated from the TessFactorCtx. After
this calculation, ring by 4 number of threads are dispatched to handle calculate the
inside points. Both of these compute shaders output their points into the vertex
buffer. When they both have completed execution, quad point generation has been
completed.
RWBuffer
(Index Buffer)
Unordered
Access View
Outer Compute Shader
Inner Compute Shader
At the same time that the two point compute shaders are being prepared for
dispatch, the two quad connectivity shaders are also about to launch. For brevities
sake, it is worth noting that the number of threads and indexing of said threads is
Figure 52: Index Buffer for quad connectivity
nearly identical to the threads dispatched for point connectivity. After the proper
number of threads is calculated for the two connectivity threads, they will be
dispatched. Due to the regular nature of the point generation’s spiral pattern, the
point connectivity can be stitched completely separate from the point generation.
In practice, all the connectivity algorithm needs to generate correct output is the
Tess Factor Context.
Project Administration
Facilities and Equipment
The facility that we typically use in team collaboration is the EECS senior design
lab. This facility is extremely new and clean, which is one of the things that makes
the lab such a great meeting place. It is important that we keep the room’s relaxing
and clutter free atmosphere intact, especially since there will be more teams in the
coming semesters that will use it.
Personal Work
79 | Page
Erwin Holzhauser
The project provides a satisfying balance of research and implementation and an
opportunity to branch out to technologies and concepts not covered in a standard
computer science curriculum. Research-wise, this project provides the opportunity
to familiarize oneself with a deeper general understanding of computer graphics
and parallelism, the role of tessellation in rendering surfaces, and the potential
performance trade-offs of fixed-function hardware versus equivalent software
implementations. Technology-wise and implementation-wise, this project provides
the opportunity to learn shading language to the level of proficiency of building on
an existing code base and working with cutting-edge graphics cards. Finally, the
implications of improved tessellation methods on CAD and video game
applications make this a very attractive project.
Matthew Faller
My passion for the last few years has been learning about the algorithms that are
used in computer graphics and game development. My focus thus far has been 2d
algorithms and structures such as quadtrees, collision detection and openGL’s
fixed function pipeline. This project is an exciting opportunity to learn about
programmable graphics, and in particular, parallel computing. Tessellation is a
subject that also interests me from a 3d modeling standpoint, since I also use 3d
software for design. Naturally, Catmull-Clark is one of my favorite subdivision
algorithms.
David Sierra
I do not really have any real experience with HLSL, GPU programming, or parallel
programming in general. I chose this project because I wanted to get into
massively parallel programming on the GPU. I first learned about parallel
computing when I read an article in Wired. The article was about high speed trading
computers on Wall Street and the industry of having the fastest software and
internet connection in order to gain an edge on the stock exchange. Although not
massively parallel, the apps that I have written where I have to spawn work in
multiple thread have been the most engaging ones to write so far. There is
something about the challenge of synchronizing packets of work that makes those
programs more fun to write. One thing I did have experience with going into this
project was Visual Studio and setting up big C++ projects in it since it is something
I have to work with at my internship. My biggest advantage going into this project
is my ability to go into someone’s existing code base and be able to find my way
around and figure out what they were doing. During the first semester of this
course, I was also enrolled in COP4331 where we had to make an Android
application that had to query multiple web services. For that project I took charge
of the background querying of Google services and had to manage multiple
threads and callback interfaces. It turned out to be pretty fun and got me even
80 | Page
more interested in parallel programming. So next semester our whole group is
enrolled in COP 4520, which is the Concepts in Parallel and Distributed Computing
class. Overall I am sure it will be an “exciting” semester in the spring.
Lessons Learned
Erwin Holzhauser
Prior to this project, I had negligible Java multithreaded programming experience
and absolutely no shader programming or graphics experience. Even then, Java
uses a different threading model than Direct Compute shaders; instead of globally
assigning work to threads, a Direct Compute shader is written as an instance that
each individual thread will run.
Upon completion, I feel more comfortable with Direct Compute, its’ threading
model, and the DirectX 11 graphics pipeline.
As with the completion of any larger-scale program, I feel more confident in my
abilities to research a new area of computer science, or learn a new technology.
Our experience with Direct Compute was painful as the technology is not heavily
documented online, but that only reinforced self-direction in learning new
technologies.
I also feel more comfortable with debugging code; while I had some experience
with the Java logger, most of my debugging of C/C++ applications has been with
planted print statements at points of interest. Now, I feel comfortable with
debugging tools such as Visual Studio’s which allow you to set breakpoints and
step into code.
While proactive time management was not a new lesson, it was certainly reinforced
during this project out of pure necessity.
Matthew Faller
Although I had a fair amount of experience writing graphics programs for fixed
function pipelines, I had never had the chance to dabble in any type of shader,
which of course is somewhat sad for a computer graphics enthusiast. To me,
shaders were some mystical black box that made incredible things happen behind
the scenes in my favorite game engines. After working on this project I have now
written: vertex, pixel, geometry, and compute shaders. This is in addition to
learning how the hull and domain shaders function. The most amazing thing is that
the project has expanded my breadth of knowledge in more areas that simply
computer graphics and parallel programming. I knew some C++ going into the
project, but had never written much more than smaller academic projects in the
language. Although not as intuitive as some languages, I know that I’m a much
better programmer having worked in C++ for a year while learning a new (and
complex) API.
81 | Page
One lesson that I’ve learned is that DirectCompute is best left to projects were the
results of a kernel must immediately be passed back into the graphics pipeline.
The threading model is nearly identical to the widely popular openCL, yet has fewer
online tutorials, requires more configuration, and has a life interleaved with the
D3D11 pipeline. For compute shader projects going forward, I intend on using
mostly openCL and openGL.
David Sierra
Going into this project I had no idea about writing programs for efficiency outside
of what I learned in cs1 and cs2. I also didn’t know anything about parallel
programming. This project gave me tons of insight into both of these. Especially
with how to distribute work into efficient chunks and memory accessing techniques.
It also gave me practical experience developing my own large applications using
C++11. Another pretty important lesson I learned was to manage my time better.
We would have gotten so much more work done if we had just spent 1 more month
doing actual work. The most important thing I think I learned in this project is
managing memory in large applications. At some points, my test app was leaking
hundreds of megabytes of memory a second and it sure was fun tracking that
down.
Project Plan and Milestones
Project Phases:
The project will be broken up into a number of phases, listed below.
1.
2.
3.
4.
5.
6.
Plan
Research
Design
Prototype*
Implement
Test
*Current Phase in Bold
The most important constraints on this project are schedule and scope. The project
must be finished by the senior design presentation deadline, otherwise it is a
failure. The project must also adhere at minimum to the scope outlined in the
specifications section.
Phase
Plan
Research
Design
Estimated Duration
Sep. 10 – Oct. 10
Oct. 10 – Oct. 31
Oct. 15 – Nov. 7
82 | Page
Prototype
Implement
Test
Nov. 7 – Dec. 4
Dec. 4 – Mar. 10
Mar. 10 – Apr. 10
Milestones:
Fall Semester:
Milestone:
Date:
Status:
September 26
HLSL Hello, World! Program
Setup Wiki
October 3
Write sample program with single
thread
Setup GitHub and standardize
development environment amongst
group members
Write sample program with multiple October 10
threads
Completed Early
Signed NDA
Completed Late
Survey of contemporary
algorithms and methods
October 20
tessellation October 24
Completed Early
Completed Early
Completed
Began Reference Code Analysis
October 29
Completed Late
Survey of relevant AMD code-base
October 31
Completed Late
Detailed design of software architecture
November 7
Completed Late
Naïve,single-threaded implementation of December 3
system
Late
Spring Semester:
Milestone:
Date:
Test harness interface complete
December 19
Started
December 30
Started
The test harness interfaces with the 2d
visualizer provided by AMD. This allows for
easier unit testing and visual debugging,
which will be vital for connectivity. This will
also allow for batch tests of every
tessellation factor.
Naïve implementation for isolines
83 | Page
This is a shader implementation for isolines
that is still as close as possible to the serial
reference.
Naïve implementation for triangles
Jan 10
Started
Jan 10
Started
Jan 11
Incomplete
January 23
Incomplete
February 20
Incomplete
February 20
Incomplete
This is a shader implementation for
triangles that is still as close as possible to
the serial reference. This includes both
inner-outer point generation and inner –
outer connectivity.
Naïve implementation for quads
This is a shader implementation for quads
that is still as close as possible to the serial
reference. This includes both inner-outer
point generation and inner –outer
connectivity.
Naïve output matches reference
Using the test harness, run tests to ensure
that all output match the output expected by
the reference rasterizer.
HLSL Tessellation of lines
A highly parallelized version of isoline
primitive
generation.
(Points
and
Connections)
HLSL Tessellation of triangles
A highly parallelized version of triangle
primitive
generation.
(Points
and
Connections)
HLSL Tessellation of quads
A highly parallelized version of quad
primitive generation. Both Points and
Connections
Integration, optimization, until finalized March 15
multi-threaded HLSL implementation
Incomplete
Integration Testing
Incomplete
March 28
84 | Page
Testing Methodology
Figure 53: Screenshot of AMD's reference rasterizer
The way in which we were to validate our tessellator output was provided as a
fixed specification by AMD. They wanted the vertex and index buffer of our
tessellator to match the vertex and index buffer of their reference rasterizer bit for
bit.
Once we received their rasterizer we noticed a problem right away. Their rasterizer
uses fixed point arithmetic as opposed to standard floating point numbers. Fixed
point arithmetic is when you represent numbers with fractional parts by storing the
fractional and whole parts in a certain set of bits that never change. In practice this
number is stored as an unsigned 32 integer. For example AMD’s reference
85 | Page
rasterizer stores their fixed point numbers with 1 bit dedicated to being the sign bit,
15 bits are reserved for the integer, and 16 bits reserved for the fractional part. The
big upside to this is that you can do fractional math using very fast integer
arithmetic hardware. And since AMD used a fixed piece of hardware to accomplish
tessellation, this was obviously in their favor since they wanted to minimize space
taken up by the hardware. Especially because when the hardware is not
tessellating, it is just sitting there doing nothing.
Figure 54: Representing a floating point value in fixed point notation
In our project though, the graphics cards use standard IEEE 754 floating point
calculations and have hardware that accelerates floating point math. Since floating
point arithmetic is generally more accurate than fixed point arithmetic we were
encouraged to use it and take advantage of the GPU’s hardware acceleration
capabilities. A direct consequence of all this is that our floating point output only
matches their reference output to, on average, 3 significant figures.
Testing Harness
In order to streamline testing we have chosen to integrate the reference rasterizer
given to us by AMD with Google’s Google Test. Google Test is a C++ testing
framework that allows us to easily integrate tests into our code and make it simple
to test a large amount of values in a short amount of time. Google Test does this
by providing an extensive set of assertions. An assertion is a procedure that
resolves a boolean expression. If the boolean value does not evaluate to the
expected value, then the program quits and output’s an error. Google Test also
makes it easy to insert custom print statements into test case outputs for even
more accurate testing. This will provide us with relatively instant feedback when
we make code changes to any part of the algorithm. These small changes will be
extremely numerous during the implementation phase as we scour our
implementation for optimizations. For example, we can have a test case where we
instantiate 2 tessellators, the C reference tessellator and our HLSL
implementation. We can then pass them the same input values and run them. After
that we can store the output vertex and index buffers and compare them in a
standard for loop. Inside the for loop is a Google Test assertation that expects the
86 | Page
two values to be the same. When the values differ we can get output to the screen
telling us of the error and the program will exit. Google Test can handle hundreds
of test cases at a time so we can get a massive amount of testing done in a
relatively short amount of time.
Figure 55: Google test output
To make testing even simpler, the rasterizer provided to us defines an interface for
tessellators. This means we can have our tessellator implement their interface and
be able to plug it right into the code they provided. This makes it extremely easy
to diff results and run test cases with Google Test. We can even overlay our
calculations onto the reference calculations on the screen and get visual feedback
on our errors.
Test Cases
Test Objective
Correctness of Degenerate
Tessellation Factor
Triangle
for
Even
Inner
Test
Description
Input to the HLSL implementation of the tessellation algorithm
tessellation factors t0 = 4.0, t1 = 3.2, t2 = 1.7, i0 = 4.0. Check
the output of the vertex and index buffers generated by the
HLSL implementation against the output of the vertex and
index buffers output by the Microsoft reference rasterizer. This
is an edge case, because non-odd tessellation factor parity of
the inner tessellation factor results in a ‘degenerate triangle’—
a single point—as the center ring.
Test Conditions
For the HLSL implementation and Microsoft reference
rasterizer, the tessellation type is set to triangle and the
tessellation mode is set to integer.
Expected
Results
The HLSL implementation of the tessellation algorithm
matches index and vertex buffer output of the Microsoft
reference rasterizer.
87 | Page
Test Objective
Correctness of Fractional Odd Mode Tessellation of Triangles
Test
Description
Input to the HLSL implementation of the tessellation algorithm
tessellation factors t0 = 1.0, t1 = 1.0, t2 = 1.0, i0 =1.0. Check
the output of the vertex and index buffers generated by the
HLSL implementation against the output of the vertex and
index buffers output by the Microsoft reference rasterizer. This
is an edge case, because fractional odd tessellation of
triangles requires a minimum inner tessellation value of 1+2 16. If the inner tessellation value is not clamped correctly, the
outer ring will overlap with the inner ring.
Test Conditions
For the HLSL implementation and Microsoft reference
rasterizer, the tessellation type is set to triangle and the
tessellation mode is set to fractional odd.
Expected
Results
The HLSL implementation of the tessellation algorithm
matches index and vertex buffer output of the Microsoft
reference rasterizer.
Test Objective
Correctness of Clamping For Fractional Even Mode
Tessellation of Triangles
Test
Description
Input to the HLSL implementation of the tessellation algorithm
tessellation factors t0 = 1, t1 = 1, t2 = 65, i0 = 5. Check the
output of the vertex and index buffers generated by the HLSL
implementation against the output of the vertex and index
buffers output by the Microsoft reference rasterizer. For
triangles, fractional even tessellation mode clamps outer and
inner tessellation values are clamped between 2 and 65.
Test Conditions
For the HLSL implementation and Microsoft reference
rasterizer, the tessellation type is set to triangle and the
tessellation mode is set to fractional even.
Expected
Results
The HLSL implementation of the tessellation algorithm
matches index and vertex buffer output of the Microsoft
reference rasterizer.
Test Objective
Correctness of Clamping For Integer Mode Tessellation of
Quads
88 | Page
Test
Description
Input to the HLSL implementation of the tessellation algorithm
tessellation factors t0 = 1, t1 = 66, t2 = -1, i0 = 5, i1=7. Check
the output of the vertex and index buffers generated by the
HLSL implementation against the output of the vertex and
index buffers output by the Microsoft reference rasterizer. For
quads, integer tessellation mode clamps outer and inner
tessellation values within the range of 1 through 64.
Test Conditions
For the HLSL implementation and Microsoft reference
rasterizer, the tessellation type is set to quad and the
tessellation mode is set to integer.
Expected
Results
The HLSL implementation of the tessellation algorithm
matches index and vertex buffer output of the Microsoft
reference rasterizer.
Test Objective
Correctness of Degenerate Vertical Quads
Test
Description
Input to the HLSL implementation of the tessellation algorithm
tessellation factors t0 = 3.0, t1 = 1.2, t2 = 4.1, i0 = 6.0, i1 =
3.2. Check the output of the vertex and index buffers
generated by the HLSL implementation against the output of
the vertex and index buffers output by the Microsoft reference
rasterizer. This is an edge case, because the occurrence of
either inner tessellation factor being even, coupled with the
first inner tessellation value being greater than the second,
results in a vertical degenerate quad—a vertical row of single
points—at the center.
Test Conditions
For the HLSL implementation and Microsoft reference
rasterizer, the tessellation type is set to quad and the
tessellation mode is set to integer.
Expected
Results
The HLSL implementation of the tessellation algorithm
matches index and vertex buffer output of the Microsoft
reference rasterizer.
Test Objective
Correctness of Degenerate Horizontal Quads
89 | Page
Test
Description
Input to the HLSL implementation of the tessellation algorithm
tessellation factors t0 = 3.0, t1 = 1.2, t2 = 4.1, i0 = 3.2, i1 =
6.0. Check the output of the vertex and index buffers
generated by the HLSL implementation against the output of
the vertex and index buffers output by the Microsoft reference
rasterizer. This is an edge case, because the occurrence of
either inner tessellation factor being even, coupled with the
second inner tessellation value being greater than the first,
results in a horizontal degenerate quad—a horizontal row of
single points—at the center.
Test Conditions
For the HLSL implementation and Microsoft reference
rasterizer, the tessellation type is set to quad and the
tessellation mode is set to integer.
Expected
Results
The HLSL implementation of the tessellation algorithm
matches index and vertex buffer output of the Microsoft
reference rasterizer.
Test Objective
Correctness of Integer Mode Tessellation of Isolines
Test
Description
Input to the HLSL implementation of the tessellation algorithm
tessellation factors t0 = 1.8 and t1 = 6.2. Check the output of
the vertex and index buffers generated by the HLSL
implementation against the output of the vertex and index
buffers output by the Microsoft reference rasterizer.
Test Conditions
For the HLSL implementation and Microsoft reference
rasterizer, the tessellation type is set to isoline and the
tessellation mode is set to integer.
Expected
Results
The HLSL implementation of the tessellation algorithm
matches index and vertex buffer output of the Microsoft
reference rasterizer.
Test Objective
Correctness of Fractional Odd Mode Tessellation of Isolines
Test
Description
Input to the HLSL implementation of the tessellation algorithm
tessellation factors t0 = 4.0 and t1 = 34.0. Check the output of
the vertex and index buffers generated by the HLSL
implementation against the output of the vertex and index
buffers output by the Microsoft reference rasterizer.
90 | Page
Test Conditions
For the HLSL implementation and Microsoft reference
rasterizer, the tessellation type is set to isoline and the
tessellation mode is set to fractional odd.
Expected
Results
The HLSL implementation of the tessellation algorithm
matches index and vertex buffer output of the Microsoft
reference rasterizer.
Test Objective
Correctness of Fractional Even Mode Tessellation of Isolines
Test
Description
Input to the HLSL implementation of the tessellation algorithm
tessellation factors t0 = 9.2 and t1 = 89.0. Check the output of
the vertex and index buffers generated by the HLSL
implementation against the output of the vertex and index
buffers output by the Microsoft reference rasterizer.
Test Conditions
For the HLSL implementation and Microsoft reference
rasterizer, the tessellation type is set to isoline and the
tessellation mode is set to fractional even.
Expected
Results
The HLSL implementation of the tessellation algorithm
matches index and vertex buffer output of the Microsoft
reference rasterizer.
Error Reporting Conventions
Here is our template that is used in reporting bugs and fixes. This type of reporting
is important to a large project of this nature since errors should be made a tool to
learn from. On google docs is where we will be reporting any sort of bugs and
problems. Below is just a generalized template to follow; if a bug requires other
entries or does not fit into the template, alterations made be made. Because our
team set up a bit bucket account, we also have access to special automated bug
tracking software. It might be best to use their automated issue tracking.
Problem: <Name>
Error Code:
<my weird error code or stacktrace hear>
Description:
91 | Page
<A description of the problem/bug that fully discloses what went wrong, to our best
understanding. If there are things that we do not yet fully understand also list
them.>
Fix:
<What is the solution or work around for the problem/bug? Feel free to also post
relavent links to external websites, i.e. stackoverflow.com>
Reported By: <Your Name!>
Add the error report to the bug on bitBucket to save time:
Problem: vgt_te11_reorder.hpp and .cpp missing.
Error Code:
error C1083: Cannot open include file: 'vgt_te11_reorder.h': No such file or
directory
Description:
vgt_te11_reorder.hpp and .cpp missing from the reference rasterizer.
Fix:
We met with Todd from AMD and he allowed us to also have access to the needed
files.
Reported By:
Matt, David
Problem: Directx Debug Build
Error Code:
Looked like this:
'Shaders.exe': Loaded 'C:\WINDOWS\system32\user32.dll', Cannot find or open
the PDB file 'Shaders.exe': Loaded 'C:\WINDOWS\system32\gdi32.dll', Cannot
find
or
open
the
PDB
file
'Shaders.exe':
Loaded
'C:\WINDOWS\system32\ole32.dll', Cannot find or open the PDB file
'Shaders.exe': Loaded 'C:\WINDOWS\system32\advapi32.dll', Cannot find or open
the PDB file 'Shaders.exe': Loaded 'C:\WINDOWS\system32\rpcrt4.dll', Cannot
find
or
open
the
PDB
file
'Shaders.exe':
Loaded
'C:\WINDOWS\system32\secur32.dll', Cannot find or open the PDB file
Description:
92 | Page
When trying to build and run a directx program that compiled a simple compute
shader, the loader could not find certain files it needed.
Fix:
Goto: tools→ options → debugging → symbols, and check the box that lets you
download anything you do not already have.
http://stackoverflow.com/questions/12954821/cannot-find-or-open-the-pdb-file-invisual-studio-c-2010
Reported By: Matt
Problem: Downloading wrong DX SDK
Error Code:N/A
Description:
The old DX SDK is no longer its own separate thing, but is bundled into the
windows SDK
Fix:
Download the Windows SDK instead from here.
Reported By: Matt, Erwin
Problem: Shader Won’t Compile
Error Code:
Failed compiling shader:... 80004005
Description:
The compute shader in the .hlsl file did not compile when passed to the microsoft
function
D3DCompileFromFile(srcFile,
D3D_COMPILE_STANDARD_FILE_INCLUDE,
entryPoint, profile,
flags, 0, &shaderBlob, &errorBlob);
defines,
93 | Page
Fix:
The entry point was incorrect. Make sure that the entry point you pass to the above
function matches the entry point in your .hlsl code.
Reported By: Matt and David
Problem: Visual Studio Crashes
Error Code:
Visual Studio has stopped working.
Description:
When using watch expressions in conjunction with Fixed point math, sometimes
while stepping through or adding a new value, the debugger will crash visual
studio.
Fix:
Still open! We have not sure what causes this…
Reported By: Matt, Erwin, and David
Problem: Visual Studio Project Will not build
Error Code:
Build option becomes greyed out and will not allow the project to build.
Description:
Other projects in visual studio build without fail, but one particular solution will not
allow the user to build / run or just plain build.
Fix:
The project was failing to build due to a problem with a particular folder. For
whatever reason, the folder inside the project was set to read-only, causing the
visual studio to not be able to write to the project. This meant that on an attempted
build the compiler could not link to the PDB file. The solution allowed building once
the folder option for read-only (and all sub folders options as well), was unchecked.
Reported By: Erwin
94 | Page
Project Summary and Conclusions
Our project, Parallel Tessellation Using Compute Shaders, at its core is a software
porting job. We’re taking a specification designed to run on a fixed function piece
of hardware and porting it to a new language and piece of hardware in hopes that
it can do a better job than the fixed function hardware.
The formal request is that we design and implement a software tessellator written
in Microsoft’s High Level Shader Language (HLSL). This software implementation
will attempt to outperform the fixed function hardware by taking advantage of the
GPU’s highly parallelizable vector processors (compute units). AMD’s compute
units can perform the same instruction on up to 64 different pieces of data enabling
a massive amount of parallelization.
The software implementation should also not consume an excess amount of
resources in order to achieve its throughput numbers. In addition to achieving its
performance goals, it must also match its output to a reasonable degree of
accuracy (around 3 significant figures). Some slack was given for the accuracy
because the fixed function hardware and software implementation use different
number formats. The fixed function hardware uses fixed point decimal numbers
while the GPU uses standard IEEE 754 floating point values. This causes some
discrepancy in the numbers as the two formats store the fractional parts of the
number in different ways.
Along the way we encountered many problems. Firstly the tessellation spec was
truly huge and took a lot of time to even begin understanding. Secondly getting
some real performance out of the GPU was much harder than we anticipated. The
CPU even proved very difficult to outperform. Thirdly, Microsoft’s DirectX api is
extremely massive and took a long time to get a grasp of. Lastly, was the time
frame we were given. Seven months just is not enough time to learn tessellation,
DirectX, GPU programming techniques, and then actually have enough time left to
really start optimizing its performance.
© Copyright 2026 Paperzz