Conference Presentation - Princeton Graphics Group

Real-time Mesh
Simplification Using the GPU
Christopher DeCoro
Natasha Tatarchuk
3D Application Research Group
Introduction
•
Implement Mesh Decimation in real-time
•
Utilizes new Geometry Shader stage of GPU
•
Achieves a 20x speedup over CPU
2
Project Motivation
•
Massive Increases in submitted geometry
•
•
•
Geometry rendered per shadow map (6x for cubemap!)
Not always needed at highest resolution
Geometry not always known at build-time
•
•
•
•
Dynamically-skinned objects only finalized at run-time
May be customized to users machine based on capabilities, would
need to be adapted at program load time
Could be dynamically generated per level, need to be adapted at
level load time
Simplification therefore needs to be fast (or even real-time)
Also, just as importantly…
•
We want applications that exercise & stress GS/GPU
•
•
•
•
Evaluate new capabilities of the GPU
Learn how to adapt previously CPU-bound algorithms
Develop GPU-centric methodologies
Identify future feature set for GS/GPU as a whole
•
Limitations still exist – which should be addressed?
3
Contributions
•
Mapping of Decimation to GPU
20x speedup vs. CPU
• Enables load-time or real-time usage
•
•
Detail Preservation by Non-linear Warping
•
•
Also applicable to CPU out-of-core decimation
General-purpose GPU Octree
Adaptive decimation w/ constant memory
• Applications not limited to simplification: collision
detection, frustum culling, etc.
•
4
Outline
•
•
Project Introduction and Motivation
Background
•
•
Decimation with Vertex Clustering
Geometry Shaders in Direct3D 10
Geometry Shader-based Vertex Clustering
• Adaptive Simplification w/ Non-linear Warps
• Probabalistic Octrees on the GPU
•
5
Vertex Clustering
•
Reduces mesh resolution
High-res mesh as input
• Low-res as output
•
•
All implemented on the GPU
•
•
•
•
Ideal for processing streamed
out data
Useful when rendering multiple
times (i.e. shadows)
Can handle enormous models
from scanned data
Based on “Out-of-Core
Simplification of Large Polygonal
Models,” P. Lindstrom, 2000
6
Figure from [Lindstrom 2000]
Previous Rendering Pipeline
•
•
Vertex Shaders and Pixel Shaders
Limits 1 output per 1 input
•
•
No culling of triangles for decimation
Fixed destination for each stage
•
Result meshes cannot be (easily) saved and reused
7
DirectX10 Rendering Pipeline
•
Geometry Shader in between VS & PS
•
•
Able to access all vertices of a primitive
•
•
Can compute per-face quantities
Breaks 1:1 input-output limitation
•
•
Called for each primitive (usually triangle)
Allows triangles to be culled from pipeline
Allows stream-out of processed geometry
•
Decimated meshes can easily be saved and reused
8
Outline
•
•
•
Project Introduction and Motivation
Background
Geometry Shader-based Vertex Clustering
•
•
•
•
•
•
Overview
Quadric Generation
Optimal Position Computation
Final Clustering
Adaptive Simplification w/ Non-linear Warps
Probabilistic Octrees on the GPU
9
Algorithm Overview
•
Start with the input mesh
•
•
Pass 1: Compute the quadric map from mesh
•
•
•
Use GS to compute quadric
Accumulate in cluster map, an RT used as large array
Pass 2: For each cluster, compute optimal position
•
•
Shown divided into clusters
Solves a linear system given by quadrics
Pass 3: Collapse each vertex to representative
•
9x9x9 grid shown
Model Courtesy of Stanford Graphics Lab
10
Vertex Clustering Pipeline
•
Pass 1: Create Quadric Map
•
•
Input: Original Mesh
Computation:
Determine plane equation, face quadrics for triangle
• Compute the cluster and address of each vertex
• Pack quadric into RT at appropriate address
•
•
Output: Render Targets representing clusters
with packed quadrics and average positions
11
Quadric Map Implementation
//Map a point to its location in the cluster map array
float2 writeAddr( float3 vPos )
{
•
Start with the input mesh
uint •iX = Shown
clusterId(vPos)
/ iClusterMapSize.x;
divided into
clusters
uint iY = clusterId(vPos) % iClusterMapSize.y;
return expand( float2(iX,iY)/float(iClusterMapSize.x) ) + 1.0/iClusterMapSize.x;
}
Compute the quadric map from mesh
•
•
Use GS to compute quadric
[maxvertexcount(3)]
•
Accumulate in cluster map, an RT used as large array
void main( triangle ClipVertex input[3], inout PointStream<FragmentData> stream )
{
• //For
Forthe
each
cluster,
compute
optimal
position
current
triangle,
compute the
area and
normal
float3 vNormal = (cross( input[1].vWorldPos - input[0].vWorldPos, input[2].vWorldPos - input[0].vWorldPos ));
fArea = length(vNormal)/6;
• float
Collapse
each vertex to representative
vNormal
=
normalize(vNormal);
•
9x9x9 grid shown
//Then compute the distance of plane to the origin along the normal
float fDist = -dot(vNormal, input[0].vWorldPos);
//Compute the components of the face quadrics using the plane coefficients
float3x3 qA = fArea*outer(vNormal, vNormal);
float3 qb = fArea*vNormal*fDist;
float qc = fArea*fDist*fDist;
//Loop over each vertex in input triangle primitive
for(int i=0; i<3; i++)
{
//Assign the output position in the quadric map
FragmentData output;
output.vPos = float4(writeAddress(input[i].vPos),0,1);
//Write the quadric to be accumulated in the quadric map
packQuadric( qA, qb, qc, output );
stream.Append( output );
}
}
12
Vertex Clustering Pipeline
•
Pass 2: Find Optimal Positions
•
•
Input: Cluster Map Render Targets,
Full-screen Quad
Computation:
Determine if we can solve for optimal position
• If not, fall back to vertex average
•
•
Output: Render Targets representing clusters
with optimal position of representative vtx.
13
Optimal Positions
•
For each cell, need representative
•
Naïve solution: Use averages
Original Mesh
Simplified w/ Averages
Looks very blocky
• Does not consider the original faces,
only vertices
•
Simplified w/ Quadrics
•
Implemented solution: Use quadrics
Quadrics are a measure of surface
• We can solve for optimal position
•
14
Optimal Positions Implementation
float3 optimalPosition(float2 vTexcoord)
{
vPos
= float3(0,0,0);
• float3
Start
with
the input mesh
float4
dataA0, into
dataB,
dataA1;
• dataWorld,
Shown divided
clusters
•
•
•
//Read the vertex average from the cluster map
dataWorld = tClusterMap0.SampleLevel( sClusterMap0, vTexcoord, 0 );
Compute the quadric map from mesh
int iCount = dataWorld.w;
•
Use GS to compute quadric
•
Accumulate in cluster map, an RT used as large array
//Only compute optimal position if there are vertices in this cluster
if( iCount != 0 )
{ For each cluster, compute optimal position
//Read all the data from the clustermap to reconstruct the quadric
dataA0 = tClusterMap1.SampleLevel(
sClusterMap1, vTexcoord, 0 );
Collapse
each vertex to representative
dataA1
=
tClusterMap2.SampleLevel(
sClusterMap2, vTexcoord, 0 );
•
9x9x9 grid shown
dataB = tClusterMap3.SampleLevel( sClusterMap3, vTexcoord, 0 );
//Then reassemble the quadric
float3x3 qA = { dataA0.x, dataA0.y, dataA0.z,
dataA0.y, dataA0.w, dataA1.x,
dataA0.z, dataA1.x, dataA1.y
float3 qB = dataB.xyz;
float qC = dataA1.z;
};
//Determine if inverting A is stable, if so, compute optimal position
//If not, default to using the average position
const float SINGULAR_THRESHOLD = 1e-11;
if(determinant(quadricA) > SINGULAR_THRESHOLD )
vPos = -mul( inverse(quadricA), quadricB );
else
vPos = dataWorld.xyz / dataWorld.w;
}
return vPos;
}
15
Vertex Clustering Pipeline
•
Pass 3: Decimate Mesh
•
•
Input: Cluster Map Render Targets, Input Mesh
Computation:
Find clusters, Remap vertices to representative
• Determine if triangle becomes degenerate
• If not, stream output new triangle at new positions
•
•
Output: Low-resolution Mesh
16
Final Clustering Implementation
[maxvertexcount(3)]
void main( triangle ClipVertex input[3], inout TriangleStream<StreamoutVertex> stream )
•
Start with the input mesh
{
•
Shown divided into clusters
//Only emit a triangle if all three vertices are in diff. clusters
if( all_different(clusterId(input[0].vPos),
•
Compute the quadric
map from mesh
clusterId(input[1].vPos),
•
Use GS to compute
quadric
clusterId(input[2].vPos))
)
•
Accumulate
in
cluster
map,
an
RT
used
as
large
array
{
for(int i=0; i<3; i++)
•
For each cluster, compute optimal position
{
//Lookup optimal position in the RT computed in Step 2
•
Collapse each vertex to representative
vPos = tClusterMap3.SampleLevel( sClusterMap3, readAddr(input[0].vPos), 0 );
•
9x9x9 grid shown
//Output vertex to stream out
stream.Append( vPos );
}
}
return;
}
17
Vertex Clustering Pipeline
•
Alternate Pass 2: Downsample RTs
•
•
Input and Output as before
Computation:
Collapse 8 adjacent cells by adding cluster quadrics
• Compute optimal position for 2x larger cell
•
•
Create multiple lower levels of detail without
repeatedly incurring Pass 1 overhead (~75%)
Pass 3 can use previous streamed-out mesh
• Lower levels of detail almost free
•
18
Timing Results
•
Recorded Time Spent in Decimation
GPU: AMD/ATI XXX
• CPU: 3Ghz Intel P4
•
•
Significant Improvement over CPU
•
•
Averages ~20x speedup on large models
Scales linearly
19
More Results
•
Models shown at varying resolutions
Buddha,
45x130x45 grid
Bunny, 90x90x90 grid
Dragon, 100x60x20 grid
20
Models Courtesy of Stanford Graphics Lab
More Results
•
Models shown at varying resolutions
Buddha,
20x70x20 grid
Bunny, 60x60x60 grid
Dragon, 50x25x10 grid
21
More Results
•
Models shown at varying resolutions
Buddha,
10x40x10 grid
Bunny, 20x20x20 grid
Dragon, 30x15x6 grid
22
Outline
•
•
•
•
Project Introduction and Motivation
Background
Geometry Shader-based Vertex Clustering
Adaptive Simplification w/ Non-linear Warps
•
•
•
View-dependent Simplification
Region-of-interest Simplification
Probabalistic Octrees on the GPU
23
View-dependent Simplification
•
Standard simplification does not consider view
•
•
Preserves uniform amount of detail all over
Simplify in post-projection space to use view
•
Preserves more detail closer to viewer (left)
View Direction
24
Arbitrary Warping Functions
•
View Transform special case of nonlinear warp
•
•
Can use arbitrary warp for adaptive simplification
Regular grids allow data-independence, parallelism
Constant time mapping from position to grid cell
• Maps well onto GPU render targets
• Forces uniform resolution throughout output mesh
•
•
Irregular geometry grids allow non-uniform output
•
•
•
•
Cells can be larger/smaller in certain regions
Corresponds to lower/greater output triangle density
We lose constant-time mapping of position to cell
Solution: apply inverse warp to vertices
•
•
•
•
Equivalent to applying forward warp to grid cells
Clustering still performed in uniform grid
Flexibility of irregular geometry w/ speed of regular
One proposal: Gaussian weighting functions
25
Region-of-Interest Specification
•
Importance specified w/ biased Gaussian
•
•
•
•
Highest preservation at mean
Width of region given by sigma
Bias prevents falloff to zero
Integrate to produce corresponding warp function
(Derivation given in paper)
26
Region-of-Interest Specification
Warping allows non-uniform/adaptive level of detail
•
•
Head has most
semantic importance
•
Detail lost in uniform
simplification
•
We can warp first to
expand center
•
Equivalent to grid
density increasing
•
Adaptive simplification
preserves head detail
27
Outline
•
•
•
•
•
Project Introduction and Motivation
Background
Geometry Shader-based Vertex Clustering
Adaptive Simplification w/ Non-linear Warps
Probabalistic Octrees on the GPU
•
•
•
•
•
Motivation
Probablistic Storage
Adaptive Simplification
Randomized Construction
Results
28
Octrees - Motivation
•
Basic grid
•
•
•
Warped grid
•
•
•
•
regular geometry, regular topology
Limitations as we discussed
irregular geometry, regular topology
Much improved; however, we can do better
May be difficult to know required detail a priori
CPU Solution: Multi-resolution grid (i.e. octree)
•
•
•
•
•
Irregular topology (irregular geometry w/ warping)
Store grid at many levels of detail
Measure error at each level, use coarse as possible
Efficiency requires dynamic memory, storage O(L3)
Requires O(L) writes to produce correct tree
29
GPU Solution – Probabilistic Octrees
•
Proposal
Successful storage not guaranteed, w/ Prob. <= 1
• However, storage failure detected on read
•
•
Assumptions allow much flexibility
We can have unlimited depth tree (but lim P=0)
• Sparse storage of data
•
•
Require conservative algorithms for task
•
•
•
Vertex clustering (conveniently!) is such an example
So is collision detection and frustum culling
Only studied in brief in this paper, we would like
to analyze more for future work
30
Implementation Details
•
Storage: Spatial Hashes
•
•
•
•
Retrieval:
•
•
•
•
Map (position,level) to cell, cell hashed to index
Additive blending for quadric accumulation (app-specific)
Max blending to store (key,-key) with data (i.e. min_key,max_key)
Again map (position, level) to index
Retrieve key value from data, collision iff min_key != max_key
Use parent level, which will have higher storage probability
Usage for Adaptive Simplification
•
•
•
•
For each vertex, find maximum error level below some threshold
Use this as the representative vertex
Can perform binary search along path
Conservative, because we can maintain validity even when using
parent of optimal node (just adds some error)
31
Probabilistic Octree Results
•
Adaptive simplification shown on bunny (~4K tris)
•
•
•
•
Preserves detail around leg, eyes and ears
Simplifies significantly on large, flat regions
Using 8% of storage of total tree, we have < 10% collisions
Only ~20% performance hit vs. standard grids
33
Conclusions
•
GS is a powerful tool for interactive
graphics
•
Amplification and decimation are
important applications of GS
34
Geometry Shaders and Other Feature
Wish-List
•
Bring back the Point fill mode
•
•
Data amplification improvements with
indexed stream out
•
•
Important for scatter in GPGPU applications
Avoiding triangle soups very non-trivial
Efficient indexable temps
35
Thanks a lot!
•
Various people here…
36
Questions?
37