Implementing the Render Cache and the Edge-and

Implementing the Render Cache and the
Edge-and-Point Image on Graphics Hardware
Edgar Velázquez-Armendáriz
Eugene Lee
Bruce Walter
Kavita Bala
GI 2006, Québec, June 9th 2006
Motivation
• High quality shading is still too slow.
– Not ready for interactivity.
– It is slow even on the GPU.
• Potential applications.
– Architecture.
– Modeling.
– Movies.
Overview
• GPU acceleration of the Render Cache and the
Edge-and-Point Image (EPI).
Points and Points
Edges
Render Cache
EPI reconstruction
Render Cache overview
Projection
Depth cull
Interpolation
Edge-and-Point Image overview
Naive
EPI
• Alternative display representation
• Edge-constrained interpolation preserves
sharp features
• Fast anti-aliasing
Presented work
• Mapping to the
hardware
– The algorithm’s
components differ from
standard hardware
rendering.
– Overcome GPU
limitations.
• Results
– GPU strategies.
– Better interactivity.
Related Work
• Interactive.
–
–
–
–
–
Shading cache. [Tole02]
Corrective texturing. [Stamminger00]
Tapestry. [Simmons00]
Adaptive Frameless Rendering. [Dayal05]
Distance impostors. [Szirmay-Kalos05]
• Non-interactive.
– Irradiance caching. [Smky05]
• Pure Hardware implementations.
– Ray tracing. [Purcell02, Carr06]
– Photon mapping. [Purcell03]
Talk overview
• Algorithm overview.
• Mapping to the hardware: strategies and
challenges.
• Results.
• Discussion.
Overview
Shading
samples
Shader
3D points
Point
manager
Point
projector
Feedback
Asynchronous
CPU GPU
Overview
Shadow
edge finder
Shading
samples
Shader
Silhouette
edge finder
3D edges
3D points
Point
manager
Point
projector
Feedback
Asynchronous
CPU GPU
Edge raster
Overview
Shadow
edge finder
Shading
samples
Request
samples
3D edges
3D points
Point
manager
Shader
Silhouette
edge finder
Point
projector
Feedback
Output Image
Asynchronous
CPU GPU
Edge raster
2D
points
Edge
Constrained
Interpolation
2D
edges
Public availability
• The complete Cg source of the shaders is
available online:
http://www.cs.cornell.edu/~kb/projects/epigpu/
Talk overview
• Algorithm overview.
• Mapping to the hardware: strategies and
challenges.
• Results.
• Discussion.
Mapping to the hardware
• Sections are grouped
on computational
similarity:
– Point processing
– Edge finding
– Edge constrained
interpolation
• Most of the
processing has been
moved to the GPU.
Silhouette
edge finder
3D edges
Point
projector
Edge raster
2D
points
Edge
Constrained
Interpolation
2D
edges
Point processing
• Point Cloud as Vertex Buffer Object (VBO) and
Texture.
• Multiple Render Targets (MRT) used to write all
information in a single pass.
• Simplified predicted projection.
– Not as accurate as the regular projection.
4 one-pixel
1
splat pointpoints
using one
quarter of the point cloud
Point processing: Update
• Render Cache’s structures are complex to map.
• We cannot modify pipelined GPU data.
– Use additional passes.
Vertex and Pixel shaders
Point projector
Point Cloud
Point Image
Point processing: Bandwidth issues
• Point projection is bandwidth limited.
– Point cloud update.
– New samples request.
• Write to the point cloud only the new
samples.
– We use vertex scatter.
– Faster than replacing all the point cloud.
• A static VBO is projected three times
faster than a constantly modified one.
Silhouette detection
• The original EPI uses hierarchical trees.
– Does not map well to GPU.
• Brute force method on the GPU.
– Avoid edges transfer every frame.
– Faster than hierarchical structures!
• Shadow edge detection left on the CPU.
Model edges
Edge texture
Silhouette detection: Limitations
• GPU silhouette detection is limited by the
fill rate.
• Texture memory constraints.
– We need to keep all vertices as VBO.
– Vertices and normals as textures.
– One results texture.
• Normals stored as fp16 to reduce space.
Edge Raster
• Raster edges with
subpixel precision.
• Depends on model
complexity.
• Extended lines as
described in SEN03.
• Filtered depth as
read-only depth
buffer.
No depth texture
– Free occlusion culling!
With depth texture
Edge Constrained Interpolation
• Multi-pass pixel shaders.
– Very long.
– A lot of texture accesses.
• Image resolution dependent.
• Use look-up tables encoded as textures.
– Avoid control code in shaders.
– Encode original EPI operations.
Future trends
• Branching granularity.
– Some filters require fine granularity to take
advance of dynamic branching.
– This issue is being solved with newer cards
beginning with ATI X1000 series.
• Bit operations not directly supported.
– DirectX 10 will support them.
• Bottom line: GPU implementation will get
better and faster.
Limitations
• Fill rate and texture access.
– These characteristics constantly improve
with newer hardware with more pipelines and
faster clock frequencies.
• Improve by diminishing shaders length.
– Number of registers used is still important.
– A 180 instructions shader with 25 registers
performs 50% slower than a 215 instructions
shader with and 24 registers on our GPU.
Talk overview
• Algorithm overview.
• Mapping to the hardware: strategies and
challenges.
• Results.
• Discussion.
Test platform
• Test environment.
– Software written in C++, Cg 1.4rc, and Java through
JNI under Windows XP.
– Pentium 4 EE 3.2 Ghz dual core, 2 GB RAM, dual
Nvidia GeForce 7800 GTX (81.85).
• Test scenes.
–
–
–
–
–
Cornell Box
Chains
Mackintosh Room
David Head
Dragon
Results: FPS
• GPU version is 60–110% faster than the original.
– Speed up increases along with scene complexity.
30
CPU only
25
GPU
FPS
20
15
10
5
0
Cornell Box
Chains
Mack Room
David Head
Dragon
Results: Speed increase from CPU
700.0%
665.3%
600.0%
Speed increase
500.0%
400.0%
317.2%
278.6%
300.0%
200.0%
90.6%
100.0%
45.4%
13.6%
0.0%
Point projection
Predicted
projection
Depth cull
Silhouette
detection
Edge raster
Image Filters
Results: Rendering times
140
120
Image filters
Rendering time (ms)
100
Edge raster
80
Silhouette detection
60
Depth cull
40
Predicted projection
20
Point projection
0
CPU Dragon
GPU Dragon
Talk overview
• Algorithm overview.
• Mapping to the hardware: strategies and
challenges.
• Results.
• Discussion.
Discussion
• Point projection, even though it maps
straightforwardly to the GPU is the
bottleneck.
• Image filters are very fast in spite of their
multiple texture accesses and multiple
passes.
• We originally thought the opposite would
be true!
Discussion
• Projection is not optimal.
• We wanted to use Vertex Texture Fetch
(VTF) for mapping the point cloud update
but it was slower than Render to Vertex
Array (RTV).
• Dual GPU rendering with Scalable Link
Interface (SLI) showed marginal gains.
Future performance
• Texture accesses are very fast and
efficient.
• Transferring vertex data on the GPU is
too slow to be fully useful.
• Scatter write on pixel shaders and
geometry shaders may allow complete
data management on the GPU.
Conclusions
• We presented a hybrid GPU/CPU system
for the Render Cache and the EPI using
commodity graphics hardware.
• Our implementation is 60−110% faster
than a pure CPU implementation and
frees the CPU up for other operations.
• System’s performance is likely to improve
with the current trend of GPUs.
Questions?
Implementing the Render Cache and the
Edge-and-Point Image on Graphics Hardware
http://www.cs.cornell.edu/~kb/projects/epigpu/