Save the Nanosecond - AMD Developer Central

Save the Nanosecond!
PC Graphics Performance
for the next 3 years
Richard Huddy
European Developer Relations Manager
ATI Technologies, Inc.
A funny thing happened to me…
•
March 04
ATI is now broadly recognised and highly
recommended amongst high end gamers
Another DX performance talk?
Because although this has been my pet subject
for 7 years there’s still complexity to work out…
• Like:
•
– Choosing sort criteria
– Preferred ways of handling dynamic data
– The best way to express a pixel shader algorithm
March 04
nanoseconds - There are lots of them...
•
But, once they’re gone...
– They’re gone...
•
If a game lasts for around 40 hours of play
that’s roughly 1014 nanoseconds...
Each frame says goodbye to roughly 10 million
of these puppies
• Each VPU clock tick is roughly 2 ns
•
– 500MHz is a fast VPU
•
March 04
[Each CPU tick is roughly 1/2 ns.]
– 2GHz is a modest CPU
Save the nanosecond
•
There’s an English saying,
– “Look after the pennies and the pounds will look after
themselves”
•
In a sense, pennies are just “small pounds”
•
But delivering fast frames requires you to save
millions of nanoseconds
– And you can’t get rich by saving a dollar every now
and then...
March 04
The DirectX API
Since there’s an API between you and the
hardware it makes sense to expect that you
need to know how to use it
• Abuse of the API can be a mighty expensive
option…
•
– And this is an incredibly common problem
March 04
Huge Savings...
•
Don’t create resources within the performance
sensitive part of your code…
•
Offline:
– Compressing textures
•
Install time:
– Optimize vertex sequences (D3DXOptimizeMesh)
•
Start-up time:
– Create VBs, IBs, RTs, etc
•
Game loop:
– Create nothing at all
March 04
Huge Savings...
•
SetRenderTarget()
– Let’s not have too many of these please!
– Single digits counts are good…
•
Lock() – with zero for flags
– Whether that’s a VB that’s being rendered from
– Or a RenderTarget which was rendered to
March 04
•
Because there are milliseconds at stake here!
•
Also use ‘DONOTWAIT’ appropriately to reclaim
CPU cycles – these are scarce!
Significant savings...
•
Every DrawPrim call is a significant cost
– So make sure you get good value from it
•
Every time you set any state it costs you
– Whether you set one or ten...
– But aggressive state filtering is no longer needed so
much in DX9
•
One pixel is irrelevant, but millions matter...
– Clear() the Z/stencil buffer to make it work fast
– Sort Front to Back
•
Sub-Sort by shader
– Set your shader constants in blocks
March 04
Compilers are smart...
•
At ATI we test compilers to make sure that they’re good
and help make them better
•
Sample results show :
– HLSL vs Cg on ATI*
– HLSL vs Cg on NV
(Win, Draw, Lose)
: 5,
:16,
7,
7,
2
0
– (*) Cg compiler failed to compile 9 of the 23 Renderman
samples for SM2.0 even though HLSL compiler
succeeded
So using HLSL seems like the logical choice…
• Not just an industry standard – but the best too
•
March 04
And a PC is complex
Which is a bit of an understatement…
• A 9800 Pro has a similar number of gates to two
Pentium4 processors all on one die
• But the highly parallel design allows it to do
much more work – of a very specific kind…
• So you’d like to have the CPU and VPU both
doing useful work at the same time
•
– Luckily the API encourages this…
March 04
Which bits are fast?
•
System:
– CPU
•
1 to 1/3 of a nanosecond… (1GHz to 3GHz)
– System memory
High latency compared to the CPU
• 200 - 800MHz (for moving data about)
•
– Virtual memory
•
•
Takes all week…
Graphics card:
– VPU core
•
200 to 500MHz
– Local video memory
•
•
200 to 500MHz (~20GB per second)
AGP Bus:
– 266MHz, 2GB per second, with latency like molasses…
March 04
•
[100MB per second for CPU reads – so don’t!]
Which bits are fast?
•
System:
– CPU
•
•
•
So the CPU is fast, but it still has too much to do…
“All games are CPU limited”
Graphics card:
– VPU core
•
•
Not blinding fast clock, but phenomenal throughput
AGP Bus:
– Don’t texture from here unless you have to
March 04
Inside the VPU
•
You have several units at your disposal…
–
–
–
–
–
–
–
Vertex fetch (memory cache)
Vertex shader (xform and lighting)
Vertex cache (protecting the shader from abuse)
Clipper (so fast it might as well not be there…)
Triangle setup
Fast Z/stencil reject (quad speed rasterizer rejection)
Rasterizer
•
•
Pixel cache
Texture cache
– Z buffer
– Blend (Yummy! Read-modify-write)
March 04
Inside the VPU
Because the vertex fetch unit is just reading /
caching memory it makes sense to prefer
cache-aligned data formats (like 32 bytes or 64
bytes)
• The vertex cache only works for indexed
primitives…
• So we recommend that all rendering is done
with DrawIndexedPrimitive() and that you
submit data in roughly tri-strip order
•
March 04
Saving nanoseconds…
•
Use shorter shaders since they’re faster
– One op per clock is what you should expect
– ATI hardware can parallelise vector + scalar op pairs
•
Shaders are cached on chip too
– So switching shader can sometimes be very fast
Hand written assembly isn’t usually a good bet
• ps.1.4 modifiers can be free in ps.2.0 hardware
•
March 04
Saving nanoseconds…
Prefer the shortest shader which does what you
want
• Use the lowest shader model which achieves
your target
•
– That way you can potentially access the ps1.4
modifiers which run in the same clock cycle
•
But please do not sacrifice quality for speed!
– That can be the user’s choice later on by selecting
no-AA, low screen resolution etc
March 04
Pre Zee
•
An early “Z only” pass will save you time if…
(1) Your pixel shaders are ‘long’
(2) You cannot sort front-to-back
March 04
•
The definition of ‘long’ here depends upon how
well you can usually sort!
•
Pre-Z saves you pixels, but costs you vertices
Optimisation - The Big Picture
•
Almost all of the best optimisations come down
to one single principal…
Do the work as early as possible in the pipeline to
avoid doing it later where the cost would be greater
•
This applies to things like…
– resource creation (prefer install time costs to runtime
costs)
– culling (cull early is better than late)
– shader tuning (pre-shader opts move from ps to vs
to CPU)
– Z-only pass
March 04
What’s this about the future?
•
March 04
Let’s looks at the trends which are changing the
balance…
ATI is at the Center of
The Digital Experience
March 04
Market share...
•
At the end of 2003 ATI finally took the lead in
market share in game-play graphics from the
competition
– Yeah, but only by 0.2%... So what?
•
According to Mercury Research, ATI leads with
a roughly 80:20 split at the high end…
– Which means that if you’re targeting high end
gamers and reviewers then your focus is on ATI
– That’s what the vast majority of your audience is
using…
– And ATI has a 100% market share lead of “New
Xbox technologies”… ☺
March 04
Multiple platforms...
•
The PC leads the way so that the various
genres of lesser hardware are several years
behind PC architecture...
– Latest PDA hardware is equivalent to cutting edge
PC hardware from just 4 years ago!
– Laptops are less than 2 years behind high end
workstations
– Consoles often define the high end as they arrive...
March 04
PC Platform retirement
•
Top spec PC’s actually have a game-buying life
of just two years!
– PC’s older than that are ‘retired’ for Word, email,
web browsing etc.
– New PC’s or graphics cards are brought into the
home and it’s these that are used for games
– Gamers with systems which are >2 years old buy
roughly 1 game per year and these are not high end
games
– Hard core gamers average 5 - 10 games per year
– This implies a roughly 2.5:1 CPU scalability issue…
– And roughly 4:1 GPU scalability on both power and
features
March 04
All of which means
•
You should require DX8 hardware and upwards
for games due Xmas 2004 or later
– We recommend treating low end DX9 hardware to
the DX8 path. Even 1024x768 is often too
demanding for the low end DX9 hardware out there
– So you should be able to cope with just two code
paths on many games for this year
•
•
•
March 04
DX8 hardware takes one
DX9 hardware takes the other
But note that because this assertion is based on
forecasts and trends it is highly subjective…
DirectX 8 class hardware
•
Programmable vertex pipeline is in addition to
the FF pipeline
– That makes it hard to beat the fixed function
hardware
– And this makes it fast to switch between pipelines
•
March 04
Pixel pipeline is shared between the old
fashioned texture cascade and the new pixel
processor
DirectX 9 class hardware
•
Programmable vertex pipeline is shared with
the FF pipeline
– That makes it easy to beat the fixed function
hardware
– That makes it slow to switch between pipelines
•
March 04
For this reason it makes sense generally to
prefer the programmable pipeline.
So, here is our target:
•
DX9 style mainstream graphics (per frame):
–
–
–
–
–
–
–
–
–
> 500K triangles
< 500 DrawIndexedPrimitive() calls
< 500 VertexBuffer switches
< 200 different textures
< 200 State change groups
Few calls to SetRenderTarget - aim for 0 to 4...
1 pass per poly is typical, but 2 is sometimes smart
Runs at monitor refresh rate
Which gives more than 40 million polys per second
•
And everything goes through the programmable pipeline
– No occurrences of Lock(0), DrawPrimitive(),
DPUP(), CreateVB() etc
March 04
Are we there yet?
•
Pixel Shader throughput:
– More pixel engines with
•
•
Higher clock speeds
Higher Instruction counts
– More vertex engines too since triangles keep getting
smaller
– The pressure moves away from textures and
towards the ALU operations…
•
March 04
Simply because ALU power grows faster than B/W
Are we there yet?
•
High quality AA:
– Continue to innovate with...
– Programmable sample points
•
Currently 0, 2, 4 or 6
– Full exposure of ‘centroid’ control
•
DirectX 9.0c API fully exposes this
– Gamma correction of AA in hardware
•
March 04
ATI do this already with a 2.2 gamma function
The 3.0 shader model
•
Requires 32 bit floats throughout the pipeline
– But that’s not necessarily full IEEE 754...
•
With it’s -0.0s, NANs and INFINITYs etc
Although the spec does not require support for
blend and fog into float surfaces you may
expect this to be available on much hardware
• Static flow control in pixel shader
•
– Has some serious performance implications...
March 04
Which constraints are next?
•
SM3 Precision
– Consistent 32 bit IEEE throughout
– Which means...
– se7m24
•
•
•
One sign bit
7 bits of exponent
24 bits of mantissa
– But the propagation rules (like “what is –INF * -0.0”)
are not necessarily required until SM 4.0
– Higher (64 bit) precision is not for the near-term...
March 04
Stream Processors
•
Modern GPUs and VPUs are computing
devices built from stream processors
– Stream Processors are great for some tasks...
Fixed maximum input B/W
Fixed
Processing
power
Fixed maximum output B/W
March 04
Stream Processors?
•
Modern GPUs and VPUs are computing
devices built from stream processors
Vertex
Fetch
Vertex
Shader
Triangle
set up
Pixel
Shader
FB fog
+blend
But really, each block is complex...
Sp[0]
March 04
Sp[1]
Sp[...]
Sp[n-1]
Sp[n]
A unified shader model
•
The plan as of GDC 04
– Is that each of the different 4.0 shaders will use the
same syntax and feature set
– This allows us to get around the major drawback of
hardwired stream processors – fixed resources.
•
•
March 04
Then the chip can become a pool of vector processors and
the hardware allocates these resources to match demand
Which implies that benchmarking the hardware becomes
somewhat more complex where:– How many vertices per second depends on the pixel
complexity
– How many pixels per second depends on the vertex
complexity
So isn’t this a CPU?
No, look at the Differences:
Cache Sizes - CPU = huge
Number of Pipeline Stages - VPU = long
Cache Interaction - VPU = none
Clock Speed - CPU = fast
Generality - VPU tends not to read what it writes
Vector oriented - VPU is fundamentally 4D
Number types - CPU is more flexible, supporting
integers and floats easily
March 04
Branches - VPUs don’t like branching…
Some of the targets for DX Next
•
Geometry generation in the VPU
– A fully specified new Topology Processor unit
Which means you’ll be able to generate new
vertices with all relevant connectivity information
from within the VPU...
• For example you can extrude shadow volumes
using this new hardware
• [But the geometry shader probably doesn’t get
fed it’s own output...]
•
Note please that “DX Next” is just my placeholder name
March 04
Some of the targets for DX Next
•
Support for virtual memory
– So texture downloads are much more efficient
– Now only those pages of the relevant mip levels will
be present
•
Contrast that with the current situation where all of every
mip level is required to be present in VPU-accessible
memory before the first texel is filtered...
– And DX Next has the notion of graphics hardware
contexts with maximum context switch times
– VM may also include write capabilities...
•
March 04
Will reduce the pressure to move beyond
512MB but we’ll still head in that direction...
The 4.0 shader model
•
•
•
•
•
•
Is still being decided by Microsoft
Will be for the next OS only
Expect this circa early 2006
New geometry shader
Common capabilities between all shaders
Faster small batch performance is a very high
priority…
– Which implies a new driver model
•
Will last for two or more years
– DX9 lasts from Q4 2002 until the next OS
March 04