Save the Nanosecond! PC Graphics Performance for the next 3 years Richard Huddy European Developer Relations Manager ATI Technologies, Inc. A funny thing happened to me… • March 04 ATI is now broadly recognised and highly recommended amongst high end gamers Another DX performance talk? Because although this has been my pet subject for 7 years there’s still complexity to work out… • Like: • – Choosing sort criteria – Preferred ways of handling dynamic data – The best way to express a pixel shader algorithm March 04 nanoseconds - There are lots of them... • But, once they’re gone... – They’re gone... • If a game lasts for around 40 hours of play that’s roughly 1014 nanoseconds... Each frame says goodbye to roughly 10 million of these puppies • Each VPU clock tick is roughly 2 ns • – 500MHz is a fast VPU • March 04 [Each CPU tick is roughly 1/2 ns.] – 2GHz is a modest CPU Save the nanosecond • There’s an English saying, – “Look after the pennies and the pounds will look after themselves” • In a sense, pennies are just “small pounds” • But delivering fast frames requires you to save millions of nanoseconds – And you can’t get rich by saving a dollar every now and then... March 04 The DirectX API Since there’s an API between you and the hardware it makes sense to expect that you need to know how to use it • Abuse of the API can be a mighty expensive option… • – And this is an incredibly common problem March 04 Huge Savings... • Don’t create resources within the performance sensitive part of your code… • Offline: – Compressing textures • Install time: – Optimize vertex sequences (D3DXOptimizeMesh) • Start-up time: – Create VBs, IBs, RTs, etc • Game loop: – Create nothing at all March 04 Huge Savings... • SetRenderTarget() – Let’s not have too many of these please! – Single digits counts are good… • Lock() – with zero for flags – Whether that’s a VB that’s being rendered from – Or a RenderTarget which was rendered to March 04 • Because there are milliseconds at stake here! • Also use ‘DONOTWAIT’ appropriately to reclaim CPU cycles – these are scarce! Significant savings... • Every DrawPrim call is a significant cost – So make sure you get good value from it • Every time you set any state it costs you – Whether you set one or ten... – But aggressive state filtering is no longer needed so much in DX9 • One pixel is irrelevant, but millions matter... – Clear() the Z/stencil buffer to make it work fast – Sort Front to Back • Sub-Sort by shader – Set your shader constants in blocks March 04 Compilers are smart... • At ATI we test compilers to make sure that they’re good and help make them better • Sample results show : – HLSL vs Cg on ATI* – HLSL vs Cg on NV (Win, Draw, Lose) : 5, :16, 7, 7, 2 0 – (*) Cg compiler failed to compile 9 of the 23 Renderman samples for SM2.0 even though HLSL compiler succeeded So using HLSL seems like the logical choice… • Not just an industry standard – but the best too • March 04 And a PC is complex Which is a bit of an understatement… • A 9800 Pro has a similar number of gates to two Pentium4 processors all on one die • But the highly parallel design allows it to do much more work – of a very specific kind… • So you’d like to have the CPU and VPU both doing useful work at the same time • – Luckily the API encourages this… March 04 Which bits are fast? • System: – CPU • 1 to 1/3 of a nanosecond… (1GHz to 3GHz) – System memory High latency compared to the CPU • 200 - 800MHz (for moving data about) • – Virtual memory • • Takes all week… Graphics card: – VPU core • 200 to 500MHz – Local video memory • • 200 to 500MHz (~20GB per second) AGP Bus: – 266MHz, 2GB per second, with latency like molasses… March 04 • [100MB per second for CPU reads – so don’t!] Which bits are fast? • System: – CPU • • • So the CPU is fast, but it still has too much to do… “All games are CPU limited” Graphics card: – VPU core • • Not blinding fast clock, but phenomenal throughput AGP Bus: – Don’t texture from here unless you have to March 04 Inside the VPU • You have several units at your disposal… – – – – – – – Vertex fetch (memory cache) Vertex shader (xform and lighting) Vertex cache (protecting the shader from abuse) Clipper (so fast it might as well not be there…) Triangle setup Fast Z/stencil reject (quad speed rasterizer rejection) Rasterizer • • Pixel cache Texture cache – Z buffer – Blend (Yummy! Read-modify-write) March 04 Inside the VPU Because the vertex fetch unit is just reading / caching memory it makes sense to prefer cache-aligned data formats (like 32 bytes or 64 bytes) • The vertex cache only works for indexed primitives… • So we recommend that all rendering is done with DrawIndexedPrimitive() and that you submit data in roughly tri-strip order • March 04 Saving nanoseconds… • Use shorter shaders since they’re faster – One op per clock is what you should expect – ATI hardware can parallelise vector + scalar op pairs • Shaders are cached on chip too – So switching shader can sometimes be very fast Hand written assembly isn’t usually a good bet • ps.1.4 modifiers can be free in ps.2.0 hardware • March 04 Saving nanoseconds… Prefer the shortest shader which does what you want • Use the lowest shader model which achieves your target • – That way you can potentially access the ps1.4 modifiers which run in the same clock cycle • But please do not sacrifice quality for speed! – That can be the user’s choice later on by selecting no-AA, low screen resolution etc March 04 Pre Zee • An early “Z only” pass will save you time if… (1) Your pixel shaders are ‘long’ (2) You cannot sort front-to-back March 04 • The definition of ‘long’ here depends upon how well you can usually sort! • Pre-Z saves you pixels, but costs you vertices Optimisation - The Big Picture • Almost all of the best optimisations come down to one single principal… Do the work as early as possible in the pipeline to avoid doing it later where the cost would be greater • This applies to things like… – resource creation (prefer install time costs to runtime costs) – culling (cull early is better than late) – shader tuning (pre-shader opts move from ps to vs to CPU) – Z-only pass March 04 What’s this about the future? • March 04 Let’s looks at the trends which are changing the balance… ATI is at the Center of The Digital Experience March 04 Market share... • At the end of 2003 ATI finally took the lead in market share in game-play graphics from the competition – Yeah, but only by 0.2%... So what? • According to Mercury Research, ATI leads with a roughly 80:20 split at the high end… – Which means that if you’re targeting high end gamers and reviewers then your focus is on ATI – That’s what the vast majority of your audience is using… – And ATI has a 100% market share lead of “New Xbox technologies”… ☺ March 04 Multiple platforms... • The PC leads the way so that the various genres of lesser hardware are several years behind PC architecture... – Latest PDA hardware is equivalent to cutting edge PC hardware from just 4 years ago! – Laptops are less than 2 years behind high end workstations – Consoles often define the high end as they arrive... March 04 PC Platform retirement • Top spec PC’s actually have a game-buying life of just two years! – PC’s older than that are ‘retired’ for Word, email, web browsing etc. – New PC’s or graphics cards are brought into the home and it’s these that are used for games – Gamers with systems which are >2 years old buy roughly 1 game per year and these are not high end games – Hard core gamers average 5 - 10 games per year – This implies a roughly 2.5:1 CPU scalability issue… – And roughly 4:1 GPU scalability on both power and features March 04 All of which means • You should require DX8 hardware and upwards for games due Xmas 2004 or later – We recommend treating low end DX9 hardware to the DX8 path. Even 1024x768 is often too demanding for the low end DX9 hardware out there – So you should be able to cope with just two code paths on many games for this year • • • March 04 DX8 hardware takes one DX9 hardware takes the other But note that because this assertion is based on forecasts and trends it is highly subjective… DirectX 8 class hardware • Programmable vertex pipeline is in addition to the FF pipeline – That makes it hard to beat the fixed function hardware – And this makes it fast to switch between pipelines • March 04 Pixel pipeline is shared between the old fashioned texture cascade and the new pixel processor DirectX 9 class hardware • Programmable vertex pipeline is shared with the FF pipeline – That makes it easy to beat the fixed function hardware – That makes it slow to switch between pipelines • March 04 For this reason it makes sense generally to prefer the programmable pipeline. So, here is our target: • DX9 style mainstream graphics (per frame): – – – – – – – – – > 500K triangles < 500 DrawIndexedPrimitive() calls < 500 VertexBuffer switches < 200 different textures < 200 State change groups Few calls to SetRenderTarget - aim for 0 to 4... 1 pass per poly is typical, but 2 is sometimes smart Runs at monitor refresh rate Which gives more than 40 million polys per second • And everything goes through the programmable pipeline – No occurrences of Lock(0), DrawPrimitive(), DPUP(), CreateVB() etc March 04 Are we there yet? • Pixel Shader throughput: – More pixel engines with • • Higher clock speeds Higher Instruction counts – More vertex engines too since triangles keep getting smaller – The pressure moves away from textures and towards the ALU operations… • March 04 Simply because ALU power grows faster than B/W Are we there yet? • High quality AA: – Continue to innovate with... – Programmable sample points • Currently 0, 2, 4 or 6 – Full exposure of ‘centroid’ control • DirectX 9.0c API fully exposes this – Gamma correction of AA in hardware • March 04 ATI do this already with a 2.2 gamma function The 3.0 shader model • Requires 32 bit floats throughout the pipeline – But that’s not necessarily full IEEE 754... • With it’s -0.0s, NANs and INFINITYs etc Although the spec does not require support for blend and fog into float surfaces you may expect this to be available on much hardware • Static flow control in pixel shader • – Has some serious performance implications... March 04 Which constraints are next? • SM3 Precision – Consistent 32 bit IEEE throughout – Which means... – se7m24 • • • One sign bit 7 bits of exponent 24 bits of mantissa – But the propagation rules (like “what is –INF * -0.0”) are not necessarily required until SM 4.0 – Higher (64 bit) precision is not for the near-term... March 04 Stream Processors • Modern GPUs and VPUs are computing devices built from stream processors – Stream Processors are great for some tasks... Fixed maximum input B/W Fixed Processing power Fixed maximum output B/W March 04 Stream Processors? • Modern GPUs and VPUs are computing devices built from stream processors Vertex Fetch Vertex Shader Triangle set up Pixel Shader FB fog +blend But really, each block is complex... Sp[0] March 04 Sp[1] Sp[...] Sp[n-1] Sp[n] A unified shader model • The plan as of GDC 04 – Is that each of the different 4.0 shaders will use the same syntax and feature set – This allows us to get around the major drawback of hardwired stream processors – fixed resources. • • March 04 Then the chip can become a pool of vector processors and the hardware allocates these resources to match demand Which implies that benchmarking the hardware becomes somewhat more complex where:– How many vertices per second depends on the pixel complexity – How many pixels per second depends on the vertex complexity So isn’t this a CPU? No, look at the Differences: Cache Sizes - CPU = huge Number of Pipeline Stages - VPU = long Cache Interaction - VPU = none Clock Speed - CPU = fast Generality - VPU tends not to read what it writes Vector oriented - VPU is fundamentally 4D Number types - CPU is more flexible, supporting integers and floats easily March 04 Branches - VPUs don’t like branching… Some of the targets for DX Next • Geometry generation in the VPU – A fully specified new Topology Processor unit Which means you’ll be able to generate new vertices with all relevant connectivity information from within the VPU... • For example you can extrude shadow volumes using this new hardware • [But the geometry shader probably doesn’t get fed it’s own output...] • Note please that “DX Next” is just my placeholder name March 04 Some of the targets for DX Next • Support for virtual memory – So texture downloads are much more efficient – Now only those pages of the relevant mip levels will be present • Contrast that with the current situation where all of every mip level is required to be present in VPU-accessible memory before the first texel is filtered... – And DX Next has the notion of graphics hardware contexts with maximum context switch times – VM may also include write capabilities... • March 04 Will reduce the pressure to move beyond 512MB but we’ll still head in that direction... The 4.0 shader model • • • • • • Is still being decided by Microsoft Will be for the next OS only Expect this circa early 2006 New geometry shader Common capabilities between all shaders Faster small batch performance is a very high priority… – Which implies a new driver model • Will last for two or more years – DX9 lasts from Q4 2002 until the next OS March 04
© Copyright 2025 Paperzz