Threading Games for Performance – Architecture – Case Studies Threading Issues Threads are a tool, not a ready-made solution. Most threading tutorials use “embarrassingly parallel” examples. Games are especially challenging for threading, because of architectural requirements and genre expectations Many issues have to be considered when implementing a threading strategy: ◦ Is high frame rate the most important performance indicator? ◦ Is input latency a deal-breaker? ◦ Is it fair for clients to run at different speeds? ◦ High frame rate or smooth frame rate? ◦ How well will it scale when Intel ships n-core? Terminology A task is a piece of work that is mapped to a thread. ◦ Dedicated threads run the same task repeatedly ◦ Thread pools are assigned tasks dynamically Work can be broken down into tasks various ways. ◦ One task for each subsystem is functional decomposition ◦ Multiple tasks for a subsystem is data decomposition Physics Animation Physics Physics Let’s hack a game… Task Cluster: isolate procedures in the game into general tasks. Particles AI Physics Animation Render Level Loading Get Organized - A simple Render Split? Render Split: queue up calls, pass them to the render thread. Buffer Render Everything Else Render – Split == wait on data or tasks. Try a Work Crew Work Crew: like a Task Cluster, but buffer data for each task. Particles Particles Render Particles Particles Particles Render Animation Animation Animation Animation Physics Physics Physics Physics AI AI Threading Games for Performance – Case Studies 6 Work Crew == High Memory Bandwidth. Try an Operation Queue Operation Queue: data is broken into blocks with a service thread which executes operations put into a queue. Physics AI Queue Render Service Animation Threading Games for Performance – Case Studies 7 Architecture Model – Synchronous Function Parallel Model Find parallel tasks from an existing loop. To reduce the need for communication between parallel tasks, the tasks should preferably be truly independent of each other. 8 Architecture Model – Synchronous Function Parallel Model Divide the functionality to small tasks, build a graph of which tasks precede which task. Supply this task-dependency graph to a framework. The framework in turn will schedule the proper tasks to be run, minding the amount of available processor cores. 9 Architecture Model – Synchronous Function Parallel Model There is an upper limit to how many cores they can support dictated by the limit of how many parallel tasks it is possible to find in the engine. The number of meaningful tasks is decreased by the fact that threading very small tasks will yield negligible results. The parallel tasks should have very little dependencies on each other. 10 Architecture Model – Asynchronous Function Parallel Model This model doesn't contain a game loop. The tasks that drive the game forward update at their own pace. The most recent information is used by the render engine. 11 Architecture Model – Asynchronous Function Parallel Model The scalability of the asynchronous function parallel model is limited by how many tasks it is possible to find from the engine. Communication between threads by only using the latest information available effectively reduces the need for the threads to be truly independent. The asynchronous model can support a larger amount of tasks, and therefore a larger amount of processor cores, than the synchronous model. 12 Architecture Model – Data Parallel Model Find some set of similar data for which to perform the same tasks in parallel. These are typically the objects in the game. ◦ Example: In a flying simulation, divide all of the planes into two threads. Each thread handles the simulation of half of the planes. Optimally the engine would use as many threads as there are logical processor cores. 13 Architecture Model – Data Parallel Model How to divide the objects into threads? ◦ Threads should be properly balanced, so that each processor core gets used to full capacity. What will happen when two objects in different threads need to interact? ◦ Communication using synchronization primitives could potentially reduce the amount of parallelism. ◦ Use message passing accompanied by using latest known updates as in the asynchronous model. ◦ Communication between threads can be reduced by grouping objects that are most likely to interact with each other. ◦ Objects are more likely to come into contact with their neighbors, so one strategy could be to group objects by area. 14 Architecture Model – Data Parallel Model The data parallel model has excellent scalability. The amount of object threads can be automatically set to the amount of cores the system is running, and the only non-parallelizable parts of the game loop would be ones that don't directly deal with game objects. Data parallelism is needed to fully utilize future processors with dozens of cores. The performance of the data parallel model is directly related to how large a part of the game engine can be parallelized by data. As the amount of processor cores goes up, the data parallel parts of the engine take less time to run. Fortunately these are usually also the performance heavy parts of a game engine. 15 Architecture Model – Data Parallel Model The biggest drawback of the model is the need to have components that support data parallelism. For example, a physics component would need to be able to run several physics updates in parallel, and be able to correctly calculate collisions with objects that are in these separate threads. 16 Valve uses a hybrid approach to threading the Source* engine Uses both functional and data parallelism (coarse and fine grain). Single mechanism (thread pool with task queue) supports both. Conventional functional threading: Sound, Rendering back end (D3D calls). Example parallel tasks: ◦ Construct scene rendering lists for multiple scenes in parallel (e.g., the world and its reflection in water) ◦ Graphics simulation (particles, ropes, sprites) ◦ Character bone transformations for all characters in all scenes in parallel ◦ Shadows for all characters Threading Games for Performance – Case Studies 17 Valve’s hybrid threading Thread Pool Main Thread Render Thread Task Q D3D Game Engine Loop Re-Order Buffer Driver Sound Thread Threading Games for Performance – Case Studies 18 The Quake 4* engine takes a different approach to threading The Engine is split up into 3 main Components - The Quake 4 Engine (exe) – this is the part that gets threaded - idlib common library for all is stuff (math, timing , algorithms, memory management, parsers,… ) linked statically very well optimized with SSE,SSE2, SSE3. - The Game DLL – the basic game dll implements classes specific to the game like Weapons,Vehicles, Characters, Script engine, AI, Game physics,… calls into the Quake Engine for all of the lower level work like the skinning of characters during animation Threading Games for Performance – Case Studies 19 VTune™ Analyzer shows unthreaded Quake 4* has no big hotspots Analysis with the VTune™ Performance Analyzer revealed that: ◦ It was single threaded and CPU bound ◦ Roughly equal amount is being spent in the driver and the engine 41% & 49% respectively ◦ Each of the major hotspots consumed 2-4% of CPU time Best performance gains by overlapping engine and renderer Legal text goes here in Verdana regular 7pt. Threading Games for Performance – Case Studies 20 Quake 4* gets the Render Split treatment ◦ Latency is a key issue, so we have to achieve the most performance in a time constrained scenario – only one frame of latency allowed. ◦ The engine was functionally decomposed to maximize overlap and minimize synchronization into its two largest blocks ◦ All of the time spent in the OpenGL driver is due to the rendering subsystem of the Quake 4 Engine ◦ Split the render into front-end and back-end so all the OpenGL calls were now made from the back-end thread ◦ The front-end and back-end communicate through command queues and synchronization events Threading Games for Performance – Case Studies 21 Quake 4* control flow Front End Back End Frame n Frame n+1 Frame n+2 Frame n Frame n+1 Threading Games for Performance – Case Studies 22 Though simple in concept, the Render Split requires significant changes ◦ The frame was prepared by the front end handed over to the back end while the front end prepared the next frame. ◦ Data specific to a frame was duplicated ◦ Data had to be allocated and freed safely. ◦ All allocations with the exception of a few were done in the front end ◦ Data to be freed was kept till the backend was done and cleared at the front end just before reuse. ◦ Subsystems that were not thread safe had to be re written for thread safety models classes, animation, shadows, texture subsystems, deforms, loaders, writers, vertex caches, effects, … Minimize synchronization Have a policy on memory allocation Threading Games for Performance – Case Studies 23 Debugging the threaded engine is a further challenge ◦ Debugging the threaded code is the hardest problem ◦ Issues could be broadly categorized into 3 major types Data race conditions Object lifetime issues OpenGL context issues ◦ Added a lock step mode to the threaded code where the front end and back end would run on separate threads but run lock step ◦ Added lots of initialization and destruction code to deal with lifetime issues ◦ Used synchronization points to slowly & painfully eliminate data races - Threading is hard. Interaction with the GPU adds more complexity - Need to design debugging aids while designing engine threading Threading Games for Performance – Case Studies 24 Multi-threaded drivers enable a further performance gain After Quake 4* was threaded NVIDIA and ATI both have released multi-threaded drivers. The drivers have matured and now work well with a threaded renderer With the multi-threaded drivers we see a further gain of about 30-40% Threading Games for Performance – Case Studies 25
© Copyright 2026 Paperzz