Game engine architectures

Threading Games for Performance
– Architecture
– Case Studies
Threading Issues

Threads are a tool, not a ready-made solution.

Most threading tutorials use “embarrassingly parallel” examples.

Games are especially challenging for threading, because of architectural requirements and
genre expectations

Many issues have to be considered when implementing a threading strategy:
◦
Is high frame rate the most important performance indicator?
◦
Is input latency a deal-breaker?
◦
Is it fair for clients to run at different speeds?
◦
High frame rate or smooth frame rate?
◦
How well will it scale when Intel ships n-core?
Terminology

A task is a piece of work that is mapped to a thread.
◦ Dedicated threads run the same task repeatedly
◦ Thread pools are assigned tasks dynamically

Work can be broken down into tasks various ways.
◦ One task for each subsystem is functional decomposition
◦ Multiple tasks for a subsystem is data decomposition
Physics
Animation
Physics
Physics
Let’s hack a game…

Task Cluster: isolate procedures in the
game into general tasks.
Particles
AI
Physics
Animation
Render
Level Loading
Get Organized - A simple Render Split?
Render Split: queue up calls, pass them to the render thread.
Buffer

Render
Everything Else
Render – Split == wait on data or
tasks. Try a Work Crew

Work Crew: like a Task Cluster, but buffer data for each task.
Particles
Particles
Render
Particles
Particles
Particles
Render
Animation
Animation
Animation
Animation
Physics
Physics
Physics
Physics
AI
AI
Threading Games for Performance – Case
Studies
6
Work Crew == High Memory
Bandwidth. Try an Operation Queue

Operation Queue: data is broken into blocks with a service thread
which executes operations put into a queue.
Physics
AI
Queue
Render
Service
Animation
Threading Games for Performance – Case
Studies
7
Architecture Model – Synchronous
Function Parallel Model

Find parallel tasks from an existing loop.

To reduce the need for communication between parallel tasks, the
tasks should preferably be truly independent of each other.
8
Architecture Model – Synchronous
Function Parallel Model

Divide the functionality to small tasks, build a graph of which tasks
precede which task.

Supply this task-dependency graph to a framework.

The framework in turn will schedule the proper tasks to be run,
minding the amount of available processor cores.
9
Architecture Model – Synchronous
Function Parallel Model

There is an upper limit to how many cores they can support
dictated by the limit of how many parallel tasks it is possible to find
in the engine.

The number of meaningful tasks is decreased by the fact that
threading very small tasks will yield negligible results.

The parallel tasks should have very little dependencies on each
other.
10
Architecture Model – Asynchronous
Function Parallel Model

This model doesn't contain a game loop.

The tasks that drive the game forward update at their own pace.

The most recent information is used by the render engine.
11
Architecture Model – Asynchronous
Function Parallel Model

The scalability of the asynchronous function parallel model is
limited by how many tasks it is possible to find from the engine.

Communication between threads by only using the latest
information available effectively reduces the need for the threads to
be truly independent.

The asynchronous model can support a larger amount of tasks, and
therefore a larger amount of processor cores, than the
synchronous model.
12
Architecture Model – Data
Parallel Model

Find some set of similar data for which to perform the same tasks
in parallel.

These are typically the objects in the game.
◦ Example: In a flying simulation, divide all of the planes into two threads. Each
thread handles the simulation of half of the planes. Optimally the engine would
use as many threads as there are logical processor cores.
13
Architecture Model – Data
Parallel Model

How to divide the objects into threads?
◦ Threads should be properly balanced, so that each processor core gets used to full capacity.

What will happen when two objects in different threads need to interact?
◦ Communication using synchronization primitives could potentially reduce the amount of
parallelism.
◦ Use message passing accompanied by using latest known updates as in the asynchronous
model.
◦ Communication between threads can be reduced by grouping objects that are most likely to
interact with each other.
◦ Objects are more likely to come into contact with their neighbors, so one strategy could be
to group objects by area.
14
Architecture Model – Data
Parallel Model

The data parallel model has excellent scalability.

The amount of object threads can be automatically set to the amount of cores the
system is running, and the only non-parallelizable parts of the game loop would be
ones that don't directly deal with game objects.

Data parallelism is needed to fully utilize future processors with dozens of cores.

The performance of the data parallel model is directly related to how large a part of
the game engine can be parallelized by data.

As the amount of processor cores goes up, the data parallel parts of the engine take
less time to run. Fortunately these are usually also the performance heavy parts of a
game engine.
15
Architecture Model – Data
Parallel Model

The biggest drawback of the model is the need to have components that
support data parallelism.

For example, a physics component would need to be able to run several
physics updates in parallel, and be able to correctly calculate collisions with
objects that are in these separate threads.
16
Valve uses a hybrid approach to
threading the Source* engine

Uses both functional and data parallelism (coarse and fine grain).

Single mechanism (thread pool with task queue) supports both.

Conventional functional threading: Sound, Rendering back end (D3D calls).

Example parallel tasks:
◦ Construct scene rendering lists for multiple scenes in parallel (e.g., the world and its
reflection in water)
◦ Graphics simulation (particles, ropes, sprites)
◦ Character bone transformations for all characters in all scenes in parallel
◦ Shadows for all characters
Threading Games for Performance – Case
Studies
17
Valve’s hybrid threading
Thread
Pool
Main
Thread
Render
Thread
Task Q
D3D
Game
Engine
Loop
Re-Order
Buffer
Driver
Sound
Thread
Threading Games for Performance – Case
Studies
18
The Quake 4* engine takes a
different approach to threading

The Engine is split up into 3 main Components
- The Quake 4 Engine (exe) – this is the part that gets
threaded
- idlib common library for all is stuff (math, timing ,
algorithms, memory management, parsers,… ) linked
statically very well optimized with SSE,SSE2, SSE3.
- The Game DLL – the basic game dll implements
classes specific to the game like Weapons,Vehicles,
Characters, Script engine, AI, Game physics,… calls
into the Quake Engine for all of the lower level work
like the skinning of characters during animation
Threading Games for Performance – Case
Studies
19
VTune™ Analyzer shows
unthreaded Quake 4* has no big
hotspots
Analysis with the VTune™ Performance
Analyzer revealed that:
◦ It was single threaded and CPU bound
◦ Roughly equal amount is being spent in the
driver and the engine 41% & 49% respectively
◦ Each of the major hotspots consumed 2-4% of
CPU time
Best performance gains by overlapping engine and renderer
Legal text goes here in Verdana regular 7pt.
Threading Games for Performance – Case
Studies
20
Quake 4* gets the Render Split
treatment
◦ Latency is a key issue, so we have to achieve the most
performance in a time constrained scenario – only one frame of
latency allowed.
◦ The engine was functionally decomposed to maximize overlap
and minimize synchronization into its two largest blocks
◦ All of the time spent in the OpenGL driver is due to the
rendering subsystem of the Quake 4 Engine
◦ Split the render into front-end and back-end so all the OpenGL
calls were now made from the back-end thread
◦ The front-end and back-end communicate through command
queues and synchronization events
Threading Games for Performance – Case
Studies
21
Quake 4* control flow
Front End
Back End
Frame n
Frame n+1
Frame n+2
Frame n
Frame n+1
Threading Games for Performance – Case
Studies
22
Though simple in concept, the Render
Split requires significant changes
◦ The frame was prepared by the front end handed over to the back end while the front end
prepared the next frame.
◦ Data specific to a frame was duplicated
◦ Data had to be allocated and freed safely.
◦ All allocations with the exception of a few were done in the front end
◦ Data to be freed was kept till the backend was done and cleared at the front end just before
reuse.
◦ Subsystems that were not thread safe had to be re written for thread safety models classes,
animation, shadows, texture subsystems, deforms, loaders, writers, vertex caches, effects, …
Minimize synchronization
Have a policy on memory allocation
Threading Games for Performance – Case
Studies
23
Debugging the threaded engine is a
further challenge
◦ Debugging the threaded code is the hardest problem
◦ Issues could be broadly categorized into 3 major types
 Data race conditions
 Object lifetime issues
 OpenGL context issues
◦ Added a lock step mode to the threaded code where the front end and
back end would run on separate threads but run lock step
◦ Added lots of initialization and destruction code to deal with lifetime
issues
◦ Used synchronization points to slowly & painfully eliminate data races
- Threading is hard. Interaction with the GPU adds more complexity
- Need to design debugging aids while designing engine threading
Threading Games for Performance – Case
Studies
24
Multi-threaded drivers enable a
further performance gain
After Quake 4* was threaded NVIDIA and ATI
both have released multi-threaded drivers.
 The drivers have matured and now work well
with a threaded renderer
 With the multi-threaded drivers we see a
further gain of about 30-40%

Threading Games for Performance – Case
Studies
25