Shogun2: Total War* Case Study

Shogun2: Total War* Case Study
Contents
Abstract ......................................................................................................................................................... 2
Getting Accurate Results............................................................................................................................... 2
Intel® HD 3000 Graphics Investigation ......................................................................................................... 3
Multi- core Investigation............................................................................................................................... 5
Caution: Death can inhibit your frame rate!................................................................................................. 6
Conclusion ..................................................................................................................................................... 8
About Intel GPA ............................................................................................................................................ 8
About the Author .......................................................................................................................................... 8
Abstract
Intel worked with Creative Assembly* throughout the development of Shogun 2: Total War* to support
them in getting the best possible performance from the Intel® Core™ i7-2820QM CPU and Intel ® HD
3000 Graphics. Through a series of analyses and experiments using Intel ® tools Parallel Amplifier XE
and Intel® Parallel Studio, we managed to identify and remove numerous locks causing bottlenecks and
produced a final frame rate which was 1.28X on Intel® Core™ i7 processors compared to 2 core
systems. The team also used Intel ® Graphics Performance Analyzers throughout development to
identify bottlenecks in the terrain shaders that were impacting performance on Intel ® HD 3000 Graphics
and achieved a very playable frame rate of 28 frames per second. This case study looks at some of the
efforts made by both Intel and Creative Assembly through the development to make Shogun 2: Total
War a top class product on Intel ® HD 3000 Graphics.
Getting Accurate Results
Intel engaged with Creative Assembly quite early in the development of Shogun2. At that point, little of
the game was functional, which made performance testing and monitoring difficult. What we needed
was a game typical workload which was repeatable which we could use as a yardstick for future tests.
Later in the project, we knew we were going to rebuild and enable the replay system used on previous
incarnations of the engine but this was not possible when we started. We needed numbers early.
Eventually we settled on Lua* scripted scenarios with managed cameras (similar to what you see at the
start of the campaign battles). These worked quite well as a repeatable workload and would suffice for
early testing as all the AI and everything else ran normally during the script. We used this system to help
profile the game on a number of graphics devices, including the Intel ® HD 3000 Graphics processor
family (codenamed Sandybridge).
Intel worked with Creative Assembly to debug and refine the in-game replay system which had been
designed during previous incarnations of the Total War engine. In Total War, the replays are actually
real game action, not a movie. During the replay all game code elements are fully active and are playing
the scenario as surely as if you were pushing the buttons yourself. This made the replay system ideal for
producing a repeatable and representative workload, however there was a problem. As diehard Total
War fans know when you play back a replay you are normally free to move the camera where ever you
want. This level of control makes for a great experience but it also means the frame content is
unpredictable, i.e. if you point the camera at the sky there is almost no content and consequentially the
frame rate is really high, but if you point at a charging army the content is very busy and hence the
frame rate is lower. If we had measured the improvement in game performance with this free camera
movement, it would have been impossible to repeat the camera movement for all tests and so it would
have been impossible to make an accurate comparison.
Changing the replay system was extra to the planned work but Creative Assembly agreed recording the
camera was essential in order to provide accurate performance data on all tested platforms. Intel
worked closely with Creative Assembly on the replay system. An onsite Intel engineer contributed
design ideas and helped out by testing and debugging the replay system. Intel also developed a few
battle scenarios to help debug the camera in the early stages of game development.
Intel® HD 3000 Graphics Investigation
To make Shogun 2 a success on Intel ® HD 3000 Graphics, we wanted to have at least ‘medium’ settings
for all the options, so any options that showed significant slowdown at or below ‘medium’ resulted in a
deep dive with Intel ® Graphics Performance Analyzers (GPA). Early in the enabling process we tried to
focus on the graphics pipeline to determine what effects and settings levels we could apply for all the
possible graphics options. The Intel ® HD 3000 Graphics was capable of dealing with settings greater
than ‘low’ for almost all the options, but we wanted to go beyond mere settings selection, and actually
tweak some higher settings to make them better performing on Intel ® HD 3000.
We investigated a number of areas where there was heavy graphics processing and discussed various
ways to reduce the workload so that higher settings could be used on Intel ® HD 3000 Graphics. In the
following example we looked into the landscape renderer, since being able to default to one higher level
of landscape made a significant visual difference. Because of the camera in Shogun2, the landscape is
rendered over most of the screen which meant that any improvement in the landscape renderer would
pay dividends.
Figure 1: Intel ® GPA shows draw calls which render the terrain take up a lot of
time.
Each horizontal bar
is a single draw call.
Those picked out in
yellow are calls
made by the terrain
renderer. The
vertical axis shows
duration of draw
call.
All pixels drawn by the terrain renderer
highlighted in pink. The renderer takes up a
fair portion of the screen, and so probably
should command a large portion of the frames
processing time. However, the stats
suggested there may be things we could do.
Figure 2: Shows pixels drawn by the terrain renderer
We investigated the game with Intel ® GPA. The main workhorse, as always, was Frame Analyzer. Using
Frame Analyzer you can see the time taken to carry out each draw call, then, you can select individual
draw calls and investigate their textures, shaders etc. Pretty quickly, we identified the landscape as
being the most expensive component of the scene. The first thing we noticed about the shader for the
landscape was that it was 350 instructions long. That count included 21 texture reads from 14 different
textures to arrive at the final image. The textures in question were quite large so our initial thoughts
were that we could be losing a lot of time accessing textures.
We were surprised to see that our initial assessment was wrong. In fact, using Intel GPA experiments
we found that using 2x2 textures on the land only reduced it from 34% to 33% of the scene – almost no
difference at all. Looking further we saw that the execution unit use on the Intel ® HD 3000 Graphics
was at almost 80%. The final clinching blow was that replacing the shader with a simple one reduced the
processing share of the landscape to 3%.
A partial rewrite of the landscape shader gave us a reduction in size which in resulted in a performance
improvement. The main change was the vectorization of a set of sequential operations in the shader
which greatly improved the execution time. Other minor changes were made to the number of detail
textures and the way they were handled resulting in our being able to set the landscape detail option a
full level higher in the options and keep more or less the same frame rate but with a significant
improvement to the visual quality.
Multi- core Investigation
We really wanted Shogun2 to take advantage of systems with multiple cores. We started work on multicore optimization very early in the development of Shogun 2, long before the earlier mentioned replay
camera had been implemented. At this point we were using static scenes of armies and comparing the
frame rates on the 2 core and 4 core target systems with camera positions as near as we could get to
identical. Our testing was so early in the development that many of the graphics techniques used to
achieve the visual richness of Shogun 2 had yet to be implemented.
It’s important to get in early with multi-core optimizations, but how do you do that on an unfinished
engine? What we did have was the core of the scenery and weather systems, and the AI and animation
engine for the soldiers. After deliberation we felt that this was enough to give us meaningful results if
not 100% accurate, and our findings could be verified later once the game had been completed. As a
result of earlier work with Intel on the Total War* titles Napoleon* and Empire*, Creative Assembly
already used Intel® Threading Building Blocks (TBB) to thread systems involving repetitive tasks such as
the animation of combatants and ships. What we found was that in designing the next generation Total
War engine Creative Assembly had had to rewrite some of these systems and while the new design was
innovative, the multi core scaling we expected to see was not there.
We set aside an afternoon to investigate this using Intel® VTune™ Amplifier XE , a tool from Intel that
lets you see exactly where execution time is being spent right down to individual instructions. We took
samples using Intel® VTune™ Amplifier XE on a 2 core and a 4 core system to compare threading
performance. What we found was puzzling at first. It seemed that there was a good percentage of
threaded code in the engine still. On a 4 core Hyper-threaded system we were getting about 275% code
execution (equivalent to 2.75 cores flat out) and on the 2 core HT system we were getting about 175%
(equivalent to 1.75 cores running flat out) so there should have been some scaling but the frame rates
were doggedly identical +/- less than 1%. Drilling down with VTune we found the problem pretty
quickly. The Windows* function WaitForSingleObject(), used to prevent multiple threads accessing the
same code at the same time, was oddly taking a significant portion of the execution time.
A deeper examination of the code showed that the threading optimizations in trees, weather and
animation used linked lists to store items for processing by the graphics thread. While it appeared that
each of the threads used its own list, an optimization in the list management code completely separate
from the systems we were examining meant that ‘under the hood’ all the systems used the same global
list. Consequently, all the threaded optimizations were being cancelled out by a single lock they all
shared in the engine core. Once we removed the offending lock and provided true separate lists for
each thread the scaling returned and we began to see up to 1.28X performance increase between the 2
systems.
Once we had scaling, it was a fairly simple matter to periodically check it as the project progressed to
watch for sudden losses. If we saw a sudden drop, then we could look at the code added over the last
period and track down the problem fairly quickly.
One amusing event which occurred during the development deserves mention here. Although we had
scaling as a result of the animation threading code, we would regularly see a drop off as battles
progressed as shown figure 3.
Graph showing the recorded framerate
from Shogun 2 Total War after the
global lock fix (vertical axis is frame
rate, horizontal is in seconds).
We were typically seeing 1.2X scaling
through most of the battle, but as can
be seen from the end of the trace the
scaling always seemed to tail off.
Figure 3: 2 core versus 4 core frame rate over time.
Caution: Death can inhibit your frame rate!
This puzzle was eventually traced to an issue with corpses. In Shogun2 the corpses of the fallen stay in
view on the battle ground. Once fallen, the corpses would still animate so they could be blown about by
explosions and trampled by horses etc. A tiny error in the code meant that once a model was marked as
dead, the animation code swapped to an old path which did not have the threading optimisations in it.
The net effect was that as more men died and moved to the unthreaded animation system, the
performance improvement dropped off. The fix to the corpse code was fairly minor, and once complete
we were seeing a continuous scaling right through the thick of battle at about 1.2X improvement from 2
core to 4 core.
It wasn’t all good news for multi-core on Shogun 2. During the development we added threading to tree
animation and specifically the level of detail (LoD) calculations for the trees. With the amount of trees
in a typical landscape this added up to quite a lot of parallel code and boosted the scaling to over 1.3X
for a time. An innovation added by the Creative Assembly team in the form of a system to batch the
trees together and share LoD calculations across groups of trees resulted in a new piece of code which
was significantly faster on a single core than the threaded execution had been. It is practically
impossible to see any difference between the two tree processing systems so there was no benefit to
keeping the more complicated parallel code. By batching the trees there were so few LoD calculations
that there was little or no effect from threading it. This was a classic example of the old adage that ‘the
fastest piece of code is the one which does not execute at all…’ – hat’s off to Creative Assembly!
Conclusion
Intel worked with Creative Assembly for about 9 months on Shogun 2. We worked on the game together
to add multi- core optimizations which resulted in the game being 1.2X faster on a 4 core HT system
compared to a 2 core one, and we added graphics optimizations which gave us more than reasonable
performance on Intel® HD graphics, proving that integrated graphics systems could hit the mark in the
games environment. We all concluded from the development that early focus on CPU and graphics
optimization is vital to making a successful game on modern hardware.
But the main thing Intel achieved was to help Creative Assembly produce a game which was worthy of
Creative Assembly’s Total War lineage and at the same time demonstrate what Intel already knew: that
Intel hardware has a great deal to offer game developers looking to excel in performance through multi
core optimizations and increase their potential market by embracing Intel® HD Graphics.
About Intel GPA
Intel® Graphics Performance Analyzers (Intel® GPA) is a powerful, agile developer tool suite for analyzing
and optimizing games, media, and other graphics-intensive applications. Intel® GPA Frame Analyzer is a
powerful, intuitive, best-in-class single frame analysis and optimization tool. Intel® GPA System Analyzer
Heads-up Display (HUD) and Standalone provide straightforward initial analysis and provides interactive
Microsoft Direct3D* pipeline state overrides. Intel® GPA Platform Analyzer provides a timeline view for
analysis of tasks, threads, Microsoft DirectX*, OpenCL™ and GPU-accelerated media applications in
context.Intel® GPA Media Analyzer: See how efficiently your code utilizes hardware acceleration on
Intel® Core™ processor-based PCs with Intel® HD Graphics or run real-time media performance analysis
of encode and decode metrics to get in-depth, real-time media performance analysis.
About the Author
Steve Hughes spent over 12 years developing games for PC and various consoles with, he boasts, “at
least 10 released games - hard to be sure…” before joining Intel as a Senior Application Engineer in 2008.
Since joining Intel he has worked with many companies to try to synergize the relationship between
their games and Intel hardware. When not gazing at code, he plays guitar, tries to polish telescope
mirrors, and occasionally builds sheds.
* Other names and brands may be claimed as the property of others.
Appendix A: System Information
Hardware
System Item
CPU
Graphics
Memory
Max DVMT
Value
Core™ i7-2820QM @ 2.3GHz
Intel® HD Graphics 3000 @ 1300 MHz
4 GB
1696 MB
Software
System Item
OS
Graphics Driver
Video BIOS
Value
Windows 7 x64
15.10.2291
2077.0
Game Configuration
System Item
Graphics Settings
Resolution
Value
Medium
1280x800
Appendix B: Tools





Intel® Graphics Performance Analyzer 3.0 & 4.0
Intel® VTune™ Amplifier XE + Parallel Studio
Intel® Threading Building Blocks
Fraps
Creative Assembly internal tools