http://www.xna.com
© 2009 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
Out of Order
Making In-order Processors Play Nicely
Allan Murphy
XNA Developer Connection, Microsoft
Optimization Example
class BaseParticle
{
public:
…
virtual Vector& Position() { return mPosition; }
virtual Vector& PreviousPosition() { return mPreviousPosition; }
float& Intensity() { return mIntensity; }
bool& Active() { return mActive; }
float& Lifetime() { return mLifetime; }
…
private:
…
float mIntensity;
float mLifetime;
bool mActive;
Vector mPosition;
Vector mPreviousPosition;
…
};
Optimization Example
// Boring old vector class
class Vector
{
…
public:
float x,y,z,w;
};
// Boring old generic linked list class
template <class T> class ListNode
{
public:
ListNode(T* contents) : mNext(NULL), mContents(contents){}
void SetNext(ListNode* node)
{ mNext = node; }
ListNode* NextNode()
{ return mNext; }
T* Contents()
{ return mContents; }
private:
ListNode<T>* mNext;
T* mContents;
};
Optimization Example
// Run through list and update each active particle
for (ListNode<BaseParticle>* node = gParticles; node != NULL; node = node->NextNode())
if (node->Contents()->Active())
{
Vector vel;
vel.x = node->Contents()->Position().x - node->Contents()->PrevPosition().x;
vel.y = node->Contents()->Position().y - node->Contents()->PrevPosition().y;
vel.z = node->Contents()->Position().z - node->Contents()->PrevPosition().z;
const float length = __fsqrts((vel.x*vel.x) + (vel.y*vel.y) + (vel.z*vel.z));
if (length > cLimitLength)
{
float newIntensity = cMaxIntensity - node->Contents()->Lifetime();
if (newIntensity < 0.0f)
newIntensity = 0.0f;
node->Contents()->Intensity() = newIntensity;
}
else
node->Contents()->Intensity() = 0.0f;
}
Optimization Example
// Replacement for straight C vector work
// Build 360 friendly __vector4s
__vector4 position, prevPosition;
position.x = node->Contents()->Position().x;
position.y = node->Contents()->Position().y;
position.z = node->Contents()->Position().z;
prevPosition.x = node->Contents()->PrevPosition().x;
prevPosition.y = node->Contents()->PrevPosition().y;
prevPosition.z = node->Contents()->PrevPosition().z;
// Use VMX to do the calculations
__vector4 velocity = __vsubfp(position,previousPosition);
__vector4 velocitySqr = __vmsum4fp(velocity,velocity);
// Grab the length result from the vector
const float length = __fsqrts(velocitySqr.x);
• Job done, right?
Thank you for listening
Optimization Example
•
Hold on.
•
If we time it…
•
•
And if we check the results..
•
•
•
Its actually slower than the straight C version
It's also wrong!
Incorrect is a special case optimization
Unfortunately, this does happen in practice
Important Caveat
•
•
Today we’re talking about optimization
But, the techniques discussed are orthogonal to…
…good algorithm choice
…good multithreading system implementation
•
It’s like Mr Knuth said.
•
They typically build code which is…
…very non-general
…very difficult to maintain or understand
…possibly completely platform specific
But My Code Is Really Quick On PC…?
•
A common assumption:
•
•
•
•
It’s quick on PC
360 & PS3 have 3.2GHz clock speed
Should be good on console! Right?
Alas 360 core and PS3 PPU have..
•
•
•
•
No instruction reordering hardware
No store forwarding hardware
Smaller caches and slower memory
No L3 cache
The 4 Horsemen of In-Order Apocalypse
•
What goes wrong?
•
•
•
•
LHS
L2 miss
Expensive, non pipelined instructions
Branch mispredict penalty
Load-Hit-Store (LHS)
•
What is it?
•
•
•
What causes LHS?
•
•
Storing to a memory location…
…then loading from it very shortly after
Casts, changing register set, aliasing
Why is it a problem?
•
On PC, bullet usually dodged by…
•
•
Instruction re-ordering
Store forwarding hardware
L2 Miss
•
What is it?
•
•
Why is it a problem?
•
•
•
Loading from a location not already in cache
Costs ~610 cycles to load a cache line
You can do a lot of work in 610 cycles
What can we do about it?
•
•
•
Hot cold split
Reduce in-memory data size
Use cache coherent structures
Expensive Instructions
•
What is it?
•
•
•
Certain instructions not pipelined
No other instructions issued ‘til they complete
Stalls both hardware threads
•
•
high latency and low throughput
What can we do about it?
•
•
Know when those instructions are generated
Avoid or code round those situations
•
But only in critical places
Branch Mispredicts
•
•
What is it?
•
Mispredicting a branch causes…
•
•
…CPU to discard instructions it predicted it needed
…23-24 cycle delay as correct instructions fetched
Why is this a problem?
•
Misprediction penalty can…
…dominate total time in tight loops
…waste time fetching unneeded instructions
Branch Mispredicts
•
•
What can we do about it?
Know how compiler implements branches
•
•
•
for, do, while, if
Function pointers, switches, virtual calls
Reduce total branch counts for task
•
•
•
Use test and set style instructions
Refactor calculations to remove branches
Unroll
Who Are Our Friends?
•
•
Profiling, profiling, profiling
360 tools
•
•
•
•
Other platforms
•
•
PIX CPU instruction trace
LibPMCPB counters
XbPerfView sampling capture
SN Tuner, vTune
Thinking laterally
General Improvements
•
inline
•
•
Pass and return in register
•
•
__declspec(passinreg)
__restrict
•
•
Make sure your function fits the profile
Compiler released from being ultra careful
const
•
•
Doesn’t affect code gen
But does affect your brain
General Improvements
•
Compiler options
•
•
•
•
Inline all possible
Prefer speed over size
Platform specifics
360
•
•
•
/Ou - Removes integer div by zero trap
/Oc – Runs a second code scheduling pass
Don’t write inline asm
General Improvements
•
Reduce parameter count
•
•
•
•
•
Prefer 32, 64 and 128 bit variables
Isolate constants – or constant sets
•
•
Reduce function epilogue and prologue
Reduce stack access
Reduce LHS
Look to specialise, not generalise
Avoid virtual if feasible
•
Unnecessary virtual means indirected branch
Know Your Cache Architecture
•
Cache size
•
•
Cache line size
•
•
360: 128 bytes; x86 – typically 64 bytes
Pre-fetch mechanism
•
•
360: 1Mb L2, 32Kb L1
360: dcbt, dcbz128
Cross-core sharing policy
•
360: L2 shared, L1 per core
Know Pipeline & LHS Conditions
•
LHS caused by:
•
•
•
Be aware of non-pipelined instructions
•
•
Pointer aliasing
Register set swap / casting
fsqrt, fdiv, int mul, int div, sraw
Be aware of pipeline flush issues
•
Especially fcmp
Knowing Your Instruction Set
•
360 specifics:
•
•
•
VMX
Slow instructions
Regularly useful instructions
•
•
PS3
•
•
fsel, vsel, vcmp*, vrlimi
Altivec & world of SPE
PC
•
SSE, SSE2, SSE3, SSE4, SSE4.1 and friends
What Went Wrong With The Example?
•
Correctness
•
•
•
Guessed at 1 performance issue
SIMD vs straight float
•
•
•
•
Always cross-compare during development
Giving SIMD ‘some road’
Branch behaviour exactly the same
Adding SIMD adds an LHS
Memory access and L2 usage unchanged
Image Analysis
Image Analysis Example
•
Classification via Gaussian Mixture Model
•
For each pixel in a 320x240 array…
•
•
•
•
Evaluate ‘cost’ via up to 20 Gaussian models
Returns lowest cost found for pixel
Submit cost to graph structure for min-cut
Profiling shows:
•
•
•
86% of time in pixel cost function
No surprises there
1,536,000 Gaussian model applies
Image Analysis Example
float GMM::Cost(unsigned char r, unsigned char g, unsigned char b, size_t k)
{
Component& component = mComponent[k];
SampleType x(r,g,b);
x -= component.Mean();
FloatVector fx((float)x[0],(float)x[1],(float)x[2]);
return component.EofLog() + 0.5f * fx.Dot( component.CovInv().Multiply(fx));
}
float GMM::BestCost(unsigned char r, unsigned char g, unsigned char b)
{
float bestCost = Cost(r,g,b,0);
for(size_t k=1; k<nK; k++)
{
float cost = Cost(r,g,b,k);
if( cost < bestCost )
bestCost = cost;
}
return bestCost;
}
Image Analysis Example
•
What things look suspect?
•
•
•
•
•
•
L2 miss on component load
Passing individual r,g,b elements
Building two separate vectors
Casting int to float
Vector maths
Branching may be an issue in BestCost()
•
•
•
Loop
Conditional inside loop
Confirm with PIX on 360
Image Analysis Example
•
Pass 1
•
•
•
•
•
•
•
Don’t even touch platform specifics
Pass a single int, not 3 unsigned chars
Mark up all consts
Build the sample value once in the caller
Add __forceinline
Check correctness
Doesn’t help a lot – gives about 1.1x
Image Analysis Example
•
Pass 2
•
•
•
Turn Cost function innards to VMX
Return cost as __vector4 to avoid LHS
Remove if from loop in BestCost by…
•
•
•
•
•
Keeping bestCost as a __vector4
Using vcmpgefp to make a comparison mask
Using vsel to pick the lowest value
Speedup of 1.7x
Constructing the __vector4s on the fly expensive
Image Analysis Example
•
Pass 3
-
Build the colour as a __vector4 in calling function
Build a static __vector4 containing {0.5f,0.5f,0.5f,0.5f}
Load once in calling function
Mark all __vector4 as __declspec(passinreg)
Build __vector4 version of Component
All calculations done as __vector4
More like it – speedup of 5.2x
Image Analysis Example
•
Pass 4
-
Go all the way out to the per pixel calling code
Load __vector4 at a time from source array
Do 4 pixel costs at once
__vcmpgefp/__vsel works exactly the same
Return __vector4 with 4 costs
Write to results array as single __vector4
Gives a speedup of 19.54x
Image Analysis Example
__declspec(passinreg) __vector4 CMOGs::BestCost(__declspec(passinreg) __vector4 colours) const
{
__vector4 half = gHalf;
const size_t nK = m_componentCount;
assert(nK != 0);
__vector4 bestCost = Cost(colour, half, 0 );
for(size_t k=1;k<nK;k++)
{
const __vector4 cost = Cost(colour, half, k );
const __vector4 mask = __vcmpgefp(bestCost,cost);
bestCost = __vsel(bestCost,cost,mask);
}
return bestCost;
}
Image Analysis Example
const Component& comp = m_vComponent[k];
const __vector4 vEofLog = comp.GetVEofLog();
colour0 = __vsubfp(colour0,comp.GetVMean());
…
const __vector4 row0 = comp.GetVCovInv(0);
const __vector4 row1 = comp.GetVCovInv(1);
const __vector4 row2 = comp.GetVCovInv(2);
x = __vspltw(colour0,0);
y = __vspltw(colour0,1);
z = __vspltw(colour0,2);
mulResult = __vmulfp(row0,x);
mulResult = __vmaddfp(row1,y,mulResult);
mulResult = __vmaddfp(row2,z,mulResult);
vdp2 = __vmsum3fp(mulResult,input);
vdp2 = __vmaddfp(vdp2,half,vEofLog);
result = vdp2;
…
// half is __vector4 parameter
Image Analysis Example
•
•
•
Hold on, this is image analysis.
Shouldn’t it be on the GPU?
Maybe, maybe not:
•
•
Per pixel we manipulate a dynamic tree structure
Excluding the tree structure…
•
•
CPU can run close to GPU speed
But syncing and memory throughput overhead not worth it
Movie Compression
Movie Compression Optimization
•
Timing results
•
•
Freeware movie compressor on 360
76.3% of instructions spent in InterError()
•
•
Calculating error between macroblocks
Majority of time in 8x8 macro block functions
•
•
•
•
•
Picking up source and target intensity macro block
For each pixel, calculating abs difference
Summing differences along rows
Returning sum of diffs
Or early out when sum exceeds a threshold
Movie Compression Optimization
int ThresholdSum(unsigned char *ptr1, unsigned char *ptr2, int stride2, int stride1,int thres)
{
int32 sad = 0;
for (int i=8; i; i--)
{
sad += DSP_OP_ABS_DIFF(ptr1[0], ptr2[0]);
sad += DSP_OP_ABS_DIFF(ptr1[1], ptr2[1]);
sad += DSP_OP_ABS_DIFF(ptr1[2], ptr2[2]);
sad += DSP_OP_ABS_DIFF(ptr1[3], ptr2[3]);
sad += DSP_OP_ABS_DIFF(ptr1[4], ptr2[4]);
sad += DSP_OP_ABS_DIFF(ptr1[5], ptr2[5]);
sad += DSP_OP_ABS_DIFF(ptr1[6], ptr2[6]);
sad += DSP_OP_ABS_DIFF(ptr1[7], ptr2[7]);
if (sad > thres )
return sad;
ptr1 += stride1;
ptr2 += stride2;
}
return sad;
}
Movie Compression Optimization
•
•
Look at our worst enemies
L2
•
•
LHS
•
•
Its all integer, so we should be LHS free
Expensive instructions?
•
•
8x8 byte blocks, seems tight
No, just byte maths
Branching
•
Should get prediction right 7 out of 8 times
Movie Compression Optimization
•
Maths
•
•
•
•
•
Element by element abs and average ops on bytes
Done row by row, exit on over sum
Perfect for VMX!
Awesome speedup of… 0%
Huh? Why?
•
•
•
Summing a row doesn’t suit VMX
Branch penalty still there
We have to do unaligned loads to VMX registers
Movie Compression Optimization
•
•
Let’s think again
Look at higher level picture
•
•
•
•
Thresholding is by row
•
•
•
Error calculated for 4 blocks at a time by caller
Rows in blocks (0,1) and (2,3) are contiguous
Pick up two blocks at a time in VMX registers
But there is no reason not to do it by column
Means we can sum columns in 7 instructions
Use __restrict on block pointers
Movie Compression Optimization
0
2
VMX register 0
VMX register 1
VMX register 2
VMX register 3
VMX register 4
VMX register 5
VMX register 6
VMX register 7
1
3
Movie Compression Optimization
•
Data Layout & Alignment
•
Rows in 2 blocks are contiguous in memory
•
•
•
Unrolling
•
•
•
Source block always 16 byte aligned
Dest block only guaranteed to be byte aligned
We can unroll the 8 iteration loop
We have plenty of VMX registers available
Return value
•
Return a __vector4 to avoid LHS writing to int
Movie Compression Optimization
•
Miscellaneous
•
•
Prebuild threshold word once
Remove stride word parameters
•
•
•
•
Constant values in this application only
Proved with empirical research (and assert)
Vector parameters and return in registers
Pushed vector error results out to caller
•
All callers calculations in VMX – drop LHS
Movie Compression Optimization
__vector4 __declspec(passinreg) twoblock_sad8x8__xbox (const unsigned char* __restrict ptr1, const unsigned char* __restrict
ptr2)
{
__vector4 zero = __vzero();
__vector4 row1_0 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_1 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_2 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_3 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_4 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_5 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_6 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_7 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4
__vector4
__vector4
__vector4
__vector4
__vector4
__vector4
__vector4
row1_0
row1_1
row1_2
row1_3
row1_4
row1_5
row1_6
row1_7
=
=
=
=
=
=
=
=
row2_0
row2_1
row2_2
row2_3
row2_4
row2_5
row2_6
row2_7
= *(__vector4
= *(__vector4
= *(__vector4
= *(__vector4
= *(__vector4
= *(__vector4
= *(__vector4
= *(__vector4
*)ptr2;
*)ptr2;
*)ptr2;
*)ptr2;
*)ptr2;
*)ptr2;
*)ptr2;
*)ptr2;
ptr2
ptr2
ptr2
ptr2
ptr2
ptr2
ptr2
ptr2
+= cStride2;
+= cStride2;
+= cStride2;
+= cStride2;
+= cStride2;
+= cStride2;
+= cStride2;
+= cStride2;
__vsubsbs(__vmaxub(row1_0,row2_0),__vminub(row1_0,row2_0));
__vsubsbs(__vmaxub(row1_1,row2_1),__vminub(row1_1,row2_1));
__vsubsbs(__vmaxub(row1_2,row2_2),__vminub(row1_2,row2_2));
__vsubsbs(__vmaxub(row1_3,row2_3),__vminub(row1_3,row2_3));
__vsubsbs(__vmaxub(row1_4,row2_4),__vminub(row1_4,row2_4));
__vsubsbs(__vmaxub(row1_5,row2_5),__vminub(row1_5,row2_5));
__vsubsbs(__vmaxub(row1_6,row2_6),__vminub(row1_6,row2_6));
__vsubsbs(__vmaxub(row1_7,row2_7),__vminub(row1_7,row2_7));
row2_0 = __vmrglb(zero,row1_0);
row1_0 = __vmrghb(zero,row1_0);
row2_1 = __vmrglb(zero,row1_1);
row1_1 = __vmrghb(zero,row1_1);
row2_2 = __vmrglb(zero,row1_2);
row1_2 = __vmrghb(zero,row1_2);
row2_3 = __vmrglb(zero,row1_3);
row1_3 = __vmrghb(zero,row1_3);
row2_4 = __vmrglb(zero,row1_4);
row1_4 = __vmrghb(zero,row1_4);
row2_5 = __vmrglb(zero,row1_5);
row1_5 = __vmrghb(zero,row1_5);
row2_6 = __vmrglb(zero,row1_6);
row1_6 = __vmrghb(zero,row1_6);
row2_7 = __vmrglb(zero,row1_7);
row1_0
row1_2
row1_4
row1_6
=
=
=
=
__vaddshs(row1_0,row1_1);
__vaddshs(row1_2,row1_3);
__vaddshs(row1_4,row1_5);
__vaddshs(row1_6,row1_7);
row1_0 = __vaddshs(row1_0,row1_2);
row1_4 = __vaddshs(row1_4,row1_6);
row1_0 = __vaddshs(row1_0,row1_4);
row2_0
row2_2
row2_4
row2_6
=
=
=
=
__vaddshs(row2_0,row2_1);
__vaddshs(row2_2,row2_3);
__vaddshs(row2_4,row2_5);
__vaddshs(row2_6,row2_7);
row2_0 = __vaddshs(row2_0,row2_2);
row2_4 = __vaddshs(row2_4,row2_6);
row2_0 = __vaddshs(row2_0,row2_4);
row1_1
row1_2
row1_3
row1_4
row1_5
row1_6
row1_7
=
=
=
=
=
=
=
__vsldoi(row1_0,row2_0,2);
__vsldoi(row1_0,row2_0,4);
__vsldoi(row1_0,row2_0,6);
__vsldoi(row1_0,row2_0,8);
__vsldoi(row1_0,row2_0,10);
__vsldoi(row1_0,row2_0,12);
__vsldoi(row1_0,row2_0,14);
row1_0 = __vrlimi(row1_0,row2_0,0x1,0);
row2_0 = __vsldoi(row2_0,zero,2);
row1_1 = __vrlimi(row1_1,row2_0,0x1,0);
row1_0 = __vaddshs(row1_0,row1_1);
row1_2 = __vaddshs(row1_2,row1_3);
row1_4 = __vaddshs(row1_4,row1_5);
row1_6 = __vaddshs(row1_6,row1_7);
// add 4 rows to the next row
row1_0 = __vaddshs(row1_0,row1_2);
row1_4 = __vaddshs(row1_4,row1_6);
row1_0 = __vaddshs(row1_0,row1_4);
row1_0 = __vpermwi(row1_0,VPERMWI_CONST(0,3,0,0));
row1_0 = __vmrghh(zero,row1_0);
row1_0 = __vpermwi(row1_0,VPERMWI_CONST(0,2,0,0));
return row1_0;
Movie Compression Optimization
•
Results
•
Un-thresholded macro block compare
•
•
•
•
Thresholded macro block compare
•
•
2.86 times quicker than existing C
Not bad, but our code is doing 2 blocks at once, too
So actually, 5.72 times quicker
4.12 times quicker
Optimizations to just the block compares…
…reduced movie compression time by 22%
…in worst case, saved 40 seconds from compress time
Do We Get Improvements In Reverse?
• Do we see improvements on PC?
• Image analysis
•
Movie compression
Summary Interlude
•
Profiling, profiling, profiling
•
•
Explore data alignment and layout
•
•
Know your enemy
Give SIMD plenty of room to work
Don’t ignore simple code structure changes
•
Specialise, not generalise
Original Example
Improving Original Example
PIX Summary
•
704k instructions executed
40% L2 usage
Top penalties
•
•
•
•
•
•
•
•
•
L2 cache miss @ 3m cycles
bctr mispredicts @ 1.14m cycles
__fsqrt @ 696k cycles
2x fcmp @ 490k cycles
Some 20.9m cycles of penalty overall
Takes 7.528ms
Improving Original Example
1) Avoid branch mispredict #1
•
•
•
Ditch the zealous use of virtual
Call functions just once
Gives 1.13x speedup
2) Improve L2 use #1
•
•
•
•
Refactoring list to contiguous array
Hot/cold split
Using bitfield for active flag
Gives 3.59x speedup
Improving Original Example
4) Remove expensive instructions
• Ditch __fsqrts and compare with squares
• Gives 4.05x speedup
5) Avoid branch mispredict #1
• Insert __fsel() to select tail length
• Gives 4.44x speedup
• Insert 2nd fsel
• Now only loop active branches remain
• Gives 5.0x speedup
Improving Original Example
7) Use VMX
• Use __vsubfp and __vmsum3fp for vector math
• Gives 5.28x speedup
8) Avoid branch mispredict #2
• Unroll the loop 4x
• Sticks at 5.28x speedup
Improving Original Example
9) Avoid branch mispredict #3
•
•
•
•
•
Build a __vector4 mask from active flags
__vsel tail lengths from existing and new
Write a single __vector4 result
Now only the loop branch remaining
Gives 6.01x speedup
10) Improve L2 access #2
•
•
Add __dcbt on position array
Gives 16.01x speedup
Improving Original Example
11) Improve L2 use #3
•
•
•
Moveto short coordinates
Now loading ¼ the data for positions
Gives 21.23x speedup
12) Avoid branch mispredict #4
•
•
•
•
We are now writing tail lengths for every particle
Wait, we don’t care about inactive particles
Epiphany - don’t check active flag at all
Gives 23.21x speedup
Improving Original Example
13) Improve L2 use #4
• Remaining L2 misses on output array
• __dcbt that too
• Tweak __dcbt offsets and pre-load
• 39.01x speedup
Improving Original Example
PIX Summary
•
259k instructions executed
99.4% L2 usage
Top penalties
•
•
•
•
•
•
•
•
ERAT Data Miss @ 14k cycles
1 LHS via 4kb aliasing
No mispredict penalties
71k cycles of penalty overall
Takes 0.193ms
Improving Original Example
•
Caveat
•
•
•
•
Slightly trivial code example
Not all techniques possible in ‘real life’
But principles always apply
Dcbz128 mystery?
•
•
•
We write entire array
Should be able to save L2 loads by pre-zeroing
But results showed slowdown
Thanks For Listening
• Any questions?
http://www.xna.com
© 2009 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
© Copyright 2026 Paperzz