Out of Order

http://www.xna.com
© 2009 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
Out of Order
Making In-order Processors Play Nicely
Allan Murphy
XNA Developer Connection, Microsoft
Optimization Example
class BaseParticle
{
public:
…
virtual Vector& Position() { return mPosition; }
virtual Vector& PreviousPosition() { return mPreviousPosition; }
float& Intensity() { return mIntensity; }
bool& Active() { return mActive; }
float& Lifetime() { return mLifetime; }
…
private:
…
float mIntensity;
float mLifetime;
bool mActive;
Vector mPosition;
Vector mPreviousPosition;
…
};
Optimization Example
// Boring old vector class
class Vector
{
…
public:
float x,y,z,w;
};
// Boring old generic linked list class
template <class T> class ListNode
{
public:
ListNode(T* contents) : mNext(NULL), mContents(contents){}
void SetNext(ListNode* node)
{ mNext = node; }
ListNode* NextNode()
{ return mNext; }
T* Contents()
{ return mContents; }
private:
ListNode<T>* mNext;
T* mContents;
};
Optimization Example
// Run through list and update each active particle
for (ListNode<BaseParticle>* node = gParticles; node != NULL; node = node->NextNode())
if (node->Contents()->Active())
{
Vector vel;
vel.x = node->Contents()->Position().x - node->Contents()->PrevPosition().x;
vel.y = node->Contents()->Position().y - node->Contents()->PrevPosition().y;
vel.z = node->Contents()->Position().z - node->Contents()->PrevPosition().z;
const float length = __fsqrts((vel.x*vel.x) + (vel.y*vel.y) + (vel.z*vel.z));
if (length > cLimitLength)
{
float newIntensity = cMaxIntensity - node->Contents()->Lifetime();
if (newIntensity < 0.0f)
newIntensity = 0.0f;
node->Contents()->Intensity() = newIntensity;
}
else
node->Contents()->Intensity() = 0.0f;
}
Optimization Example
// Replacement for straight C vector work
// Build 360 friendly __vector4s
__vector4 position, prevPosition;
position.x = node->Contents()->Position().x;
position.y = node->Contents()->Position().y;
position.z = node->Contents()->Position().z;
prevPosition.x = node->Contents()->PrevPosition().x;
prevPosition.y = node->Contents()->PrevPosition().y;
prevPosition.z = node->Contents()->PrevPosition().z;
// Use VMX to do the calculations
__vector4 velocity = __vsubfp(position,previousPosition);
__vector4 velocitySqr = __vmsum4fp(velocity,velocity);
// Grab the length result from the vector
const float length = __fsqrts(velocitySqr.x);
• Job done, right?
Thank you for listening
Optimization Example
•
Hold on.
•
If we time it…
•
•
And if we check the results..
•
•
•
Its actually slower than the straight C version
It's also wrong!
Incorrect is a special case optimization
Unfortunately, this does happen in practice
Important Caveat
•
•
Today we’re talking about optimization
But, the techniques discussed are orthogonal to…
…good algorithm choice
…good multithreading system implementation
•
It’s like Mr Knuth said.
•
They typically build code which is…
…very non-general
…very difficult to maintain or understand
…possibly completely platform specific
But My Code Is Really Quick On PC…?
•
A common assumption:
•
•
•
•
It’s quick on PC
360 & PS3 have 3.2GHz clock speed
Should be good on console! Right?
Alas 360 core and PS3 PPU have..
•
•
•
•
No instruction reordering hardware
No store forwarding hardware
Smaller caches and slower memory
No L3 cache
The 4 Horsemen of In-Order Apocalypse
•
What goes wrong?
•
•
•
•
LHS
L2 miss
Expensive, non pipelined instructions
Branch mispredict penalty
Load-Hit-Store (LHS)
•
What is it?
•
•
•
What causes LHS?
•
•
Storing to a memory location…
…then loading from it very shortly after
Casts, changing register set, aliasing
Why is it a problem?
•
On PC, bullet usually dodged by…
•
•
Instruction re-ordering
Store forwarding hardware
L2 Miss
•
What is it?
•
•
Why is it a problem?
•
•
•
Loading from a location not already in cache
Costs ~610 cycles to load a cache line
You can do a lot of work in 610 cycles
What can we do about it?
•
•
•
Hot cold split
Reduce in-memory data size
Use cache coherent structures
Expensive Instructions
•
What is it?
•
•
•
Certain instructions not pipelined
No other instructions issued ‘til they complete
Stalls both hardware threads
•
•
high latency and low throughput
What can we do about it?
•
•
Know when those instructions are generated
Avoid or code round those situations
•
But only in critical places
Branch Mispredicts
•
•
What is it?
•
Mispredicting a branch causes…
•
•
…CPU to discard instructions it predicted it needed
…23-24 cycle delay as correct instructions fetched
Why is this a problem?
•
Misprediction penalty can…
…dominate total time in tight loops
…waste time fetching unneeded instructions
Branch Mispredicts
•
•
What can we do about it?
Know how compiler implements branches
•
•
•
for, do, while, if
Function pointers, switches, virtual calls
Reduce total branch counts for task
•
•
•
Use test and set style instructions
Refactor calculations to remove branches
Unroll
Who Are Our Friends?
•
•
Profiling, profiling, profiling
360 tools
•
•
•
•
Other platforms
•
•
PIX CPU instruction trace
LibPMCPB counters
XbPerfView sampling capture
SN Tuner, vTune
Thinking laterally
General Improvements
•
inline
•
•
Pass and return in register
•
•
__declspec(passinreg)
__restrict
•
•
Make sure your function fits the profile
Compiler released from being ultra careful
const
•
•
Doesn’t affect code gen
But does affect your brain
General Improvements
•
Compiler options
•
•
•
•
Inline all possible
Prefer speed over size
Platform specifics
360
•
•
•
/Ou - Removes integer div by zero trap
/Oc – Runs a second code scheduling pass
Don’t write inline asm
General Improvements
•
Reduce parameter count
•
•
•
•
•
Prefer 32, 64 and 128 bit variables
Isolate constants – or constant sets
•
•
Reduce function epilogue and prologue
Reduce stack access
Reduce LHS
Look to specialise, not generalise
Avoid virtual if feasible
•
Unnecessary virtual means indirected branch
Know Your Cache Architecture
•
Cache size
•
•
Cache line size
•
•
360: 128 bytes; x86 – typically 64 bytes
Pre-fetch mechanism
•
•
360: 1Mb L2, 32Kb L1
360: dcbt, dcbz128
Cross-core sharing policy
•
360: L2 shared, L1 per core
Know Pipeline & LHS Conditions
•
LHS caused by:
•
•
•
Be aware of non-pipelined instructions
•
•
Pointer aliasing
Register set swap / casting
fsqrt, fdiv, int mul, int div, sraw
Be aware of pipeline flush issues
•
Especially fcmp
Knowing Your Instruction Set
•
360 specifics:
•
•
•
VMX
Slow instructions
Regularly useful instructions
•
•
PS3
•
•
fsel, vsel, vcmp*, vrlimi
Altivec & world of SPE
PC
•
SSE, SSE2, SSE3, SSE4, SSE4.1 and friends
What Went Wrong With The Example?
•
Correctness
•
•
•
Guessed at 1 performance issue
SIMD vs straight float
•
•
•
•
Always cross-compare during development
Giving SIMD ‘some road’
Branch behaviour exactly the same
Adding SIMD adds an LHS
Memory access and L2 usage unchanged
Image Analysis
Image Analysis Example
•
Classification via Gaussian Mixture Model
•
For each pixel in a 320x240 array…
•
•
•
•
Evaluate ‘cost’ via up to 20 Gaussian models
Returns lowest cost found for pixel
Submit cost to graph structure for min-cut
Profiling shows:
•
•
•
86% of time in pixel cost function
No surprises there
1,536,000 Gaussian model applies
Image Analysis Example
float GMM::Cost(unsigned char r, unsigned char g, unsigned char b, size_t k)
{
Component& component = mComponent[k];
SampleType x(r,g,b);
x -= component.Mean();
FloatVector fx((float)x[0],(float)x[1],(float)x[2]);
return component.EofLog() + 0.5f * fx.Dot( component.CovInv().Multiply(fx));
}
float GMM::BestCost(unsigned char r, unsigned char g, unsigned char b)
{
float bestCost = Cost(r,g,b,0);
for(size_t k=1; k<nK; k++)
{
float cost = Cost(r,g,b,k);
if( cost < bestCost )
bestCost = cost;
}
return bestCost;
}
Image Analysis Example
•
What things look suspect?
•
•
•
•
•
•
L2 miss on component load
Passing individual r,g,b elements
Building two separate vectors
Casting int to float
Vector maths
Branching may be an issue in BestCost()
•
•
•
Loop
Conditional inside loop
Confirm with PIX on 360
Image Analysis Example
•
Pass 1
•
•
•
•
•
•
•
Don’t even touch platform specifics
Pass a single int, not 3 unsigned chars
Mark up all consts
Build the sample value once in the caller
Add __forceinline
Check correctness
Doesn’t help a lot – gives about 1.1x
Image Analysis Example
•
Pass 2
•
•
•
Turn Cost function innards to VMX
Return cost as __vector4 to avoid LHS
Remove if from loop in BestCost by…
•
•
•
•
•
Keeping bestCost as a __vector4
Using vcmpgefp to make a comparison mask
Using vsel to pick the lowest value
Speedup of 1.7x
Constructing the __vector4s on the fly expensive
Image Analysis Example
•
Pass 3
-
Build the colour as a __vector4 in calling function
Build a static __vector4 containing {0.5f,0.5f,0.5f,0.5f}
Load once in calling function
Mark all __vector4 as __declspec(passinreg)
Build __vector4 version of Component
All calculations done as __vector4
More like it – speedup of 5.2x
Image Analysis Example
•
Pass 4
-
Go all the way out to the per pixel calling code
Load __vector4 at a time from source array
Do 4 pixel costs at once
__vcmpgefp/__vsel works exactly the same
Return __vector4 with 4 costs
Write to results array as single __vector4
Gives a speedup of 19.54x
Image Analysis Example
__declspec(passinreg) __vector4 CMOGs::BestCost(__declspec(passinreg) __vector4 colours) const
{
__vector4 half = gHalf;
const size_t nK = m_componentCount;
assert(nK != 0);
__vector4 bestCost = Cost(colour, half, 0 );
for(size_t k=1;k<nK;k++)
{
const __vector4 cost = Cost(colour, half, k );
const __vector4 mask = __vcmpgefp(bestCost,cost);
bestCost = __vsel(bestCost,cost,mask);
}
return bestCost;
}
Image Analysis Example
const Component& comp = m_vComponent[k];
const __vector4 vEofLog = comp.GetVEofLog();
colour0 = __vsubfp(colour0,comp.GetVMean());
…
const __vector4 row0 = comp.GetVCovInv(0);
const __vector4 row1 = comp.GetVCovInv(1);
const __vector4 row2 = comp.GetVCovInv(2);
x = __vspltw(colour0,0);
y = __vspltw(colour0,1);
z = __vspltw(colour0,2);
mulResult = __vmulfp(row0,x);
mulResult = __vmaddfp(row1,y,mulResult);
mulResult = __vmaddfp(row2,z,mulResult);
vdp2 = __vmsum3fp(mulResult,input);
vdp2 = __vmaddfp(vdp2,half,vEofLog);
result = vdp2;
…
// half is __vector4 parameter
Image Analysis Example
•
•
•
Hold on, this is image analysis.
Shouldn’t it be on the GPU?
Maybe, maybe not:
•
•
Per pixel we manipulate a dynamic tree structure
Excluding the tree structure…
•
•
CPU can run close to GPU speed
But syncing and memory throughput overhead not worth it
Movie Compression
Movie Compression Optimization
•
Timing results
•
•
Freeware movie compressor on 360
76.3% of instructions spent in InterError()
•
•
Calculating error between macroblocks
Majority of time in 8x8 macro block functions
•
•
•
•
•
Picking up source and target intensity macro block
For each pixel, calculating abs difference
Summing differences along rows
Returning sum of diffs
Or early out when sum exceeds a threshold
Movie Compression Optimization
int ThresholdSum(unsigned char *ptr1, unsigned char *ptr2, int stride2, int stride1,int thres)
{
int32 sad = 0;
for (int i=8; i; i--)
{
sad += DSP_OP_ABS_DIFF(ptr1[0], ptr2[0]);
sad += DSP_OP_ABS_DIFF(ptr1[1], ptr2[1]);
sad += DSP_OP_ABS_DIFF(ptr1[2], ptr2[2]);
sad += DSP_OP_ABS_DIFF(ptr1[3], ptr2[3]);
sad += DSP_OP_ABS_DIFF(ptr1[4], ptr2[4]);
sad += DSP_OP_ABS_DIFF(ptr1[5], ptr2[5]);
sad += DSP_OP_ABS_DIFF(ptr1[6], ptr2[6]);
sad += DSP_OP_ABS_DIFF(ptr1[7], ptr2[7]);
if (sad > thres )
return sad;
ptr1 += stride1;
ptr2 += stride2;
}
return sad;
}
Movie Compression Optimization
•
•
Look at our worst enemies
L2
•
•
LHS
•
•
Its all integer, so we should be LHS free
Expensive instructions?
•
•
8x8 byte blocks, seems tight
No, just byte maths
Branching
•
Should get prediction right 7 out of 8 times
Movie Compression Optimization
•
Maths
•
•
•
•
•
Element by element abs and average ops on bytes
Done row by row, exit on over sum
Perfect for VMX!
Awesome speedup of… 0%
Huh? Why?
•
•
•
Summing a row doesn’t suit VMX
Branch penalty still there
We have to do unaligned loads to VMX registers
Movie Compression Optimization
•
•
Let’s think again
Look at higher level picture
•
•
•
•
Thresholding is by row
•
•
•
Error calculated for 4 blocks at a time by caller
Rows in blocks (0,1) and (2,3) are contiguous
Pick up two blocks at a time in VMX registers
But there is no reason not to do it by column
Means we can sum columns in 7 instructions
Use __restrict on block pointers
Movie Compression Optimization
0
2
VMX register 0
VMX register 1
VMX register 2
VMX register 3
VMX register 4
VMX register 5
VMX register 6
VMX register 7
1
3
Movie Compression Optimization
•
Data Layout & Alignment
•
Rows in 2 blocks are contiguous in memory
•
•
•
Unrolling
•
•
•
Source block always 16 byte aligned
Dest block only guaranteed to be byte aligned
We can unroll the 8 iteration loop
We have plenty of VMX registers available
Return value
•
Return a __vector4 to avoid LHS writing to int
Movie Compression Optimization
•
Miscellaneous
•
•
Prebuild threshold word once
Remove stride word parameters
•
•
•
•
Constant values in this application only
Proved with empirical research (and assert)
Vector parameters and return in registers
Pushed vector error results out to caller
•
All callers calculations in VMX – drop LHS
Movie Compression Optimization
__vector4 __declspec(passinreg) twoblock_sad8x8__xbox (const unsigned char* __restrict ptr1, const unsigned char* __restrict
ptr2)
{
__vector4 zero = __vzero();
__vector4 row1_0 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_1 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_2 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_3 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_4 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_5 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_6 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4 row1_7 = *(__vector4 *)ptr1; ptr1 += cStride1;
__vector4
__vector4
__vector4
__vector4
__vector4
__vector4
__vector4
__vector4
row1_0
row1_1
row1_2
row1_3
row1_4
row1_5
row1_6
row1_7
=
=
=
=
=
=
=
=
row2_0
row2_1
row2_2
row2_3
row2_4
row2_5
row2_6
row2_7
= *(__vector4
= *(__vector4
= *(__vector4
= *(__vector4
= *(__vector4
= *(__vector4
= *(__vector4
= *(__vector4
*)ptr2;
*)ptr2;
*)ptr2;
*)ptr2;
*)ptr2;
*)ptr2;
*)ptr2;
*)ptr2;
ptr2
ptr2
ptr2
ptr2
ptr2
ptr2
ptr2
ptr2
+= cStride2;
+= cStride2;
+= cStride2;
+= cStride2;
+= cStride2;
+= cStride2;
+= cStride2;
+= cStride2;
__vsubsbs(__vmaxub(row1_0,row2_0),__vminub(row1_0,row2_0));
__vsubsbs(__vmaxub(row1_1,row2_1),__vminub(row1_1,row2_1));
__vsubsbs(__vmaxub(row1_2,row2_2),__vminub(row1_2,row2_2));
__vsubsbs(__vmaxub(row1_3,row2_3),__vminub(row1_3,row2_3));
__vsubsbs(__vmaxub(row1_4,row2_4),__vminub(row1_4,row2_4));
__vsubsbs(__vmaxub(row1_5,row2_5),__vminub(row1_5,row2_5));
__vsubsbs(__vmaxub(row1_6,row2_6),__vminub(row1_6,row2_6));
__vsubsbs(__vmaxub(row1_7,row2_7),__vminub(row1_7,row2_7));
row2_0 = __vmrglb(zero,row1_0);
row1_0 = __vmrghb(zero,row1_0);
row2_1 = __vmrglb(zero,row1_1);
row1_1 = __vmrghb(zero,row1_1);
row2_2 = __vmrglb(zero,row1_2);
row1_2 = __vmrghb(zero,row1_2);
row2_3 = __vmrglb(zero,row1_3);
row1_3 = __vmrghb(zero,row1_3);
row2_4 = __vmrglb(zero,row1_4);
row1_4 = __vmrghb(zero,row1_4);
row2_5 = __vmrglb(zero,row1_5);
row1_5 = __vmrghb(zero,row1_5);
row2_6 = __vmrglb(zero,row1_6);
row1_6 = __vmrghb(zero,row1_6);
row2_7 = __vmrglb(zero,row1_7);
row1_0
row1_2
row1_4
row1_6
=
=
=
=
__vaddshs(row1_0,row1_1);
__vaddshs(row1_2,row1_3);
__vaddshs(row1_4,row1_5);
__vaddshs(row1_6,row1_7);
row1_0 = __vaddshs(row1_0,row1_2);
row1_4 = __vaddshs(row1_4,row1_6);
row1_0 = __vaddshs(row1_0,row1_4);
row2_0
row2_2
row2_4
row2_6
=
=
=
=
__vaddshs(row2_0,row2_1);
__vaddshs(row2_2,row2_3);
__vaddshs(row2_4,row2_5);
__vaddshs(row2_6,row2_7);
row2_0 = __vaddshs(row2_0,row2_2);
row2_4 = __vaddshs(row2_4,row2_6);
row2_0 = __vaddshs(row2_0,row2_4);
row1_1
row1_2
row1_3
row1_4
row1_5
row1_6
row1_7
=
=
=
=
=
=
=
__vsldoi(row1_0,row2_0,2);
__vsldoi(row1_0,row2_0,4);
__vsldoi(row1_0,row2_0,6);
__vsldoi(row1_0,row2_0,8);
__vsldoi(row1_0,row2_0,10);
__vsldoi(row1_0,row2_0,12);
__vsldoi(row1_0,row2_0,14);
row1_0 = __vrlimi(row1_0,row2_0,0x1,0);
row2_0 = __vsldoi(row2_0,zero,2);
row1_1 = __vrlimi(row1_1,row2_0,0x1,0);
row1_0 = __vaddshs(row1_0,row1_1);
row1_2 = __vaddshs(row1_2,row1_3);
row1_4 = __vaddshs(row1_4,row1_5);
row1_6 = __vaddshs(row1_6,row1_7);
// add 4 rows to the next row
row1_0 = __vaddshs(row1_0,row1_2);
row1_4 = __vaddshs(row1_4,row1_6);
row1_0 = __vaddshs(row1_0,row1_4);
row1_0 = __vpermwi(row1_0,VPERMWI_CONST(0,3,0,0));
row1_0 = __vmrghh(zero,row1_0);
row1_0 = __vpermwi(row1_0,VPERMWI_CONST(0,2,0,0));
return row1_0;
Movie Compression Optimization
•
Results
•
Un-thresholded macro block compare
•
•
•
•
Thresholded macro block compare
•
•
2.86 times quicker than existing C
Not bad, but our code is doing 2 blocks at once, too
So actually, 5.72 times quicker
4.12 times quicker
Optimizations to just the block compares…
…reduced movie compression time by 22%
…in worst case, saved 40 seconds from compress time
Do We Get Improvements In Reverse?
• Do we see improvements on PC?
• Image analysis
•
Movie compression
Summary Interlude
•
Profiling, profiling, profiling
•
•
Explore data alignment and layout
•
•
Know your enemy
Give SIMD plenty of room to work
Don’t ignore simple code structure changes
•
Specialise, not generalise
Original Example
Improving Original Example
PIX Summary
•
704k instructions executed
40% L2 usage
Top penalties
•
•
•
•
•
•
•
•
•
L2 cache miss @ 3m cycles
bctr mispredicts @ 1.14m cycles
__fsqrt @ 696k cycles
2x fcmp @ 490k cycles
Some 20.9m cycles of penalty overall
Takes 7.528ms
Improving Original Example
1) Avoid branch mispredict #1
•
•
•
Ditch the zealous use of virtual
Call functions just once
Gives 1.13x speedup
2) Improve L2 use #1
•
•
•
•
Refactoring list to contiguous array
Hot/cold split
Using bitfield for active flag
Gives 3.59x speedup
Improving Original Example
4) Remove expensive instructions
• Ditch __fsqrts and compare with squares
• Gives 4.05x speedup
5) Avoid branch mispredict #1
• Insert __fsel() to select tail length
• Gives 4.44x speedup
• Insert 2nd fsel
• Now only loop active branches remain
• Gives 5.0x speedup
Improving Original Example
7) Use VMX
• Use __vsubfp and __vmsum3fp for vector math
• Gives 5.28x speedup
8) Avoid branch mispredict #2
• Unroll the loop 4x
• Sticks at 5.28x speedup
Improving Original Example
9) Avoid branch mispredict #3
•
•
•
•
•
Build a __vector4 mask from active flags
__vsel tail lengths from existing and new
Write a single __vector4 result
Now only the loop branch remaining
Gives 6.01x speedup
10) Improve L2 access #2
•
•
Add __dcbt on position array
Gives 16.01x speedup
Improving Original Example
11) Improve L2 use #3
•
•
•
Moveto short coordinates
Now loading ¼ the data for positions
Gives 21.23x speedup
12) Avoid branch mispredict #4
•
•
•
•
We are now writing tail lengths for every particle
Wait, we don’t care about inactive particles
Epiphany - don’t check active flag at all
Gives 23.21x speedup
Improving Original Example
13) Improve L2 use #4
• Remaining L2 misses on output array
• __dcbt that too
• Tweak __dcbt offsets and pre-load
• 39.01x speedup
Improving Original Example
PIX Summary
•
259k instructions executed
99.4% L2 usage
Top penalties
•
•
•
•
•
•
•
•
ERAT Data Miss @ 14k cycles
1 LHS via 4kb aliasing
No mispredict penalties
71k cycles of penalty overall
Takes 0.193ms
Improving Original Example
•
Caveat
•
•
•
•
Slightly trivial code example
Not all techniques possible in ‘real life’
But principles always apply
Dcbz128 mystery?
•
•
•
We write entire array
Should be able to save L2 loads by pre-zeroing
But results showed slowdown
Thanks For Listening
• Any questions?
http://www.xna.com
© 2009 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

Download Report

Out of Order

Paperzz.com

Your Paperzz