Previous Best Combining Code 15-213 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } “The course that gives CMU its Zip!” Code Optimization II: Machine Dependent Optimizations Oct. 1, 2002 Topics ! Machine-Dependent Optimizations Task " Pointer code " Unrolling " Enabling instruction level parallelism ! Understanding Processor Operation " Translation of instructions into operations " Out-of-order execution of operations ! ! Data Types –3– Use different declarations for data_t ! float ! double ! Achieved CPE of 2.00 15-213, F’02 –2– void abstract_combine4(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); data_t *data = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP data[i]; *dest = t; } int Vector represented by C-style abstract data type " Cycles per element General Forms of Combining ! Compute sum of all elements in vector ! Branches and Branch Prediction Advice class11.ppt ! ! Operations ! Machine Independent Opt. Results Optimizations ! Method + / 0 ! * / 1 Integer + Abstract -g Abstract -O2 Move vec_length data access Accum. in temp Floating Point + * * 42.06 31.25 20.66 6.00 2.00 41.86 33.25 21.25 9.00 4.00 41.44 31.25 21.15 8.00 3.00 160.00 143.00 135.00 117.00 5.00 Performance Anomaly Use different definitions of OP and IDENT ! Reduce function calls and memory references within loop ! Computing FP product of all elements exceptionally slow. ! Very large speedup when accumulate in temporary ! Caused by quirk of IA32 floating point " Memory uses 64-bit format, register use 80 " Benchmark data caused overflow of 64 bits, but not 80 15-213, F’02 –4– 15-213, F’02 Pointer Code Pointer vs. Array Code Inner Loops void combine4p(vec_ptr v, int *dest) { int length = vec_length(v); int *data = get_vec_start(v); int *dend = data+length; int sum = 0; while (data < dend) { sum += *data; data++; } *dest = sum; } Array Code .L24: addl (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 Use pointers rather than array references ! CPE: 3.00 (Compiled -O2) .L30: addl (%eax),%ecx addl $4,%eax cmpl %edx,%eax jb .L30 Warning: Some compilers do better job optimizing array code 15-213, F’02 –5– Modern CPU Design Fetch Control Pointer Code: Almost same 4 instructions in 3 clock cycles 15-213, F’02 Multiple Instructions Can Execute in Parallel Instrs. Instruction Cache Operations Register Updates Array Code: 4 instructions in 2 clock cycles ! –6– Address Instruction Decode ! CPU Capabilities of Pentium III Instruction Instruction Control Control Register File # Loop: # sum += *data # data ++ # data:dend # if < goto Loop Performance " Oops! We’re not making progress here! Retirement Unit Loop: sum += data[i] i++ i:length if < goto Loop Pointer Code Optimization ! # # # # # Prediction OK? ! 1 load ! 1 store ! 2 integer (one may be branch) ! 1 FP Addition ! 1 FP Multiplication or Division Some Instructions Take > 1 Cycle, but Can be Pipelined Integer/ Branch General Integer FP Add Operation Results FP Mult/Div Load Addr. Store Functional Units Addr. Data Data Data Cache Execution Execution –7– 15-213, F’02 –8– ! Instruction Latency Cycles/Issue ! ! Load / Store 3 1 Integer Multiply 4 1 ! Integer Divide ! Double/Single FP Multiply 36 36 5 2 ! Double/Single FP Add ! Double/Single FP Divide 3 1 38 38 15-213, F’02 Instruction Control Translation Example Instruction Instruction Control Control Retirement Unit Register File Fetch Control Instruction Decode Version of Combine4 Address Instrs. ! Instruction Cache Integer data, multiply operation .L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 Operations Grabs Instruction Bytes From Memory ! ! Based on current PC + predicted targets for predicted branches Hardware dynamically guesses whether branches taken/not taken and (possibly) branch target ! .L24: imull (%eax,%edx,4),%ecx Primitive steps required to perform instruction Typical instruction requires 1–3 operations incl %edx cmpl %esi,%edx jl .L24 Converts Register References Into Tags ! –9– Abstract identifier linking destination of one operation with sources of later operations 15-213, F’02 Translation Example #1 imull (%eax,%edx,4),%ecx ! Loop: t *= data[i] i++ i:length if < goto Loop Translation of First Iteration Translates Instructions Into Operations ! # # # # # load (%eax,%edx.0,4) imull t.1, %ecx.0 incl %edx.0 cmpl %esi, %edx.1 jl-taken cc.1 # # # # t.1 %ecx.1 %edx.1 cc.1 15-213, F’02 – 10 – Translation Example #2 incl %edx load (%eax,%edx.0,4) # t.1 imull t.1, %ecx.0 # %ecx.1 Split into two operations ! " load reads from memory to generate temporary result t.1 incl %edx.0 # %edx.1 Register %edx changes on each iteration. Rename as %edx.0, %edx.1, %edx.2, … " Multiply operation just operates on registers ! Operands " Registers %eax does not change in loop. Values will be retrieved from register file during decoding " Register %ecx changes on every iteration. Uniquely identify different versions as %ecx.0, %ecx.1, %ecx.2, … » Register renaming » Values passed directly from producer to consumers – 11 – 15-213, F’02 – 12 – 15-213, F’02 Translation Example #3 cmpl %esi,%edx cmpl %esi, %edx.1 Translation Example #4 jl .L24 # cc.1 jl-taken cc.1 ! Condition codes are treated similar to registers ! Instruction control unit determines destination of jump ! Assign tag to define connection between producer and consumer ! Predicts whether will be taken and target ! Starts fetching instruction at predicted destination ! Execution unit simply checks whether or not prediction was OK ! If not, it signals instruction control " Instruction control then “invalidates” any operations generated from misfetched instructions " Begins fetching and decoding instructions at correct target 15-213, F’02 – 13 – 15-213, F’02 – 14 – Visualizing Operations %edx.0 incl load load (%eax,%edx,4) imull t.1, %ecx.0 incl %edx.0 cmpl %esi, %edx.1 jl-taken cc.1 %edx.1 cmpl cc.1 %ecx.0 jl t.1 Visualizing Operations (cont.) # # # # t.1 %ecx.1 %edx.1 cc.1 %edx.0 load incl load cmpl+1 %ecx.i %edx.1 load (%eax,%edx,4) iaddl t.1, %ecx.0 incl %edx.0 cmpl %esi, %edx.1 jl-taken cc.1 # # # # t.1 %ecx.1 %edx.1 cc.1 cc.1 Time Operations imull ! Vertical position denotes time at which executed %ecx.0 Time Operations ! operands available ! t.1 addl %ecx.1 " Cannot begin operation until %ecx.1 jl Same as before, except that add has latency of 1 Height denotes latency Operands ! – 15 – Arcs shown only for operands that are passed within execution unit 15-213, F’02 – 16 – 15-213, F’02 3 Iterations of Combining Product Unlimited Resource Analysis %edx.0 incl 1 load 2 %edx.1 cmpl incl cc.1 jl 3 %edx.2 cmpl ! incl cc.2 jl i=0 4 5 load t.1 %ecx.0 load %edx.3 cmpl t.2 cc.3 jl imull ! t.3 6 7 %ecx.1 %edx.0 9 Assume operation can start as soon as operands available ! i=1 Cycle %ecx.2 Iteration 2 12 13 ! imull 14 i=2 15 %ecx.3 – 17 – 3 t.4 addl incl cmpl+1 %ecx.i cc.4 jl %ecx.4 11 Iteration 4 12 15-213, F’02 incl %edx.5 load t.5 addl %ecx.5 cmpl+1 %ecx.i cc.5 load cmpl load cc.6 cc.6 jl i=5 incl t.7 addl Only have two integer functional units ! Some17operations delayed even though operands available 18 jl i=6 Cycle Set priority based on program order %edx.7 cmpl cc.7 Iteration 6 15 ! ! %edx.6 t.6 addl %ecx.6 14 incl jl i=4 Iteration 5 13 16 %ecx.7 load incl %edx.8 cmpl+1 %ecx.i t.8 cc.8 addl Iteration 7 jl i=7 %ecx.8 Iteration 8 Performance ! – 19 – incl load t.2 addl i=1 jl incl load addl i=2 %ecx.3 Iteration 3 %edx.4 4 integer ops cmpl+1 %ecx.i cc.4 t.3 %ecx.2 Iteration 2 %edx.3 cmpl+1 %ecx.i cc.3 jl t.4 i=3 addl %ecx.4 Unlimited Resource Analysis Performance Gives CPE of 4.0 %edx.4 i=3 10 Cycle jl Iteration 4 Limiting factor becomes latency of integer multiplier 6 9 Iteration 1 %edx.2 cmpl+1 %ecx.i cc.2 7 Sustain CPE of 2.0 15-213, F’02 ! Can begin a new iteration on each clock cycle ! Should give CPE of 1.0 ! Would require executing 4 integer operations in parallel 15-213, F’02 – 18 – Loop Unrolling %edx.3 8 %ecx.3 i=0 %ecx.1 5 Combining Sum: Resource Constraints load incl load t.1 addl 6 %edx.1 cmpl+1 %ecx.i cc.1 jl %ecx.0 4 Operations for multiple iterations overlap in time Iteration 3 7 load 2 Performance imull 11 incl 1 Iteration 1 8 10 4 Iterations of Combining Sum void combine5(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; int i; /* Combine 3 elements at a time */ for (i = 0; i < limit; i+=3) { sum += data[i] + data[i+2] + data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { sum += data[i]; } *dest = sum; } – 20 – Optimization ! Combine multiple iterations into single loop body ! Amortizes loop overhead across multiple iterations ! Finish extras at end ! Measured CPE = 1.33 15-213, F’02 Visualizing Unrolled Loop ! ! %edx.0 Loads can pipeline, since don’t have dependencies Only one set of loop control operations %edx.2 addl load load %ecx.0c 6 jl addl 7 cc.1 load 8 t.1a %ecx.1a load Time t.1b %ecx.3a t.1a %ecx.1a t.1b %ecx.1b t.1c %ecx.1c %edx.1 cc.1 addl 11 %ecx.1c i=6 12 13 load addl ! 14 load cmpl+1 %ecx.i t.3c %ecx.3b addl jl t.4a addl Iteration 3 load Cycle Predicted Performance 15 t.4b addl %ecx.4b i=9 t.4c addl %ecx.4c Iteration 4 " Should give CPE of 1.0 Measured Performance " CPE of 1.33 " One iteration every 4 cycles 15-213, F’02 15-213, F’02 – 22 – 1 x0 Effect of Unrolling * 1 2 3 2.00 1.50 1.33 4 8 16 1.50 1.25 1.06 Integer Sum Integer Product 4.00 FP Sum 3.00 FP Product 5.00 Serial Computation x1 * Computation x2 * ((((((((((((1 * x0) * x1) * x2) * x3) * x4) * x5) * x6) * x7) * x8) * x9) * x10) * x11) x3 * x4 * x5 * Performance ! x6 * ! Only helps integer sum for our examples ! Effect is nonlinear with degree of unrolling ! N elements, D cycles/operation N*D cycles x7 * " Other cases constrained by functional unit latencies x8 * x9 * " Many subtle effects determine exact scheduling of operations x10 * – 23 – cc.4 load %ecx.3c " Can complete iteration in 3 cycles ! %edx.4 t.3b addl %ecx.4a – 21 – Unrolling Degree cc.3 jl t.3a addl 10 t.1c %edx.3 cmpl+1 %ecx.i load 9 %ecx.2c addl %ecx.1b # # # # # # # # 5 %edx.1 cmpl+1 %ecx.i addl load (%eax,%edx.0,4) iaddl t.1a, %ecx.0c load 4(%eax,%edx.0,4) iaddl t.1b, %ecx.1a load 8(%eax,%edx.0,4) iaddl t.1c, %ecx.1b iaddl $3,%edx.0 cmpl %esi, %edx.1 jl-taken cc.1 Executing with Loop Unrolling 15-213, F’02 – 24 – x11 * 15-213, F’02 Dual Product Computation Parallel Loop Unrolling void combine6(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x0 = 1; int x1 = 1; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x0 *= data[i]; x1 *= data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x0 *= data[i]; } *dest = x0 * x1; } Computation Code Version ! ((((((1 * x0) * x2) * x4) * x6) * x8) * x10) * Integer product ((((((1 * x1) * x3) * x5) * x7) * x9) * x11) Optimization ! Performance Accumulate in two different products " Can be performed 1 x0 * simultaneously ! 1 x1 * Combine at end * CPE = 2.0 ! 2X performance N elements, D cycles/operation (N/2+1)*D cycles ! ~2X performance improvement x5 * x6 * ! x7 * x8 * x10 x9 * * Requirements for Parallel Computation * 15-213, F’02 – 26 – Visualizing Parallel Loop ! Mathematical Combining operation must be associative & commutative ! " OK for integer multiplication " Not strictly true for floating point %edx.0 Two multiplies within loop no longer have data depency Allows them to pipeline addl load %ecx.0 » OK for most applications Pipelined functional units ! Ability to dynamically extract parallelism from code cc.1 jl t.1a t.1b Time imull imull load (%eax,%edx.0,4) imull t.1a, %ecx.0 load 4(%eax,%edx.0,4) imull t.1b, %ebx.0 iaddl $2,%edx.0 cmpl %esi, %edx.1 jl-taken cc.1 – 27 – %edx.1 cmpl load %ebx.0 Hardware ! x11 * 15-213, F’02 ! x3 x4 * Performance – 25 – * x2 ! ! 15-213, F’02 – 28 – # # # # # # t.1a %ecx.1 t.1b %ebx.1 %edx.1 cc.1 %ecx.1 %ebx.1 15-213, F’02 Executing with Parallel Loop Optimization Results for Combining %edx.0 addl 1 load 2 load cc.3 jl imull %ecx.1 i=0 t.2a %ebx.1 t.2b Iteration 1 imull Cycle 12 imull Predicted Performance 13 " Can keep 4-cycle %ecx.2 i=2 t.3a %ebx.2 t.3b Iteration 2 imull 14 multiplier busy performing two simultaneous 15 multiplications 16 " Gives CPE of 2.0 – 31 – Abstract -g Abstract -O2 Move vec_length data access Accum. in temp Pointer Unroll 4 Unroll 16 2X2 4X4 8X4 Theoretical Opt. Worst : Best Floating Point + * * 42.06 31.25 20.66 6.00 2.00 3.00 1.50 1.06 1.50 1.50 1.25 1.00 39.7 41.86 33.25 21.25 9.00 4.00 4.00 4.00 4.00 2.00 2.00 1.25 1.00 33.5 41.44 31.25 21.15 8.00 3.00 3.00 3.00 3.00 2.00 1.50 1.50 1.00 27.6 160.00 143.00 135.00 117.00 5.00 5.00 5.00 5.00 2.50 2.50 2.00 2.00 80.0 imull %ecx.3 i=4 %ebx.3 Iteration 15-213, 3 F’02 Parallel Unrolling: Method #2 void combine6aa(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x = 1; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x *= (data[i] * data[i+1]); } /* Finish any remaining elements */ for (; i < length; i++) { x *= data[i]; } *dest = x; } Integer + %edx.3 cmpl load 11 – 29 – cc.2 jl addl imull 9 ! Method %edx.2 cmpl load 7 10 addl load t.1b 6 8 cc.1 jl t.1a 4 %ebx.0 5 cmpl load 3 %ecx.0 %edx.1 15-213, F’02 – 30 – Method #2 Computation Computation Code Version ! ((((((1 * (x0 * x1)) * (x2 * x3)) * (x4 * x5)) * (x6 * x7)) * (x8 * x9)) * (x10 * x11)) Integer product x0 x1 Optimization ! Multiply pairs of elements together ! And then update product ! 1 * * x4 x5 * “Tree height reduction” ! N elements, D cycles/operation ! Should be (N/2+1)*D cycles ! Measured CPE worse " CPE = 2.0 * x6 x7 * * x10 x11 * CPE = 2.5 * * 15-213, F’02 – 32 – Unrolling * x8 x9 * Performance ! Performance * x2 x3 2 CPE CPE (measured) (theoretical) 2.50 2.00 3 1.67 1.33 4 1.50 1.00 6 1.78 1.00 15-213, F’02 Understanding Parallelism 1 x0 * Limitations of Parallel Execution x1 * x2 * x3 * x4 /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = (x * data[i]) * data[i+1]; } * Need Lots of Registers x5 * x6 * x7 * x8 * ! CPE = 4.00 ! All multiplies perfomed in sequence ! CPE = 2.50 ! Multiplies overlap To hold sums/products ! Only 6 usable integer registers ! 8 FP registers ! When not enough registers, must spill temporaries onto stack " Also needed for pointers, loop conditions x9 * /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = x * (data[i] * data[i+1]); } ! x10 * x11 * " Wipes out any performance gains ! x0 x1 1 * x2 x3 * " Cannot reference more operands than instruction set allows * x4 x5 * Not helped by renaming " Major drawback of IA32 instruction set * x6 x7 * * x8 x9 * * x10 x11 * * * 15-213, F’02 – 33 – Register Spilling Example Example ! 8 X 8 integer product ! 7 local variables share 1 register ! See that are storing locals on stack E.g., at -8(%ebp) ! – 35 – 15-213, F’02 – 34 – Summary: Results for Pentium III .L165: imull (%eax),%ecx movl -4(%ebp),%edi imull 4(%eax),%edi movl %edi,-4(%ebp) movl -8(%ebp),%edi imull 8(%eax),%edi movl %edi,-8(%ebp) movl -12(%ebp),%edi imull 12(%eax),%edi movl %edi,-12(%ebp) movl -16(%ebp),%edi imull 16(%eax),%edi movl %edi,-16(%ebp) … addl $32,%eax addl $8,%edx cmpl -32(%ebp),%edx jl .L165 15-213, F’02 Method Integer + Abstract -g Abstract -O2 Move vec_length data access Accum. in temp Unroll 4 Unroll 16 4X2 8X4 8X8 Worst : Best ! ! – 36 – Floating Point + * * 42.06 31.25 20.66 6.00 2.00 1.50 1.06 1.50 1.25 1.88 39.7 41.86 33.25 21.25 9.00 4.00 4.00 4.00 2.00 1.25 1.88 33.5 41.44 31.25 21.15 8.00 3.00 3.00 3.00 1.50 1.50 1.75 27.6 160.00 143.00 135.00 117.00 5.00 5.00 5.00 2.50 2.00 2.00 80.0 Biggest gain doing basic optimizations But, last little bit helps 15-213, F’02 Results for Alpha Processor Method Integer + Abstract -g Abstract -O2 Move vec_length data access Accum. in temp Unroll 4 Unroll 16 4X2 8X4 8X8 Worst : Best ! ! Floating Point + * * 40.14 25.08 19.19 6.26 1.76 1.51 1.25 1.19 1.15 1.11 36.2 Results for Pentium 4 47.14 36.05 32.18 12.52 9.01 9.01 9.01 4.69 4.12 4.24 11.4 52.07 37.37 28.73 13.26 8.08 6.32 6.33 4.44 2.34 2.36 22.3 Method 53.71 32.02 32.73 13.01 8.01 6.32 6.22 4.45 2.01 2.08 26.7 Abstract -g Abstract -O2 Move vec_length data access Accum. in temp Unroll 4 Unroll 16 4X2 8X4 8X8 Worst : Best ! Overall trends very similar to those for Pentium III. Even though very different architecture and compiler ! What About Branches? ! ! – 39 – 38.00 32.00 24.25 28.35 7.00 7.00 7.00 3.50 2.00 2.31 19.0 Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0) Avoids FP multiplication anomaly 15-213, F’02 When encounter conditional branch, cannot determine where to continue fetching " Branch Not-Taken: Continue with next instruction in sequence ! $0x1,%ecx %edx,%edx %esi,%edx 8048a25 %esi,%esi (%eax,%edx,4),%ecx 35.85 31.55 23.36 27.50 5.00 5.00 5.00 2.63 1.82 2.42 19.7 " Branch Taken: Transfer control to branch target " To generate enough operations to keep EU busy movl xorl cmpl jnl movl imull 35.34 30.26 25.71 31.56 14.00 14.00 14.00 7.00 3.98 4.50 8.9 – 38 – Instruction Control Unit must work well ahead of Exec. Unit 80489f3: 80489f8: 80489fa: 80489fc: 80489fe: 8048a00: 35.25 26.52 18.00 3.39 2.00 1.01 1.00 1.02 1.01 1.63 35.2 Branch Outcomes Challenge ! Floating Point + * * " Clock runs at 2.0 GHz " Not an improvement over 1.0 GHz P3 for integer * 15-213, F’02 – 37 – Integer + 80489f3: 80489f8: 80489fa: 80489fc: 80489fe: 8048a00: Executing Fetching & Decoding movl xorl cmpl jnl movl imull 8048a25: 8048a27: 8048a29: 8048a2c: 8048a2f: When encounters conditional branch, cannot reliably determine where to continue fetching 15-213, F’02 Cannot resolve until outcome determined by branch/integer unit – 40 – $0x1,%ecx %edx,%edx %esi,%edx Branch Not-Taken 8048a25 %esi,%esi Branch Taken (%eax,%edx,4),%ecx cmpl jl movl leal movl %edi,%edx 8048a20 0xc(%ebp),%eax 0xffffffe8(%ebp),%esp %ecx,(%eax) 15-213, F’02 Branch Prediction Through Loop Branch Prediction Idea ! Guess which way branch will go ! Begin executing instructions at predicted position " But don’t actually modify register or memory data 80489f3: 80489f8: 80489fa: 80489fc: . . . movl xorl cmpl jnl $0x1,%ecx %edx,%edx %esi,%edx 8048a25 8048a25: 8048a27: 8048a29: 8048a2c: 8048a2f: cmpl jl movl leal movl Predict Taken %edi,%edx 8048a20 0xc(%ebp),%eax 0xffffffe8(%ebp),%esp %ecx,(%eax) Execute 15-213, F’02 – 41 – Branch Misprediction Invalidation 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 98 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 99 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 100 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: movl addl incl (%ecx,%edx,4),%eax %eax,(%edi) i = 101 %edx – 43 – Assume vector length = 100 Predict Taken (OK) Predict Taken (Oops) 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 98 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 99 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 100 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 101 %esi,%edx 80488b1 Assume vector length = 100 Predict Taken (OK) Predict Taken Executed (Oops) Read invalid location Fetched 15-213, F’02 – 42 – Branch Misprediction Recovery 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 98 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: 80488bb: 80488be: 80488bf: 80488c0: movl addl incl cmpl jl leal popl popl popl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 99 %esi,%edx 80488b1 0xffffffe8(%ebp),%esp %ebx %esi %edi Assume vector length = 100 Predict Taken (OK) Definitely not taken Invalidate Performance Cost 15-213, F’02 – 44 – ! Misprediction on Pentium III wastes ~14 clock cycles ! That’s a lot of time on a high performance processor 15-213, F’02 Avoiding Branches Avoiding Branches with Bit Tricks On Modern Processor, Branches Very Expensive ! Unless prediction can be reliable ! When possible, best to avoid altogether ! In style of Lab #1 ! Use masking rather than conditionals int bmax(int x, int y) { int mask = -(x>y); return (mask & x) | (~mask & y); } Example ! Compute maximum of two values " 14 cycles when prediction correct ! " 29 cycles when incorrect Compiler still uses conditional " 16 cycles when predict correctly int max(int x, int y) { return (x < y) ? y : x; } movl 12(%ebp),%edx movl 8(%ebp),%eax cmpl %edx,%eax jge L11 movl %edx,%eax L11: # # # # # Get y rval=x rval:y skip when >= rval=y 15-213, F’02 – 45 – Avoiding Branches with Bit Tricks ! movl 8(%ebp),%ecx movl 12(%ebp),%edx cmpl %edx,%ecx setg %al movzbl %al,%eax movl %eax,-4(%ebp) movl -4(%ebp),%eax xorl %edx,%edx movl 8(%ebp),%eax movl 12(%ebp),%ecx cmpl %ecx,%eax jle L13 movl $-1,%edx L13: – 46 – ! # # # # # # # # mask = 0 # skip if x<=y # mask = -1 15-213, F’02 Conditional Move Force compiler to generate desired code int bvmax(int x, int y) { volatile int t = (x>y); int mask = -t; return (mask & x) | (~mask & y); } ! " 32 cycles when mispredict ! Get x Get y x:y (x>y) Zero extend Save as t Retrieve t Added with P6 microarchitecture (PentiumPro onward) cmovXXl %edx, %eax " If condition XX holds, copy %edx to %eax " Doesn’t involve any branching " Handled as operation within Execution Unit movl 8(%ebp),%edx movl 12(%ebp),%eax cmpl %edx, %eax cmovll %edx,%eax volatile declaration forces value to be written to memory # # # # Get x rval=y rval:x If <, rval=x " Compiler must therefore generate code to compute t " Simplest way is setg/movzbl combination ! Current version of GCC won’t use this instruction ! Performance " Thinks it’s compiling for a 386 Not very elegant! " A hack to get control over compiler ! ! " 14 cycles on all data 22 clock cycles on all data " Better than misprediction – 47 – 15-213, F’02 – 48 – 15-213, F’02 Machine-Dependent Opt. Summary Role of Programmer How should I write my programs, given that I have a good, optimizing compiler? Pointer Code ! Look carefully at generated code to see whether helpful Don’ Don’t: Smash Code into Oblivion Loop Unrolling ! Some compilers do this automatically ! Generally not as clever as what can achieve by hand ! Do: Exposing Instruction-Level Parallelism ! Write code that’s readable & maintainable " Even though these factors can slow down code Benefits depend heavily on particular machine ! Best if performed by compiler ! Eliminate optimization blockers " Allows compiler to do its job Focus on Inner Loops " But GCC on IA32/Linux is not very good – 49 – Select best algorithm ! " Procedures, recursion, without built-in constant limits Warning: ! ! Very machine dependent ! Hard to read, maintain, & assure correctness Do only for performance-critical parts of code 15-213, F’02 – 50 – ! Do detailed optimizations where code will be executed repeatedly ! Will get most performance gain here 15-213, F’02
© Copyright 2025 Paperzz