15-213
“The course that gives CMU its Zip!”
Code Optimization II:
Machine Dependent Optimizations
Oct. 1, 2002
Topics
Machine-Dependent Optimizations
Pointer code
Unrolling
Enabling instruction level parallelism
Understanding Processor Operation
Translation of instructions into operations
Out-of-order execution of operations
class11.ppt
Branches and Branch Prediction
Advice
Previous Best Combining Code
void combine4(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
*dest = sum;
}
Task
Compute sum of all elements in vector
Vector represented by C-style abstract data type
Achieved CPE of 2.00
Cycles per element
–2–
15-213, F’02
General Forms of Combining
void abstract_combine4(vec_ptr v, data_t *dest)
{
int i;
int length = vec_length(v);
data_t *data = get_vec_start(v);
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP data[i];
*dest = t;
}
Data Types
–3–
Use different declarations
for data_t
int
float
double
Operations
Use different definitions
of OP and IDENT
+ / 0
* / 1
15-213, F’02
Machine Independent Opt. Results
Optimizations
Reduce function calls and memory references within loop
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Floating Point
+
*
*
42.06
31.25
20.66
6.00
2.00
41.86
33.25
21.25
9.00
4.00
41.44
31.25
21.15
8.00
3.00
160.00
143.00
135.00
117.00
5.00
Performance Anomaly
Computing FP product of all elements exceptionally slow.
Very large speedup when accumulate in temporary
Caused by quirk of IA32 floating point
Memory uses 64-bit format, register use 80
Benchmark data caused overflow of 64 bits, but not 80
–4–
15-213, F’02
Pointer Code
void combine4p(vec_ptr v, int *dest)
{
int length = vec_length(v);
int *data = get_vec_start(v);
int *dend = data+length;
int sum = 0;
while (data < dend) {
sum += *data;
data++;
}
*dest = sum;
}
Optimization
Use pointers rather than array references
CPE: 3.00 (Compiled -O2)
Oops! We’re not making progress here!
Warning: Some compilers do better job optimizing array code
–5–
15-213, F’02
Pointer vs. Array Code Inner Loops
Array Code
.L24:
addl (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
#
#
#
#
#
Loop:
sum += data[i]
i++
i:length
if < goto Loop
Pointer Code
.L30:
addl (%eax),%ecx
addl $4,%eax
cmpl %edx,%eax
jb .L30
# Loop:
# sum += *data
# data ++
# data:dend
# if < goto Loop
Performance
–6–
Array Code: 4 instructions in 2 clock cycles
Pointer Code: Almost same 4 instructions in 3 clock cycles
15-213, F’02
Modern CPU Design
Instruction
Instruction Control
Control
Fetch
Control
Retirement
Unit
Register
File
Address
Instrs.
Instruction
Decode
Instruction
Cache
Operations
Register
Updates
Prediction
OK?
Integer/
Branch
General
Integer
FP
Add
Operation Results
FP
Mult/Div
Load
Addr.
Store
Functional
Units
Addr.
Data
Data
Data
Cache
Execution
Execution
–7–
15-213, F’02
CPU Capabilities of Pentium III
Multiple Instructions Can Execute in Parallel
1 load
1 store
2 integer (one may be branch)
1 FP Addition
1 FP Multiplication or Division
Some Instructions Take > 1 Cycle, but Can be Pipelined
–8–
Instruction
Latency
Cycles/Issue
Load / Store
3
1
Integer Multiply
4
1
Integer Divide
36
36
Double/Single FP Multiply
5
2
Double/Single FP Add
3
1
Double/Single FP Divide
38
38
15-213, F’02
Instruction Control
Instruction
Instruction Control
Control
Retirement
Unit
Register
File
Fetch
Control
Instruction
Decode
Address
Instrs.
Instruction
Cache
Operations
Grabs Instruction Bytes From Memory
Based on current PC + predicted targets for predicted branches
Hardware dynamically guesses whether branches taken/not taken and
(possibly) branch target
Translates Instructions Into Operations
Primitive steps required to perform instruction
Typical instruction requires 1–3 operations
Converts Register References Into Tags
–9–
Abstract identifier linking destination of one operation with sources of
later operations
15-213, F’02
Translation Example
Version of Combine4
Integer data, multiply operation
.L24:
imull (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
#
#
#
#
#
Loop:
t *= data[i]
i++
i:length
if < goto Loop
Translation of First Iteration
.L24:
imull (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
– 10 –
load (%eax,%edx.0,4)
imull t.1, %ecx.0
incl %edx.0
cmpl %esi, %edx.1
jl-taken cc.1
t.1
%ecx.1
%edx.1
cc.1
15-213, F’02
Translation Example #1
imull (%eax,%edx,4),%ecx
load (%eax,%edx.0,4) t.1
imull t.1, %ecx.0
%ecx.1
Split into two operations
load reads from memory to generate temporary result t.1
Multiply operation just operates on registers
Operands
Registers %eax does not change in loop. Values will be
retrieved from register file during decoding
Register %ecx changes on every iteration. Uniquely identify
different versions as %ecx.0, %ecx.1, %ecx.2, …
» Register renaming
» Values passed directly from producer to consumers
– 11 –
15-213, F’02
Translation Example #2
incl %edx
– 12 –
incl %edx.0
%edx.1
Register %edx changes on each iteration. Rename as
%edx.0, %edx.1, %edx.2, …
15-213, F’02
Translation Example #3
cmpl %esi,%edx
– 13 –
cmpl %esi, %edx.1
cc.1
Condition codes are treated similar to registers
Assign tag to define connection between producer and
consumer
15-213, F’02
Translation Example #4
jl .L24
jl-taken cc.1
Instruction control unit determines destination of jump
Predicts whether will be taken and target
Starts fetching instruction at predicted destination
Execution unit simply checks whether or not prediction was
OK
If not, it signals instruction control
Instruction control then “invalidates” any operations generated
from misfetched instructions
Begins fetching and decoding instructions at correct target
– 14 –
15-213, F’02
Visualizing Operations
%edx.0
incl
load
load (%eax,%edx,4)
imull t.1, %ecx.0
incl %edx.0
cmpl %esi, %edx.1
jl-taken cc.1
%edx.1
cmpl
cc.1
%ecx.0
jl
t.1
Time
t.1
%ecx.1
%edx.1
cc.1
Operations
imull
Vertical position denotes time
at which executed
Cannot begin operation until
%ecx.1
operands available
Height denotes latency
Operands
– 15 –
Arcs shown only for operands
that are passed within
execution unit
15-213, F’02
Visualizing Operations (cont.)
%edx.0
load
incl
load
cmpl+1
%ecx.i
%edx.1
load (%eax,%edx,4)
iaddl t.1, %ecx.0
incl %edx.0
cmpl %esi, %edx.1
jl-taken cc.1
t.1
%ecx.1
%edx.1
cc.1
cc.1
%ecx.0
Time
jl
t.1
addl
%ecx.1
Operations
– 16 –
Same as before, except
that add has latency of 1
15-213, F’02
3 Iterations of Combining Product
Unlimited Resource
Analysis
%edx.0
incl
1
load
2
cmpl
cc.1
jl
3
incl
load
cc.2
i=0
4
%edx.2
cmpl
t.1
%ecx.0
5
%edx.1
jl
incl
load
Assume operation
can start as soon as
operands available
Operations for
multiple iterations
overlap in time
%edx.3
cmpl
t.2
cc.3
jl
imull
t.3
6
7
%ecx.1
Iteration 1
8
9
10
11
12
13
Performance
imull
Cycle
i=1
Limiting factor
becomes latency of
integer multiplier
Gives CPE of 4.0
%ecx.2
Iteration 2
imull
14
i=2
15
%ecx.3
– 17 –
Iteration 3
15-213, F’02
4 Iterations of Combining Sum
%edx.0
incl
1
load
2
3
load
i=0
%ecx.1
5
Iteration 1
6
incl
t.1
addl
4
cmpl+1
%ecx.i
cc.1
jl
%ecx.0
%edx.1
Cycle
%edx.2
cmpl+1
%ecx.i
cc.2
jl
incl
load
t.2
addl
i=1
%ecx.2
Iteration 2
7
%edx.3
cmpl+1
%ecx.i
cc.3
jl
incl
load
t.3
addl
i=2
%ecx.3
Iteration 3
%edx.4
4 integer ops
cmpl+1
%ecx.i
cc.4
jl
t.4
addl
i=3
%ecx.4
Iteration 4
Unlimited Resource Analysis
Performance
– 18 –
Can begin a new iteration on each clock cycle
Should give CPE of 1.0
Would require executing 4 integer operations in parallel
15-213, F’02
Combining Sum: Resource Constraints
%edx.3
6
load
7
t.4
addl
cc.4
jl
%ecx.4
11
Iteration 4
12
load
t.5
addl
%ecx.5
cmpl+1
%ecx.i
cc.5
load
incl
%edx.6
jl
i=4
t.6
addl
cmpl
load
cc.6
cc.6
jl
%ecx.6
14
%edx.5
Iteration 5
13
incl
i=3
10
%edx.4
cmpl+1
%ecx.i
8 %ecx.3
9
incl
i=5
incl
t.7
addl
jl
Only have two integer functional units
16
Set priority based on program order
load
i=6
Cycle
Some17operations delayed even though
operands
available
18
cmpl
cc.7
Iteration 6
15
%edx.7
%ecx.7
t.8
addl
Iteration 7
incl
%edx.8
cmpl+1
%ecx.i
cc.8
jl
i=7
%ecx.8
Iteration 8
Performance
– 19 –
Sustain CPE of 2.0
15-213, F’02
Loop Unrolling
void combine5(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-2;
int *data = get_vec_start(v);
int sum = 0;
int i;
/* Combine 3 elements at a time */
for (i = 0; i < limit; i+=3) {
sum += data[i] + data[i+2]
+ data[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
sum += data[i];
}
*dest = sum;
}
– 20 –
Optimization
Combine multiple
iterations into single
loop body
Amortizes loop
overhead across
multiple iterations
Finish extras at end
Measured CPE = 1.33
15-213, F’02
Visualizing Unrolled Loop
%edx.0
Loads can pipeline,
since don’t have
dependencies
Only one set of loop
control operations
addl
load
cmpl+1
%ecx.i
load
%ecx.0c
%ecx.1a
load
– 21 –
t.1a
%ecx.1a
t.1b
%ecx.1b
t.1c
%ecx.1c
%edx.1
cc.1
Time
t.1b
addl
%ecx.1b
jl
cc.1
t.1a
addl
load (%eax,%edx.0,4)
iaddl t.1a, %ecx.0c
load 4(%eax,%edx.0,4)
iaddl t.1b, %ecx.1a
load 8(%eax,%edx.0,4)
iaddl t.1c, %ecx.1b
iaddl $3,%edx.0
cmpl %esi, %edx.1
jl-taken cc.1
%edx.1
t.1c
addl
%ecx.1c
15-213, F’02
Executing with Loop Unrolling
%edx.2
5
6
addl
7
load
8
cmpl+1
%ecx.i
%ecx.3a
11
i=6
12
jl
t.3a
addl
10
cc.3
load
9 %ecx.2c
%edx.3
13
load
addl
addl
%ecx.3b
load
%ecx.3c
Iteration 3
Predicted Performance
15
Should give CPE of 1.0
cc.4
load
jl
t.4a
addl
Cycle
Can complete iteration in 3 cycles
cmpl+1
%ecx.i
t.3c
addl
%ecx.4a
14
%edx.4
t.3b
i=9
load
t.4b
addl
%ecx.4b
t.4c
addl
%ecx.4c
Iteration 4
Measured Performance
CPE of 1.33
One iteration every 4 cycles
– 22 –
15-213, F’02
Effect of Unrolling
Unrolling Degree
1
2
3
4
8
16
2.00
1.50
1.33
1.50
1.25
1.06
Integer
Sum
Integer
Product
4.00
FP
Sum
3.00
FP
Product
5.00
Only helps integer sum for our examples
Other cases constrained by functional unit latencies
Effect is nonlinear with degree of unrolling
Many subtle effects determine exact scheduling of operations
– 23 –
15-213, F’02
1 x0
*
Serial Computation
x1
*
Computation
x2
*
((((((((((((1 * x0) * x1) * x2) * x3)
* x4) * x5) * x6) * x7) * x8) * x9)
* x10) * x11)
x3
*
x4
*
x5
*
Performance
x6
*
N elements, D cycles/operation
N*D cycles
x7
*
x8
*
x9
*
x10
*
– 24 –
x11
*
15-213, F’02
Parallel Loop Unrolling
void combine6(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-1;
int *data = get_vec_start(v);
int x0 = 1;
int x1 = 1;
int i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x0 *= data[i];
x1 *= data[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x0 *= data[i];
}
*dest = x0 * x1;
}
– 25 –
Code Version
Integer product
Optimization
Accumulate in two
different products
Can be performed
simultaneously
Combine at end
Performance
CPE = 2.0
2X performance
15-213, F’02
Dual Product Computation
Computation
((((((1 * x0) * x2) * x4) * x6) * x8) * x10) *
((((((1 * x1) * x3) * x5) * x7) * x9) * x11)
Performance
1 x0
*
1 x1
*
x2
*
x3
*
x4
*
N elements, D cycles/operation
(N/2+1)*D cycles
~2X performance improvement
x5
*
x6
*
x7
*
x8
*
x10
x9
*
*
x11
*
*
– 26 –
15-213, F’02
Requirements for Parallel Computation
Mathematical
Combining operation must be associative & commutative
OK for integer multiplication
Not strictly true for floating point
» OK for most applications
Hardware
– 27 –
Pipelined functional units
Ability to dynamically extract parallelism from code
15-213, F’02
Visualizing Parallel Loop
%edx.0
Two multiplies within
loop no longer have
data depency
Allows them to
pipeline
addl
load
%ecx.0
%edx.1
cmpl
load
cc.1
jl
t.1a
%ebx.0
t.1b
Time
imull
imull
load (%eax,%edx.0,4)
imull t.1a, %ecx.0
load 4(%eax,%edx.0,4)
imull t.1b, %ebx.0
iaddl $2,%edx.0
cmpl %esi, %edx.1
jl-taken cc.1
– 28 –
t.1a
%ecx.1
t.1b
%ebx.1
%edx.1
cc.1
%ecx.1
%ebx.1
15-213, F’02
Executing with Parallel Loop
%edx.0
addl
1
load
2
load
cmpl
load
load
%edx.3
cmpl
cc.3
jl
imull
%ecx.1
i=0
t.2a
%ebx.1
t.2b
Iteration 1
imull
Cycle
imull
11
12
Predicted
Performance
13
Can keep
4-cycle
– 29 –
jl
imull
9
cc.2
addl
load
7
10
jl
%edx.2
t.1b
6
8
cc.1
addl
t.1a
4 %ebx.0
5
cmpl
load
3 %ecx.0
%edx.1
14
multiplier
busy performing
two simultaneous
15
multiplications
16
Gives CPE of 2.0
%ecx.2
i=2
t.3a
%ebx.2
t.3b
Iteration 2
imull
imull
%ecx.3
i=4
%ebx.3
Iteration 15-213,
3
F’02
Optimization Results for Combining
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Pointer
Unroll 4
Unroll 16
2X2
4X4
8X4
Theoretical Opt.
Worst : Best
– 30 –
Floating Point
+
*
*
42.06
31.25
20.66
6.00
2.00
3.00
1.50
1.06
1.50
1.50
1.25
1.00
39.7
41.86
33.25
21.25
9.00
4.00
4.00
4.00
4.00
2.00
2.00
1.25
1.00
33.5
41.44
31.25
21.15
8.00
3.00
3.00
3.00
3.00
2.00
1.50
1.50
1.00
27.6
160.00
143.00
135.00
117.00
5.00
5.00
5.00
5.00
2.50
2.50
2.00
2.00
80.0
15-213, F’02
Parallel Unrolling: Method #2
void combine6aa(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-1;
int *data = get_vec_start(v);
int x = 1;
int i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x *= (data[i] * data[i+1]);
}
/* Finish any remaining elements */
for (; i < length; i++) {
x *= data[i];
}
*dest = x;
}
– 31 –
Code Version
Integer product
Optimization
Multiply pairs of
elements together
And then update
product
“Tree height
reduction”
Performance
CPE = 2.5
15-213, F’02
Method #2 Computation
Computation
((((((1 * (x0 * x1)) * (x2 * x3)) * (x4 * x5))
* (x6 * x7)) * (x8 * x9)) * (x10 * x11))
x0 x1
1
Performance
* x2 x3
*
* x4 x5
*
N elements, D cycles/operation
Should be (N/2+1)*D cycles
CPE = 2.0
* x6 x7
*
Unrolling
* x8 x9
*
* x10 x11
*
*
*
– 32 –
Measured CPE worse
2
CPE
CPE
(measured) (theoretical)
2.50
2.00
3
1.67
1.33
4
1.50
1.00
6
1.78
1.00
15-213, F’02
Understanding
Parallelism
1 x0
*
x1
*
x2
*
x3
*
x4
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = (x * data[i]) * data[i+1];
}
*
x5
*
x6
*
x7
*
x8
*
CPE = 4.00
All multiplies perfomed in sequence
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = x * (data[i] * data[i+1]);
}
CPE = 2.50
Multiplies overlap
x9
*
x10
*
*
x0 x1
1
* x2 x3
*
* x4 x5
*
* x6 x7
*
* x8 x9
*
* x10 x11
*
*
*
– 33 –
x11
15-213, F’02
Limitations of Parallel Execution
Need Lots of Registers
To hold sums/products
Only 6 usable integer registers
Also needed for pointers, loop conditions
8 FP registers
When not enough registers, must spill temporaries onto
stack
Wipes out any performance gains
Not helped by renaming
Cannot reference more operands than instruction set allows
Major drawback of IA32 instruction set
– 34 –
15-213, F’02
Register Spilling Example
Example
8 X 8 integer product
7 local variables share 1
register
See that are storing locals
on stack
E.g., at -8(%ebp)
– 35 –
.L165:
imull (%eax),%ecx
movl -4(%ebp),%edi
imull 4(%eax),%edi
movl %edi,-4(%ebp)
movl -8(%ebp),%edi
imull 8(%eax),%edi
movl %edi,-8(%ebp)
movl -12(%ebp),%edi
imull 12(%eax),%edi
movl %edi,-12(%ebp)
movl -16(%ebp),%edi
imull 16(%eax),%edi
movl %edi,-16(%ebp)
…
addl $32,%eax
addl $8,%edx
cmpl -32(%ebp),%edx
jl .L165
15-213, F’02
Summary: Results for Pentium III
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Unroll 4
Unroll 16
4X2
8X4
8X8
Worst : Best
– 36 –
Floating Point
+
*
*
42.06
31.25
20.66
6.00
2.00
1.50
1.06
1.50
1.25
1.88
39.7
41.86
33.25
21.25
9.00
4.00
4.00
4.00
2.00
1.25
1.88
33.5
41.44
31.25
21.15
8.00
3.00
3.00
3.00
1.50
1.50
1.75
27.6
160.00
143.00
135.00
117.00
5.00
5.00
5.00
2.50
2.00
2.00
80.0
Biggest gain doing basic optimizations
But, last little bit helps
15-213, F’02
Results for Alpha Processor
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Unroll 4
Unroll 16
4X2
8X4
8X8
Worst : Best
– 37 –
Floating Point
+
*
*
40.14
25.08
19.19
6.26
1.76
1.51
1.25
1.19
1.15
1.11
36.2
47.14
36.05
32.18
12.52
9.01
9.01
9.01
4.69
4.12
4.24
11.4
52.07
37.37
28.73
13.26
8.08
6.32
6.33
4.44
2.34
2.36
22.3
53.71
32.02
32.73
13.01
8.01
6.32
6.22
4.45
2.01
2.08
26.7
Overall trends very similar to those for Pentium III.
Even though very different architecture and compiler
15-213, F’02
Results for Pentium 4
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Unroll 4
Unroll 16
4X2
8X4
8X8
Worst : Best
Floating Point
+
*
*
35.25
26.52
18.00
3.39
2.00
1.01
1.00
1.02
1.01
1.63
35.2
35.34
30.26
25.71
31.56
14.00
14.00
14.00
7.00
3.98
4.50
8.9
35.85
31.55
23.36
27.50
5.00
5.00
5.00
2.63
1.82
2.42
19.7
38.00
32.00
24.25
28.35
7.00
7.00
7.00
3.50
2.00
2.31
19.0
Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0)
Clock runs at 2.0 GHz
Not an improvement over 1.0 GHz P3 for integer *
– 38 –
Avoids FP multiplication anomaly
15-213, F’02
What About Branches?
Challenge
Instruction Control Unit must work well ahead of Exec. Unit
To generate enough operations to keep EU busy
80489f3:
80489f8:
80489fa:
80489fc:
80489fe:
8048a00:
– 39 –
movl
xorl
cmpl
jnl
movl
imull
$0x1,%ecx
%edx,%edx
%esi,%edx
8048a25
%esi,%esi
(%eax,%edx,4),%ecx
Executing
Fetching &
Decoding
When encounters conditional branch, cannot reliably determine
where to continue fetching
15-213, F’02
Branch Outcomes
When encounter conditional branch, cannot determine where to
continue fetching
Branch Taken: Transfer control to branch target
Branch Not-Taken: Continue with next instruction in sequence
Cannot resolve until outcome determined by branch/integer unit
80489f3:
80489f8:
80489fa:
80489fc:
80489fe:
8048a00:
movl
xorl
cmpl
jnl
movl
imull
8048a25:
8048a27:
8048a29:
8048a2c:
8048a2f:
– 40 –
$0x1,%ecx
%edx,%edx
%esi,%edx Branch Not-Taken
8048a25
%esi,%esi
Branch Taken
(%eax,%edx,4),%ecx
cmpl
jl
movl
leal
movl
%edi,%edx
8048a20
0xc(%ebp),%eax
0xffffffe8(%ebp),%esp
%ecx,(%eax)
15-213, F’02
Branch Prediction
Idea
Guess which way branch will go
Begin executing instructions at predicted position
But don’t actually modify register or memory data
80489f3:
80489f8:
80489fa:
80489fc:
. . .
movl
xorl
cmpl
jnl
$0x1,%ecx
%edx,%edx
%esi,%edx
8048a25
8048a25:
8048a27:
8048a29:
8048a2c:
8048a2f:
– 41 –
cmpl
jl
movl
leal
movl
Predict Taken
%edi,%edx
8048a20
0xc(%ebp),%eax
0xffffffe8(%ebp),%esp
%ecx,(%eax)
Execute
15-213, F’02
Branch Prediction Through Loop
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 98
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 99
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 100
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 101
%esi,%edx
80488b1
– 42 –
Assume vector length = 100
Predict Taken (OK)
Predict Taken
Executed
(Oops)
Read
invalid
location
Fetched
15-213, F’02
Branch Misprediction Invalidation
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 98
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 99
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 100
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
movl
addl
incl
(%ecx,%edx,4),%eax
%eax,(%edi)
i = 101
%edx
– 43 –
Assume vector length = 100
Predict Taken (OK)
Predict Taken (Oops)
Invalidate
15-213, F’02
Branch Misprediction Recovery
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 98
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
80488bb:
80488be:
80488bf:
80488c0:
movl
addl
incl
cmpl
jl
leal
popl
popl
popl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 99
%esi,%edx
80488b1
0xffffffe8(%ebp),%esp
%ebx
%esi
%edi
Assume vector length = 100
Predict Taken (OK)
Definitely not taken
Performance Cost
– 44 –
Misprediction on Pentium III wastes ~14 clock cycles
That’s a lot of time on a high performance processor
15-213, F’02
Avoiding Branches
On Modern Processor, Branches Very Expensive
Unless prediction can be reliable
When possible, best to avoid altogether
Example
Compute maximum of two values
14 cycles when prediction correct
29 cycles when incorrect
int max(int x, int y)
{
return (x < y) ? y : x;
}
– 45 –
movl 12(%ebp),%edx
movl 8(%ebp),%eax
cmpl %edx,%eax
jge L11
movl %edx,%eax
L11:
#
#
#
#
#
Get y
rval=x
rval:y
skip when >=
rval=y
15-213, F’02
Avoiding Branches with Bit Tricks
In style of Lab #1
Use masking rather than conditionals
int bmax(int x, int y)
{
int mask = -(x>y);
return (mask & x) | (~mask & y);
}
Compiler still uses conditional
16 cycles when predict correctly
32 cycles when mispredict
– 46 –
xorl %edx,%edx
movl 8(%ebp),%eax
movl 12(%ebp),%ecx
cmpl %ecx,%eax
jle L13
movl $-1,%edx
L13:
# mask = 0
# skip if x<=y
# mask = -1
15-213, F’02
Avoiding Branches with Bit Tricks
Force compiler to generate desired code
int bvmax(int x, int y)
{
volatile int t = (x>y);
int mask = -t;
return (mask & x) |
(~mask & y);
}
movl 8(%ebp),%ecx
movl 12(%ebp),%edx
cmpl %edx,%ecx
setg %al
movzbl %al,%eax
movl %eax,-4(%ebp)
movl -4(%ebp),%eax
#
#
#
#
#
#
#
Get x
Get y
x:y
(x>y)
Zero extend
Save as t
Retrieve t
volatile declaration forces value to be written to memory
Compiler must therefore generate code to compute t
Simplest way is setg/movzbl combination
Not very elegant!
A hack to get control over compiler
22 clock cycles on all data
Better than misprediction
– 47 –
15-213, F’02
Conditional Move
Added with P6 microarchitecture (PentiumPro onward)
cmovXXl %edx, %eax
If condition XX holds, copy %edx to %eax
Doesn’t involve any branching
Handled as operation within Execution Unit
movl 8(%ebp),%edx
movl 12(%ebp),%eax
cmpl %edx, %eax
cmovll %edx,%eax
#
#
#
#
Get x
rval=y
rval:x
If <, rval=x
Current version of GCC won’t use this instruction
Thinks it’s compiling for a 386
Performance
14 cycles on all data
– 48 –
15-213, F’02
Machine-Dependent Opt. Summary
Pointer Code
Look carefully at generated code to see whether helpful
Loop Unrolling
Some compilers do this automatically
Generally not as clever as what can achieve by hand
Exposing Instruction-Level Parallelism
Very machine dependent
Warning:
Benefits depend heavily on particular machine
Best if performed by compiler
But GCC on IA32/Linux is not very good
– 49 –
Do only for performance-critical parts of code
15-213, F’02
Role of Programmer
How should I write my programs, given that I have a good,
optimizing compiler?
Don’t: Smash Code into Oblivion
Hard to read, maintain, & assure correctness
Do:
Select best algorithm
Write code that’s readable & maintainable
Procedures, recursion, without built-in constant limits
Even though these factors can slow down code
Eliminate optimization blockers
Allows compiler to do its job
Focus on Inner Loops
– 50 –
Do detailed optimizations where code will be executed repeatedly
Will get most performance gain here
15-213, F’02
© Copyright 2026 Paperzz