class11.pdf

15-213
“The course that gives CMU its Zip!”
Code Optimization II:
Machine Dependent Optimizations
Oct. 1, 2002
Topics

Machine-Dependent Optimizations
 Pointer code
 Unrolling
 Enabling instruction level parallelism

Understanding Processor Operation
 Translation of instructions into operations
 Out-of-order execution of operations


class11.ppt
Branches and Branch Prediction
Advice
Previous Best Combining Code
void combine4(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
*dest = sum;
}
Task

Compute sum of all elements in vector

Vector represented by C-style abstract data type

Achieved CPE of 2.00
 Cycles per element
–2–
15-213, F’02
General Forms of Combining
void abstract_combine4(vec_ptr v, data_t *dest)
{
int i;
int length = vec_length(v);
data_t *data = get_vec_start(v);
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP data[i];
*dest = t;
}
Data Types
–3–

Use different declarations
for data_t

int

float

double
Operations

Use different definitions
of OP and IDENT

+ / 0

* / 1
15-213, F’02
Machine Independent Opt. Results
Optimizations

Reduce function calls and memory references within loop
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Floating Point
+
*
*
42.06
31.25
20.66
6.00
2.00
41.86
33.25
21.25
9.00
4.00
41.44
31.25
21.15
8.00
3.00
160.00
143.00
135.00
117.00
5.00
Performance Anomaly

Computing FP product of all elements exceptionally slow.

Very large speedup when accumulate in temporary

Caused by quirk of IA32 floating point
 Memory uses 64-bit format, register use 80
 Benchmark data caused overflow of 64 bits, but not 80
–4–
15-213, F’02
Pointer Code
void combine4p(vec_ptr v, int *dest)
{
int length = vec_length(v);
int *data = get_vec_start(v);
int *dend = data+length;
int sum = 0;
while (data < dend) {
sum += *data;
data++;
}
*dest = sum;
}
Optimization

Use pointers rather than array references

CPE: 3.00 (Compiled -O2)
 Oops! We’re not making progress here!
Warning: Some compilers do better job optimizing array code
–5–
15-213, F’02
Pointer vs. Array Code Inner Loops
Array Code
.L24:
addl (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
#
#
#
#
#
Loop:
sum += data[i]
i++
i:length
if < goto Loop
Pointer Code
.L30:
addl (%eax),%ecx
addl $4,%eax
cmpl %edx,%eax
jb .L30
# Loop:
# sum += *data
# data ++
# data:dend
# if < goto Loop
Performance
–6–

Array Code: 4 instructions in 2 clock cycles

Pointer Code: Almost same 4 instructions in 3 clock cycles
15-213, F’02
Modern CPU Design
Instruction
Instruction Control
Control
Fetch
Control
Retirement
Unit
Register
File
Address
Instrs.
Instruction
Decode
Instruction
Cache
Operations
Register
Updates
Prediction
OK?
Integer/
Branch
General
Integer
FP
Add
Operation Results
FP
Mult/Div
Load
Addr.
Store
Functional
Units
Addr.
Data
Data
Data
Cache
Execution
Execution
–7–
15-213, F’02
CPU Capabilities of Pentium III
Multiple Instructions Can Execute in Parallel

1 load

1 store

2 integer (one may be branch)

1 FP Addition

1 FP Multiplication or Division
Some Instructions Take > 1 Cycle, but Can be Pipelined
–8–

Instruction
Latency
Cycles/Issue

Load / Store
3
1

Integer Multiply
4
1

Integer Divide
36
36

Double/Single FP Multiply
5
2

Double/Single FP Add
3
1

Double/Single FP Divide
38
38
15-213, F’02
Instruction Control
Instruction
Instruction Control
Control
Retirement
Unit
Register
File
Fetch
Control
Instruction
Decode
Address
Instrs.
Instruction
Cache
Operations
Grabs Instruction Bytes From Memory


Based on current PC + predicted targets for predicted branches
Hardware dynamically guesses whether branches taken/not taken and
(possibly) branch target
Translates Instructions Into Operations


Primitive steps required to perform instruction
Typical instruction requires 1–3 operations
Converts Register References Into Tags

–9–
Abstract identifier linking destination of one operation with sources of
later operations
15-213, F’02
Translation Example
Version of Combine4

Integer data, multiply operation
.L24:
imull (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
#
#
#
#
#
Loop:
t *= data[i]
i++
i:length
if < goto Loop
Translation of First Iteration
.L24:
imull (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
– 10 –
load (%eax,%edx.0,4)
imull t.1, %ecx.0
incl %edx.0
cmpl %esi, %edx.1
jl-taken cc.1




t.1
%ecx.1
%edx.1
cc.1
15-213, F’02
Translation Example #1
imull (%eax,%edx,4),%ecx

load (%eax,%edx.0,4)  t.1
imull t.1, %ecx.0
 %ecx.1
Split into two operations
 load reads from memory to generate temporary result t.1
 Multiply operation just operates on registers

Operands
 Registers %eax does not change in loop. Values will be
retrieved from register file during decoding
 Register %ecx changes on every iteration. Uniquely identify
different versions as %ecx.0, %ecx.1, %ecx.2, …
» Register renaming
» Values passed directly from producer to consumers
– 11 –
15-213, F’02
Translation Example #2
incl %edx

– 12 –
incl %edx.0
 %edx.1
Register %edx changes on each iteration. Rename as
%edx.0, %edx.1, %edx.2, …
15-213, F’02
Translation Example #3
cmpl %esi,%edx
– 13 –
cmpl %esi, %edx.1
 cc.1

Condition codes are treated similar to registers

Assign tag to define connection between producer and
consumer
15-213, F’02
Translation Example #4
jl .L24
jl-taken cc.1

Instruction control unit determines destination of jump

Predicts whether will be taken and target

Starts fetching instruction at predicted destination

Execution unit simply checks whether or not prediction was
OK

If not, it signals instruction control
 Instruction control then “invalidates” any operations generated
from misfetched instructions
 Begins fetching and decoding instructions at correct target
– 14 –
15-213, F’02
Visualizing Operations
%edx.0
incl
load
load (%eax,%edx,4)
imull t.1, %ecx.0
incl %edx.0
cmpl %esi, %edx.1
jl-taken cc.1
%edx.1
cmpl
cc.1
%ecx.0
jl
t.1
Time




t.1
%ecx.1
%edx.1
cc.1
Operations
imull

Vertical position denotes time
at which executed
 Cannot begin operation until
%ecx.1
operands available

Height denotes latency
Operands

– 15 –
Arcs shown only for operands
that are passed within
execution unit
15-213, F’02
Visualizing Operations (cont.)
%edx.0
load
incl
load
cmpl+1
%ecx.i
%edx.1
load (%eax,%edx,4)
iaddl t.1, %ecx.0
incl %edx.0
cmpl %esi, %edx.1
jl-taken cc.1




t.1
%ecx.1
%edx.1
cc.1
cc.1
%ecx.0
Time
jl
t.1
addl
%ecx.1
Operations

– 16 –
Same as before, except
that add has latency of 1
15-213, F’02
3 Iterations of Combining Product
Unlimited Resource
Analysis
%edx.0
incl
1
load
2
cmpl
cc.1
jl
3
incl
load
cc.2
i=0
4
%edx.2
cmpl
t.1
%ecx.0
5
%edx.1
jl
incl
load

Assume operation
can start as soon as
operands available

Operations for
multiple iterations
overlap in time
%edx.3
cmpl
t.2
cc.3
jl
imull
t.3
6
7
%ecx.1
Iteration 1
8
9
10
11
12
13
Performance
imull
Cycle
i=1
Limiting factor
becomes latency of
integer multiplier

Gives CPE of 4.0
%ecx.2
Iteration 2
imull
14
i=2
15
%ecx.3
– 17 –

Iteration 3
15-213, F’02
4 Iterations of Combining Sum
%edx.0
incl
1
load
2
3
load
i=0
%ecx.1
5
Iteration 1
6
incl
t.1
addl
4
cmpl+1
%ecx.i
cc.1
jl
%ecx.0
%edx.1
Cycle
%edx.2
cmpl+1
%ecx.i
cc.2
jl
incl
load
t.2
addl
i=1
%ecx.2
Iteration 2
7
%edx.3
cmpl+1
%ecx.i
cc.3
jl
incl
load
t.3
addl
i=2
%ecx.3
Iteration 3
%edx.4
4 integer ops
cmpl+1
%ecx.i
cc.4
jl
t.4
addl
i=3
%ecx.4
Iteration 4
Unlimited Resource Analysis
Performance
– 18 –

Can begin a new iteration on each clock cycle

Should give CPE of 1.0

Would require executing 4 integer operations in parallel
15-213, F’02
Combining Sum: Resource Constraints
%edx.3
6
load
7
t.4
addl
cc.4
jl
%ecx.4
11
Iteration 4
12
load
t.5
addl
%ecx.5
cmpl+1
%ecx.i
cc.5
load
incl
%edx.6
jl
i=4
t.6
addl
cmpl
load
cc.6
cc.6
jl
%ecx.6
14

%edx.5
Iteration 5
13

incl
i=3
10

%edx.4
cmpl+1
%ecx.i
8 %ecx.3
9
incl
i=5
incl
t.7
addl
jl
Only have two integer functional units
16
Set priority based on program order
load
i=6
Cycle
Some17operations delayed even though
operands
available
18
cmpl
cc.7
Iteration 6
15
%edx.7
%ecx.7
t.8
addl
Iteration 7
incl
%edx.8
cmpl+1
%ecx.i
cc.8
jl
i=7
%ecx.8
Iteration 8
Performance

– 19 –
Sustain CPE of 2.0
15-213, F’02
Loop Unrolling
void combine5(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-2;
int *data = get_vec_start(v);
int sum = 0;
int i;
/* Combine 3 elements at a time */
for (i = 0; i < limit; i+=3) {
sum += data[i] + data[i+2]
+ data[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
sum += data[i];
}
*dest = sum;
}
– 20 –
Optimization

Combine multiple
iterations into single
loop body

Amortizes loop
overhead across
multiple iterations

Finish extras at end

Measured CPE = 1.33
15-213, F’02
Visualizing Unrolled Loop


%edx.0
Loads can pipeline,
since don’t have
dependencies
Only one set of loop
control operations
addl
load
cmpl+1
%ecx.i
load
%ecx.0c
%ecx.1a
load
– 21 –
t.1a
%ecx.1a
t.1b
%ecx.1b
t.1c
%ecx.1c
%edx.1
cc.1
Time
t.1b
addl
%ecx.1b








jl
cc.1
t.1a
addl
load (%eax,%edx.0,4)
iaddl t.1a, %ecx.0c
load 4(%eax,%edx.0,4)
iaddl t.1b, %ecx.1a
load 8(%eax,%edx.0,4)
iaddl t.1c, %ecx.1b
iaddl $3,%edx.0
cmpl %esi, %edx.1
jl-taken cc.1
%edx.1
t.1c
addl
%ecx.1c
15-213, F’02
Executing with Loop Unrolling
%edx.2
5
6
addl
7
load
8
cmpl+1
%ecx.i
%ecx.3a
11
i=6
12
jl
t.3a
addl
10
cc.3
load
9 %ecx.2c
%edx.3
13
load
addl
addl
%ecx.3b
load

%ecx.3c
Iteration 3
Predicted Performance
15
 Should give CPE of 1.0
cc.4
load
jl
t.4a
addl
Cycle
 Can complete iteration in 3 cycles

cmpl+1
%ecx.i
t.3c
addl
%ecx.4a
14
%edx.4
t.3b
i=9
load
t.4b
addl
%ecx.4b
t.4c
addl
%ecx.4c
Iteration 4
Measured Performance
 CPE of 1.33
 One iteration every 4 cycles
– 22 –
15-213, F’02
Effect of Unrolling
Unrolling Degree
1
2
3
4
8
16
2.00
1.50
1.33
1.50
1.25
1.06
Integer
Sum
Integer
Product
4.00
FP
Sum
3.00
FP
Product
5.00

Only helps integer sum for our examples
 Other cases constrained by functional unit latencies

Effect is nonlinear with degree of unrolling
 Many subtle effects determine exact scheduling of operations
– 23 –
15-213, F’02
1 x0
*
Serial Computation
x1
*
Computation
x2
*
((((((((((((1 * x0) * x1) * x2) * x3)
* x4) * x5) * x6) * x7) * x8) * x9)
* x10) * x11)
x3
*
x4
*
x5
*
Performance

x6
*

N elements, D cycles/operation
N*D cycles
x7
*
x8
*
x9
*
x10
*
– 24 –
x11
*
15-213, F’02
Parallel Loop Unrolling
void combine6(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-1;
int *data = get_vec_start(v);
int x0 = 1;
int x1 = 1;
int i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x0 *= data[i];
x1 *= data[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x0 *= data[i];
}
*dest = x0 * x1;
}
– 25 –
Code Version

Integer product
Optimization

Accumulate in two
different products
 Can be performed
simultaneously

Combine at end
Performance

CPE = 2.0

2X performance
15-213, F’02
Dual Product Computation
Computation
((((((1 * x0) * x2) * x4) * x6) * x8) * x10) *
((((((1 * x1) * x3) * x5) * x7) * x9) * x11)
Performance
1 x0
*
1 x1
*
x2
*
x3
*
x4
*

N elements, D cycles/operation
(N/2+1)*D cycles

~2X performance improvement

x5
*
x6
*
x7
*
x8
*
x10
x9
*
*
x11
*
*
– 26 –
15-213, F’02
Requirements for Parallel Computation
Mathematical

Combining operation must be associative & commutative
 OK for integer multiplication
 Not strictly true for floating point
» OK for most applications
Hardware
– 27 –

Pipelined functional units

Ability to dynamically extract parallelism from code
15-213, F’02
Visualizing Parallel Loop


%edx.0
Two multiplies within
loop no longer have
data depency
Allows them to
pipeline
addl
load
%ecx.0
%edx.1
cmpl
load
cc.1
jl
t.1a
%ebx.0
t.1b
Time
imull
imull
load (%eax,%edx.0,4)
imull t.1a, %ecx.0
load 4(%eax,%edx.0,4)
imull t.1b, %ebx.0
iaddl $2,%edx.0
cmpl %esi, %edx.1
jl-taken cc.1
– 28 –






t.1a
%ecx.1
t.1b
%ebx.1
%edx.1
cc.1
%ecx.1
%ebx.1
15-213, F’02
Executing with Parallel Loop
%edx.0
addl
1
load
2
load
cmpl
load
load
%edx.3
cmpl
cc.3
jl
imull
%ecx.1
i=0
t.2a
%ebx.1
t.2b
Iteration 1
imull
Cycle
imull
11
12
Predicted
Performance
13
 Can keep
4-cycle
– 29 –
jl
imull
9

cc.2
addl
load
7
10
jl
%edx.2
t.1b
6
8
cc.1
addl
t.1a
4 %ebx.0
5
cmpl
load
3 %ecx.0
%edx.1
14
multiplier
busy performing
two simultaneous
15
multiplications
16
 Gives CPE of 2.0
%ecx.2
i=2
t.3a
%ebx.2
t.3b
Iteration 2
imull
imull
%ecx.3
i=4
%ebx.3
Iteration 15-213,
3
F’02
Optimization Results for Combining
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Pointer
Unroll 4
Unroll 16
2X2
4X4
8X4
Theoretical Opt.
Worst : Best
– 30 –
Floating Point
+
*
*
42.06
31.25
20.66
6.00
2.00
3.00
1.50
1.06
1.50
1.50
1.25
1.00
39.7
41.86
33.25
21.25
9.00
4.00
4.00
4.00
4.00
2.00
2.00
1.25
1.00
33.5
41.44
31.25
21.15
8.00
3.00
3.00
3.00
3.00
2.00
1.50
1.50
1.00
27.6
160.00
143.00
135.00
117.00
5.00
5.00
5.00
5.00
2.50
2.50
2.00
2.00
80.0
15-213, F’02
Parallel Unrolling: Method #2
void combine6aa(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-1;
int *data = get_vec_start(v);
int x = 1;
int i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x *= (data[i] * data[i+1]);
}
/* Finish any remaining elements */
for (; i < length; i++) {
x *= data[i];
}
*dest = x;
}
– 31 –
Code Version

Integer product
Optimization

Multiply pairs of
elements together

And then update
product

“Tree height
reduction”
Performance

CPE = 2.5
15-213, F’02
Method #2 Computation
Computation
((((((1 * (x0 * x1)) * (x2 * x3)) * (x4 * x5))
* (x6 * x7)) * (x8 * x9)) * (x10 * x11))
x0 x1
1
Performance
* x2 x3
*
* x4 x5
*
N elements, D cycles/operation

Should be (N/2+1)*D cycles
 CPE = 2.0
* x6 x7
*


Unrolling
* x8 x9
*
* x10 x11
*
*
*
– 32 –
Measured CPE worse
2
CPE
CPE
(measured) (theoretical)
2.50
2.00
3
1.67
1.33
4
1.50
1.00
6
1.78
1.00
15-213, F’02
Understanding
Parallelism
1 x0
*
x1
*
x2
*
x3
*
x4
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = (x * data[i]) * data[i+1];
}
*
x5
*
x6
*
x7
*
x8
*

CPE = 4.00

All multiplies perfomed in sequence
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = x * (data[i] * data[i+1]);
}


CPE = 2.50
Multiplies overlap
x9
*
x10
*
*
x0 x1
1
* x2 x3
*
* x4 x5
*
* x6 x7
*
* x8 x9
*
* x10 x11
*
*
*
– 33 –
x11
15-213, F’02
Limitations of Parallel Execution
Need Lots of Registers

To hold sums/products

Only 6 usable integer registers
 Also needed for pointers, loop conditions

8 FP registers

When not enough registers, must spill temporaries onto
stack
 Wipes out any performance gains

Not helped by renaming
 Cannot reference more operands than instruction set allows
 Major drawback of IA32 instruction set
– 34 –
15-213, F’02
Register Spilling Example
Example

8 X 8 integer product

7 local variables share 1
register

See that are storing locals
on stack
E.g., at -8(%ebp)

– 35 –
.L165:
imull (%eax),%ecx
movl -4(%ebp),%edi
imull 4(%eax),%edi
movl %edi,-4(%ebp)
movl -8(%ebp),%edi
imull 8(%eax),%edi
movl %edi,-8(%ebp)
movl -12(%ebp),%edi
imull 12(%eax),%edi
movl %edi,-12(%ebp)
movl -16(%ebp),%edi
imull 16(%eax),%edi
movl %edi,-16(%ebp)
…
addl $32,%eax
addl $8,%edx
cmpl -32(%ebp),%edx
jl .L165
15-213, F’02
Summary: Results for Pentium III
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Unroll 4
Unroll 16
4X2
8X4
8X8
Worst : Best


– 36 –
Floating Point
+
*
*
42.06
31.25
20.66
6.00
2.00
1.50
1.06
1.50
1.25
1.88
39.7
41.86
33.25
21.25
9.00
4.00
4.00
4.00
2.00
1.25
1.88
33.5
41.44
31.25
21.15
8.00
3.00
3.00
3.00
1.50
1.50
1.75
27.6
160.00
143.00
135.00
117.00
5.00
5.00
5.00
2.50
2.00
2.00
80.0
Biggest gain doing basic optimizations
But, last little bit helps
15-213, F’02
Results for Alpha Processor
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Unroll 4
Unroll 16
4X2
8X4
8X8
Worst : Best


– 37 –
Floating Point
+
*
*
40.14
25.08
19.19
6.26
1.76
1.51
1.25
1.19
1.15
1.11
36.2
47.14
36.05
32.18
12.52
9.01
9.01
9.01
4.69
4.12
4.24
11.4
52.07
37.37
28.73
13.26
8.08
6.32
6.33
4.44
2.34
2.36
22.3
53.71
32.02
32.73
13.01
8.01
6.32
6.22
4.45
2.01
2.08
26.7
Overall trends very similar to those for Pentium III.
Even though very different architecture and compiler
15-213, F’02
Results for Pentium 4
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Unroll 4
Unroll 16
4X2
8X4
8X8
Worst : Best

Floating Point
+
*
*
35.25
26.52
18.00
3.39
2.00
1.01
1.00
1.02
1.01
1.63
35.2
35.34
30.26
25.71
31.56
14.00
14.00
14.00
7.00
3.98
4.50
8.9
35.85
31.55
23.36
27.50
5.00
5.00
5.00
2.63
1.82
2.42
19.7
38.00
32.00
24.25
28.35
7.00
7.00
7.00
3.50
2.00
2.31
19.0
Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0)
 Clock runs at 2.0 GHz
 Not an improvement over 1.0 GHz P3 for integer *

– 38 –
Avoids FP multiplication anomaly
15-213, F’02
What About Branches?
Challenge

Instruction Control Unit must work well ahead of Exec. Unit
 To generate enough operations to keep EU busy
80489f3:
80489f8:
80489fa:
80489fc:
80489fe:
8048a00:

– 39 –
movl
xorl
cmpl
jnl
movl
imull
$0x1,%ecx
%edx,%edx
%esi,%edx
8048a25
%esi,%esi
(%eax,%edx,4),%ecx
Executing
Fetching &
Decoding
When encounters conditional branch, cannot reliably determine
where to continue fetching
15-213, F’02
Branch Outcomes

When encounter conditional branch, cannot determine where to
continue fetching
 Branch Taken: Transfer control to branch target
 Branch Not-Taken: Continue with next instruction in sequence

Cannot resolve until outcome determined by branch/integer unit
80489f3:
80489f8:
80489fa:
80489fc:
80489fe:
8048a00:
movl
xorl
cmpl
jnl
movl
imull
8048a25:
8048a27:
8048a29:
8048a2c:
8048a2f:
– 40 –
$0x1,%ecx
%edx,%edx
%esi,%edx Branch Not-Taken
8048a25
%esi,%esi
Branch Taken
(%eax,%edx,4),%ecx
cmpl
jl
movl
leal
movl
%edi,%edx
8048a20
0xc(%ebp),%eax
0xffffffe8(%ebp),%esp
%ecx,(%eax)
15-213, F’02
Branch Prediction
Idea

Guess which way branch will go

Begin executing instructions at predicted position
 But don’t actually modify register or memory data
80489f3:
80489f8:
80489fa:
80489fc:
. . .
movl
xorl
cmpl
jnl
$0x1,%ecx
%edx,%edx
%esi,%edx
8048a25
8048a25:
8048a27:
8048a29:
8048a2c:
8048a2f:
– 41 –
cmpl
jl
movl
leal
movl
Predict Taken
%edi,%edx
8048a20
0xc(%ebp),%eax
0xffffffe8(%ebp),%esp
%ecx,(%eax)
Execute
15-213, F’02
Branch Prediction Through Loop
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 98
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 99
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 100
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 101
%esi,%edx
80488b1
– 42 –
Assume vector length = 100
Predict Taken (OK)
Predict Taken
Executed
(Oops)
Read
invalid
location
Fetched
15-213, F’02
Branch Misprediction Invalidation
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 98
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 99
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 100
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
movl
addl
incl
(%ecx,%edx,4),%eax
%eax,(%edi)
i = 101
%edx
– 43 –
Assume vector length = 100
Predict Taken (OK)
Predict Taken (Oops)
Invalidate
15-213, F’02
Branch Misprediction Recovery
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 98
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
80488bb:
80488be:
80488bf:
80488c0:
movl
addl
incl
cmpl
jl
leal
popl
popl
popl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 99
%esi,%edx
80488b1
0xffffffe8(%ebp),%esp
%ebx
%esi
%edi
Assume vector length = 100
Predict Taken (OK)
Definitely not taken
Performance Cost
– 44 –

Misprediction on Pentium III wastes ~14 clock cycles

That’s a lot of time on a high performance processor
15-213, F’02
Avoiding Branches
On Modern Processor, Branches Very Expensive

Unless prediction can be reliable

When possible, best to avoid altogether
Example

Compute maximum of two values
 14 cycles when prediction correct
 29 cycles when incorrect
int max(int x, int y)
{
return (x < y) ? y : x;
}
– 45 –
movl 12(%ebp),%edx
movl 8(%ebp),%eax
cmpl %edx,%eax
jge L11
movl %edx,%eax
L11:
#
#
#
#
#
Get y
rval=x
rval:y
skip when >=
rval=y
15-213, F’02
Avoiding Branches with Bit Tricks

In style of Lab #1

Use masking rather than conditionals
int bmax(int x, int y)
{
int mask = -(x>y);
return (mask & x) | (~mask & y);
}

Compiler still uses conditional
 16 cycles when predict correctly
 32 cycles when mispredict
– 46 –
xorl %edx,%edx
movl 8(%ebp),%eax
movl 12(%ebp),%ecx
cmpl %ecx,%eax
jle L13
movl $-1,%edx
L13:
# mask = 0
# skip if x<=y
# mask = -1
15-213, F’02
Avoiding Branches with Bit Tricks

Force compiler to generate desired code
int bvmax(int x, int y)
{
volatile int t = (x>y);
int mask = -t;
return (mask & x) |
(~mask & y);
}

movl 8(%ebp),%ecx
movl 12(%ebp),%edx
cmpl %edx,%ecx
setg %al
movzbl %al,%eax
movl %eax,-4(%ebp)
movl -4(%ebp),%eax
#
#
#
#
#
#
#
Get x
Get y
x:y
(x>y)
Zero extend
Save as t
Retrieve t
volatile declaration forces value to be written to memory
 Compiler must therefore generate code to compute t
 Simplest way is setg/movzbl combination

Not very elegant!
 A hack to get control over compiler

22 clock cycles on all data
 Better than misprediction
– 47 –
15-213, F’02
Conditional Move


Added with P6 microarchitecture (PentiumPro onward)
cmovXXl %edx, %eax
 If condition XX holds, copy %edx to %eax
 Doesn’t involve any branching
 Handled as operation within Execution Unit
movl 8(%ebp),%edx
movl 12(%ebp),%eax
cmpl %edx, %eax
cmovll %edx,%eax

#
#
#
#
Get x
rval=y
rval:x
If <, rval=x
Current version of GCC won’t use this instruction
 Thinks it’s compiling for a 386

Performance
 14 cycles on all data
– 48 –
15-213, F’02
Machine-Dependent Opt. Summary
Pointer Code

Look carefully at generated code to see whether helpful
Loop Unrolling

Some compilers do this automatically

Generally not as clever as what can achieve by hand
Exposing Instruction-Level Parallelism

Very machine dependent
Warning:

Benefits depend heavily on particular machine

Best if performed by compiler
 But GCC on IA32/Linux is not very good

– 49 –
Do only for performance-critical parts of code
15-213, F’02
Role of Programmer
How should I write my programs, given that I have a good,
optimizing compiler?
Don’t: Smash Code into Oblivion

Hard to read, maintain, & assure correctness
Do:

Select best algorithm

Write code that’s readable & maintainable
 Procedures, recursion, without built-in constant limits
 Even though these factors can slow down code

Eliminate optimization blockers
 Allows compiler to do its job
Focus on Inner Loops
– 50 –

Do detailed optimizations where code will be executed repeatedly

Will get most performance gain here
15-213, F’02