class11-4.pdf

Previous Best Combining Code
15-213
void combine4(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
*dest = sum;
}
“The course that gives CMU its Zip!”
Code Optimization II:
Machine Dependent Optimizations
Oct. 1, 2002
Topics
!
Machine-Dependent Optimizations
Task
" Pointer code
" Unrolling
" Enabling instruction level parallelism
!
Understanding Processor Operation
" Translation of instructions into operations
" Out-of-order execution of operations
!
!
Data Types
–3–
Use different declarations
for data_t
!
float
!
double
!
Achieved CPE of 2.00
15-213, F’02
–2–
void abstract_combine4(vec_ptr v, data_t *dest)
{
int i;
int length = vec_length(v);
data_t *data = get_vec_start(v);
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP data[i];
*dest = t;
}
int
Vector represented by C-style abstract data type
" Cycles per element
General Forms of Combining
!
Compute sum of all elements in vector
!
Branches and Branch Prediction
Advice
class11.ppt
!
!
Operations
!
Machine Independent Opt. Results
Optimizations
!
Method
+ / 0
!
* / 1
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Floating Point
+
*
*
42.06
31.25
20.66
6.00
2.00
41.86
33.25
21.25
9.00
4.00
41.44
31.25
21.15
8.00
3.00
160.00
143.00
135.00
117.00
5.00
Performance Anomaly
Use different definitions
of OP and IDENT
!
Reduce function calls and memory references within loop
!
Computing FP product of all elements exceptionally slow.
!
Very large speedup when accumulate in temporary
!
Caused by quirk of IA32 floating point
" Memory uses 64-bit format, register use 80
" Benchmark data caused overflow of 64 bits, but not 80
15-213, F’02
–4–
15-213, F’02
Pointer Code
Pointer vs. Array Code Inner Loops
void combine4p(vec_ptr v, int *dest)
{
int length = vec_length(v);
int *data = get_vec_start(v);
int *dend = data+length;
int sum = 0;
while (data < dend) {
sum += *data;
data++;
}
*dest = sum;
}
Array Code
.L24:
addl (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
Use pointers rather than array references
!
CPE: 3.00 (Compiled -O2)
.L30:
addl (%eax),%ecx
addl $4,%eax
cmpl %edx,%eax
jb .L30
Warning: Some compilers do better job optimizing array code
15-213, F’02
–5–
Modern CPU Design
Fetch
Control
Pointer Code: Almost same 4 instructions in 3 clock cycles
15-213, F’02
Multiple Instructions Can Execute in Parallel
Instrs.
Instruction
Cache
Operations
Register
Updates
Array Code: 4 instructions in 2 clock cycles
!
–6–
Address
Instruction
Decode
!
CPU Capabilities of Pentium III
Instruction
Instruction Control
Control
Register
File
# Loop:
# sum += *data
# data ++
# data:dend
# if < goto Loop
Performance
" Oops! We’re not making progress here!
Retirement
Unit
Loop:
sum += data[i]
i++
i:length
if < goto Loop
Pointer Code
Optimization
!
#
#
#
#
#
Prediction
OK?
!
1 load
!
1 store
!
2 integer (one may be branch)
!
1 FP Addition
!
1 FP Multiplication or Division
Some Instructions Take > 1 Cycle, but Can be Pipelined
Integer/
Branch
General
Integer
FP
Add
Operation Results
FP
Mult/Div
Load
Addr.
Store
Functional
Units
Addr.
Data
Data
Data
Cache
Execution
Execution
–7–
15-213, F’02
–8–
!
Instruction
Latency
Cycles/Issue
!
!
Load / Store
3
1
Integer Multiply
4
1
!
Integer Divide
!
Double/Single FP Multiply
36
36
5
2
!
Double/Single FP Add
!
Double/Single FP Divide
3
1
38
38
15-213, F’02
Instruction Control
Translation Example
Instruction
Instruction Control
Control
Retirement
Unit
Register
File
Fetch
Control
Instruction
Decode
Version of Combine4
Address
Instrs.
!
Instruction
Cache
Integer data, multiply operation
.L24:
imull (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
Operations
Grabs Instruction Bytes From Memory
!
!
Based on current PC + predicted targets for predicted branches
Hardware dynamically guesses whether branches taken/not taken and
(possibly) branch target
!
.L24:
imull (%eax,%edx,4),%ecx
Primitive steps required to perform instruction
Typical instruction requires 1–3 operations
incl %edx
cmpl %esi,%edx
jl .L24
Converts Register References Into Tags
!
–9–
Abstract identifier linking destination of one operation with sources of
later operations
15-213, F’02
Translation Example #1
imull (%eax,%edx,4),%ecx
!
Loop:
t *= data[i]
i++
i:length
if < goto Loop
Translation of First Iteration
Translates Instructions Into Operations
!
#
#
#
#
#
load (%eax,%edx.0,4)
imull t.1, %ecx.0
incl %edx.0
cmpl %esi, %edx.1
jl-taken cc.1
#
#
#
#
t.1
%ecx.1
%edx.1
cc.1
15-213, F’02
– 10 –
Translation Example #2
incl %edx
load (%eax,%edx.0,4) # t.1
imull t.1, %ecx.0
# %ecx.1
Split into two operations
!
" load reads from memory to generate temporary result t.1
incl %edx.0
# %edx.1
Register %edx changes on each iteration. Rename as
%edx.0, %edx.1, %edx.2, …
" Multiply operation just operates on registers
!
Operands
" Registers %eax does not change in loop. Values will be
retrieved from register file during decoding
" Register %ecx changes on every iteration. Uniquely identify
different versions as %ecx.0, %ecx.1, %ecx.2, …
» Register renaming
» Values passed directly from producer to consumers
– 11 –
15-213, F’02
– 12 –
15-213, F’02
Translation Example #3
cmpl %esi,%edx
cmpl %esi, %edx.1
Translation Example #4
jl .L24
# cc.1
jl-taken cc.1
!
Condition codes are treated similar to registers
!
Instruction control unit determines destination of jump
!
Assign tag to define connection between producer and
consumer
!
Predicts whether will be taken and target
!
Starts fetching instruction at predicted destination
!
Execution unit simply checks whether or not prediction was
OK
!
If not, it signals instruction control
" Instruction control then “invalidates” any operations generated
from misfetched instructions
" Begins fetching and decoding instructions at correct target
15-213, F’02
– 13 –
15-213, F’02
– 14 –
Visualizing Operations
%edx.0
incl
load
load (%eax,%edx,4)
imull t.1, %ecx.0
incl %edx.0
cmpl %esi, %edx.1
jl-taken cc.1
%edx.1
cmpl
cc.1
%ecx.0
jl
t.1
Visualizing Operations (cont.)
#
#
#
#
t.1
%ecx.1
%edx.1
cc.1
%edx.0
load
incl
load
cmpl+1
%ecx.i
%edx.1
load (%eax,%edx,4)
iaddl t.1, %ecx.0
incl %edx.0
cmpl %esi, %edx.1
jl-taken cc.1
#
#
#
#
t.1
%ecx.1
%edx.1
cc.1
cc.1
Time
Operations
imull
!
Vertical position denotes time
at which executed
%ecx.0
Time
Operations
!
operands available
!
t.1
addl
%ecx.1
" Cannot begin operation until
%ecx.1
jl
Same as before, except
that add has latency of 1
Height denotes latency
Operands
!
– 15 –
Arcs shown only for operands
that are passed within
execution unit
15-213, F’02
– 16 –
15-213, F’02
3 Iterations of Combining Product
Unlimited Resource
Analysis
%edx.0
incl
1
load
2
%edx.1
cmpl
incl
cc.1
jl
3
%edx.2
cmpl
!
incl
cc.2
jl
i=0
4
5
load
t.1
%ecx.0
load
%edx.3
cmpl
t.2
cc.3
jl
imull
!
t.3
6
7
%ecx.1
%edx.0
9
Assume operation
can start as soon as
operands available
!
i=1
Cycle
%ecx.2
Iteration 2
12
13
!
imull
14
i=2
15
%ecx.3
– 17 –
3
t.4
addl
incl
cmpl+1
%ecx.i
cc.4
jl
%ecx.4
11
Iteration 4
12
15-213, F’02
incl
%edx.5
load
t.5
addl
%ecx.5
cmpl+1
%ecx.i
cc.5
load
cmpl
load
cc.6
cc.6
jl
i=5
incl
t.7
addl
Only have two integer functional units
!
Some17operations delayed even though
operands
available
18
jl
i=6
Cycle
Set priority based on program order
%edx.7
cmpl
cc.7
Iteration 6
15
!
!
%edx.6
t.6
addl
%ecx.6
14
incl
jl
i=4
Iteration 5
13
16
%ecx.7
load
incl
%edx.8
cmpl+1
%ecx.i
t.8
cc.8
addl
Iteration 7
jl
i=7
%ecx.8
Iteration 8
Performance
!
– 19 –
incl
load
t.2
addl
i=1
jl
incl
load
addl
i=2
%ecx.3
Iteration 3
%edx.4
4 integer ops
cmpl+1
%ecx.i
cc.4
t.3
%ecx.2
Iteration 2
%edx.3
cmpl+1
%ecx.i
cc.3
jl
t.4
i=3
addl
%ecx.4
Unlimited Resource Analysis
Performance
Gives CPE of 4.0
%edx.4
i=3
10
Cycle
jl
Iteration 4
Limiting factor
becomes latency of
integer multiplier
6
9
Iteration 1
%edx.2
cmpl+1
%ecx.i
cc.2
7
Sustain CPE of 2.0
15-213, F’02
!
Can begin a new iteration on each clock cycle
!
Should give CPE of 1.0
!
Would require executing 4 integer operations in parallel
15-213, F’02
– 18 –
Loop Unrolling
%edx.3
8 %ecx.3
i=0
%ecx.1
5
Combining Sum: Resource Constraints
load
incl
load
t.1
addl
6
%edx.1
cmpl+1
%ecx.i
cc.1
jl
%ecx.0
4
Operations for
multiple iterations
overlap in time
Iteration 3
7
load
2
Performance
imull
11
incl
1
Iteration 1
8
10
4 Iterations of Combining Sum
void combine5(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-2;
int *data = get_vec_start(v);
int sum = 0;
int i;
/* Combine 3 elements at a time */
for (i = 0; i < limit; i+=3) {
sum += data[i] + data[i+2]
+ data[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
sum += data[i];
}
*dest = sum;
}
– 20 –
Optimization
!
Combine multiple
iterations into single
loop body
!
Amortizes loop
overhead across
multiple iterations
!
Finish extras at end
!
Measured CPE = 1.33
15-213, F’02
Visualizing Unrolled Loop
!
!
%edx.0
Loads can pipeline,
since don’t have
dependencies
Only one set of loop
control operations
%edx.2
addl
load
load
%ecx.0c
6
jl
addl
7
cc.1
load
8
t.1a
%ecx.1a
load
Time
t.1b
%ecx.3a
t.1a
%ecx.1a
t.1b
%ecx.1b
t.1c
%ecx.1c
%edx.1
cc.1
addl
11
%ecx.1c
i=6
12
13
load
addl
!
14
load
cmpl+1
%ecx.i
t.3c
%ecx.3b
addl
jl
t.4a
addl
Iteration 3
load
Cycle
Predicted Performance
15
t.4b
addl
%ecx.4b
i=9
t.4c
addl
%ecx.4c
Iteration 4
" Should give CPE of 1.0
Measured Performance
" CPE of 1.33
" One iteration every 4 cycles
15-213, F’02
15-213, F’02
– 22 –
1 x0
Effect of Unrolling
*
1
2
3
2.00
1.50
1.33
4
8
16
1.50
1.25
1.06
Integer
Sum
Integer
Product
4.00
FP
Sum
3.00
FP
Product
5.00
Serial Computation
x1
*
Computation
x2
*
((((((((((((1 * x0) * x1) * x2) * x3)
* x4) * x5) * x6) * x7) * x8) * x9)
* x10) * x11)
x3
*
x4
*
x5
*
Performance
!
x6
*
!
Only helps integer sum for our examples
!
Effect is nonlinear with degree of unrolling
!
N elements, D cycles/operation
N*D cycles
x7
*
" Other cases constrained by functional unit latencies
x8
*
x9
*
" Many subtle effects determine exact scheduling of operations
x10
*
– 23 –
cc.4
load
%ecx.3c
" Can complete iteration in 3 cycles
!
%edx.4
t.3b
addl
%ecx.4a
– 21 –
Unrolling Degree
cc.3
jl
t.3a
addl
10
t.1c
%edx.3
cmpl+1
%ecx.i
load
9 %ecx.2c
addl
%ecx.1b
#
#
#
#
#
#
#
#
5
%edx.1
cmpl+1
%ecx.i
addl
load (%eax,%edx.0,4)
iaddl t.1a, %ecx.0c
load 4(%eax,%edx.0,4)
iaddl t.1b, %ecx.1a
load 8(%eax,%edx.0,4)
iaddl t.1c, %ecx.1b
iaddl $3,%edx.0
cmpl %esi, %edx.1
jl-taken cc.1
Executing with Loop Unrolling
15-213, F’02
– 24 –
x11
*
15-213, F’02
Dual Product Computation
Parallel Loop Unrolling
void combine6(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-1;
int *data = get_vec_start(v);
int x0 = 1;
int x1 = 1;
int i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x0 *= data[i];
x1 *= data[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x0 *= data[i];
}
*dest = x0 * x1;
}
Computation
Code Version
!
((((((1 * x0) * x2) * x4) * x6) * x8) * x10) *
Integer product
((((((1 * x1) * x3) * x5) * x7) * x9) * x11)
Optimization
!
Performance
Accumulate in two
different products
" Can be performed
1 x0
*
simultaneously
!
1 x1
*
Combine at end
*
CPE = 2.0
!
2X performance
N elements, D cycles/operation
(N/2+1)*D cycles
!
~2X performance improvement
x5
*
x6
*
!
x7
*
x8
*
x10
x9
*
*
Requirements for Parallel Computation
*
15-213, F’02
– 26 –
Visualizing Parallel Loop
!
Mathematical
Combining operation must be associative & commutative
!
" OK for integer multiplication
" Not strictly true for floating point
%edx.0
Two multiplies within
loop no longer have
data depency
Allows them to
pipeline
addl
load
%ecx.0
» OK for most applications
Pipelined functional units
!
Ability to dynamically extract parallelism from code
cc.1
jl
t.1a
t.1b
Time
imull
imull
load (%eax,%edx.0,4)
imull t.1a, %ecx.0
load 4(%eax,%edx.0,4)
imull t.1b, %ebx.0
iaddl $2,%edx.0
cmpl %esi, %edx.1
jl-taken cc.1
– 27 –
%edx.1
cmpl
load
%ebx.0
Hardware
!
x11
*
15-213, F’02
!
x3
x4
*
Performance
– 25 –
*
x2
!
!
15-213, F’02
– 28 –
#
#
#
#
#
#
t.1a
%ecx.1
t.1b
%ebx.1
%edx.1
cc.1
%ecx.1
%ebx.1
15-213, F’02
Executing with Parallel Loop
Optimization Results for Combining
%edx.0
addl
1
load
2
load
cc.3
jl
imull
%ecx.1
i=0
t.2a
%ebx.1
t.2b
Iteration 1
imull
Cycle
12
imull
Predicted Performance
13
" Can keep
4-cycle
%ecx.2
i=2
t.3a
%ebx.2
t.3b
Iteration 2
imull
14
multiplier
busy performing
two simultaneous
15
multiplications
16
" Gives CPE of 2.0
– 31 –
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Pointer
Unroll 4
Unroll 16
2X2
4X4
8X4
Theoretical Opt.
Worst : Best
Floating Point
+
*
*
42.06
31.25
20.66
6.00
2.00
3.00
1.50
1.06
1.50
1.50
1.25
1.00
39.7
41.86
33.25
21.25
9.00
4.00
4.00
4.00
4.00
2.00
2.00
1.25
1.00
33.5
41.44
31.25
21.15
8.00
3.00
3.00
3.00
3.00
2.00
1.50
1.50
1.00
27.6
160.00
143.00
135.00
117.00
5.00
5.00
5.00
5.00
2.50
2.50
2.00
2.00
80.0
imull
%ecx.3
i=4
%ebx.3
Iteration 15-213,
3
F’02
Parallel Unrolling: Method #2
void combine6aa(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-1;
int *data = get_vec_start(v);
int x = 1;
int i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x *= (data[i] * data[i+1]);
}
/* Finish any remaining elements */
for (; i < length; i++) {
x *= data[i];
}
*dest = x;
}
Integer
+
%edx.3
cmpl
load
11
– 29 –
cc.2
jl
addl
imull
9
!
Method
%edx.2
cmpl
load
7
10
addl
load
t.1b
6
8
cc.1
jl
t.1a
4 %ebx.0
5
cmpl
load
3 %ecx.0
%edx.1
15-213, F’02
– 30 –
Method #2 Computation
Computation
Code Version
!
((((((1 * (x0 * x1)) * (x2 * x3)) * (x4 * x5))
* (x6 * x7)) * (x8 * x9)) * (x10 * x11))
Integer product
x0 x1
Optimization
!
Multiply pairs of
elements together
!
And then update
product
!
1
*
* x4 x5
*
“Tree height
reduction”
!
N elements, D cycles/operation
!
Should be (N/2+1)*D cycles
!
Measured CPE worse
" CPE = 2.0
* x6 x7
*
* x10 x11
*
CPE = 2.5
*
*
15-213, F’02
– 32 –
Unrolling
* x8 x9
*
Performance
!
Performance
* x2 x3
2
CPE
CPE
(measured) (theoretical)
2.50
2.00
3
1.67
1.33
4
1.50
1.00
6
1.78
1.00
15-213, F’02
Understanding
Parallelism
1 x0
*
Limitations of Parallel Execution
x1
*
x2
*
x3
*
x4
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = (x * data[i]) * data[i+1];
}
*
Need Lots of Registers
x5
*
x6
*
x7
*
x8
*
!
CPE = 4.00
!
All multiplies perfomed in sequence
!
CPE = 2.50
!
Multiplies overlap
To hold sums/products
!
Only 6 usable integer registers
!
8 FP registers
!
When not enough registers, must spill temporaries onto
stack
" Also needed for pointers, loop conditions
x9
*
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = x * (data[i] * data[i+1]);
}
!
x10
*
x11
*
" Wipes out any performance gains
!
x0 x1
1
* x2 x3
*
" Cannot reference more operands than instruction set allows
* x4 x5
*
Not helped by renaming
" Major drawback of IA32 instruction set
* x6 x7
*
* x8 x9
*
* x10 x11
*
*
*
15-213, F’02
– 33 –
Register Spilling Example
Example
!
8 X 8 integer product
!
7 local variables share 1
register
!
See that are storing locals
on stack
E.g., at -8(%ebp)
!
– 35 –
15-213, F’02
– 34 –
Summary: Results for Pentium III
.L165:
imull (%eax),%ecx
movl -4(%ebp),%edi
imull 4(%eax),%edi
movl %edi,-4(%ebp)
movl -8(%ebp),%edi
imull 8(%eax),%edi
movl %edi,-8(%ebp)
movl -12(%ebp),%edi
imull 12(%eax),%edi
movl %edi,-12(%ebp)
movl -16(%ebp),%edi
imull 16(%eax),%edi
movl %edi,-16(%ebp)
…
addl $32,%eax
addl $8,%edx
cmpl -32(%ebp),%edx
jl .L165
15-213, F’02
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Unroll 4
Unroll 16
4X2
8X4
8X8
Worst : Best
!
!
– 36 –
Floating Point
+
*
*
42.06
31.25
20.66
6.00
2.00
1.50
1.06
1.50
1.25
1.88
39.7
41.86
33.25
21.25
9.00
4.00
4.00
4.00
2.00
1.25
1.88
33.5
41.44
31.25
21.15
8.00
3.00
3.00
3.00
1.50
1.50
1.75
27.6
160.00
143.00
135.00
117.00
5.00
5.00
5.00
2.50
2.00
2.00
80.0
Biggest gain doing basic optimizations
But, last little bit helps
15-213, F’02
Results for Alpha Processor
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Unroll 4
Unroll 16
4X2
8X4
8X8
Worst : Best
!
!
Floating Point
+
*
*
40.14
25.08
19.19
6.26
1.76
1.51
1.25
1.19
1.15
1.11
36.2
Results for Pentium 4
47.14
36.05
32.18
12.52
9.01
9.01
9.01
4.69
4.12
4.24
11.4
52.07
37.37
28.73
13.26
8.08
6.32
6.33
4.44
2.34
2.36
22.3
Method
53.71
32.02
32.73
13.01
8.01
6.32
6.22
4.45
2.01
2.08
26.7
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Unroll 4
Unroll 16
4X2
8X4
8X8
Worst : Best
!
Overall trends very similar to those for Pentium III.
Even though very different architecture and compiler
!
What About Branches?
!
!
– 39 –
38.00
32.00
24.25
28.35
7.00
7.00
7.00
3.50
2.00
2.31
19.0
Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0)
Avoids FP multiplication anomaly
15-213, F’02
When encounter conditional branch, cannot determine where to
continue fetching
" Branch Not-Taken: Continue with next instruction in sequence
!
$0x1,%ecx
%edx,%edx
%esi,%edx
8048a25
%esi,%esi
(%eax,%edx,4),%ecx
35.85
31.55
23.36
27.50
5.00
5.00
5.00
2.63
1.82
2.42
19.7
" Branch Taken: Transfer control to branch target
" To generate enough operations to keep EU busy
movl
xorl
cmpl
jnl
movl
imull
35.34
30.26
25.71
31.56
14.00
14.00
14.00
7.00
3.98
4.50
8.9
– 38 –
Instruction Control Unit must work well ahead of Exec. Unit
80489f3:
80489f8:
80489fa:
80489fc:
80489fe:
8048a00:
35.25
26.52
18.00
3.39
2.00
1.01
1.00
1.02
1.01
1.63
35.2
Branch Outcomes
Challenge
!
Floating Point
+
*
*
" Clock runs at 2.0 GHz
" Not an improvement over 1.0 GHz P3 for integer *
15-213, F’02
– 37 –
Integer
+
80489f3:
80489f8:
80489fa:
80489fc:
80489fe:
8048a00:
Executing
Fetching &
Decoding
movl
xorl
cmpl
jnl
movl
imull
8048a25:
8048a27:
8048a29:
8048a2c:
8048a2f:
When encounters conditional branch, cannot reliably determine
where to continue fetching
15-213, F’02
Cannot resolve until outcome determined by branch/integer unit
– 40 –
$0x1,%ecx
%edx,%edx
%esi,%edx Branch Not-Taken
8048a25
%esi,%esi
Branch Taken
(%eax,%edx,4),%ecx
cmpl
jl
movl
leal
movl
%edi,%edx
8048a20
0xc(%ebp),%eax
0xffffffe8(%ebp),%esp
%ecx,(%eax)
15-213, F’02
Branch Prediction Through Loop
Branch Prediction
Idea
!
Guess which way branch will go
!
Begin executing instructions at predicted position
" But don’t actually modify register or memory data
80489f3:
80489f8:
80489fa:
80489fc:
. . .
movl
xorl
cmpl
jnl
$0x1,%ecx
%edx,%edx
%esi,%edx
8048a25
8048a25:
8048a27:
8048a29:
8048a2c:
8048a2f:
cmpl
jl
movl
leal
movl
Predict Taken
%edi,%edx
8048a20
0xc(%ebp),%eax
0xffffffe8(%ebp),%esp
%ecx,(%eax)
Execute
15-213, F’02
– 41 –
Branch Misprediction Invalidation
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 98
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 99
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 100
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
movl
addl
incl
(%ecx,%edx,4),%eax
%eax,(%edi)
i = 101
%edx
– 43 –
Assume vector length = 100
Predict Taken (OK)
Predict Taken (Oops)
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 98
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 99
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 100
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 101
%esi,%edx
80488b1
Assume vector length = 100
Predict Taken (OK)
Predict Taken
Executed
(Oops)
Read
invalid
location
Fetched
15-213, F’02
– 42 –
Branch Misprediction Recovery
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 98
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
80488bb:
80488be:
80488bf:
80488c0:
movl
addl
incl
cmpl
jl
leal
popl
popl
popl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 99
%esi,%edx
80488b1
0xffffffe8(%ebp),%esp
%ebx
%esi
%edi
Assume vector length = 100
Predict Taken (OK)
Definitely not taken
Invalidate
Performance Cost
15-213, F’02
– 44 –
!
Misprediction on Pentium III wastes ~14 clock cycles
!
That’s a lot of time on a high performance processor
15-213, F’02
Avoiding Branches
Avoiding Branches with Bit Tricks
On Modern Processor, Branches Very Expensive
!
Unless prediction can be reliable
!
When possible, best to avoid altogether
!
In style of Lab #1
!
Use masking rather than conditionals
int bmax(int x, int y)
{
int mask = -(x>y);
return (mask & x) | (~mask & y);
}
Example
!
Compute maximum of two values
" 14 cycles when prediction correct
!
" 29 cycles when incorrect
Compiler still uses conditional
" 16 cycles when predict correctly
int max(int x, int y)
{
return (x < y) ? y : x;
}
movl 12(%ebp),%edx
movl 8(%ebp),%eax
cmpl %edx,%eax
jge L11
movl %edx,%eax
L11:
#
#
#
#
#
Get y
rval=x
rval:y
skip when >=
rval=y
15-213, F’02
– 45 –
Avoiding Branches with Bit Tricks
!
movl 8(%ebp),%ecx
movl 12(%ebp),%edx
cmpl %edx,%ecx
setg %al
movzbl %al,%eax
movl %eax,-4(%ebp)
movl -4(%ebp),%eax
xorl %edx,%edx
movl 8(%ebp),%eax
movl 12(%ebp),%ecx
cmpl %ecx,%eax
jle L13
movl $-1,%edx
L13:
– 46 –
!
#
#
#
#
#
#
#
# mask = 0
# skip if x<=y
# mask = -1
15-213, F’02
Conditional Move
Force compiler to generate desired code
int bvmax(int x, int y)
{
volatile int t = (x>y);
int mask = -t;
return (mask & x) |
(~mask & y);
}
!
" 32 cycles when mispredict
!
Get x
Get y
x:y
(x>y)
Zero extend
Save as t
Retrieve t
Added with P6 microarchitecture (PentiumPro onward)
cmovXXl %edx, %eax
" If condition XX holds, copy %edx to %eax
" Doesn’t involve any branching
" Handled as operation within Execution Unit
movl 8(%ebp),%edx
movl 12(%ebp),%eax
cmpl %edx, %eax
cmovll %edx,%eax
volatile declaration forces value to be written to memory
#
#
#
#
Get x
rval=y
rval:x
If <, rval=x
" Compiler must therefore generate code to compute t
" Simplest way is setg/movzbl combination
!
Current version of GCC won’t use this instruction
!
Performance
" Thinks it’s compiling for a 386
Not very elegant!
" A hack to get control over compiler
!
!
" 14 cycles on all data
22 clock cycles on all data
" Better than misprediction
– 47 –
15-213, F’02
– 48 –
15-213, F’02
Machine-Dependent Opt. Summary
Role of Programmer
How should I write my programs, given that I have a good,
optimizing compiler?
Pointer Code
!
Look carefully at generated code to see whether helpful
Don’
Don’t: Smash Code into Oblivion
Loop Unrolling
!
Some compilers do this automatically
!
Generally not as clever as what can achieve by hand
!
Do:
Exposing Instruction-Level Parallelism
!
Write code that’s readable & maintainable
" Even though these factors can slow down code
Benefits depend heavily on particular machine
!
Best if performed by compiler
!
Eliminate optimization blockers
" Allows compiler to do its job
Focus on Inner Loops
" But GCC on IA32/Linux is not very good
– 49 –
Select best algorithm
!
" Procedures, recursion, without built-in constant limits
Warning:
!
!
Very machine dependent
!
Hard to read, maintain, & assure correctness
Do only for performance-critical parts of code
15-213, F’02
– 50 –
!
Do detailed optimizations where code will be executed repeatedly
!
Will get most performance gain here
15-213, F’02