PowerPoint Presentation

PowerPoint Presentation - CMPUT415

CMPUT680 - Winter 2006
Topic I: Superblock and
Hyperblock Formation
José Nelson Amaral
http://www.cs.ualberta.ca/~amaral/courses/680
CMPUT 329 - Computer
Organization and Architecture II
1
Instruction Level
Parallelism Optimizations
The objective of an optimizer is to reduce the
number and complexity of the instructions
executed by the processor.
Superscalar or Very Long Instruction Word (VLIW)
processors can reduce the execution time even when
the number of instructions executed moderately
increases, as long as the dependence height is reduced.
CMPUT 329 - Computer
Organization and Architecture II
2
Speculative and
Predicated Execution
Speculative Execution: execution of an instruction
before knowing that its execution is required.
Superblock: structure used to implement
compiler-controlled speculative execution.
Predicated Execution: architecture-supported
conditional execution of an instruction based on
the value of a Boolean source operand, referred
to as the predicate of the instruction.
If-conversion: compiler algorithm that converts
conditional branches into predicate-defining
instructions to allow the use of predication.
CMPUT 329 - Computer
Organization and Architecture II
3
Trace Scheduling
(Fisher, 1981)
Some optimization and scheduling decisions
may decrease the execution time for one
control path while increasing the execution
time for another path.
Thus decisions should favor more frequently
executed paths to improve overall performance.
Trace scheduling divides a procedure in a set
of frequently executed traces (paths).
CMPUT 329 - Computer
Organization and Architecture II
4
Trace Scheduling
There may be conditional branches from the
middle of the trace (side exits) and transitions
from other traces into the middle of the trace
(side entrances).
These control-flow transitions are ignored during
trace scheduling.
After scheduling, bookeeping is required to ensure
the correct execution of off-trace code.
CMPUT 329 - Computer
Organization and Architecture II
5
Bookeeping for Trace
Scheduling
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 2
Instr 3
Instr 4
Instr 1
Instr 5
What bookeeping is required when Instr 1
is moved below the side entrance in the trace?
CMPUT 329 - Computer
Organization and Architecture II
6
Bookeeping for Trace
Scheduling
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 2
Instr 3
Instr 4
Instr 1
Instr 5
CMPUT 329 - Computer
Organization and Architecture II
Instr 3
Instr 4
7
Bookeeping for Trace
Scheduling
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 1
Instr 5
Instr 2
Instr 3
Instr 4
What bookeeping is required when Instr 5
moves above the side entrance in the trace?
CMPUT 329 - Computer
Organization and Architecture II
8
Bookeeping for Trace
Scheduling
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Instr 1
Instr 5
Instr 2
Instr 3
Instr 4
CMPUT 329 - Computer
Organization and Architecture II
Instr 5
9
Superblocks
A superblock is a trace without side entrances, i.e.,
control can only enter from the top, but it can leave
at one or more exit points.
The formation of superblocks creates additional
optimization opportunities because constraints
associated with infrequently executed paths of
control are ignored (thus these constraints do
not inhibit optimizations that favor frequently
executed paths).
CMPUT 329 - Computer
Organization and Architecture II
10
Superblock Formation
(Example)
Y
Y
1
90
0
B
90
D
100
90
D
0
E
90
0
90
F
100
1
Z
10
C
10
10
0
99
D
0
0
1
D
100
90
B
90
90
E
90
90
F
100
10
C
10
10
99
1
CMPUT 329 - Computer
Organization and Architecture II
Z
11
Superblock Formation
(Example)
Y
0
D
0
0
1
D
100
90
B
90
90
E
90
90
F
100
1
Is this a superblock?
10
C
10
10
99
No, a superblock cannot
have side entrances, and
this set of nodes has
two side entrances into
node F. How do we
convert it into a superblock?
Z
CMPUT 329 - Computer
Organization and Architecture II
12
Superblock Formation
(Example)
Y
0
D
0
9.9
0
1
D
100
90
B
90
90
E
90
90
F
90
0.9
Z
F’
10
10
C
10
10
89.1
Tail duplication, is the
duplication of basic blocks
that appear after a side
entrance to eliminate side
entrances and transform
a trace into a superblock.
10
0.1
CMPUT 329 - Computer
Organization and Architecture II
13
Common Subexpression
Elimination in Superblocks
opA: mul r1,r2,3
opA: mul r1,r2,3
1
1
opB: add r2,r2,1
99
99
opB: add r2,r2,1
opC’: mul r3,r2,3
1
opC: mul r3,r2,3
opC: mul r3,r2,3
Original Code
Code After Superblock Formation
opA: mul r1,r2,3
1
99
opB: add r2,r2,1
opC’: mul r3,r2,3
opC: mov r3,r1
Code After Common
CMPUT 329 -Elimination
Computer
Subexpression
Organization and Architecture II
14
Operation Migration in
Superblocks
…
mov r0,r1
…
mov r0,r2
…
mov r0,r3
…
add r1,r1,4
add r2,r2,4
add r3,r3,4
Original Code
…
…
X
mov r0,r1
…
Y
Z
…
add r1,r1,4
add r2,r2,4
add r3,r3,4
mov r0,r2
X
mov r0,r3
Y
Z
After Operation Migration
CMPUT 329 - Computer
Organization and Architecture II
15
Global Variable Migration
in Superblock Loops
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpC: st_I x, r0, r4
100
0
MEM[r0+x]
OpE: add r0, r0, 1
OpD: add r1, r1, 1
Original Program Segment
r0
1
r1
1
10
20
30
r4
CMPUT 329 - Computer
Organization and Architecture II
16
Global Variable Migration
in Superblock Loops
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpC: st_I x, r0, r4
100
0
MEM[r0+x]
OpE: add r0, r0, 1
OpD: add r1, r1, 1
Original Program Segment
r0
1
r1
1
r4
10
CMPUT 329 - Computer
Organization and Architecture II
10
20
30
17
Global Variable Migration
in Superblock Loops
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpC: st_I x, r0, r4
100
0
MEM[r0+x]
OpE: add r0, r0, 1
OpD: add r1, r1, 1
Original Program Segment
r0
1
r1
1
r4
11
CMPUT 329 - Computer
Organization and Architecture II
10
20
30
18
Global Variable Migration
in Superblock Loops
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpC: st_I x, r0, r4
100
0
MEM[r0+x]
OpE: add r0, r0, 1
OpD: add r1, r1, 1
Original Program Segment
r0
1
r1
1
r4
11
CMPUT 329 - Computer
Organization and Architecture II
11
20
30
19
Global Variable Migration
in Superblock Loops
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpC: st_I x, r0, r4
100
0
MEM[r0+x]
OpE: add r0, r0, 1
OpD: add r1, r1, 1
Original Program Segment
r0
1
r1
2
r4
11
CMPUT 329 - Computer
Organization and Architecture II
11
20
30
20
Global Variable Migration
in Superblock Loops
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpC: st_I x, r0, r4
100
0
MEM[r0+x]
OpE: add r0, r0, 1
OpD: add r1, r1, 1
Original Program Segment
r0
1
r1
2
r4
11
CMPUT 329 - Computer
Organization and Architecture II
11
20
30
21
Global Variable Migration
in Superblock Loops
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpC: st_I x, r0, r4
100
0
MEM[r0+x]
OpE: add r0, r0, 1
OpD: add r1, r1, 1
Original Program Segment
r0
1
r1
2
r4
12
CMPUT 329 - Computer
Organization and Architecture II
11
20
30
22
Global Variable Migration
in Superblock Loops
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpC: st_I x, r0, r4
100
0
MEM[r0+x]
OpE: add r0, r0, 1
OpD: add r1, r1, 1
Original Program Segment
r0
1
r1
2
r4
12
CMPUT 329 - Computer
Organization and Architecture II
12
20
30
23
Global Variable Migration
in Superblock Loops
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpC: st_I x, r0, r4
MEM[r0+x]
100
0
OpE: add r0, r0, 1
OpD: add r1, r1, 1
Original Program Segment
r0
2
r1
2
r4
12
CMPUT 329 - Computer
Organization and Architecture II
12
20
30
24
Global Variable Migration
in Superblock Loops
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpC: st_I x, r0, r4
MEM[r0+x]
100
0
OpE: add r0, r0, 1
OpD: add r1, r1, 1
Original Program Segment
r0
2
r1
2
r4
20
CMPUT 329 - Computer
Organization and Architecture II
12
20
30
25
Global Variable Migration
in Superblock Loops
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpC: st_I x, r0, r4
MEM[r0+x]
100
0
OpE: add r0, r0, 1
OpD: add r1, r1, 1
Original Program Segment
r0
2
r1
2
r4
21
CMPUT 329 - Computer
Organization and Architecture II
12
20
30
26
Global Variable Migration
in Superblock Loops
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpC: st_I x, r0, r4
MEM[r0+x]
100
0
OpE: add r0, r0, 1
OpD: add r1, r1, 1
Original Program Segment
r0
2
r1
2
r4
21
CMPUT 329 - Computer
Organization and Architecture II
12
21
30
27
Global Variable Migration
in Superblock Loops
OpA: ld_I r4, x, r0
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpC: st_I x, r0, r4
OpB: add r4, r4, r1
100
0
OpE: add r0, r0, 1
OpD: add r1, r1, 1
100
0
OpD: add r1, r1, 1
Original Program Segment
OpC’: st_i x, r0, r4
OpE: add r0, r0, 1
OpC: st_i x, r0, r4
After Variable Migration
CMPUT 329 - Computer
Organization and Architecture II
28
Superblock Enlarging
Optimizations
By enlarging a superblock, we can provide the
scheduler with more independent instructions
to choose from for each cycle
Superblock enlarging optimizations:
Branch target expansion
Loop unrolling
Loop peeling
CMPUT 329 - Computer
Organization and Architecture II
29
Branch Target Expansion
Idea: To expand the superblock with the target
of a likely taken branch.
L1:
L1:
blt r1, r2, L3
20
L2:
blt r1, r2, L3
100
L3:
beq r3, r4, L5
jump L4
beq r3, r4, L5
CMPUT 329 - Computer
Organization and Architecture II
20
L2:
jump L4
30
Superblock Loops
A superblock loop is a superblock that has a
frequently taken backedge from its last node to
its first node.
We will study the extension of some common
loop optimizations to superblocks.
CMPUT 329 - Computer
Organization and Architecture II
31
Dependence Removing
Optimizations
The goal is to eliminate data dependences between
instructions within frequently executed superblocks.
Dependence removing optimizations include:
Register renaming
Accumulator variable expansion
Induction variable expansion
Search variable expansion
Operation combining
Strength reduction
Tree height reduction
CMPUT 329 - Computer
Organization and Architecture II
32
Instruction Latencies for
Examples
Function
Latency
Int ALU
1
Int multiply
3
Int divide
10
branch
1
Memory load
2
Memory store
1
FP ALU
3
FP conversion
3
FP multiply
3
FP divide
10
CMPUT 329 - Computer
Organization and Architecture II
33
Register Renaming
Example
For (j=0; j<n; j++)
{
C(j) = A(j)+B(j)
}
L1:
ld_f
ld_f
add_f
st_f
add
blt
Original Loop
f2, A, r1
f3, B, r1
f4, f2, f3
C, r1, f4
r1, r1, 4
r1, r5, L1
(a)
(b)
(c)
(d)
(e)
(f)
Assembly Code
For all the examples we assume a superscalar processor with infinite
resources and no register renaming hardware.
Thus for the code above, we obtain the following schedule.
CMPUT 329 - Computer
Organization and Architecture II
34
Register Renaming
Example
L1:
For (j=0; j<n; j++)
{
C(j) = A(j)+B(j)
}
ld_f
ld_f
add_f
st_f
add
blt
Original Loop
f2, A, r1
f3, B, r1
f4, f2, f3
C, r1, f4
r1, r1, 4
r1, r5, L1
(a)
(b)
(c)
(d)
(e)
(f)
Assembly Code
Instr.
a a
b b
0
c c c
d
e
7 cycles / 1 iteration
f
5
cycles
Code Schedule
CMPUT 329 - Computer
Organization and Architecture II
35
Register Renaming
Example
L1:
ld_f
ld_f
add_f
st_f
add
blt
f2, A, r1
f3, B, r1
f4, f2, f3
C, r1, f4
r1, r1, 4
r1, r5, L1
(a)
(b)
(c)
(d)
(e)
(f)
L1:
ld_f
ld_f
add_f
st_f
add
ld_f
ld_f
add_f
st_f
add
ld_f
ld_f
add_f
st_f
add
blt
Original Assembly Code
f2, A, r1
f3, B, r1
f4, f2, f3
C, r1, f4
r1, r1, 4
f2, A, r1
f3, B, r1
f4, f2, f3
C, r1, f4
r1, r1, 4
f2, A, r1
f3, B, r1
f4, f2, f3
C, r1, f4
r1, r1, 4
r1, r5, L1
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
After Loop Unrolling
CMPUT 329 - Computer
Organization and Architecture II
36
Loop Unrolling
L1:
ld_f
ld_f
add_f
st_f
add
ld_f
ld_f
add_f
st_f
add
ld_f
ld_f
add_f
st_f
add
blt
f2, A, r1
f3, B, r1
f4, f2, f3
C, r1, f4
r1, r1, 4
f2, A, r1
f3, B, r1
f4, f2, f3
C, r1, f4
r1, r1, 4
f2, A, r1
f3, B, r1
f4, f2, f3
C, r1, f4
r1, r1, 4
r1, r5, L1
After Loop Unrolling
(a) Instr.
(b)
a a
b b
(c)
c c c
(d)
d
e
(e)
f f
(f)
g g
h h h
(g)
i
j
(h)
k k
(i)
l l
m m m
(j)
n
(k)
o
p
(l)
(m)
0
5
10
15
cycles
(n)
(o)
Code Schedule
(p)
19 cycles / 3 iterations = 6.3 cycles / iteration
CMPUT 329 - Computer
Organization and Architecture II
37
Register Renaming
L1:
ld_f
ld_f
add_f
st_f
add
ld_f
ld_f
add_f
st_f
add
ld_f
ld_f
add_f
st_f
add
blt
f2, A, r1
f3, B, r1
f4, f2, f3
C, r1, f4
r1, r1, 4
f2, A, r1
f3, B, r1
f4, f2, f3
C, r1, f4
r1, r1, 4
f2, A, r1
f3, B, r1
f4, f2, f3
C, r1, f4
r1, r1, 4
r1, r5, L1
After Loop Unrolling
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
L1:
ld_f
ld_f
add_f
st_f
add
ld_f
ld_f
add_f
st_f
add
ld_f
ld_f
add_f
st_f
add
blt
f21, A, r11
f31, B, r11
f41, f21, f31
C, r11, f41
r12, r11, 4
f22, A, r12
f32, B, r12
f42, f22, f32
C, r12, f42
r13, r12, 4
f23, A, r13
f33, B, r13
f43, f23, f33
C, r13, f43
r11, r13, 4
r11, r5, L1
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
After Register Renaming
CMPUT 329 - Computer
Organization and Architecture II
38
Loop Unrolling and
Register Renaming
L1:
ld_f
ld_f
add_f
st_f
add
ld_f
ld_f
add_f
st_f
add
ld_f
ld_f
add_f
st_f
add
blt
f21, A, r11
f31, B, r11
f41, f21, f31
C, r11, f41
r12, r11, 4
f22, A, r12
f32, B, r12
f42, f22, f32
C, r12, f42
r13, r12, 4
f23, A, r13
f33, B, r13
f43, f23, f33
C, r13, f43
r11, r13, 4
r11, r5, L1
(a) Instr.
(b)
a a
(c)
b b
c c c
(d)
d
(e)
e
f f
(f)
g g
h h h
(g)
i
(h)
j
k k
(i)
l l
(j)
m m m
n
(k)
o
p
(l)
(m)
(n)
0
5
10
15
(o)
Code Schedule
(p)
After Register Renaming
cycles
8 cycles / 3 iterations = 2.7 cycles / iteration
CMPUT 329 - Computer
Organization and Architecture II
39
Accumulator Variable
Expansion
An accumulator variable accumulates a sum or product
in each iteration of a loop.
Accumulator variable expansion eliminates redefinitions
of an accumulator variable within an unrolled loop by
creating k temporary accumulators (k is the number of
accumulation instructions). The values of all temporary
accumulators must be summed at the exit
points of the loop where the accumulator is live.
CMPUT 329 - Computer
Organization and Architecture II
40
Accumulator Expansion
Example
For (k=0; k<n; k++)
{
C(i,j) = C(i,j) + A(i,k) * B(k,j)
}
ld_f
ld_f
ld_f
mul_f
add_f
add
add
blt
st_f
L1:
Original Loop
f1, C, r2
f3, A, r4
f5, B, r6
f7, f3, f5
f1, f1, f7
r4, r4, 4
r6, r6, r8
r4, r9, L1
C, r2, f1
(-)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(-)
Assembly Code
For all examples we assume a superscalar processor with infinite
resources and no register renaming hardware.
Thus for the code above, we obtain the following schedule.
CMPUT 329 - Computer
Organization and Architecture II
41
Accumulator Expansion
Example
For (k=0; k<n; k++)
{
C(i,j) = C(i,j) + A(i,k) * B(k,j)
}
ld_f
ld_f
ld_f
mul_f
add_f
add
add
blt
st_f
L1:
Original Loop
Instr.
a a
b b
(-)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(-)
Assembly Code
c c c
d d d
e
f
0
f1, C, r2
f3, A, r4
f5, B, r6
f7, f3, f5
f1, f1, f7
r4, r4, 4
r6, r6, r8
r4, r9, L1
C, r2, f1
8 cycles / 1 iteration
g
5
cycles
Code Schedule
CMPUT 329 - Computer
Organization and Architecture II
42
Loop Unrolling and
Register Renaming
L1:
L1:
ld_f
ld_f
ld_f
mul_f
add_f
add
add
blt
st_f
f1, C, r2
f3, A, r4
f5, B, r6
f7, f3, f5
f1, f1, f7
r4, r4, 4
r6, r6, r8
r4, r9, L1
C, r2, f1
(-)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(-)
Assembly Code
CMPUT
329 - Computer
After Unrolling and
Renaming
Organization and Architecture II
ld_f
ld_f
ld_f
mul_f
add_f
add
add
ld_f
ld_f
mul_f
add_f
add
add
ld_f
ld_f
mul_f
add_f
add
add
blt
st_f
f1, C, r2
f31, A, r41
f51, B, r61
f71, f31, f51
f1, f1, f71
r42, r41, 4
r62, r61, r8
f32, A, r42
f52, B, r62
f72, f32, f52
f1, f1, f72
r43, r42, 4
r63, r62, r8
f33, A, r43
f53, B, r63
f73, f33, f53
f1, f1, f73
r41, r43, 4
r61, r63, r8
r4, r9, L1
C, r2, f1
(-)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(-) 43
Loop Unrolling and
Register Renaming
L1:
ld_f
ld_f
ld_f
mul_f
add_f
add
add
ld_f
ld_f
mul_f
add_f
add
add
ld_f
ld_f
mul_f
add_f
add
add
blt
st_f
f1, C, r2
f31, A, r41
f51, B, r61
f71, f31, f51
f1, f1, f71
r42, r41, 4
r62, r61, r8
f32, A, r42
f52, B, r62
f72, f32, f52
f1, f1, f72
r43, r42, 4
r63, r62, r8
f33, A, r43
f53, B, r63
f73, f33, f53
f1, f1, f73
r41, r43, 4
r61, r63, r8
r4, r9, L1
C, r2, f1
(-) Instr.
(a)
a a
(b)
b b
c c c
(c)
d d d
(d)
e
f
(e)
g g
h h
(f)
i i i
(g)
j j j
k
(h)
l
(i)
m m
n n
(j)
o o o
p p p
(k)
q
(l)
r
s
(m)
0
5
10
15
cycles
(n)
(o)
Code Schedule
(p)
(q)
14 cycles / 3 iterations = 4.7 cycles / iteration
(r)
CMPUT 329 - Computer
(s) Organization and Architecture II
44
(-)
L1:
ld_f f11, C, r2
mov_f f12, 0
mov_f f13, 0
ld_f f31, A, r41
ld_f f51, B, r61
mul_f f71, f31, f51
add_f f11, f11, f71
add r42, r41, 4
add r62, r61, r8
ld_f f32, A, r42
ld_f f52, B, r62
mul_f f72, f32, f52
add_f f12, f12, f72
add r43, r42, 4
add r63, r62, r8
ld_f f33, A, r43
ld_f f53, B, r63
mul_f f73, f33, f53
add_f f13, f13, f73
add r41, r43, 4
add r61, r63, r8
blt
r4, r9, L1
add_f f11, f11, f12
add_f f11, f11, f13
st_f C, r2, f1
Accumulator
Expansion
(-)
(-)
(-)
(a)
(b) Instr.
(c)
a a
(d)
b b
c c c
(e)
d d d
(f)
e
f
(g)
g g
h h
(h)
i i i
(i)
j j j
k
(j)
l
(k)
m m
n n
(l)
o o o
p p p
(m)
q
(n)
r
s
(o)
0
5
10
15
cycles
(p)
(q)
Code Schedule
(r)
(s)
10 cycles / 3 iterations = 3.3 cycles / iteration
(-)
CMPUT 329 - Computer
(-)
Organization and Architecture II
45
(-)
Induction Variable
Expansion
An induction variable is used to index through loop
iterations and through regular data structure, such as
arrays.
Induction variable expansion eliminates dependences
between definitions of induction variables and their uses
in unrolled loops.
CMPUT 329 - Computer
Organization and Architecture II
46
Induction Variable
Expansion Example
For (i=0; i<n; i++)
{
C(j) = A(j) * B(j)
j=j+K
}
L1:
ld_f f3, A, r2
ld_f f4, B, r2
mul_f f5, f3, f4
st_f C, r2, f5
add r2, r2, r7
add r1, r1, 1
blt r1, r6, L1
Original Loop
Instr.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Assembly Code
a a
b b
c c c
d
e
f
0
6 cycles / 1 iteration
g
5
cycles
Code Schedule
CMPUT 329 - Computer
Organization and Architecture II
47
Loop Unrolling and
Register Renaming
L1:
L1:
ld_f f3, A, r2
ld_f f4, B, r2
mul_f f5, f3, f4
st_f C, r2, f5
add r2, r2, r7
add r1, r1, 1
blt r1, r6, L1
Assembly Code
(a)
(b)
(c)
(d)
(e)
(f)
(g)
ld_f
ld_f
mul_f
st_f
add
ld_f
ld_f
mul_f
st_f
add
ld_f
ld_f
mul_f
st_f
add
add
blt
CMPUT 329 - Computer
After Unrolling
Organization and Architecture II
f31, A, r21
f41, B, r21
f51, f31, f41
C, r21, f51
r22, r21, r7
f32, A, r22
f42, B, r22
f52, f32, f42
C, r22, f52
r23, r22, r7
f33, A, r23
f43, B, r23
f53, f33, f43
C, r23, f53
r21, r23, r7
r1, r1, 3
r1, r6, L1
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
and Renaming
48
Loop Unrolling and
Register Renaming
L1:
ld_f
ld_f
mul_f
st_f
add
ld_f
ld_f
mul_f
st_f
add
ld_f
ld_f
mul_f
st_f
add
add
blt
f31, A, r21
f41, B, r21
f51, f31, f41
C, r21, f51
r22, r21, r7
f32, A, r22
f42, B, r22
f52, f32, f42
C, r22, f52
r23, r22, r7
f33, A, r23
f43, B, r23
f53, f33, f43
C, r23, f53
r21, r23, r7
r1, r1, 3
r1, r6, L1
(a) Instr.
(b)
a a
(c)
b b
(d)
c c c
d
(e)
e
f f
(f)
g g
(g)
h h h
i
(h)
j
(i)
k k
l l
(j)
m m m
n
(k)
o
(l)
p
q
(m)
(n)
(o)
0
5
10
15
(p)
Code Schedule
(q)
cycles
8 cycles / 3 iterations = 2.6 cycles / iteration
After Unrolling and Renaming
CMPUT 329 - Computer
Organization and Architecture II
49
Induction Variable
Expansion
mov
add
add
mul
L1:
ld_f
ld_f
mul_f
st_f
ld_f
ld_f
mul_f
st_f
ld_f
ld_f
mul_f
st_f
add
add
add
add
After Unrolling
blt
r21, r2
(-)
r22, r21, r7 (-)
r23, r22, r7 (-) Instr.
r71, r7, 3
(-)
a a
b b
f31, A, r21 (a)
c c c
f41, B, r21 (b)
d
f
f
f51, f31, f41 (c)
g g
C, r21, f51 (d)
h h h
i
f32, A, r22 (f)
k k
l l
f42, B, r22 (g)
m m m
f52, f32, f42 (h)
n
e
C, r22, f52 (i)
j
f33, A, r23 (k)
o
p
f43, B, r23 (l)
q
f53, f33, f43 (m)
C, r23, f53 (n)
0
5
10
15
cycles
r21, r21, r71 (e)
Code Schedule
r22, r22, r71 (j)
r23, r23, r71 (o)
r1, r1, 3
(p)
6 cycles / 3 iterations = 2 cycles / iteration
and
r1, r6,Renaming
L1
(q)
CMPUT 329 - Computer
Organization and Architecture II
50
Search Variable Expansion
A search variable is a single value (p.e., a minimum or a
maximum) computed for a collection of data.
Search variable expansion eliminates dependences
between definitions of search variables and their uses
in unrolled loops.
Each search variable is expanded into k temporary
independent variables. At the exit of the loop the value
of the original search variable is obtained by comparing
the values of the temporary search variables.
CMPUT 329 - Computer
Organization and Architecture II
51
Superblock Scheduling
Superblock scheduling is a two step process:
Step 1: Build dependence graph
Step 2: List scheduling using the dependence
graph, instruction latencies, and resource
constraints of the processor
CMPUT 329 - Computer
Organization and Architecture II
52
List Scheduling
List scheduling employs heuristics to choose among
all ready nodes, the combination of nodes
that should be scheduled in the current cycle.
A node is ready if:
(i) all its parents in the dependence graph
have been scheduled;
(ii) the result produced by each parent is available; and
(iii) the resources required by the node are available.
CMPUT 329 - Computer
Organization and Architecture II
53
Speculative Execution in
Superblocks
To produce an efficient schedule, the compiler
must be able to move instructions above and
below branches.
LIVE-OUT(BR) is the set of
SB1 R: xy+z
variables that may be used
before being redefined when
…
the branch BR is taken
S: bnz r1
P
...
In the example, LIVE-OUT(S)
B2
...
is the set of variables
that is live at point P.
CMPUT 329 - Computer
Organization and Architecture II
54
Speculative Execution in
Superblocks
If we want to move instruction R below the branch
instruction S, two situations might occur:
1) x  LIVE-OUT(S)
2) x  LIVE-OUT(S)
SB1 R: xy+z
…
S: bnz r1
...
P
B2
...
What is the code that
the compiler should
produce for each situation?
CMPUT 329 - Computer
Organization and Architecture II
55
Speculative Execution in
Superblocks
If we want to move instruction R below the branch
instruction S, two situations might occur:
SB1 R: xy+z
…
S: bnz r1
...
P
B2
...
1) x  LIVE-OUT(S)
insert a copy of
instruction R in the
branch target.
2) x  LIVE-OUT(S)
no compensation code
is required
CMPUT 329 - Computer
Organization and Architecture II
56
Speculative Execution in
Superblocks
SB1
SB1
…
S: bnz r1
…
R: xy+z
P
B2 R’: xy+z
...
1) x  LIVE-OUT(S)
must introduce R’ in
basic block B2
…
S: bnz r1
…
R: xy+z
P
B2
...
2) x  LIVE-OUT(S)
no compensation code
is required
CMPUT 329 - Computer
Organization and Architecture II
57
Speculative Execution in
Superblocks
Upward code motion is more common to reduce
the critical path of a superblock. (p.e. moving a
load instruction upward to hide the load latency).
There are two major restrictions to move an
instruction J from below to above a branch BR:
Restriction 1: The destination of J is not in
LIVE-OUT(BR).
Restriction 2: J will never cause an exception that
may terminate program execution
when BR is taken.
CMPUT 329 - Computer
Organization and Architecture II
58
Speculative Execution in
Superblocks
Restriction 1 is usually removed by register renaming.
By renaming the destination register of instruction J,
we ensure that it is not in LIVE-OUT(BR).
There are two extreme interpretations to restriction 2.
Restricted Speculation Model: fully enforce restriction 2.
Therefore only instructions that cannot cause
expections are candidates for speculative execution
(p. e. memory load, memory store, integer divide, and
all floating point instructions cannot be speculated).
CMPUT 329 - Computer
Organization and Architecture II
59
Speculative Execution in
Superblocks
General Speculation Model: completely ignore restriction 2.
Requires that the processor provide non-excepting or silent
versions of all potentially excepting instructions in the instruction
set architecure. If an exception occurs for a silent instruction, it
is simply ignored, and garbage is written in the destination.
CMPUT 329 - Computer
Organization and Architecture II
60
Example for Speculative
Execution
avg = 0;
weight = 0;
count = 0;
while(prt != NULL)
{
count = count + 1;
if(prt->wt > 0)
weight = weight - prt->wt;
else
weight = weight + prt->wt;
prt = prt -> next;
}
if(count != 0)
avg = weight/count
C code segment
(i1)
(i2)
(i3)
(i4)
(i5)
(i6) L0:
(i7)
(i8)
(i9)
(i10)
(i11) L1:
(i12) L2:
(i13)
(i14) L3:
(i15)
(i16)
(i17) L4:
ld_i r1, prt, 0
mov r7, 0
// avg
mov r2, 0
// count
mov r3, 0
// weight
beq r1, 0, L3
add r2, r2, 1
ld_i r4, r1, 0 // prt->wt
bge r4, 0, L1
sub r3, r3, r4
jmp L2
add r3, r3, r4
ld_i r1, r1, 4
bne r1, 0, L0
beq r2, 0, L4
div r7, r3, r2
st_i avg, 0, r7
Assembly code segment
CMPUT 329 - Computer
Organization and Architecture II
61
Example for Speculative
Execution
(i1)
(i2)
(i3)
(i4)
(i5)
(i6) L0:
(i7)
(i8)
(i9)
(i10)
(i11) L1:
(i12) L2:
(i13)
(i14) L3:
(i15)
(i16)
(i17) L4:
ld_i r1, prt, 0
mov r7, 0
// avg
mov r2, 0
// count
mov r3, 0
// weight
beq r1, 0, L3
add r2, r2, 1
ld_i r4, r1, 0 // prt->wt
bge r4, 0, L1
sub r3, r3, r4
jmp L2
add r3, r3, r4
ld_i r1, r1, 4
bne r1, 0, L0
beq r2, 0, L4
div r7, r3, r2
st_i avg, 0, r7
1
BB2
i6
i7
i8
10
BB3
i9
i10
90
BB4
99
i11
90
10
i12
i13
BB5
1
Assembly code segment
Trace Selection for the Loop
CMPUT 329 - Computer
Organization and Architecture II
62
Example for Speculative
Execution
1
SB1
BB2
i6
i7
i8
i9
i10
BB2
SB2
10
10
BB3
1
99(1/10)
90
BB4
99
i11
90
BB3’
i9
i12’
i13’
BB4
BB5
90
99(1/10)
i11
90
10
i12
i13
i6
i7
i8
BB5
i12
i13
1(1/10)
1
Trace Selection for the Loop
1(9/10)
After superblock formation
and branch target expansion
CMPUT 329 - Computer
Organization and Architecture II
63
Example for Speculative
Execution
ld_i r1, prt, 0
mov r7, 0
// avg
mov r2, 0
// count
SB1
mov r3, 0
// weight
i6
BB2
beq r1, 0, L3
i7
(i6) L0: add r2, r2, 1
i8
(i7)
ld_i r4, r1, 0 // prt->wt
10
(i8)
bge r4, 0, LA
90
99(1/10)
(i11)
add r3, r3, r4
BB4
i9
i11
(i12)
ld_i r1, r1, 4 // prt->next
i12’
(i13)
bne r1, 0, L0
i13’
90
(i9) LA: sub r3, r3, r4
i12
(i12’)
ld_i r1, r1, 4 // prt->next
BB5
i13
(i13’)
bne r1, 0, L0
(i14) L3: beq r2, 0, L4
1(1/10)
(i15)
div r7, r3, r2
1(9/10)
(i16)
st_i avg, 0, r7
After superblock formation
(i17) L4:
and branch target expansion
CMPUT 329 - Computer Assembly code segment
99(1/10)
SB2
BB3’
1
Organization and Architecture II
64
Example for Speculative
Execution
ld_i r1, prt, 0
mov r7, 0
// avg
mov r2, 0
// count
ld_i r1, prt, 0
mov r3, 0
// weight
mov r7, 0
// avg
beq r1, 0, L3
L3:
beq r2,
mov r2, 0
// count
(I1) L0: add r2, r2, 1
div
mov r3, 0
// weight
(I2)
ld_i r4, r1, 0 // prt->wt
st_
beq r1, 0, L3
(I3)
blt r4, 0, L1
(I4)
add r3, r3, r4
(I1) L0: add r2, r2, 1
(I5)
ld_i r5, r1, 4 // prt->next
(I2)
ld_i r4, r1, 0 // prt->wt
(I6)
beq r5, 0, L3
(I3)
blt r4, 0, L1
(I7)
add r2, r2, 1
(I8)
ld_i r6, r5, 0 // prt->wt
L1’: mov
(I4)
add r3, r3, r4
(I9)
blt r6, 0, L1’
mov
(I5)
ld_i r5, r1, 4 // prt->next
(I10)
add r3, r3, r6
(I6)
beq r5, 0, L3
(I11)
ld_i r1, r5, 4 // prt -> next
L1:
sub r3
(I12)
bne r1, 0, L0
(I7)
add r2, r2, 1
ld_i r1
L3:
beq r2, 0, L4
(I8)
ld_i r6, r5, 0 // prt->wt
bne r1
div r7, r3, r2
(I9)
blt r6, 0, L1’
st_I avg, 0, r7
L4:
L1’: mov r1, r5
(I10)
add r3, r3, r6
mov r4, r6
(I11)
ld_i r1, r5, 4 // prt -> next
L1:
sub r32, r3, r4
(I12)
bne r1, 0, L0
CMPUT 329 - Computer
ld_i r1, r1, 4
Organization and Architecture II
65
bne r1, 0, L0
Example for Speculative
Execution
ld_i
mov
mov
mov
beq
(I1)
(I2)
(I3)
(I4)
(I5)
(I6)
(I7)
(I8)
(I9)
(I10)
(I11)
(I12)
L0:
r1, prt, 0
r7, 0
r2, 0
r3, 0
r1, 0, L3
// avg
// count
// weight
L3:
div r7, r3, r2
st_I avg, 0, r7
add r2, r2, 1
ld_i r4, r1, 0 // prt->wt
blt r4, 0, L1
add r3, r3, r4
ld_i r5, r1, 4 // prt->next
beq r5, 0, L3
add r2, r2, 1
ld_i r6, r5, 0 // prt->wt
blt r6, 0, L1’
add r3, r3, r6
ld_i r1, r5, 4 // prt -> next
bne r1, 0,CMPUT
L0
329 - Computer
Organization and Architecture II
beq r2, 0, L4
L4:
L1’:
L1:
mov r1, r5
mov r4, r6
sub r32, r3, r4
ld_i r1, r1, 4
bne r1, 0, L0
66
Hyperblocks
Suggested Reading
Scott A. Mahlke’s Ph.D. Thesis, chap. 7.
CMPUT 329 - Computer
Organization and Architecture II
67
Hyperblock
A hyperblock is a collection of connected basic
blocks in which control may only enter through
the first block (entry block).
Control flow may leave from any number of blocks
in the hyperblock.
Before scheduling, all control flow between basic
blocks within a hyperblock is removed via if-conversion.
CMPUT 329 - Computer
Organization and Architecture II
68
Hyperblock Formation
A five-step procedure is used to form hyperblocks:
1. region identification
2. loop backedge coalescing
3. block selection
4. tail duplication
5. if-conversion
CMPUT 329 - Computer
Organization and Architecture II
69
Running Example: wc
Mahlke uses the inner loop of wc, the program that counts
the number of characters, words, and lines in a file for
linux, as a running example.
CMPUT 329 - Computer
Organization and Architecture II
70
The source code
A:
C:
B:
D:
E:
F:
linect =wordct = charct = token = 0;
for ( ; ; )
if (--(fp)->cnt < 0)
c = filbuf(fp);
else
c = *(fp)->ptr++;
if (c == EOF) break;
charct++;
if ((‘ ‘ < c) &&
(c < 0177))
{
H:
if(! token)
{
wordct++;
token++;
}
continue;
}
if (c == ‘\n’)
linec++;
else if ((c != ‘ ‘) &&
(c != ‘\t’)) continue;
token = 0;
K:
G:
I:
J:
L:
M:
}
CMPUT 329 - Computer
Organization and Architecture II
71
The Assembly Code
LK: ld_I r36, r72, 0
LA: ld_i r98, r3, 0
add r35, r36, 1
add r27, r98, -1
st_I r72, 0, r35
st_i r3, 0, 27
add r2, r2, 1
blt r98, 1, LC
jmp LA
LB: ld_i r30, r3, 4
LG: beq r4, r10, LI
add r29, r30, 1
LJ: bne r4, 32, LL
st_i r3, 4, r29
LM: mov r2, 0
ld_c r4, r30, 0
jmp LA
LD: beq r4, -1, EXIT
LI: ld_I r39, r71, 0
LE: ld_I r33, r73, 0
add r38, r39, 1
add r32, r33, 1
st_I r71, 0, r38
st_I r73, 0, r32
jmp LM
bge 32, r4, LG
LL: bne r4, 9, LA
LF: bge r4, 127, LG
jmp LM
LH: bne 0, r2, LA
LC: mov Parm0, r3
jsr filbuf
mov r4, Ret0
CMPUT 329 - Computer jmp LD
Organization and Architecture II
72
Control Flow Graph
A
105K
14
B
C
14
105K
D
105K
E
77K
F
1
28K
0
77K
4K
H
K
16K
G
I
16K
61K
EXIT
4K
24K
J
2K
22K
L
2K
M
CMPUT 329 - Computer
Organization and Architecture II
25
28K
73
Statistics of the Example
wc is formed by small basic blocks with a large
percentage of branches
It contains 13 basic blocks and 34 instructions:
14 branches:
8 conditional
5 unconditional
1 subroutine call
CMPUT 329 - Computer
Organization and Architecture II
74
Step 1: Region
Identification
A region is a group of basic blocks with a single
entry block that dominates all the blocks in the
region.
A basic block can only reside in a single region.
Regions are used because they provide easy to
compute outer boundaries for hyperblocks.
A second constraint imposed on region formation
is that regions may not contain internal cycles
(this constraint is relaxed later).
In wc, the entire control flow graph forms a region.
CMPUT 329 - Computer
Organization and Architecture II
75
Step 2: Backedge
Coalescing
If-conversion only can remove non-loop branches.
Thus we need to coaslece all back edges into a
single backedge. This allows the control logic
that choses which backedge is taken to be
eliminated by if-conversion.
To coalesce the backedges, we introduce a new
node that will be the origin of the new single backedge.
Then we retarget all existing backedges to this
new node
CMPUT 329 - Computer
Organization and Architecture II
76
CFG Before Backedge
Coalescing
A
105K
14
B
C
14
105K
D
105K
E
77K
F
1
28K
0
77K
4K
H
K
16K
G
I
16K
61K
EXIT
4K
24K
J
2K
22K
L
2K
M
CMPUT 329 - Computer
Organization and Architecture II
25
28K
77
CFG After Backedge
Coalescing
A
105K
14
B
C
14
105K
D
105K
E
77K
F
77K
1
28K
0
4K
H
G
24K
I
J
2K
16K
K
105K
EXIT
4K
22K
L
61K
M
2K
25
28K
N
16K
CMPUT
329 - Computer
Organization and Architecture II
78
Step 3: Block Selection
Two conflicting goals:
(1) More blocks can potentially improve performance
by eliminating branches among the blocks included.
(2) Too many blocks may result in performance loss
due to over-saturation of processor resources or
increased dependence height.
CMPUT 329 - Computer
Organization and Architecture II
79
Enumerating Execution
Paths
An execution path is a path of control flow from
the entry block to an exit block in the region.
Mahlke assigns a priority to each execution path.
This priority indicates the path relative importance.
Mahlke also estimates the available resources
and the resource use of each path.
Paths are included in the hyperblock from the
highest to the lowest priority based on the
available resources.
CMPUT 329 - Computer
Organization and Architecture II
80
Path Priority Function
The path priority function combines four elements:
(1) path execution frequency;
(2) number of instructions in the path;
(3) path dependence height;
(4) hazard conditions on the path;
Intuition: include paths with fewer instructions,
with lower dependence height, that
have few hazard conditions, and that
are executed very often.
Hazard conditions include procedure calls and
unresolvable memory
stores.
CMPUT 329 - Computer
Organization and Architecture II
81
Path Priority Function


 dep _ height i

dep _ ratioi  1.0  
 max dep _ height j  
 1 j  N



num
_
ops


i
op _ ratioi  1.0  
 max num _ ops j  
 1 j  N

priority i   probabilit yi  hazard i  dep _ ratioi  op _ ratio i  K 
Malhke use a hazard multiplier of 0.25 for all paths
containing a subroutine call or an unresolvable
memory reference, and 1.0 for all other paths.
CMPUT 329 - Computer
Organization and Architecture II
82
Path Priority Function


 dep _ height i

dep _ ratioi  1.0  
 max dep _ height j  
 1 j  N



num
_
ops


i
op _ ratioi  1.0  
 max num _ ops j  
 1 j  N

priority i   probabilit yi  hazard i  dep _ ratioi  op _ ratio i  K 
The constant K makes the path with the largest
dependence height and the most operations have
a non-zero probability. Malhke used K=0.1.
CMPUT 329 - Computer
Organization and Architecture II
83
Block Selection Algorithm
ISSUE_WIDTH = 1 to 8 /* as specified in the machine description file */
RES_MULTIPLIER = 2
MAX_DEP_GROWTH = 3
MIN_PATH_PRIORITY_RATIO = 0.10
block_selection(region) {
enumerate all paths in the region
calculate priority of each path
sort paths from highest to lowest priority
/* Initialization of loop variables */
avail_resources = ISSUE_WIDTH  dep_height1  RES_MULTIPLIER
used_resources = 0
last_priority = 0.0
selected_paths = 0
for (i = 1 to num_paths) {
/* Check if there are enough resources available to include the path */
if ((num_opsi + used_resources) > avail_resources) {
continue
}
/* Prevent paths with large relative dependence heights from being included */
if (dep_heighti > (dep_height1  MAX_DEP_GROWTH)) {
continue
CMPUT 329 - Computer
}
Organization and Architecture II
84
Block Selection Algorithm
/* Prevent paths with large relative dependence heights from being included */
if (dep_heighti > (dep_height1  MAX_DEP_GROWTH)) {
continue
}
/* Do not include paths with a small relative priority to that of the last included path */
if (priorityi < (last_priority  MIN_PATH_PRIORITY_RATIO)) {
continue
}
/* Include the path in the hyperblock */
selected_paths = selected_paths  pathi
used_resources = used_resources + num_opsi
last_priority = priorityi
}
selected_blocks = all blocks contained within selected_paths
return selected_blocks
}
CMPUT 329 - Computer
Organization and Architecture II
85
Block Selection
A
105K
14
B
C
14
105K
D
105K
E
77K
F
77K
1
28K
0
4K
H
G
24K
I
J
2K
16K
K
105K
EXIT
4K
22K
L
61K
M
16K
2K
25
28K
N
CMPUT 329 - Computer
Organization and Architecture II
1. A-B-D-E-F-H-N
2. A-B-D-E-F-H-K-N
3. A-B-D-E-G-J-M-N
4. A-B-D-E-G-J-L-M-N
5. A-B-D-E-G-I-M-N
6. A-B-D-E-G-J-L-N
7. A-B-D
8. A-C-D-E-F-H-N
9. A-C-D-E-F-H-K-N
10. A-C-D-E-G-J-M-N
11. A-C-D-E-G-J-L-M-N
12. A-C-D-E-G-I-M-N
13. A-C-D-E-G-J-L-N
14. A-C-D
15. A-B-D-E-F-G-I-M-N
16. A-B-D-E-F-G-J-M-N
17. A-B-D-E-F-G-J-L-M-N
18. A-B-D-E-F-G-J-L-N
19. A-C-D-E-F-G-I-M-N
20. A-C-D-E-F-G-J-M-N
21. A-C-D-E-F-G-J-L-M-N
22. A-C-D-E-F-G-J-L-N
86
Block Selection
A
105K
14
B
C
14
105K
D
105K
E
77K
F
77K
1
28K
0
4K
H
G
24K
I
J
2K
16K
K
105K
EXIT
4K
22K
L
61K
M
16K
2K
25
28K
N
CMPUT 329 - Computer
Organization and Architecture II
1. A-B-D-E-F-H-N
2. A-B-D-E-F-H-K-N
3. A-B-D-E-G-J-M-N
4. A-B-D-E-G-J-L-M-N
5. A-B-D-E-G-I-M-N
6. A-B-D-E-G-J-L-N
7. A-B-D
8. A-C-D-E-F-H-N
9. A-C-D-E-F-H-K-N
10. A-C-D-E-G-J-M-N
11. A-C-D-E-G-J-L-M-N
12. A-C-D-E-G-I-M-N
13. A-C-D-E-G-J-L-N
14. A-C-D
15. A-B-D-E-F-G-I-M-N
16. A-B-D-E-F-G-J-M-N
17. A-B-D-E-F-G-J-L-M-N
18. A-B-D-E-F-G-J-L-N
19. A-C-D-E-F-G-I-M-N
20. A-C-D-E-F-G-J-M-N
21. A-C-D-E-F-G-J-L-M-N
22. A-C-D-E-F-G-J-L-N
87
Path Selection
Some paths that are not selected by the block
selection algorithms are also included in the
hyperblocks because all their blocks belong
to selected paths.
An alternative procedure could have eliminated
these paths from the path set before the selection.
But the cost of such elimination would be higher
than maintaining these extra paths in the set.
CMPUT 329 - Computer
Organization and Architecture II
88
Block Selection
A
105K
14
B
C
14
105K
D
105K
E
77K
F
77K
1
28K
0
4K
H
G
24K
I
J
2K
16K
K
105K
EXIT
4K
22K
L
61K
M
16K
2K
25
28K
N
CMPUT 329 - Computer
Organization and Architecture II
1. A-B-D-E-F-H-N
2. A-B-D-E-F-H-K-N
3. A-B-D-E-G-J-M-N
4. A-B-D-E-G-J-L-M-N
5. A-B-D-E-G-I-M-N
6. A-B-D-E-G-J-L-N
7. A-B-D
8. A-C-D-E-F-H-N
9. A-C-D-E-F-H-K-N
10. A-C-D-E-G-J-M-N
11. A-C-D-E-G-J-L-M-N
12. A-C-D-E-G-I-M-N
13. A-C-D-E-G-J-L-N
14. A-C-D
15. A-B-D-E-F-G-I-M-N
16. A-B-D-E-F-G-J-M-N
17. A-B-D-E-F-G-J-L-M-N
18. A-B-D-E-F-G-J-L-N
19. A-C-D-E-F-G-I-M-N
20. A-C-D-E-F-G-J-M-N
21. A-C-D-E-F-G-J-L-M-N
22. A-C-D-E-F-G-J-L-N
89
Step 4: Tail Duplication
To convert the set of selected blocks into a
hyperblock (with a single entry block), control
flow from non-selected blocks (side entry points)
must be eliminated.
The tail duplication algorithm first marks all
blocks that have side entry points.
Then the algorithm marks all blocks that can
be reached from marked blocks.
All marked blocks form the tails that must be
duplicated.
CMPUT 329 - Computer
Organization and Architecture II
90
Tail Duplication
A
105K
14
B
C
14
105K
D
105K
E
77K
F
77K
1
28K
0
4K
H
G
24K
I
J
2K
16K
K
105K
EXIT
4K
22K
L
61K
M
16K
2K
25
28K
N
CMPUT 329 - Computer
Organization and Architecture II
91
Tail Duplication
A
105K
14
B
C
14
105K
D
105K
E
77K
F
77K
1
28K
0
4K
H
G
24K
I
J
2K
16K
K
105K
EXIT
4K
22K
L
61K
M
16K
2K
25
28K
N
CMPUT 329 - Computer
Organization and Architecture II
92
Tail Duplication
A
105K
14
B
14
105K
105K
D
105K
77K
28K
0
4K
H
14
G
F’
10
24K
16K
1
G’
J’
25
28K
N
2
CMPUT 329 - Computer
Organization and Architecture II
0
3
1
8
2K
3
I’
K’
L
M
0
2
22K
61K
4
H’
J
2K
4K
E’
10
EXIT
I
16K
K
D’
0
1
E
77K
F
14
C
M’
L’
0
4
0
N’
93
Anatomy of a Predicate
Computation Operation
p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)
This instruction assigns value to Pout1 and Pout2:
The value assigned depends on:
The result of the comparison
The value of Pin
The type of Pout1 and Pout2
CMPUT 329 - Computer
Organization and Architecture II
94
Anatomy of a Predicate
Computation Operation
p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)
<cmp> = eq | ne | gt
<type> = U | U | OR | OR | AND | AND
Example:
pge p4(OR), p2(/U), r4, 127 (p1)
cmp = ge, Pin = p1, Pout1 = p4, Pout2 = p2, src1 = r4, src2 = 127
CMPUT 329 - Computer
Organization and Architecture II
95
Anatomy of a Predicate
Computation Operation
p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)
<type> = U | U | OR | OR | AND | AND
U or U
Always write into the destination register:
if type = U
then if Pin = 0
then Pout = 0
elseif src1 <cmp> src2
then Pout = 1
else Pout = 0
if type = U
then if Pin = 0
then Pout = 0
elseif src1 <cmp> src2
then Pout = 0
else Pout = 1
CMPUT 329 - Computer
Organization and Architecture II
96
Anatomy of a Predicate
Computation Operation
p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)
<type> = U | U | OR | OR | AND | AND
OR or OR
Write into the destination register only
if Pin = 1 and <cmp> is true:
if type = OR and Pin = 1 and
src1 <cmp> src2
then Pout = 1
if type = OR and Pin = 1 and
src1 !<cmp> src2
then Pout = 1
Used when the execution of a block is enabled by
one of multiple conditions.
CMPUT 329 - Computer
OR type predicates
must be
Organization
andinitialized
Architecture IIto 0 before their use.
97
Anatomy of a Predicate
Computation Operation
p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)
<type> = U | U | OR | OR | AND | AND
Write into the destination register only
AND or AND
if Pin = 1 and <cmp> is false:
if type =AND and Pin = 1 and
src1 !<cmp> src2
then Pout = 0
if type = AND and Pin = 1 and
src1 <cmp> src2
then Pout = 0
Used when the execution of a block requires
several conditions to be true.
CMPUT 329 - Computer
AND type predicates
are often
initialized
Organization
and Architecture
II to 1.
98
Predicate Comparison
Truth Table
p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)
• Pin predicates the entire predicate computation
instruction.
• Notice that for an unconditional type, the value 0 is
written in Pout even when Pin is 0.
Pin Comparison
0
0
1
1
0
1
0
1
Pout
U
U OR OR AND AND
0
0
0
1
0
0
1
0
1
1
-
0
-
CMPUT 329 - Computer
Organization and Architecture II
0
99
Predicate Comparison
Truth Table
Pin Comparison
0
0
1
1
0
1
0
1
Pout
U
U OR OR AND AND
0
0
0
1
0
0
1
0
1
1
-
0
-
0
Example:
pge p4(OR), p2(/U), r4, 127 (p1)
p1 Comparison P4(OR) P2(/U)
0
0
0
0
1
0
1
0
1
1
1
1
0
CMPUT 329 - Computer
Organization and Architecture II
100
Predicate Types
Unconditional predicates are used for control
dependence sets that have a single edge.
OR-type predicates are used for predicates with
multiple edges in their control dependence sets.
(OR-type predicates must be cleared before
entering the hyperblock).
CMPUT 329 - Computer
Organization and Architecture II
101
Step 5: If-conversion
For graph drawing, Malhke uses the convention that the
left edge out of a basic block is the true condition
and the right one is the false.
G
I
J
In this control flow graph the control dependencies
on blocks I and J are:
I: brG
J: /brG
CMPUT 329 - Computer
Organization and Architecture II
102
Step 5: If-conversion
A
105K
14
B
105K
D
105K
77K
D’-N’
1
28K
0
4K
H
EXIT
G
105K
24K
I
J
2K
16K
K
14
E
77K
F
14
C
4K
22K
L
61K
M
16K
2K
25
28K
N
Control Dependences
A : none
B : none
D : none
E : none
F : brE
G : /brE, /brF
H : brF
I : brG
J : /brG
K : brH
L : /brJ
M : brI, brJ, brL
N : none
CMPUT 329 - Computer
Organization and Architecture II
Predicate Assignment
A : null
B : null
C : null
E : null
F : p1 (U)
G : p4 (OR)
H : p2 (U)
I : p7 (U)
J : p5 (U)
K : p3 (U)
L : p8 (U)
M : p6 (OR)
N : null
103
Step 5: If-conversion
A
105K
14
B
105K
D
105K
77K
D’-N’
1
28K
0
4K
H
EXIT
G
105K
24K
I
J
2K
16K
K
14
E
77K
F
14
C
4K
22K
L
61K
M
16K
2K
25
28K
N
Control Dependences
A : none
B : none
D : none
E : none
F : brE
G : /brE, /brF
H : brF
I : brG
J : /brG
K : brH
L : /brJ
M : brI, brJ, brL
N : none
CMPUT 329 - Computer
Organization and Architecture II
Predicate Assignment
A : null
B : null
C : null
E : null
F : p1 (U)
G : p4 (OR)
H : p2 (U)
I : p7 (U)
J : p5 (U)
K : p3 (U)
L : p8 (U)
M : p6 (OR)
N : null
104
Step 5: If-conversion
(example)
A
105K
14
B
105K
D
105K
77K
D’-N’
1
28K
0
4K
H
EXIT
G
105K
24K
I
J
2K
16K
K
14
E
77K
F
14
C
4K
22K
L
61K
M
16K
2K
25
28K
N
Control Dependences
A : none
B : none
D : none
E : none
F : brE
G : /brE, /brF
H : brF
I : brG
J : /brG
K : brH
L : /brJ
M : brI, brJ, brL
N : none
CMPUT 329 - Computer
Organization and Architecture II
Predicate Assignment
A : null
B : null
C : null
E : null
F : p1 (U)
G : p4 (OR)
H : p2 (U)
I : p7 (U)
J : p5 (U)
K : p3 (U)
L : p8 (U)
M : p6 (OR)
N : null
105
Step 5: If-conversion
(example)
A
105K
14
B
C
105K
D
105K
77K
D’-N’
1
28K
0
4K
H
EXIT
G
105K
24K
I
J
2K
16K
K
14
E
77K
F
14
4K
22K
L
61K
M
16K
2K
25
28K
N
LA: ld_i r98, r3, 0
add r27, r98, -1
st_i r3, 0, 27
blt r98, 1, LC
LB: ld_i r30, r3, 4
add r29, r30, 1
st_i r3, 4, r29
ld_c r4, r30, 0
LD: beq r4, -1, EXIT
LE: ld_I r33, r73, 0
add r32, r33, 1
st_I r73, 0, r32
bge 32, r4, LG
LF: bge r4, 127, LG
LH: bne 0, r2, LA
CMPUT 329 - Computer
Organization and Architecture II
LK: ld_I r36, r72, 0
add r35, r36, 1
st_I r72, 0, r35
add r2, r2, 1
jmp LA
LG: beq r4, r10, LI
LJ: bne r4, 32, LL
LM: mov r2, 0
jmp LA
LI: ld_I r39, r71, 0
add r38, r39, 1
st_I r71, 0, r38
jmp LM
LL: bne r4, 9, LA
jmp LM
LC: mov Parm0, r3
jsr filbuf
mov r4, Ret0
jmp LD
106
Step 5: If-conversion
(example)
A
105K
14
B
C
105K
D
105K
E
77K
F
77K
1
28K
0
4K
H
I
16K
K
4K
61K
16K
pclr p4, p6
ld_i r98, r3, 0
14
add r27, r98, -1
14
st_i r3, 0, r27
blt r98, 1, LC
D’-N’
ld_i r30, r3, 4
add r29, r30, 1
st_i r3, 4, r29
EXIT
105K
ld_c r4, r30, 0
beq r4, -1, EXIT
G 24K
ld_I r33, r73, 0
add r32, r33, 1
J
st_I r73, 0, r32
2K
pge p4(OR), p1(/U), 32, r4
22K
pge p4(OR), p2(/U), r4, 127 (p1)
L
peq p3(U),-,0,r2 (p2)
2K
peq p6(OR), p5(/U), r4, r10 (p4)
25
M 28K
peq p7(U), -, r4, r10 (p4)
N 329 - Computer...
CMPUT
Organization and Architecture II
107
105K
E
77K
F
77K
Step 5: If-conversion
(example)
1
28K
0
4K
H
I
LA: ld_i r98, r3, 0
add r27, r98, -1
st_i r3, 0, 27
blt r98, 1, LC
LB: ld_i r30, r3, 4
add r29, r30, 1
st_i r3, 4, r29
ld_c r4, r30, 0
LD: beq r4, -1, EXIT
LE: ld_I r33, r73, 0
add r32, r33, 1
st_I r73, 0, r32
bge 32, r4, LG
LF: bge r4, 127, LG
LH: bne 0, r2, LA
EXIT
G
24K
LK:
pclr p4, p6
J ld_I r36, r72, 0
add r35, r36, 1
ld_i r98, r3, 0
st_I r72, 0, r35
add r27, r98, -1
add r2, r2, 1
st_i r3, 0, r27
jmp LA
blt r98, 1, LC
LG: beq r4, r10, LI
ld_i r30, r3, 4
LJ: bne r4, 32, LL
add r29, r30, 1
LM: mov r2, 0
st_i r3, 4, r29
jmp LA
ld_c r4, r30, 0
LI: ld_I r39, r71, 0
beq r4, -1, EXIT
add r38, r39, 1
ld_I r33, r73, 0
st_I r71, 0, r38
add r32, r33, 1
jmp LM
st_I r73, 0, r32
LL: bne r4, 9, LA
pge p4(OR), p1(/U), 32, r4
jmp LM
pge p4(OR), p2(/U), r4, 127 (p1)
LC: mov Parm0, r3
peq p3(U),-,0,r2 (p2)
jsr filbuf
peq p6(OR), p5(/U), r4, r10 (p4)
mov r4, Ret0
peq p7(U), -, r4, r10 (p4)
jmp LD
CMPUT 329 - Computer...
Organization and Architecture II
108
Inner Loop After Ifconversion
ld_I r33, r73, 0
add r32, r33, 1
st_I r73, 0, r32
pge p4(OR), p1(/U), 32, r4
pge p4(OR), p2(/U), r4, 127 (p1)
peq p3(U),-,0,r2 (p2)
peq p6(OR), p5(/U), r4, r10 (p4)
peq p7(U), -, r4, r10 (p4)
peq p6(OR), p8(/U), r4, 32 (p5)
ld_I r36, r72, 0 (p3)
add r35, r36, 1 (p3)
st_I r72, 0, r35 (p3)
add r2, r2, 1 (p3)
ld_I r39, r71, 0 (p7)
add r38, r39, 1 (p7)
st_I r71, 0, r38 (p7)
peq p6(OR), -, r4, 9 (p8)
mov r2, 0 (p6)
jmp loop
CMPUT 329 - Computer
pclr p4, p6
ld_I r98, r3, 0
add r27, r98, -1
st_I r3, 0, r27
blt r98, 1, LC
ld_i r30, r3, 4
add r29, r30, 1
st_I r3, 4, r29
ld_c r4, r30, 0
beq r4, -1, EXIT
Organization and Architecture II
109
Predicate Hierarchy Graph
The Predicate Hierarchy Graph (PHG) is a directed
acyclic graph representing the Boolean equations
used to compute all the predicates in a hyperblock.
The PHG is used to derive relationships
among predicates.
There are two types of nodes in the PHG:
predicate nodes and condition nodes.
Two PHG nodes x and y are connected if the
value specified by x is used to directly compute
the value of y.
CMPUT 329 - Computer
Organization and Architecture II
110
Example of PHG
Construction
pclr
ld_I
add
st_I
blt
ld_i
add
st_I
ld_c
beq
ld_I
add
st_I
pge
pge
peq
peq
peq
peq
ld_I
add
st_I
add
ld_I
add
st_I
peq
mov
jmp
p4, p6
r98, r3, 0
r27, r98, -1
r3, 0, r27
r98, 1, LC
r30, r3, 4
r29, r30, 1
r3, 4, r29
r4, r30, 0
r4, -1, EXIT
r33, r73, 0
r32, r33, 1
r73, 0, r32
p4(OR), p1(/U), 32, r4
p4(OR), p2(/U), r4, 127 (p1)
p3(U),-,0,r2 (p2)
p6(OR), p5(/U), r4, r10 (p4)
p7(U), -, r4, r10 (p4)
p6(OR), p8(/U), r4, 32 (p5)
r36, r72, 0 (p3)
r35, r36, 1 (p3)
r72, 0, r35 (p3)
r2, r2, 1 (p3)
r39, r71, 0 (p7)
r38, r39, 1 (p7)
r71, 0, r38 (p7)
p6(OR), -, r4, 9 (p8)
r2, 0 (p6)
loop
T
[c1, /c1]
[c2, /c2]
[c3]
[c4, /c4]
[c4]
[c5, /c5]
[c6] CMPUT 329 - Computer
Organization and Architecture II
111
Example of PHG
Construction
pclr
ld_I
add
st_I
blt
ld_i
add
st_I
ld_c
beq
ld_I
add
st_I
pge
pge
pge
peq
peq
peq
peq
ld_I
add
st_I
add
ld_I
add
st_I
peq
mov
jmp
p4, p6
r98, r3, 0
r27, r98, -1
r3, 0, r27
r98, 1, LC
r30, r3, 4
r29, r30, 1
r3, 4, r29
r4, r30, 0
r4, -1, EXIT
r33, r73, 0
r32, r33, 1
r73, 0, r32
p4(OR),p1(/U),
p1(/U),32,
32,r4r4
p4(OR),
p4(OR), p2(/U), r4, 127 (p1)
p3(U),-,0,r2 (p2)
p6(OR), p5(/U), r4, r10 (p4)
p7(U), -, r4, r10 (p4)
p6(OR), p8(/U), r4, 32 (p5)
r36, r72, 0 (p3)
r35, r36, 1 (p3)
r72, 0, r35 (p3)
r2, r2, 1 (p3)
r39, r71, 0 (p7)
r38, r39, 1 (p7)
r71, 0, r38 (p7)
p6(OR), -, r4, 9 (p8)
r2, 0 (p6)
loop
T
c1
/c1
p1
p4
[c1,
[c1,/c1]
/c1]
[c2, /c2]
[c3]
[c4, /c4]
[c4]
[c5, /c5]
[c6] CMPUT 329 - Computer
Organization and Architecture II
112
Example of PHG
Construction
pclr
ld_I
add
st_I
blt
ld_i
add
st_I
ld_c
beq
ld_I
add
st_I
pge
pge
pge
peq
peq
peq
peq
ld_I
add
st_I
add
ld_I
add
st_I
peq
mov
jmp
p4, p6
r98, r3, 0
r27, r98, -1
r3, 0, r27
r98, 1, LC
r30, r3, 4
r29, r30, 1
r3, 4, r29
r4, r30, 0
r4, -1, EXIT
r33, r73, 0
r32, r33, 1
r73, 0, r32
p4(OR), p1(/U), 32, r4
p4(OR),
p4(OR),p2(/U),
p2(/U),r4,
r4,127
127(p1)
(p1)
p3(U),-,0,r2 (p2)
p6(OR), p5(/U), r4, r10 (p4)
p7(U), -, r4, r10 (p4)
p6(OR), p8(/U), r4, 32 (p5)
r36, r72, 0 (p3)
r35, r36, 1 (p3)
r72, 0, r35 (p3)
r2, r2, 1 (p3)
r39, r71, 0 (p7)
r38, r39, 1 (p7)
r71, 0, r38 (p7)
p6(OR), -, r4, 9 (p8)
r2, 0 (p6)
loop
T
c1
/c1
p1
c2
p4
/c2
p2
[c1, /c1]
[c2,/c2]
/c2]
[c2,
[c3]
[c4, /c4]
[c4]
[c5, /c5]
[c6] CMPUT 329 - Computer
Organization and Architecture II
113
Example of PHG
Construction
pclr
ld_I
add
st_I
blt
ld_i
add
st_I
ld_c
beq
ld_I
add
st_I
pge
pge
peq
peq
peq
peq
peq
ld_I
add
st_I
add
ld_I
add
st_I
peq
mov
jmp
p4, p6
r98, r3, 0
r27, r98, -1
r3, 0, r27
r98, 1, LC
r30, r3, 4
r29, r30, 1
r3, 4, r29
r4, r30, 0
r4, -1, EXIT
r33, r73, 0
r32, r33, 1
r73, 0, r32
p4(OR), p1(/U), 32, r4
p4(OR), p2(/U), r4, 127 (p1)
p3(U),-,0,r2
p3(U),-,0,r2(p2)
(p2)
p6(OR), p5(/U), r4, r10 (p4)
p7(U), -, r4, r10 (p4)
p6(OR), p8(/U), r4, 32 (p5)
r36, r72, 0 (p3)
r35, r36, 1 (p3)
r72, 0, r35 (p3)
r2, r2, 1 (p3)
r39, r71, 0 (p7)
r38, r39, 1 (p7)
r71, 0, r38 (p7)
p6(OR), -, r4, 9 (p8)
r2, 0 (p6)
loop
T
c1
/c1
p1
c2
p4
[c1, /c1]
[c2, /c2]
[c3]
[c3]
[c4, /c4]
[c4]
[c5, /c5]
/c2
p2
c3
p3
[c6] CMPUT 329 - Computer
Organization and Architecture II
114
Example of PHG
Construction
pclr
ld_I
add
st_I
blt
ld_i
add
st_I
ld_c
beq
ld_I
add
st_I
pge
pge
peq
peq
peq
peq
ld_I
add
st_I
add
ld_I
add
st_I
peq
mov
jmp
p4, p6
r98, r3, 0
r27, r98, -1
r3, 0, r27
r98, 1, LC
r30, r3, 4
r29, r30, 1
r3, 4, r29
r4, r30, 0
r4, -1, EXIT
r33, r73, 0
r32, r33, 1
r73, 0, r32
p4(OR), p1(/U), 32, r4
p4(OR), p2(/U), r4, 127 (p1)
p3(U),-,0,r2 (p2)
p6(OR), p5(/U),
p5(/U), r4,
r4,r10
r10(p4)
(p4)
p6(OR),
p7(U), -, r4, r10 (p4)
p6(OR), p8(/U), r4, 32 (p5)
r36, r72, 0 (p3)
r35, r36, 1 (p3)
r72, 0, r35 (p3)
r2, r2, 1 (p3)
r39, r71, 0 (p7)
r38, r39, 1 (p7)
r71, 0, r38 (p7)
p6(OR), -, r4, 9 (p8)
r2, 0 (p6)
loop
T
c1
/c1
p1
c2
p4
[c1, /c1]
[c2, /c2]
[c3]
[c4, /c4]
/c4]
[c4]
[c5, /c5]
c4
/c2
p2
/c4
p5
c3
p3
[c6] CMPUT 329 - Computer
p6
Organization and Architecture II
115
Example of PHG
Construction
pclr
ld_I
add
st_I
blt
ld_i
add
st_I
ld_c
beq
ld_I
add
st_I
pge
pge
peq
peq
peq
peq
peq
ld_I
add
st_I
add
ld_I
add
st_I
peq
mov
jmp
p4, p6
r98, r3, 0
r27, r98, -1
r3, 0, r27
r98, 1, LC
r30, r3, 4
r29, r30, 1
r3, 4, r29
r4, r30, 0
r4, -1, EXIT
r33, r73, 0
r32, r33, 1
r73, 0, r32
p4(OR), p1(/U), 32, r4
p4(OR), p2(/U), r4, 127 (p1)
p3(U),-,0,r2 (p2)
p6(OR), p5(/U), r4, r10 (p4)
p7(U),
p7(U), -,-, r4,
r4,r10
r10(p4)
(p4)
p6(OR), p8(/U), r4, 32 (p5)
r36, r72, 0 (p3)
r35, r36, 1 (p3)
r72, 0, r35 (p3)
r2, r2, 1 (p3)
r39, r71, 0 (p7)
r38, r39, 1 (p7)
r71, 0, r38 (p7)
p6(OR), -, r4, 9 (p8)
r2, 0 (p6)
loop
T
c1
/c1
p1
c2
p4
[c1, /c1]
[c2, /c2]
[c3]
[c4, /c4]
[c4]
[c4]
[c5, /c5]
c4
c4
p7
/c2
p2
/c4
p5
c3
p3
[c6] CMPUT 329 - Computer
p6
Organization and Architecture II
116
Example of PHG
Construction
pclr
ld_I
add
st_I
blt
ld_i
add
st_I
ld_c
beq
ld_I
add
st_I
pge
pge
peq
peq
peq
peq
peq
ld_I
add
st_I
add
ld_I
add
st_I
peq
mov
jmp
p4, p6
r98, r3, 0
r27, r98, -1
r3, 0, r27
r98, 1, LC
r30, r3, 4
r29, r30, 1
r3, 4, r29
r4, r30, 0
r4, -1, EXIT
r33, r73, 0
r32, r33, 1
r73, 0, r32
p4(OR), p1(/U), 32, r4
p4(OR), p2(/U), r4, 127 (p1)
p3(U),-,0,r2 (p2)
p6(OR), p5(/U), r4, r10 (p4)
p7(U), -, r4, r10 (p4)
p6(OR), p8(/U),
p8(/U),r4,
r4,32
32(p5)
(p5)
p6(OR),
r36, r72, 0 (p3)
r35, r36, 1 (p3)
r72, 0, r35 (p3)
r2, r2, 1 (p3)
r39, r71, 0 (p7)
r38, r39, 1 (p7)
r71, 0, r38 (p7)
p6(OR), -, r4, 9 (p8)
r2, 0 (p6)
loop
T
c1
/c1
p1
c2
/c2
p4
[c1, /c1]
[c2, /c2]
[c3]
[c4, /c4]
[c4]
[c5, /c5]
/c5]
[c5,
c4
p2
c4
/c4
p7
c3
p5
c5
p3
/c5
p8
[c6] CMPUT 329 - Computer
p6
Organization and Architecture II
117
Example of PHG
Construction
pclr
ld_I
add
st_I
blt
ld_i
add
st_I
ld_c
beq
ld_I
add
st_I
pge
pge
peq
peq
peq
peq
ld_I
add
st_I
add
ld_I
add
st_I
peq
peq
mov
jmp
p4, p6
r98, r3, 0
r27, r98, -1
r3, 0, r27
r98, 1, LC
r30, r3, 4
r29, r30, 1
r3, 4, r29
r4, r30, 0
r4, -1, EXIT
r33, r73, 0
r32, r33, 1
r73, 0, r32
p4(OR), p1(/U), 32, r4
p4(OR), p2(/U), r4, 127 (p1)
p3(U),-,0,r2 (p2)
p6(OR), p5(/U), r4, r10 (p4)
p7(U), -, r4, r10 (p4)
p6(OR), p8(/U), r4, 32 (p5)
r36, r72, 0 (p3)
r35, r36, 1 (p3)
r72, 0, r35 (p3)
r2, r2, 1 (p3)
r39, r71, 0 (p7)
r38, r39, 1 (p7)
r71, 0, r38 (p7)
p6(OR), -,-, r4,
r4,99(p8)
(p8)
p6(OR),
r2, 0 (p6)
loop
T
c1
/c1
p1
c2
/c2
p4
[c1, /c1]
[c2, /c2]
[c3]
[c4, /c4]
[c4]
[c5, /c5]
c4
p2
c4
/c4
p7
c3
p5
c5
p3
/c5
p8
c6
[c6] CMPUT 329 - Computer
[c6]
p6
Organization and Architecture II
118
Example of PHG
Construction
pclr
ld_I
add
st_I
blt
ld_i
add
st_I
ld_c
beq
ld_I
add
st_I
pge
pge
peq
peq
peq
peq
ld_I
add
st_I
add
ld_I
add
st_I
peq
mov
jmp
p4, p6
r98, r3, 0
r27, r98, -1
r3, 0, r27
r98, 1, LC
r30, r3, 4
r29, r30, 1
r3, 4, r29
r4, r30, 0
r4, -1, EXIT
r33, r73, 0
r32, r33, 1
r73, 0, r32
p4(OR), p1(/U), 32, r4
p4(OR), p2(/U), r4, 127 (p1)
p3(U),-,0,r2 (p2)
p6(OR), p5(/U), r4, r10 (p4)
p7(U), -, r4, r10 (p4)
p6(OR), p8(/U), r4, 32 (p5)
r36, r72, 0 (p3)
r35, r36, 1 (p3)
r72, 0, r35 (p3)
r2, r2, 1 (p3)
r39, r71, 0 (p7)
r38, r39, 1 (p7)
r71, 0, r38 (p7)
p6(OR), -, r4, 9 (p8)
r2, 0 (p6)
loop
T
c1
/c1
p1
c2
/c2
p4
[c1, /c1]
[c2, /c2]
[c3]
[c4, /c4]
[c4]
[c5, /c5]
c4
p2
c4
/c4
p7
c3
p5
c5
p3
/c5
p8
c6
[c6] CMPUT 329 - Computer
p6
Organization and Architecture II
119
Purpose of PHG
The PHG is used to allow the compiler to derive
relations among the predicates. Mahlke identifies three
predicate relations:
Ancestor: pi is an ancestor of pj if all conditions
used to compute pj are derived from pi.
The compiler can be sure that pj may be true only when pi is also true.
Control Path: There is a control path between
pi and pj if there is at least one set of
conditions under which both pj and pi are true.
The compiler knows that pi and pj may be true at the same time.
Implies: pi implies pj if the conditions that make pi true
guatanteeCMPUT
that 329
pj will
also be true.
- Computer
Organization and Architecture II
120
Imply Relationship
pclr
ld_I
add
st_I
blt
ld_i
add
st_I
ld_c
beq
ld_I
add
st_I
pge
pge
peq
peq
peq
peq
ld_I
add
st_I
add
ld_I
add
st_I
peq
mov
jmp
p4, p6
r98, r3, 0
r27, r98, -1
r3, 0, r27
r98, 1, LC
r30, r3, 4
r29, r30, 1
r3, 4, r29
r4, r30, 0
r4, -1, EXIT
r33, r73, 0
r32, r33, 1
r73, 0, r32
p4(OR), p1(/U), 32, r4
p4(OR), p2(/U), r4, 127 (p1)
p3(U),-,0,r2 (p2)
p6(OR), p5(/U), r4, r10 (p4)
p7(U), -, r4, r10 (p4)
p6(OR), p8(/U), r4, 32 (p5)
r36, r72, 0 (p3)
r35, r36, 1 (p3)
r72, 0, r35 (p3)
r2, r2, 1 (p3)
r39, r71, 0 (p7)
r38, r39, 1 (p7)
r71, 0, r38 (p7)
p6(OR), -, r4, 9 (p8)
r2, 0 (p6)
loop
T
c1
/c1
p1
c2
/c2
p4
[c1, /c1]
[c2, /c2]
[c3]
[c4, /c4]
[c4]
[c5, /c5]
c4
p2
c4
/c4
p7
c3
p5
c5
p3
/c5
p8
c6
[c6] CMPUT 329 - Computer
p6
Organization and Architecture II
p7 implies p6
121
Ancestor Relationship
pclr
ld_I
add
st_I
blt
ld_i
add
st_I
ld_c
beq
ld_I
add
st_I
pge
pge
peq
peq
peq
peq
ld_I
add
st_I
add
ld_I
add
st_I
peq
mov
jmp
p4, p6
r98, r3, 0
r27, r98, -1
r3, 0, r27
r98, 1, LC
r30, r3, 4
r29, r30, 1
r3, 4, r29
r4, r30, 0
r4, -1, EXIT
r33, r73, 0
r32, r33, 1
r73, 0, r32
p4(OR), p1(/U), 32, r4
p4(OR), p2(/U), r4, 127 (p1)
p3(U),-,0,r2 (p2)
p6(OR), p5(/U), r4, r10 (p4)
p7(U), -, r4, r10 (p4)
p6(OR), p8(/U), r4, 32 (p5)
r36, r72, 0 (p3)
r35, r36, 1 (p3)
r72, 0, r35 (p3)
r2, r2, 1 (p3)
r39, r71, 0 (p7)
r38, r39, 1 (p7)
r71, 0, r38 (p7)
p6(OR), -, r4, 9 (p8)
r2, 0 (p6)
loop
T
c1
/c1
p1
c2
/c2
p4
[c1, /c1]
[c2, /c2]
[c3]
[c4, /c4]
[c4]
[c5, /c5]
c4
p2
c4
/c4
p7
c3
p5
c5
p3
/c5
p8
c6
[c6] CMPUT 329 - Computer
p6
Organization and Architecture II
Which predicate
nodes are ancestors
of p5?
T, p4, and p5
122
Ancestor Relationship
pclr
ld_I
add
st_I
blt
ld_i
add
st_I
ld_c
beq
ld_I
add
st_I
pge
pge
peq
peq
peq
peq
ld_I
add
st_I
add
ld_I
add
st_I
peq
mov
jmp
p4, p6
r98, r3, 0
r27, r98, -1
r3, 0, r27
r98, 1, LC
r30, r3, 4
r29, r30, 1
r3, 4, r29
r4, r30, 0
r4, -1, EXIT
r33, r73, 0
r32, r33, 1
r73, 0, r32
p4(OR), p1(/U), 32, r4
p4(OR), p2(/U), r4, 127 (p1)
p3(U),-,0,r2 (p2)
p6(OR), p5(/U), r4, r10 (p4)
p7(U), -, r4, r10 (p4)
p6(OR), p8(/U), r4, 32 (p5)
r36, r72, 0 (p3)
r35, r36, 1 (p3)
r72, 0, r35 (p3)
r2, r2, 1 (p3)
r39, r71, 0 (p7)
r38, r39, 1 (p7)
r71, 0, r38 (p7)
p6(OR), -, r4, 9 (p8)
r2, 0 (p6)
loop
T
c1
/c1
p1
c2
/c2
p4
[c1, /c1]
[c2, /c2]
[c3]
[c4, /c4]
[c4]
[c5, /c5]
c4
p2
c4
/c4
p7
c3
p5
c5
/c5
c6
[c6] CMPUT 329 - Computer
p6
Organization and Architecture II
p3
p8 Which predicate
nodes are in the same
control path as p5?
T, p1, p4, p5, p6, p8
123
Classical/ILP Optimizations in
Predicated Code
Example: Copy Propagation
A:
B:
C:
mov
add
ld_i
r1, r2 (p1)
r2, r3, r4 (p2)
r5, r1, 0 (p3)
A:
B:
C:
mov
add
ld_i
r1, r2 (p1)
r2, r3, r4 (p2)
r5, r2, 0 (p3)
Is the copy propagation from
instruction A to instruction C legal?
Depends on what we know about the
relationship between p1, p2, and p3.
If it is possible that p1 is false and p3
is true,
propagation would be wrong!
CMPUTthe
329 - Computer
Organization and Architecture II
124
Classical/ILP Optimizations in
Predicated Code
p1
Example: Copy Propagation
pk
A:
B:
C:
mov
add
ld_i
r1, r2 (p1)
r2, r3, r4 (p2)
r5, r1, 0 (p3)
cm
/cm
p2
p3
For instance, if we know that:
(1) p1 is an ancestor of both p2 and p3, and
(2) p2 and p3 are mutually exclusive
Then we can do the copy propagation safely.
CMPUT 329 - Computer
Organization and Architecture II
125
Classical/ILP Optimizations in
Predicated Code
Example: Instruction Scheduling
A:
B:
C:
D:
ld_i
add
ld_i
mul
r1, r2, r3 (p2)
r4, r1, 4 (p2)
r1, r5, 0 (p3)
r6, r1, r7 (p3)
What are the data dependencies in the
code above?
Depends on what we know about the
relationship between p2, and p3.
CMPUT 329 - Computer
Organization and Architecture II
126
Classical/ILP Optimizations in
Predicated Code
Example: Instruction Scheduling
A:
B:
C:
D:
ld_i
add
ld_i
mul
r1, r2, r3 (p2)
r4, r1, 4 (p2)
r1, r5, 0 (p3)
r6, r1, r7 (p3)
For instance, if we know that
p2 and p3 are mutually exclusive,
we have this DDG:
CMPUT 329 - Computer
Organization and Architecture II
pk
cm
/cm
p2
p3
A
C
B
D
127
Classical/ILP Optimizations in
Predicated Code
Example: Instruction Scheduling
A:
B:
C:
D:
ld_i
add
ld_i
mul
r1, r2, r3 (p2)
r4, r1, 4 (p2)
r1, r5, 0 (p3)
r6, r1, r7 (p3)
But if p2 implies p3,
then have this DDG:
pk
cm
cm
p2
p3
A
B
C
CMPUT 329 - Computer
Organization and Architecture II
D
128
Predicate-Specific
Optimizations
- Predicate Promotion
- Branch Combining
- Predicate Loop Peeling
CMPUT 329 - Computer
Organization and Architecture II
129
Predicate Promotion
The idea it to speculate the execution of instructions
by replacing their predicate by a less constrained
predecessor predicate.
Because the ancestor predicate is computed with
fewer conditions, the execution of the promoted
instruction is speculative.
The advantage of predicate promotion is the reduction
of the dependence chain in a hyperblock.
CMPUT 329 - Computer
Organization and Architecture II
130
Conditions for Simple
Predicate Promotion
The predicate of an instruction op(x) can
be promoted to its predecessor predicate
if all the following conditions are true:
1. op(x) is predicated
2. op(x) has a destination register
3. op(x) has a speculative version
4. there is a unique op(y) lexically before
op(x) such that dest(y) = pred(x)
5. dest(x) is not live at op(y)
6. for any op(j) such that there is a path
op(j)…op(y), dest(x)  dest(j)
7. It is profitable
to promote op(x)
CMPUT 329 - Computer
Organization and Architecture II
131
Example of Predicate
Promotion (qsort)
1 LA: ld_i
2
ld_i
3
pge
4 LB: ld_i
5
add
6
add
7
add
8 LC: ld_i
9
add
10
add
11
add
12 LD: st_i
13
st_i
14
add
15
add
16
bge
17 LE: blt
r20, r24, r101
r23, r2, r102
p126(U), p127(U), r20, r23
r6, r123, 0
(p126)
r123, r123, 8
(p126)
r9, r9, 1
(p126)
r101, r101, 8
(p126)
r6, r124, 8
(p127)
r124, r124, 8
(p127)
r124, r124, 8
(p127)
r102, r102, 8
(p127)
r114, 0, r23
r114, 4, r6
r7, r7, 1
r114, r114, 8
r9, r3, EXIT
r8, r1, LA
1 LA: ld_i
2
ld_i
3
pge
4 LB: ld_i
5
add
6
add
7
add
8 LC: ld_i
8a
mov
9
add
10
add
11
add
12 LD: st_i
13
st_i
14
add
15
add
16
bge
17 LE: blt
CMPUT 329 - Computer
Organization and Architecture II
r20, r24, r101
r23, r2, r102
p126(U), p127(U), r20, r23
r6, r123, 0
r123, r123, 8
(p126)
r9, r9, 1
(p126)
r101, r101, 8
(p126)
r60, r124, 8
r6, r60
(p127)
r124, r124, 8
(p127)
r124, r124, 8
(p127)
r102, r102, 8
(p127)
r114, 0, r23
r114, 4, r6
r7, r7, 1
r114, r114, 8
r9, r3, EXIT
r8, r1, LA
132
Branch Combining
Problem: too many infrequently executed branches
in a hyperblock
1 A: bge r1, r5, EXIT1
2
ld_c r3, r1, 0
3
beq r3, 10, EXIT2
4
beq r3, 0, EXIT3
5
bge r2, r6, EXIT4
6
st_c r2, 0, r3
7
add r1, r1, 1
8
add r2, r2, 1
9
jmp A
14
4035
0
0
Example: a loop in grep
CMPUT 329 - Computer
Organization and Architecture II
133
Branch Combining
Solution: replace a group of exit branches
by a corresponding group of predicate
define instructions.
All predicate definitions write into the same predicate
register using the OR-type semantics.
The resultant predicate will be set to 1 if any
of the exit branches were to be taken.
Because not exiting the hyperblock is the most
common case, the predicate will be false.
CMPUT 329 - Computer
Organization and Architecture II
134
Branch Combining
1 A: bge r1, r5, EXIT
2
ld_c r3, r1, -1
3
beq r3, 10, EXIT2
4
beq r3, 0, EXIT3
5
bge r2, r6, EXIT4
6
st_c r2, -1, r3
7
bge r1, r7, EXIT5
8
ld_c r4, r1, 0
9
beq r4, 10, EXIT6
10
beq r4, 0, EXIT7
11
bge r2, r8, EXIT8
12
st_c r2, 0, r4
13
add r1, r1, 2
14
add r2, r2, 2
15
jmp A
0 A: pclr p1
pge p1(OR), r1, r5
1
ld_c r3, r1, -1
2
peq p1(OR), r3, 10
3
peq p1(OR), r3, 0
4
pge p1(OR), r2, r6
5
pge p1(OR), r1, r7
7
ld_c r4, r1, 0
8
peq p1(OR), r4, 10
9
peq p1(OR), r4, 0
10
pge p1(OR), r2, r8
11
jmp Decode (p1)
16
st_c r2, -1, r3
6’
st_c r2, 0, r4
12
add r1, r1, 2
13
add r2, r2, 2
14
jmp A
15
jm p
1
3
4
5
6
7
9
10
11
Decode:
bge
beq
beq
bge
st_c
bge
beq
beq
jmp
r1, r5, EXIT1
r3, 10, EXIT2
r3, 0, EXIT3
r2, r6, EXIT4
r2, -1, r3
r1, r7, EXIT5
r4, 10, EXIT6
r4, 0, EXIT7
EXIT8
jm p
jm p
CMPUT 329 - Computer
Organization and Architecture II
135
Instruction Between
Combined Branches
Instructions between combined branches are
speculated.
For instructions that are between combined branches
but cannot be speculated, the following must be done:
(1) move the instructions below the combined
exit branch in the hyperblock.
(2) replicate these instructions in their original position
with respect to the exit branches in the decode block.
CMPUT 329 - Computer
Organization and Architecture II
136
Backend Compilation with
Hyperblocks
Lcode generation
Classical Optim.
PHG
predicate relations
Hyperblock/Superblock
Formation
ILP/Predicate-Specific
Optimizations
CFG
Generator
Equation
Solver
Classical Optim.
dataflow
information
Instruction Scheduling
Register Allocation
CMPUT 329 - Computer
Organization and Architecture II
predicate
aware
137

Download Report

PowerPoint Presentation - CMPUT415

Paperzz.com

Your Paperzz