Document

Code Generation
Code Generation
 Use registers during execution
 Whenever possible, perform computation in registers
 Memory load/store are much more expensive
 Need to determine the best register allocation
 For a given number of registers, minimize the number of spills
 Spill: When run out of registers, store some registers to memory
 Need to determine the best order of instruction execution
 To satisfy the suboptimal register allocation decision
 To reduce the number of instructions
 Instruction selection
 Map the intermediate code to the set of machine instructions that
minimizes the cost of execution
 Peephole optimization
Code Generation
 Various methods for register allocation and instruction
scheduling
 Tree
 Achieve optimal register allocation and instruction scheduling
 DAG (directed acyclic graph)
 Achieve local subexpression elimination (optimal)
 Optimal register allocation and instruction scheduling is NP
 Heuristic algorithms
 Global
 Global register allocation
 Do not have corresponding scheduling algorithm, just follow the
original instruction order
Tree Based Approach for a Basic Block
Basic block:
t1 := a + b
t2 := c * d
t3 := e + f
t4 := t2 + t3
y := t1 * t4
Assumptions:
The system has two registers, r0, r1
only y is alive at the exit of the block
“op reg reg/mem reg” -- first reg is the result
“– a b c”  a := b – c
15 instructions
10 load, 5 store
load r0, a
add r0, r0, b
store t1, r0
load r0, c
mul r0, r0, d
store t2, r0
load r0, e
add r0, r0, f
store t3, r0
load r0, t2
add r0, r0, t3
store t4, r0
Can we use the registers more effectively?
load r0, t1
add r0, r0, t4
store y, r0
Tree Based Approach for a Basic Block
Assumptions:
The system has two registers, r0, r1
only y is alive at the exit of the block
Basic block:
t1 := a + b
t2 := c * d
t3 := e + f
t4 := t2 + t3
y := t1 * t4
t1 (R0) and t2 (R1) are
still needed
But no more registers to
compute t3
Has to spill (choose R0)
* y
+ t4
+ t1
a
Need to
load t1 back
into R1
+ t3
* t2
b
c
d
e
f
load r0, a
add r0, r0, b
load r1, c
mul r1, r1, d
store t1, r0
load r0, e
add r0, r0, f
add r0, r1, r0
load t1, r1
mul r0, r1, r0
store y, r0
11 instructions
7 load, 2 store (1 spill)
Tree Based Approach for a Basic Block
Assumptions:
The system has two registers, r0, r1
only y is alive at the exit of the block
Basic block:
t1 := a + b
t2 := c * d
t3 := e + f
t4 := t2 + t3
y := t1 * t4
load r1, e
add r1, r1, f
Can we always
achieve optimal
execution?
* y
+ t3
* t2
b
c
d
e
add r1, r0, r1
load r0, a
add r0, r0, b
mul r0, r0, r1
store y, r0
+ t4
+ t1
a
load r0, c
mul r0, r0, d
f
9 instructions
6 load, 1 store (0 spill)
Optimal!
Tree based Register Allocation and Scheduling
 Construct the execution tree for a basic block
 Label the tree to obtain the register requirements
 Depth first labeling
 L(leaf) = 1 if it is an identifier
 L(leaf) = 0 if it is a constant
 L(nonleaf node) =
Assumptions:
From here onwards,
3 address code need to be:
op reg reg reg
 If L(left child) = L(right child) then
o L(current node) := L(left child) + 1
 Otherwise
o L(current node) := max (L(left child), L(right child))
 Assign registers and generate code
 Register allocation
 Instruction scheduling follows the register allocation algo
Tree based Register Allocation and Scheduling
 Register allocation and instruction scheduling
 The process starts from root, recursively going to leave nodes
 Each non-leave node N with mark lr(N), do
– t1
 Assume
o The registers allocated to N are Rb+1 to Rb+k (k registers)
o The node operation is op and op is binary
 If L(left) = L(right)
a
b
 t1 := a – b
 – t1 a b
o Go to left, pass lr = left, use registers Rb+1 to Rb+k–1, store result in Rb+1
o Go to right, pass lr = right, use registers Rb+2 to Rb+k, store result in Rb+k
o If lr(N) = left gen “op Rb+1 Rb+1 Rb+k”; else gen “op Rb+k Rb+1 Rb+k”
 If L(left) < L(right)
 (assume: left needs m registers, right needs k registers, k > m)
o Go to right, pass lr = right, use registers Rb+1 to Rb+k, store result in Rb+k
o Go to left, pass lr = left, use registers Rb+1 to Rb+m, store result in Rb+1
o If lr(N) = left gen “op Rb+1 Rb+1 Rb+k”; else gen “op Rb+k Rb+1 Rb+k”
Tree based Register Allocation and Scheduling
 Register allocation and instruction scheduling
 Each non-leave node N, with mark lr(N), do
 Assume
o The allocated registers are Rb+1 to Rb+k (k registers)
o The node operation is op and op is unary
 If lr(N)=left then Go to the child
o Pass lr(N), and pass registers Rb+1 to Rb+k to the child
o Generate code: “op Rb+1 Rb+1”
 If lr(N)=right then Go to the child
o Pass lr(N), and pass registers Rb+1 to Rb+k to the child
o Generate code: “op Rb+k Rb+k”
 Leave node x, x is an identifier
 Assume: allocated register is Rb+1
 Generate code: “load Rb+1 x”
load r1, c
load r2, d
Compute register
requirement
mul r1, r1, r2
Assign registers
Tree based Register Allocation and Scheduling
load r2, e
Generate code
mul r1, r1, r3
load r3, f
now r1, r2 are
available
(r1,r2,r3)
add r3, r2, r3
*
add r3, r1, r3
load r1, a
add r1, r1, r2
load r2, b
(r1, r2)
+
add r1, r1, r2
3
add r3, r1, r3
(r1,r2,r3)
2
+
mul r1, r1, r2
(r1,r2)
mul r1, r1, r3
a 1
b 1
(r1)
load r1, a
(r2)
load r2, b
*
c1
(r1)
load r1, c
3
add r3, r2, r3
(r2,r3)
2
d1
(r2)
load r2, d
+
2
e 1
f 1
(r2)
load r2, e
(r3)
load r3, f
Global Register Allocation
 Basic approach
 Global liveliness analysis
 Build the interference graph
 Graph coloring
 N colors
 N is the number of available registers
 If N-coloring is not possible
 Insert spill code to the program
Global Register Allocation
 Block level liveliness analysis
{b,c,f }
{ c,d,e,f }
{c,e}
a:= b+c
d:= –a
e:= d+f
{a,c,f}
{c,d,f}
b:= d+e
e:= e–1
print(b)
f:= 2*e
{c,f}
{b}
{b,c,f}
b:= f+c
Assumption: {b} is the Live set of the next block
{c,d,e,f}
{b,c,e,f}
{b,c,f}
Global Register Allocation
 Build the interference graph
 Show which variables interfere with each other
 Principle:
 Two variables that are alive simultaneously interfere
 They cannot be allocated to the same register
----x=? define
----alive
----{x,…}
----?=x-- use
-----
 Register interference graph:
 One vertex for each variable in the graph
 At each point “p” in the CFG
 L is the Live set at p
 Two variables x and y are in L together,
 x should not get the same register as y
  add an edge (x,y)
-------------------------------------
x
y
{x,y,…}
Global Register Allocation
 Build the interference graph -- example
{b,c,f }
{ c,d,e,f }
a:= b+c
d:= –a
e:= d+f
{a,c,f}
{c,d,f}
{c,d,e,f}
{c,e}
f:= 2*e
{b,c,e,f}
{b,c,f}
b:= d+e
e:= e–1
print(b)
a
b
f
{c,f}
b:= f+c
{b,c,f}
{b}
c
e
d
Global Register Allocation
 Graph coloring to decide register allocation
 Color the interference graph so that no two adjacent nodes are of
the same color
 Graph is k-colorable:
 Implies we can use k registers without needing to spill
 Whether a graph is k-colorable is NP complete
 If there are k registers available
 We do not care whether the graph is k-colorable, we have to only
use k registers anyway
 When it is not possible, spill
Global Register Allocation
 Coloring the graph with k colors
 Reasoning:
 If there exists a node x with less than k neighbors
 no matter how the neighbors are colored, there is a different color that
x can use
 Heuristic approach (this step is also called simplification)





Pick a node x with less than k edges
Put x in a stack (to keep track of the coloring order)
Remove x and its edges from the interference graph
If the resulting graph is k-colorable then so does the original graph
Repeat until only one node left
 When there is no node with < k edges
 Algorithm fails
Global Register Allocation
 Coloring the example graph with 4 colors
 Color selection
 Starting from the last nodes added to the stack
 The nodes removed later are having more edges and their colors
should be decided first
 For each node, pick a color that is different from its neighbors
 Always possible to get a color
 This is obvious from how the node was removed
Global Register Allocation
 Coloring the example graph with 4 colors
 Simplification step
a, b, d have
< 4 edges.
Choose a
Now all nodes have < 4 edges,
remove them in arbitrary order
a
b
f
e
b
c
e
b, d have
< 4 edges.
Choose d
c
d
d
a
stack
top
Global Register Allocation
 Coloring the example graph with 4 colors
 Selection step
a
f
e
b
f
b
c
c
e
d
a
d
stack
Global Register Allocation
 Coloring the example graph with 3 colors
After removing a,
No node has < 3 edges
Algorithm fails!!!
a
b
f
c
e
d
Global Register Allocation
 Coloring algorithm failure (for k colors)
 Does not imply it is not possible to color with K colors
 Always try to color anyway
 Example: color the graph with 3 colors
Color the node with the
highest degree first.
The remaining nodes
has the same degree.
Choose any to color.
After removing a,
No node has < 3 edges.
Algorithm fails!
a had degree 2, no
problem to color!
a
b
f
Still can find a color
for this node!
c
e
d
Still can find a color
for this node!
Spill
 When no way is found to color with k colors
 Choose one node to spill
 Continue to spill if necessary, till a node can be removed
 For each spilled node
 For each definition, store the value
 For each use, load the value
 Where to load the value, need a register anyway
 Naive approach
 Always keep extra registers for shuffling data in and out
 What a waste!!!
 Rewrite code
 Use a new temporary variable for each load, it will have very short
life and likely to have very few outgoing edges
 Redo liveliness analysis and register allocation
Spill
 Consider the example we gave earlier
 Cannot find a way to color with k = 3, spill
After removing a,
No node has < 3 edges
Once c is spilled,
coloring can be done
Nothing else to spill
a
b
f
c
e
d
Choose to spill c
Remove c
Spill
 After spill, rewrite the code, redo the allocation
store c
{b,f,t1 }
{d,e,f }
{b,f}
{a,f}
{d,f}
b
f
b:= f+c
c
e
{d,e,f}
{e}
f:= 2*e
{b,e,f}
{b,f}
b:= d+e
e:= e–1
print(b)
f:= 2*e
a
t1 := load c
a:= b+t1
d:= –a
e:= d+f
a:= b+c
d:= –a
e:= d+f
d
a
b:= d+e
e:= e–1
print(b)
b
f
{f}
t2 := load c
b:= f+t2
{b,f}
{b}
{f,t2}
t1
e
t2
{b}
d
Spill
 Redo the coloring for the new interference graph
 Consider k=3, the graph can be colored!!!
 When generating code
 Load c of the top block to t1’s register
 Load c of the bottom block to t2’s register
a
b
f
t1
e
t2
d
Pre-Color
 Some variables are pre-assigned to registers
 E.g., in C, it is possible to define register variables
 register int i
 Handle pre-assigned registers
 If the system has k registers available, and x variables are preassigned, then only use k–x registers for other variables
 But this is wasteful, these registers can be reused
 Perform coloring on IG





Still using k colors
Pre-color the variables that are pre-assigned to registers
In the simplification phase, these nodes cannot be removed
The simplification phase terminates when only pre-colored nodes left
In the selection phase, do not change the colors of pre-colored nodes
Pre-Color
 Assume that a is pre-assigned to registers
 Consider K=4
 Pre-color a
 Terminate when only a left
b d e c f
a
b
f
c
e
d
Coalescing
 When no way is found to color with k colors
 Try coalescing before trying spill!!!
 When there are copy statements, x := y, coalesce x and y
 Assign x and y to the same register
 Advantage
 Reduce the unneeded copying
 Save a register
 Requirement
 Assume no dead code
 x and y are not interfering, i.e., not connected in IG
Coalescing
 Example
{in: j, k}
g := M[j+12]
h := k – 1
f := g + h
e := M[j+8]
m := j + f
b := M[j]
c := e + 8
d := c
k := m + 4
j := b
{out: d,j,k}
j, k
g
g, j, k
g, h, j
f
j
h
f, j
e, f, j
e, j, m
e
b
k
b, e, m
b, c, m
b, d, m
m
d
b, d, k
copy link
c
Coalescing
 Coloring with 3 colors
h g f c
need to spill
g
f
j
h
e
b
k
m
d
c
Coalescing
 Coloring with 4 colors
h g f c
j
Cannot color with 3 colors
Need to use 4 colors (4 registers)
e
b
k
m
d
Coalescing
 Coalescing
{in: j, k}
g := M[j+12]
h := k – 1
f := g + h
e := M[j+8]
m := j + f
b := M[j]
c := e + 8
d := c
k := m + 4
j := b
{out: d,j,k}
g
f
j
h
e
b
k
m
d
c
Coalescing
 Coalescing c,d (non-interfering)
g
j
h
f
e
b
k
m
c,d
Coalescing
 Coalescing b,j (non-interfering)
g
b,j
h
f
e
k
m
c,d
Coalescing
 Coloring with 3 colors
 Simplification
g
b,j
h
f
e
k
m
c,d
h g f k c,d b,j e m
Coalescing
 Coloring with 3 colors
r1
r2
r3
 Selection
g
b,j
h
f
e
k
m
c,d
h g f k c,d b,j e m
Coalescing
 Coloring with 3 colors
{in: j, k}
g := M[j+12]
h := k – 1
f := g + h
e := M[j+8]
m := j + f
b := M[j]
c := e + 8
d := c
k := m + 4
j := b
{out: d,j,k}
r3 := load j
r1 := load k
r2 := M[r3+12]
r1 := r1 – 1
h
r1 := r2 + r1
r2 := M[r3+8]
r1 := r3 + r1
r3 := M[r3]
r2 := r2 + 8
r2 := r2
r1 := r1 + 4
r3 := r3
store r2, d
storeCoalescing
r3, j
saved
storetwo
r1,copy
k statements
and 1 register
r1
r2
r3
g
b,j
f
e
k
m
c,d
h g f k c,d b,j e m
Coalescing
 Another example
{in: a, c}
v := a + c
t := v * c
v := t * a
b := v
t := M[b]
u := b + c
w := t * u
{out: w}
a, c
b
a, c, v
t
a, c, t
c, v
c
u
b, c
b, c, t
t, u
a
v
w
 Try to color with 3 colors
u v b a c t
Coalescing
 Coalescing b,v (non-interfering)
b
t
c
u
a
v
Coalescing
 Coalescing b,v
 Coloring with 3 colors
 Coalesce when
b,v
t
Coalescing increases the degree
of the coalesced node and
makes the graph irreducible!
Only coalesce when the degrees
of the nodes are not increased.
But sometimes, coalesce may
increase the degree of some nodes,
but ends up saving registers!
c
u
a
After removing u, no
node with < 3 degree.
Need to spill!
Global Register Allocation
 Code generation
 For each statement
 Replace variables by registers
 If a variable is from external, then it should be loaded to the register
first
 For the spilled variables
 Load to reserved registers if the rewrite code approach is not used
 Store the live variables
 No need to store temporary variables
 Variables that are alive after the CFG should be stored to memory
DAG Based Instruction Scheduling
 Dag
* e1
 Used for subexpression elimination
 Used to eliminate duplicate variables
a := b – c
b := a + d
d := b – c
a := a * d
b := b – c
e := a * b
*
a2
–
b1
+
a1 –
b0
d0
c0
d1, b2
DAG Based Instruction Scheduling
 How to generate code for dags
 Need to determine the schedule for executing the instructions
 Need to determine the register allocation
 Global register allocation
 Minimum register instruction sequence
 Based on the results, generate code
DAG Based Instruction Scheduling
 Algorithm for ordering nodes in a dag
 Start from the root
 Assign a node x an order number, if
 All x’s parents already has a number
 After x obtained a number
 Try to assign numbers for x’s children
 If x’s child y cannot be assigned an order number: no problem
o y has at least one parent without an assigned number, when its parent has
the number, y will be examined for number assignment
 Since the dag is acyclic, all nodes will obtain a number and the order
is correct
DAG Based Instruction Scheduling
 Ordering nodes in a dag – Example
1 * e
1
 Rewrite the code based on the dag
load b
load c
a1 := b – c
load d
b1 := a1 + d
d1 := b1 – c
a2 := a1 * d1
e1 := a2 * d1
2 * a2
d1, b2
–
6
b1
4 +
a1 –
b0 8
3
d0 5
c0 7
DAG Based Instruction Scheduling
 Register allocation
load b, c
a1 := b – c
load d
b1 := a1 + d
d1 := b1 – c
a2 := a1 * d1
e1 := a2 * d1
…
use e
b, c
r3
a1, c
d
r3
d1
r2
c
a1, c, d
a1, b1, c
a1, d1
a2, d1
e1
 Need 3 registers
b1 r3
e1
r1
a1
r1
b
a2
r1
r3
DAG Based Instruction Scheduling
 Generate code based on
 The instruction sequence
 87654321
 The register allocation
r3
r3
d1
d
r2
(8) load r3 b
(7) load r2 c
(6) sub r1 r3 r2
(5) load r3 d
(4) add r3 r1 r3
(3) sub r3 r3 r2
(2) mul r1 r1 r2
(1) mul r1 r1 r3
store e, r1
c
r1
a1
r1
b
a2
r1
a2
6 – a1
2
3
d1, b2
d0 5
r3
b0 8
*
–
b1
4 +
b1 r3
e1
e1 * 1
c0 7
Tree Based Register Allocation
 Original code
a := b – c
b := a + d
d := b – c
a := a * d
b := b – c
e := a * b
e0 * 3
a2 * 3
2
a1 – 2
c0
b0
–
b1
2 +
2 – a1
d1
b1 + 2
c0
2 – a1
d0
b0
b0
 Does not really work
c0
d0
c0
Simply Global Register Allocation
 Original code
a := b – c
b := a + d
d := b – c
a := a * d
b := b – c
e := a * b
r3
r2
d
c
a, b, c
a
b
a, b
r1
r3
b, c, d
a, c, d
a, b, c
e
a, b, c, d
r1
e
 Need to use 4 registers
Minimum Register Instruction Sequence
 Derive an instruction sequence so that its register
requirement is minimum
 Instructions with no data dependency can be rearranged
 But MRIS problem is NP, need to use heuristic algorithms
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
t1 := load x t1
t2 := t1 + 4
t2
t3 := t1 * 8
t3
t4 := t1 - 4
t4
t5 := t1 / 2
t5
t6 := t2 * t3
t6
t7 := t4 - t5
t7
t8 := t6 * t7
t8
store t8, y
Original schedule
Need 4 registers
(a)
(d)
(e)
(g)
(c)
(b)
(f)
(h)
(i)
live range
t1 := load x t1
t4 := t1 - 4
t4
t5 := t1 / 2
t5
t7 := t4 - t5
t7
t3 := t1 * 8
t3
t2 := t1 + 4
t2
t6 := t2 * t3
t6
t8 := t6 * t7
t8
Store t8, y
Properly reschedule
Only need 3 registers
Minimum Register Instruction Sequence
 Consider dag based MRIS
 Construct dag for the example code
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
t1 := load x
t2 := t1 + 4
t3 := t1 * 8
t4 := t1 - 4
t5 := t1 / 2
t6 := t2 * t3
t7 := t4 - t5
t8 := t6 * t7
store t8, y
*
t8, y
* t7
2
- t4
* t6
4
* t3
8
 How to determine
 Register allocation
 Execution sequencing
/ t5
+
t2
t1 x
4
Minimum Register Instruction Sequence
t3
 Principles
s1
 Nodes on a single path can share the same register
 E.g., t1, t2, t3 can share one register
o r0 := r0 op x, r0 := r0 op y
y
t2
x
z
t1
 New scheduling constraints
 Sharing register may introduce new dependencies (execution orders)
in the dag
 E.g., If t2 reuses t1’s register, then all other operations that uses t1
should be done first
o E.g., s1 uses t1, s1 should be evaluated before t2 uses t1’s register
o Otherwise, t1’s register value got changed by t2 and t1 is lost
 This only happens for nodes with multiple parents (not for tree)
 Constant does not need any register
Minimum Register Instruction Sequence
 Path formation
 Number the nodes


Number root nodes as 0
The child node number = max (parent node number) + 1
 Find all paths till all nodes are covered

Start from a largest node that is not covered
o

Keep going till reaching the end or a covered node
o

A node that is already in a path is marked as covered
For an already covered node, still include, but use ( ) to mark it
- Because the register can only get released after ( ) is evaluated
- Need it for interference analysis
If the node has multiple parents, choose the smallest parent node
o
Need to also create scheduling constraints
Minimum Register Instruction Sequence
 MRIS Algorithm
 Number the nodes
 Find all paths
*
1
Largest uncovered node: t1
t1 has multiple parents of the same number
choose t2
P1 = [t1, t2, t6, t8, ()]
Largest uncovered node: t3, t4, t5, choose t3
t1 hasP2multiple
parents and t1, t2 share registers
= [t3, (t6)]
t3, t4,
t5 has tonode:
be evaluated
beforet4t2
Largest
uncovered
t4, t5, choose
 add
constraints
P3scheduling
= [t4, t7, (t8)]
 Redo
numbering
Largest
uncovered
node: t5
P4 = [t5, (t7)]
Largest uncovered node: none
t8, y
2
34
* t7
/ t5
2
* t6
4
34
* t3
8
+ 3
t2
t1 x
2
34
- t4
45
4
Minimum Register Instruction Sequence
t3
 Interferences
s1
 Live range of a variable x
o The durations x is alive
o Can be derived from liveliness analysis
y
t2
 Live range
x
z
t1
 When a path shares a register r, the live range for r spans the life of
the entire path
o r0 := r0 op x, r0 := r0 op y  r0 is alive from the evaluation of t1 to t3
 If two variables has overlapping live range, they cannot share register
 Before instructions are scheduled, the life range of variables are
not determined
 But dependencies makes constraints on live ranges
  Find out live range constraints
Minimum Register Instruction Sequence
 Interferences
 Theorem 1
 Two paths: P = [u1, u2, …, um], Q = [v1, v2, …, vn]
 If u1 can reach vn 
 Need to evaluate u1 before vn
 If v1 can reach um 
u5
v3
 Need to evaluate v1 before um
  Live ranges of P and Q have to overlap
  Cannot use the same registers
 Use this relation to construct IG
 Not for the nodes but for the paths
 Essentially, for the registers (path = register)
u4
v2
u3
v1
u2
u1
 Edges due to scheduling constraints
 They enforce execution order, their impact on live range is the
same as the other edges
Minimum Register Instruction Sequence
P1 = [t1, t2, t6, t8, ()]
P2 = [t3, (t6)]
P3 = [t4, t7, (t8)]
P4 = [t5, (t7)]
 Dag based MRIS
 Interferences based on Theorem 1
 P1 interferes with all other paths
*
t8, y
* t7
o t1 can reach all nodes
o All nodes can reach t8
/ t5
 P2 interferes with P3
2
o t3 can reach t8, t4 can reach t6
- t4
 P2 does not interfere with P4
o t5 can reach t6, but t3 does not reach t7
 P3 interferes with P4
* t6
* t3
o t4 can reach t7, t5 can reach t8
P1
4
P2
8
+
P3
P4
t2
t1 x
4
Minimum Register Instruction Sequence
 How to schedule the paths
 The approach so far only found the potential number of registers
 Still do not know how to schedule
 Path fusing
 A register can only be released when an entire path is done
 Done means the last node in () is evaluated
 () is in another path
 The result from the path can be stored in the register of another path
 Check path pairs (x,y) to “fuse”
 If x can execute till completion before y starts, we say (x,y) can fuse
 Then x can release the register to y after it completes
 Which pairs to consider?
 Eliminate impossible pairs
o All the interfering pairs (due to any of the reasons) cannot fuse
Minimum Register Instruction Sequence
 Dag based MRIS
*
t8, y
* t7
 Find (x, y) to fuse
/ t5
 Only P2 and P4 do not interfere
 Try (P2, P4)
o t6 > t2 and t2 > t4
o Need to start P4 before evaluate t2 and t6
o Not possible
 Try (P4, P2)
P1
P2
P3
P4
P1 = [t1, t2, t6, t8, ()]
P2 = [t3, (t6)]
P3 = [t4, t7, (t8)]
P4 = [t5, (t7)]
2
- t4
* t6
o P4 can complete without starting P2
- Complete P2 just need to execute P3 partially
o Succeeded
o Let P2 and P4 share register
o Add the scheduling constraint t7 < t3
t1 x
4
* t3
8
+
t2
4
Minimum Register Instruction Sequence
 Dag based MRIS
1
*
t8, y
7
* t7
 Assign register




5
/ t5
Assign according to the path
P1, P3 each gets one register
P4 and P2 share one register
Do not color the node in ()
 Find execution order for the dag
 Dag node ordering
 Code generation
r1 := load x
r3 := r1 / 2
r2 := r1 – 4
r2 := r2 * r3
r3 := r1 * 8
r1 := r1 + 4
r1 := r1 * r3
r1 := r1 * r2
store y r1
2
6
- t4
2 * t6
4
4
* t3
8
+ 3
t2
t1 x 8
r1
4
r2
r3
Instruction Selection
 Some hardware provides a rich set of instructions
 May be not RISC processors!
 There are multiple ways to translate a set of instructions
 Example instruction set
 load r1, r2
 store r1, r2
 add r1, r2
 addc r1, c
 mul r1, r2
 mulc r1, c
 movem r1, r2
 movex r1, r2, r3
load M[r2] to r1
store r2 to M[r1]
r1 := r1 + r2
r1 := r1 + c, where c is a constant
r1 := r1 * r2
r1 := r1 * c, where c is a constant
M[r1] := M[r2]
M[r1] := M[r2+r3]
Instruction Selection
 Example program
 A[i+1] := B[j]
 Intermediate code
t1 := j * 4
mulc rj, 4
mulc rj, 4
t2 := B + t1
add rb, rj
add rb, rj
t3 := M[t2]
load r1, rb  addc ri, 1

addc ri, 1
mulc ri, 4
t4 := i + 1
mulc ri, 4
add ra, ri
t5 := t4 * 4
add ra, ri
movem ra, rb
t6 := A + t5
store ra, r1
M[t6] := t3
load r1, r2
 Assume: register ra holds address of A store r1, r2
 Assume: register rb holds address of B add r1, r2
addc r1, c
 Assume: register ri holds value of i
mul r1, r2
mulc r1, c
 Assume: register ra holds value of j
mulc rj, 4
addc ri, 1
mulc ri, 4
add ra, ri
movex ra, rb, rj
load M[r2] to r1
store r2 to M[r1]
r1 := r1 + r2
r1 := r1 + c
r1 := r1 * r2
r1 := r1 * c
movem r1, r2
M[r1] := M[r2]
movex r1, r2, r3 M[r1] := M[r2+r3]
Instruction Selection
 Each instruction may have different cost
 Time cost: how fast can the instruction execute
 Space cost: how much space the instruction take
mulc rj, 4
add rb, rj
addc ri, 1
mulc ri, 4
add ra, ri
movem ra, rb
cost = 27
 For example
 load r1, r2
 store r1, r2
 add r1, r2
 addc r1, c
 mul r1, r2
 mulc r1, c
 movem r1, r2
 movex r1, r2, r3
cost = 3
cost = 3
cost = 1
cost = 1
cost = 10
cost = 10
cost = 4
cost = 5
mulc rj, 4
add rb, rj
load r1, rb
addc ri, 1
mulc ri, 4
add ra, ri
store ra, r1
cost = 29
mulc rj, 4
addc ri, 1
mulc ri, 4
add ra, ri
movex ra, rb, rj
cost = 27
 Goal: find the translations with the minimal cost
Tree Representation
 Problem
 Some translations may have to take non-consecutive instructions
t1 := j * 4
t2 := b + t1
t3 := M[t2]
t4 := i + 1
t5 := t4 * 4
t6 := a + t5
M[t6] := t3
mulc rj, 4
addc ri, 1
mulc ri, 4
add ra, ri
movex ra, rb, rj
 Solution
 Use tree-like representation for instruction match
o In tree, these instructions are consecutive
 Convert instructions to tiles
  Easier to detect the matching instructions
Instruction Selection
 Goal
 Determine parts of the tree that can match the instruction tiles
store
+
store
load
R?
movem R?
*
a
i
+
4
+
1
store
*
b
j
R?
R?
load
+
load
R?
4
mul
+
+
c?
addc
R?
R?
R?
add
R?
movex +
R?
+
R?
c?
mulc
…
Instruction Selection
 Desirable to achieve optimal tiling
 Get the instruction set with least cost
 Not easy
 The maximal munch algorithm (a greedy algorithm)
 Start from the tree root and find all matching tiles
 Select the one with the maximum number of nodes
o Can consider other criteria that include the cost of the instruction
 Go to the children and apply the algorithm recursively
 Until the tree is fully covered
Instruction Selection
 Dynamic programming for tiling
 Start from the tree root
 For each node
 If the best cost for the node has already been computed, then return it
 Otherwise: Cost of a node = cost of T + total costs of all children
o T is a matching tile, for different T, the children may be different
o Select the minimum cost among all possible matching T’s
 The tiling decision is top down, the cost is computed bottom up
 Complexity: O(N*NT)
 N: the number of nodes
 NT: the maximum number of titles for a node
 The cost of each node only need to be computed once
o Once computed, just return
 After one round, the costs of the internal nodes of some tiles may not
have been computed, if selected in another round, will be computed
Instruction Selection
 Cost criteria should consider modern architecture
 Best instruction set or best instruction schedule may not be useful
for some modern architectures
 E.g., best scheduling may reduce register usage, but may not allow
best pipelining or best parallel execution
o Pipelining is commonly used in modern processors
o Multiple core will become common architecture
 For both instruction selection and instruction scheduling
 Should consider the cost to facilitate pipelining and/or parallel
execution
Peephole Optimization
 Performed at the end of code generation
 Performed directly at the generated machine code
 Only look at a few instructions
 Generally no more than 5
 Using a sliding window
 Eliminating redundant instructions
 Some code generator generated code has redundancies, after other
optimization steps, there still may be easy to catch redundancies
 Algebraic transformation
 Strength reductions
Eliminate Redundancies
 Unnecessary load-store
load r0, a
store a, r0
 Load r0, a
 Eliminate jump after jump
if (a<b) goto L1
...
L1: goto L2
goto L1
...
L1: if a < b goto L2
L3:
goto L1
...
L1: return
L3:



if (a<b) goto L2
goto L3
...
if (a<b) goto L2
...
L1: goto L2
return
...
L1:
Eliminate Redundancies
 Eliminate jump after jump
Source code:
debug = 0
...
if(debug)
{print debugging information}
Generated intermediate code:
debug = 0
...
if debug = 1 goto L1
goto L2
L1: print debugging information
L2:

After optimization:
debug = 0
...
if debug  1 goto L2
print debugging information
L2:
Strength Reduction
 Replace multiplication and division by shift
 A := A * 4  A := << A
 Need to take care of overflow problem (may result in negative number)
 A := A / 4  A := >> A
 Need to shift by replacing msb with sign bit
 But right-shift has a famous problem in two’s complement representation
o –5
111111...1111111011
o >> –5
111111...1111111101
o –6
111111...1111111010
o >> –6
111111...1111111101
(–5 / 2 = –2, but the result here is –3)
(No problem, correct answer)
o Fix 1: shift bit by bit, add lsb back to the number after shift
o Fix 2: Convert to positive number for shift, then convert to negative
Code Generation Issues -- Summary
 Read Chapter 8
 Sections 8.5, 8.7, 8.8, 8.10
 Run time storage allocation
 Register allocation and instruction scheduling
 For basic blocks
 Tree based
 Dag based
o R. Govindarajan Y, H. Yang Z, J. N. Amaral, C. Zhang Z, G. R. Gao,
“Minimum register instruction sequence problem: Revisiting optimal
code generation for DAGs,” IEEE International Parallel & Distributed
Processing Symposium, 2001
 Global register allocation using graph coloring
 Peephole optimization

Download Report

Document

Paperzz.com

Your Paperzz