Code Generation
Code Generation
Use registers during execution
Whenever possible, perform computation in registers
Memory load/store are much more expensive
Need to determine the best register allocation
For a given number of registers, minimize the number of spills
Spill: When run out of registers, store some registers to memory
Need to determine the best order of instruction execution
To satisfy the suboptimal register allocation decision
To reduce the number of instructions
Instruction selection
Map the intermediate code to the set of machine instructions that
minimizes the cost of execution
Peephole optimization
Code Generation
Various methods for register allocation and instruction
scheduling
Tree
Achieve optimal register allocation and instruction scheduling
DAG (directed acyclic graph)
Achieve local subexpression elimination (optimal)
Optimal register allocation and instruction scheduling is NP
Heuristic algorithms
Global
Global register allocation
Do not have corresponding scheduling algorithm, just follow the
original instruction order
Tree Based Approach for a Basic Block
Basic block:
t1 := a + b
t2 := c * d
t3 := e + f
t4 := t2 + t3
y := t1 * t4
Assumptions:
The system has two registers, r0, r1
only y is alive at the exit of the block
“op reg reg/mem reg” -- first reg is the result
“– a b c” a := b – c
15 instructions
10 load, 5 store
load r0, a
add r0, r0, b
store t1, r0
load r0, c
mul r0, r0, d
store t2, r0
load r0, e
add r0, r0, f
store t3, r0
load r0, t2
add r0, r0, t3
store t4, r0
Can we use the registers more effectively?
load r0, t1
add r0, r0, t4
store y, r0
Tree Based Approach for a Basic Block
Assumptions:
The system has two registers, r0, r1
only y is alive at the exit of the block
Basic block:
t1 := a + b
t2 := c * d
t3 := e + f
t4 := t2 + t3
y := t1 * t4
t1 (R0) and t2 (R1) are
still needed
But no more registers to
compute t3
Has to spill (choose R0)
* y
+ t4
+ t1
a
Need to
load t1 back
into R1
+ t3
* t2
b
c
d
e
f
load r0, a
add r0, r0, b
load r1, c
mul r1, r1, d
store t1, r0
load r0, e
add r0, r0, f
add r0, r1, r0
load t1, r1
mul r0, r1, r0
store y, r0
11 instructions
7 load, 2 store (1 spill)
Tree Based Approach for a Basic Block
Assumptions:
The system has two registers, r0, r1
only y is alive at the exit of the block
Basic block:
t1 := a + b
t2 := c * d
t3 := e + f
t4 := t2 + t3
y := t1 * t4
load r1, e
add r1, r1, f
Can we always
achieve optimal
execution?
* y
+ t3
* t2
b
c
d
e
add r1, r0, r1
load r0, a
add r0, r0, b
mul r0, r0, r1
store y, r0
+ t4
+ t1
a
load r0, c
mul r0, r0, d
f
9 instructions
6 load, 1 store (0 spill)
Optimal!
Tree based Register Allocation and Scheduling
Construct the execution tree for a basic block
Label the tree to obtain the register requirements
Depth first labeling
L(leaf) = 1 if it is an identifier
L(leaf) = 0 if it is a constant
L(nonleaf node) =
Assumptions:
From here onwards,
3 address code need to be:
op reg reg reg
If L(left child) = L(right child) then
o L(current node) := L(left child) + 1
Otherwise
o L(current node) := max (L(left child), L(right child))
Assign registers and generate code
Register allocation
Instruction scheduling follows the register allocation algo
Tree based Register Allocation and Scheduling
Register allocation and instruction scheduling
The process starts from root, recursively going to leave nodes
Each non-leave node N with mark lr(N), do
– t1
Assume
o The registers allocated to N are Rb+1 to Rb+k (k registers)
o The node operation is op and op is binary
If L(left) = L(right)
a
b
t1 := a – b
– t1 a b
o Go to left, pass lr = left, use registers Rb+1 to Rb+k–1, store result in Rb+1
o Go to right, pass lr = right, use registers Rb+2 to Rb+k, store result in Rb+k
o If lr(N) = left gen “op Rb+1 Rb+1 Rb+k”; else gen “op Rb+k Rb+1 Rb+k”
If L(left) < L(right)
(assume: left needs m registers, right needs k registers, k > m)
o Go to right, pass lr = right, use registers Rb+1 to Rb+k, store result in Rb+k
o Go to left, pass lr = left, use registers Rb+1 to Rb+m, store result in Rb+1
o If lr(N) = left gen “op Rb+1 Rb+1 Rb+k”; else gen “op Rb+k Rb+1 Rb+k”
Tree based Register Allocation and Scheduling
Register allocation and instruction scheduling
Each non-leave node N, with mark lr(N), do
Assume
o The allocated registers are Rb+1 to Rb+k (k registers)
o The node operation is op and op is unary
If lr(N)=left then Go to the child
o Pass lr(N), and pass registers Rb+1 to Rb+k to the child
o Generate code: “op Rb+1 Rb+1”
If lr(N)=right then Go to the child
o Pass lr(N), and pass registers Rb+1 to Rb+k to the child
o Generate code: “op Rb+k Rb+k”
Leave node x, x is an identifier
Assume: allocated register is Rb+1
Generate code: “load Rb+1 x”
load r1, c
load r2, d
Compute register
requirement
mul r1, r1, r2
Assign registers
Tree based Register Allocation and Scheduling
load r2, e
Generate code
mul r1, r1, r3
load r3, f
now r1, r2 are
available
(r1,r2,r3)
add r3, r2, r3
*
add r3, r1, r3
load r1, a
add r1, r1, r2
load r2, b
(r1, r2)
+
add r1, r1, r2
3
add r3, r1, r3
(r1,r2,r3)
2
+
mul r1, r1, r2
(r1,r2)
mul r1, r1, r3
a 1
b 1
(r1)
load r1, a
(r2)
load r2, b
*
c1
(r1)
load r1, c
3
add r3, r2, r3
(r2,r3)
2
d1
(r2)
load r2, d
+
2
e 1
f 1
(r2)
load r2, e
(r3)
load r3, f
Global Register Allocation
Basic approach
Global liveliness analysis
Build the interference graph
Graph coloring
N colors
N is the number of available registers
If N-coloring is not possible
Insert spill code to the program
Global Register Allocation
Block level liveliness analysis
{b,c,f }
{ c,d,e,f }
{c,e}
a:= b+c
d:= –a
e:= d+f
{a,c,f}
{c,d,f}
b:= d+e
e:= e–1
print(b)
f:= 2*e
{c,f}
{b}
{b,c,f}
b:= f+c
Assumption: {b} is the Live set of the next block
{c,d,e,f}
{b,c,e,f}
{b,c,f}
Global Register Allocation
Build the interference graph
Show which variables interfere with each other
Principle:
Two variables that are alive simultaneously interfere
They cannot be allocated to the same register
----x=? define
----alive
----{x,…}
----?=x-- use
-----
Register interference graph:
One vertex for each variable in the graph
At each point “p” in the CFG
L is the Live set at p
Two variables x and y are in L together,
x should not get the same register as y
add an edge (x,y)
-------------------------------------
x
y
{x,y,…}
Global Register Allocation
Build the interference graph -- example
{b,c,f }
{ c,d,e,f }
a:= b+c
d:= –a
e:= d+f
{a,c,f}
{c,d,f}
{c,d,e,f}
{c,e}
f:= 2*e
{b,c,e,f}
{b,c,f}
b:= d+e
e:= e–1
print(b)
a
b
f
{c,f}
b:= f+c
{b,c,f}
{b}
c
e
d
Global Register Allocation
Graph coloring to decide register allocation
Color the interference graph so that no two adjacent nodes are of
the same color
Graph is k-colorable:
Implies we can use k registers without needing to spill
Whether a graph is k-colorable is NP complete
If there are k registers available
We do not care whether the graph is k-colorable, we have to only
use k registers anyway
When it is not possible, spill
Global Register Allocation
Coloring the graph with k colors
Reasoning:
If there exists a node x with less than k neighbors
no matter how the neighbors are colored, there is a different color that
x can use
Heuristic approach (this step is also called simplification)
Pick a node x with less than k edges
Put x in a stack (to keep track of the coloring order)
Remove x and its edges from the interference graph
If the resulting graph is k-colorable then so does the original graph
Repeat until only one node left
When there is no node with < k edges
Algorithm fails
Global Register Allocation
Coloring the example graph with 4 colors
Color selection
Starting from the last nodes added to the stack
The nodes removed later are having more edges and their colors
should be decided first
For each node, pick a color that is different from its neighbors
Always possible to get a color
This is obvious from how the node was removed
Global Register Allocation
Coloring the example graph with 4 colors
Simplification step
a, b, d have
< 4 edges.
Choose a
Now all nodes have < 4 edges,
remove them in arbitrary order
a
b
f
e
b
c
e
b, d have
< 4 edges.
Choose d
c
d
d
a
stack
top
Global Register Allocation
Coloring the example graph with 4 colors
Selection step
a
f
e
b
f
b
c
c
e
d
a
d
stack
Global Register Allocation
Coloring the example graph with 3 colors
After removing a,
No node has < 3 edges
Algorithm fails!!!
a
b
f
c
e
d
Global Register Allocation
Coloring algorithm failure (for k colors)
Does not imply it is not possible to color with K colors
Always try to color anyway
Example: color the graph with 3 colors
Color the node with the
highest degree first.
The remaining nodes
has the same degree.
Choose any to color.
After removing a,
No node has < 3 edges.
Algorithm fails!
a had degree 2, no
problem to color!
a
b
f
Still can find a color
for this node!
c
e
d
Still can find a color
for this node!
Spill
When no way is found to color with k colors
Choose one node to spill
Continue to spill if necessary, till a node can be removed
For each spilled node
For each definition, store the value
For each use, load the value
Where to load the value, need a register anyway
Naive approach
Always keep extra registers for shuffling data in and out
What a waste!!!
Rewrite code
Use a new temporary variable for each load, it will have very short
life and likely to have very few outgoing edges
Redo liveliness analysis and register allocation
Spill
Consider the example we gave earlier
Cannot find a way to color with k = 3, spill
After removing a,
No node has < 3 edges
Once c is spilled,
coloring can be done
Nothing else to spill
a
b
f
c
e
d
Choose to spill c
Remove c
Spill
After spill, rewrite the code, redo the allocation
store c
{b,f,t1 }
{d,e,f }
{b,f}
{a,f}
{d,f}
b
f
b:= f+c
c
e
{d,e,f}
{e}
f:= 2*e
{b,e,f}
{b,f}
b:= d+e
e:= e–1
print(b)
f:= 2*e
a
t1 := load c
a:= b+t1
d:= –a
e:= d+f
a:= b+c
d:= –a
e:= d+f
d
a
b:= d+e
e:= e–1
print(b)
b
f
{f}
t2 := load c
b:= f+t2
{b,f}
{b}
{f,t2}
t1
e
t2
{b}
d
Spill
Redo the coloring for the new interference graph
Consider k=3, the graph can be colored!!!
When generating code
Load c of the top block to t1’s register
Load c of the bottom block to t2’s register
a
b
f
t1
e
t2
d
Pre-Color
Some variables are pre-assigned to registers
E.g., in C, it is possible to define register variables
register int i
Handle pre-assigned registers
If the system has k registers available, and x variables are preassigned, then only use k–x registers for other variables
But this is wasteful, these registers can be reused
Perform coloring on IG
Still using k colors
Pre-color the variables that are pre-assigned to registers
In the simplification phase, these nodes cannot be removed
The simplification phase terminates when only pre-colored nodes left
In the selection phase, do not change the colors of pre-colored nodes
Pre-Color
Assume that a is pre-assigned to registers
Consider K=4
Pre-color a
Terminate when only a left
b d e c f
a
b
f
c
e
d
Coalescing
When no way is found to color with k colors
Try coalescing before trying spill!!!
When there are copy statements, x := y, coalesce x and y
Assign x and y to the same register
Advantage
Reduce the unneeded copying
Save a register
Requirement
Assume no dead code
x and y are not interfering, i.e., not connected in IG
Coalescing
Example
{in: j, k}
g := M[j+12]
h := k – 1
f := g + h
e := M[j+8]
m := j + f
b := M[j]
c := e + 8
d := c
k := m + 4
j := b
{out: d,j,k}
j, k
g
g, j, k
g, h, j
f
j
h
f, j
e, f, j
e, j, m
e
b
k
b, e, m
b, c, m
b, d, m
m
d
b, d, k
copy link
c
Coalescing
Coloring with 3 colors
h g f c
need to spill
g
f
j
h
e
b
k
m
d
c
Coalescing
Coloring with 4 colors
h g f c
j
Cannot color with 3 colors
Need to use 4 colors (4 registers)
e
b
k
m
d
Coalescing
Coalescing
{in: j, k}
g := M[j+12]
h := k – 1
f := g + h
e := M[j+8]
m := j + f
b := M[j]
c := e + 8
d := c
k := m + 4
j := b
{out: d,j,k}
g
f
j
h
e
b
k
m
d
c
Coalescing
Coalescing c,d (non-interfering)
g
j
h
f
e
b
k
m
c,d
Coalescing
Coalescing b,j (non-interfering)
g
b,j
h
f
e
k
m
c,d
Coalescing
Coloring with 3 colors
Simplification
g
b,j
h
f
e
k
m
c,d
h g f k c,d b,j e m
Coalescing
Coloring with 3 colors
r1
r2
r3
Selection
g
b,j
h
f
e
k
m
c,d
h g f k c,d b,j e m
Coalescing
Coloring with 3 colors
{in: j, k}
g := M[j+12]
h := k – 1
f := g + h
e := M[j+8]
m := j + f
b := M[j]
c := e + 8
d := c
k := m + 4
j := b
{out: d,j,k}
r3 := load j
r1 := load k
r2 := M[r3+12]
r1 := r1 – 1
h
r1 := r2 + r1
r2 := M[r3+8]
r1 := r3 + r1
r3 := M[r3]
r2 := r2 + 8
r2 := r2
r1 := r1 + 4
r3 := r3
store r2, d
storeCoalescing
r3, j
saved
storetwo
r1,copy
k statements
and 1 register
r1
r2
r3
g
b,j
f
e
k
m
c,d
h g f k c,d b,j e m
Coalescing
Another example
{in: a, c}
v := a + c
t := v * c
v := t * a
b := v
t := M[b]
u := b + c
w := t * u
{out: w}
a, c
b
a, c, v
t
a, c, t
c, v
c
u
b, c
b, c, t
t, u
a
v
w
Try to color with 3 colors
u v b a c t
Coalescing
Coalescing b,v (non-interfering)
b
t
c
u
a
v
Coalescing
Coalescing b,v
Coloring with 3 colors
Coalesce when
b,v
t
Coalescing increases the degree
of the coalesced node and
makes the graph irreducible!
Only coalesce when the degrees
of the nodes are not increased.
But sometimes, coalesce may
increase the degree of some nodes,
but ends up saving registers!
c
u
a
After removing u, no
node with < 3 degree.
Need to spill!
Global Register Allocation
Code generation
For each statement
Replace variables by registers
If a variable is from external, then it should be loaded to the register
first
For the spilled variables
Load to reserved registers if the rewrite code approach is not used
Store the live variables
No need to store temporary variables
Variables that are alive after the CFG should be stored to memory
DAG Based Instruction Scheduling
Dag
* e1
Used for subexpression elimination
Used to eliminate duplicate variables
a := b – c
b := a + d
d := b – c
a := a * d
b := b – c
e := a * b
*
a2
–
b1
+
a1 –
b0
d0
c0
d1, b2
DAG Based Instruction Scheduling
How to generate code for dags
Need to determine the schedule for executing the instructions
Need to determine the register allocation
Global register allocation
Minimum register instruction sequence
Based on the results, generate code
DAG Based Instruction Scheduling
Algorithm for ordering nodes in a dag
Start from the root
Assign a node x an order number, if
All x’s parents already has a number
After x obtained a number
Try to assign numbers for x’s children
If x’s child y cannot be assigned an order number: no problem
o y has at least one parent without an assigned number, when its parent has
the number, y will be examined for number assignment
Since the dag is acyclic, all nodes will obtain a number and the order
is correct
DAG Based Instruction Scheduling
Ordering nodes in a dag – Example
1 * e
1
Rewrite the code based on the dag
load b
load c
a1 := b – c
load d
b1 := a1 + d
d1 := b1 – c
a2 := a1 * d1
e1 := a2 * d1
2 * a2
d1, b2
–
6
b1
4 +
a1 –
b0 8
3
d0 5
c0 7
DAG Based Instruction Scheduling
Register allocation
load b, c
a1 := b – c
load d
b1 := a1 + d
d1 := b1 – c
a2 := a1 * d1
e1 := a2 * d1
…
use e
b, c
r3
a1, c
d
r3
d1
r2
c
a1, c, d
a1, b1, c
a1, d1
a2, d1
e1
Need 3 registers
b1 r3
e1
r1
a1
r1
b
a2
r1
r3
DAG Based Instruction Scheduling
Generate code based on
The instruction sequence
87654321
The register allocation
r3
r3
d1
d
r2
(8) load r3 b
(7) load r2 c
(6) sub r1 r3 r2
(5) load r3 d
(4) add r3 r1 r3
(3) sub r3 r3 r2
(2) mul r1 r1 r2
(1) mul r1 r1 r3
store e, r1
c
r1
a1
r1
b
a2
r1
a2
6 – a1
2
3
d1, b2
d0 5
r3
b0 8
*
–
b1
4 +
b1 r3
e1
e1 * 1
c0 7
Tree Based Register Allocation
Original code
a := b – c
b := a + d
d := b – c
a := a * d
b := b – c
e := a * b
e0 * 3
a2 * 3
2
a1 – 2
c0
b0
–
b1
2 +
2 – a1
d1
b1 + 2
c0
2 – a1
d0
b0
b0
Does not really work
c0
d0
c0
Simply Global Register Allocation
Original code
a := b – c
b := a + d
d := b – c
a := a * d
b := b – c
e := a * b
r3
r2
d
c
a, b, c
a
b
a, b
r1
r3
b, c, d
a, c, d
a, b, c
e
a, b, c, d
r1
e
Need to use 4 registers
Minimum Register Instruction Sequence
Derive an instruction sequence so that its register
requirement is minimum
Instructions with no data dependency can be rearranged
But MRIS problem is NP, need to use heuristic algorithms
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
t1 := load x t1
t2 := t1 + 4
t2
t3 := t1 * 8
t3
t4 := t1 - 4
t4
t5 := t1 / 2
t5
t6 := t2 * t3
t6
t7 := t4 - t5
t7
t8 := t6 * t7
t8
store t8, y
Original schedule
Need 4 registers
(a)
(d)
(e)
(g)
(c)
(b)
(f)
(h)
(i)
live range
t1 := load x t1
t4 := t1 - 4
t4
t5 := t1 / 2
t5
t7 := t4 - t5
t7
t3 := t1 * 8
t3
t2 := t1 + 4
t2
t6 := t2 * t3
t6
t8 := t6 * t7
t8
Store t8, y
Properly reschedule
Only need 3 registers
Minimum Register Instruction Sequence
Consider dag based MRIS
Construct dag for the example code
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
t1 := load x
t2 := t1 + 4
t3 := t1 * 8
t4 := t1 - 4
t5 := t1 / 2
t6 := t2 * t3
t7 := t4 - t5
t8 := t6 * t7
store t8, y
*
t8, y
* t7
2
- t4
* t6
4
* t3
8
How to determine
Register allocation
Execution sequencing
/ t5
+
t2
t1 x
4
Minimum Register Instruction Sequence
t3
Principles
s1
Nodes on a single path can share the same register
E.g., t1, t2, t3 can share one register
o r0 := r0 op x, r0 := r0 op y
y
t2
x
z
t1
New scheduling constraints
Sharing register may introduce new dependencies (execution orders)
in the dag
E.g., If t2 reuses t1’s register, then all other operations that uses t1
should be done first
o E.g., s1 uses t1, s1 should be evaluated before t2 uses t1’s register
o Otherwise, t1’s register value got changed by t2 and t1 is lost
This only happens for nodes with multiple parents (not for tree)
Constant does not need any register
Minimum Register Instruction Sequence
Path formation
Number the nodes
Number root nodes as 0
The child node number = max (parent node number) + 1
Find all paths till all nodes are covered
Start from a largest node that is not covered
o
Keep going till reaching the end or a covered node
o
A node that is already in a path is marked as covered
For an already covered node, still include, but use ( ) to mark it
- Because the register can only get released after ( ) is evaluated
- Need it for interference analysis
If the node has multiple parents, choose the smallest parent node
o
Need to also create scheduling constraints
Minimum Register Instruction Sequence
MRIS Algorithm
Number the nodes
Find all paths
*
1
Largest uncovered node: t1
t1 has multiple parents of the same number
choose t2
P1 = [t1, t2, t6, t8, ()]
Largest uncovered node: t3, t4, t5, choose t3
t1 hasP2multiple
parents and t1, t2 share registers
= [t3, (t6)]
t3, t4,
t5 has tonode:
be evaluated
beforet4t2
Largest
uncovered
t4, t5, choose
add
constraints
P3scheduling
= [t4, t7, (t8)]
Redo
numbering
Largest
uncovered
node: t5
P4 = [t5, (t7)]
Largest uncovered node: none
t8, y
2
34
* t7
/ t5
2
* t6
4
34
* t3
8
+ 3
t2
t1 x
2
34
- t4
45
4
Minimum Register Instruction Sequence
t3
Interferences
s1
Live range of a variable x
o The durations x is alive
o Can be derived from liveliness analysis
y
t2
Live range
x
z
t1
When a path shares a register r, the live range for r spans the life of
the entire path
o r0 := r0 op x, r0 := r0 op y r0 is alive from the evaluation of t1 to t3
If two variables has overlapping live range, they cannot share register
Before instructions are scheduled, the life range of variables are
not determined
But dependencies makes constraints on live ranges
Find out live range constraints
Minimum Register Instruction Sequence
Interferences
Theorem 1
Two paths: P = [u1, u2, …, um], Q = [v1, v2, …, vn]
If u1 can reach vn
Need to evaluate u1 before vn
If v1 can reach um
u5
v3
Need to evaluate v1 before um
Live ranges of P and Q have to overlap
Cannot use the same registers
Use this relation to construct IG
Not for the nodes but for the paths
Essentially, for the registers (path = register)
u4
v2
u3
v1
u2
u1
Edges due to scheduling constraints
They enforce execution order, their impact on live range is the
same as the other edges
Minimum Register Instruction Sequence
P1 = [t1, t2, t6, t8, ()]
P2 = [t3, (t6)]
P3 = [t4, t7, (t8)]
P4 = [t5, (t7)]
Dag based MRIS
Interferences based on Theorem 1
P1 interferes with all other paths
*
t8, y
* t7
o t1 can reach all nodes
o All nodes can reach t8
/ t5
P2 interferes with P3
2
o t3 can reach t8, t4 can reach t6
- t4
P2 does not interfere with P4
o t5 can reach t6, but t3 does not reach t7
P3 interferes with P4
* t6
* t3
o t4 can reach t7, t5 can reach t8
P1
4
P2
8
+
P3
P4
t2
t1 x
4
Minimum Register Instruction Sequence
How to schedule the paths
The approach so far only found the potential number of registers
Still do not know how to schedule
Path fusing
A register can only be released when an entire path is done
Done means the last node in () is evaluated
() is in another path
The result from the path can be stored in the register of another path
Check path pairs (x,y) to “fuse”
If x can execute till completion before y starts, we say (x,y) can fuse
Then x can release the register to y after it completes
Which pairs to consider?
Eliminate impossible pairs
o All the interfering pairs (due to any of the reasons) cannot fuse
Minimum Register Instruction Sequence
Dag based MRIS
*
t8, y
* t7
Find (x, y) to fuse
/ t5
Only P2 and P4 do not interfere
Try (P2, P4)
o t6 > t2 and t2 > t4
o Need to start P4 before evaluate t2 and t6
o Not possible
Try (P4, P2)
P1
P2
P3
P4
P1 = [t1, t2, t6, t8, ()]
P2 = [t3, (t6)]
P3 = [t4, t7, (t8)]
P4 = [t5, (t7)]
2
- t4
* t6
o P4 can complete without starting P2
- Complete P2 just need to execute P3 partially
o Succeeded
o Let P2 and P4 share register
o Add the scheduling constraint t7 < t3
t1 x
4
* t3
8
+
t2
4
Minimum Register Instruction Sequence
Dag based MRIS
1
*
t8, y
7
* t7
Assign register
5
/ t5
Assign according to the path
P1, P3 each gets one register
P4 and P2 share one register
Do not color the node in ()
Find execution order for the dag
Dag node ordering
Code generation
r1 := load x
r3 := r1 / 2
r2 := r1 – 4
r2 := r2 * r3
r3 := r1 * 8
r1 := r1 + 4
r1 := r1 * r3
r1 := r1 * r2
store y r1
2
6
- t4
2 * t6
4
4
* t3
8
+ 3
t2
t1 x 8
r1
4
r2
r3
Instruction Selection
Some hardware provides a rich set of instructions
May be not RISC processors!
There are multiple ways to translate a set of instructions
Example instruction set
load r1, r2
store r1, r2
add r1, r2
addc r1, c
mul r1, r2
mulc r1, c
movem r1, r2
movex r1, r2, r3
load M[r2] to r1
store r2 to M[r1]
r1 := r1 + r2
r1 := r1 + c, where c is a constant
r1 := r1 * r2
r1 := r1 * c, where c is a constant
M[r1] := M[r2]
M[r1] := M[r2+r3]
Instruction Selection
Example program
A[i+1] := B[j]
Intermediate code
t1 := j * 4
mulc rj, 4
mulc rj, 4
t2 := B + t1
add rb, rj
add rb, rj
t3 := M[t2]
load r1, rb addc ri, 1
addc ri, 1
mulc ri, 4
t4 := i + 1
mulc ri, 4
add ra, ri
t5 := t4 * 4
add ra, ri
movem ra, rb
t6 := A + t5
store ra, r1
M[t6] := t3
load r1, r2
Assume: register ra holds address of A store r1, r2
Assume: register rb holds address of B add r1, r2
addc r1, c
Assume: register ri holds value of i
mul r1, r2
mulc r1, c
Assume: register ra holds value of j
mulc rj, 4
addc ri, 1
mulc ri, 4
add ra, ri
movex ra, rb, rj
load M[r2] to r1
store r2 to M[r1]
r1 := r1 + r2
r1 := r1 + c
r1 := r1 * r2
r1 := r1 * c
movem r1, r2
M[r1] := M[r2]
movex r1, r2, r3 M[r1] := M[r2+r3]
Instruction Selection
Each instruction may have different cost
Time cost: how fast can the instruction execute
Space cost: how much space the instruction take
mulc rj, 4
add rb, rj
addc ri, 1
mulc ri, 4
add ra, ri
movem ra, rb
cost = 27
For example
load r1, r2
store r1, r2
add r1, r2
addc r1, c
mul r1, r2
mulc r1, c
movem r1, r2
movex r1, r2, r3
cost = 3
cost = 3
cost = 1
cost = 1
cost = 10
cost = 10
cost = 4
cost = 5
mulc rj, 4
add rb, rj
load r1, rb
addc ri, 1
mulc ri, 4
add ra, ri
store ra, r1
cost = 29
mulc rj, 4
addc ri, 1
mulc ri, 4
add ra, ri
movex ra, rb, rj
cost = 27
Goal: find the translations with the minimal cost
Tree Representation
Problem
Some translations may have to take non-consecutive instructions
t1 := j * 4
t2 := b + t1
t3 := M[t2]
t4 := i + 1
t5 := t4 * 4
t6 := a + t5
M[t6] := t3
mulc rj, 4
addc ri, 1
mulc ri, 4
add ra, ri
movex ra, rb, rj
Solution
Use tree-like representation for instruction match
o In tree, these instructions are consecutive
Convert instructions to tiles
Easier to detect the matching instructions
Instruction Selection
Goal
Determine parts of the tree that can match the instruction tiles
store
+
store
load
R?
movem R?
*
a
i
+
4
+
1
store
*
b
j
R?
R?
load
+
load
R?
4
mul
+
+
c?
addc
R?
R?
R?
add
R?
movex +
R?
+
R?
c?
mulc
…
Instruction Selection
Desirable to achieve optimal tiling
Get the instruction set with least cost
Not easy
The maximal munch algorithm (a greedy algorithm)
Start from the tree root and find all matching tiles
Select the one with the maximum number of nodes
o Can consider other criteria that include the cost of the instruction
Go to the children and apply the algorithm recursively
Until the tree is fully covered
Instruction Selection
Dynamic programming for tiling
Start from the tree root
For each node
If the best cost for the node has already been computed, then return it
Otherwise: Cost of a node = cost of T + total costs of all children
o T is a matching tile, for different T, the children may be different
o Select the minimum cost among all possible matching T’s
The tiling decision is top down, the cost is computed bottom up
Complexity: O(N*NT)
N: the number of nodes
NT: the maximum number of titles for a node
The cost of each node only need to be computed once
o Once computed, just return
After one round, the costs of the internal nodes of some tiles may not
have been computed, if selected in another round, will be computed
Instruction Selection
Cost criteria should consider modern architecture
Best instruction set or best instruction schedule may not be useful
for some modern architectures
E.g., best scheduling may reduce register usage, but may not allow
best pipelining or best parallel execution
o Pipelining is commonly used in modern processors
o Multiple core will become common architecture
For both instruction selection and instruction scheduling
Should consider the cost to facilitate pipelining and/or parallel
execution
Peephole Optimization
Performed at the end of code generation
Performed directly at the generated machine code
Only look at a few instructions
Generally no more than 5
Using a sliding window
Eliminating redundant instructions
Some code generator generated code has redundancies, after other
optimization steps, there still may be easy to catch redundancies
Algebraic transformation
Strength reductions
Eliminate Redundancies
Unnecessary load-store
load r0, a
store a, r0
Load r0, a
Eliminate jump after jump
if (a<b) goto L1
...
L1: goto L2
goto L1
...
L1: if a < b goto L2
L3:
goto L1
...
L1: return
L3:
if (a<b) goto L2
goto L3
...
if (a<b) goto L2
...
L1: goto L2
return
...
L1:
Eliminate Redundancies
Eliminate jump after jump
Source code:
debug = 0
...
if(debug)
{print debugging information}
Generated intermediate code:
debug = 0
...
if debug = 1 goto L1
goto L2
L1: print debugging information
L2:
After optimization:
debug = 0
...
if debug 1 goto L2
print debugging information
L2:
Strength Reduction
Replace multiplication and division by shift
A := A * 4 A := << A
Need to take care of overflow problem (may result in negative number)
A := A / 4 A := >> A
Need to shift by replacing msb with sign bit
But right-shift has a famous problem in two’s complement representation
o –5
111111...1111111011
o >> –5
111111...1111111101
o –6
111111...1111111010
o >> –6
111111...1111111101
(–5 / 2 = –2, but the result here is –3)
(No problem, correct answer)
o Fix 1: shift bit by bit, add lsb back to the number after shift
o Fix 2: Convert to positive number for shift, then convert to negative
Code Generation Issues -- Summary
Read Chapter 8
Sections 8.5, 8.7, 8.8, 8.10
Run time storage allocation
Register allocation and instruction scheduling
For basic blocks
Tree based
Dag based
o R. Govindarajan Y, H. Yang Z, J. N. Amaral, C. Zhang Z, G. R. Gao,
“Minimum register instruction sequence problem: Revisiting optimal
code generation for DAGs,” IEEE International Parallel & Distributed
Processing Symposium, 2001
Global register allocation using graph coloring
Peephole optimization
© Copyright 2026 Paperzz