Run Time Optimization
15-745: Optimizing Compilers
Pedro Artigas
Motivation
A good reason
Compiling a language that contains runtime constructs
Java dynamic class loading
Perl or Matlab eval(“statement”)
Faster than interpreting
A better reason
May use program information only
available at run time
2
Example of run-time
information
The processor that will be used to run the
program
inc ax is faster on a Pentium III
add ax,1 is faster on a Pentium 4
No need to recompile if generating code at
run time
The actual program input/run-time behavior
Is my profile information accurate for the current
program input? YES!
3
The life cycle of a program
Compile
Link
Load/Run
One Object File
One Binary
One Process
Global Analysis
Whole
Program
Analysis
Analysis? No
observation!
Larger scope, better information about program behavior
4
New strategies are possible
Pessimistic x Optimistic approaches
Ex: Does int *a points to the same location
as int *b ?
Compile time/Pessimistic: Prove that in ANY
execution those pointers point to different
addresses
Run Time/Optimistic: Up to now in the current
execution a and b point to different locations
Assume this holds
If the assumption breaks, invalidate generated code and
generate new code
5
A sanity check
Using run time information does not require
run time code generation
Example: Versioning
ISA may allow cheaper tests
IA-64
Transmeta
if (a!=b) {
<generate code assuming a!=b>
} else {
<generate code assuming a==b>
}
6
Drawbacks
Code generation has to be FAST
Rule of thumb: almost linear on program
size
Code quality: Compromise on quality to
achieve fast code generation
shoot for good, not great
Also this usually means:
No time for classical Iterative Data Flow
Analysis at run time
7
No classical IDFA: Solutions
Quasi-Static and/or Staged Compilation
Perform IDFA at compile time
Specialize the dynamic code generator for the
obtained information
That is, encode the obtained data flow information
in the “binary”
Do not rely on classical IDFA
Use algorithms that do not require it
Ex: Dominator based value numbering (coming up!)
Generate code in a style that does not require it
Ex: One entry multiple exits traces
as in deco and dynamo
8
Code generation Strategies
Compiling a language that requires run-time
code generation:
Compile adaptively:
Use a very simple and fast code generation
scheme
Re-compile frequently used regions using more
advanced techniques
9
Adaptive Compilation:
Motivation
re-compilation
2 level recompilation
Optimizing Only
Fast Only
Optimal(Oracle)
total cost
Very simple code
generation
Elaborate code
generation
Fast compiler
Optimizing compiler
Cost
threshold
Higher compilation cost
Problem:
execution count
Higher execution cost
We may not know in
advance how frequently
a region will execute
Measure frequencies
and re-compile
dynamically
10
Code generation Strategies
Compiling selected regions that benefit from
run-time code generation:
Pick only the regions that should benefit the
most
Which regions?
Select them statically
Use profile information
Re-compile (that is select then dynamically)
Usually all of the above
11
Code Optimization Unit
1
2
What is the run-time unit of optimization?
3
4
Option: Procedures/static code regions
Option: Traces
1
2
3
4
4
Similar to static compilers
Start at the target of a backward branch
Include all the instructions in a path
May include procedure calls and returns
Branches
Fall through = remain in the trace
Target = exit the trace
12
Current strategies
Static region
Trace
JIT compilers
Java JITs
Matlab JITs
?
Run-time
performance
engines
Dyc
Fabius
Dynamo
Deco
13
Run-Time code generation:
Case studies
Two examples of algorithms that are
suitable for run-time code generation
Run time CSE/PRE replacement:
Dominator based value numbering
Run time Register Allocation:
Linear scan register allocation
14
Sidebar
With traces CSE/PRE become almost
trivial
No need for register allocation if
optimizing a binary (ex: dynamo)
A+B
PRE
A+B
CSE
A+B
A+B
15
Review: Local value
numbering
Store expressions already computed (in a hash table)
Store variable nameVN mapping in the VN array
Store VNvariable name mapping in the Name array
Same value numbersame value
Expression was
computed in the
past, check if
result is available
New expression,
add to the table
for each basic block
Table.empty()
for each computed expression (“x=y op z”)
if V=Table.lookup(“y op z”)
VN[“x”]=V
if VN[Name[V]]==V //expression is still there
replace “x = y op z” with “x = Name[V]”
else
Name[V]=“x”
else
VN[“x”]=new_value_number()
Table.insert(“y op z”,VN[“x”])
Name[VN[“x”]]=“x”
16
Local value numbering
Works in linear time on program size
Assuming accesses to the array and the
hash table occur in constant time
Can we make it work in a scope larger
than a basic block? (Hint: Yes)
What are the potential problems?
17
Problems
How to propagate the hash table
contents across basic blocks?
How to make sure that is safe to access
the location containing the expression
in other basic blocks?
How do we make sure if the location
containing the expression is fresh?
Remember: no IDFA
18
Control flow issues
On split points things are simple
Just keep the content of the hash table from the
predecessor
What about merge points?
We do not know if the same expression was
computed in all incoming paths
We do not want to check the fact anyway (why?)
Reset the state of the hash table to a safe state it
had in the past
Which program point in the past?
The immediate dominator of the merge block
19
Data flow issues
Making sure the def of an expression is fresh
and reaches the blocks of interest
How?
By construction! SSA
All names are fresh (Single Assignment)
All defs dominate its’ uses (regular uses not
functions)
As, by construction, we introduce new
defs using functions at every point this
would not hold
20
Dominator/SSA based value
numbering
DVN(Block B)
Table.PushScope()
for each exp “n=(…)”
if (exp is redundant or meaningless) //meaningless: (x0,x0)
VN[“n”]= Table.lookup(“(…)” or “x0”)
First process the
remove(“n=(…)”)
expressions
else
VN[“n”]=“n”
Table.insert(“(…)”,VN[n])
for each exp “x=y op z”
if (“v”=Table.lookup(“y op z”))
VN[“x”]=“v”
Them the
remove(“x=y op z”)
regular ones
else
VN[“x”]=“x”
Table.insert(“x=y op z”,VN[“x”])
for each successor s of B
Propagate info
Adjust the inputs
about inputs
for each dominator tree child c in CFG reverse post-order
and call DVN
DVN(c)
recursively
21
Table.PopScope()
VN
Name VN
Example
u0
1
v0
u0=a0+b0
w0
v0=c0+d0
x0
w0=e0+f0
2
x0=c0+d0
y0=c0+d0
u2= (u0,u1)
x2=(x0,x1)
y2=(y0,y1)
u3=a0+b0
3
y0
u1=a0+b0
u1
x1=e0+f0
x1
y1=e0+f0
y1
4
u2
x2
y2
u3
22
Problems
Does not catch
x0=a0+b0
x0=a0+b0
x1=a0+b0
x1=(x0,x2)
x2=a0+b0
x2=(x0,x1)
But it performs almost as well as CSE
And runs much faster
linear time ? (YES? NO?)
23
Homework #4
The DVN algorithm scans the CFG in a similar
way as the second phase of SSA translation
SSA translation phase #1
SSA translation phase #2
Placing functions
assigning unique numbers to variables
Combine both and save one pass
Gives us a smaller constant
But, at run time, it pays of!
24
Run time register allocation
Graph Coloring? Not an option
Even the simple stack based heuristic shown in
class is O(n2)
Not even counting:
Building the graph
Move coalescing optimization
But register allocation is VERY important in terms
of performance
Remember, memory is REALLY slow
We need a simple but effective (almost) linear
time algorithm
25
Let’s start simple
Start with a local (basic block) linear time
algorithm
Assuming only one def and one use per variable
(More constrained than SSA)
Assuming that if a variable is spilled it must
remain spilled (Why?)
Can we find an optimum linear time algorithm?
(Hint: Yes)
Ideas?
Think about liveness first …
26
Simple Algorithm:
Computing Liveness
One def and one use per variable, only one
block
A live range is merely the interval between
the def and the use
Live Interval: Interval between the first def and
the last use
OBS: Live Range = Live Interval if there is no
control flow, only one def and use
We could compute live intervals using a linear
scan if we store the def instructions
(beginning of the interval) in a hash table
27
Example
S1:
S2:
S3:
S4:
S5:
S6:
S7:
S8:
A=1
B=2
C=3
D=A
E=B
use(E)
use(D)
use(C)
28
Now Register Allocation
Another linear scan
Keep the active intervals in an list (active)
Assumption: an interval, when spilled, will remain
spilled
Two scenarios
#1:
No problem
#2:
Must spill
Which interval?
| active | R
| active | R
29
Spilling heuristic
Since there is no second chance:
That is a spilled variable will always remain spilled
Spill the interval that ends last
Intuition: As one spill must occur …
Pick the one that makes the remaining allocation
least constrained
That is, the interval that ends last
This is the provably optimum solution (given all
the constraints)
30
Linear Scan Register Allocation
active = {}
freeregs = {all_registers}
for each interval I (in order of increasing start point)
for each interval J in active
if J.end>I.start
Expire old
continue
active.remove(J)
intervals
freeregs.insert(J.register)
end for each interval J
if active.length()==R
spill_candidade=active.last();
if (spill_candidate.end>I.end)
Must spill, pick
I.register = spill_candidate.register
either the last
spill(spill_candidate)
interval in active
active.remove(spill_candidate)
active.insert_sorted(I) //sorted by end point
or the new
else
interval
spill(I)
else
No
I.register = freeregs.pop() //get any register from the free list
active.insert_sorted(I) //sorted by end point
constraints
end for each interval I
31
Example (R=2)
S1:
S2:
S3:
S4:
S5:
S6:
S7:
S8:
A
A
B
C
D
E
A=1
S1
B
B=2
S2
C
C=3
S3
D
D=A
S4
E
E=B
S5
use(E)
S6
use(D)
S7
use(C)
S8
32
Is the second pass really
linear?
Invariant: active.length()<=R
Complexity O(R*n)
R is usually a small constant (128 at
most)
Therefore: O(n)
33
And we are done! Right?
YES and NO
Use the same algorithm as before for register
assignment
Program representation: Linear list of
instructions
Live intervals are not precise anymore given
control flow and multiple def/uses
Not optimum, but still FAST
Code quality: within 10% of graph coloring for
spec95 benchmarks (One problem with this claim)
34
The Worst problem: Obtaining
precise live intervals
How to obtain precise live interval information
FAST?
Claim of 10% relies on live interval
information obtained using liveness analysis
(IDFA)
Most recent solutions:
IDFA is SLOW, O(n3)
Use the local interval algorithm for variables that
only live inside one basic block
Use liveness analysis for more global variables
Alleviates the problem, does not fully solve it
35
More problems: Live intervals
may not be precise
OBS: The idea of lifetime holes leads to allocators that also try to use this
holes to assign the same register to other live ranges (bin-packing)
Such an allocator is used in the Alpha family of compilers (GEM compilers)
36
Other problems: Linearization
order
Register allocation quality depends on
chosen block linearization order
Choose a good order in practice
layout order
depth first traversal of the CFG
Both only 10% slower than graph coloring
37
Graph coloring versus
Linear scan
Compilation cost scaling
38
Conclusion
Run time code generation provides new
optimization opportunities
Challenges
Identify new optimization opportunities
Design new compilation strategies
Design algorithms and implementations that:
example: optimistic versus conservative
minimize run time overhead
Do not compromise much on code quality
Recent examples indicate:
extending fast local methods is a promising way to
obtain fast run-time code generation
39
© Copyright 2026 Paperzz