Lengthening Traces to Improve
Opportunities for Dynamic Optimization
Chuck Zhao, Cristiana Amza, Greg Steffan,
University of Toronto
Youfeng Wu
Intel Research
Feb. 16, 2007
Interact-12, HPCA
Intel’s StarDBT Project
StarDBT
A Dynamic Binary Translation framework
Operates on traces, optimizes hot traces
Long term goal: Use StarDBT to allow legacy
apps to exploit TM support
(NOT by automatically parallelizing legacy apps)
Allow speculative sequential optimizations
Use hardware TM’s checkpoint/restore
Problem: default traces are too small
TM overheads would overwhelm benefits
Challenge: lengthening traces can be tricky
2
Trace Formation
basic-block profile
trace profile
A
B
A
C
C
B
D
E
F
G
D
E
F
G
off-trace stub
on-trace blocks
Control flow that goes off-trace can be costly
3
Trade-offs when Lengthening Traces
side-exit ratio
5%
5%
5%
A
B
D
F
G
5%
5%
5%
5%
A
B
D
F
G
A
B
D
F
G
100% - 10% = 90%
Completion ratio:
likelihood of execution
staying on trace
percentage of execution
reaching trace tail
Tradeoffs:
longer traces have more
optimization opportunities
100% - 25% = 75%
longer traces have more
side-exit branches
completion ratio
Sweet spot exits in between, can we find it?
4
Our Work So Far (i.e., this talk)
1.
Lengthening traces while maintaining completion
ratios
Through unrolling and straightening
A characterization of the impact on traces
length, completion ratio, unroll factor, …
2.
Improving optimization opportunities on longer
traces
Improve Local Value Numbering (LVN) hits
Measurement of impact on performance is pending
3.
Performing on-the-fly actions by DBT system
Decisions made by instrumenting/sampling code online
5
Related Work
Binary Translation Systems
Dynamo
DynamoRIO
PIN
StarDBT
transparent translation
x86 legacy code
Trace Collection and Optimizations
Java JIT
Dynamo, DynamoRIO, Mojo
StarDBT
x86 binary level
MRET2 to improve trace formation
aggressive trace optimizations
First full analysis of trace-lengthening issues for DBT systems
6
StarDBT Trace Types
b
a
self type
other trace type
c
d
dispatcher
elsewhere type
7
Lengthening Traces Through Unrolling
a
90%
81%
a
completion ratio: 90%
a
72.9%
a
Unrolling increases trace’s length, but reduces completion ratio
8
Finding the Sweet-Spot Unroll Factor
Unroll factor
a
Completion ratio
p (0.99)
p2 (0.98)
3
p3 (0.97)
a
…
…
N (10)
p10 (0.904)
N (11)
p11 (0.895)
...
a
1
2
a
a
given porig = 99% and ptarget = 90%
chosen by system designer
Traces with 100% completion ratio: set N = 10
9
Lengthening Traces Through Straightening
c
b
b
d
c
We don’t yet implement/evaluate straightening
10
Evaluation
11
Distribution of Original Completion Ratios
original
completion
ratio
Original Completion Ratios
100%
90%
90-100%
80-89%
70-79%
60-69%
50-59%
40-49%
30-39%
20-29%
10-19%
0-9%
80%
70%
60%
50%
40%
30%
20%
10%
0%
bzip2
gzip
crafty
parser
vpr
mcf
average
Majority of hot traces have completion ratios in 90%-100%
12
Impact of Unrolling on Hot Trace Size
Average Number of Instructions
60
36% longer
50
completion ratio
40
original
98%
94%
90%
30
20
10
0
bzip2
gzip
mcf
parser
vpr
crafty
average
Select SPECIntCPU 2000 bmarks with MinneSpec input
Lengthening increases hot trace size by more than 36%
13
How Much are Traces Unrolled?
Target
completion
ratio
Average Unroll Factor
2.4
98%
94%
90%
2.2
2
1.8
1.6
1.38-1.58x
1.4
1.2
1
bzip2
gzip
mcf
parser
vpr
crafty average
Not
unrolled
Hot traces are unrolled on average by 1.38x or more
14
Average Completion Ratio After Lengthening
Completion Ratio
90
90%
80
80%
<0.5%
70
70%
60
60%
completion ratio
original
98%
94%
90%
50
50%
40
40%
30
30%
20
20%
10%
10
0
bzip2
gzip
mcf
parser
vpr
crafty average
Lengthening traces reduces completion ratio by < 0.5%
15
Impact of Lengthening
on Optimizations
16
Local Value Numbering (LVN)
No need to build Control Flow Graph (CFG)
Partial info
No need to perform Data Flow Analysis (DFA)
Expensive, rely on CFG
Can be arranged into a single-pass scan
Ease of implementation
Relatively light weight algorithm
Performs three optimizations:
Common Subexpression Elimination (CSE)
Copy Propagation (CP)
Dead-Code Elimination (DCE)
LVN is common in JIT optimizers
17
Ex: LVN On a Lengthened Trace
Original Traces
…
c=a+b
d=a
e=b
f=d+e
d=x
…
Lengthened Trace
…
c3 = a1 + b2
DCE hit
d1 = a1
e2 = b2
f3 = d1 + e2
f3 = c3
d4 = x4
…
CSE hit
Optimized Trace
…
c=a+b
e=b
f=c
d=x
…
18
LVN Hits Improvement (%)
% Increase in LVN Hits
35
35%
30
30%
target
completion
ratio
25
25%
98%
96%
94%
90%
20
20%
15
15%
10
10%
5%5
0
bzip2
gzip
parser
vpr
crafty
mcf
average
10+% more LVN hits are available through lengthening
19
Ongoing Work
Complete DBT Optimization Framework
Evaluate speculative optimizations on long
hot traces with high completion ratios
Automatically determine optimal
transaction granularity
Use HTM to support trace-based
speculative optimizations
20
Control Speculation
A Compiler Framework for Speculative Analysis and Optimizations:
Lin et. al, PLDI 03
ld.s x = [y]
cmp
if(c){
90+%
10-%
chk.s x, recovery
next: …
ld x=[y]
}
…
recovery:
ld x=[y]
jmp next
21
Use HTM to Support Trace-based
Speculative Optimizations
start_tx
cmp
ld x = [y]
90+%
10-%
if(c){
chk x, abort_tx
ld x=[y]
…
…
}
commit_tx
Use longer traces with high completion ratio as tx granularity
HTM hardware support simplifies speculative optimization
22
Conclusion
Traces can be effectively lengthened
increase in trace size by 36+%
decrease completion ratio by less than 0.5%
Longer traces provide better opportunities
for optimization
increase in LVN hits by 10%+
23
Q+A
24
Complete StarDBT Optimization
Framework
X86 CISIC ISA
code patching won’t work
Really need a code generator and IR
Design + implement a low-level Runtime IR
close to hardware
capture + represent all necessary low-level info
easy to convert from/to machine code
easy to implement analysis and optimizations
Starting point
Dynamo IR
LLVM IR
GCC RTL
…
25
StarDBT Overall Structure
Program Binary Code
DBT
Run Time
Front End
Code
Cache
Back End
OS
Control Flow
Data Flow
26
Trace Formation Heuristics
MRET: Most Recent Execution Tail
originally proposed by Dynamo
Trace head
loop head (backward branch target)
sampling counter reaches a certain threshold
Trace tail
satisfy certain trace-tail conditions
MRET2: 2-pass MRET
perform 2 independent MRET trace formation
intersect traces with common head
27
Traces and Hot Traces
Trace
MRET2 recognize trace heads
Trace tails satisfy certain conditions
Blocks in between become a trace
Hot Trace
Based on recognized Traces
Put in additional software counters
head: head counter
each early-exit branch: off-trace counters
sampling: hot-trace’s completion ratio
28
29
© Copyright 2026 Paperzz