Adaptive Optimization with
On-Stack Replacement
Stephen J. Fink
IBM T.J. Watson Research Center
Feng Qian (presenter)
Sable Research Group, McGill University
http://www.sable.mcgill.ca
Motivation
Modern VM uses adaptive recompilation
strategies
VM replaces entry in dispatching table with
newly compiled code
Switching to new code can only happen at the
next invocation
On-stack replacement (OSR) allows
transformation happen in the middle of method
execution
What is On-stack Replacement?
Transfer execution from compiled code m1 to
compiled code m2 even while m1 runs on some
thread’s stack
stack
stack
m2
m1
m1
frame
PC
m2
frame
PC
Why On-Stack Replacement (OSR)?
Debugging optimized code via dynamic deoptimization [SELF-93]
Deferred compilation of cold paths in a method
[SELF-91, HotSpot, Whaley 2001]
Promotion of long-run activations [SELF-93]
Safe invalidation for speculative optimization
[HotSpot, SELF-91]
Related Work
Holzle, Chambers, and Ungar (SELF-91, SELF93) deferred compilation, de-optimization for
debugging, promotion of long-run loops, safe
invalidation [OOPSLA’91, PLDI’92, OOPSLA’94]
HotSpot server compiler [JVM’01]
Partial method compilation [OOPSLA’01]
OSR Challenges
Engineering Complexity
How to minimize disruption to VM code base?
How to constrain optimizations?
Policies for applying OSR
How to make rational decisions for applying OSR?
Effectiveness
How does OSR improve/constrain dataflow
optimizations?
How effective are online OSR-based optimizations?
Outline
Motivation
OSR Mechanism
Applications
Experimental Results
Conclusion
OSR Mechanism Overview
Extract compiler-independent state from a
suspended activation for m1
Generate specialized code m2 for the suspended
activation
Compile and transfer execution to the new code m2
stack
stack
m1
m1
frame
PC
1
compilerindependent
state
2
m2
3
m2
m2
frame
PC
JVM Scope Descriptor
Compiler-independent state of a running activation
Based on Java Virtual Machine Architecture
Five components:
1) Thread running the activation
2) Reference to the activation's stack frame
3) Program Counter (as a bytecode index)
4) Value of each local variable
5) Value of each stack location
JVM Scope Descriptor Example
class C {
static int sum(int c) {
int y = 0;
for (int i=0; i<c; i++) {
y += i;
}
return y;
}
}
Suspend after
50 loop iterations
(i = 50)
Bytecode
0 iconst_0
1 istore_1
2 iconst_0
3 istore_2
4 goto 14
7 iload_1
8 iload_2
9 iadd
10 istore_1
11 iinc 2 1
14 iload_2
15 iload_0
16 if_icmplt 7
19 iload_1
20 ireturn
JVM Scope Descriptor
Running thread: MainThread
Frame Pointer: 0xSomeAddress
Program Counter: 16
Local variables:
L0(c) = 100;
L1(y) = 1225;
L2(i) = 50;
Stack Expressions:
S0 = 50;
S1 = 100;
Extracting JVM Scope Descriptor
Trivial from interpreter
Optimizing Compiler
Insert OSR Point (safe-point) instructions in initial IR
OSR Point uses stack, local state needed to recover scope
descriptor
OSR Point is treated as a call, transfers control to exit
block
Aggregate OSR points to an OSR map when generating
machine instructions
stack
m1
m1
frame
PC
1
compilerindependent
state
Specialized Code Generation
Prepend a specialized prologue to original bytecode
Prologue will
• Save JVM Scope Descriptor values into local variables
• Push JVM Scope Descriptor values onto the stack
• Jump to the desired program counter
compiler-
independent
state
2
m2
Transition Example
JVM Scope Descriptor
Running thread: MainThread
Frame Pointer:
0xSomeAddress
Program Counter: 16
Local variables:
L0(c) = 100;
L1(y) = 1225;
L2(i) = 50;
Stack Expressions:
S0 = 50;
S1 = 100;
Original
Bytecode
Specialized
Bytecode
0 iconst_0
1 istore_1
2 iconst_0
3 istore_2
4 goto 14
7 iload_1
8 iload_2
9 iadd
10 istore_1
11 iinc 2 1
14 iload_2
15 iload_0
16 if_icmplt 7
19 iload_1
20 ireturn
ldc 100
istore_0
ldc 1225
istore_1
ldc 50
istore_2
ldc 50
ldc 100
goto 16
0 iconst_0
...
16 if_icmplt 7
...
20 ireturn
Transfer Execution to the New Code
Compile m2 as a normal method
System unfolds the stack frame of m1
Reschedule the thread to execute m2
By construction, executing specialized m2 sets up
target stack frame and continues execution
stack
m2
3
m2
m2
frame
PC
Recovering from Inlining
Suppose optimizer inlines A -> B -> C:
JVM Scope
Descriptor
A
stack
A'
A'
stack
A
A
frame
PC
1
JVM Scope
Descriptor
B
2
B'
3
A
m2
A'
frame
frame
frame
B'
B'
frame
JVM Scope
Descriptor
C
C'
C'
frame
C'
PC
Inlining Example
void foo() {
bar();
Suspend
A:
at B: in
...
A -> B
}
void bar() {
...
B:
...
}
foo_prime() {
<specialized foo
prologue>
call bar_prime()
goto A;
...
bar();
A: ...
}
bar_prime() {
<specialized bar
prologue>
goto B:
...
B:
...
}
Wipe stack
to caller C
and call
foo_prime
stack
C
A
m2
foo'
frame
frame
frame
bar'
frame
PC
foo'
bar'
Implementation Details
Target Compiler unmodified, except for ....
New pseudo-bytecodes
Load literals (to avoid inserting new constants in
constant pool)
Load an address/bytecode index: JSR return
address on stack
Fix bytecode indices for GC maps, exception
tables, line number tables
Pros and Cons
Advantages
mostly compiler-independent
avoid multi-entry points of compiled code
target compiler can exploit run-time constants
Disadvantage
must compile target method twice (once for transition, once
for next invocation)
Outline
Motivation
OSR Mechanism
Applications
Experimental Results
Conclusion
Two OSR Applications
Promotion (see the paper for details)
recompile a long-running activation
Deferred Compilation
don't compile uncommon paths
saves compile-time
if (foo is currently final)
x = 1;
x = foo();
trap/OSR;
return x;
Deferred Compilation
What's "infrequent"?
static heuristics
profile data
Feng Qian:
Class initialization is called by a
class loader, when do we need OSR
for it?
Adaptive recompilation decision is modified to
consider OSR factors
Outline
Motivation
OSR Mechanism
Applications
Experimental Results
Conclusion
Online Experiments
Eager : (by default) no deferred compilation
OSR/static: deferred compilation for CHA-based inlining only
OSR/edge counts: deferred compilation w/online profile data
& CHA-based inlining
g. mean
jack
OSR/edge counts
mtrt
mpegaudio
javac
db
jess
compress
Performance Relative to Eager
better
Adaptive System Performance
First Run
OSR/static
1.2
1.1
1
0.9
0.8
Adaptive System Performance
Best Run of 10
OSR/static
1.2
1.1
1
0.9
g.mean
jack
mtrt
mpegaudio
javac
db
jess
0.8
compress
Performance Relative to Eager
better
OSR/edge counts
OSR Activities
SPECjvm98 size 100 First Run
Promotions
Invalidations
compress
3
6
jess
0
0
db
0
1
javac
0
10
mpegaudio
0
1
mtrt
0
5
jack
0
1
total
3
24
Outline
Motivation
OSR Mechanism
Applications
Experimental Results
Conclusion
Summary
A new On-stack replacement mechanism
Online profile-directed deferred compilation
Evaluation of OSR applications in JikesRVM
Conclusion
Should a VM implement OSR?
+ Can be done with minimal intrusion to code base
Modest gains from deferred compilation
No benefit for class-hierarchy-based inlining
+ Debugging with dynamic de-optimization valuable
TODO: More advanced speculative optimizations
Implementation is available to public in JikesRVM
under CPL:
Linux/x86, Linux/PPC, and AIX/PPC
http://www-124.ibm.com/developerworks/oss/jikesrvm/
Backup Slides
Compile Rate
Offline Profile
Compile Rate
Offline Profile
Machine Code Size
Offline Profile
Machine Code Size
Offline Profile
Code Quality
Offline Profile
better
Code Quality
Offline Profile
Jikes RVM Analytic Recompilation Model
Define
cur, current optimization level for method m
Tj, expected future execution time at level j
Cj, compilation cost at opt level j
Choose j > cur
If Tj + Cj < Tcur
Assumptions
that minimizes Tj + Cj
recompile at level j
Method will execute for twice its current duration
Compilation cost and speedup based on offline average
Sample data determines how long a method has executed
Jikes RVM OSR Promotion Model
Given: Outdated activation A of method m
Define
L, last optimization level for any compiled version of m
cur, current optimization level for activation A
Tcur , expected future execution time of A at level cur
CL , compilation cost for method m at opt level L
TL , expected future execution time of A at level L
If TL + CL < Tcur
Assumption
specialize A at level L
Outdated activation will execute for twice its current
duration
Jikes RVM Recompilation Model,
with Profile-Driven Deferred Compilation
Define
cur, current optimization level for method m
Tj, expected future execution time at level j
Cj, compilation cost at opt level j
P,
percentage of code in m that profile data indicates was reached
Choose j > cur that minimizes Tj + P*Cj
If Tj + P*Cj < Tcur
recompile at level j
Assumptions
Method will execute for twice its current duration
Compilation cost and speedup based on offline average
Sample data determines how long a method has executed
Offline Profile experiments
Collect "perfect" profile data offline
Mark any block never reached as "uncommon"
Defer compilation of "uncommon" blocks
Four configurations
Ideal: deferred compilation trap keeps no state live
Ideal-OSR: deferred compilation trap is valid OSR point
Static-OSR: no profile data; defer compilation for CHA-based
inlining; trap is valid OSR point
Eager: (default) no deferred compilation
Compile Rate
Offline Profile
Machine Code Size
Offline Profile
Code Quality
Offline Profile
OSR Challenges
Engineering Complexity
How to minimize disruption to VM code base?
How to constrain optimizations?
Policies for applying OSR
How to make rational decisions for applying OSR?
Effectiveness
How does OSR improve/constrain dataflow optimizations?
How effective are online OSR-based optimizations?
Recompilation Activities
First Run
With OSR
Without OSR
O0
O1
O2 total O0
O1 O2 total
compress
17
7
2
26
13
9
6
28
jess
49
20
1
70
39
17
4
60
db
8
4
2
14
8
4
5
17
javac
171
19
2
192
168
16
3
187
mpegaudio 68
32
7
107
66
29
6
101
mtrt
57
14
3
74
61
11
3
75
jack
59
25
8
92
54
26
5
85
total
429 121 25
575
409
112 32
553
Summary of Study (1)
Engineering Complexity
How to minimize disruption to VM code base?
°Compiler-independent specialized source code to manage
transition transparently
How to constrain optimizations?
°Model OSR Points like CALLS in standard transformations
Policies for applying OSR
How to make rational decisions for applying OSR?
°Simple modifications to cost-benefit analytic model
Summary of Study (2)
Effectiveness
(for an implementation of online profile-directed deferred compilation)
How does OSR improve/constrain dataflow optimizations?
°small ideal benefit from dataflow merges (0.5 - 2.2%)
°negligible benefit when constraining optimization for potential invalidation
°negligible benefit for just CHA-based inlining
patch points + splitting + pre-existence good enough
How effective are online OSR-based optimizations?
°average performance improvement of 2.6% on first run SPECjvm98 s=100
°individual benchmarks range from +8% to -4%
°negligible impact on steady state performance (best of 10 iterations)
°adaptive recompilation model relatively insensitive, compiles 4% more methods
Experimental Details
SPECjvm98, size 100
Jikes RVM 2.1.1
FastAdaptiveSemispace configuration
one virtual processor
500MB heap
separate VM instance for each benchmark
IBM RS/6000 Model F80
six 500 MHz PowerPC 630's
AIX 4.3.3
4 GB memory
Specialized Code Generation
Generate specialized m2 that sets up new stack frame
and continues execution, preserving semantics.
Express the transition to new stack frame in source
code (bytecode)
compilerindependent
state
2
m2
Deferred Compilation
Don't compile "infrequent" blocks
if (foo is currently final)
x = 1;
x = foo();
return x;
if (foo is currently final)
x = 1;
trap/OSR;
return x;
Experimental Results
Online profile-directed deferred compilation
Evaluation
How much do OSR points improve optimization by eliminating
merges?
How much do OSR points constrain optimization?
How effective is online profile-directed deferred
compilation?
Adaptive System Performance
Adaptive System Performance
Online Experiments
Before optimizing, collect intraprocedural edge
counters
Defer compilation at blocks that profile data says
not reached
If deferred block reached
Trigger OSR and deoptimize
Invalidate compiled code
Modify analytic recompilation model
Promotion from baseline to optimized
Compile-time cost estimate modified according to profile
data
© Copyright 2026 Paperzz