(OSR)? - Sable Research Group

Adaptive Optimization with
On-Stack Replacement
Stephen J. Fink
IBM T.J. Watson Research Center
Feng Qian (presenter)
Sable Research Group, McGill University
http://www.sable.mcgill.ca
Motivation
 Modern VM uses adaptive recompilation
strategies
 VM replaces entry in dispatching table with
newly compiled code
 Switching to new code can only happen at the
next invocation
 On-stack replacement (OSR) allows
transformation happen in the middle of method
execution
What is On-stack Replacement?
 Transfer execution from compiled code m1 to
compiled code m2 even while m1 runs on some
thread’s stack
stack
stack
m2
m1
m1
frame
PC
m2
frame
PC
Why On-Stack Replacement (OSR)?
 Debugging optimized code via dynamic deoptimization [SELF-93]
 Deferred compilation of cold paths in a method
[SELF-91, HotSpot, Whaley 2001]
 Promotion of long-run activations [SELF-93]
 Safe invalidation for speculative optimization
[HotSpot, SELF-91]
Related Work
 Holzle, Chambers, and Ungar (SELF-91, SELF93) deferred compilation, de-optimization for
debugging, promotion of long-run loops, safe
invalidation [OOPSLA’91, PLDI’92, OOPSLA’94]
 HotSpot server compiler [JVM’01]
 Partial method compilation [OOPSLA’01]
OSR Challenges
 Engineering Complexity
 How to minimize disruption to VM code base?
 How to constrain optimizations?
 Policies for applying OSR
 How to make rational decisions for applying OSR?
 Effectiveness
 How does OSR improve/constrain dataflow
optimizations?
 How effective are online OSR-based optimizations?
Outline
Motivation
 OSR Mechanism
 Applications
 Experimental Results
 Conclusion
OSR Mechanism Overview
 Extract compiler-independent state from a
suspended activation for m1
 Generate specialized code m2 for the suspended
activation
 Compile and transfer execution to the new code m2
stack
stack
m1
m1
frame
PC
1
compilerindependent
state
2
m2
3
m2
m2
frame
PC
JVM Scope Descriptor
 Compiler-independent state of a running activation
 Based on Java Virtual Machine Architecture
 Five components:
1) Thread running the activation
2) Reference to the activation's stack frame
3) Program Counter (as a bytecode index)
4) Value of each local variable
5) Value of each stack location
JVM Scope Descriptor Example
class C {
static int sum(int c) {
int y = 0;
for (int i=0; i<c; i++) {
y += i;
}
return y;
}
}
Suspend after
50 loop iterations
(i = 50)
Bytecode
0 iconst_0
1 istore_1
2 iconst_0
3 istore_2
4 goto 14
7 iload_1
8 iload_2
9 iadd
10 istore_1
11 iinc 2 1
14 iload_2
15 iload_0
16 if_icmplt 7
19 iload_1
20 ireturn
JVM Scope Descriptor
Running thread: MainThread
Frame Pointer: 0xSomeAddress
Program Counter: 16
Local variables:
L0(c) = 100;
L1(y) = 1225;
L2(i) = 50;
Stack Expressions:
S0 = 50;
S1 = 100;
Extracting JVM Scope Descriptor
 Trivial from interpreter
 Optimizing Compiler
 Insert OSR Point (safe-point) instructions in initial IR
 OSR Point uses stack, local state needed to recover scope
descriptor
 OSR Point is treated as a call, transfers control to exit
block
 Aggregate OSR points to an OSR map when generating
machine instructions
stack
m1
m1
frame
PC
1
compilerindependent
state
Specialized Code Generation
 Prepend a specialized prologue to original bytecode
 Prologue will
• Save JVM Scope Descriptor values into local variables
• Push JVM Scope Descriptor values onto the stack
• Jump to the desired program counter
compiler-
independent
state
2
m2
Transition Example
JVM Scope Descriptor
Running thread: MainThread
Frame Pointer:
0xSomeAddress
Program Counter: 16
Local variables:
L0(c) = 100;
L1(y) = 1225;
L2(i) = 50;
Stack Expressions:
S0 = 50;
S1 = 100;
Original
Bytecode
Specialized
Bytecode
0 iconst_0
1 istore_1
2 iconst_0
3 istore_2
4 goto 14
7 iload_1
8 iload_2
9 iadd
10 istore_1
11 iinc 2 1
14 iload_2
15 iload_0
16 if_icmplt 7
19 iload_1
20 ireturn
ldc 100
istore_0
ldc 1225
istore_1
ldc 50
istore_2
ldc 50
ldc 100
goto 16
0 iconst_0
...
16 if_icmplt 7
...
20 ireturn
Transfer Execution to the New Code




Compile m2 as a normal method
System unfolds the stack frame of m1
Reschedule the thread to execute m2
By construction, executing specialized m2 sets up
target stack frame and continues execution
stack
m2
3
m2
m2
frame
PC
Recovering from Inlining
 Suppose optimizer inlines A -> B -> C:
JVM Scope
Descriptor
A
stack
A'
A'
stack
A
A
frame
PC
1
JVM Scope
Descriptor
B
2
B'
3
A
m2
A'
frame
frame
frame
B'
B'
frame
JVM Scope
Descriptor
C
C'
C'
frame
C'
PC
Inlining Example
void foo() {
bar();
Suspend
A:
at B: in
...
A -> B
}
void bar() {
...
B:
...
}
foo_prime() {
<specialized foo
prologue>
call bar_prime()
goto A;
...
bar();
A: ...
}
bar_prime() {
<specialized bar
prologue>
goto B:
...
B:
...
}
Wipe stack
to caller C
and call
foo_prime
stack
C
A
m2
foo'
frame
frame
frame
bar'
frame
PC
foo'
bar'
Implementation Details
Target Compiler unmodified, except for ....
 New pseudo-bytecodes
 Load literals (to avoid inserting new constants in
constant pool)
 Load an address/bytecode index: JSR return
address on stack
 Fix bytecode indices for GC maps, exception
tables, line number tables
Pros and Cons
Advantages
 mostly compiler-independent
 avoid multi-entry points of compiled code
 target compiler can exploit run-time constants
Disadvantage
 must compile target method twice (once for transition, once
for next invocation)
Outline
Motivation
OSR Mechanism
 Applications
 Experimental Results
 Conclusion
Two OSR Applications
 Promotion (see the paper for details)
 recompile a long-running activation
 Deferred Compilation
 don't compile uncommon paths
 saves compile-time
if (foo is currently final)
x = 1;
x = foo();
trap/OSR;
return x;
Deferred Compilation
 What's "infrequent"?
 static heuristics
 profile data
Feng Qian:
Class initialization is called by a
class loader, when do we need OSR
for it?
 Adaptive recompilation decision is modified to
consider OSR factors
Outline
Motivation
OSR Mechanism
Applications
 Experimental Results
 Conclusion
Online Experiments
Eager : (by default) no deferred compilation
 OSR/static: deferred compilation for CHA-based inlining only
 OSR/edge counts: deferred compilation w/online profile data
& CHA-based inlining
g. mean
jack
OSR/edge counts
mtrt
mpegaudio
javac
db
jess
compress
Performance Relative to Eager
better
Adaptive System Performance
First Run
OSR/static
1.2
1.1
1
0.9
0.8
Adaptive System Performance
Best Run of 10
OSR/static
1.2
1.1
1
0.9
g.mean
jack
mtrt
mpegaudio
javac
db
jess
0.8
compress
Performance Relative to Eager
better
OSR/edge counts
OSR Activities
SPECjvm98 size 100 First Run
Promotions
Invalidations
compress
3
6
jess
0
0
db
0
1
javac
0
10
mpegaudio
0
1
mtrt
0
5
jack
0
1
total
3
24
Outline
Motivation
OSR Mechanism
Applications
Experimental Results
 Conclusion
Summary
 A new On-stack replacement mechanism
 Online profile-directed deferred compilation
 Evaluation of OSR applications in JikesRVM
Conclusion
 Should a VM implement OSR?
+ Can be done with minimal intrusion to code base
 Modest gains from deferred compilation
 No benefit for class-hierarchy-based inlining
+ Debugging with dynamic de-optimization valuable
TODO: More advanced speculative optimizations
Implementation is available to public in JikesRVM
under CPL:
Linux/x86, Linux/PPC, and AIX/PPC
http://www-124.ibm.com/developerworks/oss/jikesrvm/
Backup Slides
Compile Rate
Offline Profile
Compile Rate
Offline Profile
Machine Code Size
Offline Profile
Machine Code Size
Offline Profile
Code Quality
Offline Profile
better
Code Quality
Offline Profile
Jikes RVM Analytic Recompilation Model
Define
cur, current optimization level for method m
Tj, expected future execution time at level j
Cj, compilation cost at opt level j
Choose j > cur
If Tj + Cj < Tcur
Assumptions
that minimizes Tj + Cj
recompile at level j
Method will execute for twice its current duration
Compilation cost and speedup based on offline average
Sample data determines how long a method has executed
Jikes RVM OSR Promotion Model
Given: Outdated activation A of method m
Define
L, last optimization level for any compiled version of m
cur, current optimization level for activation A
Tcur , expected future execution time of A at level cur
CL , compilation cost for method m at opt level L
TL , expected future execution time of A at level L
If TL + CL < Tcur
Assumption
specialize A at level L
Outdated activation will execute for twice its current
duration
Jikes RVM Recompilation Model,
with Profile-Driven Deferred Compilation
Define
cur, current optimization level for method m
Tj, expected future execution time at level j
Cj, compilation cost at opt level j
P,
percentage of code in m that profile data indicates was reached
Choose j > cur that minimizes Tj + P*Cj
If Tj + P*Cj < Tcur
recompile at level j
Assumptions
Method will execute for twice its current duration
Compilation cost and speedup based on offline average
Sample data determines how long a method has executed
Offline Profile experiments
 Collect "perfect" profile data offline
 Mark any block never reached as "uncommon"
 Defer compilation of "uncommon" blocks
 Four configurations
Ideal: deferred compilation trap keeps no state live
Ideal-OSR: deferred compilation trap is valid OSR point
Static-OSR: no profile data; defer compilation for CHA-based
inlining; trap is valid OSR point
Eager: (default) no deferred compilation
Compile Rate
Offline Profile
Machine Code Size
Offline Profile
Code Quality
Offline Profile
OSR Challenges
Engineering Complexity
How to minimize disruption to VM code base?
How to constrain optimizations?
Policies for applying OSR
How to make rational decisions for applying OSR?
Effectiveness
How does OSR improve/constrain dataflow optimizations?
How effective are online OSR-based optimizations?
Recompilation Activities
First Run
With OSR
Without OSR
O0
O1
O2 total O0
O1 O2 total
compress
17
7
2
26
13
9
6
28
jess
49
20
1
70
39
17
4
60
db
8
4
2
14
8
4
5
17
javac
171
19
2
192
168
16
3
187
mpegaudio 68
32
7
107
66
29
6
101
mtrt
57
14
3
74
61
11
3
75
jack
59
25
8
92
54
26
5
85
total
429 121 25
575
409
112 32
553
Summary of Study (1)
Engineering Complexity
How to minimize disruption to VM code base?
°Compiler-independent specialized source code to manage
transition transparently
How to constrain optimizations?
°Model OSR Points like CALLS in standard transformations
Policies for applying OSR
How to make rational decisions for applying OSR?
°Simple modifications to cost-benefit analytic model
Summary of Study (2)
Effectiveness
(for an implementation of online profile-directed deferred compilation)
How does OSR improve/constrain dataflow optimizations?
°small ideal benefit from dataflow merges (0.5 - 2.2%)
°negligible benefit when constraining optimization for potential invalidation
°negligible benefit for just CHA-based inlining
patch points + splitting + pre-existence good enough
How effective are online OSR-based optimizations?
°average performance improvement of 2.6% on first run SPECjvm98 s=100
°individual benchmarks range from +8% to -4%
°negligible impact on steady state performance (best of 10 iterations)
°adaptive recompilation model relatively insensitive, compiles 4% more methods
Experimental Details
SPECjvm98, size 100
Jikes RVM 2.1.1
FastAdaptiveSemispace configuration
one virtual processor
500MB heap
separate VM instance for each benchmark
IBM RS/6000 Model F80
six 500 MHz PowerPC 630's
AIX 4.3.3
4 GB memory
Specialized Code Generation
Generate specialized m2 that sets up new stack frame
and continues execution, preserving semantics.
Express the transition to new stack frame in source
code (bytecode)
compilerindependent
state
2
m2
Deferred Compilation
Don't compile "infrequent" blocks
if (foo is currently final)
x = 1;
x = foo();
return x;
if (foo is currently final)
x = 1;
trap/OSR;
return x;
Experimental Results
Online profile-directed deferred compilation
Evaluation
How much do OSR points improve optimization by eliminating
merges?
How much do OSR points constrain optimization?
How effective is online profile-directed deferred
compilation?
Adaptive System Performance
Adaptive System Performance
Online Experiments
Before optimizing, collect intraprocedural edge
counters
Defer compilation at blocks that profile data says
not reached
If deferred block reached
Trigger OSR and deoptimize
Invalidate compiled code
Modify analytic recompilation model
Promotion from baseline to optimized
Compile-time cost estimate modified according to profile
data