Adaptive Optimization with On-Stack Replacement

Adaptive Optimization with On-Stack Replacement

Stephen J. Fink IBM T.J. Watson Research Center

Feng Qian (presenter)Sable Research Group, McGill University

http://www.sable.mcgill.ca

Motivation

Modern VM uses adaptive recompilation strategies

VM replaces entry in dispatching table with newly compiled code

Switching to new code can only happen at the next invocation

On-stack replacement (OSR) allows transformation happen in the middle of method execution

What is On-stack Replacement?

Transfer execution from compiled code m1 to compiled code m2 even while m1 runs on some thread’s stack

stack

PC

frame

m1

m1

stack

PC

frame

m2

m2

Why On-Stack Replacement (OSR)?

Debugging optimized code via dynamic de-optimization [SELF-93]

Deferred compilation of cold paths in a method [SELF-91, HotSpot, Whaley 2001]

Promotion of long-run activations [SELF-93]

Safe invalidation for speculative optimization [HotSpot, SELF-91]

Related Work

Holzle, Chambers, and Ungar (SELF-91, SELF-93) deferred compilation, de-optimization for debugging, promotion of long-run loops, safe invalidation [OOPSLA’91, PLDI’92, OOPSLA’94]

HotSpot server compiler [JVM’01]

Partial method compilation [OOPSLA’01]

OSR Challenges

Engineering Complexity How to minimize disruption to VM code base? How to constrain optimizations?

Policies for applying OSR How to make rational decisions for applying OSR?

Effectiveness How does OSR improve/constrain dataflow

optimizations? How effective are online OSR-based optimizations?

Outline

Motivation OSR Mechanism Applications Experimental Results Conclusion

OSR Mechanism Overview

Extract compiler-independent state from a suspended activation for m1

Generate specialized code m2 for the suspended activation

Compile and transfer execution to the new code m2

m2

stack

PC

frame

m1

m1

compiler-

independent state

stack

PC

frame

m2

m2

1 2 3

JVM Scope Descriptor

Compiler-independent state of a running activation

Based on Java Virtual Machine Architecture Five components:

1) Thread running the activation2) Reference to the activation's stack frame3) Program Counter (as a bytecode index)4) Value of each local variable5) Value of each stack location

class C { static int sum(int c) { int y = 0; for (int i=0; i<c; i++) { y += i; } return y; }}

Running thread: MainThreadFrame Pointer: 0xSomeAddressProgram Counter: 16Local variables: L0(c) = 100; L1(y) = 1225; L2(i) = 50;Stack Expressions: S0 = 50; S1 = 100;

JVM Scope Descriptor 0 iconst_0 1 istore_1 2 iconst_0 3 istore_2 4 goto 14 7 iload_1 8 iload_2 9 iadd 10 istore_1 11 iinc 2 1 14 iload_2 15 iload_0 16 if_icmplt 7 19 iload_1 20 ireturn

Bytecode

JVM Scope Descriptor Example

Suspend after50 loop iterations(i = 50)

Extracting JVM Scope Descriptor

Trivial from interpreter Optimizing Compiler

Insert OSR Point (safe-point) instructions in initial IR OSR Point uses stack, local state needed to recover

scope descriptor OSR Point is treated as a call, transfers control to exit

block Aggregate OSR points to an OSR map when generating

machine instructionsstack

PC

frame

m1

m1

compiler-

independent state

1

Specialized Code Generation

Prepend a specialized prologue to original bytecode

Prologue will• Save JVM Scope Descriptor values into local variables• Push JVM Scope Descriptor values onto the stack• Jump to the desired program counter

m2

compiler-

independent state

2

Running thread: MainThreadFrame Pointer: 0xSomeAddressProgram Counter: 16Local variables: L0(c) = 100; L1(y) = 1225; L2(i) = 50;Stack Expressions: S0 = 50; S1 = 100;

JVM Scope Descriptor

ldc 100 istore_0 ldc 1225 istore_1 ldc 50 istore_2 ldc 50 ldc 100 goto 160 iconst_0 ...16 if_icmplt 7 ...20 ireturn

Specialized Bytecode 0 iconst_0

1 istore_1 2 iconst_0 3 istore_2 4 goto 14 7 iload_1 8 iload_2 9 iadd 10 istore_1 11 iinc 2 1 14 iload_2 15 iload_0 16 if_icmplt 7 19 iload_1 20 ireturn

Original Bytecode

Transition Example

m2

stack

PC

frame

m2

m2

3

Transfer Execution to the New Code

Compile m2 as a normal method System unfolds the stack frame of m1 Reschedule the thread to execute m2 By construction, executing specialized m2 sets up

target stack frame and continues execution

m2

stack

PC

frame

m2

m2

3

Suppose optimizer inlines A -> B -> C:

A'

stack

PC

frameA

A

1 2 3

JVM ScopeDescriptor A

JVM ScopeDescriptor C

JVM ScopeDescriptor B

C'

B'

stack

PC

frame

m2

C'

A'

B'

AA

frame

C'frame

A'

frame

B'

frame

Recovering from Inlining

Inlining Example

foo_prime() { <specialized foo prologue> call bar_prime() goto A; ... bar(); A: ...}bar_prime() { <specialized bar prologue> goto B: ... B: ...}

void foo() { bar(); A: ... } void bar() { ... B: ... }

Wipe stackto caller C and call foo_prime

frame

A

stack

PC

frame

m2

foo'

bar'

C

frame

bar'

frame

foo'

Suspendat B: inA -> B

Implementation Details

Target Compiler unmodified, except for .... New pseudo-bytecodes

Load literals (to avoid inserting new constants in constant pool)

Load an address/bytecode index: JSR return address on stack

Fix bytecode indices for GC maps, exception tables, line number tables

Pros and Cons

Advantages mostly compiler-independent avoid multi-entry points of compiled code target compiler can exploit run-time constants

Disadvantage must compile target method twice (once for transition,

once for next invocation)

Outline

MotivationOSR Mechanism Applications Experimental Results Conclusion

Two OSR Applications

Promotion (see the paper for details) recompile a long-running activation

Deferred Compilation don't compile uncommon paths saves compile-time

x = 1; x = foo();

return x;

if (foo is currently final)

trap/OSR;

Deferred Compilation

What's "infrequent"? static heuristics profile data

Adaptive recompilation decision is modified to consider OSR factors

Feng Qian:

Class initialization is called by a class loader, when do we need OSR

for it?

Feng Qian:

Class initialization is called by a class loader, when do we need OSR

for it?

Outline

MotivationOSR MechanismApplications Experimental Results Conclusion

Online Experiments

Eager : (by default) no deferred compilation OSR/static: deferred compilation for CHA-based inlining

only OSR/edge counts: deferred compilation w/online profile

data & CHA-based inlining

Adaptive System Performance

First Run

0.8

0.9

1

1.1

1.2

com

pres

s

jess db

java

c

mpe

gaud

io

mtr

t

jack

g. m

ean

Per

form

ance

Rel

ativ

e to

Eag

er

OSR/ edge counts OSR/ static

bett

er


Best Run of 10

0.8

0.9

1

1.1

1.2co

mpre

ss

jess db

java

c

mpegau

dio

mtr

t

jack

g.m

ean

Perf

orm

an

ce R

ela

tive t

o E

ag

er OSR/ edge counts OSR/ static

bett

er

Promotions Invalidations

compress 3 6

jess 0 0

db 0 1

javac 0 10

mpegaudio 0 1

mtrt 0 5

jack 0 1

total 3 24

OSR ActivitiesSPECjvm98 size 100 First Run

Outline

MotivationOSR MechanismApplicationsExperimental Results Conclusion

Summary

A new On-stack replacement mechanism Online profile-directed deferred compilation Evaluation of OSR applications in JikesRVM

Conclusion

Should a VM implement OSR?+Can be done with minimal intrusion to code

baseModest gains from deferred compilationNo benefit for class-hierarchy-based inlining+Debugging with dynamic de-optimization

valuable TODO: More advanced speculative

optimizations

Implementation is available to public in JikesRVM under CPL:

Linux/x86, Linux/PPC, and AIX/PPC

http://www-124.ibm.com/developerworks/oss/jikesrvm/

Backup Slides

Compile RateOffline Profile


Machine Code SizeOffline Profile


Code QualityOffline Profile


better

Jikes RVM Analytic Recompilation Model

Definecur, current optimization level for method mTj, expected future execution time at level jCj, compilation cost at opt level j

Choose j > cur that minimizes Tj + CjIf Tj + Cj < Tcur recompile at level jAssumptions

Method will execute for twice its current duration Compilation cost and speedup based on offline average Sample data determines how long a method has executed

Jikes RVM OSR Promotion Model

Given: Outdated activation A of method mDefine

L, last optimization level for any compiled version of mcur, current optimization level for activation A

Tcur , expected future execution time of A at level cur

CL , compilation cost for method m at opt level L

TL , expected future execution time of A at level L

If TL + CL < Tcur specialize A at level LAssumption

Outdated activation will execute for twice its current duration

Jikes RVM Recompilation Model, with Profile-Driven Deferred Compilation

Definecur, current optimization level for method mTj, expected future execution time at level jCj, compilation cost at opt level j

P, percentage of code in m that profile data indicates was reached

Choose j > cur that minimizes Tj + P*CjIf Tj + P*Cj < Tcur recompile at level jAssumptions

Method will execute for twice its current duration Compilation cost and speedup based on offline average Sample data determines how long a method has executed

Offline Profile experiments

Collect "perfect" profile data offline Mark any block never reached as "uncommon" Defer compilation of "uncommon" blocks Four configurations

Ideal: deferred compilation trap keeps no state liveIdeal-OSR: deferred compilation trap is valid OSR pointStatic-OSR: no profile data; defer compilation for CHA-based

inlining; trap is valid OSR pointEager: (default) no deferred compilation




OSR Challenges

Engineering ComplexityHow to minimize disruption to VM code base?How to constrain optimizations?

Policies for applying OSRHow to make rational decisions for applying OSR?

EffectivenessHow does OSR improve/constrain dataflow optimizations?

How effective are online OSR-based optimizations?

Recompilation ActivitiesFirst Run

O0 O1 O2 total O0 O1 O2 total

compress 17 7 2 26 13 9 6 28

jess 49 20 1 70 39 17 4 60

db 8 4 2 14 8 4 5 17

javac 171 19 2 192 168 16 3 187

mpegaudio

68 32 7 107 66 29 6 101

mtrt 57 14 3 74 61 11 3 75

jack 59 25 8 92 54 26 5 85

total 429 121 25 575 409 112 32 553

With OSR Without OSR

Summary of Study (1)

Engineering ComplexityHow to minimize disruption to VM code base?

°Compiler-independent specialized source code to manage transition transparently

How to constrain optimizations?°Model OSR Points like CALLS in standard transformations

Policies for applying OSRHow to make rational decisions for applying OSR?

°Simple modifications to cost-benefit analytic model

Summary of Study (2)

Effectiveness (for an implementation of online profile-directed deferred compilation)

How does OSR improve/constrain dataflow optimizations?

°small ideal benefit from dataflow merges (0.5 - 2.2%)°negligible benefit when constraining optimization for potential invalidation°negligible benefit for just CHA-based inlining

patch points + splitting + pre-existence good enough

How effective are online OSR-based optimizations? °average performance improvement of 2.6% on first run SPECjvm98 s=100°individual benchmarks range from +8% to -4%°negligible impact on steady state performance (best of 10 iterations)°adaptive recompilation model relatively insensitive, compiles 4% more methods

Experimental Details

SPECjvm98, size 100Jikes RVM 2.1.1

FastAdaptiveSemispace configurationone virtual processor500MB heap

separate VM instance for each benchmarkIBM RS/6000 Model F80

six 500 MHz PowerPC 630'sAIX 4.3.34 GB memory

Specialized Code Generation

Generate specialized m2 that sets up new stack frame and continues execution, preserving semantics.

Express the transition to new stack frame in source code (bytecode)

m2

compiler-

independent state

2

Deferred Compilation

Don't compile "infrequent" blocks

x = 1; trap/OSR;

return x;


x = 1; x = foo();

return x;


Experimental Results

Online profile-directed deferred compilation Evaluation

How much do OSR points improve optimization by eliminating merges?How much do OSR points constrain optimization?How effective is online profile-directed deferred compilation?



Online Experiments

Before optimizing, collect intraprocedural edge countersDefer compilation at blocks that profile data says not reachedIf deferred block reached

Trigger OSR and deoptimizeInvalidate compiled code

Modify analytic recompilation modelPromotion from baseline to optimizedCompile-time cost estimate modified according to profile data

Adaptive Optimization with On-Stack Replacement

Documents

Adaptive Optimization with On-Stack Replacement