Auto-Tuning Java Virtual Machine
Auto-Tuning the Java Virtual MachineMilinda Fernando, Tharindu
Rusira, Chalitha Perera, Chamara Phillips
Department of Computer Science and EngineeringUniversity of
MoratuwaSri LankaSupervisors: Prof. Sanath Jayasena(UoM),
Prof.Saman Amarasinghe(MIT)Good Morning to everyone
We present the final year project today, during the presentation
we will be discussing what problems we are addressing, our
approach, our experimental results, their implications and our
contributions.JVM*.java*.classsource to bytecode
compilation111010101000010010101010101010010Bytecode to machine
code translationJIT compilation in HotSpot2JVMA minimal HotSpot
workflow. Actual HotSpot architecture is much more complicated and
should be studied thoroughly.
Other JVMs might implement this in different ways.
JVM is a complex software system that implements necessary
components for code generation. Class loader, Garbage Collector,
JIT compiler, Native Method Interface, Native Method Libraries,
The Java Conundrum
Whats happening to Goslings paradigm-shifting invention?
Is it losing the battle?
-2.13% from 2013 to 2014; Why?3Java is popular theres no doubt
about that. But there are many emerging technologies and languages.
It is noteworthy that this is a very general comparison, but still
we can see Java is losing its popularity over recent yearsWhy?There
can be multiple reasons. One thing that interests us is the
execution Speed ! How fast the JVM can compile bytecode into
machine code. This is a critical factor for.Because for many
scientific and industrial level applications, resource utilization
and speed is a major concern
What if...
Statistics as of May 7, 2010How many JVMs are there?55
Despite the lack of comprehensive information on the actual
amount of Java virtual machines (JVMs) currently available,
estimates present at least 55 JVMs: 23 of them being proprietary
implementations and 32 free/open source implementations. Sun
Microsystems, developer of the Java platform, reports on their
website that JVM implementation has reached 4.5 billion
devices.
We can make Java run faster ?
4Program runtime is one major metric that we use to measure
system quality. But its not all, there are other measures also.Eg.
Power consumption is a major concern in embedded devices, mobile
platforms
You can see the impact we make, Some stats to show the
popularity of Java and the impact it can make if the JVM was
optimizedMore than 9 million developers and more than 4.5 billion
devices running Java.We would like to run Java faster, but how can
we achieve that?
Does JVM care performance?Yes. It does, but the approach is
primary.Ergonomics is deciding each option depending ONLY on
underlying architecture.Selection of JIT CompilerInitial Heap
SizesDefault Garbage Collector. is selected based on platform
configuration5Currently JVM does optimization using the concept of
Ergonomics.
Hotspot JVM by oracle is a ergonomic JVM.
It decides the compiler => server or client
How it decides that is a very simple way. If the machine is 32
bit it decides to use client
if 64 it uses server.
It includes dynamic compilers that adaptively compile Java
bytecodes into optimized machine instructions
and efficiently manages the Java heap using garbage collectors,
optimized for both low pause time and throughput.
Based on these different approaches for compilers, heap
configuration, garbage collectors, we identified conflicts between
combinations of possible option which are considered to be illegal
JVM settings. Filtering these reduced the time needed to tuning and
gave more good results which will be presented sooner.Our Solution:
Auto-tuningjava program_name -arg0 -arg1 -arg2 ....These JVM
flags/parameters change the runtime behavior of the systemCan we
decide what values would fit into each parameter?This idea
formulates an optimization problem in a configuration space
6There are 2 main approaches to do optimizationcode invariant
optimizationparameter optimization
We adopt the latter method.
This is how we usually initialize a JVM instance, but do you
know what argument values fits best for a given arbitrary
program?In this study, we focus on this issue.
Runtime options for controlling the system behavior, in areas
such asJIT compilationGarbage collectionHeap managementThread
managementMemory management
Why Auto-tuning?More than 600 configurable flags/parameters
If we assume, these are all boolean flags, the cardinality of
the configuration space is at least 2600
To make things worse, most of the parameters take values from a
finite domain whose size is way larger than 27We said we use
auto-tuning. Here is why.
Given a program, can you pick the best parameter values that
would output the best performance of a program for a given
architecture?Challenges in Auto-tuningSize of the valid
configuration space
Landscape of the configuration space
Configuration representations are very domain specific and could
vary in a wide range8Auto-tuning imposes few
questionsrepresentation of the configuration space shape of the
configuration space
These search spaces can be of arbitrary size and shape
Prior Work on Auto-tuningPetabricks - Use Algorithmic choices at
program compilation (C language extension)Java: GC Tuning- Optimize
overall garbage collection timeJava: JIT Tuning- Considers only
selected subset of JIT flags in tuning 9OpenTuner[1]An extensible
framework for auto-tuning[1] [ J. Ansel, S. Kamil, K.
Veeramachaneni, U.-M. O'Reilly and S. Amarasinghe, "OpenTuner: An
Extensible Framework for Program Autotuning," in MIT CSAIL
Technical Report MIT-CSAIL-TR-2013-026, November 1, 2013.]
Multiple search techniquesEvolutionary algorithms allow to reach
optima aggressivelyworks best with massive search spaces and
manages computational complexity really well
10Remember to mention that the OpenTuner is a extention
(decoupling) of auto tuner of peta bricks with language
specification.
OPEN TUNERThe problems we identified by analyzing existing
program auto-tunersRepresenting the right configuration
representation SIze of the valid configuration space -
computational efficiencyLandscape of the configuration space -
performance of particular algorithms heavily depends on this
factorOpenTuner provides a framework for building auto-tunersWhy we
decided to use OpenTunerExtensible across different domainsIf a
particular algorithm is found to perform well for a particular
region of the configuration space, more jobs from that region are
given to that algorithm.Has been successfully tested in many
different domains like Halide
(http://en.wikipedia.org/wiki/Halide_%28programming_language%29)PetaBricksGNU/G++
compiler (http://en.wikipedia.org/wiki/GNU_Compiler_Collection)High
performances Linepack benchmark
(http://en.wikipedia.org/wiki/LINPACK_benchmarks) measures the
floating point computing power of a machineMemory bound stencil
computations (http://en.wikipedia.org/wiki/Stencil_code). An
application from numerical analysis domainand the results are
promising. To write our own auto-tuner, we implement Configuration
ManipulatorMeasurement functionWind up and invite Chamara to
present the next part
Methodology We use the HotSpot Virtual Machine widely used
implementation of the JVM
We use OpenTuner as a framework to build the HotSpot
Auto-tuner
11HotSpot is the most widely used implementation of JVM and it
gets its name from its approach to JIT compilation (Compile only
the HotSpots) Two maintained versions of HotSpot - one by Oracle
and other by the OpenJDK groupWe have used the OpenJDK HotSpot JVM
for our experiments
Methodology- JVM flagsThere are more than 600 flags to select.
For,Memory optionsGarbage collection optionsCompiling, interpreting
and other optionsTwo typesBoolean flagsParameter flags (We limit
domain by defining min and max values for parameter flags depending
on their default values)
12There are more than 600 flags.Next slide shows the hierarchy
of the flags that we have built, which i very important to whole
java community.
These options let us to run the bytecode on any architecture
with different compiling and interpreting options.As the Java
bytecode execution engine, JVM, provides Java runtime facilities,
such as thread and object synchronization, on a variety of
operating systems and architectures. The best values for these
options depend on the system, machine architecture and the
program.
JVM comes with over 600 flags to choose fromThere are flags for
optimizing jit compiler, garbage collection, memory options and
etcThere are two types of flagsBoolean flagsParameter flags
We can get the default values of those flags by setting the JVM
flag PrintFlagsFinal For parameter flags we used the range
[default-default/2, default+default/2] for tuning
Methodology - Configuration ManipulatorUsed to define the
configuration spacedef manipulator(self): m =
manipulator.ConfigurationManipulator() for flag_set in
self.bool_flags: for flag in flag_set:
m.add_parameter(manipulator.EnumParameter(flag, ['on', 'off'])) for
flag_set in self.param_flags: for flag in flag_set: value =
flag_set[flag] if(value['min'] >= value['max']):
m.add_parameter(manipulator.IntegerParameter(value['flagname'],value['max'],value['min']))
else:
m.add_parameter(manipulator.IntegerParameter(value['flagname'],value['min'],value['max'])
return m
13Two important components we need to define to implement the
auto-tuner are the Configuration Manipulator and the Run
functionConfiguration Manipulator defines the configuration search
space for auto-tuning
Methodology - Run Function
Measures the quality (fitness) of a given configuration.
Eg. In our experiments we used two benchmark suits
SPECjvm2008: operations per minute (ops/m)
DaCapo: execution time in ms14Run function is where we specify
how to run the programs using a selected configuration and the
output of the run function should be a measure of quality of the
selected configurationIn our experiments we have used two widely
used benchmark suites,In SpecJVM the quality measure is the ops/min
(Higher values mean better quality)In DaCapo execution time is ms
(Lower values are better)Challenge - Tuning TimeWhen we feed all
the flags to OpenTuner, tuning times of programs are very high.
Problem? Tuner is stuck on searching invalid regions of the search
spaceInvalid regions of the search space are due to JVM flag
dependencies
15Methodology - Can we reduce Tuning Time ?JVM flags have huge
number of inter-dependencies. Configuration search space contains
areas of illegal flag configurationsHow to omit these illegal
search space during tuning process? 16One of the main issues with
considering all flags as a flat structure is there are many
dependencies between JVM flagsThere are many illegal configurations
that won't initialize the JVM For example once you have selected
the parallel GC collector you cannot use concurrent or serial
selectors
Methodology- JVM Flag Hierarchy17JVM
flagsByteCodeCodeCacheCompilationDeOptimizationGCInterpreterMemoryPrioritiesTemporaryTieredCompilation
+ C1if (TieredCompilation) == TRUEC2CommonCMSG1Throughput
CollectorParallelOldParallelSerialReason for defining a flag
hierarchyTo reduce the possibility of OpenTuner searching for
invalid flag configurationsTo reduce the computational time and
resources dedicated for computing unsuccessful/useless flag
combinations
Importance of garbage collectors=>Garbage collectors make
assumptions about the way applications use objects, and these are
reflected in tunable parameters that can be adjusted for improved
performance without sacrificing the power of the abstraction.
When does the choice of a garbage collector matter? For some
applications, the answer is never.That is, the application can
perform well in the presence of garbage collection with pauses of
modest frequency and duration. For them we can use the default
Serial Garbage collector.
However, this is not the case for a large class of applications,
particularly those with large amounts of data (multiple gigabytes),
many threads and high transaction rates.Ergonomics select parallel
GC when running in such server class machines.
The G1 collector is a server-style garbage collector, targeted
for multi-processor machines with large memories. It meets garbage
collection (GC) pause time goals with high probability, while
achieving high throughput. Whole-heap operations, such as global
marking, are performed concurrently with the application
threads.
This prevents interruptions proportional to heap or live-data
size.
To reduce these illegal configurations we have developed
hierarchical structure for JVM flags mostly concentrated on GC as
it is the main area that contribute to illegal configurationsWe
have initially asked from OpenJDK group for the existence of a flag
dependency structure for JVM (reply was no)This is a categorization
of flags we found by examining the source code of JVMThis
categorization helps us to evaluate how different areas of JVM
affects the performance of a program
Methodology- Effect of JVM Flag Hierarchy (DaCapo fop
Benchmark)
18This shows the effect of flag hierarchy on the tuning process
By reducing the invalid configurations we can significantly improve
the tuning time and also can gain higher performance
improvement
Methodology- Effect of JVM Flag Hierarchy (DaCapo eclipse
Benchmark)19
Methodology - What areas of JVM are more critical for
performance ?To answer the above, we tuned every benchmark program
in 4 categories.Those are,GC - Using Garbage Collection flags
onlyJIT- Using Just-In-Time Compilation Flags onlyGC & JIT -
Using GC and JIT flags togetherALL - Using all the flags of the JVM
20In our experiments, we examine how different areas of JVM
contribute to performance improvementsHere we have considered
performance improvements usingonly GC flagsonly JIT flagsGC+JIT
flagsAll JVM flags
Experimental SetupWe tested Hotspot Tuner in the following
architectures. Architecture 1 (Main Architecture)Intel Core i7 CPU
@3.40Ghz (4 cores)RAM: 16 GBOS: Ubuntu 12.04 LTSJVM: OpenJDK 7
(HotSpot JVM) update 55Architecture 2Intel Core i5-2400 CPU @
3.10GHz (4 cores)RAM: 4 GBOS: Ubuntu 14.04 LTSJVM: OpenJDK 7
(HotSpot JVM) update 55
21Architecture 3Intel Xeon E7-4820v2 CPU @2.00 GHz (32
cores)RAM: 64 GB DDR3OS: Ubuntu 14.04 LTSJVM: OpenJDK 7 (HotSpot
JVM) update 55
Results - (SPECjvm2008 startup.*) (Startup Performance
Improvement % of JVM)22
We have selected two Benchmark Suits for test the JVM tuner.1).
SPECjvm2008 startup benchmarks. 2). Dacapo Benchmark suite.
(Contains lot of Commercial applications like tomcat, eclipse
etc)
Why start up benchmarks?To measure the startup performances of
the jvm. The results shows that we have tuned the jvm to increase
the startup performances of the jvm.
But Startup performances are not enough to describe the steady
state behaviour of the jvm. That is the main reason to try the
Dacapo benchmarks. (One of the reviewers idea that we got from the
CGO paper review.)
SPEC startup programs measures the startup performance of the
JVM.Performance Improvements Under the 4 Categories Presented
AboveWe can conclude that Using all the flags together can achieve
much higher performance improvements
Results - (DaCapo Benchmark Suite) (Steady State Performance
Improvement % of JVM)
23
Explain the minus performance improvements..
Move on to the steady state performance improvement of the JVMWe
can observe the same patternBut tuning JIT only and GC and JIT
together gives us good performance improvements, but they do not
surpass the overall performance gained tuning JVM as a whole
unit.
Tuning Results on Different Architecture (DaCapo suite
Performance Improvement on Architecture 2)
24Discussion - eclipse DaCapo benchmark Tuning Behaviour
25
Tuning Behaviour.
Mainly with all the flags the tuning behaviour tends to converge
faster than with the partial tuning. This might be because all the
flags gives much higher degree of freedom to the tuning
program.
We wanted to see how tuning procedure behaves for each flag
category that we tuned. Tuning GC only gives us very low
performance improvements. Using JIT flags in performance tuning
gives us very good performance improvements. But Generally we can
conclude that using all the flags together will gives us much
higher performance improvement in lesser time. (HIGH Convergence
Rate Due much higher freedom of degrees to tuning procedure)
How JVM reacts to Auto-tuning? (JVM Internals)What happens
inside the JVM as a result of auto-tuning?To find out we used the
jstat profiler Using jstat profiler data we analyzeMemory usage of
JVM Compilation behaviourClass Loading behaviour
26JVM Internals - pmd Benchmark (Compiler Rate)
27What changes inside the JVM when we tune it?We have selected
pmd (A benchmark which has shown large performance Improvement) and
h2 (A benchmark which has shown a lower performance improvement)
and compared what changed inside the JVM as a result of tuning.
pmdCompilation Rate almost doubled (eger compilation achieved by
tuning)Class Loading rate has decreased. But this performance
decrement is negligible compared to the high performance
improvement of the CR
JVM Internals - pmd Benchmark (Class Loading Rate)
28pmdCompilation Rate almost doubled (eger compilation achieved
by tuning)Class Loading rate has decreased. But this performance
decrement is negligible compared to the high performance
improvement of the CR
JVM Internals - h2 Benchmark (Compiler Rate)
29h2Compilation rate improved but not much !CLR has improved
slightly. These are the main reasons that this benchmark shows less
performance improvements.JVM Internals - h2 Benchmark (Class
Loading Rate)
30The best configurationNo best configuration or optimum
performance for all of theseOur interest: a better configuration
that gives better performance than the current bestBetterness
depends onSystemProgram featureslocal optima31There is no best
configuration for a given program or a universal setup for every
program.The performance of the tuned configuration depends on the
system, program features and the local optima in the subspace the
tuner was working. System means the hardware architecture of the
system where the program runs upon, the operating system used to
execute the program and the other programs that is running on the
machine.We made it sure that there were no other program running
while testing except the tuner program.
Program features are the hotspots in the program like methods,
loops etc.As an example,There is a threshold value for each segment
to become a hotspot. If a method runs more than that threshold
within an unit timeit is considered as a hotspot. Depending on the
program nature we can change this threshold to get the optimal
value.
No common magic configuration to tune all the
benchmarks.Describe What best configuration depends on. Project
AchievementsGOLD Prize at ACM Student Research Competition in
Undergraduate Category at CGO 2015, San Francisco USA.Invited talk
on Java Auto Tuning Tool (JATT) at the OpenTuner workshop at CGO
2015 , San Francisco USA.Paper accepted at International Workshop
on Automatic Performance Tuning (iWAPT2015), to be held in
Hyderabad, India in May 2015 (co-hosted with IPDPS)
32Good reviews and feedback from representatives from google,
facebook, university of berkeley and huawei at CGO.Given the chance
to publish our work in the opentuner site.
Ongoing & Future WorkFinializing Java Auto-Tuning
Tool(JATT)Tuning widely used programs whose performance is critical
(e.g., Siddhi CEP, X10 compiler, ...)Online Tuning33What are the
Limitations of offline program tuning ?
tuning process can take long time have to rerun whenever the
program, microarchitecture or execution environment change
Online TuningTwo basic approachesUsing a model - Garbage
Collector algorithm selectionSibling Rivalry - Discussed in the
next slide
AcknowledgementProf. Sanath Jayasena and Prof. Saman Amarasinghe
who supervised the projectOpenTuner group and Dr. Jason Ansel for
their reviews on our projectCGO 2015 reviewing committee CSE
Staff34Q & A35Your questions are welcome36Thank You !