Auto-tuning the Java Virtual Machine. TechTalk @ WSO2

Auto-Tuning Java Virtual Machine

Auto-Tuning the Java Virtual MachineMilinda Fernando, Tharindu Rusira, Chalitha Perera, Chamara Phillips

Department of Computer Science and EngineeringUniversity of MoratuwaSri LankaSupervisors: Prof. Sanath Jayasena(UoM), Prof.Saman Amarasinghe(MIT)Good Morning to everyone

We present the final year project today, during the presentation we will be discussing what problems we are addressing, our approach, our experimental results, their implications and our contributions.JVM*.java*.classsource to bytecode compilation111010101000010010101010101010010Bytecode to machine code translationJIT compilation in HotSpot2JVMA minimal HotSpot workflow. Actual HotSpot architecture is much more complicated and should be studied thoroughly.

Other JVMs might implement this in different ways.

JVM is a complex software system that implements necessary components for code generation. Class loader, Garbage Collector, JIT compiler, Native Method Interface, Native Method Libraries,

The Java Conundrum

Whats happening to Goslings paradigm-shifting invention?

Is it losing the battle?

-2.13% from 2013 to 2014; Why?3Java is popular theres no doubt about that. But there are many emerging technologies and languages. It is noteworthy that this is a very general comparison, but still we can see Java is losing its popularity over recent yearsWhy?There can be multiple reasons. One thing that interests us is the execution Speed ! How fast the JVM can compile bytecode into machine code. This is a critical factor for.Because for many scientific and industrial level applications, resource utilization and speed is a major concern

What if...

Statistics as of May 7, 2010How many JVMs are there?55

Despite the lack of comprehensive information on the actual amount of Java virtual machines (JVMs) currently available, estimates present at least 55 JVMs: 23 of them being proprietary implementations and 32 free/open source implementations. Sun Microsystems, developer of the Java platform, reports on their website that JVM implementation has reached 4.5 billion devices.

We can make Java run faster ?

4Program runtime is one major metric that we use to measure system quality. But its not all, there are other measures also.Eg. Power consumption is a major concern in embedded devices, mobile platforms

You can see the impact we make, Some stats to show the popularity of Java and the impact it can make if the JVM was optimizedMore than 9 million developers and more than 4.5 billion devices running Java.We would like to run Java faster, but how can we achieve that?

Does JVM care performance?Yes. It does, but the approach is primary.Ergonomics is deciding each option depending ONLY on underlying architecture.Selection of JIT CompilerInitial Heap SizesDefault Garbage Collector. is selected based on platform configuration5Currently JVM does optimization using the concept of Ergonomics.

Hotspot JVM by oracle is a ergonomic JVM.

It decides the compiler => server or client

How it decides that is a very simple way. If the machine is 32 bit it decides to use client

if 64 it uses server.

It includes dynamic compilers that adaptively compile Java bytecodes into optimized machine instructions

and efficiently manages the Java heap using garbage collectors, optimized for both low pause time and throughput.

Based on these different approaches for compilers, heap configuration, garbage collectors, we identified conflicts between combinations of possible option which are considered to be illegal JVM settings. Filtering these reduced the time needed to tuning and gave more good results which will be presented sooner.Our Solution: Auto-tuningjava program_name -arg0 -arg1 -arg2 ....These JVM flags/parameters change the runtime behavior of the systemCan we decide what values would fit into each parameter?This idea formulates an optimization problem in a configuration space

6There are 2 main approaches to do optimizationcode invariant optimizationparameter optimization

We adopt the latter method.

This is how we usually initialize a JVM instance, but do you know what argument values fits best for a given arbitrary program?In this study, we focus on this issue.

Runtime options for controlling the system behavior, in areas such asJIT compilationGarbage collectionHeap managementThread managementMemory management

Why Auto-tuning?More than 600 configurable flags/parameters

If we assume, these are all boolean flags, the cardinality of the configuration space is at least 2600

To make things worse, most of the parameters take values from a finite domain whose size is way larger than 27We said we use auto-tuning. Here is why.

Given a program, can you pick the best parameter values that would output the best performance of a program for a given architecture?Challenges in Auto-tuningSize of the valid configuration space

Landscape of the configuration space

Configuration representations are very domain specific and could vary in a wide range8Auto-tuning imposes few questionsrepresentation of the configuration space shape of the configuration space

These search spaces can be of arbitrary size and shape

Prior Work on Auto-tuningPetabricks - Use Algorithmic choices at program compilation (C language extension)Java: GC Tuning- Optimize overall garbage collection timeJava: JIT Tuning- Considers only selected subset of JIT flags in tuning 9OpenTuner[1]An extensible framework for auto-tuning[1] [ J. Ansel, S. Kamil, K. Veeramachaneni, U.-M. O'Reilly and S. Amarasinghe, "OpenTuner: An Extensible Framework for Program Autotuning," in MIT CSAIL Technical Report MIT-CSAIL-TR-2013-026, November 1, 2013.]

Multiple search techniquesEvolutionary algorithms allow to reach optima aggressivelyworks best with massive search spaces and manages computational complexity really well

10Remember to mention that the OpenTuner is a extention (decoupling) of auto tuner of peta bricks with language specification.

OPEN TUNERThe problems we identified by analyzing existing program auto-tunersRepresenting the right configuration representation SIze of the valid configuration space - computational efficiencyLandscape of the configuration space - performance of particular algorithms heavily depends on this factorOpenTuner provides a framework for building auto-tunersWhy we decided to use OpenTunerExtensible across different domainsIf a particular algorithm is found to perform well for a particular region of the configuration space, more jobs from that region are given to that algorithm.Has been successfully tested in many different domains like Halide (http://en.wikipedia.org/wiki/Halide_%28programming_language%29)PetaBricksGNU/G++ compiler (http://en.wikipedia.org/wiki/GNU_Compiler_Collection)High performances Linepack benchmark (http://en.wikipedia.org/wiki/LINPACK_benchmarks) measures the floating point computing power of a machineMemory bound stencil computations (http://en.wikipedia.org/wiki/Stencil_code). An application from numerical analysis domainand the results are promising. To write our own auto-tuner, we implement Configuration ManipulatorMeasurement functionWind up and invite Chamara to present the next part

Methodology We use the HotSpot Virtual Machine widely used implementation of the JVM

We use OpenTuner as a framework to build the HotSpot Auto-tuner

11HotSpot is the most widely used implementation of JVM and it gets its name from its approach to JIT compilation (Compile only the HotSpots) Two maintained versions of HotSpot - one by Oracle and other by the OpenJDK groupWe have used the OpenJDK HotSpot JVM for our experiments

Methodology- JVM flagsThere are more than 600 flags to select. For,Memory optionsGarbage collection optionsCompiling, interpreting and other optionsTwo typesBoolean flagsParameter flags (We limit domain by defining min and max values for parameter flags depending on their default values)

12There are more than 600 flags.Next slide shows the hierarchy of the flags that we have built, which i very important to whole java community.

These options let us to run the bytecode on any architecture with different compiling and interpreting options.As the Java bytecode execution engine, JVM, provides Java runtime facilities, such as thread and object synchronization, on a variety of operating systems and architectures. The best values for these options depend on the system, machine architecture and the program.

JVM comes with over 600 flags to choose fromThere are flags for optimizing jit compiler, garbage collection, memory options and etcThere are two types of flagsBoolean flagsParameter flags

We can get the default values of those flags by setting the JVM flag PrintFlagsFinal For parameter flags we used the range [default-default/2, default+default/2] for tuning

Methodology - Configuration ManipulatorUsed to define the configuration spacedef manipulator(self): m = manipulator.ConfigurationManipulator() for flag_set in self.bool_flags: for flag in flag_set: m.add_parameter(manipulator.EnumParameter(flag, ['on', 'off'])) for flag_set in self.param_flags: for flag in flag_set: value = flag_set[flag] if(value['min'] >= value['max']): m.add_parameter(manipulator.IntegerParameter(value['flagname'],value['max'],value['min'])) else: m.add_parameter(manipulator.IntegerParameter(value['flagname'],value['min'],value['max']) return m

13Two important components we need to define to implement the auto-tuner are the Configuration Manipulator and the Run functionConfiguration Manipulator defines the configuration search space for auto-tuning

Methodology - Run Function

Measures the quality (fitness) of a given configuration.

Eg. In our experiments we used two benchmark suits

SPECjvm2008: operations per minute (ops/m)

DaCapo: execution time in ms14Run function is where we specify how to run the programs using a selected configuration and the output of the run function should be a measure of quality of the selected configurationIn our experiments we have used two widely used benchmark suites,In SpecJVM the quality measure is the ops/min (Higher values mean better quality)In DaCapo execution time is ms (Lower values are better)Challenge - Tuning TimeWhen we feed all the flags to OpenTuner, tuning times of programs are very high. Problem? Tuner is stuck on searching invalid regions of the search spaceInvalid regions of the search space are due to JVM flag dependencies

15Methodology - Can we reduce Tuning Time ?JVM flags have huge number of inter-dependencies. Configuration search space contains areas of illegal flag configurationsHow to omit these illegal search space during tuning process? 16One of the main issues with considering all flags as a flat structure is there are many dependencies between JVM flagsThere are many illegal configurations that won't initialize the JVM For example once you have selected the parallel GC collector you cannot use concurrent or serial selectors

Methodology- JVM Flag Hierarchy17JVM flagsByteCodeCodeCacheCompilationDeOptimizationGCInterpreterMemoryPrioritiesTemporaryTieredCompilation + C1if (TieredCompilation) == TRUEC2CommonCMSG1Throughput CollectorParallelOldParallelSerialReason for defining a flag hierarchyTo reduce the possibility of OpenTuner searching for invalid flag configurationsTo reduce the computational time and resources dedicated for computing unsuccessful/useless flag combinations

Importance of garbage collectors=>Garbage collectors make assumptions about the way applications use objects, and these are reflected in tunable parameters that can be adjusted for improved performance without sacrificing the power of the abstraction.

When does the choice of a garbage collector matter? For some applications, the answer is never.That is, the application can perform well in the presence of garbage collection with pauses of modest frequency and duration. For them we can use the default Serial Garbage collector.

However, this is not the case for a large class of applications, particularly those with large amounts of data (multiple gigabytes), many threads and high transaction rates.Ergonomics select parallel GC when running in such server class machines.

The G1 collector is a server-style garbage collector, targeted for multi-processor machines with large memories. It meets garbage collection (GC) pause time goals with high probability, while achieving high throughput. Whole-heap operations, such as global marking, are performed concurrently with the application threads.

This prevents interruptions proportional to heap or live-data size.

To reduce these illegal configurations we have developed hierarchical structure for JVM flags mostly concentrated on GC as it is the main area that contribute to illegal configurationsWe have initially asked from OpenJDK group for the existence of a flag dependency structure for JVM (reply was no)This is a categorization of flags we found by examining the source code of JVMThis categorization helps us to evaluate how different areas of JVM affects the performance of a program

Methodology- Effect of JVM Flag Hierarchy (DaCapo fop Benchmark)

18This shows the effect of flag hierarchy on the tuning process By reducing the invalid configurations we can significantly improve the tuning time and also can gain higher performance improvement

Methodology- Effect of JVM Flag Hierarchy (DaCapo eclipse Benchmark)19

Methodology - What areas of JVM are more critical for performance ?To answer the above, we tuned every benchmark program in 4 categories.Those are,GC - Using Garbage Collection flags onlyJIT- Using Just-In-Time Compilation Flags onlyGC & JIT - Using GC and JIT flags togetherALL - Using all the flags of the JVM 20In our experiments, we examine how different areas of JVM contribute to performance improvementsHere we have considered performance improvements usingonly GC flagsonly JIT flagsGC+JIT flagsAll JVM flags

Experimental SetupWe tested Hotspot Tuner in the following architectures. Architecture 1 (Main Architecture)Intel Core i7 CPU @3.40Ghz (4 cores)RAM: 16 GBOS: Ubuntu 12.04 LTSJVM: OpenJDK 7 (HotSpot JVM) update 55Architecture 2Intel Core i5-2400 CPU @ 3.10GHz (4 cores)RAM: 4 GBOS: Ubuntu 14.04 LTSJVM: OpenJDK 7 (HotSpot JVM) update 55

21Architecture 3Intel Xeon E7-4820v2 CPU @2.00 GHz (32 cores)RAM: 64 GB DDR3OS: Ubuntu 14.04 LTSJVM: OpenJDK 7 (HotSpot JVM) update 55

Results - (SPECjvm2008 startup.*) (Startup Performance Improvement % of JVM)22

We have selected two Benchmark Suits for test the JVM tuner.1). SPECjvm2008 startup benchmarks. 2). Dacapo Benchmark suite. (Contains lot of Commercial applications like tomcat, eclipse etc)

Why start up benchmarks?To measure the startup performances of the jvm. The results shows that we have tuned the jvm to increase the startup performances of the jvm.

But Startup performances are not enough to describe the steady state behaviour of the jvm. That is the main reason to try the Dacapo benchmarks. (One of the reviewers idea that we got from the CGO paper review.)

SPEC startup programs measures the startup performance of the JVM.Performance Improvements Under the 4 Categories Presented AboveWe can conclude that Using all the flags together can achieve much higher performance improvements

Results - (DaCapo Benchmark Suite) (Steady State Performance Improvement % of JVM)

23

Explain the minus performance improvements..

Move on to the steady state performance improvement of the JVMWe can observe the same patternBut tuning JIT only and GC and JIT together gives us good performance improvements, but they do not surpass the overall performance gained tuning JVM as a whole unit.

Tuning Results on Different Architecture (DaCapo suite Performance Improvement on Architecture 2)

24Discussion - eclipse DaCapo benchmark Tuning Behaviour

25

Tuning Behaviour.

Mainly with all the flags the tuning behaviour tends to converge faster than with the partial tuning. This might be because all the flags gives much higher degree of freedom to the tuning program.

We wanted to see how tuning procedure behaves for each flag category that we tuned. Tuning GC only gives us very low performance improvements. Using JIT flags in performance tuning gives us very good performance improvements. But Generally we can conclude that using all the flags together will gives us much higher performance improvement in lesser time. (HIGH Convergence Rate Due much higher freedom of degrees to tuning procedure)

How JVM reacts to Auto-tuning? (JVM Internals)What happens inside the JVM as a result of auto-tuning?To find out we used the jstat profiler Using jstat profiler data we analyzeMemory usage of JVM Compilation behaviourClass Loading behaviour

26JVM Internals - pmd Benchmark (Compiler Rate)

27What changes inside the JVM when we tune it?We have selected pmd (A benchmark which has shown large performance Improvement) and h2 (A benchmark which has shown a lower performance improvement) and compared what changed inside the JVM as a result of tuning. pmdCompilation Rate almost doubled (eger compilation achieved by tuning)Class Loading rate has decreased. But this performance decrement is negligible compared to the high performance improvement of the CR

JVM Internals - pmd Benchmark (Class Loading Rate)

28pmdCompilation Rate almost doubled (eger compilation achieved by tuning)Class Loading rate has decreased. But this performance decrement is negligible compared to the high performance improvement of the CR

JVM Internals - h2 Benchmark (Compiler Rate)

29h2Compilation rate improved but not much !CLR has improved slightly. These are the main reasons that this benchmark shows less performance improvements.JVM Internals - h2 Benchmark (Class Loading Rate)

30The best configurationNo best configuration or optimum performance for all of theseOur interest: a better configuration that gives better performance than the current bestBetterness depends onSystemProgram featureslocal optima31There is no best configuration for a given program or a universal setup for every program.The performance of the tuned configuration depends on the system, program features and the local optima in the subspace the tuner was working. System means the hardware architecture of the system where the program runs upon, the operating system used to execute the program and the other programs that is running on the machine.We made it sure that there were no other program running while testing except the tuner program.

Program features are the hotspots in the program like methods, loops etc.As an example,There is a threshold value for each segment to become a hotspot. If a method runs more than that threshold within an unit timeit is considered as a hotspot. Depending on the program nature we can change this threshold to get the optimal value.

No common magic configuration to tune all the benchmarks.Describe What best configuration depends on. Project AchievementsGOLD Prize at ACM Student Research Competition in Undergraduate Category at CGO 2015, San Francisco USA.Invited talk on Java Auto Tuning Tool (JATT) at the OpenTuner workshop at CGO 2015 , San Francisco USA.Paper accepted at International Workshop on Automatic Performance Tuning (iWAPT2015), to be held in Hyderabad, India in May 2015 (co-hosted with IPDPS)

32Good reviews and feedback from representatives from google, facebook, university of berkeley and huawei at CGO.Given the chance to publish our work in the opentuner site.

Ongoing & Future WorkFinializing Java Auto-Tuning Tool(JATT)Tuning widely used programs whose performance is critical (e.g., Siddhi CEP, X10 compiler, ...)Online Tuning33What are the Limitations of offline program tuning ?

tuning process can take long time have to rerun whenever the program, microarchitecture or execution environment change

Online TuningTwo basic approachesUsing a model - Garbage Collector algorithm selectionSibling Rivalry - Discussed in the next slide

AcknowledgementProf. Sanath Jayasena and Prof. Saman Amarasinghe who supervised the projectOpenTuner group and Dr. Jason Ansel for their reviews on our projectCGO 2015 reviewing committee CSE Staff34Q & A35Your questions are welcome36Thank You !