PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the OpenJDK Graal infrastructure, by Vasanth Venkatachalam

WHOLLY GRAAL: ENABLING GPU ACCELERATION OF JAVA USING THE OPENJDK GRAAL COMPILER.

VASANTH VENKATACHALAM

2 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL

AGENDA

!  Why should you be interested in GPU offload?

!  Java execuQon model

!  Requirements for Java GPU enablement

!  Sumatra OpenJDK Project for Java GPU enablement

!  Heterogeneous System Architecture (HSA)

!  IntroducQon to Graal !  JDK8 based Graal prototype for Java GPU offload

!  Future work !  Summary


WHY SHOULD YOU BE INTERESTED IN GPU OFFLOAD?

!  In many instances, offloading the data-‐parallel parts of a program to a GPU will improve the performance compared to running the enQre program on the CPU ‒ A typical GPU offers more cores for the same density than a CPU

‒ AMD Radeon™ HD 7750 features 512 Stream Processors! ‒ In a data-‐parallel computaQon in which the same computaQon is repeated over different data (and the results are not dependent on each other), the individual computaQons can be executed in parallel on mulQple cores

!  Example: Squaring array elements

for(int i = 0; i < in.length; i++) { out[i] = in[i] * in[i]; }

In[0]*in[0], in[1]*in[1], in[2]*in[2]…

core0 core1 core2


JAVA EXECUTION MODEL

!  Java is a managed run*me language , a language that runs on top of a virtual machine (VM) ‒ Other managed runQme languages include Ruby, JavaScript, Python, Scala

!  Java source is compiled into an intermediate format called bytecode

!  The Java virtual machine (JVM) executes the bytecodes using interpreta*on or just-‐in-‐*me compila*on. ‒  InterpretaQon involves a straight bytecode to machine code translaQon, instrucQon by instrucQon

‒  Just-‐in-‐Time CompilaQon (JIT) involves compiling bytecodes into machine code at runQme and execuQng the machine codes.

!  Examples of JVMs include: ‒ Oracle Hotspot™ JVM ‒  IBM J9 VM

Java Bytecodes (Hello World) 0: getstaQc #13 3: ldc #19 5: invokevirtual #21 8: return

Java Virtual Machine

Java Source Code (Hello World)

public staQc void main(String[] args) { System.out.println(“Hello”); }

Machine code

Java Source Compiler


REQUIREMENTS FOR JAVA GPU ENABLEMENT

!  Java needs a programming model to express data-‐parallel workloads ‒  Java 8 for example has the Stream API with support for Lambda constructs

!  The Java Virtual Machine (JVM) needs to generate code for the GPU as well as the CPU ‒ The JVM has to target mulQple InstrucQon Set Architectures (ISAs)

!  Ideally, the JVM can generate code in a standard intermediate language that can be translated into the naQve machine instrucQons of each GPU target ‒ This allows for portability ‒ Any update to the GPU ISA affects only the translaQon of this intermediate language into the GPU machine instrucQons


SUMATRA OPENJDK PROJECT FOR JAVA GPU ENABLEMENT

!  Open source project intending to enable Java applicaQons to take advantage of the GPU ‒ More or less transparently to the applicaQon

!  Project started by Oracle and AMD shortly before JavaOne 2012

!  We are developing a prototype of Sumatra using the Heterogeneous System Architecture (HSA) and the Graal OpenJDK project ‒ Backend for Graal Just-‐In-‐Time (JIT) Compiler which compiles Java into HSAIL for GPU execuQon ‒ Project home page: hwp://openjdk.java.net/projects/graal/


EXAMPLE OF THE KINDS OF CODE WE’D LIKE TO RUN ON THE GPU

class NameInfo { private String name; private boolean exists; public void checkExistsIn(String text) { exists = text.contains(name); } }; NameInfo allNames[]; String longText; IntStream istr = IntStream.range(0, allNames.length); istr.forEach(i -> { allNames[i].checkExistsIn(longText); });

Our prototype can handle this today!


HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)

!  Heterogeneous System Architecture standardizes CPU/GPU funcQonality via a common intermediate language (HSAIL) and runQme (the HSA stack) ‒ ISA-‐agnosQc for both CPUs and accelerators ‒ Support high-‐level programming languages

!  HSA makes a great plazorm for GPU offload ‒ Shared Virtual Memory

‒ Direct access to heap objects in main memory from GPU cores ‒ In other words, “a pointer is a pointer” ‒ Eliminates the overhead of copying data from CPU to GPU ‒ Eliminates the overhead of bookkeeping pointers

!  SpecificaQons and simulator available from HSA FoundaQon ‒ hwp://hsafoundaQon.com/ ‒ hwp://hsafoundaQon.com/standards/


HSA PARALLEL EXECUTION MODEL !  Grid based execuQon model

!  Programmer supplies a “kernel” that is run on each work-‐item

!  Kernel is wriwen as a single thread of execuQon and represents the main body of work each work-‐item will execute

!  Each work-‐item has a unique id

!  Programmer specifies the number of work-‐items (for scope of problem)

(in[i]*in[i)] Grid size (number of mulQplicaQons to be done)

work-‐item

for(int i = 0; i < in.length; i++) { out[i] = in[i] * in[i]; }


HSAIL PRIMER

" HSAIL is the code that the Graal backend will emit

" Gets translated to the ISA of the GPU device by a runQme layer known as the “finalizer”

" Generated code is ASCII text form, which aids in debugging

" Example: signed 32-‐bit mulQplicaQon

Mnemonic (mul, add, sub, div, Etc.)

mul_s32 $s3, $s0, $s1

Type modifier (s, u, b, f)

Length modifier (1, 8., 16, 32, 64 etc)

DesQnaQon Source1 Source2


INTRODUCTION TO GRAAL

!  Graal is a highly extensible, open-‐source, just-‐in-‐Qme compiler for Java ‒ Project home page: hwp://openjdk.java.net/projects/graal/

!  Graal is wriwen in Java ‒ Graal can be developed using exisQng Java IDEs (e.g., Eclipse, NetBeans) making it straighzorward to debug ‒ Because Graal is wriwen in Java, it can run on any plazorm and be treated as a cross-‐compiler ‒  In parQcular, Graal can compile Java for the GPU while running on the CPU

!  Graal is being used to develop a centralized framework (Truffle) for execuQng JVM languages ‒  hwp://www.oracle.com/technetwork/java/jvmls2013wimmer-‐2014084.pdf

‒ Adding a GPU backend to Graal potenQally opens the door to GPU enablement of other JVM languages using a centralized framework, but this is an area for future work


JDK8 BASED GRAAL PROTOTYPE FOR JAVA GPU ENABLEMENT

!  Graal has been extended with a prototype backend that generates HSAIL code for GPU execuQon

!  This allows porQons of Java 8 programs using the Stream API to be compiled into HSAIL

!  This prototype has been tested using a simulator as well as real hardware ‒ On Mandelbrot we get a speedup of 10x running on the GPU versus running using Java threads on the x86 CPU

!  The HSAIL backend has been checked into the Graal OpenJDK repository


HOW GRAAL WORKS WITH THE HSA RUNTIME STACK FOR A JAVA PROGRAM

HSAIL code

Graal HSAIL code generaQon

IR GeneraQon/OpQmizaQon

GPU ISA

JVM

Java Application

GPU CPU

HSAIL

HSAIL finalizer and runQme

Java JDK Stream + Lambda API

Java GRAAL JIT backend

CPU ISA


EXAMPLE HSAIL CODE GENERATED FOR A SAMPLE JAVA PROGRAM

Intstream forEach (i-‐> { out[i] = in[i] * in[i]; });

kernel &run ( kernarg_u64 %_arg0, kernarg_u64 %_arg1 ) { ld_kernarg_u64 $d6, [%_arg0]; ld_kernarg_u64 $d2, [%_arg1]; workitemabsid_u32 $s1, 0; cvt_s64_s32 $d0, $s1; mul_s64 $d0, $d0, 4; add_u64 $d2, $d2, $d0; ld_global_s32 $s0, [$d2 + 24]; mul_s32 $s3, $s0, $s0; cvt_s64_s32 $d1, $s1; mul_s64 $d1, $d1, 4; add_u64 $d6, $d6, $d1; st_global_s32 $s3, [$d6 + 24]; ret; };

Parameter passing

in[i] * in[i]

Load in[i]

Store to out[i]

Load id of current workitem

" Data-‐parallel execuPon model " Each workitem has a unique id " workitemabsid instrucPon returns the id of the current workitem

private staQc void lambda$67(int[] out, int[] in,, int i) { out[i] = in[i] * in[i]; }

What the compiler sees!

Parameter passed to lambda


EXAMPLES OF FUNCTIONALITY WE SUPPORT

•  some Math intrinsics: Intstream.range(0, in.length).forEach(i-> {

out[i] = Math.sqrt( in[i] )*in[i]; });

•  arrays, string manipulaQon rouQnes, calls to some JDK methods Intstream.range(0, boolArray.length).forEach(i-{

boolArray[i] = (inArray[i]).contains(“hello”);

});

•  instanceOf operator

Shape shapeArray[];

return shapeArray[i] instanceof Circle;



•  IntStream use case

Public class Point { public double x; public double y;

}

Point[] pointArray; Intstream.range(0, pointArray.length).forEach(i -> { pointArray[i].x ++;

});

•  ObjectStream from Array

Arrays.stream(pointArray).forEach(p -> { p.x ++; });



•  ObjectStream from ArrayList

ArrayList<Point> pointList; pointList.stream().forEach(p -> { p.x ++; });

•  Atomic operations (patch forthcoming)

AtomicInteger atomicInt; i -> { outArray[i] = atomicInt.incrementAndGet( ); }


FUTURE WORK

!  GPU enablement for managed runQme languages other than Java is an area for future work

!  One path is to develop a mechanism that allows other languages to call the Java 8 Stream API ‒ This could leverage exisQng work to make other JVM languages interoperable with Java

‒  hwp://agiledeveloper.com/presentaQons/integraQng_jvm_languages_javaone.zip

!  Another path is to develop a centralized framework that allows JVM languages to be compiled into a format that Graal can take as input

!  Truffle is a prototype language implementaQon framework wriwen in Java that uses the Graal JIT compiler ‒ hwps://wiki.openjdk.java.net/display/Graal/Truffle+FAQ+and+Guidelines ‒ The OpenJDK community has developed prototype implementaQons of JavaScript, Ruby and R on Truffle

‒  hwp://www.oracle.com/technetwork/java/jvmls2013wimmer-‐2014084.pdf ‒  hwp://www.oracle.com/technetwork/java/jvmls2013vitek-‐2013524.pdf

‒ Using Graal as the JIT compiler potenQally allows other JVM languages to take advantage of Graal’s HSAIL backend


SUMMARY

!  GPU offload is beneficial for improved performance

!  We have extended the Graal Just-‐In-‐Time compiler with a prototype backend that generates HSAIL code

!  This work opens the door to GPU acceleraQon for Java !  GPU acceleraQon for other managed runQme languages is an area for future work

‒ Truffle may make this possible by providing a centralized framework for language implementaQon using the Graal JIT compiler

!  We encourage OpenJDK community feedback and contribuQons


REFERENCES

!  AMD DevCentral blog on HSAIL-‐based GPU Offload ‒ hwp://developer.amd.com/community/blog/hsail-‐based-‐gpu-‐offload-‐the-‐quest-‐for-‐java-‐performance-‐begins/

!  Sumatra OpenJDK GPU/APU offload project ‒ Project home page: hwp://openjdk.java.net/projects/sumatra/ ‒ Wiki: hwps://wiki.openjdk.java.net/display/Sumatra/Main

!  Graal JIT compiler and runQme project ‒ Project home page: hwp://openjdk.java.net/projects/graal/

!  HSA FoundaQon: ‒ hwp://hsafoundaQon.com/ ‒ hwp://hsafoundaQon.com/standards/


REFERENCES

!  JVM Language Summit 2013 (JVMLS 2013) ‒ Wimmer and Seaton, “One VM to Rule them all” :

‒  hwp://www.oracle.com/technetwork/java/jvmls2013wimmer-‐2014084.pdf ‒ Wuerthinger and Venkatachalam, “Graal and GPU offload”

‒  hwp://www.oracle.com/technetwork/java/jvmls2913wuerth-‐2013918.pdf ‒ Vitek, “R in Java”

‒  hwp://www.oracle.com/technetwork/java/jvmls2013vitek-‐2013524.pdf

!  JavaOne 2013 ‒ Thalinger, Wimmer, and Venkatachalam, “Wholly Graal: AcceleraQng GPU offload for Java”:

‒  hwps://oracleus.acQveevents.com/2013/connect/fileDownload/session/C2A34A60DEDE1B2D9FE9D87733345017/CON6419_Wimmer.pdf


DISCLAIMER & ATTRIBUTION

The informaQon presented in this document is for informaQonal purposes only and may contain technical inaccuracies, omissions and typographical errors.

The informaQon contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, so�ware changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligaQon to update or otherwise correct or revise this informaQon. However, AMD reserves the right to revise this informaQon and to make changes from Qme to Qme to the content hereof without obligaQon of AMD to noQfy any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinaQons thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdicQons. SPEC is a registered trademark of the Standard Performance EvaluaQon CorporaQon (SPEC). Other names are for informaQonal purposes only and may be trademarks of their respecQve owners.

PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the OpenJDK Graal infrastructure, by Vasanth Venkatachalam

Technology

s0 ini

ini cvt

sqrt ini

s1 mul

point pointarray intstream

d0 loadini ld

confidentialkernelrun

arg0 parameterpassing