WHOLLY GRAAL: ENABLING GPU ACCELERATION OF JAVA USING THE OPENJDK GRAAL COMPILER. VASANTH VENKATACHALAM
May 06, 2015
WHOLLY GRAAL: ENABLING GPU ACCELERATION OF JAVA USING THE OPENJDK GRAAL COMPILER.
VASANTH VENKATACHALAM
2 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
AGENDA
! Why should you be interested in GPU offload?
! Java execuQon model
! Requirements for Java GPU enablement
! Sumatra OpenJDK Project for Java GPU enablement
! Heterogeneous System Architecture (HSA)
! IntroducQon to Graal ! JDK8 based Graal prototype for Java GPU offload
! Future work ! Summary
3 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
WHY SHOULD YOU BE INTERESTED IN GPU OFFLOAD?
! In many instances, offloading the data-‐parallel parts of a program to a GPU will improve the performance compared to running the enQre program on the CPU ‒ A typical GPU offers more cores for the same density than a CPU
‒ AMD Radeon™ HD 7750 features 512 Stream Processors! ‒ In a data-‐parallel computaQon in which the same computaQon is repeated over different data (and the results are not dependent on each other), the individual computaQons can be executed in parallel on mulQple cores
! Example: Squaring array elements
for(int i = 0; i < in.length; i++) { out[i] = in[i] * in[i]; }
In[0]*in[0], in[1]*in[1], in[2]*in[2]…
core0 core1 core2
4 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
JAVA EXECUTION MODEL
! Java is a managed run*me language , a language that runs on top of a virtual machine (VM) ‒ Other managed runQme languages include Ruby, JavaScript, Python, Scala
! Java source is compiled into an intermediate format called bytecode
! The Java virtual machine (JVM) executes the bytecodes using interpreta*on or just-‐in-‐*me compila*on. ‒ InterpretaQon involves a straight bytecode to machine code translaQon, instrucQon by instrucQon
‒ Just-‐in-‐Time CompilaQon (JIT) involves compiling bytecodes into machine code at runQme and execuQng the machine codes.
! Examples of JVMs include: ‒ Oracle Hotspot™ JVM ‒ IBM J9 VM
Java Bytecodes (Hello World) 0: getstaQc #13 3: ldc #19 5: invokevirtual #21 8: return
Java Virtual Machine
Java Source Code (Hello World)
public staQc void main(String[] args) { System.out.println(“Hello”); }
Machine code
Java Source Compiler
5 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
REQUIREMENTS FOR JAVA GPU ENABLEMENT
! Java needs a programming model to express data-‐parallel workloads ‒ Java 8 for example has the Stream API with support for Lambda constructs
! The Java Virtual Machine (JVM) needs to generate code for the GPU as well as the CPU ‒ The JVM has to target mulQple InstrucQon Set Architectures (ISAs)
! Ideally, the JVM can generate code in a standard intermediate language that can be translated into the naQve machine instrucQons of each GPU target ‒ This allows for portability ‒ Any update to the GPU ISA affects only the translaQon of this intermediate language into the GPU machine instrucQons
6 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
SUMATRA OPENJDK PROJECT FOR JAVA GPU ENABLEMENT
! Open source project intending to enable Java applicaQons to take advantage of the GPU ‒ More or less transparently to the applicaQon
! Project started by Oracle and AMD shortly before JavaOne 2012
! We are developing a prototype of Sumatra using the Heterogeneous System Architecture (HSA) and the Graal OpenJDK project ‒ Backend for Graal Just-‐In-‐Time (JIT) Compiler which compiles Java into HSAIL for GPU execuQon ‒ Project home page: hwp://openjdk.java.net/projects/graal/
7 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
EXAMPLE OF THE KINDS OF CODE WE’D LIKE TO RUN ON THE GPU
class NameInfo { private String name; private boolean exists; public void checkExistsIn(String text) { exists = text.contains(name); } }; NameInfo allNames[]; String longText; IntStream istr = IntStream.range(0, allNames.length); istr.forEach(i -> { allNames[i].checkExistsIn(longText); });
Our prototype can handle this today!
8 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)
! Heterogeneous System Architecture standardizes CPU/GPU funcQonality via a common intermediate language (HSAIL) and runQme (the HSA stack) ‒ ISA-‐agnosQc for both CPUs and accelerators ‒ Support high-‐level programming languages
! HSA makes a great plazorm for GPU offload ‒ Shared Virtual Memory
‒ Direct access to heap objects in main memory from GPU cores ‒ In other words, “a pointer is a pointer” ‒ Eliminates the overhead of copying data from CPU to GPU ‒ Eliminates the overhead of bookkeeping pointers
! SpecificaQons and simulator available from HSA FoundaQon ‒ hwp://hsafoundaQon.com/ ‒ hwp://hsafoundaQon.com/standards/
9 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
HSA PARALLEL EXECUTION MODEL ! Grid based execuQon model
! Programmer supplies a “kernel” that is run on each work-‐item
! Kernel is wriwen as a single thread of execuQon and represents the main body of work each work-‐item will execute
! Each work-‐item has a unique id
! Programmer specifies the number of work-‐items (for scope of problem)
(in[i]*in[i)] Grid size (number of mulQplicaQons to be done)
work-‐item
for(int i = 0; i < in.length; i++) { out[i] = in[i] * in[i]; }
10 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
HSAIL PRIMER
" HSAIL is the code that the Graal backend will emit
" Gets translated to the ISA of the GPU device by a runQme layer known as the “finalizer”
" Generated code is ASCII text form, which aids in debugging
" Example: signed 32-‐bit mulQplicaQon
Mnemonic (mul, add, sub, div, Etc.)
mul_s32 $s3, $s0, $s1
Type modifier (s, u, b, f)
Length modifier (1, 8., 16, 32, 64 etc)
DesQnaQon Source1 Source2
11 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
INTRODUCTION TO GRAAL
! Graal is a highly extensible, open-‐source, just-‐in-‐Qme compiler for Java ‒ Project home page: hwp://openjdk.java.net/projects/graal/
! Graal is wriwen in Java ‒ Graal can be developed using exisQng Java IDEs (e.g., Eclipse, NetBeans) making it straighzorward to debug ‒ Because Graal is wriwen in Java, it can run on any plazorm and be treated as a cross-‐compiler ‒ In parQcular, Graal can compile Java for the GPU while running on the CPU
! Graal is being used to develop a centralized framework (Truffle) for execuQng JVM languages ‒ hwp://www.oracle.com/technetwork/java/jvmls2013wimmer-‐2014084.pdf
‒ Adding a GPU backend to Graal potenQally opens the door to GPU enablement of other JVM languages using a centralized framework, but this is an area for future work
12 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
JDK8 BASED GRAAL PROTOTYPE FOR JAVA GPU ENABLEMENT
! Graal has been extended with a prototype backend that generates HSAIL code for GPU execuQon
! This allows porQons of Java 8 programs using the Stream API to be compiled into HSAIL
! This prototype has been tested using a simulator as well as real hardware ‒ On Mandelbrot we get a speedup of 10x running on the GPU versus running using Java threads on the x86 CPU
! The HSAIL backend has been checked into the Graal OpenJDK repository
13 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
HOW GRAAL WORKS WITH THE HSA RUNTIME STACK FOR A JAVA PROGRAM
HSAIL code
Graal HSAIL code generaQon
IR GeneraQon/OpQmizaQon
GPU ISA
JVM
Java Application
GPU CPU
HSAIL
HSAIL finalizer and runQme
Java JDK Stream + Lambda API
Java GRAAL JIT backend
CPU ISA
14 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
EXAMPLE HSAIL CODE GENERATED FOR A SAMPLE JAVA PROGRAM
Intstream forEach (i-‐> { out[i] = in[i] * in[i]; });
kernel &run ( kernarg_u64 %_arg0, kernarg_u64 %_arg1 ) { ld_kernarg_u64 $d6, [%_arg0]; ld_kernarg_u64 $d2, [%_arg1]; workitemabsid_u32 $s1, 0; cvt_s64_s32 $d0, $s1; mul_s64 $d0, $d0, 4; add_u64 $d2, $d2, $d0; ld_global_s32 $s0, [$d2 + 24]; mul_s32 $s3, $s0, $s0; cvt_s64_s32 $d1, $s1; mul_s64 $d1, $d1, 4; add_u64 $d6, $d6, $d1; st_global_s32 $s3, [$d6 + 24]; ret; };
Parameter passing
in[i] * in[i]
Load in[i]
Store to out[i]
Load id of current workitem
" Data-‐parallel execuPon model " Each workitem has a unique id " workitemabsid instrucPon returns the id of the current workitem
private staQc void lambda$67(int[] out, int[] in,, int i) { out[i] = in[i] * in[i]; }
What the compiler sees!
Parameter passed to lambda
15 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
EXAMPLES OF FUNCTIONALITY WE SUPPORT
• some Math intrinsics: Intstream.range(0, in.length).forEach(i-> {
out[i] = Math.sqrt( in[i] )*in[i]; });
• arrays, string manipulaQon rouQnes, calls to some JDK methods Intstream.range(0, boolArray.length).forEach(i-{
boolArray[i] = (inArray[i]).contains(“hello”);
});
• instanceOf operator
Shape shapeArray[];
return shapeArray[i] instanceof Circle;
16 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
EXAMPLES OF FUNCTIONALITY WE SUPPORT
• IntStream use case
Public class Point { public double x; public double y;
}
Point[] pointArray; Intstream.range(0, pointArray.length).forEach(i -> { pointArray[i].x ++;
});
• ObjectStream from Array
Arrays.stream(pointArray).forEach(p -> { p.x ++; });
17 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
EXAMPLES OF FUNCTIONALITY WE SUPPORT
• ObjectStream from ArrayList
ArrayList<Point> pointList; pointList.stream().forEach(p -> { p.x ++; });
• Atomic operations (patch forthcoming)
AtomicInteger atomicInt; i -> { outArray[i] = atomicInt.incrementAndGet( ); }
18 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
FUTURE WORK
! GPU enablement for managed runQme languages other than Java is an area for future work
! One path is to develop a mechanism that allows other languages to call the Java 8 Stream API ‒ This could leverage exisQng work to make other JVM languages interoperable with Java
‒ hwp://agiledeveloper.com/presentaQons/integraQng_jvm_languages_javaone.zip
! Another path is to develop a centralized framework that allows JVM languages to be compiled into a format that Graal can take as input
! Truffle is a prototype language implementaQon framework wriwen in Java that uses the Graal JIT compiler ‒ hwps://wiki.openjdk.java.net/display/Graal/Truffle+FAQ+and+Guidelines ‒ The OpenJDK community has developed prototype implementaQons of JavaScript, Ruby and R on Truffle
‒ hwp://www.oracle.com/technetwork/java/jvmls2013wimmer-‐2014084.pdf ‒ hwp://www.oracle.com/technetwork/java/jvmls2013vitek-‐2013524.pdf
‒ Using Graal as the JIT compiler potenQally allows other JVM languages to take advantage of Graal’s HSAIL backend
19 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
SUMMARY
! GPU offload is beneficial for improved performance
! We have extended the Graal Just-‐In-‐Time compiler with a prototype backend that generates HSAIL code
! This work opens the door to GPU acceleraQon for Java ! GPU acceleraQon for other managed runQme languages is an area for future work
‒ Truffle may make this possible by providing a centralized framework for language implementaQon using the Graal JIT compiler
! We encourage OpenJDK community feedback and contribuQons
20 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
REFERENCES
! AMD DevCentral blog on HSAIL-‐based GPU Offload ‒ hwp://developer.amd.com/community/blog/hsail-‐based-‐gpu-‐offload-‐the-‐quest-‐for-‐java-‐performance-‐begins/
! Sumatra OpenJDK GPU/APU offload project ‒ Project home page: hwp://openjdk.java.net/projects/sumatra/ ‒ Wiki: hwps://wiki.openjdk.java.net/display/Sumatra/Main
! Graal JIT compiler and runQme project ‒ Project home page: hwp://openjdk.java.net/projects/graal/
! HSA FoundaQon: ‒ hwp://hsafoundaQon.com/ ‒ hwp://hsafoundaQon.com/standards/
21 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
REFERENCES
! JVM Language Summit 2013 (JVMLS 2013) ‒ Wimmer and Seaton, “One VM to Rule them all” :
‒ hwp://www.oracle.com/technetwork/java/jvmls2013wimmer-‐2014084.pdf ‒ Wuerthinger and Venkatachalam, “Graal and GPU offload”
‒ hwp://www.oracle.com/technetwork/java/jvmls2913wuerth-‐2013918.pdf ‒ Vitek, “R in Java”
‒ hwp://www.oracle.com/technetwork/java/jvmls2013vitek-‐2013524.pdf
! JavaOne 2013 ‒ Thalinger, Wimmer, and Venkatachalam, “Wholly Graal: AcceleraQng GPU offload for Java”:
‒ hwps://oracleus.acQveevents.com/2013/connect/fileDownload/session/C2A34A60DEDE1B2D9FE9D87733345017/CON6419_Wimmer.pdf
22 | PRESENTATION TITLE | November 20, 2013 | CONFIDENTIAL
DISCLAIMER & ATTRIBUTION
The informaQon presented in this document is for informaQonal purposes only and may contain technical inaccuracies, omissions and typographical errors.
The informaQon contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, so�ware changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligaQon to update or otherwise correct or revise this informaQon. However, AMD reserves the right to revise this informaQon and to make changes from Qme to Qme to the content hereof without obligaQon of AMD to noQfy any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinaQons thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdicQons. SPEC is a registered trademark of the Standard Performance EvaluaQon CorporaQon (SPEC). Other names are for informaQonal purposes only and may be trademarks of their respecQve owners.