Enabling Heterogeneous Computing in Java with Graal Juan Fumero, Michel Steuwer, Christophe Dubach Introduction API Runtime Code Generation Data Management Results Conclusion Enabling Heterogeneous Computing in Java with Graal Juan Fumero, Michel Steuwer, Christophe Dubach The University of Edinburgh 7 July 2015 Truffle Workshop 1 / 26
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Enabling Heterogeneous Computing in Javawith Graal
Juan Fumero, Michel Steuwer, Christophe Dubach
The University of Edinburgh
7 July 2015Truffle Workshop
1 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
1 Introduction
2 API
3 Runtime Code Generation
4 Data Management
5 Results
6 Conclusion
2 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Heterogeneous Computing
NBody App (NVIDIA SDK) ˜105x speedup over seqLU Decomposition (Rodinia Benchmark) ˜10x over 32OpenMP threads
3 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Cool, but how to program?
4 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Example in OpenCL1 // create host buffers2 i n t ∗A, . . . .3 //Initialization4 . . .5 // platform6 c l u i n t numPlatforms = 0 ;7 c l p l a t f o r m i d ∗p l a t f o r m s ;8 s t a t u s = c l G e t P l a t f o r m I D s ( 0 , NULL , &numPlatforms ) ;9 p l a t f o r m s = ( c l p l a t f o r m i d ∗) m a l l o c ( numPlatforms∗ s i z e o f ( c l p l a t f o r m i d ) ) ;
10 s t a t u s = c l G e t P l a t f o r m I D s ( numPlatforms , p l a t f o r m s , NULL) ;11 c l u i n t numDevices = 0 ;12 c l d e v i c e i d ∗ d e v i c e s ;13 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , 0 , NULL , &
numDevices ) ;14 // Allocate space for each device15 d e v i c e s = ( c l d e v i c e i d ∗) m a l l o c ( numDevices∗ s i z e o f ( c l d e v i c e i d ) ) ;16 // Fill in devices17 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , numDevices ,
d e v i c e s , NULL) ;18 c l c o n t e x t c o n t e x t ;19 c o n t e x t = c l C r e a t e C o n t e x t (NULL , numDevices , d e v i c e s , NULL , NULL , &s t a t u s ) ;20 cl command queue cmdQ ;21 cmdQ = clCreateCommandQueue ( c o n t e x t , d e v i c e s [ 0 ] , 0 , &s t a t u s ) ;22 cl mem d A , d B , d C ;23 d A = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY|CL MEM COPY HOST PTR ,
d a t a s i z e , A, &s t a t u s ) ;24 d B = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY|CL MEM COPY HOST PTR ,
d a t a s i z e , B, &s t a t u s ) ;25 d C = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM WRITE ONLY , d a t a s i z e , NULL , &s t a t u s ) ;26 . . .27 // Check errors28 . . .
5 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Example in OpenCL
1 const char ∗ s o u r c e F i l e = ” k e r n e l . c l ” ;2 s o u r c e = r e a d s o u r c e ( s o u r c e F i l e ) ;3 program = c l C r e a t e P r o g r a m W i t h S o u r c e ( c o n t e x t , 1 , ( const char∗∗)&s o u r c e , NULL ,
&s t a t u s ) ;4 c l i n t b u i l d E r r ;5 b u i l d E r r = c l B u i l d P r o g r a m ( program , numDevices , d e v i c e s , NULL , NULL , NULL) ;6 // Create a kernel7 k e r n e l = c l C r e a t e K e r n e l ( program , ” vecadd ” , &s t a t u s ) ;89 s t a t u s = c l S e t K e r n e l A r g ( k e r n e l , 0 , s i z e o f ( cl mem ) , &d A ) ;
10 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 1 , s i z e o f ( cl mem ) , &d B ) ;11 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 2 , s i z e o f ( cl mem ) , &d C ) ;1213 s i z e t g l o b a l W o r k S i z e [ 1 ] = { ELEMENTS} ;14 s i z e t l o c a l i t e m s i z e [ 1 ] = {5} ;1516 clEnqueueNDRangeKernel (cmdQ , k e r n e l , 1 , NULL , g l o b a l W o r k S i z e , NULL , 0 , NULL ,
NULL) ;1718 c l E n q u e u e R e a d B u f f e r (cmdQ , d C , CL TRUE , 0 , d a t a s i z e , C , 0 , NULL , NULL) ;1920 // Free memory
6 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
OpenCL example
1 k e r n e l vo idvecadd (
2 g l o b a l i n t ∗a ,3 g l o b a l i n t ∗b ,4 g l o b a l i n t ∗c ) {5
6 i n t i d x =7 g e t g l o b a l i d ( 0 ) ;8 c [ i d x ] = a [ i d x ] ∗
b [ i d x ] ;9 }
• Hello world App ˜ 250 lines ofcode (including errorchecking)
• Low-level and specific code
• Knowledge about targetarchitecture
• If GPU/accelerator changes,tuning is required
7 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
OpenCL programming is hard and error-prone!!
8 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Higher levels of abstraction
9 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Higher levels of abstraction
10 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Similar works
• Sumatra API (discontinued): Stream API for HSAIL
• AMD Aparapi: Java API for OpenCL
• NVIDIA Nova: functional programming language forCPU/GPU
• Cooperhead: subset of python than can be executed onheterogeneous platforms.
11 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Our Approach
Three levels of abstraction:
• Parallel Skeletons: API based on functional programmingstyle (map/reduce)
• High-level optimising library which rewrites operations totarget specific hardware
• OpenCL code generation and runtime with datamanagement for heterogeneous architecture
12 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Our approachOverview
Application +ArrayFunction API
Java Bytecode
(using Graal API)
OpenCL Kernel Generation
OpenCL Execution
Java source compilation
Java executiondotP.apply(input)
Accelerator
OpenCL Kernel
JOCL
output
13 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Example: Saxpy
1 // Computation function2 ArrayFunc<Tuple2<F l o a t , F l o a t >, F l o a t> mult = new
MapFunction<>(t −> 2 . 5 f ∗ t . 1 ( ) + t . 2 ( ) ) ;3
4 // Prepare the input5 Tuple2<F l o a t , F l o a t > [ ] i n p u t = new Tuple2 [ s i z e ] ;6 f o r ( i n t i = 0 ; i < i n p u t . l e n g t h ; ++i ) {7 i n p u t [ i ] . 1 = ( f l o a t ) ( i ∗ 0 . 3 2 3 ) ;8 i n p u t [ i ] . 2 = ( f l o a t ) ( i + 2 . 0 ) ;9 }
10
11 // Computation12 F l o a t [ ] output = mult . a p p l y ( i n p u t ) ;
If accelerator enabled, the map expression is rewritten in lowerlevel operations automatically.map(λ) = MapAccelerator(λ) =CopyIn().computeOCL(λ).CopyOut()
14 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Our ApproachOverview
Ar r ayFunc
Map
MapThr eads
MapOpenCL
Reduce. . .appl y( ) { f or ( i = 0; i < s i ze; ++i ) out [ i ] = f . appl y( i n[ i ] ) ) ;}
appl y( ) { f or ( t hr ead : t hr eads) t hr ead. per f or mMapSeq( ) ;}
void kernel ( global float* input, global float* output) { ...; ...;} OpenCL Kernel
3. optimizations
2. IR generation
4. kernel generation
1. Type inference
16 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
OpenCL code generated1 double lambda0 ( f l o a t p0 ) {2 double c a s t 1 = ( double ) p0 ;3 double r e s u l t 2 = c a s t 1 ∗ 2 . 0 ;4 r e t u r n r e s u l t 2 ;5 }6 k e r n e l vo id l ambdaComputat ionKerne l (7 g l o b a l f l o a t ∗ p0 ,8 g l o b a l i n t ∗ p 0 i n d e x d a t a ,9 g l o b a l double ∗p1 ,
10 g l o b a l i n t ∗ p 1 i n d e x d a t a ) {11 i n t p0 d im 1 = 0 ; i n t p1 d im 1 = 0 ;12 i n t gs = g e t g l o b a l s i z e ( 0 ) ;13 i n t l o o p 1 = g e t g l o b a l i d ( 0 ) ;14 f o r ( ; ; l o o p 1 += gs ) {15 i n t p 0 l e n d i m 1 = p 0 i n d e x d a t a [ p0 d im 1 ] ;16 b o o l cond 2 = l o o p 1 < p 0 l e n d i m 1 ;17 i f ( cond 2 ) {18 f l o a t auxVar0 = p0 [ l o o p 1 ] ;19 double r e s = lambd0 ( auxVar0 ) ;20 p1 [ p 1 i n d e x d a t a [ p1 d im 1 + 1 ] + l o o p 1 ]21 = r e s ;22 } e l s e { break ; }23 }24 }