Gpus graal

EnablingHeterogeneousComputing in

Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API

Runtime CodeGeneration

DataManagement

Results

Conclusion

Enabling Heterogeneous Computing in Javawith Graal

Juan Fumero, Michel Steuwer, Christophe Dubach

The University of Edinburgh

7 July 2015Truffle Workshop

1 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

1 Introduction

2 API

3 Runtime Code Generation

4 Data Management

5 Results

6 Conclusion

2 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Heterogeneous Computing

NBody App (NVIDIA SDK) ˜105x speedup over seqLU Decomposition (Rodinia Benchmark) ˜10x over 32OpenMP threads

3 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Cool, but how to program?

4 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Example in OpenCL1 // create host buffers2 i n t ∗A, . . . .3 //Initialization4 . . .5 // platform6 c l u i n t numPlatforms = 0 ;7 c l p l a t f o r m i d ∗p l a t f o r m s ;8 s t a t u s = c l G e t P l a t f o r m I D s ( 0 , NULL , &numPlatforms ) ;9 p l a t f o r m s = ( c l p l a t f o r m i d ∗) m a l l o c ( numPlatforms∗ s i z e o f ( c l p l a t f o r m i d ) ) ;

10 s t a t u s = c l G e t P l a t f o r m I D s ( numPlatforms , p l a t f o r m s , NULL) ;11 c l u i n t numDevices = 0 ;12 c l d e v i c e i d ∗ d e v i c e s ;13 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , 0 , NULL , &

numDevices ) ;14 // Allocate space for each device15 d e v i c e s = ( c l d e v i c e i d ∗) m a l l o c ( numDevices∗ s i z e o f ( c l d e v i c e i d ) ) ;16 // Fill in devices17 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , numDevices ,

d e v i c e s , NULL) ;18 c l c o n t e x t c o n t e x t ;19 c o n t e x t = c l C r e a t e C o n t e x t (NULL , numDevices , d e v i c e s , NULL , NULL , &s t a t u s ) ;20 cl command queue cmdQ ;21 cmdQ = clCreateCommandQueue ( c o n t e x t , d e v i c e s [ 0 ] , 0 , &s t a t u s ) ;22 cl mem d A , d B , d C ;23 d A = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY|CL MEM COPY HOST PTR ,

d a t a s i z e , A, &s t a t u s ) ;24 d B = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY|CL MEM COPY HOST PTR ,

d a t a s i z e , B, &s t a t u s ) ;25 d C = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM WRITE ONLY , d a t a s i z e , NULL , &s t a t u s ) ;26 . . .27 // Check errors28 . . .

5 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Example in OpenCL

1 const char ∗ s o u r c e F i l e = ” k e r n e l . c l ” ;2 s o u r c e = r e a d s o u r c e ( s o u r c e F i l e ) ;3 program = c l C r e a t e P r o g r a m W i t h S o u r c e ( c o n t e x t , 1 , ( const char∗∗)&s o u r c e , NULL ,

&s t a t u s ) ;4 c l i n t b u i l d E r r ;5 b u i l d E r r = c l B u i l d P r o g r a m ( program , numDevices , d e v i c e s , NULL , NULL , NULL) ;6 // Create a kernel7 k e r n e l = c l C r e a t e K e r n e l ( program , ” vecadd ” , &s t a t u s ) ;89 s t a t u s = c l S e t K e r n e l A r g ( k e r n e l , 0 , s i z e o f ( cl mem ) , &d A ) ;

10 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 1 , s i z e o f ( cl mem ) , &d B ) ;11 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 2 , s i z e o f ( cl mem ) , &d C ) ;1213 s i z e t g l o b a l W o r k S i z e [ 1 ] = { ELEMENTS} ;14 s i z e t l o c a l i t e m s i z e [ 1 ] = {5} ;1516 clEnqueueNDRangeKernel (cmdQ , k e r n e l , 1 , NULL , g l o b a l W o r k S i z e , NULL , 0 , NULL ,

NULL) ;1718 c l E n q u e u e R e a d B u f f e r (cmdQ , d C , CL TRUE , 0 , d a t a s i z e , C , 0 , NULL , NULL) ;1920 // Free memory

6 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

OpenCL example

1 k e r n e l vo idvecadd (

2 g l o b a l i n t ∗a ,3 g l o b a l i n t ∗b ,4 g l o b a l i n t ∗c ) {5

6 i n t i d x =7 g e t g l o b a l i d ( 0 ) ;8 c [ i d x ] = a [ i d x ] ∗

b [ i d x ] ;9 }

• Hello world App ˜ 250 lines ofcode (including errorchecking)

• Low-level and specific code

• Knowledge about targetarchitecture

• If GPU/accelerator changes,tuning is required

7 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

OpenCL programming is hard and error-prone!!

8 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Higher levels of abstraction

9 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Higher levels of abstraction

10 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Similar works

• Sumatra API (discontinued): Stream API for HSAIL

• AMD Aparapi: Java API for OpenCL

• NVIDIA Nova: functional programming language forCPU/GPU

• Cooperhead: subset of python than can be executed onheterogeneous platforms.

11 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Our Approach

Three levels of abstraction:

• Parallel Skeletons: API based on functional programmingstyle (map/reduce)

• High-level optimising library which rewrites operations totarget specific hardware

• OpenCL code generation and runtime with datamanagement for heterogeneous architecture

12 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Our approachOverview

Application +ArrayFunction API

Java Bytecode

(using Graal API)

OpenCL Kernel Generation

OpenCL Execution

Java source compilation

Java executiondotP.apply(input)

Accelerator

OpenCL Kernel

JOCL

output

13 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Example: Saxpy

1 // Computation function2 ArrayFunc<Tuple2<F l o a t , F l o a t >, F l o a t> mult = new

MapFunction<>(t −> 2 . 5 f ∗ t . 1 ( ) + t . 2 ( ) ) ;3

4 // Prepare the input5 Tuple2<F l o a t , F l o a t > [ ] i n p u t = new Tuple2 [ s i z e ] ;6 f o r ( i n t i = 0 ; i < i n p u t . l e n g t h ; ++i ) {7 i n p u t [ i ] . 1 = ( f l o a t ) ( i ∗ 0 . 3 2 3 ) ;8 i n p u t [ i ] . 2 = ( f l o a t ) ( i + 2 . 0 ) ;9 }

10

11 // Computation12 F l o a t [ ] output = mult . a p p l y ( i n p u t ) ;

If accelerator enabled, the map expression is rewritten in lowerlevel operations automatically.map(λ) = MapAccelerator(λ) =CopyIn().computeOCL(λ).CopyOut()

14 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Our ApproachOverview

Ar r ayFunc

Map

MapThr eads

MapOpenCL

Reduce. . .appl y( ) { f or ( i = 0; i < s i ze; ++i ) out [ i ] = f . appl y( i n[ i ] ) ) ;}

appl y( ) { f or ( t hr ead : t hr eads) t hr ead. per f or mMapSeq( ) ;}

appl y( ) { copyToDevi ce( ) ; execut e( ) ; copyToHost ( ) ;}

Funct i on

15 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Runtime Code GenerationWorkflow

...10: aload_211: iload_312: aload_013: getfield16: aaload18: invokeinterface#apply23: aastore24: iinc27: iload_3...

Java sourceMap.apply(f)

Java bytecode

Graal VM

CFG + Dataflow(Graal IR)

void kernel ( global float* input, global float* output) { ...; ...;} OpenCL Kernel

3. optimizations

2. IR generation

4. kernel generation

1. Type inference

16 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

OpenCL code generated1 double lambda0 ( f l o a t p0 ) {2 double c a s t 1 = ( double ) p0 ;3 double r e s u l t 2 = c a s t 1 ∗ 2 . 0 ;4 r e t u r n r e s u l t 2 ;5 }6 k e r n e l vo id l ambdaComputat ionKerne l (7 g l o b a l f l o a t ∗ p0 ,8 g l o b a l i n t ∗ p 0 i n d e x d a t a ,9 g l o b a l double ∗p1 ,

10 g l o b a l i n t ∗ p 1 i n d e x d a t a ) {11 i n t p0 d im 1 = 0 ; i n t p1 d im 1 = 0 ;12 i n t gs = g e t g l o b a l s i z e ( 0 ) ;13 i n t l o o p 1 = g e t g l o b a l i d ( 0 ) ;14 f o r ( ; ; l o o p 1 += gs ) {15 i n t p 0 l e n d i m 1 = p 0 i n d e x d a t a [ p0 d im 1 ] ;16 b o o l cond 2 = l o o p 1 < p 0 l e n d i m 1 ;17 i f ( cond 2 ) {18 f l o a t auxVar0 = p0 [ l o o p 1 ] ;19 double r e s = lambd0 ( auxVar0 ) ;20 p1 [ p 1 i n d e x d a t a [ p1 d im 1 + 1 ] + l o o p 1 ]21 = r e s ;22 } e l s e { break ; }23 }24 }

17 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Investigation of runtime for BS

Black-scholes benchmark.Float[] =⇒ Tuple2 < Float,Float > []

0.0

0.2

0.4

0.6

0.8

1.0

Am

ount of to

tal ru

ntim

e in %

Unmarshaling

CopyToCPU

GPU Execution

CopyToGPU

Marshaling

Java overhead

• Un/marshal data takesup to 90% of the time

• Computation stepshould be dominant

This is not acceptable. Can we do better?

18 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Custom Array Type

Programmer's View

Tuple2

...

Graal-OCL VM

float float float float...

double double double double...

FloatBuffer

DoubleBuffer

...

0 1 2 n-1

...

0 1 2 n-1

0 1 2 n-1

float

double

Tuple2

float

double

Tuple2

float

double

Tuple2

float

double

...

PArray<Tuple2<Float,Double>>

With this layout, un/marshal operations are not necessary

19 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Example of JPAI

1 ArrayFunc<Tuple2<F l o a t , Double >, Double> f = newMapFunction<>(t −> 2 . 5 f ∗ t . 1 ( ) + t . 2 ( ) ) ;

2

3 PArray<Tuple2<F l o a t , Double>> i n p u t = new PArray<>( s i z e ) ;4

5 f o r ( i n t i = 0 ; i < s i z e ; ++i ) {6 i n p u t . put ( i , new Tuple2 <>(( f l o a t ) i , ( double ) i + 2) ) ;7 }8

9 PArray<Double> output = f . a p p l y ( i n p u t ) ;

20 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Setup

• 5 Applications

• Comparison with:• Java Sequential - Graal

compiled code• AMD and Nvidia GPUs• Java Array vs. Custom

PArray• Java threads

21 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Java Threads Execution

0

1

2

3

4

5

6

small large

Saxpysmall large

K−Means

small large

Black−Scholes

small large

N−Bodysmall large

Monte Carlo

Speedup v

s. Java

sequential

Number of Java Threads

#1 #2 #4 #8 #16

CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

22 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

OpenCL GPU Execution

0.1

1

10

100

1000

small large

Saxpy

0.004 0.004small large

K−Meanssmall large

Black−Scholessmall large

N−Bodysmall large

Monte Carlo

Speedup v

s. Java

sequential

Nvidia Marshalling Nvidia Optimized AMD Marshalling AMD Optimized

AMD Radeon R9 295NVIDIA Geforce GTX Titan Black

23 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

OpenCL GPU Execution

0.1

1

10

100

1000

small largeSaxpy

0.004 0.004small large

K−Meanssmall large

Black−Scholessmall large

N−Bodysmall large

Monte Carlo

Spe

edup

vs.

Jav

a se

quen

tial

Nvidia Marshalling Nvidia Optimized AMD Marshalling AMD Optimized

10x12x 70x

AMD Radeon R9 295NVIDIA Geforce GTX Titan Black

24 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

.zip(Conclusions).map(Future)

Present

• We have presented an API to enable heterogeneouscomputing in Java

• Custom array type to reduce overheads when transfer thedata

• Runtime system to run heterogeneous applications withinJava

Future

• Runtime data type specialization

• Code generation for multiple devices

• Runtime scheduling (Where is the best place to run thecode?)

25 / 26


Java withGraal

Juan Fumero,Michel

Steuwer,Christophe

Dubach

Introduction

API


DataManagement

Results

Conclusion

Thanks so much for your attention

This work was supported bya grant from:

Juan Jose [email protected]

26 / 26

Gpus graal

Engineering