OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

Xavier Martorell Barcelona Supercomputing Center

and Universitat Politècnica de Catalunya

NVIDIA Global Technology Conference (GTC'13)San José, CaliforniaMarch 18-21, 2013

OmpSs: Leveraging CUDA and OpenCLto Exploit Heterogeneous Clusters of

Hardware Accelerators

Motivation

• Variety of accelerators– GPUs

– APUs

– MIC

• Productivity is low, due to the different programming languages– Programmers need effective solutions

• OpenACC, OpenMP, OmpSs

NVidia K20

AMD Fusion Intel Xeon

Phi

3

Cholesky• Decomposes N×N positive definite symmetric

matrix A as: A = LU

– L is lower triangular

– U is upper triangular matrix U, and L = UT

void Cholesky( float **A, int nb, int bs ) { int i, j, k; for (k=0; k<nb; k++) { potrf_tile (A[k*nb+k], bs); for (i=k+1; i<nb; i++) trsm_tile (A[k*nb+k], A[k*nb+i], bs);

for (i=k+1; i<nb; i++) { for (j=k+1; j<i; j++) gemm_tile (A[k*nb+i], A[k*nb+j], A[j*nb+i], bs); syrk_tile (A[k*nb+i], A[i*nb+i], bs); } }}

BS

BS

NB

NB

BS

BS

A

4

OpenMP tasks @SMPs for (k = 0; k < nb; k++) { potrf_tile (Ah[k*nb + k], bs); for (i = k + 1; i < nb; i++) {#pragma omp task shared(Ah) trsm_tile(Ah[k*nb + k], Ah[k*nb + i], bs); }#pragma omp taskwait for (i = k + 1; i < nb; i++) { for (j = k + 1; j < i; j++) {#pragma omp task shared(Ah) gemm_tile(Ah[k*nb + i], Ah[k*nb + j], Ah[j*nb + i], bs); }#pragma omp task shared(Ah) syrk_tile(Ah[k*nb + i], Ah[i*nb + i], bs); }#pragma omp taskwait }

• OpenMP tasksneed the use oftaskwaits, thatlimits parallelism

5

OpenMP tasks @SMPs

- OpenMP taskwait- Imbalance- Idle time (light blue in trace)

- How can we also exploit heterogeneity?

PARAVER VISUALIZATION TRACE

6

Outline

• Motivation

• Cholesky @OmpSs– Evaluation on SMP and GPUs

• OmpSs @OpenCL&CUDA– Characteristics

– Evaluation

– Coding applications

• Conclusions & future work

7

Cholesky @OmpSs

#pragma omp target device (smp) copy_deps#pragma omp task inout([bs*bs]A)void potrf_tile(REAL *A, int bs);

#pragma omp target device (cuda) implements (potrf_tile) copy_deps#pragma omp task inout([bs*bs]A)void potrf_tile_gpu(REAL *A, int bs);

• OmpSs– User functions can also be specified as tasks

– Data directionality hints• Compiler generates information for the runtime

• Runtime performs required data transfers

– Provide implementations for different targets

8

Cholesky @OmpSs#pragma omp target device (smp) copy_deps#pragma omp task input([bs*bs]A, [bs*bs]B) inout([bs*bs]C)void gemm_tile (REAL *A, REAL *B, REAL *C, int bs);

#pragma omp target device (cuda) implements (gemm_tile) copy_deps#pragma omp task input([bs*bs]A, [bs*bs]B) inout([bs*bs]C)void gemm_tile_gpu(REAL *A, REAL *B, REAL *C, int bs);

void gemm_tile_gpu(REAL *A, REAL *B, REAL *C, int bs){ unsigned char TR = 'T', NT = 'N'; REAL DONE = 1.0, DMONE = -1.0;

cudaStream_t stream = nanos_get_kernel_execution_stream(); cublasSetKernelStream(stream); cublasSgemm (NT, TR, bs, bs, bs, DMONE, A, bs, B, bs, DONE, C, bs);}

✔ Leveraging the use of existing kernels✔ CUDA, CUBLAS, OpenCL

9

Cholesky @OmpSsBS

BS

NB

NB

BS

BS

void cholesky(REAL** A, int nb, int bs){ for (k = 0; k < nb; k++) { // Diagonal Block factorization

potrf_tile(A[k*nb + k], bs); // spawn

// Triangular systems for (i = k + 1; i < nb; i++) { trsm_tile(A[k*nb + k], A[k*nb + i], bs); // spawn }

// Update trailing matrix for (i = k + 1; i < nb; i++) { for (j = k + 1; j < i; j++) { // spawn gemm_tile(A[k*nb + i], A[k*nb + j], A[j*nb + i], bs); } syrk_tile(A[k*nb + i], A[i*nb + i], bs); // spawn } }}

A

10

Cholesky @SMPs• Task dependences allow starting iterations earlier

✔ Exploitation of Critical Path ✔ OmpSs 14% faster

OpenMPtasks

OmpSstasks withdeps

11

Cholesky @OpenMP 4.0BS

BS

NB

NB

BS

BS

void cholesky(REAL** A, int nb, int bs){ for (k = 0; k < nb; k++) {#pragma omp task depend (inout: A[k*nt + k])

potrf_tile(A[k*nb + k], bs);

for (i = k + 1; i < nb; i++) {#pragma omp task depend (in: A[k*nt + k]) \ depend (inout: A[k*nt + i]) trsm_tile(A[k*nb + k], A[k*nb + i], bs); } for (i = k + 1; i < nb; i++) { for (j = k + 1; j < i; j++) {#pragma omp task depend (in: A[k*nb + i], A[k*nb + j]) \ depend (inout: A[j*nb + i) gemm_tile(A[k*nb + i], A[k*nb + j], A[j*nb + i], bs); }#pragma omp task depend (in: A[i*nt + k]) \ depend (inout: A[k*nt + k]) syrk_tile(A[k*nb + i], A[i*nb + i], bs); } }}

A

12

C/C++/Fortran

Execution environment

• Mercurium compiler 2.0

• Nanos++ 1.0

• gcc 4.6

• Nvidia CUDA 4.1

13

Execution environment

• Single node on MinoTauro@BSC– 2x six-core Intel Xeon E5649 (12 cores)

• 2.53 Ghz, 12MB L3

– 2x Nvidia Tesla M2090 GPUs• 6 Gbytes Global Memory, each

– 24 Gbytes RAM

14

1 2 4 8 120

50

100

150

200

250

300

350

OpenMP tasksOmpSs tasksOmpSs+1 GPUOmpSs+2 GPUs

Number of cores

Gflo

p/s

1 2 4 8 120

100

200

300

400

500

600

700

800

OpenMP tasksOmpSs tasksOmpSs+1 GPUOmpSs+2 GPUs

Number of cores

Gflo

p/s

Cholesky @GPUs4096 x 4096Block size

CPU: 256 x 256GPU: 512 x 512

16384 x 16384Block size CPU: 256 x 256 GPU: 4096 x 4096

15

16 24 32 40 480

100

200

300

400

500

600

700

800

OpenMP tasksOmpSs tasksPeak

Number of cores

Gflo

p/s

Cholesky @large SMPs

40960 x 40960Block size

CPU: 640 x 640

On AMD Opteron 6172 2.1 Ghz

✔ Exploitation of Critical Path ✔ OmpSs 65% faster at 48 cores

16

Outline

• Motivation

• Cholesky @OmpSs– Evaluation on SMP and GPUs

• OmpSs @OpenCL&CUDA– Characteristics

– Evaluation

– Coding applications

• Conclusions & future work

17

OmpSs @OpenCL&CUDA

• Thread-pool model– SMP threads

– Device representative thread

• Tasks labeled with "target"– smp

– opencl

– cuda

– combinationsof theprevious

OmpSs @OpenCL applications

Application Characteristics Lines of Host CodeOpenCL / CUDA / OmpSs

API calls/directives OpenCL / CUDA / OmpSs

Matmul 8192x8192 (1024x1024)

292 / 240 / 133 31 / 14 / 3

Julia Set 512x51250 frames

943 / 825 / 770 30 / 11 / 5

Krist 1000 atoms10000 reflections

446 / 342 / 280 30 / 15 / 3

NBody MPI+OmpSs65536 particles

922 / 800 / 798 26 / 7 / 3

✔ Shorter writing compared to OpenCL✔ Less lines of code✔ Less API calls / directives

OmpSs @CUDA applications

✔ Speedup compared to hand-coded CUDA✔ Competitive performance

Matmul Julia Krist NBody NBody MPI0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

CUDA 2 GPUsOmpSs 2 GPUsCUDA 4 GPUsOmpSs 4 GPUsS

pee

dup

OmpSs @OpenCL applications

✔ Speedup compared to hand-coded OpenCL✔ OmpSs also competitive

Matmul Julia Krist NBody NBody MPI0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

OpenCL 2 GPUsOmpSs 2 GPUsOpenCL 4 GPUsOmpSs 4 GPUsS

pee

dup

Matmul

• matmul_block function provides OpenCL kernel– Configured as 2-dimensions on blocks of NBxNB

• Local work group size as BL_SIZE x BL_SIZE

• Data transfers scheduled automatically by Nanos++

#pragma omp target device(opencl) ndrange(2, BS, BS, BL_SIZE, BL_SIZE) copy_deps#pragma omp task depend (in: A[0:BS*BS], B[0:BS*BS]) depend (inout: C[0:BS*BS])__kernel void matmul_block (__global REAL * A, __global REAL * B, __global REAL * C, int BS);

✔ 3 additional directives to original benchmark!✔ Including taskwait at the end

BS

BS

DIM

DIM

BS

BS

Nbody

• calculate_force_func is the kernel

#pragma omp target device(opencl) ndrange(1,size,MAX_NUM_THREADS) \ copy_in(d_particles[0;number_of_particles]) \ copy_out([size] out)#pragma omp task out([size] out) \ in(d_particles[0*size;size], \ d_particles[1*size;size], \ d_particles[2*size;size] , \ d_particles[3*size;size] , \ d_particles[4*size;size] , \ d_particles[5*size;size] , \ d_particles[6*size;size] , \ d_particles[7*size;size] )__kernel void calculate_force_func(int size, float time_interval, int number_of_particles, __global Particle* d_particles, __global Particle *out, int first_local, int last_local);

d_particles

out

23

OmpSs Syntax

#pragma omp target device ({ smp | cuda | opencl }) [ file (filename.cl) ] \ ndrange ( ndim, global_vals, local_vals ) \ [ implements ( function_name )] \ { copy_deps | [ copy_in ( array_spec ,...)] [ copy_out (...)] [ copy_inout (...)] }

#pragma omp task [ input (...)] [ output (...)] [ inout (...)] \ [ concurrent (…)] [commutative (…)] \ [ priority (p) ] {code block or function}

#pragma omp taskwait [on (...)] [noflush]

• Creating, managing tasks & deps

• Task synchronization

Extending OpenMP semantics

24

Implements feature• One may have several implementations of

– same functionality, on

– heterogeneous devices

SUBROUTINE& VEC_SUM_SMP(N, A, B, RES) integer :: i, n integer :: a(n), b(n), res(n)

do i = 1, N res(i) = a(i) + b(i) end doEND SUBROUTINE

__kernel void vec_sum(int n, __global int* a, __global int* b, __global int* res){ const int idx = get_global_id(0);

if (idx < n) { res[idx] = a[idx] + b[idx]; }}

25

Implements feature• Annotate both interfaces as the second

implementing the first– Let the runtime system to schedule both in the

available resourcesINTERFACE !$OMP TARGET DEVICE(SMP) !$OMP TASK IN(A, B) OUT(RES) SUBROUTINE VEC_SUM_SMP(N, A, B, RES)…

!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128) FILE(vec_sum.cl) & IMPLEMENTS(VEC_SUM_SMP) !$OMP TASK IN(A, B) OUT(RES) SUBROUTINE VEC_SUM(N, A, B, RES)…END INTERFACE

26

Conclusions

• OmpSs programming model– SMPs, GPUs (CUDA & OpenCL)

– C/C++/Fortran

• Shown OmpSs easy of use

• Performance comparable to hand-tuned– Still, scalability on GPUs needs improvements

– Support for constant, textures hardware features need to be included in OmpSs

27

• Keep pushing for these extensions to be included in the OpenMP standard

• Include support for GPU memory types– Constant, texture

• Building OmpSs @FPGAs – Collaboration with Xilinx Dublin Research Lab

• Interoperate with graphics rendering in OpenGL

• Work with Mont-Blanc and DEEP applications

Future work

28

Acknowledgments

• Parallel programming group @BSC

• Encore/Mont-Blanc/DEEP projects– European Commission

• Spanish Ministry of Education

• Catalan Government

OmpSs available at Barcelona Supercomputing Center http://pm.bsc.es

OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

Documents