Top Banner
Xavier Martorell Barcelona Supercomputing Center and Universitat Politècnica de Catalunya NVIDIA Global Technology Conference (GTC'13) San José, California March 18-21, 2013 OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous Clusters of Hardware Accelerators
28

OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

Jul 03, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

Xavier Martorell Barcelona Supercomputing Center

and Universitat Politècnica de Catalunya

NVIDIA Global Technology Conference (GTC'13)San José, CaliforniaMarch 18-21, 2013

OmpSs: Leveraging CUDA and OpenCLto Exploit Heterogeneous Clusters of

Hardware Accelerators

Page 2: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

Motivation

• Variety of accelerators– GPUs

– APUs

– MIC

• Productivity is low, due to the different programming languages– Programmers need effective solutions

• OpenACC, OpenMP, OmpSs

NVidia K20

AMD Fusion Intel Xeon

Phi

Page 3: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

3

Cholesky• Decomposes N×N positive definite symmetric

matrix A as: A = LU

– L is lower triangular

– U is upper triangular matrix U, and L = UT

void Cholesky( float **A, int nb, int bs ) { int i, j, k; for (k=0; k<nb; k++) { potrf_tile (A[k*nb+k], bs); for (i=k+1; i<nb; i++) trsm_tile (A[k*nb+k], A[k*nb+i], bs);

for (i=k+1; i<nb; i++) { for (j=k+1; j<i; j++) gemm_tile (A[k*nb+i], A[k*nb+j], A[j*nb+i], bs); syrk_tile (A[k*nb+i], A[i*nb+i], bs); } }}

BS

BS

NB

NB

BS

BS

A

Page 4: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

4

OpenMP tasks @SMPs for (k = 0; k < nb; k++) { potrf_tile (Ah[k*nb + k], bs); for (i = k + 1; i < nb; i++) {#pragma omp task shared(Ah) trsm_tile(Ah[k*nb + k], Ah[k*nb + i], bs); }#pragma omp taskwait for (i = k + 1; i < nb; i++) { for (j = k + 1; j < i; j++) {#pragma omp task shared(Ah) gemm_tile(Ah[k*nb + i], Ah[k*nb + j], Ah[j*nb + i], bs); }#pragma omp task shared(Ah) syrk_tile(Ah[k*nb + i], Ah[i*nb + i], bs); }#pragma omp taskwait }

• OpenMP tasksneed the use oftaskwaits, thatlimits parallelism

Page 5: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

5

OpenMP tasks @SMPs

- OpenMP taskwait- Imbalance- Idle time (light blue in trace)

- How can we also exploit heterogeneity?

PARAVER VISUALIZATION TRACE

Page 6: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

6

Outline

• Motivation

• Cholesky @OmpSs– Evaluation on SMP and GPUs

• OmpSs @OpenCL&CUDA– Characteristics

– Evaluation

– Coding applications

• Conclusions & future work

Page 7: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

7

Cholesky @OmpSs

#pragma omp target device (smp) copy_deps#pragma omp task inout([bs*bs]A)void potrf_tile(REAL *A, int bs);

#pragma omp target device (cuda) implements (potrf_tile) copy_deps#pragma omp task inout([bs*bs]A)void potrf_tile_gpu(REAL *A, int bs);

• OmpSs– User functions can also be specified as tasks

– Data directionality hints• Compiler generates information for the runtime

• Runtime performs required data transfers

– Provide implementations for different targets

Page 8: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

8

Cholesky @OmpSs#pragma omp target device (smp) copy_deps#pragma omp task input([bs*bs]A, [bs*bs]B) inout([bs*bs]C)void gemm_tile (REAL *A, REAL *B, REAL *C, int bs);

#pragma omp target device (cuda) implements (gemm_tile) copy_deps#pragma omp task input([bs*bs]A, [bs*bs]B) inout([bs*bs]C)void gemm_tile_gpu(REAL *A, REAL *B, REAL *C, int bs);

void gemm_tile_gpu(REAL *A, REAL *B, REAL *C, int bs){ unsigned char TR = 'T', NT = 'N'; REAL DONE = 1.0, DMONE = -1.0;

cudaStream_t stream = nanos_get_kernel_execution_stream(); cublasSetKernelStream(stream); cublasSgemm (NT, TR, bs, bs, bs, DMONE, A, bs, B, bs, DONE, C, bs);}

✔ Leveraging the use of existing kernels✔ CUDA, CUBLAS, OpenCL

Page 9: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

9

Cholesky @OmpSsBS

BS

NB

NB

BS

BS

void cholesky(REAL** A, int nb, int bs){ for (k = 0; k < nb; k++) { // Diagonal Block factorization

potrf_tile(A[k*nb + k], bs); // spawn

// Triangular systems for (i = k + 1; i < nb; i++) { trsm_tile(A[k*nb + k], A[k*nb + i], bs); // spawn }

// Update trailing matrix for (i = k + 1; i < nb; i++) { for (j = k + 1; j < i; j++) { // spawn gemm_tile(A[k*nb + i], A[k*nb + j], A[j*nb + i], bs); } syrk_tile(A[k*nb + i], A[i*nb + i], bs); // spawn } }}

A

Page 10: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

10

Cholesky @SMPs• Task dependences allow starting iterations earlier

✔ Exploitation of Critical Path ✔ OmpSs 14% faster

OpenMPtasks

OmpSstasks withdeps

Page 11: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

11

Cholesky @OpenMP 4.0BS

BS

NB

NB

BS

BS

void cholesky(REAL** A, int nb, int bs){ for (k = 0; k < nb; k++) {#pragma omp task depend (inout: A[k*nt + k])

potrf_tile(A[k*nb + k], bs);

for (i = k + 1; i < nb; i++) {#pragma omp task depend (in: A[k*nt + k]) \ depend (inout: A[k*nt + i]) trsm_tile(A[k*nb + k], A[k*nb + i], bs); } for (i = k + 1; i < nb; i++) { for (j = k + 1; j < i; j++) {#pragma omp task depend (in: A[k*nb + i], A[k*nb + j]) \ depend (inout: A[j*nb + i) gemm_tile(A[k*nb + i], A[k*nb + j], A[j*nb + i], bs); }#pragma omp task depend (in: A[i*nt + k]) \ depend (inout: A[k*nt + k]) syrk_tile(A[k*nb + i], A[i*nb + i], bs); } }}

A

Page 12: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

12

C/C++/Fortran

Execution environment

• Mercurium compiler 2.0

• Nanos++ 1.0

• gcc 4.6

• Nvidia CUDA 4.1

Page 13: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

13

Execution environment

• Single node on MinoTauro@BSC– 2x six-core Intel Xeon E5649 (12 cores)

• 2.53 Ghz, 12MB L3

– 2x Nvidia Tesla M2090 GPUs• 6 Gbytes Global Memory, each

– 24 Gbytes RAM

Page 14: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

14

1 2 4 8 120

50

100

150

200

250

300

350

OpenMP tasksOmpSs tasksOmpSs+1 GPUOmpSs+2 GPUs

Number of cores

Gflo

p/s

1 2 4 8 120

100

200

300

400

500

600

700

800

OpenMP tasksOmpSs tasksOmpSs+1 GPUOmpSs+2 GPUs

Number of cores

Gflo

p/s

Cholesky @GPUs4096 x 4096Block size

CPU: 256 x 256GPU: 512 x 512

16384 x 16384Block size CPU: 256 x 256 GPU: 4096 x 4096

Page 15: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

15

16 24 32 40 480

100

200

300

400

500

600

700

800

OpenMP tasksOmpSs tasksPeak

Number of cores

Gflo

p/s

Cholesky @large SMPs

40960 x 40960Block size

CPU: 640 x 640

On AMD Opteron 6172 2.1 Ghz

✔ Exploitation of Critical Path ✔ OmpSs 65% faster at 48 cores

Page 16: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

16

Outline

• Motivation

• Cholesky @OmpSs– Evaluation on SMP and GPUs

• OmpSs @OpenCL&CUDA– Characteristics

– Evaluation

– Coding applications

• Conclusions & future work

Page 17: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

17

OmpSs @OpenCL&CUDA

• Thread-pool model– SMP threads

– Device representative thread

• Tasks labeled with "target"– smp

– opencl

– cuda

– combinationsof theprevious

Page 18: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

OmpSs @OpenCL applications

Application Characteristics Lines of Host CodeOpenCL / CUDA / OmpSs

API calls/directives OpenCL / CUDA / OmpSs

Matmul 8192x8192 (1024x1024)

292 / 240 / 133 31 / 14 / 3

Julia Set 512x51250 frames

943 / 825 / 770 30 / 11 / 5

Krist 1000 atoms10000 reflections

446 / 342 / 280 30 / 15 / 3

NBody MPI+OmpSs65536 particles

922 / 800 / 798 26 / 7 / 3

✔ Shorter writing compared to OpenCL✔ Less lines of code✔ Less API calls / directives

Page 19: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

OmpSs @CUDA applications

✔ Speedup compared to hand-coded CUDA✔ Competitive performance

Matmul Julia Krist NBody NBody MPI0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

CUDA 2 GPUsOmpSs 2 GPUsCUDA 4 GPUsOmpSs 4 GPUsS

pee

dup

Page 20: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

OmpSs @OpenCL applications

✔ Speedup compared to hand-coded OpenCL✔ OmpSs also competitive

Matmul Julia Krist NBody NBody MPI0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

OpenCL 2 GPUsOmpSs 2 GPUsOpenCL 4 GPUsOmpSs 4 GPUsS

pee

dup

Page 21: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

Matmul

• matmul_block function provides OpenCL kernel– Configured as 2-dimensions on blocks of NBxNB

• Local work group size as BL_SIZE x BL_SIZE

• Data transfers scheduled automatically by Nanos++

#pragma omp target device(opencl) ndrange(2, BS, BS, BL_SIZE, BL_SIZE) copy_deps#pragma omp task depend (in: A[0:BS*BS], B[0:BS*BS]) depend (inout: C[0:BS*BS])__kernel void matmul_block (__global REAL * A, __global REAL * B, __global REAL * C, int BS);

✔ 3 additional directives to original benchmark!✔ Including taskwait at the end

BS

BS

DIM

DIM

BS

BS

Page 22: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

Nbody

• calculate_force_func is the kernel

#pragma omp target device(opencl) ndrange(1,size,MAX_NUM_THREADS) \ copy_in(d_particles[0;number_of_particles]) \ copy_out([size] out)#pragma omp task out([size] out) \ in(d_particles[0*size;size], \ d_particles[1*size;size], \ d_particles[2*size;size] , \ d_particles[3*size;size] , \ d_particles[4*size;size] , \ d_particles[5*size;size] , \ d_particles[6*size;size] , \ d_particles[7*size;size] )__kernel void calculate_force_func(int size, float time_interval, int number_of_particles, __global Particle* d_particles, __global Particle *out, int first_local, int last_local);

d_particles

out

Page 23: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

23

OmpSs Syntax

#pragma omp target device ({ smp | cuda | opencl }) [ file (filename.cl) ] \ ndrange ( ndim, global_vals, local_vals ) \ [ implements ( function_name )] \ { copy_deps | [ copy_in ( array_spec ,...)] [ copy_out (...)] [ copy_inout (...)] }

#pragma omp task [ input (...)] [ output (...)] [ inout (...)] \ [ concurrent (…)] [commutative (…)] \ [ priority (p) ] {code block or function}

#pragma omp taskwait [on (...)] [noflush]

• Creating, managing tasks & deps

• Task synchronization

Extending OpenMP semantics

Page 24: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

24

Implements feature• One may have several implementations of

– same functionality, on

– heterogeneous devices

SUBROUTINE& VEC_SUM_SMP(N, A, B, RES) integer :: i, n integer :: a(n), b(n), res(n)

do i = 1, N res(i) = a(i) + b(i) end doEND SUBROUTINE

__kernel void vec_sum(int n, __global int* a, __global int* b, __global int* res){ const int idx = get_global_id(0);

if (idx < n) { res[idx] = a[idx] + b[idx]; }}

Page 25: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

25

Implements feature• Annotate both interfaces as the second

implementing the first– Let the runtime system to schedule both in the

available resourcesINTERFACE !$OMP TARGET DEVICE(SMP) !$OMP TASK IN(A, B) OUT(RES) SUBROUTINE VEC_SUM_SMP(N, A, B, RES)…

!$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128) FILE(vec_sum.cl) & IMPLEMENTS(VEC_SUM_SMP) !$OMP TASK IN(A, B) OUT(RES) SUBROUTINE VEC_SUM(N, A, B, RES)…END INTERFACE

Page 26: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

26

Conclusions

• OmpSs programming model– SMPs, GPUs (CUDA & OpenCL)

– C/C++/Fortran

• Shown OmpSs easy of use

• Performance comparable to hand-tuned– Still, scalability on GPUs needs improvements

– Support for constant, textures hardware features need to be included in OmpSs

Page 27: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

27

• Keep pushing for these extensions to be included in the OpenMP standard

• Include support for GPU memory types– Constant, texture

• Building OmpSs @FPGAs – Collaboration with Xilinx Dublin Research Lab

• Interoperate with graphics rendering in OpenGL

• Work with Mont-Blanc and DEEP applications

Future work

Page 28: OmpSs: Leveraging CUDA and OpenCL to Exploit Heterogeneous ...

28

Acknowledgments

• Parallel programming group @BSC

• Encore/Mont-Blanc/DEEP projects– European Commission

• Spanish Ministry of Education

• Catalan Government

OmpSs available at Barcelona Supercomputing Center http://pm.bsc.es