Top Banner
Heterogeneous Architectures accULL: An Early OpenACC Implementation Results Conclusions and Future Work accULL: An User-directed Approach to Heterogeneous Programming Ruym´ an Reyes Iv´ an L´ opez-Rodr´ ıguez Juan J. Fumero Francisco de Sande 1 Dept. E.I.O. y Computaci´ on, Univ. de La Laguna, 38271–La Laguna, Spain International Workshop on Heterogeneous Architectures and Computing Legan´ es, July 13 2012 1 / 66
66
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

accULL: An User-directed Approach toHeterogeneous Programming

Ruyman Reyes Ivan Lopez-Rodrıguez Juan J. FumeroFrancisco de Sande

1Dept. E.I.O. y Computacion,Univ. de La Laguna, 38271–La Laguna, Spain

International Workshop on HeterogeneousArchitectures and Computing

Leganes, July 13 2012

1 / 66

Page 2: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Outline

1 Heterogeneous Architectures

2 accULL: An Early OpenACC Implementation

3 Results

4 Conclusions and Future Work

2 / 66

Page 3: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Outline

1 Heterogeneous Architectures

2 accULL: An Early OpenACC Implementation

3 Results

4 Conclusions and Future Work

3 / 66

Page 4: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Introduction

The irruption of GPUs: Impressive Results

4 / 66

Page 5: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

GPUs

Successfully used for general purpose computing (GPGPU)

5 / 66

Page 6: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Heterogeneous Architectures

But ...

It is not Easy!

6 / 66

Page 7: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Heterogeneous Architectures

A GPU is not a CPU

GPUs are inherently SIMD processorsCPUs and GPUs tackle the processing of tasks differentlyCPUs excel at serial processingGPUs are better at handling applications that require highfloating point calculations and lower power consumption

7 / 66

Page 8: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Parallel Languages: MPI (DM) and OpenMP (SM)

They are not valid for programming GPUs

New programming models are required...

8 / 66

Page 9: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

GPGPU Programming

Nowadays Software Stack:

9 / 66

Page 10: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

CUDA from NVIDIA

Pros: Performance, Easierthan OpenCL

Con: Only for NVIDIAhardware

CUDA Code Example

1 __global__ v o i d mmkernel ( f l o a t ∗ a , f l o a t ∗ b , f l o a t ∗ c , i n t n ,2 i n t m , i n t p ) {3 i n t i = blockIdx . x∗32 + threadIdx . x ;4 i n t j = blockIdx . y ;5 f l o a t sum = 0 . 0 f ;6 f o r ( i n t k = 0 ; k < p ; ++k ) sum += b [ i+n∗k ] ∗ c [ k+p∗j ] ;7 a [ i+n∗j ] = sum ;8 }

10 / 66

Page 11: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

GPGPU Programming

OpenCL: Open Computing Language

A framework developed by the Khronos Group

A standard

OpenCL programs execute across heterogeneous platforms:CPUs + GPUs + other processors

Pros: can be used with any device, it is a standardCons: more complex than CUDA, inmature

11 / 66

Page 12: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

GPGPU Programming

Common Problems1 The programmer needs to know low-level details of the

architecture

12 / 66

Page 13: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

GPGPU Programming

Common Problems1 The programmer needs to know low-level details of the

architecture2 Source codes need to be rewritten:

One version for CPUA different version for GPU

3 Good performance requires a great effort in parameter tunning

4 CUDA and OpenCL are new and complex for non-experts

13 / 66

Page 14: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

GPGPU Programming

Our Claim: New models and tools are needed if we wantto widespread the use of GPUs in HPC

Is there anything new in the horizon?

hiCUDA

PGI accelerator model

CAPS HMPP

OpenACC

14 / 66

Page 15: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

GPGPU Programming

hiCUDATranslates each directive into a CUDA call

It is able to use the GPU Shared Memory

Only works with NVIDIA devices

The programmer still needs to know hardware details

hiCUDA Code Example:

1 . . .2 #pragma h icuda g l o b a l a l l o c c [ ∗ ] [ ∗ ] c o p y i n

4 #pragma h icuda k e r n e l mxm t b l o c k (N/16 ,N/16) t h r e a d ( 1 6 , 1 6 )5 #pragma hicuda loop_partition over_tblock over_thread6 f o r ( i = 0 ; i < N ; i++ ) {7 #pragma hicuda loop_partition over_tblock over_thread8 f o r ( j = 0 ; j < N ; j++) {9 double sum = 0 . 0 ;

10 . . .

15 / 66

Page 16: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

GPGPU Programming

PGI accelerator model

It is a higher level (directive-based) approach

Fortran and C are supported

Precursor to OpenACC

PGI Accelerator Model Code Example:

1 #pragma acc data c o p y i n ( b [ 0 : n∗ l ] , c [ 0 :m∗ l ] ) copy ( a [ 0 : n∗m] )2 {3 #pragma acc r e g i o n4 {5 #pragma acc loop independent6 f o r ( j = 0 ; j < n ; j++)7 {8 #pragma acc loop independent9 f o r ( i = 0 ; i < l ; i++ ) {

10 double sum = 0 . 0 ;11 f o r ( k = 0 ; k < m ; k++ ) {12 sum += b [ i+k∗l ] ∗ c [ k+j∗m ] ;13 }14 a [ i+j∗l ] = sum ;15 }16 }17 }18 }

16 / 66

Page 17: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

GPGPU Programming

OpenACC: introduced last November inSuperComputing’2011

A directive based language

Aim to be standard

Supported by: Cray, NVIDIA, PGI and CAPS

A single source code for CPU/GPU

Platform independent

Easier for beginners

17 / 66

Page 18: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

GPGPU Programming

OpenACC Code Example:

18 / 66

Page 19: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Outline

1 Heterogeneous Architectures

2 accULL: An Early OpenACC Implementation

3 Results

4 Conclusions and Future Work

19 / 66

Page 20: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

accULL: Our OpenACC implementation

accULL is a framework developed to support OpenACCprograms

20 / 66

Page 21: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

accULL: Our OpenACC implementation

accULL = YaCF + Frangollo

It is a two-layer based implementation:Compiler + RunTime Library

21 / 66

Page 22: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

YaCF: the compiler

YaCF (Yet Another Compiler Framework) is the compilerframework we have developed

Some features:

It is a StS compiler

Written in Python from scratch with an OO approach

Receives C99 as input

It is able to generate CUDA/OpenCL kernels from an annotatedcode

A driver for compiling OpenACC directives has been added

YaCF translates the directives into Frangollo calls

A public-domain development

22 / 66

Page 23: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Frangollo: the RunTime

Frangollo

It is a RunTime to support the execution over heterogeneousplatforms

1 Encapsulates the hardware issues

2 Is able to run in NVIDIA devices using CUDA

3 Is able to manage a wider range of devices using OpenCL

23 / 66

Page 24: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Frangollo: the RunTime

Compilation flow

24 / 66

Page 25: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Frangollo: the RunTime

Its Responsibilities1 Manages the memory

2 Initializes the devices

3 Launches the kernels

Makes programmers’ life easier!

25 / 66

Page 26: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Frangollo: the RunTime

Its Responsibilities1 Manages the memory

2 Initializes the devices

3 Launches the kernels

Makes programmers’ life easier!

26 / 66

Page 27: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Frangollo: Memory Management

A program workflow

27 / 66

Page 28: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Frangollo: Structure

Interface layer: A door to Frangollo

Some functions in the C interface:

registerVar

launchKernel

getNumDevices

28 / 66

Page 29: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Frangollo: Structure

Abstract layer

Frangollo uses a class-hierarchy

All classes in this layer are abstracts

29 / 66

Page 30: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Frangollo: Structure

Device layer

Encapsulates all targetlanguage related functions

New platforms could beadded in the future

30 / 66

Page 31: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Outline

1 Heterogeneous Architectures

2 accULL: An Early OpenACC Implementation

3 Results

4 Conclusions and Future Work

31 / 66

Page 32: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Platforms

M1: A Desktop computer

Intel Core i7 930 processor (2.80 GHz)

1MB of L2 cache, 8MB of L3 cache, shared by the four cores

4 GB RAM

2 GPU devices attached:

Tesla C1060 with 3Gb memory (M1a)Tesla C2050 (Fermi) with 4GB memory (M1b)Accelerator platform is CUDA 4.0

M1a/ M1b mimic the scenario of an OpenACC average developer

She can purchase a GPU card and plug in it into her desktopcomputer

It features a relatively cheap platform

32 / 66

Page 33: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Platforms

M2: A cluster node

M2: 2 quad core Intel Xeon E5410 (2.25GHz) processors

24 GB memory

Attached a Fermi C2050 card with 448 multiprocessors and 4GB memory

Accelerator platform: CUDA 4.0

M2 is a node of a common multinode cluster

Nowadays clusters combine multicore processors and GPUdevices, so we can take advantage of OpenACC

This kind of compute node has higher acquisition andmaintenance costs than M1

33 / 66

Page 34: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Platforms

M3: A second clusterM3 is a shared memory system

4 Intel Xeon E7 4850 CPU

2.50MB L2 cache and 24MB L3 cache (for all its 10 cores)

6GB of memory per core

Accelerator platform: Intel OpenCL SDK 1.5, running on theCPU

M3 showcases an alternative use of OpenCL

There are implementations of OpenCL targeting shared memorysystems

Using CPU-targeted OpenCL platforms along with OpenACCrepresents an interesting alternative to OpenMP programming

34 / 66

Page 35: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Some of our Experiments

Blocked Matrix Multiplication (M×M)

Rodinia BenchmarkThe Rodinia Benchmark suite comprises compute-heavyapplications

It covers a wide range of applications

OpenMP, CUDA and OpenCL versions are available for most ofthe codes in the suite

From them, we have selected:

Needleman-Wunsch (NW)HotSpot (HS)Speckle Reducing Anisotropic Diffusion (SRAD)

35 / 66

Page 36: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Matrix Multiplication

Sketch of M×M in OpenACC

1 #pragma acc k e r n e l s name ( "mxm" ) copy ( a [ L∗N ] )2 c o p y i n ( b [ L∗M ] , c [ M∗N ] . . . )3 {4 #pragma acc loop p r i v a t e ( i , j ) c o l l a p s e ( 2 )5 f o r ( i = 0 ; i < L ; i++)6 f o r ( j = 0 ; j < N ; j++)7 a [ i ∗ L + j ] = 0 . 0 ;8 /∗ I t e r a t e ove r b l o c k s ∗/9 f o r ( ii = 0 ; ii < L ; ii += tile_size )

10 f o r ( jj = 0 ; jj < N ; jj += tile_size )11 f o r ( kk = 0 ; kk < M ; kk += tile_size ) {12 /∗ I t e r a t e i n s i d e a b l o ck ∗/13 #pragma acc loop c o l l a p s e ( 2 ) p r i v a t e (i , j , k )14 f o r ( j=jj ; j < min (N , jj+tile_size ) ; j++)15 f o r ( i=ii ; i < min (L , ii+tile_size ) ; i++)16 f o r ( k=kk ; k < min (M , kk+tile_size ) ; k++)17 a [ i∗L+j ] += ( b [ i∗L+k ] ∗ c [ k∗M+j ] ) ;18 }19 }

36 / 66

Page 37: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Matrix Multiplication

Floating point performance for M×M in M2

37 / 66

Page 38: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Matrix Multiplication

Floating point performance comparison between OpenMP,accULL, PGI and hiCUDA in M1

accULL is the second with better performance

38 / 66

Page 39: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Matrix Multiplication

Comparison between OpenMP-gcc implementation andFrangollo+OpenCL in M3 (SM system 40 cores)

39 / 66

Page 40: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Needleman-Wunsch

Performance comparisons of NW in M1b

accULL performs worse than native versions40 / 66

Page 41: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Needleman-Wunsch

Performance comparisons of NW in M3 (SM, 40 cores)

The OpenMP versions outperform to the OpenCL counterparts41 / 66

Page 42: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

HotSpot

Performance comparison of different implementationsshowing efficiency over native CUDA code in M1

In this case, accULL performs similarly to hiCUDA 42 / 66

Page 43: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

HotSpot

Speed-Up comparison with native CUDA code inM1b (Fermi)

43 / 66

Page 44: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

HotSpot

Efficiency w.r.t. Intel-OpenMP in M3 (SM, 40 cores)

44 / 66

Page 45: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

SRAD

Speedup over the OpenMP implementation in M1b

45 / 66

Page 46: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

SRAD

Speedup over the OpenMP implementation in M3

46 / 66

Page 47: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Outline

1 Heterogeneous Architectures

2 accULL: An Early OpenACC Implementation

3 Results

4 Conclusions and Future Work

47 / 66

Page 48: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Conclusions I

accULL

First OpenACC implementation with support for both CUDAand OpenCL

It supports most of the standard

We validate accULL using codes from widely availablebenchmarks using GPUs and CPUs

It meets the requirements of a non-expert developer

48 / 66

Page 49: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Conclusions I

accULL

First OpenACC implementation with support for both CUDAand OpenCL

It supports most of the standard

We validate accULL using codes from widely availablebenchmarks using GPUs and CPUs

It meets the requirements of a non-expert developer

49 / 66

Page 50: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Conclusions I

accULL

First OpenACC implementation with support for both CUDAand OpenCL

It supports most of the standard

We validate accULL using codes from widely availablebenchmarks using GPUs and CPUs

It meets the requirements of a non-expert developer

50 / 66

Page 51: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Conclusions I

accULL

First OpenACC implementation with support for both CUDAand OpenCL

It supports most of the standard

We validate accULL using codes from widely availablebenchmarks using GPUs and CPUs

It meets the requirements of a non-expert developer

51 / 66

Page 52: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Conclusions II

accULL

YaCF can be used as a fast-prototyping tool to exploreoptimizations

Frangollo can be detached from YaCF and combined with aproduction-ready compiler

Some issues that can be tackled within Frangolloindependently from the compiler

Memory allocationKernel schedulingData splittingOverlapping of computation and communicationsParallel reduction implementation

52 / 66

Page 53: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Conclusions II

accULL

YaCF can be used as a fast-prototyping tool to exploreoptimizations

Frangollo can be detached from YaCF and combined with aproduction-ready compiler

Some issues that can be tackled within Frangolloindependently from the compiler

Memory allocationKernel schedulingData splittingOverlapping of computation and communicationsParallel reduction implementation

53 / 66

Page 54: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Conclusions II

accULL

YaCF can be used as a fast-prototyping tool to exploreoptimizations

Frangollo can be detached from YaCF and combined with aproduction-ready compiler

Some issues that can be tackled within Frangolloindependently from the compiler

Memory allocationKernel schedulingData splittingOverlapping of computation and communicationsParallel reduction implementation

54 / 66

Page 55: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Conclusions II

accULL

YaCF can be used as a fast-prototyping tool to exploreoptimizations

Frangollo can be detached from YaCF and combined with aproduction-ready compiler

Some issues that can be tackled within Frangolloindependently from the compiler

Memory allocation

Kernel schedulingData splittingOverlapping of computation and communicationsParallel reduction implementation

55 / 66

Page 56: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Conclusions II

accULL

YaCF can be used as a fast-prototyping tool to exploreoptimizations

Frangollo can be detached from YaCF and combined with aproduction-ready compiler

Some issues that can be tackled within Frangolloindependently from the compiler

Memory allocationKernel scheduling

Data splittingOverlapping of computation and communicationsParallel reduction implementation

56 / 66

Page 57: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Conclusions II

accULL

YaCF can be used as a fast-prototyping tool to exploreoptimizations

Frangollo can be detached from YaCF and combined with aproduction-ready compiler

Some issues that can be tackled within Frangolloindependently from the compiler

Memory allocationKernel schedulingData splitting

Overlapping of computation and communicationsParallel reduction implementation

57 / 66

Page 58: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Conclusions II

accULL

YaCF can be used as a fast-prototyping tool to exploreoptimizations

Frangollo can be detached from YaCF and combined with aproduction-ready compiler

Some issues that can be tackled within Frangolloindependently from the compiler

Memory allocationKernel schedulingData splittingOverlapping of computation and communications

Parallel reduction implementation

58 / 66

Page 59: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Conclusions II

accULL

YaCF can be used as a fast-prototyping tool to exploreoptimizations

Frangollo can be detached from YaCF and combined with aproduction-ready compiler

Some issues that can be tackled within Frangolloindependently from the compiler

Memory allocationKernel schedulingData splittingOverlapping of computation and communicationsParallel reduction implementation

59 / 66

Page 60: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Future work

There are plenty of opportunities to improve performance

To implement 2D arrays as cudaMatrix or OCLImages toimprove non-contiguous memory access

To complete the implementation of the asynchronous calls forbetter performance

Multi-GPU support

To explore different possibilities of integration with MPI

Integration of Frangollo with a production-ready compiler

New backend for FPGAs

60 / 66

Page 61: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Future work

There are plenty of opportunities to improve performance

To implement 2D arrays as cudaMatrix or OCLImages toimprove non-contiguous memory access

To complete the implementation of the asynchronous calls forbetter performance

Multi-GPU support

To explore different possibilities of integration with MPI

Integration of Frangollo with a production-ready compiler

New backend for FPGAs

61 / 66

Page 62: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Future work

There are plenty of opportunities to improve performance

To implement 2D arrays as cudaMatrix or OCLImages toimprove non-contiguous memory access

To complete the implementation of the asynchronous calls forbetter performance

Multi-GPU support

To explore different possibilities of integration with MPI

Integration of Frangollo with a production-ready compiler

New backend for FPGAs

62 / 66

Page 63: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Future work

There are plenty of opportunities to improve performance

To implement 2D arrays as cudaMatrix or OCLImages toimprove non-contiguous memory access

To complete the implementation of the asynchronous calls forbetter performance

Multi-GPU support

To explore different possibilities of integration with MPI

Integration of Frangollo with a production-ready compiler

New backend for FPGAs

63 / 66

Page 64: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Future work

There are plenty of opportunities to improve performance

To implement 2D arrays as cudaMatrix or OCLImages toimprove non-contiguous memory access

To complete the implementation of the asynchronous calls forbetter performance

Multi-GPU support

To explore different possibilities of integration with MPI

Integration of Frangollo with a production-ready compiler

New backend for FPGAs

64 / 66

Page 65: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Future work

There are plenty of opportunities to improve performance

To implement 2D arrays as cudaMatrix or OCLImages toimprove non-contiguous memory access

To complete the implementation of the asynchronous calls forbetter performance

Multi-GPU support

To explore different possibilities of integration with MPI

Integration of Frangollo with a production-ready compiler

New backend for FPGAs

65 / 66

Page 66: accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

Thank you for your attention!

accULL: An User-directed Approach toHeterogeneous Programming

http://accull.wordpress.com/

This work has been partially supported by the EU (FEDER),the Spanish MEC (contracts TIN2008-06570-C04-03 andTIN2011-24598), HPC-EUROPA2 and the Canary Islands

Government, ACIISI

F. de [email protected]

66 / 66