accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

accULL: An User-directed Approach toHeterogeneous Programming

Ruyman Reyes Ivan Lopez-Rodrıguez Juan J. FumeroFrancisco de Sande

1Dept. E.I.O. y Computacion,Univ. de La Laguna, 38271–La Laguna, Spain

International Workshop on HeterogeneousArchitectures and Computing

Leganes, July 13 2012

1 / 66



Results


Outline

1 Heterogeneous Architectures

2 accULL: An Early OpenACC Implementation

3 Results

4 Conclusions and Future Work

2 / 66



Results


Outline



3 Results


3 / 66



Results


Introduction

The irruption of GPUs: Impressive Results

4 / 66



Results


GPUs

Successfully used for general purpose computing (GPGPU)

5 / 66



Results


Heterogeneous Architectures

But ...

It is not Easy!

6 / 66



Results


Heterogeneous Architectures

A GPU is not a CPU

GPUs are inherently SIMD processorsCPUs and GPUs tackle the processing of tasks differentlyCPUs excel at serial processingGPUs are better at handling applications that require highfloating point calculations and lower power consumption

7 / 66



Results


Parallel Languages: MPI (DM) and OpenMP (SM)

They are not valid for programming GPUs

New programming models are required...

8 / 66



Results


GPGPU Programming

Nowadays Software Stack:

9 / 66



Results


CUDA from NVIDIA

Pros: Performance, Easierthan OpenCL

Con: Only for NVIDIAhardware

CUDA Code Example

1 __global__ v o i d mmkernel ( f l o a t ∗ a , f l o a t ∗ b , f l o a t ∗ c , i n t n ,2 i n t m , i n t p ) {3 i n t i = blockIdx . x∗32 + threadIdx . x ;4 i n t j = blockIdx . y ;5 f l o a t sum = 0 . 0 f ;6 f o r ( i n t k = 0 ; k < p ; ++k ) sum += b [ i+n∗k ] ∗ c [ k+p∗j ] ;7 a [ i+n∗j ] = sum ;8 }

10 / 66



Results


GPGPU Programming

OpenCL: Open Computing Language

A framework developed by the Khronos Group

A standard

OpenCL programs execute across heterogeneous platforms:CPUs + GPUs + other processors

Pros: can be used with any device, it is a standardCons: more complex than CUDA, inmature

11 / 66



Results


GPGPU Programming

Common Problems1 The programmer needs to know low-level details of the

architecture

12 / 66



Results


GPGPU Programming

Common Problems1 The programmer needs to know low-level details of the

architecture2 Source codes need to be rewritten:

One version for CPUA different version for GPU

3 Good performance requires a great effort in parameter tunning

4 CUDA and OpenCL are new and complex for non-experts

13 / 66



Results


GPGPU Programming

Our Claim: New models and tools are needed if we wantto widespread the use of GPUs in HPC

Is there anything new in the horizon?

hiCUDA

PGI accelerator model

CAPS HMPP

OpenACC

14 / 66



Results


GPGPU Programming

hiCUDATranslates each directive into a CUDA call

It is able to use the GPU Shared Memory

Only works with NVIDIA devices

The programmer still needs to know hardware details

hiCUDA Code Example:

1 . . .2 #pragma h icuda g l o b a l a l l o c c [ ∗ ] [ ∗ ] c o p y i n

4 #pragma h icuda k e r n e l mxm t b l o c k (N/16 ,N/16) t h r e a d ( 1 6 , 1 6 )5 #pragma hicuda loop_partition over_tblock over_thread6 f o r ( i = 0 ; i < N ; i++ ) {7 #pragma hicuda loop_partition over_tblock over_thread8 f o r ( j = 0 ; j < N ; j++) {9 double sum = 0 . 0 ;

10 . . .

15 / 66



Results


GPGPU Programming

PGI accelerator model

It is a higher level (directive-based) approach

Fortran and C are supported

Precursor to OpenACC

PGI Accelerator Model Code Example:

1 #pragma acc data c o p y i n ( b [ 0 : n∗ l ] , c [ 0 :m∗ l ] ) copy ( a [ 0 : n∗m] )2 {3 #pragma acc r e g i o n4 {5 #pragma acc loop independent6 f o r ( j = 0 ; j < n ; j++)7 {8 #pragma acc loop independent9 f o r ( i = 0 ; i < l ; i++ ) {

10 double sum = 0 . 0 ;11 f o r ( k = 0 ; k < m ; k++ ) {12 sum += b [ i+k∗l ] ∗ c [ k+j∗m ] ;13 }14 a [ i+j∗l ] = sum ;15 }16 }17 }18 }

16 / 66



Results


GPGPU Programming

OpenACC: introduced last November inSuperComputing’2011

A directive based language

Aim to be standard

Supported by: Cray, NVIDIA, PGI and CAPS

A single source code for CPU/GPU

Platform independent

Easier for beginners

17 / 66



Results


GPGPU Programming

OpenACC Code Example:

18 / 66



Results


Outline



3 Results


19 / 66



Results


accULL: Our OpenACC implementation

accULL is a framework developed to support OpenACCprograms

20 / 66



Results


accULL: Our OpenACC implementation

accULL = YaCF + Frangollo

It is a two-layer based implementation:Compiler + RunTime Library

21 / 66



Results


YaCF: the compiler

YaCF (Yet Another Compiler Framework) is the compilerframework we have developed

Some features:

It is a StS compiler

Written in Python from scratch with an OO approach

Receives C99 as input

It is able to generate CUDA/OpenCL kernels from an annotatedcode

A driver for compiling OpenACC directives has been added

YaCF translates the directives into Frangollo calls

A public-domain development

22 / 66



Results


Frangollo: the RunTime

Frangollo

It is a RunTime to support the execution over heterogeneousplatforms

1 Encapsulates the hardware issues

2 Is able to run in NVIDIA devices using CUDA

3 Is able to manage a wider range of devices using OpenCL

23 / 66



Results



Compilation flow

24 / 66



Results



Its Responsibilities1 Manages the memory

2 Initializes the devices

3 Launches the kernels

Makes programmers’ life easier!

25 / 66



Results



Its Responsibilities1 Manages the memory

2 Initializes the devices

3 Launches the kernels

Makes programmers’ life easier!

26 / 66



Results


Frangollo: Memory Management

A program workflow

27 / 66



Results


Frangollo: Structure

Interface layer: A door to Frangollo

Some functions in the C interface:

registerVar

launchKernel

getNumDevices

28 / 66



Results



Abstract layer

Frangollo uses a class-hierarchy

All classes in this layer are abstracts

29 / 66



Results



Device layer

Encapsulates all targetlanguage related functions

New platforms could beadded in the future

30 / 66



Results


Outline



3 Results


31 / 66



Results


Platforms

M1: A Desktop computer

Intel Core i7 930 processor (2.80 GHz)

1MB of L2 cache, 8MB of L3 cache, shared by the four cores

4 GB RAM

2 GPU devices attached:

Tesla C1060 with 3Gb memory (M1a)Tesla C2050 (Fermi) with 4GB memory (M1b)Accelerator platform is CUDA 4.0

M1a/ M1b mimic the scenario of an OpenACC average developer

She can purchase a GPU card and plug in it into her desktopcomputer

It features a relatively cheap platform

32 / 66



Results


Platforms

M2: A cluster node

M2: 2 quad core Intel Xeon E5410 (2.25GHz) processors

24 GB memory

Attached a Fermi C2050 card with 448 multiprocessors and 4GB memory

Accelerator platform: CUDA 4.0

M2 is a node of a common multinode cluster

Nowadays clusters combine multicore processors and GPUdevices, so we can take advantage of OpenACC

This kind of compute node has higher acquisition andmaintenance costs than M1

33 / 66



Results


Platforms

M3: A second clusterM3 is a shared memory system

4 Intel Xeon E7 4850 CPU

2.50MB L2 cache and 24MB L3 cache (for all its 10 cores)

6GB of memory per core

Accelerator platform: Intel OpenCL SDK 1.5, running on theCPU

M3 showcases an alternative use of OpenCL

There are implementations of OpenCL targeting shared memorysystems

Using CPU-targeted OpenCL platforms along with OpenACCrepresents an interesting alternative to OpenMP programming

34 / 66



Results


Some of our Experiments

Blocked Matrix Multiplication (M×M)

Rodinia BenchmarkThe Rodinia Benchmark suite comprises compute-heavyapplications

It covers a wide range of applications

OpenMP, CUDA and OpenCL versions are available for most ofthe codes in the suite

From them, we have selected:

Needleman-Wunsch (NW)HotSpot (HS)Speckle Reducing Anisotropic Diffusion (SRAD)

35 / 66



Results


Matrix Multiplication

Sketch of M×M in OpenACC

1 #pragma acc k e r n e l s name ( "mxm" ) copy ( a [ L∗N ] )2 c o p y i n ( b [ L∗M ] , c [ M∗N ] . . . )3 {4 #pragma acc loop p r i v a t e ( i , j ) c o l l a p s e ( 2 )5 f o r ( i = 0 ; i < L ; i++)6 f o r ( j = 0 ; j < N ; j++)7 a [ i ∗ L + j ] = 0 . 0 ;8 /∗ I t e r a t e ove r b l o c k s ∗/9 f o r ( ii = 0 ; ii < L ; ii += tile_size )

10 f o r ( jj = 0 ; jj < N ; jj += tile_size )11 f o r ( kk = 0 ; kk < M ; kk += tile_size ) {12 /∗ I t e r a t e i n s i d e a b l o ck ∗/13 #pragma acc loop c o l l a p s e ( 2 ) p r i v a t e (i , j , k )14 f o r ( j=jj ; j < min (N , jj+tile_size ) ; j++)15 f o r ( i=ii ; i < min (L , ii+tile_size ) ; i++)16 f o r ( k=kk ; k < min (M , kk+tile_size ) ; k++)17 a [ i∗L+j ] += ( b [ i∗L+k ] ∗ c [ k∗M+j ] ) ;18 }19 }

36 / 66



Results



Floating point performance for M×M in M2

37 / 66



Results



Floating point performance comparison between OpenMP,accULL, PGI and hiCUDA in M1

accULL is the second with better performance

38 / 66



Results



Comparison between OpenMP-gcc implementation andFrangollo+OpenCL in M3 (SM system 40 cores)

39 / 66



Results


Needleman-Wunsch

Performance comparisons of NW in M1b

accULL performs worse than native versions40 / 66



Results


Needleman-Wunsch

Performance comparisons of NW in M3 (SM, 40 cores)

The OpenMP versions outperform to the OpenCL counterparts41 / 66



Results


HotSpot

Performance comparison of different implementationsshowing efficiency over native CUDA code in M1

In this case, accULL performs similarly to hiCUDA 42 / 66



Results


HotSpot

Speed-Up comparison with native CUDA code inM1b (Fermi)

43 / 66



Results


HotSpot

Efficiency w.r.t. Intel-OpenMP in M3 (SM, 40 cores)

44 / 66



Results


SRAD

Speedup over the OpenMP implementation in M1b

45 / 66



Results


SRAD

Speedup over the OpenMP implementation in M3

46 / 66



Results


Outline



3 Results


47 / 66



Results


Conclusions I

accULL

First OpenACC implementation with support for both CUDAand OpenCL

It supports most of the standard

We validate accULL using codes from widely availablebenchmarks using GPUs and CPUs

It meets the requirements of a non-expert developer

48 / 66



Results


Conclusions I

accULL





49 / 66



Results


Conclusions I

accULL





50 / 66



Results


Conclusions I

accULL





51 / 66



Results


Conclusions II

accULL

YaCF can be used as a fast-prototyping tool to exploreoptimizations

Frangollo can be detached from YaCF and combined with aproduction-ready compiler

Some issues that can be tackled within Frangolloindependently from the compiler

Memory allocationKernel schedulingData splittingOverlapping of computation and communicationsParallel reduction implementation

52 / 66



Results


Conclusions II

accULL





53 / 66



Results


Conclusions II

accULL





54 / 66



Results


Conclusions II

accULL




Memory allocation

Kernel schedulingData splittingOverlapping of computation and communicationsParallel reduction implementation

55 / 66



Results


Conclusions II

accULL




Memory allocationKernel scheduling

Data splittingOverlapping of computation and communicationsParallel reduction implementation

56 / 66



Results


Conclusions II

accULL




Memory allocationKernel schedulingData splitting

Overlapping of computation and communicationsParallel reduction implementation

57 / 66



Results


Conclusions II

accULL




Memory allocationKernel schedulingData splittingOverlapping of computation and communications

Parallel reduction implementation

58 / 66



Results


Conclusions II

accULL





59 / 66



Results


Future work

There are plenty of opportunities to improve performance

To implement 2D arrays as cudaMatrix or OCLImages toimprove non-contiguous memory access

To complete the implementation of the asynchronous calls forbetter performance

Multi-GPU support

To explore different possibilities of integration with MPI

Integration of Frangollo with a production-ready compiler

New backend for FPGAs

60 / 66



Results


Future work




Multi-GPU support




61 / 66



Results


Future work




Multi-GPU support




62 / 66



Results


Future work




Multi-GPU support




63 / 66



Results


Future work




Multi-GPU support




64 / 66



Results


Future work




Multi-GPU support




65 / 66



Results


Thank you for your attention!

accULL: An User-directed Approach toHeterogeneous Programming

http://accull.wordpress.com/

This work has been partially supported by the EU (FEDER),the Spanish MEC (contracts TIN2008-06570-C04-03 andTIN2011-24598), HPC-EUROPA2 and the Canary Islands

Government, ACIISI

F. de [email protected]

66 / 66

accULL (HAC Leganés)

Technology

o c c copyin4

o o p independent9f

f o r i n t

c o p y i n b

c c r e g i o n4

thread6 f o r i

o o p independent6f

pragma h i c u d