The OCCA abstract threading modelmk51/presentations/SIAMPP2016_6.pdfThe OCCA abstract threading model Implementation and performance for high-order ﬁnite-element computations US

Tim Warburton David MedinaAxel Modave

Virginia Tech, USA

The OCCA abstract threading modelImplementation and performance for high-order finite-element computations

US Department of Energy - NSF - Air Force Office of Scientific Research - Office of Naval Research Shell - Halliburton - TOTAL E&P - Argonne National Laboratory - MD Anderson Cancer Center - Intel - Nvidia - AMD

2

Numerical Analysis

Accelerated Computing

High Performance Scalability

ApplicationBasic science

Approximation Theory

Numerical & PhysicalPDE Modeling

Industrial Scale

Towards efficient HPC applications for industrial simulations

Multidisciplinary research

(*) Advanced Research Computing at Virginia Techhttps://secure.hosting.vt.edu/www.arc.vt.edu/computing/newriver/

Cluster NewRiver at Virginia Tech (*)

3

Programming approach for HPC applicationswith many-core devices

MPI + X = 😎

Which X for multi-threading ?

Towards efficient HPC applications for industrial simulations

4

Portability

Code portability • CUDA, OpenCL, OpenMP, OpenACC, Intel TBB… are not code compatible. • Not all APIs are installed on any given system.

Performance portability • Logically similar kernels differ in performance (GCC & ICPC, OpenCL & CUDA) • Naively porting OpenMP to CUDA or OpenCL will likely yield low performance

Uncertainty

• Code life cycle measured in decades. • Architecture & API life cycles measured in Moore doubling periods. • Example: IBM Cell processor, IBM Blue Gene Q

Need an efficient, durable, portable, open-source, vendor-independent approach for many-core programming

Challenges for efficient HPC applications

Portable programming framework - OCCAKernel Language (OKL) - API

Applications - Performance

5



6

7

Directive approach• Use of optional [#pragma]’s to give compiler transformation hints • Aims for portability, performance and programmability • OpenACC and OpenMP begin to resemble an API rather than code decorations

#pragma omp target teams distribute parallel forfor(int i = 0; i < N; ++i){ y[i] = a*x[i] + y[i];}

• Introduced for accelerator support through directives (2012). • There are compilers which support the 1.0 specifications. • OpenACC 2.0 introduces support for inlined functions.

• OpenMP has been around for a while (1997). • OpenMP 4.0 specifications (2013) includes accelerator support. • Few compilers (ROSE) support parts of the 4.0 specifications.

Code taken from: WHAT’S NEW IN OPENACC 2.0 AND OPENMP 4.0, GTC ‘14

Portable approaches for many-core programming (1/3)

8

Wrapper approach

• Create a tailored library with optimized functions • Restricted to a set of operations with flexibility from functors/lambdas

All C++ libraries with tailored functionalities HPC has a large C and Fortran community!

SkePU

• C++ library masking OpenMP, Intel’s TBB and CUDA for x86 processors and NVIDIA GPUs

• Vector library, such as the standard template library (STL)

• Kokkos is from Sandia National Laboratories • C++ vector library with linear algebra routines • Uses OpenMP and CUDA for x86 and NVIDIA GPU support

• C++ template library • Uses code skeletons for map, reduce, scan, mapreduce, … • Uses OpenMP, OpenCL and CUDA as backends


Kokkos

Thrust

9

Source-to-source approach

• CU2CL & SWAN have limited CUDA support (3.2 and 2.0 respectively) Update ? • GPU Ocelot supports PTX from CUDA 4.2 (5.0 partially) • PGI: CUDA-x86 appears to have been put in hiatus since 2011

PTX Assembly

CU2CL

SWAN

PGI: CUDA-x86

x86

Xeon Phi

AMD GPU

NVIDIA GPU

OpenCL

NVIDIA CUDA

Ocelot


10libocca.org github.com/libocca/occa (MIT license)

Open Concurrent Compute Architecture — OCCAPortability Accessibility Lightweight

http://libocca.org

http://github.com/libocca/occa

Open Concurrent Compute Architecture

11

Auto-parallelize:

• Some programmer intervention is required to identify parallel for loops. Auto-optimize:

• Programmer knowledge of architecture is still invaluable.

Auto-layout:

• The programmer needs to decide how data is arranged in memory.

Auto-distribute:

• You can use MPI+OCCA but you have to write the MPI code. • We considered M-OCCA but it devolves quickly into a PGAS.

Low-level code:

• We do not circumvent the vendor compilers.

What does OCCA not do?



12

Global Memory

Group 0

Shared / L1Shared / L1 Shared / L1

Group 1 Group 2

L2 Cache

13

L1 Cache

Global Memory

Core 0 Core 1 Core 2

L1 Cache L1 Cache

L3 Cache

CPU Architecture GPU Architecture

Registers

Computational hierarchies are similar

void cpuFunction(){ #pragma omp parallel for for(int i = 0; i < work; ++i){

Do [hopefully thread-independent] work

}

}

__kernel void gpuFunction(){// for each work-group {// for each work-item in group {

Do [group-independent] work

// }// } }

SIMT Lane

SIMD Lane

Parallelization Paradigm

14

Description

• Minimal extensions to C, familiar for regular programmers • Explicit loops expose parallelism for modern multicore CPUs and accelerators • Parallel loops are explicit through the fourth for-loop inner and outer labels

kernel void kernelName(...){ ...

for(int groupZ = 0; groupZ < zGroups; ++groupZ; outer2){ for(int groupY = 0; groupY < yGroups; ++groupY; outer1){ for(int groupX = 0; groupX < xGroups; ++groupX; outer0){ // Work-group implicit loops

for(int itemZ = 0; itemZ < zItems; ++itemZ; inner2){ for(int itemY = 0; itemY < yItems; ++itemY; inner1){ for(int itemX = 0; itemX < xItems; ++itemX; inner0){ // Work-item implicit loops // GPU Kernel Scope }}} }}}

...}

Shared Shared Shared

dim3 blockDim(xGroups,yGroups,zGroups);dim3 threadDim(xItems,yItems,zItems);kernelName<<< blockDim , threadDim >>>(…);

The concept of iterating over groups and items is simple

Kernel Language OKL

15

Outer-loops

• Outer-loops are synonymous with CUDA and OpenCL kernels • Extension: allow for multiple outer-loops per kernel

Data dependencies are found through a variable dependency graph

kernel void kernelName(…){

if(expr){ for(outer){ for(inner){ } } else{ for(outer){ for(inner){ } } }

while(expr){ for(outer){ for(inner){ } } }

}

Kernel Language OKL

16

Shared memory

Local barriers are auto-inserted

for(int groupX = 0; groupX < xGroups; ++groupX; outer0){ // Work-group implicit loops exclusive int exclusiveVar, exclusiveArray[10];

for(int itemX = 0; itemX < 16; ++ itemX; inner0){ // Work-item implicit loops exclusiveVar = itemX; // Pre-fetch }

// Auto-insert [barrier(localMemFence);]

for(int itemX = 0; itemX < 16; ++ itemX; inner0){ // Work-item implicit loops int i = exclusiveVar; // Use pre-fetched data }}

for(int groupX = 0; groupX < xGroups; ++groupX; outer0){ // Work-group implicit loops shared int sharedVar[16];

for(int itemX = 0; itemX < 16; ++ itemX; inner0){ // Work-item implicit loops sharedVar[itemX] = itemX; }

// Auto-insert [barrier(localMemFence);]

for(int itemX = 0; itemX < 16; ++ itemX; inner0){ // Work-item implicit loops int i = (sharedVar[itemX] + sharedVar[(itemX + 1) % 16]); }}

Exclusive memory

Kernel Language OKL

17https://github.com/tcew/OCCA2/tree/master/examples/addVectors

#include "occa.hpp"

int main(int argc, char **argv){ occa::device device; occa::kernel addVectors; occa::memory o_a, o_b, o_ab;

device.setup(“mode = OpenCL, platformID = 0, deviceID = 0”);

float *a = new float[5]; float *b = new float[5]; float *ab = new float[5];

for(int i = 0; i < 5; ++i){ a[i] = i; b[i] = 1 - i; ab[i] = 0; }

o_a = device.malloc(5*sizeof(float)); o_b = device.malloc(5*sizeof(float)); o_ab = device.malloc(5*sizeof(float)); o_a.copyFrom(a); o_b.copyFrom(b);

addVectors = device.buildKernelFromSource(“addVectors.okl", "addVectors");

addVectors(5, o_a, o_b, o_ab);

o_ab.copyTo(ab);

for(int i = 0; i < 5; ++i) std::cout << i << ": " << ab[i] << '\n';

Example of code (Adding two vectors)

https://github.com/tcew/OCCA2/tree/master/examples/addVectors


#include "occa.hpp"








o_ab.copyTo(ab);


kernel void addVectors(const int entries, const float *a, const float *b, float *ab){ for(int group = 0; group < ((entries + 15)/16); ++group; outer0){ for(int item = 0; item < 16; ++item; inner0){ const int i = item + (16 * group);

if(i < entries) ab[i] = a[i] + b[i]; } }}

Example of code (Adding two vectors)



#include "occa.hpp"








o_ab.copyTo(ab);


Example of feature (UVA + Managed memory)


20https://github.com/tcew/OCCA2/tree/master/examples/uvaAddVectors

#include "occa.hpp"


device.setup(“mode = OpenCL, platformID = 0, deviceID = 0”); float *a = (float*) device.managedUvaAlloc(5 * sizeof(float)); float *b = (float*) device.managedUvaAlloc(5 * sizeof(float)); float *ab = (float*) device.managedUvaAlloc(5 * sizeof(float));




addVectors(5, a, b, ab);

occa::finish() // o_ab.copyTo(ab);


Example of feature (UVA + Managed memory)




21

MDACC: FEM model of laser tumor ablation

22

OCCA apps can perform close to or exceed native apps across platforms.

3

1

5

7

6

9

11

12

2

4

8

10

2

1

3

FDTD for seismicwave equation

ALMOND: algebraic multigrid library

RiDG: DG finite element for seismic imaging

DG compressible Navier-Stokes solver

DG shallow-water & 3D ocean modeling

Lattice Boltzmann for Core Sample Analysis

OCCA applications/benchmarks

23

0

0.5

1

1.5

2

Original OpenMPOCCA:OpenMP (CPU-kernel)OCCA:OpenMP (GPU-kernel)

Original CUDAOCCA::CUDA (GPU-kernel)OCCA::OpenCL (GPU-kernel)OCCA::CUDA (CPU-kernel)OCCA::OpenCL (CPU-kernel)

OCCAOCCA

Medina, St-Cyr & Warburton (2015)Proceedings of ICOSAHOM 2014. 365-373

CPUIntel Xeon E5-2640

GPU NVIDIA Tesla K10

Results - Finite difference for seismic waves

24

0

0.5

1

1.5

2

Original OpenMPOCCA:OpenMPOCCA:OpenCL

Original CUDAOCCA:CUDAOCCA:OpenCL

0

0.5

1

1.5

2

Original OpenMPOCCA:OpenMPOCCA:OpenCL

Original CUDAOCCA:CUDAOCCA:OpenCL

XSBench RSBench

CPU GPU CPU GPU

Collaborations with Argonne National Lab

OCCA OCCA OCCA OCCA

Results - Monte Carlo for neutronics

OpenMP OpenCL/CUDA

: Intel Xeon CPU E5-2650 : NVIDIA Tesla K20c

Rahaman, Medina, Lund, Tramm, Warburton, Siegel (2015)Conference on Exascale Applications and Software

25

0

60

120

180

240

1 2 3 4 5 6

111120

154

107

6452

OCCA:CL, Intel i7OCCA:OpenMP, Intel i7

0

450

900

1350

1800

1 2 3 4 5 6

770

941

1509

1330

1129

846

OCCA:CL, K40OCCA:CUDA, K40OCCA:CL, Tahiti

Milli

on N

odes

per

sec

ond

Gandham, Medina & Warburton (2015) Communications in Computational Physics 18:37-64

Results - DG finite element for oceanographic waves

CPU GPU

26Modave, St-Cyr, Mulder & Warburton (2015)

Geophysical Journal International 203(2):1419-1435

Results - DG finite element for seismic imaging

Modave, St-Cyr & Warburton (2016)Computers & Geosciences 18:37-64

Mesh of underground Seismic image

Can reach 37% of theoretical peak GFLOP/sOCCA::CUDA (Nvidia GTX980)

Strong scalability with 32 GPUsMPI + OCCA::CUDA (Nvidia K20)

A lot of expensive simulations !!!

27libocca.org github.com/libocca/occa (MIT license)

Open Concurrent Compute Architecture — OCCAPortability Accessibility Lightweight

http://libocca.org

http://github.com/libocca/occa

The OCCA abstract threading modelmk51/presentations/SIAMPP2016_6.pdfThe OCCA abstract threading model Implementation and performance for high-order ﬁnite-element computations US

Documents