Tim Warburton David Medina Axel Modave Virginia Tech, USA The OCCA abstract threading model Implementation and performance for high-order finite-element computations US Department of Energy - NSF - Air Force Office of Scientific Research - Office of Naval Research Shell - Halliburton - TOTAL E&P - Argonne National Laboratory - MD Anderson Cancer Center - Intel - Nvidia - AMD
27
Embed
The OCCA abstract threading modelmk51/presentations/SIAMPP2016_6.pdfThe OCCA abstract threading model Implementation and performance for high-order finite-element computations US
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Tim Warburton David MedinaAxel Modave
Virginia Tech, USA
The OCCA abstract threading modelImplementation and performance for high-order finite-element computations
US Department of Energy - NSF - Air Force Office of Scientific Research - Office of Naval Research Shell - Halliburton - TOTAL E&P - Argonne National Laboratory - MD Anderson Cancer Center - Intel - Nvidia - AMD
2
Numerical Analysis
Accelerated Computing
High Performance Scalability
ApplicationBasic science
Approximation Theory
Numerical & PhysicalPDE Modeling
Industrial Scale
Towards efficient HPC applications for industrial simulations
Multidisciplinary research
(*) Advanced Research Computing at Virginia Techhttps://secure.hosting.vt.edu/www.arc.vt.edu/computing/newriver/
Cluster NewRiver at Virginia Tech (*)
3
Programming approach for HPC applicationswith many-core devices
MPI + X = 😎
Which X for multi-threading ?
Towards efficient HPC applications for industrial simulations
4
Portability
Code portability • CUDA, OpenCL, OpenMP, OpenACC, Intel TBB… are not code compatible. • Not all APIs are installed on any given system.
Performance portability • Logically similar kernels differ in performance (GCC & ICPC, OpenCL & CUDA) • Naively porting OpenMP to CUDA or OpenCL will likely yield low performance
Uncertainty
• Code life cycle measured in decades. • Architecture & API life cycles measured in Moore doubling periods. • Example: IBM Cell processor, IBM Blue Gene Q
Need an efficient, durable, portable, open-source, vendor-independent approach for many-core programming
Challenges for efficient HPC applications
Portable programming framework - OCCAKernel Language (OKL) - API
Applications - Performance
5
Portable programming framework - OCCAKernel Language (OKL) - API
Applications - Performance
6
7
Directive approach• Use of optional [#pragma]’s to give compiler transformation hints • Aims for portability, performance and programmability • OpenACC and OpenMP begin to resemble an API rather than code decorations
#pragma omp target teams distribute parallel forfor(int i = 0; i < N; ++i){ y[i] = a*x[i] + y[i];}
• Introduced for accelerator support through directives (2012). • There are compilers which support the 1.0 specifications. • OpenACC 2.0 introduces support for inlined functions.
• OpenMP has been around for a while (1997). • OpenMP 4.0 specifications (2013) includes accelerator support. • Few compilers (ROSE) support parts of the 4.0 specifications.
Code taken from: WHAT’S NEW IN OPENACC 2.0 AND OPENMP 4.0, GTC ‘14
Portable approaches for many-core programming (1/3)
8
Wrapper approach
• Create a tailored library with optimized functions • Restricted to a set of operations with flexibility from functors/lambdas
All C++ libraries with tailored functionalities HPC has a large C and Fortran community!
SkePU
• C++ library masking OpenMP, Intel’s TBB and CUDA for x86 processors and NVIDIA GPUs
• Vector library, such as the standard template library (STL)
• Kokkos is from Sandia National Laboratories • C++ vector library with linear algebra routines • Uses OpenMP and CUDA for x86 and NVIDIA GPU support
• C++ template library • Uses code skeletons for map, reduce, scan, mapreduce, … • Uses OpenMP, OpenCL and CUDA as backends
Portable approaches for many-core programming (2/3)
Kokkos
Thrust
9
Source-to-source approach
• CU2CL & SWAN have limited CUDA support (3.2 and 2.0 respectively) Update ? • GPU Ocelot supports PTX from CUDA 4.2 (5.0 partially) • PGI: CUDA-x86 appears to have been put in hiatus since 2011
PTX Assembly
CU2CL
SWAN
PGI: CUDA-x86
x86
Xeon Phi
AMD GPU
NVIDIA GPU
OpenCL
NVIDIA CUDA
Ocelot
Portable approaches for many-core programming (3/3)
• Some programmer intervention is required to identify parallel for loops. Auto-optimize:
• Programmer knowledge of architecture is still invaluable.
Auto-layout:
• The programmer needs to decide how data is arranged in memory.
Auto-distribute:
• You can use MPI+OCCA but you have to write the MPI code. • We considered M-OCCA but it devolves quickly into a PGAS.
Low-level code:
• We do not circumvent the vendor compilers.
What does OCCA not do?
Portable programming framework - OCCAKernel Language (OKL) - API
Applications - Performance
12
Global Memory
Group 0
Shared / L1Shared / L1 Shared / L1
Group 1 Group 2
L2 Cache
13
L1 Cache
Global Memory
Core 0 Core 1 Core 2
L1 Cache L1 Cache
L3 Cache
CPU Architecture GPU Architecture
Registers
Computational hierarchies are similar
void cpuFunction(){ #pragma omp parallel for for(int i = 0; i < work; ++i){
Do [hopefully thread-independent] work
}
}
__kernel void gpuFunction(){// for each work-group {// for each work-item in group {
Do [group-independent] work
// }// } }
SIMT Lane
SIMD Lane
Parallelization Paradigm
14
Description
• Minimal extensions to C, familiar for regular programmers • Explicit loops expose parallelism for modern multicore CPUs and accelerators • Parallel loops are explicit through the fourth for-loop inner and outer labels