Workshop on Tools for Exascale, TGCC at CEA, Bruyères-le-Châtel, 2 October 2012 The H4H project Hybrid programming models for heterogeneous architectures Jean-Marc Morel - Bull Optimize HPC Applications on Heterogeneous Architectures
Workshop on Tools for Exascale, TGCC at CEA, Bruyè res-le-Châtel, 2 October 2012
The H4H projectHybrid programming models
for h eterogeneous architecturesJean-Marc Morel - Bull
Optimize HPC Applications on Heterogeneous Architectures
2Jean-Marc Morel - Bull - H4H
Outline
• H4H objectives, partnership, and organisation
• The H4H Application Development Process
• Programming Methods & Tools
• The optimized software stack
• Application domains and use cases
• Mid-term achievements
• PerfCloud: a complementary set of activities
3Jean-Marc Morel - Bull - H4H
What Is H4H Set For?
Provide developers of compute-intensive applicationswith a highly efficient hybrid programming environment
for heterogeneous computing clusters composed of a mix of classical processors and hardware accelerators
� Facilitate the development process of HPC applications
� Maximize the overall performance of these applications
=> Empower technical and scientific computing
=> Accelerate research and innovation in many domains
=> Improve the competitiveness and independence of Eur ope
5Jean-Marc Morel - Bull - H4H
The H4H Partnership
BullHPC Platform Provider
REPSOLBMATDataLab
ATEMEDassault-Aviation
Industrial HPC users
EfieldGNSINTESMAGMARECOM
Simulation Software Editors
Rogue WaveGWTCAPSSoftware Tools Editors
JSCSupercomputing Centres
SCAICEA LISTScilab Enterprises
Research Labs &Software Institutes
UABHLRSZIH
Telecom-SudParisUVSQ
Academic partners
SwedenSpainGermanyFrance
6Jean-Marc Morel - Bull - H4H
How Our Work Is Organized
WP1: Project Management & Dissemination
WP2: Programming Models, Methods, and Tools ���� Design a robust programming environment which allows programmers to develop efficient parallel programs for heterogeneous architectures.
WP3: Platforms ���� Develop, integrate, setup, and optimise the appropriate heterogeneous HPC platforms together with optimized software packages such as Scilab and SAMG.
H4H Technology
WP4: Applications ���� Evaluate the H4H technology using industrial test cases.
Feedback to technology providers (WP2 & WP3)
7Jean-Marc Morel - Bull - H4H
The H4H Application Development Process
Restructure,
add or extend
hybrid
programming
pragmas
Hybrid binary code
for heterogeneous
architecture
Existing Program
High-level hybrid source
code for heterogeneous
architecture
Generate
low-level
source code
and binary
Fix
restructure
optimize
Numerical libraries and solvers,
OpenMP and accelerator runtimes,
Open MPI Library,
job & resource mgt, OS
Optimized software stack for
heterogeneous execution platform
Execute
Analyze correctness
and performance
8Jean-Marc Morel - Bull - H4H
The H4H development process and tools
Add data distribution directives
OpenMP program
MPI + HMPP program
Transform(OMP2HMPP)
Transform(STEP)
HMPP program
OpenMP + HMPP program
Add directives for hybrid regions
Execution Analysis
Correctness checkers and debuggers
Performance Prediction
Memory / Threading Performance
Parallel PerformanceMonitoring
Fix Restructure
Optimize
High-level hybrid source code for heterogeneous
architecture
Hybrid binary codefor heterogeneous
architecture
ExecuteCode Generation
Accelerator(HMPP)
MPI / OpenMP
Numerical libraries and solvers,
OpenMP and accelerator runtimes,
Open MPI Library,
job & resource mgt, OS
Optimized software stack for
heterogeneous execution platform
9Jean-Marc Morel - Bull - H4H
Execution Analysis in Detail
Execution Analysis
Correctness checkers and debuggers
Ayudame / TemanejoValgrind
Memory / Threading Performance
ThreadSpotter MAQAO
Parallel Performance
Vampir Scalasca
Score-P
Monitoring
LWM2
Performance Prediction
PAS2P
10Jean-Marc Morel - Bull - H4H
WP2 Partners � Programming Methods & Tools
CAPS entreprise• HMPPJülich Supercomputing Centre• Scalasca• Score-P (“SILC measurement system” in FPP)Rogue Wave AB• ThreadSpotterTU Dresden / GWT• Vampir• VampirTrace• Score-P (“SILC measurement system” in FPP)UAB / CAOS• PAS2PUniversity of Stuttgart / HLRS• Open MPI + Valgrind• Ayudame / TemanejoUVSQ• MAQAO
11Jean-Marc Morel - Bull - H4H
WP3: Optimized software stack and libraries
Bull
• HPC software stack (bullx supercomputing suite)
Scilab Enterprises
• Scilab (open source numerical package)
CEA-LIST
• JIT compilation for Scilab on GPU
UAB / CAOS
• RADIC (Fault tolerance architecture)
12Jean-Marc Morel - Bull - H4H
WP4: Ten Application Partners
ATEME• Video compression/processing (e.g. Motion estimator)BMAT• Music recognition & identification (Vericast)Dassault Aviation• CFD for aerodynamic design (AETHER solver)Efield• Electromagnetic fields modeling and simulationGNS• Metal forming simulationGWT• Open source simulation codes (e.g. molecular modeling)INTES• General purpose implicit finite element analysis systemMAGMA• Casting process simulationRECOM• 3D combustion simulationREPSOL• Seismic imaging and reservoir simulation
13Jean-Marc Morel - Bull - H4H
Main achievements so far (1 / 4)
• Hybrid Programming Model:– HMPP directives have been extended :
• HMPPAlt (HMPP Alternative) to replace calls to libraries executed on CPUs by calls to their equivalent on GPUs.
• Multi-device programming extension to enable to use multiple accelerators, distributing data and computation efficiently between them.
– Contribution to the creation of the new open standard OpenACC(Members: PGI, Cray, NVIDIA, CAPS)
• Directives that specify loops and regions of code to be offloaded • Portability across operating systems, host CPUs and accelerators
– Two prototypes of source-to-source translators: • OpenMP ���� HMPP+MPI (to distribute data processing)
• OpenMP ���� HMPP (trade-off between performance & energy consumption)
– Investigation of PGAS approach and porting of OpenSHMEM on bullx
14Jean-Marc Morel - Bull - H4H
Main achievements so far (2 / 4)
• Performance measurement and analysis tools:
– Definition & Implementation of the new Open Trace Format (OTF2)
– Contribution to the development of the Score-P measurement infrastructure
– Enhancement & extension of tools (Scalasca, Vampir )e.g. support of HMPP, CUDA, OpenMP 3.0 tasks, …
better filtering, more scalability, ..
– Enhancement of the performance prediction framework of MAQAO
– Enhancement of ThreadSpotter to optimize cache sharing
15Jean-Marc Morel - Bull - H4H
Main achievements so far (3 / 4)
• Software stack
– Contribution to the enhancement of the bullx supercomputing suiteAdvanced Edition (e.g. bullxMPI, power management framework, bi-rail IB Interconnect, cluster installation & management, etc.)
– Development of sciGPGPU (Scilab on GPU)
• Taking advantage of CUDA and OpenCL features and of important functions of CuBLAS and CuFFT libraries
• Adding GPU-based functions required by applications (gpuFFT for fftw, gpuInterp & gpuInter2d for cubic spline evaluation, svd single value decomposition, and spec for eigenvalues of matrices and pencils)
and of Scilab MPI to provide distributed features based on MPI
– Extension & enhancement of the SAMG solver and LAMA(Library for Accelerated Math Applications) combining the power of CPUs and GPUs
16Jean-Marc Morel - Bull - H4H
Main achievements so far (4 / 4)
• Applications: 36 test cases in 10 domains– Performance analysis, code restructuring / porting on GPU– Significant performance improvements, e.g.:
• ATEME: Hierarchical motion estimation on GPU 33% faster
• RECOM: Control of placement on NUMA enabled a 3.8x speedup
• MAGMA: – Solving a large system of equations on GPU can be 2x faster ; – combined GPU-MPI yielded a 3x speedup ;
– ThreadSpotter � 1.6x speedup by solving memory hierarchy problems
• Efield: HMPP enables rapid porting of critical code on GPU: 5x speedupof the FDTD code (Finite-difference in time-domain method)
• …
– Still many challenges• Data transfer often impedes the overall performance improvement
• Difficult to achieve a good load balance between CPUs and GPUs
• How to get overlapping of communication and computation
17Jean-Marc Morel - Bull - H4H
PerfCloud: A complementary set of activities
• Objectives
– Develop advanced technologies to build a new generation of HPC systems that can be used in cloud infrastructures
• A new HPC architecture based on MIC
• A new cooling technology to reduce energy consumption
• An advanced software development environment for MIC
• A new software infrastructure to manage such a heterogeneous cluster and helping, in particular, to monitor energy consumption
– Demonstrate the usefulness of these technologies on large applications:
• Atmospheric dispersion and weather simulation
• Image retrieval in large data bases of images & videos
– Identify respective advantages & constraints of GPU and MIC