Top Banner
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777. http://www.montblanc-project.eu Jean-François Méhaut
65

Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Mar 12, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.

http://www.montblanc-project.eu

Jean-François Méhaut

Page 2: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.

http://www.montblanc-project.eu

The New Killer Processors Overview of the Mont-Blanc projects BOAST DSL for computing kernels

Page 3: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Corse: Compiler Optimization and Run-time SystEms ∗

Fabrice Rastello

∗Inria Joint Project Team (proposal)

June 9, 2015

Fabrice Rastello (Inria) Corse June 9, 2015 1 / 26

Page 4: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Project-team composition / Institutional context

Joint Project-Team (Inria, Grenoble INP, UJF) in the LIG laboratory@ Giant/Minatec

Fabrice Rastello, Florent Bouchez Tichadou, François Broquedis,Frédéric Desprez, Yliès Falcone, Jean-François Mehaut

8 PhD, 3 Post-doc, 1 Engineer

Fabrice Rastello (Inria) Corse June 9, 2015 3 / 26

Page 5: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Permanent member curriculum vitae

Florent Bouchez Tichadou MdC UJF (PhD Lyon 2009, 1Y Bangalore, 3Y Kalray,Nanosim) compiler optimization, compiler back-end

François Broquedis MdC INP (PhD Bordeaux 2010, 1Y Mescal, 3Y Moais)runtime systems, OpenMP, memory management

Frédéric Desprez (DR1 Inria: Graal, Avalon)parallel algorithmic, numerical libraries

Ylies Falcone MdC UJF (PhD Grenoble 2009, 2Y Rennes, Vasco)validation, enforcement, debugging, runtime

Jean-François Mehaut Pr UJF ( Mescal, Nanosim)runtime, debugging, memory management, scientific applications

Fabrice Rastello CR1 Inria (PhD Lyon 2000, 2Y STMicro, Compsys, GCG)compiler optimization, graph theory, compiler back-end, automaticparallelization

Fabrice Rastello (Inria) Corse June 9, 2015 4 / 26

Page 6: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Overall Objectives

Domain : Compiler optimization and runtime systems for performanceand energy consumption (not reliability, nor WCET)

Issues: Scalability and heterogeneity/complexity ≡ trade-off betweenspecific optimizations and programmability/portability

Target architectures: VLIW / SIMD / embedded / many-cores /heterogeneity

Applications: dynamic-systems / loop-nests / graph-algorithmic /signal-processing

Approach: combine static/dynamic & compiler/run-time

Fabrice Rastello (Inria) Corse June 9, 2015 5 / 26

Page 7: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

First, vector processors dominated HPC

• 1st Top500 list (June 1993) dominated by DLP architectures • Cray vector,41% • MasPar SIMD, 11% • Convex/HP vector, 5%

• Fujitsu Wind Tunnel is #1 1993-1996, with 170 GFLOPS

Page 8: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Then, commodity took over special purpose

• ASCI Red, Sandia • 1997, 1 TFLOPS • 9,298 cores @ 200 Mhz • Intel Pentium Pro

• Upgraded to Pentium II Xeon, 1999, 3.1 TFLOPS

• ASCI White, LLNL • 2001, 7.3 TFLOPS • 8,192 proc. @ 375 Mhz, • IBM Power 3

Transition from Vector parallelism to Message-Passing Programming Models

Page 9: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Commodity components drive HPC

• RISC processors replaced vectors • x86 processors replaced RISC

• Vector processors survive as (widening) SIMD extensions

Page 10: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

5

The killer microprocessors

• Microprocessors killed the Vector supercomputers • They were not faster ... • ... but they were significantly cheaper and greener

• Need 10 microprocessors to achieve the performance of 1 Vector CPU • SIMD vs. MIMD programming paradigms

Cray-1, Cray-C90 NEC SX4, SX5

Alpha AV4, EV5 Intel Pentium IBM P2SC HP PA8200

1974 1979 1984 1989 1994 1999 10

100

1000

10.000

MFL

OP

S

Page 11: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

The killer mobile processorsTM

• Microprocessors killed the Vector supercomputers • They were not faster ... • ... but they were significantly

cheaper and greener

• History may be about to repeat itself … • Mobile processor are not

faster … • … but they are significantly

cheaper

Alpha Intel AMD NVIDIA Tegra Samsung Exynos 4-core ARMv8 1.5 GHz

1990 1995 2000 2005 2010 100

1.000

10.000

100.000

MFL

OP

S

2015

1.000.000

Page 12: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Mobile SoC vs Server processor

Performance

5.2 GFLOPS

153 GFLOPS

Cost

21$1

1500$2

x30

1. Leaked Tegra3 price from the Nexus 7 Bill of Materials 2. Non-discounted List Price for the 8-core Intel E5 SandyBrdige

x70

15.2 GFLOPS 21$ (?)

x10 x70

Page 13: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

SoC under study: CPU and Memory

NVIDIA Tegra 2 2 x ARM Cortex-A9 @ 1GHz 1 x 32-bit DDR2-333 channel

32KB L1 + 1MB L2

NVIDIA Tegra 3 4 x ARM Cortex-A9 @ 1.3GHz 2 x 32-bit DDR23-750 channels

32KB L1 + 1MB L2

Samsng Exynos 5 Dual 2 x ARM Cortex-A15 @ 1.7GHz 2 x 32-bit DDR3-800 channels

32KB L1 + 1MB L2

Intel Core i7-2760QM 4 x Intel SandyBrdige @ 2.4GHz 2 x 64-bit DDR3-800 channels 32KB L1 + 1MB L2 + 6MB L3

Page 14: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Evaluated kernels

Tag Full name Properties

pthreads

OpenM

P

Om

pSs

CU

DA

OpenC

L

vecop Vector operation Common operation in numerical codes

dmmm Dense matrix-matrix multiply Data reuse an compute performance

3dstc 3D volume stencil Strided memory accesses (7-point 3D stencil)

2dcon 2D convolution Spatial locality

fft 1D FFT transform Peak floating-point, variable stride accesses

red Reduction operation Varying levels of parallelism

hist Histogram calculation Local privatization and reduction stage

msort Generic merge sort Barrier synchronization

nbody N-body calculation Irregular memory accesses

amcd Markov chain Monte-Carlo method Embarassingly parallel

spvm Sparse matrix-vector multiply Load imbalance

Page 15: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Single core performance and energy

• Tegra3 is 1.4x faster than Tegra2 • Higher clock frequency

• Exynos 5 is 1.7x faster than Tegra3 • Better frequency, memory bandwidth, and core microarchitecture

• Intel Core i7 is ~3x better than ARM Cortex-A15 at maximum frequency • ARM platforms more energy-efficient than Intel platform

Page 16: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Multicore performance and energy

• Tegra3 is as fast as Exynos 5, a bit more energy efficient • 4-core vs. 2-core

• ARM multicores as efficient as Intel at the same frequency • Intel still more energy efficient at highest performance

• ARM CPU is not the major power sink in the platform

Page 17: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Memory bandwidth (STREAM)

• Exynos 5 improves dramatically over Tegra (4.5x) • Dual-channel DDR3 • ARM Cortex-A15 sustains more in-flight cache misses

Page 18: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Tibidabo: The first ARM HPC multicore cluster

• Proof of concept • It is possible to deploy a cluster of smartphone processors

• Enable software stack development

Q7 carrier board 2 x Cortex-A9 2 GFLOPS 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W

Q7 Tegra 2 2 x Cortex-A9 @ 1GHz 2 GFLOPS 5 Watts (?) 0.4 GFLOPS / W

1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W

2 Racks 32 blade containers 256 nodes 512 cores 9x 48-port 1GbE switch 512 GFLOPS 3.4 Kwatt 0.15 GFLOPS / W

Page 19: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

HPC System software stack on ARM

OmpSs runtime library (NANOS++)

GPU CPU GPU CPU

CPU GPU …

Source files (C, C++, FORTRAN, …)

gcc gfortran OmpSs … Compiler(s)

Executable(s)

CUDA OpenCL MPI

GASNet

Linux Linux Linux

FFTW HDF5 … … ATLAS Scientific libraries

• Open source system software stack • Ubuntu Linux OS • GNU compilers

• gcc, g++, gfortran • Scientific libraries

• ATLAS, FFTW, HDF5,... • Slurm cluster management

• Runtime libraries • MPICH2, OpenMPI • OmpSs toolchain

• Performance analysis tools • Paraver, Scalasca

• Allinea DDT 3.1 debugger • Ported to ARM

Scalasca … Paraver

Developer tools

Cluster management (Slurm)

Page 20: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Parallel scalability

• HPC applications scale well on Tegra2 cluster • Capable of exploiting enough nodes to compensate for lower

node performance

Page 21: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

SoC under study: Interconnection

NVIDIA Tegra 2 1 GbE (PCIe)

100 Mbit (USB 2.0)

NVIDIA Tegra 3 1 GbE (PCIe)

100 Mbit (USB 2.0)

Samsng Exynos 5 Dual 1 GbE (USB3.0)

100 Mbit (USB 2.0)

Intel Core i7-2760QM 1 GbE (PCIe)

QDR Infiniband (PCIe)

Page 22: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Interconnection network: Latency

• TCP/IP adds a lot of CPU overhead • OpenMX driver interfaces directly to the Ethernet NIC • USB stack adds extra latency on top of network stack

Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results

Page 23: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Interconnection network: Bandwidth

• TCP/IP overhead prevents Cortex-A9 CPU from achieving full bandwidth

• USB stack overheads prevent Exynos 5 from achieving full bandwidth, even on OpenMX

Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results

Page 24: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Interconnect vs. Performance ratio

• Mobile SoC have low-bandwidth interconnect … • 1 GbE or USB 3.0 (6Gb/s)

• … but ratio to performance is similar to high-end • 40 Gb/s Inifiniband

1 Gb/s 6 Gb/s 40 Gb/s Tegra2 0.06 0.38 2.50 Tegra3 0.02 0.14 0.96 Exynos 5250 0.02 0.11 0.74 Intel i7 0.00 0.01 0.07

Peak IN bytes / FLOPS

Page 25: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Limitations of current mobile processors for HPC

• 32-bit memory controller • Even if ARM Cortex-A15 offers 40-bit address space

• No ECC protection in memory • Limited scalability, errors will appear beyond a certain number of

nodes • No standard server I/O interfaces

• Do NOT provide native Ethernet or PCI Express • Provide USB 3.0 and SATA (required for tablets)

• No network protocol off-load engine • TCP/IP, OpenMX, USB protocol stacks run on the CPU

• Thermal package not designed for sustained full-power operation

• All these are implementation decisions, not unsolvable problems • Only need a business case to jusitfy the cost of including the new

features … such as the HPC and server markets

Page 26: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Server chips vs. mobile chips

Server chips Mobile chips

Per-node figure Intel

SandyBridge (E5-2670)

AppliedMicro X-Gene

Calxeda EnergyCore (“Midway”)

TI Keystone II

Nvidia Tegra4

Samsung Exynos 5

Octa

#cores 8 16-32 4 4 4 4+4

CPU Sandy Bridge

Custom ARMv8

Cortex-A15 Cortex-A15 Cortex-A15 Cortex-A15 + Cortex-A7

Technology 32nm 40nm 28nm 28nm 28nm

Clock speed 2.6GHz 3GHz 2GHz 1.9GHz 1.8GHz

Memory size 750GB ? 4GB 4GB 4GB 4GB

Memory bandwidth 51.2GB/s 80 GB/s 12.8 GB/s 12.8 GB/s 12.8 GB/s

ECC in DRAM Yes Yes Yes Yes No No

I/O bandwidth 80GB/s ? 4 x 10 Gb/s 10 Gb/s 6 Gb/s * 6 Gb/s *

I/O interface PCIe Integrated Integrated Integrated USB 3.0 USB 3.0

Protocol offload (in the NIC) Yes Yes Yes No No

Page 27: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Conclusions

• Mobile processors have qualities that make them interesting for HPC • FP64 capability • Performance increasing rapidly • Large market, many providers, competition, low cost • Embedded GPU accelerator

• Current limitations due to target market conditions

• Not real technical challenges

• A whole set of ARM server chips is coming • Solving most of the limitations identified

• Get ready for the change, before it happens …

Page 28: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Low-Power High Performance Computing

● Industrial collaboration

● Kalray (http://www.kalray.eu)● French fabless semiconductor and software compagny

founded in 2008● France (Grenoble, Orsay), USA (California), Japan (Tokyo)● Compagny developing and selling a new generation of

manycore processors

● MPPA-256● Multi-Purpose Processor Array (MPPA)● Manycore processor : 256 cores on a single chip● Low Power Consumption [5W - 11W]

Page 29: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Kalray MPPA-256 architecture

● 256 cores (PEs) @ 400 MHz : 16 clusters, 16 PEs per cluster

● PEs share 2MB of memory

● Absence of cache coherence protocol inside the cluster

● Network-on-Chip (NoC) : communication between clusters

● 4 I/O subsystems : 2 connected to external memory

Page 30: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Seismic Wave Propagation (Ondes3D, BRGM)

● Simulation composed by time steps

● In each time step (3D simulation)● The first triple nested loop computes

the velocity components● The second loop reuses velocity result

of the previous time step to updatethe stress field

4th-order stencil

Page 31: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Overview of Parallel Execution on MPPA-256

● Two-level tiling scheme to exploit the memory hierarchy of MPPA-256

Page 32: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,
Page 33: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,
Page 34: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,
Page 35: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,
Page 36: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

The Mont-Blanc European Projects

Mont-Blanc 1 (2011-2015) Mont-Blanc 2 (2013-2016) :

Develop prototypes of HPC clusters using low power commerciallyavailable embedded technology (ARM CPUs, low power GPUs...).

Design the next generation in HPC systems based on embeddedtechnologies and experiments on the prototypes.

Develop a portfolio of existing applications to test these systemsand optimize their efficiency, using BSC’s OmpSs programmingmodel (11 existing applications were selected for this portfolio).

Build Software Stack (OS, runtime, performance tools,...)

Prototype : based on Exynos 5250 : ARM dual core Cortex A15with T604 Mali GPU (OpenCL)

7 / 32BOAST

Page 37: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

BigDFT a Tool for NanotechnologiesAb initio simulation :

Simulates the properties of crystalsand molecules,

Computes the electronic density,based on Daubechie wavelet.

This formalism was chosen because itis fit for HPC computations :

Each orbital can be treatedindependently most of the time,

Operator on orbitals are simple andstraightforward.

Mainly developed in Europe :

CEA-DSM/INAC (Grenoble)

Basel, Louvain la Neuve,...

Electronic density around amethane molecule.

8 / 32BOAST

Page 38: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

BigDFT as an HPC application

Implementation details :

200,000 lines of Fortran 90 and C

Supports MPI, OpenMP, CUDA and OpenCL

Uses BLAS

Scalability up to 16000 cores of Curie and 288GPUs

Operators can be expressed as 3D convolutions :

Wavelet Transform

Potential Energy

Kinetic Energy

These convolutions are separable and filter are short (16 elements).Can take up to 90% of the computation time on some systems.

9 / 32BOAST

Page 39: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

SPECFEM3D a tool for wave propagationresearch

Wave propagation simulation :

Used for geophysics and materialresearch,

Accurately simulate earthquakes,

Based on spectral finite element.

Developed all around the world :

France (CNRS Marseille),

Switzerland (ETH Zurich) CUDA,

United States (Princeton)Networking,

Grenoble (LIG/CNRS) OpenCL.

Sichuan earthequake.

10 / 32BOAST

Page 40: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

SPECFEM3D as an HPC application

Implementation details :

80,000 lines of Fortran 90

Supports MPI, CUDA, OpenCL and an OMPSs + MPI miniapp

Scalability up to 693,600 cores on IBM BlueWaters

11 / 32BOAST

Page 41: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Case Study 1 : BigDFT’s MagicFilter

The simplest convolution found in BigDFT, corresponds to thepotential operator.

Characteristics

Separable,

Filter length 16,

Transposition,

Periodic,

Only 32 operationsper element.

Pseudo code

1 doub l e f i l t [ 1 6 ] = {F0 , F1 , . . . , F15 } ;2 v o i d m a g i c f i l t e r ( i n t n , i n t ndat ,3 doub l e ∗ in , doub l e ∗out ){4 doub l e temp ;5 f o r ( j =0; j<ndat ; j++) {6 f o r ( i =0; i<n ; i++) {7 temp = 0 ;8 f o r ( k=0; k<16; k++) {9 temp+= i n [ ( ( i−7+k)%n ) + j ∗n ]

10 ∗ f i l t [ k ] ;11 }12 out [ j + i ∗ndat ] = temp ;13 } } }

13 / 32BOAST

Page 42: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Case study 2 : SPECFEM3D port toOpenCL

Existing CUDA code :

42 kernels and 15000 lines of code

kernels with 80+ parameters

∼ 7500 lines of cuda code

∼ 7500 lines of wrapper code

Objectives :

Factorize the existing code,

Single OpenCL and CUDA description for the kernels,

Validate without unit tests, comparing native Cuda to generatedCuda executions

Keep similar performances.14 / 32BOAST

Page 43: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

A Parametrized Generator

15 / 32BOAST

Page 44: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Classical Software Development Loop

SourceCodeDeveloper

Binary

Performancedata

Development Compilation

PerfomanceAnalysis

Optimization

Kernel optimization workflow

Usually performed by a knowledgeable developer

16 / 32BOAST

Page 45: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Classical Software Development Loop

SourceCode

BinaryGccMercuriumOpenCL

Performancedata

Development Compilation

PerfomanceAnalysis

Optimization

Compilers perform optimizations

Architecture specific or generic optimizations

16 / 32BOAST

Page 46: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Classical Software Development Loop

SourceCode

Binary

Performancedata

MAQAO HW CountersProprietary Tools

Development Compilation

PerfomanceAnalysis

Optimization

Performance data hint at source transformations

Architecture specific or generic hints

16 / 32BOAST

Page 47: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Classical Software Development Loop

SourceCode

Developer

Binary

Performancedata

Development Compilation

PerfomanceAnalysis

Optimization

Multiplication of kernel versions or loss of versions

Difficulty to benchmark versions against each-other

16 / 32BOAST

Page 48: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

BOAST Development Loop

SourceCode

Binary

Performancedata

Development Compilation

PerfomanceAnalysis

OptimizationGenerativeSource Code Developer

Transformation

Meta-programming of optimizations in BOAST

High level object oriented language

17 / 32BOAST

Page 49: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

BOAST Development Loop

SourceCode

BOAST

Binary

Performancedata

Development Compilation

PerfomanceAnalysis

OptimizationGenerativeSource Code

Transformation

Generate combination of optimizations

C, OpenCL, FORTRAN and CUDA are supported

17 / 32BOAST

Page 50: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

BOAST Development Loop

SourceCode

Binary

MAQAO HW CountersProprietary Tools

Performancedata

Development Compilation

PerfomanceAnalysis

OptimizationGenerativeSource Code

Transformation

GccMercuriumOpenCL

Compilation and analysis are automated

Selection of best version can also be automated

17 / 32BOAST

Page 51: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

BOAST

C kernel

Fortrankernel

OpenCLkernel

CUDAkernel

C with vectorintrinsics kernel

Select targetlanguage

Selectoptimizations

Performancemeasurements

Select performancemetrics

Binarykernel

Select compilerand options

Select inputdata

Optimization spaceprunner: ASK,

Collective Mind

Binary analysis toollike MAQAO

Kernel written inBOAST DSL

Application kernel(SPECFEM3D,

BigDFT, ...)

code generationBOAST

gcc,opencl

runtimeBOAST

1

2

3

45

Bes

t per

form

ing

vers

ion

18 / 32BOAST

Page 52: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Use Case Driven

Parameters arising in a convolution :

Filter : length, values, center.

Direction : forward or inverse convolution.

Boundary conditions : free or periodic.

Unroll factor : arbitrary.

How are those parameters constraining our tool ?

19 / 32BOAST

Page 53: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Features required

Unroll factor :

Create and manipulate an unknown number of variables,

Create loops with variable steps.

Boundary conditions :

Manage arrays with parametrized size.

Filter and convolution direction :

Transform arrays.

And of course be able to describe convolutions and output them indifferent languages.

20 / 32BOAST

Page 54: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Proposed Generator

Idea : use a high level language with support for operatoroverloading to describe the structure of the code, rather than tryingto transform a decorated tree.Define several abstractions :

Variables : type (array, float, integer), size...

Operators : affect, multiply...

Procedure and functions : parameters, variables...

Constructs : for, while...

21 / 32BOAST

Page 55: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Sample Code : Variables and Parameters

1 #simple Variable2 i = Int "i"3 #simple constant4 lowfil = Int( "lowfil", :const => 1-center )5 #simple constant array6 fil = Real("fil", :const => arr , :dim => [ Dim(lowfil ,upfil) ])7 #simple parameter8 ndat = Int("ndat", :dir => :in)9 #multidimensional array , an output parameter

10 y = Real("y", :dir => :out , :dim => [ Dim(ndat), Dim(dim_out_min , dim_out_max) ] )

Variables and Parameters are objects with a name, a type, and aset of named properties.

22 / 32BOAST

Page 56: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Sample Code : Procedure Declaration

The following declaration :1 p = Procedure("magic_filter", [n,ndat ,x,y], [lowfil ,upfil])2 open p

Outputs Fortran :1 subroutine magicfilter(n, ndat , x, y)2 integer(kind=4), parameter :: lowfil = -83 integer(kind=4), parameter :: upfil = 74 integer(kind=4), intent(in) :: n5 integer(kind=4), intent(in) :: ndat6 real(kind=8), intent(in), dimension (0:n-1, ndat) :: x7 real(kind=8), intent(out), dimension(ndat , 0:n-1) :: y

Or C :1 void magicfilter(const int32_t n, const int32_t ndat , const double * x, double * y){2 const int32_t lowfil = -8;3 const int32_t upfil = 7;

23 / 32BOAST

Page 57: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Sample Code : Constructs and Arrays

The following declaration :1 unroll = 52 pr For(j,1,ndat -(unroll -1), unroll) {3 #.....4 pr tt2 === tt2 + x[k,j+1]* fil[l]5 #.....6 }

Outputs Fortran :1 do j=1, ndat -4, 52 !......3 tt2=tt2+x(k,j+1)* fil(l)4 !......5 enddo

Or C :1 for(j=1; j<=ndat -4; j+=5){2 /* ........... */3 tt2=tt2+x[k-0+(j+1 -1)*(n-1 -0+1)]* fil[l-lowfil ];4 /* ........... */5 }

24 / 32BOAST

Page 58: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Generator Evaluation

Back to the test cases :

The generator was used to unroll the Magicfilter an evaluate it’sperformance on an ARM processor and an Intel processor.

The generator was used to describe SPECFEM3D kernel.

25 / 32BOAST

Page 59: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Performance Results

Tegra2 Intel T7500

26 / 32BOAST

Page 60: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

BigDFT Synthesis Kernel

27 / 32BOAST

Page 61: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Improvement for BigDFT

Most of the convolutions have been ported to BOAST.

Results are encouraging : on the hardware BigDFT was handoptimized for, convolutions gained on average between 30 and 40%of performance.

MagicFilter OpenCL versions tailored for problem size by BOASTgain 10 to 20% of performance.

28 / 32BOAST

Page 62: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

SPECFEM3D OpenCL port

Fully ported to OpenCL with comparable performances (using theglobal_s362ani_small test case) :

On a 2*6 cores (E5-2630) machine with 2 K40, using 12 MPIprocesses :

OpenCL : 4m15sCUDA : 3m10s

On an 2*4 cores (E5620) with a K20 using 6 MPI processes :

OpenCL : 12m47sCUDA : 11m23s

Difference comes from the capacity of cuda to specify the minimumnumber of blocks to launch on a multiprocessor. Less than 4000lines of BOAST code (7500 lines of cuda originally).

29 / 32BOAST

Page 63: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Conclusions and Future Work

30 / 32BOAST

Page 64: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Conclusions

Generator has been used to test several loop unrolling strategies inBigDFT.Highlights :

Several output languages.

All constraints have been met.

Automatic benchmarking framework allows us to test severaloptimization levels and compilers.

Automatic non regression testing.

Several algorithmically different versions can be generated (changingthe filter, boundary conditions...).

31 / 32BOAST

Page 65: Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

Future Works and Considerations

Future work :

Produce an autotuning convolution library.

Implement a parametric space explorer or use an existing one(ASK : Adaptative Sampling Kit, Collective Mind...).

Vector code is supported, but needs improvements.

Test the OpenCL version of SPECFEM3D on the Mont-Blancprototype.

Question raised :

Is this approach extensible enough ?

Can we improve the language used further ?

32 / 32BOAST