Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.

http://www.montblanc-project.eu

Jean-François Méhaut

This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.

http://www.montblanc-project.eu

The New Killer Processors Overview of the Mont-Blanc projects BOAST DSL for computing kernels

Corse: Compiler Optimization and Run-time SystEms ∗

Fabrice Rastello

∗Inria Joint Project Team (proposal)

June 9, 2015

Fabrice Rastello (Inria) Corse June 9, 2015 1 / 26

Project-team composition / Institutional context

Joint Project-Team (Inria, Grenoble INP, UJF) in the LIG laboratory@ Giant/Minatec

Fabrice Rastello, Florent Bouchez Tichadou, François Broquedis,Frédéric Desprez, Yliès Falcone, Jean-François Mehaut

8 PhD, 3 Post-doc, 1 Engineer


Permanent member curriculum vitae

Florent Bouchez Tichadou MdC UJF (PhD Lyon 2009, 1Y Bangalore, 3Y Kalray,Nanosim) compiler optimization, compiler back-end

François Broquedis MdC INP (PhD Bordeaux 2010, 1Y Mescal, 3Y Moais)runtime systems, OpenMP, memory management

Frédéric Desprez (DR1 Inria: Graal, Avalon)parallel algorithmic, numerical libraries

Ylies Falcone MdC UJF (PhD Grenoble 2009, 2Y Rennes, Vasco)validation, enforcement, debugging, runtime

Jean-François Mehaut Pr UJF ( Mescal, Nanosim)runtime, debugging, memory management, scientific applications

Fabrice Rastello CR1 Inria (PhD Lyon 2000, 2Y STMicro, Compsys, GCG)compiler optimization, graph theory, compiler back-end, automaticparallelization


Overall Objectives

Domain : Compiler optimization and runtime systems for performanceand energy consumption (not reliability, nor WCET)

Issues: Scalability and heterogeneity/complexity ≡ trade-off betweenspecific optimizations and programmability/portability

Target architectures: VLIW / SIMD / embedded / many-cores /heterogeneity

Applications: dynamic-systems / loop-nests / graph-algorithmic /signal-processing

Approach: combine static/dynamic & compiler/run-time


First, vector processors dominated HPC

• 1st Top500 list (June 1993) dominated by DLP architectures • Cray vector,41% • MasPar SIMD, 11% • Convex/HP vector, 5%

• Fujitsu Wind Tunnel is #1 1993-1996, with 170 GFLOPS

http://upload.wikimedia.org/wikipedia/commons/f/f7/Cray-1-deutsches-museum.jpg

Then, commodity took over special purpose

• ASCI Red, Sandia • 1997, 1 TFLOPS • 9,298 cores @ 200 Mhz • Intel Pentium Pro

• Upgraded to Pentium II Xeon, 1999, 3.1 TFLOPS

• ASCI White, LLNL • 2001, 7.3 TFLOPS • 8,192 proc. @ 375 Mhz, • IBM Power 3

Transition from Vector parallelism to Message-Passing Programming Models

Commodity components drive HPC

• RISC processors replaced vectors • x86 processors replaced RISC

• Vector processors survive as (widening) SIMD extensions

5

The killer microprocessors

• Microprocessors killed the Vector supercomputers • They were not faster ... • ... but they were significantly cheaper and greener

• Need 10 microprocessors to achieve the performance of 1 Vector CPU • SIMD vs. MIMD programming paradigms

Cray-1, Cray-C90 NEC SX4, SX5

Alpha AV4, EV5 Intel Pentium IBM P2SC HP PA8200

1974 1979 1984 1989 1994 1999 10

100

1000

10.000

MFL

OP

S

The killer mobile processorsTM

• Microprocessors killed the Vector supercomputers • They were not faster ... • ... but they were significantly

cheaper and greener

• History may be about to repeat itself … • Mobile processor are not

faster … • … but they are significantly

cheaper

Alpha Intel AMD NVIDIA Tegra Samsung Exynos 4-core ARMv8 1.5 GHz

1990 1995 2000 2005 2010 100

1.000

10.000

100.000

MFL

OP

S

2015

1.000.000

Mobile SoC vs Server processor

Performance

5.2 GFLOPS

153 GFLOPS

Cost

21$1

1500$2

x30

1. Leaked Tegra3 price from the Nexus 7 Bill of Materials 2. Non-discounted List Price for the 8-core Intel E5 SandyBrdige

x70

15.2 GFLOPS 21$ (?)

x10 x70

SoC under study: CPU and Memory

NVIDIA Tegra 2 2 x ARM Cortex-A9 @ 1GHz 1 x 32-bit DDR2-333 channel

32KB L1 + 1MB L2

NVIDIA Tegra 3 4 x ARM Cortex-A9 @ 1.3GHz 2 x 32-bit DDR23-750 channels

32KB L1 + 1MB L2

Samsng Exynos 5 Dual 2 x ARM Cortex-A15 @ 1.7GHz 2 x 32-bit DDR3-800 channels

32KB L1 + 1MB L2

Intel Core i7-2760QM 4 x Intel SandyBrdige @ 2.4GHz 2 x 64-bit DDR3-800 channels 32KB L1 + 1MB L2 + 6MB L3

Evaluated kernels

Tag Full name Properties

pthreads

OpenM

P

Om

pSs

CU

DA

OpenC

L

vecop Vector operation Common operation in numerical codes

dmmm Dense matrix-matrix multiply Data reuse an compute performance

3dstc 3D volume stencil Strided memory accesses (7-point 3D stencil)

2dcon 2D convolution Spatial locality

fft 1D FFT transform Peak floating-point, variable stride accesses

red Reduction operation Varying levels of parallelism

hist Histogram calculation Local privatization and reduction stage

msort Generic merge sort Barrier synchronization

nbody N-body calculation Irregular memory accesses

amcd Markov chain Monte-Carlo method Embarassingly parallel

spvm Sparse matrix-vector multiply Load imbalance

Single core performance and energy

• Tegra3 is 1.4x faster than Tegra2 • Higher clock frequency

• Exynos 5 is 1.7x faster than Tegra3 • Better frequency, memory bandwidth, and core microarchitecture

• Intel Core i7 is ~3x better than ARM Cortex-A15 at maximum frequency • ARM platforms more energy-efficient than Intel platform

Multicore performance and energy

• Tegra3 is as fast as Exynos 5, a bit more energy efficient • 4-core vs. 2-core

• ARM multicores as efficient as Intel at the same frequency • Intel still more energy efficient at highest performance

• ARM CPU is not the major power sink in the platform

Memory bandwidth (STREAM)

• Exynos 5 improves dramatically over Tegra (4.5x) • Dual-channel DDR3 • ARM Cortex-A15 sustains more in-flight cache misses

Tibidabo: The first ARM HPC multicore cluster

• Proof of concept • It is possible to deploy a cluster of smartphone processors

• Enable software stack development

Q7 carrier board 2 x Cortex-A9 2 GFLOPS 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W

Q7 Tegra 2 2 x Cortex-A9 @ 1GHz 2 GFLOPS 5 Watts (?) 0.4 GFLOPS / W

1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W

2 Racks 32 blade containers 256 nodes 512 cores 9x 48-port 1GbE switch 512 GFLOPS 3.4 Kwatt 0.15 GFLOPS / W

HPC System software stack on ARM

OmpSs runtime library (NANOS++)

GPU CPU GPU CPU

CPU GPU …

Source files (C, C++, FORTRAN, …)

gcc gfortran OmpSs … Compiler(s)

Executable(s)

CUDA OpenCL MPI

GASNet

Linux Linux Linux

FFTW HDF5 … … ATLAS Scientific libraries

• Open source system software stack • Ubuntu Linux OS • GNU compilers

• gcc, g++, gfortran • Scientific libraries

• ATLAS, FFTW, HDF5,... • Slurm cluster management

• Runtime libraries • MPICH2, OpenMPI • OmpSs toolchain

• Performance analysis tools • Paraver, Scalasca

• Allinea DDT 3.1 debugger • Ported to ARM

Scalasca … Paraver

Developer tools

Cluster management (Slurm)

Parallel scalability

• HPC applications scale well on Tegra2 cluster • Capable of exploiting enough nodes to compensate for lower

node performance

SoC under study: Interconnection

NVIDIA Tegra 2 1 GbE (PCIe)

100 Mbit (USB 2.0)

NVIDIA Tegra 3 1 GbE (PCIe)

100 Mbit (USB 2.0)

Samsng Exynos 5 Dual 1 GbE (USB3.0)

100 Mbit (USB 2.0)

Intel Core i7-2760QM 1 GbE (PCIe)

QDR Infiniband (PCIe)

Interconnection network: Latency

• TCP/IP adds a lot of CPU overhead • OpenMX driver interfaces directly to the Ethernet NIC • USB stack adds extra latency on top of network stack

Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results

Interconnection network: Bandwidth

• TCP/IP overhead prevents Cortex-A9 CPU from achieving full bandwidth

• USB stack overheads prevent Exynos 5 from achieving full bandwidth, even on OpenMX

Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results

Interconnect vs. Performance ratio

• Mobile SoC have low-bandwidth interconnect … • 1 GbE or USB 3.0 (6Gb/s)

• … but ratio to performance is similar to high-end • 40 Gb/s Inifiniband

1 Gb/s 6 Gb/s 40 Gb/s Tegra2 0.06 0.38 2.50 Tegra3 0.02 0.14 0.96 Exynos 5250 0.02 0.11 0.74 Intel i7 0.00 0.01 0.07

Peak IN bytes / FLOPS

Limitations of current mobile processors for HPC

• 32-bit memory controller • Even if ARM Cortex-A15 offers 40-bit address space

• No ECC protection in memory • Limited scalability, errors will appear beyond a certain number of

nodes • No standard server I/O interfaces

• Do NOT provide native Ethernet or PCI Express • Provide USB 3.0 and SATA (required for tablets)

• No network protocol off-load engine • TCP/IP, OpenMX, USB protocol stacks run on the CPU

• Thermal package not designed for sustained full-power operation

• All these are implementation decisions, not unsolvable problems • Only need a business case to jusitfy the cost of including the new

features … such as the HPC and server markets

Server chips vs. mobile chips

Server chips Mobile chips

Per-node figure Intel

SandyBridge (E5-2670)

AppliedMicro X-Gene

Calxeda EnergyCore (“Midway”)

TI Keystone II

Nvidia Tegra4

Samsung Exynos 5

Octa

#cores 8 16-32 4 4 4 4+4

CPU Sandy Bridge

Custom ARMv8

Cortex-A15 Cortex-A15 Cortex-A15 Cortex-A15 + Cortex-A7

Technology 32nm 40nm 28nm 28nm 28nm

Clock speed 2.6GHz 3GHz 2GHz 1.9GHz 1.8GHz

Memory size 750GB ? 4GB 4GB 4GB 4GB

Memory bandwidth 51.2GB/s 80 GB/s 12.8 GB/s 12.8 GB/s 12.8 GB/s

ECC in DRAM Yes Yes Yes Yes No No

I/O bandwidth 80GB/s ? 4 x 10 Gb/s 10 Gb/s 6 Gb/s * 6 Gb/s *

I/O interface PCIe Integrated Integrated Integrated USB 3.0 USB 3.0

Protocol offload (in the NIC) Yes Yes Yes No No

Conclusions

• Mobile processors have qualities that make them interesting for HPC • FP64 capability • Performance increasing rapidly • Large market, many providers, competition, low cost • Embedded GPU accelerator

• Current limitations due to target market conditions

• Not real technical challenges

• A whole set of ARM server chips is coming • Solving most of the limitations identified

• Get ready for the change, before it happens …

Low-Power High Performance Computing

● Industrial collaboration

● Kalray (http://www.kalray.eu)● French fabless semiconductor and software compagny

founded in 2008● France (Grenoble, Orsay), USA (California), Japan (Tokyo)● Compagny developing and selling a new generation of

manycore processors

● MPPA-256● Multi-Purpose Processor Array (MPPA)● Manycore processor : 256 cores on a single chip● Low Power Consumption [5W - 11W]

Kalray MPPA-256 architecture

● 256 cores (PEs) @ 400 MHz : 16 clusters, 16 PEs per cluster

● PEs share 2MB of memory

● Absence of cache coherence protocol inside the cluster

● Network-on-Chip (NoC) : communication between clusters

● 4 I/O subsystems : 2 connected to external memory

Seismic Wave Propagation (Ondes3D, BRGM)

● Simulation composed by time steps

● In each time step (3D simulation)● The first triple nested loop computes

the velocity components● The second loop reuses velocity result

of the previous time step to updatethe stress field

4th-order stencil

Overview of Parallel Execution on MPPA-256

● Two-level tiling scheme to exploit the memory hierarchy of MPPA-256

Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work

The Mont-Blanc European Projects

Mont-Blanc 1 (2011-2015) Mont-Blanc 2 (2013-2016) :

Develop prototypes of HPC clusters using low power commerciallyavailable embedded technology (ARM CPUs, low power GPUs...).

Design the next generation in HPC systems based on embeddedtechnologies and experiments on the prototypes.

Develop a portfolio of existing applications to test these systemsand optimize their efficiency, using BSC’s OmpSs programmingmodel (11 existing applications were selected for this portfolio).

Build Software Stack (OS, runtime, performance tools,...)

Prototype : based on Exynos 5250 : ARM dual core Cortex A15with T604 Mali GPU (OpenCL)

7 / 32BOAST


BigDFT a Tool for NanotechnologiesAb initio simulation :

Simulates the properties of crystalsand molecules,

Computes the electronic density,based on Daubechie wavelet.

This formalism was chosen because itis fit for HPC computations :

Each orbital can be treatedindependently most of the time,

Operator on orbitals are simple andstraightforward.

Mainly developed in Europe :

CEA-DSM/INAC (Grenoble)

Basel, Louvain la Neuve,...

Electronic density around amethane molecule.

8 / 32BOAST


BigDFT as an HPC application

Implementation details :

200,000 lines of Fortran 90 and C

Supports MPI, OpenMP, CUDA and OpenCL

Uses BLAS

Scalability up to 16000 cores of Curie and 288GPUs

Operators can be expressed as 3D convolutions :

Wavelet Transform

Potential Energy

Kinetic Energy

These convolutions are separable and filter are short (16 elements).Can take up to 90% of the computation time on some systems.

9 / 32BOAST


SPECFEM3D a tool for wave propagationresearch

Wave propagation simulation :

Used for geophysics and materialresearch,

Accurately simulate earthquakes,

Based on spectral finite element.

Developed all around the world :

France (CNRS Marseille),

Switzerland (ETH Zurich) CUDA,

United States (Princeton)Networking,

Grenoble (LIG/CNRS) OpenCL.

Sichuan earthequake.

10 / 32BOAST


SPECFEM3D as an HPC application

Implementation details :

80,000 lines of Fortran 90

Supports MPI, CUDA, OpenCL and an OMPSs + MPI miniapp

Scalability up to 693,600 cores on IBM BlueWaters

11 / 32BOAST


Case Study 1 : BigDFT’s MagicFilter

The simplest convolution found in BigDFT, corresponds to thepotential operator.

Characteristics

Separable,

Filter length 16,

Transposition,

Periodic,

Only 32 operationsper element.

Pseudo code

1 doub l e f i l t [ 1 6 ] = {F0 , F1 , . . . , F15 } ;2 v o i d m a g i c f i l t e r ( i n t n , i n t ndat ,3 doub l e ∗ in , doub l e ∗out ){4 doub l e temp ;5 f o r ( j =0; j<ndat ; j++) {6 f o r ( i =0; i<n ; i++) {7 temp = 0 ;8 f o r ( k=0; k<16; k++) {9 temp+= i n [ ( ( i−7+k)%n ) + j ∗n ]

10 ∗ f i l t [ k ] ;11 }12 out [ j + i ∗ndat ] = temp ;13 } } }

13 / 32BOAST


Case study 2 : SPECFEM3D port toOpenCL

Existing CUDA code :

42 kernels and 15000 lines of code

kernels with 80+ parameters

∼ 7500 lines of cuda code

∼ 7500 lines of wrapper code

Objectives :

Factorize the existing code,

Single OpenCL and CUDA description for the kernels,

Validate without unit tests, comparing native Cuda to generatedCuda executions

Keep similar performances.14 / 32BOAST


A Parametrized Generator

15 / 32BOAST


Classical Software Development Loop

SourceCodeDeveloper

Binary

Performancedata

Development Compilation

PerfomanceAnalysis

Optimization

Kernel optimization workflow

Usually performed by a knowledgeable developer

16 / 32BOAST



SourceCode

BinaryGccMercuriumOpenCL

Performancedata


PerfomanceAnalysis

Optimization

Compilers perform optimizations

Architecture specific or generic optimizations

16 / 32BOAST



SourceCode

Binary

Performancedata

MAQAO HW CountersProprietary Tools


PerfomanceAnalysis

Optimization

Performance data hint at source transformations

Architecture specific or generic hints

16 / 32BOAST



SourceCode

Developer

Binary

Performancedata


PerfomanceAnalysis

Optimization

Multiplication of kernel versions or loss of versions

Difficulty to benchmark versions against each-other

16 / 32BOAST


BOAST Development Loop

SourceCode

Binary

Performancedata


PerfomanceAnalysis

OptimizationGenerativeSource Code Developer

Transformation

Meta-programming of optimizations in BOAST

High level object oriented language

17 / 32BOAST



SourceCode

BOAST

Binary

Performancedata


PerfomanceAnalysis

OptimizationGenerativeSource Code

Transformation

Generate combination of optimizations

C, OpenCL, FORTRAN and CUDA are supported

17 / 32BOAST



SourceCode

Binary

MAQAO HW CountersProprietary Tools

Performancedata


PerfomanceAnalysis

OptimizationGenerativeSource Code

Transformation

GccMercuriumOpenCL

Compilation and analysis are automated

Selection of best version can also be automated

17 / 32BOAST


BOAST

C kernel

Fortrankernel

OpenCLkernel

CUDAkernel

C with vectorintrinsics kernel

Select targetlanguage

Selectoptimizations

Performancemeasurements

Select performancemetrics

Binarykernel

Select compilerand options

Select inputdata

Optimization spaceprunner: ASK,

Collective Mind

Binary analysis toollike MAQAO

Kernel written inBOAST DSL

Application kernel(SPECFEM3D,

BigDFT, ...)

code generationBOAST

gcc,opencl

runtimeBOAST

1

2

3

45

Bes

t per

form

ing

vers

ion

18 / 32BOAST


Use Case Driven

Parameters arising in a convolution :

Filter : length, values, center.

Direction : forward or inverse convolution.

Boundary conditions : free or periodic.

Unroll factor : arbitrary.

How are those parameters constraining our tool ?

19 / 32BOAST


Features required

Unroll factor :

Create and manipulate an unknown number of variables,

Create loops with variable steps.

Boundary conditions :

Manage arrays with parametrized size.

Filter and convolution direction :

Transform arrays.

And of course be able to describe convolutions and output them indifferent languages.

20 / 32BOAST


Proposed Generator

Idea : use a high level language with support for operatoroverloading to describe the structure of the code, rather than tryingto transform a decorated tree.Define several abstractions :

Variables : type (array, float, integer), size...

Operators : affect, multiply...

Procedure and functions : parameters, variables...

Constructs : for, while...

21 / 32BOAST


Sample Code : Variables and Parameters

1 #simple Variable2 i = Int "i"3 #simple constant4 lowfil = Int( "lowfil", :const => 1-center )5 #simple constant array6 fil = Real("fil", :const => arr , :dim => [ Dim(lowfil ,upfil) ])7 #simple parameter8 ndat = Int("ndat", :dir => :in)9 #multidimensional array , an output parameter

10 y = Real("y", :dir => :out , :dim => [ Dim(ndat), Dim(dim_out_min , dim_out_max) ] )

Variables and Parameters are objects with a name, a type, and aset of named properties.

22 / 32BOAST


Sample Code : Procedure Declaration

The following declaration :1 p = Procedure("magic_filter", [n,ndat ,x,y], [lowfil ,upfil])2 open p

Outputs Fortran :1 subroutine magicfilter(n, ndat , x, y)2 integer(kind=4), parameter :: lowfil = -83 integer(kind=4), parameter :: upfil = 74 integer(kind=4), intent(in) :: n5 integer(kind=4), intent(in) :: ndat6 real(kind=8), intent(in), dimension (0:n-1, ndat) :: x7 real(kind=8), intent(out), dimension(ndat , 0:n-1) :: y

Or C :1 void magicfilter(const int32_t n, const int32_t ndat , const double * x, double * y){2 const int32_t lowfil = -8;3 const int32_t upfil = 7;

23 / 32BOAST


Sample Code : Constructs and Arrays

The following declaration :1 unroll = 52 pr For(j,1,ndat -(unroll -1), unroll) {3 #.....4 pr tt2 === tt2 + x[k,j+1]* fil[l]5 #.....6 }

Outputs Fortran :1 do j=1, ndat -4, 52 !......3 tt2=tt2+x(k,j+1)* fil(l)4 !......5 enddo

Or C :1 for(j=1; j<=ndat -4; j+=5){2 /* ........... */3 tt2=tt2+x[k-0+(j+1 -1)*(n-1 -0+1)]* fil[l-lowfil ];4 /* ........... */5 }

24 / 32BOAST


Generator Evaluation

Back to the test cases :

The generator was used to unroll the Magicfilter an evaluate it’sperformance on an ARM processor and an Intel processor.

The generator was used to describe SPECFEM3D kernel.

25 / 32BOAST


Performance Results

Tegra2 Intel T7500

26 / 32BOAST


BigDFT Synthesis Kernel

27 / 32BOAST


Improvement for BigDFT

Most of the convolutions have been ported to BOAST.

Results are encouraging : on the hardware BigDFT was handoptimized for, convolutions gained on average between 30 and 40%of performance.

MagicFilter OpenCL versions tailored for problem size by BOASTgain 10 to 20% of performance.

28 / 32BOAST


SPECFEM3D OpenCL port

Fully ported to OpenCL with comparable performances (using theglobal_s362ani_small test case) :

On a 2*6 cores (E5-2630) machine with 2 K40, using 12 MPIprocesses :

OpenCL : 4m15sCUDA : 3m10s

On an 2*4 cores (E5620) with a K20 using 6 MPI processes :

OpenCL : 12m47sCUDA : 11m23s

Difference comes from the capacity of cuda to specify the minimumnumber of blocks to launch on a multiprocessor. Less than 4000lines of BOAST code (7500 lines of cuda originally).

29 / 32BOAST


Conclusions and Future Work

30 / 32BOAST


Conclusions

Generator has been used to test several loop unrolling strategies inBigDFT.Highlights :

Several output languages.

All constraints have been met.

Automatic benchmarking framework allows us to test severaloptimization levels and compilers.

Automatic non regression testing.

Several algorithmically different versions can be generated (changingthe filter, boundary conditions...).

31 / 32BOAST


Future Works and Considerations

Future work :

Produce an autotuning convolution library.

Implement a parametric space explorer or use an existing one(ASK : Adaptative Sampling Kit, Collective Mind...).

Vector code is supported, but needs improvements.

Test the OpenCL version of SPECFEM3D on the Mont-Blancprototype.

Question raised :

Is this approach extensible enough ?

Can we improve the language used further ?

32 / 32BOAST

Jean-François Méhaut · 2017. 5. 10. · Frédéric Desprez(DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies FalconeMdC UJF (PhD Grenoble 2009, 2Y Rennes,

Documents