High energy particle physics simulations with GRID · High energy particle physics simulations with GRID Guido Cossu Higgs centre Alan Turing Institute Visiting Scientist Intel Parallel

High energy particle physics simulations with GRIDGuido CossuHiggs centre

Alan Turing Institute Visiting Scientist

Intel Parallel computing centre

Exascale Computing Project

New Horizons of Computational Science with Heterogeneous Many-Core Processors2018 – RIKEN, Wako, Tokyo 27-28 February

https://www.exascaleproject.org/

Big questions

Dark matter, dark energy

Origin of mass

Yet unknown physics

….

The standard model

Quantum ChromoDynamics (QCD)Image credit: higgstan.com

Lattice QCD is the only known non-perturbative regularisation of QCD

Path integral formulation of the expectation values reduced to the computation to few steps:

• Generate configurations with a Boltzmann weight (equivalent to a thermal ensemble at equilibrium)

• Measure and average the observables on these ensembles

• Take the appropriate limits (continuum, infinite volume, realistic quark masses, ...)

Including quark effects requires the inversion of the Dirac operator, a very large sparse matrix

Lattice Gauge theories (LGT): a primer

The Dirac equation is a first order PDE with spin matrices γRelativistic PDE for spin 1/2 fermion fields as engraved in St. Paul’s crypt

Where the differential operator is the “square root” of the D’Alambertian

• QED couples electrons to electromagnetic potential fields in a mildly non-linear theory.

• QCD couples quarks to chromomagnetic matrix potential fields in a wickedly nonlinear theory

• Regular Cartesian 4d grid• Simple high level data parallel interface can abstract all underlying hardware

The Dirac equation as a PDE

• Current requirements are O(Petaflop) scale machines for O(1-2 years) for accurate measurements including charm quarks (i.e. fine lattices)

• Many observables, many different discretisations, theories, …• Codebases: O(105-6) lines

• Current HPC: architecture proliferation & large parallelism hierarchy• SIMD (SIMT), threading, multi-node

• Need for a high level code that is mostly unaware of the underlying architecture (portability) while preserving performance

LGT & HPC

Short history and relevance in HPC

Short history (lot of omissions, sorry)Home-brew designs for QCD:• APE series (Italy): SIMD switchless machine ‘80-’00,

custom ASIC• QCD-PAX/PACS (Japan): MIMD ‘80-’00, Motorola• QCDOC (Edinburgh, Columbia, IBM): predecessor of the

IBM BlueGene series, custom ASIC, early 2000

Tight academy-industry collaboration for cutting edge architecture development (Ex. Edinburgh - Gordon Bell prize for BG/Q)

NERSC share

Lattice QCD have a sizable share of US/Europe supercomputingSteering exascale projects in US, Japan and Europe.

GRID library

Design goals• performance portability• zero code replication (DRY)

Grid github pages github.com/paboyle/Grid(partial) documentation paboyle.github.io/Grid/CI: Travis, TeamCity https://ci.cliath.ph.ed.ac.uk/

Source code: C++11, autotools, GNU Public License v2

+ HADRONS physics measurement framework based on Grid (A. Portelli)Intense work in progress, production stage next year

Modern C++ library for Cartesian mesh problemsP. Boyle, G. Cossu, A. Portelli, A. Yamaguchi

https://github.com/paboyle/Grid

https://paboyle.github.io/Grid/

https://ci.cliath.ph.ed.ac.uk/

Current support• SSE4.2 (128 bit)• AVX, AVX2 (256 bit) (e.g. Intel Haswell, Broadwell, AMD Ryzen/EPYC)• AVX512F (512 bit, Intel KNL, Intel Skylake)• QPX (BlueGene/Q), experimental• NEON ARMv8 (thanks to Nils Meyer from Regensburg University)• Generic vector width support

Work in progress for• CUDA threads (Nvidia GPUs) (GRID team & ECP collaboration)• ARM SVE (Scalable Vector Extensions - Fujitsu post-K), from 128 up to 2048 bits!

Exploiting all level of parallelism• Vector units, Threading, MPI

Work on optimising communications triggered 3 major updates for• Intel MPI stack (library and PSM2 driver)• HPE-SGI Message Passing Toolkit (MPT)• Mellanox HPC-X

GRID architecture support

https://makondo.ugr.es/event/0/session/102/contribution/324/material/slides/0.pdf

http://www.fujitsu.com/global/Images/armv8-a-scalable-vector-extension-for-post-k.pdf

GRID features (selection)

• Actions• Gauge: Wilson, Symanzik, Iwasaki, RBC, DBW2, generic Plaquette + Rectangle• Fermion: Two Flavours, One Flavour (RHMC), Two Flavours Ratio, One Flavour Ratio, Exact one-flavour. All with the EO

variant.• Kernels: Wilson, Wilson TM, Wilson Clover + anisotropy, generalised DWF (Shamir, Scaled Shamir, Mobius, Z-mobius,

Overlap, … ), Staggered• Scalar Fields (+ FFT acceleration)

• Integrators: Leapfrog, 2nd order minimum-norm (Omelyan), force gradient, + implicit versions• Fermion representations

• Fundamental, Adjoint, Two-index symmetric, Two-index antisymmetric, and all possible mixing of these. Any number of colours. All fermionic actions are compatible.

• Stout smeared evolution with APE kernel (for SU(3) fields). Any action can be smeared.• Serialisation: XML, JSON• Algorithms: GHMC, RMHMC, LAHMC, density of states LLR (not public) easily implemented• File Formats: Binary, NERSC, ILDG, SCIDAC (for confs). MPI-IO for efficient parallel IO• Measurements:

• Hadrons (2,3 point functions), Several sources, QED, Implicitly Restarted Lanczos, and many more…• Split Grids: 3x speedup + deflation for extreme scalability. FP16 in comms for CG.Some HMC features inherited from IroIro++

GRID current physics

• RBC-UKQCD• HMC Algorithms improvement• Kaon decay with G-parity • QED corrections to Hadron Vacuum Polarization• Non Perturbative Renormalization• Holographic cosmology (FFT accelerated)• BSM, composite Higgs with mixed representations

• Numerical Stochastic PT (Wilson Fermions) G. Filaci (UoE)• Axial symmetry at finite temperature, Semi-leptonic B-decays (GC with JLQCD)• Density of states (GC with A. Rago, Plymouth U.)• …

GRID/Intel paper and on the tech news!

On the optimization of comms and how to drive the Intel Omni-Path Architecture

arXiv:1711.04883

HPC Tech news site reported GRID benchmarks in a Battle of the InfiniBands article (Nov 29)from the HPC Advisory Council slides

https://arxiv.org/abs/1711.04883

https://www.nextplatform.com/2017/11/29/the-battle-of-the-infinibands/

http://www.hpcadvisorycouncil.com/pdf/171128b_GRID.pdf

• Define algorithms for generic types

• Templates & template metaprogramming

• Define general interfaces and let the compiler do the hard job

• Basic types will mask the architecture from high level classes

• Enters C++11 • type inference

• new standard library, metaprogramming improvements

• type traits

• variadic templates

• …

• Write code once!

Harness the power of generic programming

Define performant classes

vRealF, vRealD, vComplexF, vComplexD.

Here very simplified, actual implementation use extensively C++11 type traits.

#if defined (AVX1) || defined (AVX2)typedef __m256 SIMD_Ftype;

#endif#if defined (SSE2)

typedef __m128 SIMD_Ftype;#endif#if defined (AVX512)

typedef __m512 SIMD_Ftype;#endiftemplate <class Scalar_type, class Vector_type>class Grid_simd {

Vector_type v; // Define arithmetic operatorsfriend inline vRealD operator + (vRealD a, vRealD b);friend inline vRealD operator - (vRealD a, vRealD b);friend inline vRealD operator * (vRealD a, vRealD b);friend inline vRealD operator / (vRealD a, vRealD b);static int Nsimd(void) { return sizeof(Vector_type)/sizeof(Scalar_type);

}typedef Grid_simd<float, SIMD_Ftype> vRealF;

vSIMD, basic portable vector types

• SIMD Connection Machines in the ’80: find the best mapping

• Map the vector units to virtual nodes (cmfortran and HPFortran)

Virtual nodes layout

Back to the Connection machines

• Geometrically decompose cartesian arrays across nodes (MPI)

• Subdivide node volume into smaller virtual nodes

• Spread virtual nodes across SIMD lanes

• Use OpenMP+MPI+SIMD to process conformable array operations

• Same instructions executed on many nodes, each node operates on Nsimd virtual nodes

• Conclusion: modify data layout to align data parallel operations to SIMD hardware

• Conformable array operations are simple and vectorise perfectly

Message: OVEDECOMPOSE & INTERLEAVE

Grid parallel library

• Opaque C++11 containers hide data layout from user• Automatically transform layout of arrays of mathematical objects using vSIMD scalar,

vector, matrix, higher rank tensors.

• Defines matrix, vector, scalar site operations• Internal type can be SIMD vectors or scalars

LatticeColourMatrix A(Grid);LatticeColourMatrix B(Grid);LatticeColourMatrix C(Grid);LatticeColourMatrix dC_dy(Grid);C = A*B;const int Ydim = 1;dC_dy = 0.5*Cshift(C,Ydim, 1 ) - 0.5*Cshift(C,Ydim,-1 );

• High-level data parallel code gets 65% of peak on AVX2• Single data parallelism model targets BOTH SIMD and threads

efficiently.

template<typename vtype> using iLorentzColourMatrix =

iVector<iScalar<iMatrix<vtype, Nc> >, Nd > ;

General linear algebra with vector types

QCD types example:

vRealF, vRealD, vComplexF, vComplexDtemplate<class vtype> class iScalar{

vtype _internal;};template<class vtype,int N> class iVector{

vtype _internal[N];};template<class vtype,int N> class iMatrix{

vtype _internal[N][N];};

Grid parallel library

Stencil support

Single/multi node

L4 local volume; 8/16 point stencil• Multi-RHS and DWF both take Ls = Nrhs , suppresses gauge field overhead• Cache reuse × Nstencil on fermion possible (with large enough cache)

Per 4d site of result:• Fermion: Nstencil × (𝑁𝑠 ∈ {1,4}) × (𝑁𝑠 = 3) × (𝑁𝑟ℎ𝑠 ∈ {1,16}) complex• Gauge: 2𝑁𝑑 × 𝑁𝑐

2 complex• Flops: Nstencil × Nhs SU(3) MatVec: 66 × Nhs × Nstencil = 1056 (+ 264 spin projection)

Action Fermion Volume Surface Ns Nhs Nrhs Flops Bytes Bytes/flops

HISQ L4 3 × 8 × L3 1 1 1 1146 1560 1.36

Wilson L4 8 × L3 4 2 1 1320 1440 1.09

DWF L4 × N 8 × L3 4 2 16 Nrhs × 1320 Nrhs × 864 0.65

Wilson-RHS L4 8 × L3 4 2 16 Nrhs × 1320 Nrhs × 864 0.65

HISQ-RHS L4 3 × 8 × L3 1 1 16 Nrhs × 1146 Nrhs × 408 0.36

∼ 1/2L of data references come from off node: scaling fine operator requires interconnect bandwidthBalanced network in a memory limited case:

R is reuse factor for stencil in caches (e.g. small tile on KNL gives R = 2 << 8 ). Can measure RBM from single node performance

Dirac matrix bandwidth analysis

Architecture Cores Gflops/s (Ls x Dw) Peak

Intel Knight’s Landing 7250 68 960 6100

Intel Knight’s Corner 60 270 2400

Intel Skylake Platinum x2 36 1200 7900

Intel Broadwell x2 36 800 2700

Intel Haswell x2 32 640 2400

Intel Ivybridge x2 24 270 920

Dirac operator performance: single node, single precision

• 10% to 30% of peak performance on a range of modern nodes• Needed to use hand coded ASM on KNL for best performance• On Skylake, AVX2/AVX512 intrinsics, AVX512 assembly all about the same• Differences among architecture mainly coming from cache sizes

High level code performance

Take careful control of mapping ranks to cartesian coordinates

NodeDims[d] = WorldDims[d]/ShmDims[d];WorldCoor[d]= NodeCoor[d]*ShmDims[d]+ShmCoor[d];

Ensure ranks on same node (e.g. consecutive) assume cartesian coordinates in cubes• Maximises interior fast communications with multiple

ranks per node• Perform comms by direct copy into SHM and no

interior MPI calls. ShmBarrier used to enforce synchronisation and consistency

Do not trust your MPI stack automated positioning

Cartesian optimal rank mapping

VM page sizes in HPC and OPA

A couple of words on page sizes30 years ago on a 80386 we had • 640KB of memory with 32B in registers• Page size 4K, 160 pages. Access time 64usLatest Skylake• Max 64GB of memory with 128B+2048B in registers• Page size 4K, 16M pages. Access time 16ns

Such small pages can be problematic for a lot of architectures• Overhead on TLB misses in hardware (small) and in software with page faults, Linux COW (large)• Small pages fragmentation can be a serious problem in HPC, see Omni-path PSM2 driver.

Intel Omni-Path can be even 10x slower with small pages (next slides)Problem source: kernel page faults, see the PSM2 source code on TID caches

https://github.com/intel/opa-psm2/blob/master/ptl_ips/ips_tidcache.h

Multi node performance

One example: Intel Omni-Path Single Rail, peak 25 GB/s bidirectional• Pattern: 16 = 24 node cartesian 4d halo exchange

• Sequential: one face at a time; single threaded• Concurrent: all 8 faces concurrently (Isend/Irecv); single threaded• Threaded : all 8 faces concurrently; threaded communicators (Intel MPI 2019 multi-EP beta)

• Page allocation either explicit huge pages (mmap) OR Linux Transparent Huge Pages reliance

Comms type Sequential Sequential Concurrent Concurrent Threaded

Pages Huge THP THP Huge Huge

Bytes New PSM2 driver bandwidth MB/s

49152 1296.8 390.8 1620.9 482.9 756.3

3145728 8636.9 830.8 12901.3 8811.9 21776.1

10616832 9574.6 474.5 12580.4 9642.5 21927.8

16859136 9632.2 547.0 8956.7 9620.5 21880.2

25165824 10137.6 716.0 1433.0 9971.9 22281.0

Erratic performance (non reproducible) fixed by forced, explicit huge pages89% of wire speed is obtained from a single MPI rank per node

Translates into~400 GF/s per node on KNLs

in Dirac operator benchmarks

https://arxiv.org/pdf/1711.04883.pdf

Accelerator port

Expression template engine

• Under 350 lines of code (harnessing C++11 type inference)• Loop fusion

LatticeFermion r(Grid), mmp(Grid), x(Grid), p(Grid); // CG vector linalgr = r - a * mmp;x = x + a * p ;// Expression template: avoid full vector operations for each operatorp = p * b + r;

parallel_for(auto s = r.begin(); s!= r.end();s++){

r[s] = r[s] – a*mmp[s];x[s] = x[s] + a*p[s];p[s] = p[s]*b + r[s]

}

Loop fusion:

Loop generalisation

// CUDA specific#define accelerator __host__ __device__#define accelerator_inline __host__ __device__ inline

#define accelerator_loop( iterator, range, ... ) \typedef decltype(range.begin()) Iterator; \auto lambda = [=] accelerator (Iterator iterator) mutable { \__VA_ARGS__; \}; \for(auto it=range.begin();it<range.end();it++){ \lambda(it); \}

#define cpu_loop( iterator, range, ... ) \thread_loop( (auto iterator = range.begin();iterator<range.end();iterator++), \{ __VA_ARGS__ });

NVCC is annoying. No STL support. Getting better

Portability strategy

• Avoid non-standard and vector specific code• Target unified memory space architectures (Kepler, Pascal, Volta, …)• Write only high-level language (try to avoid PTX)• Eliminate STL from Lattice class

• Make it a custom managed pointer, STL container format• All relevant methods marked with accelerator attributes.

Accelerator loops are captured as (device) lambda expressions:accelerator_for(s , r, {

r[s] = r[s] - a * mmp[s];x[s] = x[s] + a*p[s];p[s] = p[s]*b + r[s];

});

The state of affairs

Heterogeneity shows up in many ways:• Architecture parallelism (memory hierarchy, vector units, threading, NUMA)• Accelerators (multicore, GPUs, FPGAs, Custom ASIC, neuromorphic computing, (?) ) • Variety of fabric interconnects

Efficient design of these is hard, ask big chip vendors. Efficient use probably even harder. Software challenges

• Exploit all levels, identify weak elements (see Amdhal’s law)• Design with performance portability in mind. Let the compiler do the hard work.

GRID experience on several architectures and GPU port• Target vector units in decomposition for regular grids (see Connection Machines)• Huge pages with libhugetlbfs• Accelerators: assume unified memory address space• Accelerators: Abstract loops via lambda capture

Expect GPU port ready on summer 2018 (Summit)

High energy particle physics simulations with GRIDGuido CossuHiggs centre

Alan Turing Institute Visiting Researcher

Intel Parallel computing centre

Exascale Computing Project

New Horizons of Computational Science with Heterogeneous Many-Core Processors2018 – RIKEN, Wako, Tokyo 27-28 February

https://www.exascaleproject.org/

High energy particle physics simulations with GRID · High energy particle physics simulations with GRID Guido Cossu Higgs centre Alan Turing Institute Visiting Scientist Intel Parallel

Documents