High energy particle physics simulations with GRID Guido Cossu Higgs centre Alan Turing Institute Visiting Scientist Intel Parallel computing centre Exascale Computing Project New Horizons of Computational Science with Heterogeneous Many-Core Processors 2018 – RIKEN, Wako, Tokyo 27-28 February
30
Embed
High energy particle physics simulations with GRID · High energy particle physics simulations with GRID Guido Cossu Higgs centre Alan Turing Institute Visiting Scientist Intel Parallel
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
High energy particle physics simulations with GRIDGuido CossuHiggs centre
Alan Turing Institute Visiting Scientist
Intel Parallel computing centre
Exascale Computing Project
New Horizons of Computational Science with Heterogeneous Many-Core Processors2018 – RIKEN, Wako, Tokyo 27-28 February
Current support• SSE4.2 (128 bit)• AVX, AVX2 (256 bit) (e.g. Intel Haswell, Broadwell, AMD Ryzen/EPYC)• AVX512F (512 bit, Intel KNL, Intel Skylake)• QPX (BlueGene/Q), experimental• NEON ARMv8 (thanks to Nils Meyer from Regensburg University)• Generic vector width support
Work in progress for• CUDA threads (Nvidia GPUs) (GRID team & ECP collaboration)• ARM SVE (Scalable Vector Extensions - Fujitsu post-K), from 128 up to 2048 bits!
Exploiting all level of parallelism• Vector units, Threading, MPI
Work on optimising communications triggered 3 major updates for• Intel MPI stack (library and PSM2 driver)• HPE-SGI Message Passing Toolkit (MPT)• Mellanox HPC-X
• Actions• Gauge: Wilson, Symanzik, Iwasaki, RBC, DBW2, generic Plaquette + Rectangle• Fermion: Two Flavours, One Flavour (RHMC), Two Flavours Ratio, One Flavour Ratio, Exact one-flavour. All with the EO
variant.• Kernels: Wilson, Wilson TM, Wilson Clover + anisotropy, generalised DWF (Shamir, Scaled Shamir, Mobius, Z-mobius,
• Integrators: Leapfrog, 2nd order minimum-norm (Omelyan), force gradient, + implicit versions• Fermion representations
• Fundamental, Adjoint, Two-index symmetric, Two-index antisymmetric, and all possible mixing of these. Any number of colours. All fermionic actions are compatible.
• Stout smeared evolution with APE kernel (for SU(3) fields). Any action can be smeared.• Serialisation: XML, JSON• Algorithms: GHMC, RMHMC, LAHMC, density of states LLR (not public) easily implemented• File Formats: Binary, NERSC, ILDG, SCIDAC (for confs). MPI-IO for efficient parallel IO• Measurements:
• Hadrons (2,3 point functions), Several sources, QED, Implicitly Restarted Lanczos, and many more…• Split Grids: 3x speedup + deflation for extreme scalability. FP16 in comms for CG.Some HMC features inherited from IroIro++
GRID current physics
• RBC-UKQCD• HMC Algorithms improvement• Kaon decay with G-parity • QED corrections to Hadron Vacuum Polarization• Non Perturbative Renormalization• Holographic cosmology (FFT accelerated)• BSM, composite Higgs with mixed representations
• Numerical Stochastic PT (Wilson Fermions) G. Filaci (UoE)• Axial symmetry at finite temperature, Semi-leptonic B-decays (GC with JLQCD)• Density of states (GC with A. Rago, Plymouth U.)• …
GRID/Intel paper and on the tech news!
On the optimization of comms and how to drive the Intel Omni-Path Architecture
arXiv:1711.04883
HPC Tech news site reported GRID benchmarks in a Battle of the InfiniBands article (Nov 29)from the HPC Advisory Council slides
• High-level data parallel code gets 65% of peak on AVX2• Single data parallelism model targets BOTH SIMD and threads
efficiently.
template<typename vtype> using iLorentzColourMatrix =
iVector<iScalar<iMatrix<vtype, Nc> >, Nd > ;
General linear algebra with vector types
QCD types example:
vRealF, vRealD, vComplexF, vComplexDtemplate<class vtype> class iScalar{
vtype _internal;};template<class vtype,int N> class iVector{
vtype _internal[N];};template<class vtype,int N> class iMatrix{
vtype _internal[N][N];};
Grid parallel library
Stencil support
Single/multi node
L4 local volume; 8/16 point stencil• Multi-RHS and DWF both take Ls = Nrhs , suppresses gauge field overhead• Cache reuse × Nstencil on fermion possible (with large enough cache)
Per 4d site of result:• Fermion: Nstencil × (𝑁𝑠 ∈ {1,4}) × (𝑁𝑠 = 3) × (𝑁𝑟ℎ𝑠 ∈ {1,16}) complex• Gauge: 2𝑁𝑑 × 𝑁𝑐
∼ 1/2L of data references come from off node: scaling fine operator requires interconnect bandwidthBalanced network in a memory limited case:
R is reuse factor for stencil in caches (e.g. small tile on KNL gives R = 2 << 8 ). Can measure RBM from single node performance
Dirac matrix bandwidth analysis
Architecture Cores Gflops/s (Ls x Dw) Peak
Intel Knight’s Landing 7250 68 960 6100
Intel Knight’s Corner 60 270 2400
Intel Skylake Platinum x2 36 1200 7900
Intel Broadwell x2 36 800 2700
Intel Haswell x2 32 640 2400
Intel Ivybridge x2 24 270 920
Dirac operator performance: single node, single precision
• 10% to 30% of peak performance on a range of modern nodes• Needed to use hand coded ASM on KNL for best performance• On Skylake, AVX2/AVX512 intrinsics, AVX512 assembly all about the same• Differences among architecture mainly coming from cache sizes
High level code performance
Take careful control of mapping ranks to cartesian coordinates
Ensure ranks on same node (e.g. consecutive) assume cartesian coordinates in cubes• Maximises interior fast communications with multiple
ranks per node• Perform comms by direct copy into SHM and no
interior MPI calls. ShmBarrier used to enforce synchronisation and consistency
Do not trust your MPI stack automated positioning
Cartesian optimal rank mapping
VM page sizes in HPC and OPA
A couple of words on page sizes30 years ago on a 80386 we had • 640KB of memory with 32B in registers• Page size 4K, 160 pages. Access time 64usLatest Skylake• Max 64GB of memory with 128B+2048B in registers• Page size 4K, 16M pages. Access time 16ns
Such small pages can be problematic for a lot of architectures• Overhead on TLB misses in hardware (small) and in software with page faults, Linux COW (large)• Small pages fragmentation can be a serious problem in HPC, see Omni-path PSM2 driver.
Intel Omni-Path can be even 10x slower with small pages (next slides)Problem source: kernel page faults, see the PSM2 source code on TID caches
One example: Intel Omni-Path Single Rail, peak 25 GB/s bidirectional• Pattern: 16 = 24 node cartesian 4d halo exchange
• Sequential: one face at a time; single threaded• Concurrent: all 8 faces concurrently (Isend/Irecv); single threaded• Threaded : all 8 faces concurrently; threaded communicators (Intel MPI 2019 multi-EP beta)
• Page allocation either explicit huge pages (mmap) OR Linux Transparent Huge Pages reliance
Comms type Sequential Sequential Concurrent Concurrent Threaded
Pages Huge THP THP Huge Huge
Bytes New PSM2 driver bandwidth MB/s
49152 1296.8 390.8 1620.9 482.9 756.3
3145728 8636.9 830.8 12901.3 8811.9 21776.1
10616832 9574.6 474.5 12580.4 9642.5 21927.8
16859136 9632.2 547.0 8956.7 9620.5 21880.2
25165824 10137.6 716.0 1433.0 9971.9 22281.0
Erratic performance (non reproducible) fixed by forced, explicit huge pages89% of wire speed is obtained from a single MPI rank per node
• Under 350 lines of code (harnessing C++11 type inference)• Loop fusion
LatticeFermion r(Grid), mmp(Grid), x(Grid), p(Grid); // CG vector linalgr = r - a * mmp;x = x + a * p ;// Expression template: avoid full vector operations for each operatorp = p * b + r;
parallel_for(auto s = r.begin(); s!= r.end();s++){
• Avoid non-standard and vector specific code• Target unified memory space architectures (Kepler, Pascal, Volta, …)• Write only high-level language (try to avoid PTX)• Eliminate STL from Lattice class
• Make it a custom managed pointer, STL container format• All relevant methods marked with accelerator attributes.
Accelerator loops are captured as (device) lambda expressions:accelerator_for(s , r, {
Heterogeneity shows up in many ways:• Architecture parallelism (memory hierarchy, vector units, threading, NUMA)• Accelerators (multicore, GPUs, FPGAs, Custom ASIC, neuromorphic computing, (?) ) • Variety of fabric interconnects
Efficient design of these is hard, ask big chip vendors. Efficient use probably even harder. Software challenges
• Exploit all levels, identify weak elements (see Amdhal’s law)• Design with performance portability in mind. Let the compiler do the hard work.
GRID experience on several architectures and GPU port• Target vector units in decomposition for regular grids (see Connection Machines)• Huge pages with libhugetlbfs• Accelerators: assume unified memory address space• Accelerators: Abstract loops via lambda capture
Expect GPU port ready on summer 2018 (Summit)
High energy particle physics simulations with GRIDGuido CossuHiggs centre
Alan Turing Institute Visiting Researcher
Intel Parallel computing centre
Exascale Computing Project
New Horizons of Computational Science with Heterogeneous Many-Core Processors2018 – RIKEN, Wako, Tokyo 27-28 February