From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

© Crown copyright Met Office

From Knights ENDGame to Gung Ho beginnings:

Porting old, and developing new, dynamical cores in the age of accelerators

Dr Chris Maynard [email protected]


Introduction

End of Moore’s Law1 – Rise of multi-core processors The free lunch is over – Herb Sutter 1Processor speed

If you were ploughing a field, which would you rather use: Two strong oxen or 1024 chickens?

- Seymour Cray

Parallelism


Data parallel Domain decomposition Bind MPI task to CPU

Task parallel Processor perform different work I/O server

ILP – SIMD FMA, SSE/AVX CPU 512bit wide SIMD MIC Coalesced memory access (per warp) GPU

Hybrid programming MPI + OpenMP – affinity? Hierarchical memory spaces NUMA Accelerators host + device Asynchronous use?

Oversubscribed concurrency CPUs hyper-threading/SMT Not much benefit due to Out-of-order-execution MIC instruction interleave GPU hide memory latency


The Unified Model and ENDGame


Unified Model – NWP and Climate – 25 years old

ENDGame is 3rd generation dynamical core Half million lines of F90 (F77)/openMP threading low-level, incomplete coverage

T/T 2

4

Long-lat finite difference semi-implicit semi-lagrangian

Much improved stability and scalability c.f. New Dymics


Machines


STFC Daresbury 1 Dual socket 8 core Sandybridge host + KNC No queues, instant accesses

Stampede, TACC - 6400 nodes! batch system, multi-MIC

Met Office IBM Power7 2 machines ~ 500 32-core nodes each

Cray XC30 UK National Academic HPC service – 12-core dual socket Ivybridge


96x72x70 single Phi – UM

A: 1 OMP thread

B: 2 OMP, -O2, -O3 and compile thread=2

C: 2 OMP, -mkl, seg-size 20-120

D: 4 OMP, seg-size

E 2x Phi 6x10 each – 2 OMP

6x10 MPI tasks, + OpenMP threads – domain decomposition KMP_AFFINITY="verbose,granularity=thread,compact"

Compiled for native mode -mmic


Interpreting the results •  Threading performance is poor

–  Segments for radiation routines doesn’t help

–  Is there enough work per thread?

•  MKL really helps radiation routines –  Lots of trig func - library boots performance 30%

•  Two KNC cards is faster –  Even for MPI over PCIe - enough data parallelism

–  OMP at low level in UM – incomplete coverage

•  Bigger problem size? –  192x144 runs out of global memory on 1 and 2 cards

–  Same local volume size as N48 on 4!


Focus on the solver - threading

Solver performance is dominated by preconditioner – tri_sor

Red-black checkerboard for threading N48 – 96 x 72 / 6 x 10 (MPI) = 16 x 7(8) per task

split over threads?

Stopping criteria for BiCGStab 10-3

Satisfy this bound with single precision Halve data transfer from memory

Compare N96 (192 x 144 grid) 4 x KNC - 12x20 MPI task 2 x Sandy Bridge nodes (4-sockets) – 4x8 MPI tasks

1 Power 7 node (4 –sockets) – 4x8 MPI tasks


Tri-diagonal SOR preconditioner (strong scaling)

PWR 7 is faster (no off node comms) No SMT scaling MIC is slowest shows some thread scaling

Single precision solver is much faster Reduces memory traffic Reduces communication On all architectures


Data layout critical

Data layout is lexicographical (i,j,k)in UM loop order is red-black in tri_sor BAD cache use and SIMD

Change to linear red and black data layout

LGC is universal in UM including comms - data transformation Try out in serial-stand alone solver code – not full UM

Code length and complexity increase


Matrix-vector routine LCG ~ 100 lines RB ~ 700 lines Just for i and j both even!

Architecture comparison


Serial code only – mimic memory access fully occupied node Run multiple copies per node stress memory performance Global problem size, per socket, same on all architectures Amount of data transformation per memory system is the same, amount of work per core is different

Performance comparison

Summary


• ENDGame solver data layout is lexicographical but loop order is Red-black • improve using single precision and change data layout

•  will require lots of coding in UM • These changes help on all architectures (least on MIC) • Vector/SIMD – Compiler reports successful vectorisation, but no performance

•  peer at assembler code suggest all iterations in peel loop, vector length is 1?

• Try OpenMP SIMD with ifort14?

Stop press …


ifort –align array64byte

16x128x64 4 threads bound to core KNC 2x IB 1.3x Thanks to Tom Henderson for tip!

First look at Gung Ho


Met Office, Bath, Exeter, Imperial, Leeds, Reading, Warwick and STFC Daresbury Design and build new scalable dynamical core Numerics still under R & D High level software architecture design is complete First implementation developments started this year: Dynamo

Finite Element Method (FEM) – possibly higher order Unstructured mesh (semi-uniform horizontal, vertically graded) Element topology (i.e. Triangular, quadrilateral, hexagonal) not yet fixed


Design Principle

Scientific programming

Find numerical solution (and estimate of the uncertainty) to a (set of) mathematical equations which describe the action of a physical system

Parallel programming and optimisation are the methods by which large problems can be solved faster than real-time.

Roles: Natural and Computational scientists

What tasks are performed and not job descriptions


Layered architecture

Parallel code Science code

Comparing ENDGame and Gung Ho


ENDGame 32x32x64 domain RB BiCGstab solver (tri_sor) preconditioner 25 times

Gung Ho 32x32x64 Galerkin Projection on lowest order Quads for W1, W2 and W3 spaces 4 colours, no preconditioner, BiCGstab, 10 times

Single socket Ivy bridge processor – Open scaling wrt single thread

ENDGame versus Gung Ho Intel Xeon Phi


ENDGame – problem to small

EG artificial prpblem size

GH – native mode

Gung Ho Offload

Summary


7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives to offload kernel Gung Ho absolute performance KNC native twice as slow as Ivybridge offload twice as slow as native Data movement is not optimised, (fused kernels) No SIMD performance – kernel structure not yet finalised for cache and SIMD performance Parallel porting only requires changes to PSy layer science code is unchanged

Conclusions

Demise of Moore’s Law means applications must exploit increased parallelism and hierarchy

UM ported in native mode to Xeon Phi

initial performance is disappointing

code changes single: precision solver, checkerboarding: improve performance but more on other architectures (need full implementation in UM)

Gung Ho portable and scalable (hopefully) by design

Design for ILP/SIMD at low level evolving


From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

Documents