© Crown copyright Met Office From Knights ENDGame to Gung Ho beginnings: Porting old, and developing new, dynamical cores in the age of accelerators Dr Chris Maynard [email protected]
© Crown copyright Met Office
From Knights ENDGame to Gung Ho beginnings:
Porting old, and developing new, dynamical cores in the age of accelerators
Dr Chris Maynard [email protected]
© Crown copyright Met Office
Introduction
End of Moore’s Law1 – Rise of multi-core processors The free lunch is over – Herb Sutter 1Processor speed
If you were ploughing a field, which would you rather use: Two strong oxen or 1024 chickens?
- Seymour Cray
Parallelism
© Crown copyright Met Office
Data parallel Domain decomposition Bind MPI task to CPU
Task parallel Processor perform different work I/O server
ILP – SIMD FMA, SSE/AVX CPU 512bit wide SIMD MIC Coalesced memory access (per warp) GPU
Hybrid programming MPI + OpenMP – affinity? Hierarchical memory spaces NUMA Accelerators host + device Asynchronous use?
Oversubscribed concurrency CPUs hyper-threading/SMT Not much benefit due to Out-of-order-execution MIC instruction interleave GPU hide memory latency
© Crown copyright Met Office
The Unified Model and ENDGame
© Crown copyright Met Office
Unified Model – NWP and Climate – 25 years old
ENDGame is 3rd generation dynamical core Half million lines of F90 (F77)/openMP threading low-level, incomplete coverage
T/T 2
4
Long-lat finite difference semi-implicit semi-lagrangian
Much improved stability and scalability c.f. New Dymics
© Crown copyright Met Office
Machines
© Crown copyright Met Office
STFC Daresbury 1 Dual socket 8 core Sandybridge host + KNC No queues, instant accesses
Stampede, TACC - 6400 nodes! batch system, multi-MIC
Met Office IBM Power7 2 machines ~ 500 32-core nodes each
Cray XC30 UK National Academic HPC service – 12-core dual socket Ivybridge
© Crown copyright Met Office
96x72x70 single Phi – UM
A: 1 OMP thread
B: 2 OMP, -O2, -O3 and compile thread=2
C: 2 OMP, -mkl, seg-size 20-120
D: 4 OMP, seg-size
E 2x Phi 6x10 each – 2 OMP
6x10 MPI tasks, + OpenMP threads – domain decomposition KMP_AFFINITY="verbose,granularity=thread,compact"
Compiled for native mode -mmic
© Crown copyright Met Office
Interpreting the results • Threading performance is poor
– Segments for radiation routines doesn’t help
– Is there enough work per thread?
• MKL really helps radiation routines – Lots of trig func - library boots performance 30%
• Two KNC cards is faster – Even for MPI over PCIe - enough data parallelism
– OMP at low level in UM – incomplete coverage
• Bigger problem size? – 192x144 runs out of global memory on 1 and 2 cards
– Same local volume size as N48 on 4!
© Crown copyright Met Office
Focus on the solver - threading
Solver performance is dominated by preconditioner – tri_sor
Red-black checkerboard for threading N48 – 96 x 72 / 6 x 10 (MPI) = 16 x 7(8) per task
split over threads?
Stopping criteria for BiCGStab 10-3
Satisfy this bound with single precision Halve data transfer from memory
Compare N96 (192 x 144 grid) 4 x KNC - 12x20 MPI task 2 x Sandy Bridge nodes (4-sockets) – 4x8 MPI tasks
1 Power 7 node (4 –sockets) – 4x8 MPI tasks
© Crown copyright Met Office
Tri-diagonal SOR preconditioner (strong scaling)
PWR 7 is faster (no off node comms) No SMT scaling MIC is slowest shows some thread scaling
Single precision solver is much faster Reduces memory traffic Reduces communication On all architectures
© Crown copyright Met Office
Data layout critical
Data layout is lexicographical (i,j,k)in UM loop order is red-black in tri_sor BAD cache use and SIMD
Change to linear red and black data layout
LGC is universal in UM including comms - data transformation Try out in serial-stand alone solver code – not full UM
Code length and complexity increase
© Crown copyright Met Office
Matrix-vector routine LCG ~ 100 lines RB ~ 700 lines Just for i and j both even!
Architecture comparison
© Crown copyright Met Office
Serial code only – mimic memory access fully occupied node Run multiple copies per node stress memory performance Global problem size, per socket, same on all architectures Amount of data transformation per memory system is the same, amount of work per core is different
Performance comparison
Summary
© Crown copyright Met Office
• ENDGame solver data layout is lexicographical but loop order is Red-black • improve using single precision and change data layout
• will require lots of coding in UM • These changes help on all architectures (least on MIC) • Vector/SIMD – Compiler reports successful vectorisation, but no performance
• peer at assembler code suggest all iterations in peel loop, vector length is 1?
• Try OpenMP SIMD with ifort14?
Stop press …
© Crown copyright Met Office
ifort –align array64byte
16x128x64 4 threads bound to core KNC 2x IB 1.3x Thanks to Tom Henderson for tip!
First look at Gung Ho
© Crown copyright Met Office
Met Office, Bath, Exeter, Imperial, Leeds, Reading, Warwick and STFC Daresbury Design and build new scalable dynamical core Numerics still under R & D High level software architecture design is complete First implementation developments started this year: Dynamo
Finite Element Method (FEM) – possibly higher order Unstructured mesh (semi-uniform horizontal, vertically graded) Element topology (i.e. Triangular, quadrilateral, hexagonal) not yet fixed
© Crown copyright Met Office
Design Principle
Scientific programming
Find numerical solution (and estimate of the uncertainty) to a (set of) mathematical equations which describe the action of a physical system
Parallel programming and optimisation are the methods by which large problems can be solved faster than real-time.
Roles: Natural and Computational scientists
What tasks are performed and not job descriptions
© Crown copyright Met Office
Layered architecture
Parallel code Science code
Comparing ENDGame and Gung Ho
© Crown copyright Met Office
ENDGame 32x32x64 domain RB BiCGstab solver (tri_sor) preconditioner 25 times
Gung Ho 32x32x64 Galerkin Projection on lowest order Quads for W1, W2 and W3 spaces 4 colours, no preconditioner, BiCGstab, 10 times
Single socket Ivy bridge processor – Open scaling wrt single thread
ENDGame versus Gung Ho Intel Xeon Phi
© Crown copyright Met Office
ENDGame – problem to small
EG artificial prpblem size
GH – native mode
Gung Ho Offload
Summary
© Crown copyright Met Office
7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives to offload kernel Gung Ho absolute performance KNC native twice as slow as Ivybridge offload twice as slow as native Data movement is not optimised, (fused kernels) No SIMD performance – kernel structure not yet finalised for cache and SIMD performance Parallel porting only requires changes to PSy layer science code is unchanged
Conclusions
Demise of Moore’s Law means applications must exploit increased parallelism and hierarchy
UM ported in native mode to Xeon Phi
initial performance is disappointing
code changes single: precision solver, checkerboarding: improve performance but more on other architectures (need full implementation in UM)
Gung Ho portable and scalable (hopefully) by design
Design for ILP/SIMD at low level evolving
© Crown copyright Met Office