Efficient numerical simulation on multicore processors ...

Efficient numerical simulation on multicore processors (MuCoSim) 15.10.2013

Prof. Gerhard Wellein, Dr. G. Hager HPC Services Regionales Rechenzentrum Erlangen (RRZE) Department für Informatik http://moodle.rrze.uni-erlangen.de/moodle/course/view.php?id=278

2

Efficient numerical simulation on multi-core processors We do performance optimization, performance modeling, parallelization

for Multi-core CPUs: core, socket, node and large scale 10,000+ cores GPGPUs: single devices and cluster

We collaborate with many users doing numerical simulation:

Prof. Rüde: waLBerla / efficient C++ Prof. Clark (Chemistry) Physics Engineering: Prof. Schwieger, PD Dr. S. Becker Medical Image Reconstruction: Prof. Hornegger …

We operate the compute resources at FAU

Our group: 5 senior scientists (incl. RRZE) (GW/GH/TZ/MM/JT) 4 PhD students (MW/FS/MK/HS) 2 Master students (JH/JB)

15.10.2013 [email protected]

3

Effiziente numerische Simulation auf multicore-Proz.. Hintergrund

HPC Services at RRZE:

Testcluster („Playground“) Octo-Core (Intel Sandy Bridge) – coming soon 10-core Ivy Bridge Hexa-Core (AMD Interlagos) nVIDIA & AMD GPUs & Intel Xeon Phi

Local production machines: 84 nodes Nehalem-Cluster 672 cores 500 nodes Intel Westmere 6.000 cores 544 nodes Intel Ivy Bridge 10.880 cores 8 x (2 x Intel Ivy Bridge 10 core + 2 x NVIDIA K20 GPGPU) 8 x (2 x Intel Ivy Bridge 10 core + 2 x Intel Xeon Phi)

Access to external machines IBM BlueGene/Q (Jülich): 450.000+ cores (6 PFLOP/s) SuperMUC (LRZ Garching): 150.000+ cores (3 PFLOP/s) see next slide Hermit (HLR Stuttgart) CRAY XE 6 (1 PFLOP/s)


4

SuperMUC – LRZ Garching: TOP 4 (June 2012)

Thin nodes: 18 Islands with 512 nodes each 2 Intel Xeon E5-2680 8C processors (2.7 GHz baseline clock speed) per

node 147,456 cores 3.2 Pflop/s Peak 2.9 Pflop/s LINPACK

Fat nodes:

1 Island: 205 nodes 4 Intel Xeon E7-4870

10 C per node 256 GB/node

Total power

consumption: 2.5 MW – 3 MW


5

SuperMUC – far more than some islands..


6

TOP 10 – November 2012


FZ Jülich

HLRS-Stuttgart

LRZ-München

RRZE-Erlangen

7

Efficiency and multicores – why?


8

The driving force: Moore’s law

Electronics Magazine, April 1965: The complexity for minimum component costs has increased at a rate of roughly a factor of two per year… Certainly over the short term this rate can be expected to continue, if not to increase.

Intel Nehalem EX: 2.3 Billion

nVIDIA FERMI: 3 Billion

Intel Corp


9

Frequency [MHz]

1

10

100

1000

10000

1971

1974

1977

1980

1983

1986

1989

1992

1995

1998

2001

2004

2009

Year

The end of the clock speed race

Exponential growth of CPU clock speed for 15+ years

Intel x86 clock speed

Better architectural features:

• Pipelining

• Superscalarity

• SIMD / Vector ops

• Larger caches

BUT

No fundamental architectural

changes until 2004/5


10

The x86 multicore evolution so far Intel Single-Dual-/Quad-/Hexa-/-Cores (one-socket view)

Sandy Bridge EP “Core i7”

32nm

C C

C C

C C

C C

C

MI

Memory

P T0

T1 P

T0

T1 P

T0

T1 P

T0

T1

2012: Wider SIMD units AVX: 256 Bit

P C

P C

C

P C

P C

C

Woo

dcre

st

“Cor

e2 D

uo”

65nm

Har

pert

own

“Cor

e2 Q

uad”

45n

m

Memory

Chipset

P C

P C

C

Memory

Chipset

Oth

er

sock

et

Oth

er

sock

et

2006: True dual-core

P C C

Memory

Chipset

Memory

Chipset

P C C

P C C

2005: “Fake” dual-core

Nehalem EP “Core i7” 45nm

Westmere EP “Core i7”

32nm

C C

C C

C C

C C

C C

C C

C

MI

Memory

P T0

T1 P

T0

T1 P

T0

T1 P

T0

T1 P

T0

T1 P

T0

T1

C C

C C

C C

C C

C

MI

Memory

P T0

T1 P

T0

T1 P

T0

T1 P

T0

T1

2008: Simultaneous

Multi Threading (SMT)

Oth

er

sock

et

Oth

er

sock

et

C C

C C

C C

C C

P T0

T1 P

T0

T1 P

T0

T1 P

T0

T1

2010: 6-core chip


Oth

er

sock

et

11

There is no single driving force for chip performance!

Floating Point (FP) Performance:

P = ncore * F * S * ν

ncore number of cores: 8

F FP instructions per cycle: 2 (1 MULT and 1 ADD)

S FP ops / instruction: 4 (dp) / 8 (sp) (256 Bit SIMD registers – “AVX”)

ν Clock speed : ∽2.7 GHz P = 173 GF/s (dp) / 346 GF/s (sp)

Intel Xeon “Sandy Bridge EP” socket

4,6,8 core variants available

But: P=5 GF/s (dp) for serial, non-SIMD code


TOP500 rank 1 (1995)

12

Today: Dual-socket Intel (Westmere) node:

Yesterday (2006): Dual-socket Intel “Core2” node:

From UMA to ccNUMA Basic architecture of commodity compute cluster nodes

Uniform Memory Architecture (UMA)

Flat memory ; symmetric MPs

But: system “anisotropy”

Cache-coherent Non-Uniform Memory Architecture (ccNUMA)

HT / QPI provide scalable bandwidth at the price of ccNUMA architectures: Where does my data finally end up?

On AMD it is even more complicated ccNUMA within a socket!


13

The curse and blessing of ccNUMA: OpenMP STREAM triad: 4-socket (48 core) AMD node

Parallel init: Correct parallel initialization LD0: Serial initialization in LD0 Interleaved: numactl --interleave <LD range>

0

20000

40000

60000

80000

100000

120000

1 2 3 4 5 6 7 8

parallel init LD0 interleaved

# NUMA domains (6 threads per domain)

Ban

dwid

th [M

byte

/s]

Compare with NEC SX9 vector computer:

240.000 MByte/s (1 CPU)

2.600.000 MByte/s (1 node; 16 CPUs; UMA)

Vector/AMD on node basis:

Performance: 4X

Bandwidth: 26X

Price: 30X,…,40X

4 x 12 core AMD Opteron (Magny Cours)


14

GPGPUs and others have also entered the game


15

NVIDIA Kepler GK110 Block Diagram (Kepler)

Architecture 7.1B Transistors 15 “SMX” units

192 (SP) “cores” each > 1 TFLOP DP peak 1.5 MB L2 Cache 384-bit GDDR5 PCI Express Gen3

3:1 SP:DP performance

© NVIDIA Corp. Used with permission.


Vorführender

Präsentationsnotizen

Here’s a top level diagram Over 7 bn transistors 15 SMX units achieving over a Tflop double precision performance

16

Intel Xeon Phi

….

MEMORY

Code vectorization is the key feature! Vector width = 2 x AVX But: Non vectorized single thread performance ~ 0.001x


17

Comparing accelerators

Intel Xeon Phi 60+ IA32 cores each with 512 Bit SIMD

FMA unit 480/960 SIMD DP/SP tracks Clock Speed: ~1000 MHz Transistor count: ~3 B (22nm) Power consumption: ~250 W

Peak Performance (DP): ~ 1 TF/s Memory BW: ~250 GB/s (GDDR5)

Threads to execute: 60-240+ Programming:

Fortran/C/C++ +OpenMP + SIMD TOP7: “Stampede” at Texas Center

for Advanced Computing

NVIDIA Kepler K20 15 SMX units each with

192 “cores” 960/2880 DP/SP “cores” Clock Speed: ~700 MHz Transistor count: 7.1 B (28nm) Power consumption: ~250 W

Peak Performance (DP): ~ 1.3 TF/s Memory BW: ~ 250 GB/s (GDDR5)

Threads to execute: 10,000+ Programming:

CUDA, OpenCL, (OpenACC)

TOP1: “Titan” at Oak Ridge National Laboratory

TOP500 rankings Nov 2012


18

Trading single thread performance for parallelism: Accelerator vs. CPUs

Accelerator vs. CPU light speed estimate:

1. Compute bound: 3-4 X 2. Memory Bandwidth: 2 X

Intel Xeon Phi (“Knights Corner”)

Intel Xeon E5-2680 DP node (“Sandy Bridge”)

NVIDIA K20x (“Fermi”)

Cores@Clock 60+ @ 1.1+ GHz 2 x 8 @ 2.7 GHz 2880/3 @0.7 GHz

Performance+/core ~18 GFlop/s 43.2 GFlop/s 1.4 GFlop/s Threads@stream ~200 <16 >8000

Total performance+ 1,000 GFlop/s 345 GFlop/s 1,300 GFlop/s Stream BW ~160 GB/s 2 x 40 GB/s ~170 GB/s (ECC=1)

Transistors / TDP 3 Billion / 250 W 2 x (2.27 Billion / 130W) 7 Billion / 250 W * Includes on-chip GPU and PCI-Express + Double Precision Complete compute device


19

Scope of this seminar

Implement, optimize, parallelize numeric kernels on multicores,

manycores (Intel Xeon Phi) and GPGPUs

Establish performance models which guide optimization and parallelization efforts

Test / evaluate unconventional / new programming approaches

Access to latest hardware

Interact with your tutor!

Tell us if you have an interesting problem!


20

Performance Engineering – Our approach

1. Carefully analyze the minimum computational requirements (data volume, FLOP-ops) of the algorithm

2. Carefully analyze the computational requirements (data access in cache/main memory, FLOPS, instruction mix,..) of the implementation. Optimize if they do not fit to data from 1.

3. Analyze the available computational resources of the target hardware: Cache/Memory bandwidth, SIMD capabilities,..

4. Determine runtime / performance number based on 2 and 3.

5. Measure runtime / performance and compare with 4. Go back to 2. / 3. if numbers differ substantially


21

Intel Xeon Phi: Basic performance for regular stencils

Optimization, parallelization and performance modeling of a “3D Jacobi solver” for Intel Xeon Phi

Implement simple 3D Jacobi solver on Xeon Phi OpenMP parallelization

Investigate

Data transfer between Phi and host SIMD vectorization versus multi-threading Performance model Blocking strategies


!$OMP PARALLEL DO do k = 1 , N do j = 1 , N do i = 1 , N y(i,j,k)= b*( x(i-1,j,k)+ x(i+1,j,k)+ x(i,j-1,k)+

x(i,j+1,k)+ x(i,j,k-1)+ x(i,j,k+1)) enddo ; enddo ; enddo

22

Potential topics

Relaxed thread synchronisation for multi-core

architectures: Effect of replacing OpenMP barriers by relaxed synchronization

constructs Replace global synchronization by point to point synch. Target architecture: 8-core Sandy Bridge; AMD Interlagos; Intel Xeon

Phi Potential kernels / applications:

7pt-Gauß-Seidel SIP Solver (Strongly Implicit Procedure after Stone)


23

Potential topics

ILBDC Lattice Boltzmann solver (RRZE, T. Zeiser, M. Wittmann)

Project 1:

GPGPU implementation of ILBDC kernel in OpenCL/CUDA Evaluate impact of list ordering

Project 2:

SIMD Vectorized TRT/MRT kernel Code generator for automatic SIMDfication


24

Potential topics

Evaluation of the OpenACC directives on CRAY XE6 OpenACC tries to standardize the way compiler directives are used to

program accelerator devices like GPGPUs. It is available, e.g., on recent CRAY supercomputers like the HERMIT system at HLRS Stuttgart. STREAM benchmarks Jacobi solver spMVM


25

Potential topics (ESSEX project /DFG)

Iterative methods for sparse matrix problems:, e.g.: Conjugate gradient solve A x = b (LSE) Implement full kernel incl. spMVM on CPU and GPGPU nodes; Performance analysis (and modeling) of the full kernel

Parallel programming:

ghost library developed by Moritz Kreutzer (within DFG Exascale project) OpenMP / OpenCL / CUDA or CoArray Fortran

Target machines: Nodes with CPU / GPGPU and/or XeonPhi

Autovectorization

Application: Sparse Matrix Vector Multiplication + Appropriate data formats Survey and test existing tools


26

Potential topics

Global Arrays (GA) Toolkit http://www.emsl.pnl.gov/docs/global/

Implement simple parallel kernels using GA: 3D Jacobi and / or Gauß-Seidel (ppp) Simple 3D Lattice Boltzmann flow solver

Performance analysis and evaluation – establish low level

benchmarks to determine qualitative difference with pure MPI

Target machine: (Large)= compute cluster: tinyblue or lima


http://www.emsl.pnl.gov/docs/global/

27

Potential topics

Stone‘s Strongly Implicit Procedure (SIP): Old but still frequently solver in finite volume codes performs incomplete LU factorization solves through iterative LU steps carries data dependency

Establish benchmark framework using OpenMP from scratch

CoArray Fortran implementation lima-cluster / CRAY XE6


28

Potential topics

Asynchronous MPI communication: Using explicit threading ("task mode") to implement explicit overlap between communication and computation in different solvers.

Non-blocking communication calls basically allow asynchronous communication – but no MPI library fully supports that

Test cases: MPI parallel Lattice Boltzmann 3D solver on CPUs MPI parallel Jacobi 3D solver on GPUs


29

Potential topics

Stepanov Test: Development of a modern test for the optimization

capabilities of compilers, including auto-parallelization a) C++ b) Fortran 95

Evaluation of optimization strategies for matrix-matrix multiply on modern processors. Set up an automatic framework which generates unrolling and blocking

strategies. Evaluate the efficiency of those strategies and the impact of/interaction with

compiler optimizations.


30

Potential topics

Evaluation of optimization strategies for matrix-matrix multiply on modern processors.

Set up an automatic framework which generates unolling and blocking strategies for matrix-matrix multiplication.

Evaluate the efficiency of those strategies and the impact of/interaction with compiler optimizations.


do i = 1 , N do j = 1 , N do k = 1 , N a(i,j) = a(i,j) + b(i,k)*c(k,j) enddo enddo enddo

31

Potential Topics (IBM CAS collaboration)

Evaluation of short vector sums on modern architectures. Benchmark and evaluate the vector sum on Multicore, GPGPU and Xeon Phi. This involves an analysis of the overhead introduced by the necessary reduction and synchronization.

Evaluation of sorting of a float array. Benchmark and evaluate and/or implement fast sorting on modern multicore and accelerator architectures. Instead of a full sort this can also be done for the nth select operation which is very common in business analytics.


33

MuCoSim 2013/14 15.10.: Introduction (GW) 22.10.: D. Ernst: 2D Jacobi on Xeon Phi (Bachelor Thesis) 29.10.: H. Köstler: EXASTENCILs 05.11.: H. Stengel (+ Salah) Stencil – ECM Modelling 12.11.: tba 19.11.: tba 26.11.: tba 03.12.: tba 10.12.: tba 17.12.: GW: Fooling the Masses

tba for year 2014

Topics and other information: http://moodle.rrze.uni-erlangen.de/moodle/course/view.php?id=278


Efficient numerical simulation on multicore processors ...

Documents