Top Banner
Evgeny Efimenko, Sergey Bastrakov, Arkady Gonoskov, Iosif Meyerov, Michail Shiryaev, Igor Surmin ICNSP 2013 Particle-in-Cell Plasma Simulation on The Intel Xeon Phi in PICADOR
35
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Picador

Evgeny Efimenko,

Sergey Bastrakov,

Arkady Gonoskov,

Iosif Meyerov,

Michail Shiryaev,

Igor Surmin

ICNSP 2013

Particle-in-Cell Plasma Simulation

on The Intel Xeon Phi in PICADOR

Page 2: Picador

Agenda

Introduction to Xeon Phi

PICADOR Overview

Optimization Techniques

Performance Analysis

Scaling Efficiency

2

Page 3: Picador

3

HPC Hardware Environment

CPU

Peak 100-200 GFLOPS

4-16 cores (x86)

SIMD, branch prediction,

out-of-order execution

C++, Fortran, MPI,

OpenMP, TBB

GPU

Peak 1-2 TFLOPS

500-1000 cores (GPU)

Scalar, simple,

low-frequency cores

CUDA C, CUDA Fortran,

OpenCL, OpenACC

Xeon Phi

Peak 1 TFLOPS, 61 cores (x86)

Wide SIMD, simple, low-frequency cores

C++, Fortran, MPI, OpenMP, TBB, OpenCL

Page 4: Picador

Product family of Intel Many Integrated Core

Architecture (MIC), launched in 2012.

61 cores on shared memory with Linux on board.

Pentium-like x86 cores with 512-bit vector units.

~ 1 TFLOPS peak in DP, ~ 2 TFLOPS in SP.

4

Intel Xeon Phi Overview

8 to 16 GB GDDR5 memory,

240 to 350 GB/sec bandwidth.

Standard programming tools:

C++, Fortran, MPI, OpenMP.

Image source: http://software.intel.com/sites/default/files/xeonPhiHardware.jpg

Page 5: Picador

5

Xeon Phi Architecture Overview

Image source: http://www.extremetech.com/wp-content/uploads/2012/07/Intel-Mic-640x353.jpg

Page 6: Picador

61 cores, 4 hardware threads per core.

– Launch one or several MPI processes with

several threads per process (e.g. OpenMP).

– Run at least 2 threads per core for efficiency.

512-bit vector units: capable of 8 simultaneous

arithmetic operations in DP, 16 in SP.

– Use SIMD instructions to benefit from vector

units.

– Compiler may generate vector instructions to

execute several loop iterations in parallel. 6

Levels of Parallelism on Xeon Phi

Page 7: Picador

FMA(a, b, c) = a × b + c with one round-off.

For peak performance do vector FMA in all cores:

1064.8 GFLOPS in DP = 61 cores × 1.091 GHz ×

1 FMA per cycle × 8 vector FMA × 2 FLOP per FMA

for Intel Xeon Phi 7110X

7

512-bit Vector Units

x7 x6 x5 x4 x3 x2 x1 x0

y7 y6 y5 y4 y3 y2 y1 y0

z7 z6 z5 z4 z3 z2 z1 z0

+

=

One 512-bit

vector

operation

Page 8: Picador

8

How To Benefit from Xeon Phi?

Used Levels of

Parallelism

Peak,

GFLOPS

Equivalent

Peak

Multi-threaded + vector 1064 5 CPUs, GPU

Multi-threaded + scalar 133 CPU

Single-threaded + vector 17 CPU core

Single-threaded + scalar 2 1/10 CPU core

Essential: good scaling on shared memory up to at

least 120 threads.

Strongly recommended: vectorization.

Page 9: Picador

Overall peak of Xeon Phi is ~ 5x peak of CPU,

but single-threaded scalar performance is

~ 0.1x CPU core.

A fast sequential operation on CPU can become a

bottleneck on Xeon Phi.

E.g. PML, boundary pulse generator, particle

migration between cells, periodic boundary

conditions (on a single node).

Need to make everything parallel on Xeon Phi or

use offload mode and perform sequential operations

on host CPU. Similar to GPU usage. 9

A Note on Parallelism

Page 10: Picador

Need Intel C++ / Fortran compiler.

Native mode: use as a 61-core CPU under Linux:

C++ / Fortran + OpenMP / TBB / Cilk Plus

Symmetric mode: use as a node of cluster, can host

several MPI processes:

C++ / Fortran + MPI + OpenMP / TBB / Cilk Plus

Offload mode: use as a coprocessor, run kernels

launched from host CPU, similar to GPGPU:

C++ / Fortran + Offload compiler directives +

OpenMP / TBB / Cilk Plus 10

Programming for Xeon Phi

Page 11: Picador

Optimization recommendations for CPUs also

apply to Xeon Phi: multi-threading, vectorization,

cache-friendliness, etc.

But there are nuances on Xeon Phi vs. CPU that

make it more challenging:

– Much more threads: 120 to 240 vs. 4 to 16.

– Wider vector units: 512-bit vs. 128 to 256-bit.

11

Optimization for Xeon Phi

Page 12: Picador

Consider multi-threaded PIC code with 99%

scaling efficiency on 8 cores, 30% of CPU peak.

In ideal case, 30% of Xeon Phi peak 320 GFLOPS

In reality:

25% of arithmetic is vectorized –240 GFLOPS

(not all loops, not full 512-bit width)

70% scaling efficiency on 61 cores –25 GFLOPS

Overall 5% of peak on Xeon Phi 55 GFLOPS

Reality vs. ideal case = 1 / 6 12

Example

Page 13: Picador

Lorenz Force Computation

(1st order field interpolation)

𝑭𝑖 = 𝑞𝑖 𝑬 𝒓𝒊 +1

𝑐𝒗𝒊 × 𝑩 𝒓𝒊

Particle Push (Boris) 𝜕𝒑𝒊𝜕𝑡= 𝑭𝑖

𝜕𝒓𝒊𝜕𝑡= 𝒗𝑖

𝒗𝑖 =𝒑𝒊𝑚𝑖1 +

𝒑𝒊𝑚𝑖𝑐

2 −12

13

Field Solver

(FDTD, NDF)

𝛻 × 𝑩 =4𝜋

𝑐𝒋 +1

𝑐

𝜕𝑬

𝜕𝑡

𝛻 × 𝑬 = −1

𝑐

𝜕𝑩

𝜕𝑡

Current Deposition

(1st order)

𝒋 𝒓 = 𝑞𝑖𝒗𝒊𝛿 𝒓 − 𝒓𝑖𝑖

Particle-in-Cell

𝒓𝒊, 𝒗𝒊

𝑬,𝑩

𝒋

Page 14: Picador

PICADOR – tool for large-scale Particle-in-Cell

3D plasma simulation.

Under development since 2010.

Aimed at heterogeneous cluster systems with

CPUs, GPUs, Xeon Phi accelerators.

– C++, MPI, OpenMP, CUDA

Design idea:

– Flexible and extendable high-level architecture.

– Heavily tuned low-level performance-critical routines.

14

PICADOR Overview

* S. Bastrakov, R. Donchenko, A. Gonoskov, E. Efimenko, A. Malyshev, I. Meyerov, I.

Surmin. Particle-in-cell plasma simulation on heterogeneous cluster systems, Journal

of Computational Science, 3 (2012), 474-479.

Page 15: Picador

MPI exchanges and single-node performance can be

optimized independently.

Thus we consider only the single node case.

15

PICADOR Time Distribution

CPU GPU Xeon Phi

Page 16: Picador

Global array for field and current density values.

Particles in each cell are stored in a separate array

and processed by cells in parallel.

Particle push pseudo-code:

#pragma omp parallel for

for each cell

for each particle in the cell

interpolate field (1st order)

push (Boris)

check if particle leaves cell

16

Baseline Implementation

Page 17: Picador

The code already uses OpenMP, so we can use

native mode on Xeon Phi with no modifications.

Rebuild with Intel C++ Compiler, --mmic option.

– Some libraries are not tested on Xeon Phi and

may require efforts (even HDF5 needed hotfix).

Time to port PICADOR to Xeon Phi: 1 day.

Performance on Xeon Phi close to 16 CPU cores.

Time to develop our GPU-based implementation

with similar performance: ~2 months.

– Need to maintain C++ and CUDA versions of all

computationally intensive functions. 17

“No Effort” Port To Xeon Phi

Page 18: Picador

Measure performance on benchmark:

40x40x40 grid, 50 particles per cell, DP.

Hardware:

– 2x 4-core Intel Xeon 5150, peak 68 GFLOPS *.

– 2x 8-core Intel Xeon E2690, peak 371 GFLOPS **.

– 61-core Intel Xeon Phi 7110X, peak 1076 GFLOPS **.

* UNN, ** MVS-10P at JSC RAS

Measure two metrics:

– Time per iteration per particle (ns per particle update).

– % of peak performance.

18

Benchmark

Page 19: Picador

19

Baseline Performance

% of peak 8% 7.9% 5.5% 2.6% 0.9%

853

108

457

59 62

Page 20: Picador

Particle-grid interactions are spatially local.

When a cell is processed, preload field values /

accumulate current locally.

#pragma omp parallel for

for each cell

preload 27 local field values

for each particle in the cell

locally interpolate field

push (Boris)

check if particle leaves cell

20

Employing Locality

Page 21: Picador

21

Impact of Employing Locality

% of peak 19.7% 19.1% 14.6% 10.7% 3.1%

speedup over baseline

2.5x 2.4x 2.7x 4x 3.6x

347

45

172

15 17

Page 22: Picador

Loop vectorization – simultaneous

execution of several iterations on vector

units via SIMD instructions.

Compiler may perform vectorization for

internal loop.

Main requirements (for most cases):

– No data dependencies between

iterations.

– All internal function calls are inlined.

– No internal conditional statements.

22

Employing Loop Vectorization

0

1

2

3

4

5

Iteration

Page 23: Picador

Ensure inlining of interpolation and push functions.

Subdivide the loop for particles into three loops for

interpolation, push and check to avoid conditionals

and data dependencies between iterations.

Use compiler directives to guarantee pointers do

not alias, otherwise compiler suspects dependence.

Employ loop tiling for interpolation and push to

save interpolated fields.

After modifications push loop is vectorized,

interpolation loop can be vectorized but turned out

to be inefficient. 23

Employing Loop Vectorization

Page 24: Picador

#pragma omp parallel for

for each cell

preload 27 local field values

for each particle tile

#pragma novector

for each particle in the tile

locally interpolate field

#pragma ivdep

for each particle in the tile

push (Boris)

for each particle in the cell

check if particle leaves cell

24

Employing Vectorization

Page 25: Picador

25

Impact of Employing Vectorization

% of peak 34.3% 34.1% 26.4% 20.9% 4.5%

speedup over previous version

1.7x 1.8x 1.8x 2x 1.4x

199

25

95

8 12

Page 26: Picador

20.3

13.2 11.2 10.6

26

More Threads on Xeon Phi

1 thread per core is apriori inefficient because of hardware:

every clock an instruction is issued for a different thread.

Page 27: Picador

83%

73%

27

Scaling on CPUs

Hyper-threading significantly improves scaling.

Ideal scaling to 8 cores (1 CPU) with HT.

Much worse scaling from 1 CPU to 2 CPUs.

2x 8-core CPUs on shared memory, 1 MPI process

Page 28: Picador

28

Scaling on CPUs

99%

83%

For several CPUs on shared memory the best

option is to launch an MPI process per CPU.

With this option ideal scaling from 1 to 16 cores.

2 MPI processes 1 MPI process

73%

Page 29: Picador

Keep 4 threads per core.

Use 1 MPI process.

29

Scaling on Xeon Phi

72%

Page 30: Picador

Compare different process/thread configurations.

30

MPI + OpenMP Configuration

[s]

Page 31: Picador

We vectorize loop over particles in a cell.

It is efficient only if there are enough iterations.

31

Optimum # Particles per Cells on Xeon Phi

Performance nearly stabilizes with > 100 particles per cell.

Page 32: Picador

A widely used idea for GPU-based implementations.

Store and handle particles from neighbor cells together.

Select supercell size to have ~100 particles per cell.

32

Supercells

1.5x speedup

Page 33: Picador

33

Summary

We ported parallel 3D PIC code to Xeon Phi.

Optimizations were beneficial for both CPU and Xeon Phi:

– 2.5x to 4x speedup due to memory access locality.

– 1.4x to 2x speedup due to vectorization.

Final performance metrics:

– 34% of peak, 50 ns / particle update on 4-core Xeon.

– 21% of peak, 15 ns / particle update on 8-core Xeon.

– 6% of peak, 12 ns / particle update on Xeon Phi.

Reasons for lesser efficiency on Xeon Phi vs. CPU:

– Worse scaling: 72% vs. 99%.

– Subpar vectorization efficiency: 1.4x (ideal 8x).

Page 34: Picador

Improve data layout and alignment for better

utilization of vector loads and vectorization.

– Array of full particles, or separate arrays of positions

and velocities, or separate arrays of everything?

Another interpolation scheme: first from Yee grid

to cell-aligned grid, then use it for particles.

– In progress, currently got 15% benefit, expect more.

Apply vectorization not to loops over particles but

to internal operations on 3D vectors.

– Can process two pairs of 3D vectors in one operation.

Use #pragma simd for full vectorization control. 34

Xeon Phi Specific Optimizations To Try

Page 35: Picador

THANK YOU FOR ATTENTION!

[email protected]