Picador

Evgeny Efimenko,

Sergey Bastrakov,

Arkady Gonoskov,

Iosif Meyerov,

Michail Shiryaev,

Igor Surmin

ICNSP 2013

Particle-in-Cell Plasma Simulation

on The Intel Xeon Phi in PICADOR

Agenda

Introduction to Xeon Phi

PICADOR Overview

Optimization Techniques

Performance Analysis

Scaling Efficiency

2

3

HPC Hardware Environment

CPU

Peak 100-200 GFLOPS

4-16 cores (x86)

SIMD, branch prediction,

out-of-order execution

C++, Fortran, MPI,

OpenMP, TBB

GPU

Peak 1-2 TFLOPS

500-1000 cores (GPU)

Scalar, simple,

low-frequency cores

CUDA C, CUDA Fortran,

OpenCL, OpenACC

Xeon Phi

Peak 1 TFLOPS, 61 cores (x86)

Wide SIMD, simple, low-frequency cores

C++, Fortran, MPI, OpenMP, TBB, OpenCL

Product family of Intel Many Integrated Core

Architecture (MIC), launched in 2012.

61 cores on shared memory with Linux on board.

Pentium-like x86 cores with 512-bit vector units.

~ 1 TFLOPS peak in DP, ~ 2 TFLOPS in SP.

4

Intel Xeon Phi Overview

8 to 16 GB GDDR5 memory,

240 to 350 GB/sec bandwidth.

Standard programming tools:

C++, Fortran, MPI, OpenMP.

Image source: http://software.intel.com/sites/default/files/xeonPhiHardware.jpg

5

Xeon Phi Architecture Overview

Image source: http://www.extremetech.com/wp-content/uploads/2012/07/Intel-Mic-640x353.jpg

61 cores, 4 hardware threads per core.

– Launch one or several MPI processes with

several threads per process (e.g. OpenMP).

– Run at least 2 threads per core for efficiency.

512-bit vector units: capable of 8 simultaneous

arithmetic operations in DP, 16 in SP.

– Use SIMD instructions to benefit from vector

units.

– Compiler may generate vector instructions to

execute several loop iterations in parallel. 6

Levels of Parallelism on Xeon Phi

FMA(a, b, c) = a × b + c with one round-off.

For peak performance do vector FMA in all cores:

1064.8 GFLOPS in DP = 61 cores × 1.091 GHz ×

1 FMA per cycle × 8 vector FMA × 2 FLOP per FMA

for Intel Xeon Phi 7110X

7

512-bit Vector Units

x7 x6 x5 x4 x3 x2 x1 x0

y7 y6 y5 y4 y3 y2 y1 y0

z7 z6 z5 z4 z3 z2 z1 z0

+

=

One 512-bit

vector

operation

8

How To Benefit from Xeon Phi?

Used Levels of

Parallelism

Peak,

GFLOPS

Equivalent

Peak

Multi-threaded + vector 1064 5 CPUs, GPU

Multi-threaded + scalar 133 CPU

Single-threaded + vector 17 CPU core

Single-threaded + scalar 2 1/10 CPU core

Essential: good scaling on shared memory up to at

least 120 threads.

Strongly recommended: vectorization.

Overall peak of Xeon Phi is ~ 5x peak of CPU,

but single-threaded scalar performance is

~ 0.1x CPU core.

A fast sequential operation on CPU can become a

bottleneck on Xeon Phi.

E.g. PML, boundary pulse generator, particle

migration between cells, periodic boundary

conditions (on a single node).

Need to make everything parallel on Xeon Phi or

use offload mode and perform sequential operations

on host CPU. Similar to GPU usage. 9

A Note on Parallelism

Need Intel C++ / Fortran compiler.

Native mode: use as a 61-core CPU under Linux:

C++ / Fortran + OpenMP / TBB / Cilk Plus

Symmetric mode: use as a node of cluster, can host

several MPI processes:

C++ / Fortran + MPI + OpenMP / TBB / Cilk Plus

Offload mode: use as a coprocessor, run kernels

launched from host CPU, similar to GPGPU:

C++ / Fortran + Offload compiler directives +

OpenMP / TBB / Cilk Plus 10

Programming for Xeon Phi

Optimization recommendations for CPUs also

apply to Xeon Phi: multi-threading, vectorization,

cache-friendliness, etc.

But there are nuances on Xeon Phi vs. CPU that

make it more challenging:

– Much more threads: 120 to 240 vs. 4 to 16.

– Wider vector units: 512-bit vs. 128 to 256-bit.

11

Optimization for Xeon Phi

Consider multi-threaded PIC code with 99%

scaling efficiency on 8 cores, 30% of CPU peak.

In ideal case, 30% of Xeon Phi peak 320 GFLOPS

In reality:

25% of arithmetic is vectorized –240 GFLOPS

(not all loops, not full 512-bit width)

70% scaling efficiency on 61 cores –25 GFLOPS

Overall 5% of peak on Xeon Phi 55 GFLOPS

Reality vs. ideal case = 1 / 6 12

Example

Lorenz Force Computation

(1st order field interpolation)

𝑭𝑖 = 𝑞𝑖 𝑬 𝒓𝒊 +1

𝑐𝒗𝒊 × 𝑩 𝒓𝒊

Particle Push (Boris) 𝜕𝒑𝒊𝜕𝑡= 𝑭𝑖

𝜕𝒓𝒊𝜕𝑡= 𝒗𝑖

𝒗𝑖 =𝒑𝒊𝑚𝑖1 +

𝒑𝒊𝑚𝑖𝑐

2 −12

13

Field Solver

(FDTD, NDF)

𝛻 × 𝑩 =4𝜋

𝑐𝒋 +1

𝑐

𝜕𝑬

𝜕𝑡

𝛻 × 𝑬 = −1

𝑐

𝜕𝑩

𝜕𝑡

Current Deposition

(1st order)

𝒋 𝒓 = 𝑞𝑖𝒗𝒊𝛿 𝒓 − 𝒓𝑖𝑖

Particle-in-Cell

𝒓𝒊, 𝒗𝒊

𝑬,𝑩

𝒋

PICADOR – tool for large-scale Particle-in-Cell

3D plasma simulation.

Under development since 2010.

Aimed at heterogeneous cluster systems with

CPUs, GPUs, Xeon Phi accelerators.

– C++, MPI, OpenMP, CUDA

Design idea:

– Flexible and extendable high-level architecture.

– Heavily tuned low-level performance-critical routines.

14

PICADOR Overview

* S. Bastrakov, R. Donchenko, A. Gonoskov, E. Efimenko, A. Malyshev, I. Meyerov, I.

Surmin. Particle-in-cell plasma simulation on heterogeneous cluster systems, Journal

of Computational Science, 3 (2012), 474-479.

MPI exchanges and single-node performance can be

optimized independently.

Thus we consider only the single node case.

15

PICADOR Time Distribution

CPU GPU Xeon Phi

Global array for field and current density values.

Particles in each cell are stored in a separate array

and processed by cells in parallel.

Particle push pseudo-code:

#pragma omp parallel for

for each cell

for each particle in the cell

interpolate field (1st order)

push (Boris)

check if particle leaves cell

16

Baseline Implementation

The code already uses OpenMP, so we can use

native mode on Xeon Phi with no modifications.

Rebuild with Intel C++ Compiler, --mmic option.

– Some libraries are not tested on Xeon Phi and

may require efforts (even HDF5 needed hotfix).

Time to port PICADOR to Xeon Phi: 1 day.

Performance on Xeon Phi close to 16 CPU cores.

Time to develop our GPU-based implementation

with similar performance: ~2 months.

– Need to maintain C++ and CUDA versions of all

computationally intensive functions. 17

“No Effort” Port To Xeon Phi

Measure performance on benchmark:

40x40x40 grid, 50 particles per cell, DP.

Hardware:

– 2x 4-core Intel Xeon 5150, peak 68 GFLOPS *.

– 2x 8-core Intel Xeon E2690, peak 371 GFLOPS **.

– 61-core Intel Xeon Phi 7110X, peak 1076 GFLOPS **.

* UNN, ** MVS-10P at JSC RAS

Measure two metrics:

– Time per iteration per particle (ns per particle update).

– % of peak performance.

18

Benchmark

19

Baseline Performance

% of peak 8% 7.9% 5.5% 2.6% 0.9%

853

108

457

59 62

Particle-grid interactions are spatially local.

When a cell is processed, preload field values /

accumulate current locally.


for each cell

preload 27 local field values


locally interpolate field

push (Boris)


20

Employing Locality

21

Impact of Employing Locality

% of peak 19.7% 19.1% 14.6% 10.7% 3.1%

speedup over baseline

2.5x 2.4x 2.7x 4x 3.6x

347

45

172

15 17

Loop vectorization – simultaneous

execution of several iterations on vector

units via SIMD instructions.

Compiler may perform vectorization for

internal loop.

Main requirements (for most cases):

– No data dependencies between

iterations.

– All internal function calls are inlined.

– No internal conditional statements.

22

Employing Loop Vectorization

0

1

2

3

4

5

…

Iteration

Ensure inlining of interpolation and push functions.

Subdivide the loop for particles into three loops for

interpolation, push and check to avoid conditionals

and data dependencies between iterations.

Use compiler directives to guarantee pointers do

not alias, otherwise compiler suspects dependence.

Employ loop tiling for interpolation and push to

save interpolated fields.

After modifications push loop is vectorized,

interpolation loop can be vectorized but turned out

to be inefficient. 23

Employing Loop Vectorization


for each cell

preload 27 local field values

for each particle tile

#pragma novector

for each particle in the tile

locally interpolate field

#pragma ivdep

for each particle in the tile

push (Boris)



24

Employing Vectorization

25

Impact of Employing Vectorization

% of peak 34.3% 34.1% 26.4% 20.9% 4.5%

speedup over previous version

1.7x 1.8x 1.8x 2x 1.4x

199

25

95

8 12

20.3

13.2 11.2 10.6

26

More Threads on Xeon Phi

1 thread per core is apriori inefficient because of hardware:

every clock an instruction is issued for a different thread.

83%

73%

27

Scaling on CPUs

Hyper-threading significantly improves scaling.

Ideal scaling to 8 cores (1 CPU) with HT.

Much worse scaling from 1 CPU to 2 CPUs.

2x 8-core CPUs on shared memory, 1 MPI process

28

Scaling on CPUs

99%

83%

For several CPUs on shared memory the best

option is to launch an MPI process per CPU.

With this option ideal scaling from 1 to 16 cores.

2 MPI processes 1 MPI process

73%

Keep 4 threads per core.

Use 1 MPI process.

29

Scaling on Xeon Phi

72%

Compare different process/thread configurations.

30

MPI + OpenMP Configuration

[s]

We vectorize loop over particles in a cell.

It is efficient only if there are enough iterations.

31

Optimum # Particles per Cells on Xeon Phi

Performance nearly stabilizes with > 100 particles per cell.

A widely used idea for GPU-based implementations.

Store and handle particles from neighbor cells together.

Select supercell size to have ~100 particles per cell.

32

Supercells

1.5x speedup

33

Summary

We ported parallel 3D PIC code to Xeon Phi.

Optimizations were beneficial for both CPU and Xeon Phi:

– 2.5x to 4x speedup due to memory access locality.

– 1.4x to 2x speedup due to vectorization.

Final performance metrics:

– 34% of peak, 50 ns / particle update on 4-core Xeon.

– 21% of peak, 15 ns / particle update on 8-core Xeon.

– 6% of peak, 12 ns / particle update on Xeon Phi.

Reasons for lesser efficiency on Xeon Phi vs. CPU:

– Worse scaling: 72% vs. 99%.

– Subpar vectorization efficiency: 1.4x (ideal 8x).

Improve data layout and alignment for better

utilization of vector loads and vectorization.

– Array of full particles, or separate arrays of positions

and velocities, or separate arrays of everything?

Another interpolation scheme: first from Yee grid

to cell-aligned grid, then use it for particles.

– In progress, currently got 15% benefit, expect more.

Apply vectorization not to loops over particles but

to internal operations on 3D vectors.

– Can process two pairs of 3D vectors in one operation.

Use #pragma simd for full vectorization control. 34

Xeon Phi Specific Optimizations To Try

THANK YOU FOR ATTENTION!

[email protected]

Picador

Documents

xeon phi optimization

openacc xeon phi peak

core cpu

overall peak of xeon

xeon phi fmaa

intel xeon phi overview

x peak of cpu

cpu singlethreaded vector