Top Banner
Juan A. Sillero , Guillem Borrell, Javier Jiménez (Universidad Politécnica de Madrid) and Robert D. Moser (U. Texas Austin) T/NT INTERFACE x/δ 99 y/δ 99 z/δ 99 Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores FUNDED BY: CICYT, ERC, INCITE, & UPM
19

Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Mar 02, 2015

Download

Documents

Guillem Borrell

Presentación en el EuroMPI de Santorini, Sept 2011
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Juan A. Sillero, Guillem Borrell, Javier Jiménez(Universidad Politécnica de Madrid)

and Robert D. Moser (U. Texas Austin)

T/NT INTERFACE

x/�99

y/�99

z/�99

Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores

FUNDED BY: CICYT, ERC, INCITE, & UPM

Page 2: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

★ Motivations

★ Numerical approach

★ Computational setup & domain decomposition

★ Node topology

★ Code Scaling

★ IO performances

★Conclusions

Outline

Page 3: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Motivations

• Differences between internal and external flows:• Internal: Pipes and channels• External: Boundary layers

• Effect of large-scale intermittency in the turbulent structures• Energy consumption optimization:

•Skin friction is generated in the interface vehicle-boundary layer

• Separation of scales:• 3 Layers structure: inner, logarithmic and outer•Achieved only with high Reynolds number

• Important advantages of simulations over experiments

Page 4: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

INTERNAL FLOWS EXTERNAL FLOWS

Motivations: Some underlying physic

Turbulent

Non Turbulent

Duct

TurbulentTurbulent

Pipe

Sections:

|!0+|

Page 5: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

INTERNAL FLOWS EXTERNAL FLOWS

Motivations: Underlying physic

Turbulent

Non Turbulent

Duct

TurbulentTurbulent

Pipe

Sections:

Skin Friction (drag)

|!0+|(5% world energy consumption)

Page 6: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Numerical Approach

• Incompressible Navier-Stokes equations

u

v wp

+ Boundary Conditions

Staggered grid

Page 7: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Numerical Approach

• Incompressible Navier-Stokes equations

Non-Linear Terms Linear Viscous Terms

Linear pressure-gradient termsSemi-implicitRK-3

u

v wp

Staggered grid

Page 8: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Numerical Approach

• Incompressible Navier-Stokes equations

u

v wp u

v wp

Compact Finite Diferences(X & Y) Pseudo-Spectral

SPATIAL DISCRETIZATION:

Page 9: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Numerical Approach

Simens et al. JCP 228, 4218 (2009)Jimenez et al. JFM 657, 335 (2010) *

✦ Fractional Step Method

✦Inlet conditions using [Lund et al.] recycling scheme approach

✦ Linear systems solved using LU decomposition

✦Poisson equation for pressure solved using direct method

✦ 2nd order time accuracy and 4th order CFD

Page 10: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Computational setup & domain decomposition

⇧XY ⇡ 63 Mb

⇧ZY ⇡ 11 Mb

⇧ZY

⇧XY

Blue Gene/P

• 4x450 PowerPC• 2 Gb RAM (DDR2)

INCITE project (ANL)PRACE project (Jugene)

Plane to Planedecomposition

Tier-0

(16 R*8 buffers)

Page 11: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Computational setup & domain decomposition

New parallelization strategy+

Hybrid OpenMP-MPI

Page 12: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Computational setup & domain decomposition

• Global transposes: • Change the memory layout• Collective communications: MPI_ALLTOALLV• Messages are single precision (R*4)

•About 40% of the total time (when using Torus network)

Page 13: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

• 4 OpenMP threads• Static Scheduling:

•Through private indexes•Maximise data locality•Good load balance

• Loop blocking in Y• Tridiagonal LS LU solver•Tuned for Blue Gene/P

Computational setup & domain decomposition

Page 14: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Computational setup & domain decomposition

• Create 2 MPI groups (MPI_GROUP_INCL)• Groups created based in 2 list of ranks• Split global communicator in 2 local ones

• Each group performs independently• Some global operations:

• Time step: MPI_ALLREDUCE • Inlet conditions: SEND/RECEIVE

Page 15: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Node topology

How to map virtual processes onto physical processors?

Predefined Custom

Twice Faster

8192 NodesBL1=512BL2=7680

3D Torus network is lost: CommBL1 [ CommBL2 = MPI COMM WORLD

Page 16: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Node topology

BALANCE

COMM.

COMPUT.

Page 17: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

CODE SCALING:

Millions of points per node Size of message [Bytes]

Tim

e pe

r m

essa

ge [

s]

Tim

e [s

]

Across nodes (MPI) Within node (OpenMP)

⇡ 7 MBNode occu

patio

n

2 kB

40% Comm.8% Transp.52% Comp.

Linear weakscaling

Page 18: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

IO Performances

• Checkpoint of the simulation: 0.5 TBytes (R*4) • Every 3 hours (12 hours run)• Velocity {u,v,w} and pressure {p} fields (4x84 GB+4x7.2 Gb)• Correlation files {u}

• Different strategies for IO:• Serial IO: Discarded• Parallel Collective IO:

• Posix calls • SIONLIB library (Juelich)• HDF5 (GPFS & PVFS2)

• HDF5 Tuning for Blue Gene/P:• GPFS & PVFS2 (cache OFF & ON respectively)

• Cache OFF, write: 2 Gb/sec (5-15 minutes) • Cache ON, write: 16 Gb/sec (25-60 seconds)

• Forcing file system block size in GPFS: 16 Gb/sec

Page 19: Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Conclusions

✴ Turbulent boundary layer code ported to hybrid OpenMP-MPI

✴ Memory optimized for Blue Gene/P: 0.5 GB/core

✴ Excellent weak linear scalability up to 8k nodes

✴ Big impact in performances using custom node topologies

✴ Parallel Collective IO (HDF5): Read 22 Gb/sec, Write 16 Gb/sec

Low pressure isosurface at high Reynolds numbers