Presentación en el EuroMPI de Santorini, Sept 2011
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Juan A. Sillero, Guillem Borrell, Javier Jiménez(Universidad Politécnica de Madrid)
and Robert D. Moser (U. Texas Austin)
T/NT INTERFACE
x/�99
y/�99
z/�99
Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores
FUNDED BY: CICYT, ERC, INCITE, & UPM
★ Motivations
★ Numerical approach
★ Computational setup & domain decomposition
★ Node topology
★ Code Scaling
★ IO performances
★Conclusions
Outline
Motivations
• Differences between internal and external flows:• Internal: Pipes and channels• External: Boundary layers
• Effect of large-scale intermittency in the turbulent structures• Energy consumption optimization:
•Skin friction is generated in the interface vehicle-boundary layer
• Separation of scales:• 3 Layers structure: inner, logarithmic and outer•Achieved only with high Reynolds number
• Important advantages of simulations over experiments
INTERNAL FLOWS EXTERNAL FLOWS
Motivations: Some underlying physic
Turbulent
Non Turbulent
Duct
TurbulentTurbulent
Pipe
Sections:
|!0+|
INTERNAL FLOWS EXTERNAL FLOWS
Motivations: Underlying physic
Turbulent
Non Turbulent
Duct
TurbulentTurbulent
Pipe
Sections:
Skin Friction (drag)
|!0+|(5% world energy consumption)
Numerical Approach
• Incompressible Navier-Stokes equations
u
v wp
+ Boundary Conditions
Staggered grid
Numerical Approach
• Incompressible Navier-Stokes equations
Non-Linear Terms Linear Viscous Terms
Linear pressure-gradient termsSemi-implicitRK-3
u
v wp
Staggered grid
Numerical Approach
• Incompressible Navier-Stokes equations
u
v wp u
v wp
Compact Finite Diferences(X & Y) Pseudo-Spectral
SPATIAL DISCRETIZATION:
Numerical Approach
Simens et al. JCP 228, 4218 (2009)Jimenez et al. JFM 657, 335 (2010) *
✦ Fractional Step Method
✦Inlet conditions using [Lund et al.] recycling scheme approach
✦ Linear systems solved using LU decomposition
✦Poisson equation for pressure solved using direct method
✦ 2nd order time accuracy and 4th order CFD
Computational setup & domain decomposition
⇧XY ⇡ 63 Mb
⇧ZY ⇡ 11 Mb
⇧ZY
⇧XY
Blue Gene/P
• 4x450 PowerPC• 2 Gb RAM (DDR2)
INCITE project (ANL)PRACE project (Jugene)
Plane to Planedecomposition
Tier-0
(16 R*8 buffers)
Computational setup & domain decomposition
New parallelization strategy+
Hybrid OpenMP-MPI
Computational setup & domain decomposition
• Global transposes: • Change the memory layout• Collective communications: MPI_ALLTOALLV• Messages are single precision (R*4)
•About 40% of the total time (when using Torus network)
• 4 OpenMP threads• Static Scheduling:
•Through private indexes•Maximise data locality•Good load balance
• Loop blocking in Y• Tridiagonal LS LU solver•Tuned for Blue Gene/P
Computational setup & domain decomposition
Computational setup & domain decomposition
• Create 2 MPI groups (MPI_GROUP_INCL)• Groups created based in 2 list of ranks• Split global communicator in 2 local ones
• Each group performs independently• Some global operations:
• Time step: MPI_ALLREDUCE • Inlet conditions: SEND/RECEIVE
Node topology
How to map virtual processes onto physical processors?
Predefined Custom
Twice Faster
8192 NodesBL1=512BL2=7680
3D Torus network is lost: CommBL1 [ CommBL2 = MPI COMM WORLD
Node topology
BALANCE
COMM.
COMPUT.
CODE SCALING:
Millions of points per node Size of message [Bytes]
Tim
e pe
r m
essa
ge [
s]
Tim
e [s
]
Across nodes (MPI) Within node (OpenMP)
⇡ 7 MBNode occu
patio
n
2 kB
40% Comm.8% Transp.52% Comp.
Linear weakscaling
IO Performances
• Checkpoint of the simulation: 0.5 TBytes (R*4) • Every 3 hours (12 hours run)• Velocity {u,v,w} and pressure {p} fields (4x84 GB+4x7.2 Gb)• Correlation files {u}
• Different strategies for IO:• Serial IO: Discarded• Parallel Collective IO: