This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
OpenFOAM: algorithmic overview and Pstream● OpenFOAM is first and foremost a C++ library used to solve in discretized form systems of Partial
Differential Equations (PDE). OpenFOAM stands for Open Field Operation and Manipulation
● The Engine of OpenFOAM is the Numerical Method, with the following features: ○ segregated, iterative solution (PCG/GAMG), unstructured finite volume method,
co-located variables, equation coupling.
● The method of parallel computing used by OpenFOAM is based on the standardMPI using the strategy of domain decomposition(zero layer domain decomposition)
● A convenient interface, Pstream, is used to plug any MPI library into OpenFOAM. It is a light wrapper around the selected MPI Interface
OpenFOAM: HPC Performances ● OpenFOAM scales reasonably well up to thousands of cores, upper limit order of thousands of cores [1] We
are looking at achieving radical scalability of cases of 100's of millions / billions of cell on 10K-100K cores.
● A custom version by Shimuzu Corp., Fujitsu Limited and RIKEN on K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect) was able to achieve high performance on 100 thousand MPI tasks on a large scale transient CFD simulation up to 100 billion cell mesh [2]
OpenFOAM HPC bottleneckstowards the exascale: open issuesUp to now, the well known bottlenecks for enabling OpenFOAM to perform on massively parallel clusters are:
● Scalability of the linear solvers and their limits in the parallelism paradigm.
○ In most cases the memory bandwidth is a limiting factor for the solver. ○ Additionally, global reductions are frequently required in the solvers.
The linear algebra core libraries are the main communication bottlenecks for the scalability
● Sparse Matrix storage format: The LDU sparse matrix storage format used internally does not enable any cache-blocking mechanism (SIMD, vectorization).
● The I/O data storage system: when running in parallel, the data for decomposed fields and mesh(es) has historically been stored in multiple files within separate directories for each processor, which is a bottleneck for big simulation.
● The roofline model: Performance bound (y-axis) ordered according to computational intensity● Computational Intensity: ratio of total floating-point operations to total data movement (bytes): i.e.
flops/byte● Which is the OpenFoam (CFD/FEM) arithmetic intensity? About 0.1, may be less…. ● TOP500 List - June 2019
• The Technical Committees cover all the key focus areas for OpenFOAM development; they assess the state-of-the-art, need and status for validation, documentation and further development.
•• Remits of the HPC TC
• OpenFOAM recommendations to Steering Committee in respect of HPC technical area• Work together with the Community to overcome the actual HPC bottlenecks of OpenFOAM• Scalability of linear solvers• Adapt/modify data structures of Sparse Linear System to enable vectorization / hybridization• Improve memory access on new architectures• Improve memory bandwidth• Parallel pre- and post-processing, parallel I/O
• Strong co-design approach• Identify algorithm improvements to enhance HPC scalability • Interaction with other the Technical Committee (Numerics, Documentations)
• Priorities of HPC TC:• HPC Benchmark• GPU enabling of OpenFOAM (see presentation afterwards of Nvidia’s Stan Posey) • Parallel I/O :Adios 2 I/O library as function object is now a git submodule in the OpenFOAM develop branch
Members of the Technical Committee•Chair: Dr. Ivan Spisso CINECA-S. Bnà, HPC Developer, CINECA•Deputy-Chair: Dr. Mark Olesen Principal Eng., ESI-OpenCFD (Release and Maintenance)•Dr. Henrik Rusche Wikki Ltd. (Release Authority)•Dr. M. Klemm, Principal Eng., Dr. G. Rossi App. Software Eng., Intel (Hardware OEM)•Dr. Olly Perks / Dr. F. Spiga Research Eng. ARM (Hardware OEM)•F. Magugliani, Strategic Planning E4 (HPC System Integrator)•Dr. William F. Godoy, Scientific Data Group Oak Ridge National Lab (HPC Computer Sc.)•Dr. Pham Van Phuc, Institute of Technology Shimizu Corporation (HPC/OEM)•PhD. Stefano Zampini, PETSC Team KAUST (HPC center/ PETSc dev.)•Mr. Axel Koehler, Sr. Solutions Architect Nvidia (Hardware OEM)
HPC Technical Committee First Indications of TasksTechnical Committee Tasks
• Commitment• •Work together with the Community to overcome the actual HPC bottlenecks of OpenFOAM.• Demonstrate improvements in performance and scalability to move forward from actual near peta-scale to pre- and exa-scale class
performances•• Remits
•OpenFOAM recommendations to Steering Committee in respect of HPC technical area•Work together with the Community to overcome the actual HPC bottlenecks of OpenFOAM•Scalability of linear solvers•Adapt/modify data structures of Sparse Linear System to enable vectorization / hybridization•Improve memory access on new architectures•Improve memory bandwidth•Parallel pre- and post-processing, parallel I/O
•Strong co-design approach•Identify algorithm improvements to enhance HPC scalability•Interaction with other the Technical Committee (Numerics, Documentations)
• Tasks•Tests and benchmark Third Party linear solver algebra package (ESI, Intel, PeTSc Team)•Tests parallel I/O (ADIOS2) (ESI, ORNL) •Review of HPC benchmarks and establish common HPC benchmarks among different architectures
(Intel, ARM, EPCC,E4)•Review of documentation in HPC & OpenFOAM technical area
Tasks per node KPIs bottleneck(s)Cpu intensive Memory
intensive GPU intensive I/O intensive
3-D Lid Driven Cavity flow
S (100^3)M (200^3)XL (400^3)
Incompr. laminar flow,Regular and uniform grid
yes yes no no Full (36)
Time to solution
Energy to solution (?)
Memory Bandwidth
Bound
Linear solver Algebra
Data structure
….. Hlaf (18)
● Derived from 2-D Lid driven cavity flow of OpenFOAM tutorial, https://www.openfoam.com/documentation/tutorial-guide/tutorialse2.php
● Simple geometry. B.C.: inflow, outflow, no slip walls● Stress Test for the linear solver algebra, mainly pressure equation ● KPIs to be monitored: Wall time, Memory Bandwidth, Bound● Bound = The metric represents the percentage of Elapsed time spent heavily utilizing system bandwidth
Initial Benchmark test-case: 3-D Lid Driven Cavity
Geometrical and physical propertiesTest-case S M XL
delta X (m) 0.001 0.0005 0.00025
N of cells 1,000,000 8,000,000 64,000,000
n of cells lin 100 200 400
ni (m^2/se) 0.01 0.01 0.01
d (m) 0.1 0.1 0.1
Co 1 0.5 0.25
T. final 0.5 0.5 0.5(*) (0.07275)
delta T (sec.) 0.001 0.00025 0.0000625Reynolds 10 10 10
U (m/s) 1 1 1
num. of Iterations 500 2000 8000
● Final physical time T = 0.5, to reach steady state in laminar flow (* if feasible)● Delta X is halved when moving to the next bigger case. ● Courant number under stability limit, reduced for bigger cases ● Delta T is reduced by 4 times when moving to the next bigger case
Residuals and num. of iteration comparisonInitial Benchmark test-case: 3-D Lid Driven Cavity
● Comparison pressure (Left) and Ux residuals (Right) ● Solver tolerance vs solver tolerance maxIter ● Left vertical axis: Residual trend with and without fixed # iter. ● Right vertical axis: # iter.
Initial Benchmark test-case: 3-D Lid Driven Cavity
● Full computational density: using the whole cores/tasks available per node● Half computational density: using 1/2 cores/tasks available per node● Quarter density: using 1/4 cores/tasks available per node
● For Broadwell processor, respectively: ○ --tasks-per-node=36○ --tasks-per-node=18○ --tasks-per-node=9
● Galileo cluster: Intel Broadwell based cluster, --tasks-per-node=36● Left Figure: Inter-node scalability ● Too few cells per cores to get good scalability
HPC comparison: Armidia and Galileo clusterResults:XL test-case
number of cores 128 256 288 512 576 1152
Intel Broadwell
(Galileone)103.505 43.664 15.867
ArmiThunderX2
(Armidia)189.930 90.914 47.276
● Comparison by using the same number of cores ● Runs by using the full number of tasks-per-nodes 36 per Broadwell, 64 for ARM● Continuous line, trend line
Conclusion / Further work / Acknowledgment ● Conclusions
○ Presented the OpenFOAM HPC Benchmark project○ Create an open and shared repository with relevant data-sets and information ○ Present preliminary results on HPC architectures○ Starting to provide to the community a homogeneous term of reference to compare different HW architectures,
configurations and different SW environments ● Further work
○ Run M test-case○ Populate the repo web site, with instruction on how to run and post-proc the bench (with Documentation TC)○ Add half number of cores per node ○ Add several architectures○ Add weak scalability tests at optimal number of cells per cores○ Add energy to solution?○ Add different decomposition (optimal?) method ○ Add further test-cases
● Acknowledgment ○ C. Latini and A. Memmolo (CINECA) for post-proc, scripting and figures templates○ Andy Heather, joseph Nagzy for initial set-up of wiki repo and web-page (work in progress)○ Useful discussion with S. Zampini (PETSc Team)○ Peoples / Company / OEM which give (or will give) the availability of different architectures (E4, Intel, ARM, AMD, etc. etc. )