Experiences with Multigrid on 300K Processor Cores Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de 15 th Copper Mountain Multigrid Conference March 31, 2011 B. Gmeiner U. Rüde (LSS Erlangen, [email protected]) T. Gradl 1
35
Embed
Experiences with Multigrid on 300K Processor Cores … · Experiences with Multigrid on 300K Processor Cores Lehrstuhl f r Informatik 10 (Systemsimulation) Universit t Erlangen-N
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Experiences with Multigrid on300K Processor Cores
Lehrstuhl für Informatik 10 (Systemsimulation)Universität Erlangen-Nürnberg
Motivation: How fast are computers today (and tomorrow)Performance of Solvers: Theory vs PracticeScalable Parallel Multigrid Algorithms for PDE
Matrix-Free FE solver: Hierarchical Hybrid GridsConclusions
Overview
2
Motivation
3
Example Peta-Scale System:Jugene @ Jülich
PetaFlops = 1015
operations/secondIBM Blue GeneTheoretical peak performance: 1.0027 Petaflop/s294 912 cores144 TBytes = 1.44 1014 #9 on TOP 500 List in Nov. 2010
4
For comparison: Current fast desktop PC is ∼ 20.000 times slower> 1 000 000 cores expected 2011Exa-Scale System expected by 2018/19
Extreme Scaling Workshop 2010 at Jülich Supercomputing Center
5
What are the consequences?For the application developers “the free lunch is over”
Need explicitly parallel algorithms, to get the performance potential of all future architectures
For HPCCPUs will have 2, 4, 8, 16, ..., 128, ..., ??? cores - maybe sooner than we are ready for thisWe will have to deal with systems with many millions of cores
The memory wall grows higherAn Exa-Scale system with 1018 Flops/sec and 10 MWatt power intake will have 10 PicoJoule available for each Flop - not enough to touch (main) memory
What‘s the problem? Would you want to propel a Super Jumbo
with four strong jet engines(not those of Rolls-Royce of course)
or with 300,000 blow dryer fans?
6
How fast are our algorithms (multigrid) on current CPUsAssumptions:
Multigrid requires 27.5 Ops/unknown to solve an elliptic PDE (Griebel ´89 for 2-D Poisson)A modern laptop CPU delivers >10 GFlops peak
Consequence:We should solve one million unknowns in 0.00275 seconds (364 MUpS) ~ 3 ns per unknown
7
Revised Assumptions:Multigrid takes 500 Ops/unknown to solve your favorite PDE you can get 5% of 10 Gflops performance
Consequence: On your laptop you shouldsolve one million unknowns in 1.0 second (1 MUpS)~ 1 microsecond per unknown
Consider Banded Gaussian Elimination on the Play Station (Cell Processor), single Prec. 250 GFlops, for 1000 x 1000 grid unknowns
~2 Tera-Operations for factorization - will need about 10 seconds to factor the system requires 8 GB Mem.Forward-backward substitution should run in about 0.01 second, except for bandwidth limitations
Ideal vs. Real PerformanceMaybe better to consider memory bandwidth rather than Flops as performance metric:
10 GByte/sec: we can read 1250 times 106 unknowns/secEven when assuming that the complete solution must be read 30 times, we should solve 106 unknowns in 0.025 sec.
State of the art (on Peta-Scale): ~1012 unknownsHeidelberg group (M. Blatt‘s talk, groundwater flow)HHG (this talk, tetrahedral FE, scalar elliptic PDE with piecewise const coeff.)waLBerla (H. Köstler‘s talk, Laplace on structured grids)Livermore group (pers. communication)
8
What can we do with 1012 unknowns?
9
10 000 points in each of 3Ddiscretizing the volume of earth uniformly with less than 1 km meshdiscretizing the volume of the atmosphere uniformly (wrt mass) with less than 200m grid.With Exa-Scale: Even if we wantto simulate a billion objects (particles): we can do a billion operations for each of them in each seconda trillion finite elements (finite volumes) to resolve a PDE, we can do a million operations for each of them in each second
Fluidized Bed(movie: thanks to K.E. Wirth)
10
Towards Scalable FE Software
Scalable Algorithmsand Data Structures
11
Multigrid: V-Cycle
Relax on
Residual
Restrict
Correct
Solve
Interpolate
by recursion
… …
Goal: solve Ah uh = f h using a hierarchy of grids
12
Parallel High Performance FE MultigridParallelize „plain vanilla“ multigrid
partition domainparallelize all operations on all gridsuse clever data structures
Do not worry (so much) about Coarse Grids
idle processors?short messages?sequential dependency in grid hierarchy?
Elliptic problems always require global communication. This cannot be accomplished by
local relaxation orKrylov space acceleration ordomain decomposition without coarse grid
Bey‘s Tetrahedral Refinement
13
Hierarchical Hybrid Grids (HHG)Joint work withFrank Hülsemann (now EDF, Paris), Ben Bergen (now Los Alamos), T. Gradl (Erlangen), B. Gmeiner (Erlangen)
HHG Goal: Ultimate Parallel FE Performance!unstructured adaptive refinement grids with
regular substructures for efficiencysuperconvergence effectstau-extrapolation for higher order FEmatrix-free implementation
14
HHG refinement example
Input Grid
15
HHG Refinement example
Refinement Level one
16
HHG Refinement example
Refinement Level Two
17
HHG Refinement example
Structured Interior
18
HHG Refinement example
Structured Interior
19
HHG Refinement example
Edge Interior
20
HHG Refinement example
Edge Interior
Special smoothers for tetrahedra with large or small angles
21
Joint work withF. Gaspar and C. Rodrigo, Zaragoza
22
Adaptivity in HHG (with conforming meshes)
Refinement with Hanging Nodes in HHG
23
Coarse grid
Fine Grid
Adaptive Grid with Hanging Nodes
Smoothing operation
Residual computation
Restriction Computation
Treating Hanging Nodes as in FAC: see e.g.McCormick: Multilevel Adaptive Methods for PDE, SIAM, 1989UR: Mathematical and Computational Techniques for Multilevel Adaptive Methods, SIAM, 1993See also Daniel Ritter‘s talk
24
HHG for ParallelizationUse regular HHG patches for partitioning the domain
25
HHG Parallel Update Algorithmfor each vertex do apply operation to vertexend for for each edge do copy from vertex interior apply operation to edge copy to vertex haloend for
for each element do copy from edge/vertex interiors apply operation to element copy to edge/vertex halosend for
update vertex primary dependencies
update edge primary dependencies
update secondary dependencies
26
12
The HHG Framework BlueGene/P Single-Core Performance Strong Scalability Weak Scalability
Scalability of HHG on Blue Gene/P (Jugene)
!"#$%&$'%()*+*,$%-./
0102
012
2
2020
34#5$'%67%86'$.
0 209000 :09000 ;09000 <09000 =09000
Figure: Strong Scaling behavior of HHG on PowerPC 450 cores. Thistest case was performed starting from 512 cores, solving 2.14 · 109 DoF.
Department for Computer Science 10 (System Simulation)
Strong Scaling
27
13
The HHG Framework BlueGene/P Single-Core Performance Strong Scalability Weak Scalability
Towards Realtime
Is it possible to add an implicit time stepping for parabolic orhyperbolic equations and achieve real-time performance with onetime step per frame?
Example:
• V(1,1) cycle
• 40 CG steps on the coarsest mesh
• 2.14 · 109 unknowns
• 5 Multigrid levels
• Utilized number of cores: 49 152
Time per cycle: 0.08 seconds
Department for Computer Science 10 (System Simulation)
28
#Proc #unkn. x 106 Ph.1: sec Ph. 2: sec Time to sol.
Parallel scalability of scalar elliptic problem in 3Ddiscretized by tetrahedral finite elements.
Times to solution on SGI Altix: Itanium-2 1.6 GHz.
Largest problem solved to date:3.07 x 1011 DOFS (1.8 trillion tetrahedra) on 9170 Procs in roughly 90 secs.
B. Bergen, F. Hülsemann, U. Rüde, G. Wellein: ISC Award 2006, also: „Is 1.7× 1010 unknowns the largest finite element system that can be solved today?“, SuperComputing, Nov‘ 2005.
Exa-Scale will be Easy!If parallel efficiency is bad, choose a slower serial algorithm
it is probably easier to parallelizeand will make your speedups look much more impressive
Introduce the “CrunchMe” variable for getting high Flops ratesadvanced method: disguise CrunchMe by using an inefficient (but compute-intensive!) algorithm from the start
Introduce the “HitMe” variable to get good cache hit ratesadvanced version: Implement HitMe in the „Hash-Brown Lookaside Table for the Multi-Threatened Cash-Filling Cloud-Tree“ data structure ... impressing yourself and others
Never cite “time-to-solution”who cares whether you solve a real-life problem anywayit is the MachoFlops that interest the people who pay for your research
Never waste your time by trying to use a complicated algorithm in parallel Use Primitive Algorithm => Easy to Maximize your MachoFlopsA few million CPU hours can easily save you days of reading in boring math books
Is Math Research Adressingthe Real Problems?
Performance?h-independent convergence is not the holy grailparallel efficiency is not the holy grailWe need theory that predicts the constants O(N) algorithms vs. „time to solution“ or „accuracy per MWh“
We will be drowned by the Tsunami of ParallelismValidation?
rigorous theory vs. experimental validation
„abstract“ numerical algorithms vs. high quality softwareWhere is the math?
32
SIAM J. Scientific Computingeditor-in-chief: Hans Petter Langtangen, Oslo
(Uli Ruede until 12-2010)
SISC papers are classified into three categories:Methods and Algorithms for Scientific Computing. Papers in this category may include theoretical analysis, provided that the relevance to applications in science and engineering is demonstrated. They should contain meaningful computational results and theoretical results or strong heuristics supporting the performance of new algorithms. (section editor: Jan Hesthaven)Computational Methods in Science and Engineering. Papers in this section will typically describe novel methodologies for solving a specific problem in computational science or engineering. They should contain enough information about the application to orient other computational scientists but should omit details of interest mainly to the applications specialist. (section editor: Irad Yavneh)Software and High-Performance Computing. Papers in this category should concern the development of high-quality computational software, high-performance computing issues, novel architectures, data analysis, or visualization. The primary focus should be on computational methods that have potentially large impact for an important class of scientific or engineering problems. (section editor: Tamara Kolda)
33
Acknowledgements
Dissertationen ProjectsN. Thürey, T. Pohl, S. Donath, S. Bogner (LBM, free surfaces, 2-phase flows)M. Mohr, B. Bergen, U. Fabricius, H. Köstler, C. Freundl, T. Gradl, B. Gmeiner (Massively parallel PDE-solvers)M. Kowarschik, J. Treibig, M. Stürmer, J. Habich (architecture aware algorithms)K. Iglberger, T. Preclik, K. Pickel (rigid body dynamics)J. Götz, C. Feichtinger, F. Schornbaum (Massively parallel LBM software, suspensions)C. Mihoubi, D. Bartuschat (Complex geometries, parallel LBM)