Optimization and Highly Parallel Implementation of Domain Decomposition Based Algorithms Ω x 1 x 2 x 3 x 1 x 2 x 3 Ω 4 Ω 2 Ω 1 Ω 3 H λ c 1 H C C C C C C Ω 4 Ω 4 Ω 4 Ω 4 H h H s s s c s c c 2 s 1 c s 3 s 4 Ω 8 C Ω 8 C to clusters to subdomains decomposiotion decomposiotion λ g λ d h = c H s industry.it4i.cz www.it4i.cz am.vsb.cz FETI and HFETI Solvers • C++ implementation based on Intel MKL sparse and dense BLAS routines and MKL version of PARDISO sparse direct solver • Parallelization tools and strategies - hybrid parallelization for multisocket, multicore comp. nodes - enables over-subscription of cores with multiple subdoma ins - distributed memory parallelization - use of MPI 3.0 non-blocking collective operations for global reductions - Intel MPI 5.0 - shared memory parallelization using Intel Cilk+ - enables parallel reduction using custom reduce operations on C++ classes Reduction of global communication for (H)FETI B 1 B 2 B 3 B 4 λ λ 1 λ 2 λ 3 λ 4 λ 1 λ 3 λ 1 2 1 2 λ 3 λ 4 3 3 λ 4 4 λ 2 λ 2 2 1 λ 2 3 CPU CPU CPU CPU 1 2 4 3 Ω 4 Ω 3 Ω 2 Ω 1 Benchmark Systems Hiding latencies in Krylov Solvers Intercluster Processing FETI and HFETI per iteration time Total FETI - Large scale benchmarks Preconditioned pipelined Conjuagent Gradient (CG) algorithm Cluster size: ~120 000 DOFs; optimal decomposition into 27 subdomains 1: r 0 := b - Ax 0 ; u 0 := M -1 r 0 ; w 0 := Au 0 2: for i = 0,..... do 3: γ i := r i u i 4: δ := w i u i 5: m i := M -1 w i 6: n i := Am i 7: if i > 0 then 8: β i := γ i /γ i -1 ; α i := γ i /( δ - β i γ i /α i -1 ) 9: else 10: β i := 0; α i := γ i /δ 11: end if 12: z i := n i + β i z i -1 13: q i := m i + β i q i -1 14: s i := w i + β i s i -1 15: p i := u i + β i p i -1 16: x i +1 := x i + α i p i Sparse Matrix-Vector product - nearest neighbor comm., good scaling Dot-product - global communication - scales as log(P) Scalar vector multiplication, vector-vector addition no communication P. Ghysels, T. Ashby, K. Meerbergen and W. Vanroose Hiding global communication latency in the GMRES algorithm on massively parallel machines. http://www.prace-ri.eu/ https://projects.imec.be/exa2ct/ Superlinear Strong Scaling of FETI 3D CUBE - HFETI elasticity benchmark 3D CUBE - Large Scale FETI Benchmark • FETI - decomposition into subdomains (Hc) • HFETI - decomposition into clusters (Hc) and subdomains (Hs) • highly parallel generator scales up to thousands of subdomains • large subdomains from 120,000 to 160,000 DOFs • sparse direct solver uses most of the memory and CPU time (99%) • no preconditioner ---------------------------------------------------------------------- #subdomains 125 ( Nx Ny Nz )=(5 5 5) #elements per subd . 8000 ( nx ny nz )=(20 20 20) #all elements: 1000000 ( nx ny nz )=(100 100 100) #coordinates: 1030301 #DOFs undecomp .: 3090903 #DOFs decomp .: 3472875 ---------------------------------------------------------------------- mesh: done • each MPI process iterates only over lambdas associated with given subdomain - FETI – required for multiplication with the restriction of B matrix to given subdomain - HFETI – required for multiplication with the restriction of B1 matrix to • global update of vector λ becomes only nearest the neighbor type of communication with good scalability Communication layer performance evaluation on FETI - CG algorithm with 2 reductions – is a general version of the CG algorithm - CG algorithm with 1 reductions – is based on Preconditioned Pipelined CG algorithm, where projector is used in place of preconditioner (this algorithm is ready to use non-blocking global reduction for further performance improvements (comes in Intel MPI 5.0)) - GGTINV – parallelizes the solve of coarse problem plus merges two Gather and Scatter global operations into single AllGather -5 3 = 3*(5+1) 3 = 648 DOFs – small sub domain size is chosen to identify all communication bottlenecks of the solver 0.127351 0.05973 0.025983 0.00902 0.004424 0.002782 0.112963 0.048849 0.01942 0.005946 0.002672 0.001466 0.001 0.01 0.1 1 32 64 128 256 512 1024 Itera on me [s] Number of subdomains [-] Engine 2.5 millions - FETI - itera on me REGCG - LUMPED - GGTINV PIPECG - NOPREC - GGTINV REGCG - LUMPED - NOGGTINV PIPECG - NOPREC - NOGGTINV Linear strong scaling (based on 32) 0 0.005 0.01 0.015 0.02 0.025 0 200 400 600 800 1000 1200 1400 1600 1800 2000 itera on me [s] Number of subdomains [-] FETI itera on me (measured on Anselm) FETI - CG with 2 reduc ons FETI - CG with 1 reduc on FETI - CG with 1 reduc on - GGTINV • 2.5 millions of DOFs using FETI solver • decomposed into: 34, 64, 128, 256, 512 and 1024 subdomains • measured on Anselm supercomputer – 128 nodes with 16 CPU cores (2x8)8 subdomains per node SurfSara.nl – Cartesius – up to ~8600 cores – non-blocking island of 360 nodes each with: • 2× 12-core Intel Xeon E5-2695 v2 (Ivy Bridge), 2.4 GHz and 64 GB RAM – InfiniBand FDR network - 56 Gbit/s inter-node bandwidth IT4Innovations – www.it4i.cz – Anselm – up to ~3300 cores – non-blocking cluster of 209 nodes each with: • 2x 8-core Intel Sandy Bridge E5-2665, 2.4GHz and 64GB of RAM – InfiniBand QDR network - 40 Gbit/s inter-node bandwidth domain size subdomains size avg sum prec iter 4 343 128625 0.123941 5.949182 75.503219 48 5 216 139968 0.072914 3.718598 37.732576 51 6 125 128625 0.040325 2.217873 16.381135 55 8 64 139968 0.030154 1.718762 10.04497 57 11 27 139968 0.034425 1.893399 7.490033 55 17 8 139968 0.064372 3.347334 7.876874 52 Cluster size: ~330 000 DOFs; optimal decomposition into 27 subdomains domain size subdomains size avg sum prec iter 5 512 331776 0.309891 17.973677 338.035839 58 6 343 352947 0.198848 12.328554 180.041968 62 7 216 331776 0.118965 7.613787 90.402162 64 11 64 331776 0.080177 5.13133 27.456335 64 15 27 331776 0.095755 6.03256 19.761467 63 23 8 331776 0.156146 8.431877 29.455252 54 Note: domains size ‘N’ is the number of elements per edge of the cube – real size is equal to 3*(N+1) 3 Note: avg – average iteration time [s]; sum – total time of all iterations = solution time; prec – cluster preprocessing time; iter – number of iterations; 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0 500 1000 1500 2000 itera on me [s] Number of subdomains [-] HFETI vs FETI – one itera on me (measured on Anselm) FETI - CG with 1 reduc on HFETI - CG with 1 reduc on 0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 0.004 0.0045 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 HFETI vs FETI - one itera on me (measured on Cartesius) FETI HFETI 0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 0.004 0.0045 0.005 0 500 1000 1500 2000 itera on me [s] Number of subdomains [-] HFETI vs FETI – one itera on me (measured on Anselm) FETI - CG with 1 reduc on - GGTINV HFETI - CG with 1 reduc on HFETI - CG with 2 reduc ons HFETI - CG with 1 reduc on - GGTINV Weak scaling – domain size is fixed; cluster size is fixed; number of domains goes from 1 to 2000 - domain size 5 3 = 3*(5+1) 3 = 648; cluster size 16 (1 cluster per node; 1 subdomain per core) - efficient communication algorithms helps both FETI and HFETI - CG with 1 reduction + coarse problem solved using distributed inverse matrix (GGTINV) Weak scaling – domain size is fixed; cluster size is fixed; number of domains goes from 1 to 2000 - FETI a HFETI uses CG with 1 reduction and coarse problem is solved using distributed inverse matrix of GGT - Cartesius: domain size 5 3 = 3*(5+1) 3 = 648; cluster size is 8 domains (3 clusters per node; 1 subdom. per core) - Anselm: domain size 5 3 = 3*(5+1) 3 = 648; cluster size: 16 domains (1 cluster per node; 1 subdomain per core) 24 35 38 41 43 4545 46 47 47 49 49 49 50 50 50 50 50 50 50 50 51 0 20 40 60 80 100 120 140 160 - 500,000,000 1,000,000,000 1,500,000,000 Solu on and preprocessing me [s] Problem size [DOFs] Preprocessing, number of itera ons and solu on me GGT me[s] - 164 616 DOFs GGT me[s] - 177 957 DOFs GGT me[s] - 192 000 DOFs GGT me[s] - 206 763 DOFs Solu on me[s] - 164 616 DOFs Solu on me[s] - 177 957 DOFs Solu on me[s] - 192 000 DOFs Solu on me[s] - 206 763 DOFs 53.19 56.16 56.63 58.53 57.92 58.11 58.32 58.59 58.57 58.53 58.57 58.68 58.73 58.75 58.71 58.80 58.85 58.86 58.92 0.08 0.37 0.78 1.76 3.50 5.49 8.91 11.27 16.90 23.89 27.11 38.24 46.46 56.21 70.69 85.70 101.02 131.49 144.36 24 15.54 35 23.39 38 26.80 41 37.83 43 34.58 45 41.75 45 41.86 46 42.57 47 43.43 47 43.36 49 45.27 49 45.55 49 45.69 50 47.03 50 47.21 50 47.83 50 48.34 50 49.22 50 49.75 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 Coarse problem preprocessine me, K factoriza on me, solu on me and number of itera ons - subdomains size 192000 DOFs K factoriza on GGT me[s] - 39^3 Solu on me[s] - 39^3 - number of itera ons - solu on me [s] - ready to use MIC accelerators λ λ given subdomain