MLD2P4: a package of parallel algebraic multilevel Preconditioners

MLD2P4: a package of parallel

algebraic multilevel Preconditioners

Pasqua D’Ambra, Institute for High-Performance Computing and Networking (ICAR-CNR), Naples Branch, Italy

Bologna, March 2008

joint work with Daniela di Serafino, Second University of NaplesSalvatore Filippone, University of Rome “Tor-Vergata”

Pasqua D'Ambra - Bologna March 2008

2

Overview Motivations

Background Objectives

MLD2P4: Multi-Level Domain Decomposition Parallel Preconditioners Package based on PSBLAS Algorithms and computational kernels Software architecture

Some Results & Applications


3

Background

Large-scale applications have to solve

bAx The linear system matrix is:

Real or complex and squareLarge and SparseDistributed among parallel processorsMatrix dimensions and entries, conditioning, sparsity pattern and coupling among variables vary along simulations


4

Background (cont’d)

What is the best method/preconditioner? No absolute winner, experimentation is needed Reliable preconditioners require access to the complete

matrix Parallel implementation is not trivial

Interfacing with application software is required Custom-made interfaces to parallel legacy codes Different interfaces for different

preconditioners/solvers


5

Objectivesdesigning and implementing a suite of

algebraic preconditioners based on Linear Algebra kernels for parallel sparse matrix computations

Flexibility Different preconditioners by single API

Portability & Efficiency Standard base software for serial kernels and data

communications Simplicity of usage

Modern (OO) Fortran 95 features and auxiliary routines for smooth legacy code integration


6

MLD2P4Multi-Level Domain Decomposition

Parallel Preconditioners Package based on PSBLAS

Diagonal Block-Jacobi Additive Schwarz

with arbitrary overlap Algebraic

multi-level Schwarz

PSBLASParallel Sparse Basic Linear Algebra Subprograms

mld_prec_build(A,M,…)A, distributed sparse matrix (input)M, distributed sparse preconditioner (output)

mld_prec_apply(M,x,y,…)M, distributed sparse preconditioner (input)x,y, distributed vectors (input/output)


7

PSBLAS (Filippone et al., http://www.ce.uniroma2.it/psblas/)

Basic Linear Algebra Operations with Sparse Matrices on MIMD Architectures

Iterative Sparse Linear SolversCG, BiCG, CGS, BiCGSTAB,

RGMRES,…

Ap

pl.

MPI

BLACSBasic Linear Algebra

Communication Subprograms

F95

SBLAS (Duff et al.)

Base

sw

Parallel Sparse Matrix Operations

matrix-matrix products, matrix-vector products, … K

ern

elsParallel Sparse Matrix

Managementallocate, build, update,

…

F77


8

MLD2P4 DesignAlgorithms

Algebraic multi-level Schwarz preconditioners based on smoothed aggregation

good trade-off between parallelism and convergence optimal scalability for symmetric positive-definite matrices algebraic framework allows general-purpose application


9

(1-lev) Schwarz: basic ingredients

patternsparsity symmetric nnA Adjacency graph of A

0a :ji,E,n1,2,3,...,W

,EW,G

ij

Ekj, : WkWj

,WW1δ

iδ

i

1δi

δi

-overlap partition of W

0-overlap partition of W

W,,...,m, iWi of partition 10

01W

02W12W

11W

1 2 3 4 5 6 7 8 9

123456789


10

AS: basic ingredients (cont’d)

δii

T jjj

δi Wj ,e,...,e,eR

n21

Tδi

δi RP

Restriction/prolongation operators

Restriction of A

Tδi

δi

δi RARA

1 2 3 4 5 6 7 8 9

123456789

11A

12A


11

Coarse level correction: basic ingredients

TCC

1C PR ,PADIP

Algebraic coarsening

uncoupled aggregation

otherwise,0

)j .aggr()i (vert. if,1P

WW:P where

ij

C

Smoothed prol./restr.

operators

Coarse-level

matrixC

TC

TCCC ARRAPPA


12

Multilevel-Schwarz preconditioners & computational kernels

TCCC

1C

TC

C

C

ARRA :matx mat

PADIPR :matx mat

WW:P :aggregate

Abuild

Example: 2-lev hybrid-post

1CH2L MAMIMM

11

11

12 LL

build

δiA build

apply

P. D’Ambra, D. di Serafino, S. Filippone, On the Development of PSBLAS-based Parallel Two-level Schwarz Preconditioners, Applied Numerical Mathematics, 57, 2007.

CAwvx :vetmat

xMw :prec AS 1L12L

yRw :prol

zyA :esolv

vRz :ictrestr

TCC

C

C


13

MLD2P4 DesignSoftware Architecture

Parallel PreconditionersBJA, ASM, RAS, ASH, ml-additive,

ml-hybridpre, ml-hybridpost, ml-symmhybrid App

l.

Preconditioner Buildprolongation, restriction,

coarse matrix, local sparse ILU and LU

Ker

nelsPreconditioner

Applicationdistributed & serial

coarse matrix solvers

PSBLAS 2.0extended version of PSBLAS 1.0

Base

sw


14

Performance Results & Comparisons

Different test matrices from various sources

thm matrices: thermal diffusion in solids

kivap matrices: automotive engine design

shipsec matrices: from UF sparse matrix collection

Experiments carried out on different Linux clusters

64 Intel Itanium dual-processor nodes connected by Quadrics QSNetII Elan 4

32 AMD Opteron dual-processor nodes connected by Myrinet

8 AMD Opteron dual-processor nodes connected by InfiniBand

8 Intel Itanium dual-processor nodes connected by Myrinet

16 Intel Pentium IV nodes connected by Fast Ethernet

Comparison with up-to-date related work

Trilinos-MLA. Buttari, P. D’Ambra, D. di Serafino, S. Filippone, 2LEV-D2P4: a package of high-performance

preconditioners for scientific and engineering applications , Applicable Algebra in Engineering,

Communication and Computing, Vol. 18, 2007.


15

Experimental Setting

MLD2P4: right-preconditioned BiCGSTAB 1-lev Restricted Additive Schwarz preconditioner with ILU(0) (RAS)

2-lev hybrid Schwarz preconditioner, with RAS/ILU(0) as 1-lev prec.

Distributed coarsest matrix: 4 sweeps of block Jacobi with ILU(0) (2LDI) or with UMFPACK (2LDU) on diagonal blocks

3-lev hybrid Schwarz preconditioner, with RAS/ILU(0) as 1-lev prec.

Distributed coarsest matrix: 4 sweeps of block Jacobi with ILU(0) (3LDI) or with UMFPACK (3LDU) on diagonal blocks

60 10rrk

Stopping criterion: or maxitUnit right-hand side and null starting guessRow-block distribution of matrices: # submatrices = # procs


16

thm matrices: number of iterations

npOV=0

RAS 2LDI 2LDU 3LDI 3LDU

1 613 190 - 70 -

2 705 184 - 72 -

4 761 206 - 74 -

8 688 202 44 67 28

16 748 211 61 70 36

32 766 186 81 69 51

64 809 196 113 86 68

thm1n = 600000

nnz = 2996800

64 Intel Itanium dual-processornodes connected by QSNetII

npOV=1

RAS 2LDI 2LDU 3LDI 3LDU

1 613 190 - 70 -

2 923 183 - 76 -

4 684 178 - 63 -

8 937 191 34 62 27

16 688 172 57 68 33

32 714 181 74 65 45

64 720 180 107 77 62


17

thm matrices: execution times and speed-ups (OV=1; best execution times:3LDU)



18

Application test case

large eddy simulation of incompressible turbulent flows in a bi-periodical

channel main computational kernel

nonsymmetric and singular linear systems arising from elliptic PDE with Neumann b.c.

A. Aprovitola, P. D’Ambra, F. M. Denaro, D. di Serafino, S. Filippone, Application of Parallel Algebraic Multilevel Domain Decomposition Preconditioners in Large-Eddy Simulations of Wall-bounded Turbulent Flows: First Experiments, RT-ICAR-NA-2007-02, July 2007.


19

Experimental Setting

MLD2P4: right-preconditioned RGMRES(30) 1-lev Restricted Additive Schwarz preconditioner with ILU(0) (RAS)

2-lev/3-lev hybrid Schwarz preconditioner, with RAS/ILU(0) as 1-lev prec.

Distributed coarse matrix: 4 sweeps of block Jacobi with ILU(0) (2LDI/3LDI) on diagonal blocks

Stopping criterion: or maxit General row-block distribution

70k 10rr

Pressure linear system

n=201600

nnz=1398600

Reynolds number: 180Computational Grid: 140x32x45 non-uniform in the y direction, time-step 10-4


20

LES of incompressible wall-bounded flow


SOR on 1 proc.=9 sec.SOR on 1 proc.=8580 sec.


21

Work in progress Package available on the web very

soon

More sophisticated aggregation algorithms

Integration of preconditioners and solvers in large-scale applications

MLD2P4: a package of parallel algebraic multilevel Preconditioners

Documents

f77pasqua dambra bologna

psblas pasqua dambra

neededreliable preconditioners

linear algebra kernels

solvethe linear system

sparse matrices

matrixvector products

linear algebra operations