DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 C OMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE SUBDOMAIN SOLUTION K. DEKKER ISSN 1389-6520 Reports of the Department of Applied Mathematical Analysis Delft 2000
DELFT UNIVERSITY OF TECHNOLOGY
REPORT 00-02
COMPARING GMRES AND P-GMRES IN DOMAIN
DECOMPOSITION WITH APPROXIMATE SUBDOMAIN
SOLUTION
K. DEKKER
ISSN 1389-6520
Reports of the Department of Applied Mathematical Analysis
Delft 2000
Copyright 2000by Departmentof AppliedMathematicalAnalysis,Delft, TheNetherlands.
No partof theJournalmaybereproduced,storedin aretrieval system,or transmitted,in any formor by any means,electronic,mechanical,photocopying, recording,or otherwise,without theprior written permissionfrom Departmentof Applied MathematicalAnalysis,Delft Universityof Technology, TheNetherlands.
Comparing GMRES and P-GMRES in DomainDecomposition with approximate subdomain solution
K. Dekker
Abstract
Solutionof largelinearsystemsencounteredin computationalfluid dynamicsoftenleadsto someform of domaindecomposition,especiallywhen it is desiredto useparallelma-chines. In this paperP-GMRES,a partitionedmodificationof GMRES,is appliedto suchproblems.It is shown thatP-GMRESconvergesfasterthanGMRESif thesubdomainsaresolved exactly, andthat P-GMRESrequireslesscommunicationin the computationof theinnerproducts.Also, approximatesolutionsfor thesubdomainsby aninnerpreconditionedGMRES iterationare considered,in combinationwith restartedversionsof GMRES andP-GMRES.Weinvestigatetheeffectof thetolerancein thesubdomainproblemsonthecon-vergenceof theouteriteration,andon the total amountof work in numericalexperiments.It turnsout that rathercrudetolerancesareallowed,andthata goodstrategy is to vary thetolerancefor thesubdomainsin thecourseof theouteriteration.
Keywords: Domaindecomposition;Parallel GMRESmethods;Approximatesubdomainsolu-tion; Orthogonalisationmethods
1 Introduction.
Domaindecompositionarisesnaturally in computationalfluid dynamicsapplicationson struc-turedgrids: complicatedgeometriesarebroken down into (topologically) rectangularregionsanddiscretised,seee.g.[23, 30], andby solvingsubproblemson theseregionsonearrivesat thesolutionon theglobaldomain.This approachprovideseasyexploitationof parallelcomputingresources,andadditionallyoffersa solutionto memorylimitation problems.
FrankandVuik [11] addresstheparallelimplementationof a domaindecompostionmethodfor theDeFTNavier-Stokessolver describedin [23]. Their paperis a continuationof work byBrakkee,summarisedin [5] andpresentedin [6], wherea serialimplementationof nonoverlap-ping, one-level additiveSchwarzmethodwith approximatesubdomainsolutiongave promising
Faculty of Mathematicsand Informatics,Delft University of Technology, Mekelweg 4, 2628CDDelft, TheNetherlands;E-mail: [email protected]
1
results.In [11] theGCRmethodin combinationwith inaccuratesubdomainsolutionis testedforaPoissonproblemonasquaredomain,whichis representativefor thesystemto besolvedfor thepressurecorrectionmethodusedin DeFT. Ourpresentgoalis to evaluatethepartitionedmethod,P-GMRES,describedin [9], in combinationwith accurateandinaccuratesubdomainsolutiononsuchaproblem.
Theoreticalresultson approximatesolutionof subproblemsfor Schurcomplementdomaindecompositionmethodsaregiven by Borgers[4], Haaseet al. [12, 13, 14, 20]. Tan [25] andBrakkee[5] givetheoreticalresultsfor nonoverlappingSchwarziterationswith approximatesub-domainsolvers.
In thispaperwedemonstratefor anonoverlapping,additiveSchwarzmethodthatP-GMRESconvergesfasterthanGMRES,if thesubdomainsaresolvedexactly, andthatrestartedversionsof bothmethodscanbeappliedif thesubproblemsaresolvedwith moderateaccuracy. Weshowthat thecomputationalwork of themethodsis aboutthesameper iteration,andthatP-GMRESrequireslesscommunication,so it canbe moreefficiently parallelised.On theotherhand,theapplicabilityof P-GMRESis restrictedto theclassof problemsfor which a red-blackcolouringof subdomainsexistssuchthatno adjacentsubdomainshave thesamecolour.
In Section2 we briefly review the relevant mathematicsand presentthe GMRES and P-GMRESagorithms.Mucheffort hasfocussedon theefiicientparallelisationof Krylov subspacemethods.Thecomputationandcommunicationsof themany innerproductsoften limitatestheattainablespeedupon many processors.Therefore,authorshave tried to overlapinner productcommunicationwith computation[8], or to increasethe numberof inner productsthat canbecomputedwith a single communication[2, 18, 8]. Frank and Vuik [11] suggestto increasetheamountof computationto reducethenumberof communicationsin GCR.We show thatP-GMRESrequireslesscommunicationthanGCRor GMRES,andthat thecommunicationscanbeeasilyoverlappedwith computation.
In Section2.4 we addressthe solutionof the subdomainproblems.We give evidencethattherestartedversionsof GMRESandP-GMRESareapplicablein combinationwith anapprox-imatesubdomainsolution. Moreover, we show thata variableprecisionin thesolutionof thesesubproblemsis likely to bemostefficient,which is confirmedby experimentsin Section4.
A performancemodelfor the orthogonalisationsin (P-)GMRES,derived from [11], is pre-sentedin Section3. Theoreticalspeedupratios for P-GMRESon a workstationclusterandaCray T3E, basedon this model,aregiven. We alsodevelop a model for the costsof varioussubdomainsolverswhich is usedfor theevaluationof theresultsin Section4.
In Section4 wecomparetheconvergencerateof GMRESandP-GMRESwith variousmulti-block preconditionerson thetestproblemfrom [11]. We alsoreportresultsfor P-GMRESwithapproximatesubdomainsolvers,in combinationwith several strategiesfor the tolerancein theinneriterations.Our resultssuggestthatavariableprecision,decreasingduringthecourseof theouteriterations,is mostefficient. Accordingto theperformancemodel,however, exactsolutionwill becheapestfor therelatively smallsubdomains(lessthan22500grid points)consideredinthetests.
2
2 Mathematical background
2.1 Nonoverlapping domain decomposition
Weconsideran(elliptic) partialdifferentialequationdiscretisedusingafinite differenceor finitevolumemethodon a computationaldomainΩ. By a computationaldomainwe meana setofunknown valuesto beapproximated,togetherwith their locationsin space.We supposethatΩis theunionof M nonoverlappingsubdomainsΩm m 1 M.
Discretisationof thePDEresultsin asparselinearsystem
Ax b (1)
with x b N . Thestructureof thematrix A is determinedby thestencilof thediscretisation.Evenif thereis nooverlapbetweenthesubdomains,thereis aninter-subdomaincouplingduetothe stencil. Groupingtogetherinto blocksthoseunknownswhich sharea commonsubdomainwill permutethesystem(1) to produceablocksystem: A11 A1M
.... . .
...AM1 AMM
x1...
xM
b1...
bM
(2)
Here,thediagonalblocksAmm expressthecouplingamongunknownsdefinedonacommonsub-domain Ωm , whereastheoff-diagonalblocksAmn m n representcouplingacrosssubdomainboundaries.Theonly nonzerooff-diagonalblocksarethosecorrespondingto neighbouringsub-domains.Moreover, we will assumein thesequelthata red-blackcolouringof thesubdomainsexists,suchthatadjacentsubdomainshavea differentcolour, i.e. thereholds
Amn 0 m n if Ωm andΩn have thesamecolour. This restriction,which oftencanbesatisfiedin practice,isessentialfor thesolverP-GMRES.
TheadditiveSchwarziterationintroducestheblockJacobipreconditioner
K A11. . .
AMM
which, togetherwith theresidualb Ax i , definesasystemwhosesolutionprovidesanapprox-imationof theerrorx x i . Becausethis systemdecouplesinto M independentsystems,it canbesolvedefficiently onparallelcomputers.
This form of domaindecompositionhasalsobeenconsideredby FrankandVuik [11]. Fora thoroughdiscussionof domaindecompositionseethe book [24] and the review article [7],andtheextensive bibliographytherein.Roughlyspeaking,theconvergenceratesufferspropor-tionally to the numberof subdomainsin eachdirection. The convergencerate may be madeindependentof the grid sizeby usingconstantoverlapsor by applicationof coarsesubspacecorrection.Wewill accelaratetheconvergenceby aKrylov subspacemethod,asin [11].
3
Algorithm: GMRES
1. Start: Let initial guessx0 begiven.q0 b Ax0 SolveKw q0 β ! w 2 v1 w " β
2. Arnoldi process:for k 1 2 #$ until convergence:% qk Avk% Solve Kw qk%'& vk ( 1 hk ) orthonorm w v j j * k end
3. Form approximate solution:% DefineVk & v1 v2 vk )% DefineHk & h1 h2 + hk )% Computexk x0 , Vkyk, whereyk argminy βe1 Hky 2 ande1 & 1 0 - 0) T .
4. Restart: ComputeAxk andcheckterminationcriterion. If satisfiedstop,elsesetx0 xk
andgoto1.
Figure1: TheGMRESalgrotihm
2.2 GMRES acceleration
In practice(2) is solved iteratively, usingK asa preconditionerfor a Krylov subspacemethod,suchastheconjugategradientmethodfor symmetricproblemsor theGMRESmethod[22] fornonsymmetricproblems.In contrastwith [11], whereGCRis used,weconsiderGMRES,shownin Fig. 1, in orderto facilitatethecomparisonwith P-GMRES[9].
Thefunctionorthonorm() takesinputvectorsw, orthonormalisesw with respectto thevi i *k, and returnsthe modified vector vk ( 1. In serial computations,the modified Gram-Schmidtmethod(MGS),shown in Fig. 2, is usuallyemployedfor theorthonorm() function.
In parallelcomputationsMGS hasseriousdisadvantages,becausetheinnerproductsrequireglobal communications,and thereforedo not scale. Moreover, theseinner productsmust becomputedsuccessively, andtheir numberincreasesby onein every iterationstep.Variousalter-nativeshave beenproposedfor MGS, e.g. orthogonalisinga numberof vectorssimultaneously[8, 17,18], Householdertransformations[29, 11] or two-fold applicationof theclassicalGram-Schmidtmethod(CGS) [16, 11]. However, thesealternativeshave somedrawbacks,varyingfrom lossof stability with respectto roundingerrors[3] to an increaseof thenumberof float-ing point operations.Also, mostalternativesarenot applicablewhena preconditioneris used
4
Algorithm: Modified Gram-Schmidt& vk ( 1 hk ) orthonorm w v j j * k for j 1 2 k
hk . j 0/ w v j 1w w hk . jv j
end
hk . k ( 1 ! w vk ( 1 w " hk . k ( 1
Figure2: ThemodifiedGram-Schmidtalgorithm
that variesin eachiteration. FrankandVuik [11] concludefrom a comparisonwith MGS andHouseholderthatre-orthogonalisedCGS(seeFig. 3) is themostattractivemethod.
2.3 P-GMRES acceleration
Dekker [9] hasproposeda modificationof GMRES,calledPartitionedGMRES(P-GMRES),which is applicableto (2) if the subdomainscanbe partitionedinto two groups,suchthat foreachpair of differentsubdomainsfrom thesamegroupthecorrespondingblocksAmn andAnm
arezero.In [9] thetrivial caseM 2 is considered,but thissituationalsooccursif a red-blackcolour-
ing of the subdomainsis possiblewhereonly adjacentsubdomainsleadto nonzeroblocks. Inthesequelwe assumethatsucha colouringexists. Let therestrictionto theredsubdomainsbedenotedby Rr, andto theblackonesby Rb. Thenthe following equations,which areessentialfor P-GMRES,hold:
Rr , Rb I RrK 2 1ARrx Rrx RbK 2 1ARbx Rbx
P-GMRES,describedin Fig4,offersseveraladvantageswhencomparedto GMRES,whereasthe computationalcostsin an iterationareaboutthe same.First, P-GMRESyields an optimalapproximationin theaffinespace
x0 , Span3 V rk V b
k 4whichhasahigherdimensionthan
x0 , Span3 Vk 45
Algorithm: ClassicalGram-Schmidt& vk ( 1 hk ) orthonorm w v j j * k for j 1 2 k
hk . j 0/ w v j 1end
hk . k ( 1 65 w 2 ∑kj 7 1 3 hk . j 4 2
vk ( 1 8 w ∑kj 7 1 3 hk . jv j 4 " hk . k ( 1
Figure3: TheclassicalGram-Schmidtalgorithm
in which subspaceGMRESsearchesfor anapproximation.Also, thereholds
Span3 Vk 4:9 Span3 V rk V b
k 4 soP-GMRESconvergesat leastasfastasGMRES.
Secondly, we have two independentorthogonalisationprocesses,onefor thevariablesfromtheredsubdomainsandonefor theblacksubdomains.This propertyallowsseveralpossibilitiesfor parallelisation.On a 2-processormachinewe could performtwo modifiedGram-Schmidtalgorithmswith only a communicationstepafter terminationof MGS. Whenmany processorsareavailable,we coulddivide theprocessorsinto two groups,eachtakingcareof oneorthogo-nalisationprocess,therebyslightly reducingtheamountof communication,asnotall processorsareinvolvedin thecomputationof aninnerproduct.In thiscaseit mightevenbemoreattractiveto overlapcomputationandcommunication[8] in anaturalway: computationof avectorupdateanda local innerproductfor theredsubdomainscanbedonesimultaneouslywith theaccumula-tion of theinnerproductsfor theblacksubdomains,viceversa,wheneachprocessoris assignedto both a red anda black subdomain.Then, the costsof the global communicationwould benegligible, providedthesubdomainsarenot toosmall.
Finally we note that the minimisationproblemin P-GMREScanbe cheaplysolved usingGivensrotations,justasin GMRES[22], becauseHr
k andHbk arebothHessenberg matrices.Due
to the larger size2k ; 2k of the coefficient matrix eachiterationsteprequires8k Givensrota-tions,comparedto only k rotationsin aGMRESiteration.Thisadditionalamountof operations,however, is usuallyverysmallcomparedto thecostsof theinnerproducts.
Further, the computationalwork in the Arnoldi processof P-GMRESis just the sameasin GMRES. The vectorsvr
k and vbk are both restrictedto part of the subdomains,so the two
matrix multiplicationswith A amountto just onemultiplication of A with a full vector. In thepreconditioningweneedto solve
Kwb qbk Kwr qr
k
only for theblack,viz. redsubdomains.
6
Algorithm: P-GMRES
1. Start: Let initial guessx0 begiven.qr
0 Rr b Ax0 < SolveKwr qr0 βr ! wr 2 vr
1 wr " βr,qb
0 Rb b Ax0 = SolveKwb qb0 βb 6 wb 2 vb
1 wb " βb 2. Arnoldi process:
for k 1 2 #$ until convergence:% qbk RbAvr
k qrk RrAvb
k % Solve Kwb qbk Kwr qr
k %'& vrk ( 1 hr
k ) orthonorm wr vrj j * k =& vb
k ( 1 hbk ) orthonorm wb vb
j j * k =end
3. Form approximate solution:% DefineV rk & vr
1 vr2 vr
k ) V bk & vb
1 vb2 + vb
k ) % DefineHrk & hr
1 hr2 hr
k ) Hbk & hb
1 hb2 hb
k ) % Computexk x0 , V rk yr
k , V bk yb
k, whereyrk yb
k minimises>>>@? βre1 Ikyr Hrkyb
βbe1 Ikyb Hbk yr A >>>
2
e1 & 1 0 - 0) T B k ( 1 andIk is theidentity matrix,extendedwith a row of zeros.
4. Restart: ComputeAxk andcheckterminationcriterion. If satisfiedstop,elsesetx0 xk
andgoto1.
Figure4: TheP-GMRESalgrotihm
7
2.4 Subdomain solution
Solution for w from the preconditioningequationKw q in the GMRES algorithm requiresthe solution of M independentsubdomainsystemsAmmwm qm m 1 M, and similarlywe have to solve M subdomainsystemsto obtain the solutionswr andwb from the equationsKwr qr Kwb qb in theP-GMRESalgorithm.In theformulationsof thealgorithmswe haveassumedthatthesesubdomainsaresolvedexactly. However, in practicalproblemsthis mightbetoo expensive,especiallywhenthesubsystemsarelarge,sothatsolutionby aniterativemethodmight be a betteralternative. It is generallythoughtthat the solutionobtainedshouldbe veryaccurate[6, 4], otherwiseGMRESaccelerationmayno longerbeapplied.In caseof inaccuratesubdomainsolutionsthemethodsGCR[26] andFGMRES[21], which alsoallow variablepre-conditioners,aremoreappropriate.However, thesemethodsrequiremorestorageand,in caseofGCR,additionalcomputations.Moreover, P-GMRESis not suitablefor very inaccuratesubdo-mainsolutions.Therefore,we requirethat thesubdomainproblemsaresolvedwith a moderateaccuracy, andthenrestartedversionsof GMRESarealsoapplicable,asthe following analysisshows. For resultsobtainedwith GCRandaninaccuratesubdomainsolutionwereferto [11].
Supposethat, in steadof solving Kwk qk exactly, we obtainan approximatesolutionwk,satisfying,
Kwk qk , Kεk qk Avk where
εk wk wk Define
Ek & ε1 ε2 εk )andlet wk k 0 1 beusedto generatetheKrylov subspace.Then,thereholds
K 2 1AVk Vk ( 1Hk Ek Consequently, afterm outeriterations,usingtheinexactsubdomainsolutions,we obtainfor
thepreconditionedresidual K 2 1 Axm b C K 2 1 Ax0 b D, K 2 1AVmym E Vm ( 1Hmym Emym K 2 1q0 E Vm ( 1Hmym Emym βVm ( 1e1 , ε0 F** Vm ( 1Hmym βVm ( 1e1 , ε0 Emym F** Hmym βe1 , ε0 , m
∑k 7 1
εk HG ym k G (3)
As ym IJ xm x0 will bebounded,weobservethattheinexactsubdomainsolutionsdonotaffect thepreconditionedresidualdramatically, aslong astheerrorsεk k 0 1 m aresmallcomparedto theestimationof thepreconditionedresidual Hmym βe1 , which is calculatedinGMRES.In theexperimentswe will investigatethe influenceof theaccuracy in thesolutionofthesubdomainproblemson theconvergenceof theouteriterationsof GMRESandP-GMRES.
8
GMRES(MGS) k , 1 SIP(1) k axpyGMRES(CGS2) 2 SIP(k , 1) 2k axpyP-GMRES(MGS) k , 1 HSIP(2) k haxpy(2)
Table1: Numberof operationsin thek-th iterationof GMRESandP-GMRES
A secondquestionwhicharises,addressesthevalueof thetoleranceto which thesubdomainiterationsshouldconverge.Thereseemsto besometheoreticalevidence[25] thatafixedrelativetoleranceis optimal. However, in sucha modelit is usuallyassumedthatconvergenceis linearandindependentof previousiterations.This assumptionis obviouslynot valid for (P-)GMRES,wherethe convergenceis superlinearandthe iterationsareintimately related. Moreover, evenif a fixedtolerancewould beoptimal, this valueis not known beforehand.If theouteriterationconvergesslowly, too strict a tolerancewill not benecessaryin a restartedmethod.Also, whentheouteriterationhasalmostconverged,theaccuracy canberelaxed,asthecomponentsym k will getsmallfor increasingk, accordingto (extendingyk 2 1
K k 2 1 with additionalzeros)G ym k G<6G ym k yk 2 1 k GL*' ym yk 2 1 M*' xm xk 2 1 3 Performance models
To give insightinto thecostsof theorthogonalisationprocedurein GMRESandP-GMRES,andof the subdomainsolutionwe considersimpleperformancemodels. In the first subsectionwediscusstheorthogonalisation.Theemphasiswill thenbeonthecommunicationcostsfor parallelplatforms,asinnerproductsdistributedover the processorsareto be calculated.In thesecondsubsectionwe considerthe sequentialcostsof subdomainsolvers,as it is assumedthat eachprocessortakescareof thesolutionfor one(or more)subdomains.
3.1 Orthogonalisation
Thecostof orthogonalisationis mainlydeterminedby theinnerproductsandthevectorupdateswhich occur both in the MGS and the CGS algorithm. Here, we distinguishbetweeninnerproductsthat canbe computedsimultaneously(i.e., with a singlecommunication),thosethatcannotandinnerproductsfor half thevectorlength,asoccurin P-GMRES.Following [11], wedenotek simultaneousinnerproductsby SIP(k). Two innerproductsandvectorupdatesof halfthevectorlengtharedenotedbyHSIP(2)andhaxpy(2). Then,themodifiedandre-orthogonalisedGram-Schmidtcanbebrokendown into componentsasgivenin Table1.
Let thetime for communicationof a messageof n floatingpoint numbersbegivenby
tcomm t0 , βn wheret0 is thecommunicationstartuptime,andβ is thetimeperfloatingpointnumber, depend-ing on thebandwidth.Let thetime for n floatingpoint operationsbegivenby
tcomp φn 9
Operation Communication Computation Definitionsend k t0 , βk Sendamessageof lengthkflop n nφ n floatingpointoperationsB p k f p t0 , βk Broadcastk elementsSIP k 2 f p t0 , βk 2kn2φ k simultaneousinnerproductsHSIP 2 2 f p " 2 t0 , β 2n2φ 2 simultaneoushalf IPsaxpy 2n2φ vectorupdatehaxpy 2 2n2φ 2 half vectorupdates
Table2: Communicationandcomputationtimes
Operation Communication Computation DefinitionHLIP n2φ Half local innerproductALIP 2 f p t0 , β Accumulationof HLIPshaxpy n2φ Half vectorupdate
Table3: Communicationandcomputationtimes
Suchamodelis usedin e.g.[15, 8, 11].Let p denotethenumberof processes,anddefinea function f p which givesthemaximum
numberof non-simultaneoussendsnecessaryfor a broadcastto p 1 processes.The functionf p is machinedependentandalsodependsonthedistributionof theprocessesonthemachine.Commonvaluesare, f p log2 p for a hypercubestructure,and f p p 1 for an Eth-ernetbroadcast.Assumingthat eachprocessoris responsiblefor an n ; n subdomainwith n2
unknowns,wearriveat thetimesfor thebasicoperationsasin Table2Basedon the communicationmodeloutlinedin Tables1 and2 the orthogonalisationtime
requiredfor s iterationsof GMRES(without restart)usingmodifiedGram-Schmidt(MGS), re-orthogonalisedclassicalGram-Schmidt(CGS2)andP-GMRESusingmodifiedGram-Schmidt(P-MGS),is givenby
tMGS s s , 3 & 2n2φ , f p t0 , β ) 2sn2φ (4)
tCGS2 s s , 3 & 4n2φ , 2 f p β ) , 4 s , 1 f p t0 4sn2φ (5)
tP 2 MGS s s , 3 & 2n2φ , f p " 2 t0 , β ) 2sn2φ (6)
Comparingtheseexpressions,we seethat the orthogonalisationin P-GMRESis slightlycheaperthanMGS in GMRES,unless2 processorsareused,in which caseP-GMRESrequiresno communication.On many processors,with high communicationstartuptime, CGS2seemsto befavourable.However, if it is possibleto overlapcomputationandcommunication,an im-plementationof P-GMRESmightbeconsideredwhereeachprocessoris assignedbotharedanda blacksubdomain,eachcontainingn2 " 2 unknowns. Then,theMGS algorithmfor theredandblacksubdomainscanbeperformedasin Fig. 5 (cf. [8]).
10
Algorithm: Modified implementationof MGS in P-GMRESfor i 1 2 + k , 1% Accumulatelocal innerproductsi 1 for blacksubdomains,if i 1
Updatevectorson redsubdomains,if i 1Computelocal innerproductsi for redsubdomains% Accumulatelocal innnerproductsi for redsubdomainsUpdatevectorsonblacksubdomains,if i 1Computelocal innerproductsi for blacksubdomains
end% Accumulatelocal innerproductsk , 1 for blacksubdomains
Figure5: Modified implementationof MGS
Usingthecostsof eachbasicoperationasgivenin Table3, wearriveat theorthogonalisationtime for this modification(P-MGSM)
tP 2 MGSM s s , 1 max 2n2φ 2 f p t0 , β N,, s & n2φ , 2 f p t0 , β D, max n2φ 2 f p t0 , β ) (7)
From[11] wederive representativevaluesfor theparameterst0 β andφ
t0 O 4 7 ; 102 4 β O 7 5 ; 102 6 φ O 4 9 ; 102 8
for aclusterof HPworkstations,andfor CrayT3E usingMPI communications
t0 O 2 4 ; 102 5 β O 5 4 ; 102 8 φ O 5 8 ; 102 8 Assumingthe models(4-6,7), and f p p 1 for the HP-cluster, f p QP log2 p +R for theCrayT3E,wecomputethequantitiesS
CGS2 tMGS " tCGS2 SPMGS tMGS " tP 2 MGS SPMGSM tMGS " tP 2 MGSM
denotingthepredictedspeedupwith respectto modifiedGram-Schmidtin GMRES.In Fig. 6 the resultsareplottedasfunctionof n for s 60 and p 4 9 (HP-cluster),resp.
p 4 25 (CrayT3E). We observe that themodelpredictsthatCGS2is advantageousfor smallsubdomainsizes,whencommunicationis relatively expensive,ason theHP-cluster. Thisobser-vationhasbeenmadebeforein [11], whereCGS2andMGS arecomparedfor theGCRmethod
11
20 30 40 50 60 70 80 90 1000
2
4
6
8
10 HP cluster
fC
GS
2, f P
MG
S, f P
MG
SM
subdomain gridsize n
___ CGS2
− − PMGS
...... PMGSM
p=9
p=4p=9
p=4
20 30 40 50 60 70 80 90 1000
0.5
1
1.5
2 Cray T3E
fC
GS
2, f P
MG
S, f P
MG
SM
subdomain gridsize n
___ CGS2
− − PMGS
...... PMGSMp=25
p=25
p=4
p=4
Figure6: Predictedspeedupfor P-GMREScomparedto GMRES
and the predictionsverified in actualexperiments. More importantly, the modelpredictsthatmodified Gram-Schmidtis more efficient in P-GMRESthan in GMRES, due to the reducedcommunicationcosts,bothon theHP-clusterandon theCrayT3E.Thespeedupvariesbetween2 8 for small subdomains(400unknownspersubdomain)with expensive communication(HP-clusterwith p 9) and1 0 for largesubdomains(10000unknowns)ontheCrayT3E.Moreover,anadditionalspeedupcanbeobtainedby themodifiedversionof P-GMRES,whencommunica-tion andcomputationarebalanced,i.e.
2n2φ O 2 f p t0 , β =3.2 Costs of subdomain solution
We assumethat eachsubdomainis solved on one processor, so no communicationbetweenprocessorsis necessaryandwe canrestrictourselvesto thesequentialcosts,determinedby thenumberof floatingpointoperations.
First, we considerthe exact solutionof the subdomainproblemusinga forward-backwardsubstitution,afteranLU-decompositionhasbeenmade.Let n2 be thenumberof unknownsinthesubdomain,andn half thebandwidthof thecoefficient matrix. Thentheaveragecostsof asubdomainsolutionis approximately
Cexact 8 4n3 , 2n4 " s φ (8)
12
Operation Number Costs Definitionmatvec m 9n2φ Matrix-vectormultiplyprec m , 1 10n2φ ILU-preconditioningIP 1
2 m , 1 m , 2 2n2φ Innerproductaxpy 1
2m m , 1 2n2φ Vectorupdatescal m , 1 n2φ Vectorscalingsol 1 2mn2φ Solutionupdate
Table4: Numberof operationsandcostsfor m inneriterations
wheres denotesthenumberof outeriterationsin (P-)GMRESnecessaryfor convergenceof theglobalproblem,andφ thecostsof onefloatingpoint operation.
Secondly, supposethat the subproblemis solved inexactly by m iterationsof GMRESus-ing anILUD-preconditioner([19]) or anRILUD-preconditioner([1]). For thetwo-dimensionalproblemconsideredin this paperthenumberof nonzerosin a row of thecoefficient matrix willbe 5, andhence3 for the incompleteL andU factors. Then, the costsfor the matrix-vectormultiplicationandthepreconditionerare
Cmatvec 9n2φ Cprec 10n2φ
Notethatthesecostsmightbeslightly reducedusingEisenstat’strick [10], but thatisnotessentialhere.In Table4 we list thecostsandthenumberof basicoperationsin m inneriterations.
Neglectingthecostsof theconstructionof thepreconditioner(approx.10n2φ) andthesolu-tion of theHessenberg system(aboutm2φ), weobtainfor thecostsof theinexactsolution
Cgmres m 8 2m2 , 26m , 13 n2φ (9)
Comparisonof (8) and(9) shows that for this typeof problemssolutionby GMRESis onlycompetitive if 2 T n iterationsaresufficient to obtaina reasonableaccuratesubdomainsolution.Hence,the numberof unknowns in the subdomain,n2, shouldbe quite large, e.g. n 1 100,but in thatcasetheinner iterationwill probablyconvergeslowly and20 iterationsmight not besufficient. Alternatively, the subdomainproblemmight be solvedvery inaccurately(cf. [11]),leadingto asmallvalueof m. However, thenumberof outeriterationswill increasethen,andtheexactsolutionmightbethemostefficientoneafterall. Wewill pursuethis issuein thenumericalexperimentsfurther. Here,we concludethat, for the2D problemconsidered,exactsubdomainsolutionwill probablybe computationallythe mostefficient. Only whenmemorylimitationsexcludetheuseof a full LU-decomposition,subdomainsolutionby an iterative methodwill beof value.
4 Numerical experiments
In this sectionwe give numericalresultswhich provide insight into theconvergencebehaviourof P-GMRESin comparisonwith GMRES.We alsoassessthe performanceof several subdo-
13
β 0 β 1M GMRES P-GMRES GMRES P-GMRES2 12 12 20 163 19 14 28 234 24 24 36 295 25 21 39 346 29 29 45 40
Table5: Numberof iterationsfor the60 ; 60grid with M ; M subdomains
mainsolversandevaluatethemeritsof variousstoppingcriteriain theinneriterationandrestartstrategiesin theouteriteration.All experimentswereperformedonanHP-755workstation,andwe will reporton the numberof floating point operationsasa measureof the efficiency of thesequentialsubdomainsolvers.Theparallelperformanceof thealgorithmsis predictedin Section3.1,andwill bethesubjectof evaluationin a forthcomingstudy.
As atestexample,weconsideraPoissonproblem,discretisedwith afinite differencemethodonasquaredomain.Suchaproblemis similar to thepressurecorrectionmatrix,which is solvedin eachtime stepof an incompressibleNavier-Stokessimulationto enforcethedivergence-freeconstraint[27], apartfrom someasymmetryin thepressurecorrectionmatrix. As thetestexam-ple is meantto modelsucha pressurecorrectionmatrix, we do not exploit thesymmetryin theexperiments.The domainis decomposedinto M ; M subdomains,eachcontainingn ; n gridpoints.With h ∆x ∆y 1"U Mn , 1 thediscretisationis
4ui j ui j 2 1 ui j ( 1 ui 2 1 j ui ( 1 j h2 fi j (10)
Theright-handsidefunctionis fi j f ih jh , where
f x y x 1 x 3 2βx 1 3y 324 , y 1 y 3 2βy 1 3x 324 (11)
HomogeneousDirichlet boundaryconditionsu 0 aredefinedon ∂Ω. Notethatthis exampleisalmostidenticalto theonein [11], apartfrom a differentdiscretisationat theboundaries.More-over, we introducedan additionalterm in the right-handside,asthe formulationin [11], withβ 0, suffersfrom an8-fold symmetrywhichusuallyspeedsup theconvergenceconsiderably.
4.1 Convergence behaviour with exact subdomain solution
In this sectionwe comparethespeedof convergenceof GMRESandP-GMRES.For all testsafixedrestartvalueof s 30wasused,andthesolutionwascomputedaftertheinitial (precondi-tioned)residualhasbeenreducedby a factorof 106. In all casesthesubdomainproblemsweresolvedexactly.
In thefirst experimentwe compareresultsfor a fixedproblemsizeon the60 ; 60 grid withM ; M subdomains,M 2 6, and two differentvaluesfor β in (11). In Table5 we listthe requirednumberof iterations.It is interestingto observe that thesymmetryin theproblem
14
(β 0) influencestheconvergencesubstantially. Therefore,we think thatit is moreappropriateto considerthe casewith β 1 asa model for practicalproblems. Also notethat P-GMRESfor M evendoesnot profit asmuchfrom thesymmetryasGMRESdoes.This canbeexplainedby the fact that the solutionson the black andred subdomainsare identical in the symmetriccase;consequently, P-GMRESandGMRESconvergein exactly thesameway. In all othercasesP-GMRESrequiresabout20%lessiterationsthanGMRES.
We consideredgrids of dimension60 120 180 240 and300 in a secondexperiment. Thenumberof requirediterationsfor variousM ; M subdomainpartitioningsareplottedin Fig. 7.In all caseswe took the problemwith β 1. Oneseesthat P-GMRESconvergesfasterthan
50 100 150 200 250 3000
20
40
60
80
100
120
140
Domain size
# It
erat
ions
___ GMRES
− − − P−GMRES
o M=2
x M=3
. M=4
+ M=5
* M=6
Figure7: # Iterationsfor varioussubdomains
GMRESin all casesbut one(n 300 M 4). A marked improvementis obtainedfor an oddnumberof subdomainsanda relatively smallgrid. Herethedifferencebetweensolutionson theredandtheblacksubdomainsis mostpronounced,andP-GMRESprofitsfromthediscriminationbetweenthesetwo. For M even,therewill be2 blackand2 redsubdomainsin thecornersof Ω,andthesolutionson thesetwo setsof subdomainswill notbehaveverydifferently.
In a last experimentwe chosedifferentstartingvaluesfor the problemwith β 1 on the60 ; 60grid, viz.
x 0i j WV max i . j 2 n 2 1n i * n j * n
ui j otherwise (12)
whereui j denotestheexactsolutionof (10). Consequently, we starttheiterationwith theexactsolutionin all subdomains,but the first one. Table6 shows that GMRESdoesnot profit fromsucha good startingvalueat all, but P-GMREShasconverged after 1 iteration, as could be
15
x 0 0 x 0 from (12)M GMRES P-GMRES GMRES P-GMRES2 20 16 25 13 28 23 35 14 36 29 48 15 39 34 51 16 45 40 48 1
Table6: Numberof iterationsfor the60 ; 60grid with M ; M subdomains
expected. This suggeststhat P-GMRESmight be very efficient for problemswhosesolutionsbehavedifferentlyon variouspartsof thedomain,suchaslayeredproblems[28].
4.2 Evaluation of approximate subdomain solvers
In this sectionwe comparetheperformanceof a numberof approximatesubdomainsolverstogetan impressionof which solversmight beanalternative for exactsubdomainsolutionin (P-)GMRES.Again, we useda fixed restartvalueof s 30 in the outer iteration, unlessnotedotherwise,anda relative toleranceof 102 6.
Thesubdomainapproximationswill bedenotedasfollows:% EX = exactsubdomainsolution,% GMRk = (restarted)GMRESwith a toleranceof 102 k, preconditionedwith RILUD,% GMRkF = (restarted)GMRESwith a toleranceof 102 kF, preconditionedwith RILUD.
Thelastmethodneedssomeexplanation.ThefactorF is givenby
F min X 10k 2 1 max 10k 2 7 res 0res j 1 res j0
res j 1 YZ (13)
whereres 0= res j0 < res j 1 denotethenormsof the initial residual,the residualat the be-ginningof thelastrestartandtheresidualafterthepreviousouteriteration,resp.Thismeansthatwe aim to reducetheresidualin a cycle of restarted(P-)GMRESby a factorof 102 k, unlesstheresidualat the last restartis alreadysmall. Moreover, the tolerancefor thesubdomainapproxi-mationis boundedabove by 0 1. Becausetheinaccuraciesfrom thesubdomainapproximationswill persistduringa cycleof outeriterations,wedo notcontinuetheouteriterationuntil s 30,but restartassoonasthecondition
res j / 2 [ j j0 102 kres j0 is satisfied.
16
Solver 2 ; 2 3 ; 3 4 ; 4 5 ; 5 6 ; 6EX 16 23 29 34 40GMR8 16 19 9 23 17 4 29 15 3 34 13 8 40 12 8GMR6 17 16 5 26 13 6 31 11 8 36 10 8 40 9 9GMR4 25 11 8 42 9 7 47 8 1 55 7 3 59 6 9GMR7F 17 11 1 25 8 9 31 8 0 35 7 4 39 7 0GMR4F 23 7 1 38 5 9 40 5 4 53 4 7 49 4 5
Table7: Outerandinneriterationsfor varioussubdomainsolvers,60 ; 60 grid
4.2.1 Convergence on a small problem
We appliedP-GMRESfor thefixedproblemon the60 ; 60 grid with M ; M subdomains.Therequirednumberof outeriterationsandtheaveragednumberof inneriterations(in parentheses)is listedin Table7. For thesake of comparisonwe alsoincludetheiterationcountfor theexactsubdomainsolution.
It is seenthattheouteriterationdoesnotsuffer muchfrom anapproximatesubdomainsolu-tion, if a sufficiently small toleranceis imposedfor thesubdomainproblems(GMR8, GMR6).However, the numberof inner iterationsand thus the amountof work is ratherhigh in thesecases,evenfor thesmallsubdomainsconsideredhere.Relaxingthetolerancefor theinnerloop(GMR4) reducesthe numberof inner iterations,but leadsto a significantincreasein the num-berof outeriterations.Themethodsusinga flexible inner loop toleranceperformmuchbetter.GMR7F requiresthe sameamountof inner iterationsasthe inaccuratesolver GMR4, withoutmuchlossof accuracy in theouteriteration,whereasGMR4Fis abouttwiceascheapasGMR4,see(9).
4.2.2 Convergence on a larger problem
In this subsectionwe considerthe fixed problemon the 300 ; 300 grid, which hasalsobeenusedin [11] to assesstheperformanceof subdomainsolversfor GCR(although[11] hasβ 0in theright handsidefunction(11)). We appliedP-GMRESwith theflexible subdomainsolverGMRkF for variousM ; M partitioningsof Ω, andlist thenumberof outerandaveragedinneriterationsin Table8. For the sake of comparisonwe alsoquotethosenumbersfor GCR withGMR6 from [11].
Note that the flexible tolerancesubdomainsolversperformquite satisfactory, even for theratherrudetolerancesin GMR4F. Thedifferencein inneriterationsbetweenthefixedstrategy inGCRandGMRkFis striking,whereastheconvergenceof theouteriterationis comparable.Thiscanbeexplainedby the fact that inaccuraciesintroducedby thesubdomainsolvesaresoonde-tectedby therestartsin theouterloop,sothey donothaveachanceto spoil toomany subsequentiterations.Fig. 8 illustratestheconvergenceof P-GMRESwith theGMR4Fsubdomainsolverfor the6 ; 6 subdomaincase.
Thepeaksin thegraphfor GMR4Foccurat therestarts,indicatingthatthecalculatedresid-
17
Solver 2 ; 2 3 ; 3 4 ; 4 5 ; 5 6 ; 6EX 32 59 87 94 123GCR 78 68 4 83 38 7 145 31 4 168 26 4GMR6F 84 14 4 105 13 0 116 12 9 114 12 8 142 12 0GMR5F 62 17 3 102 12 2 123 11 8 115 12 2 170 10 6GMR4F 110 12 1 144 10 8 161 10 0 163 10 1 175 10 7
Table8: Outerandinneriterationsfor varioussubdomainsolvers,300 ; 300grid
0 20 40 60 80 100 120 140 160 18010
−6
10−5
10−4
10−3
10−2
10−1
100
101
300 × 300 grid, 6 × 6 subdomains
Calcu
lated
resid
ual
Iteration
_____ EX
− − − GMR4F
Figure8: Convergenceof P-GMRES(GMR4F),300 ; 300grid, 6 ; 6 subdomains
ual is contaminatedby inaccuraciesfrom thesubdomainsolve. However, subsequentiterationsquickly diminishthevalueof theseresiduals.Thenumberof inneriterations,usedin this exam-ple, is plottedin Fig. 9. It is seenthat therequirednumberof inner iterationsrapidly decreasesin thecourseof anoutercycle,usingtheflexible tolerance.Consequently, theaveragednumberof inneriterationsis substantiallyreduced.
In someexperimentswe observedthat thebehaviour of theresidualwasquiteerraticin thefinal stageof convergence(seeFig. 8). Then,thecalculatedresidualin theouteriterationis smallenough,hencethecycle is terminated,but theactualresidual,calculatedafter theupdateof thesolution,doesnot yet satisfythestoppingcriterium,soa new cycle of outeriterationsis started.Thisphenomenonis probablydueto thecrudetolerance( tol O 0 1) with whichthesubproblemsaresolved.Requringtol * 0 01givesamuchmoreregularbehaviour of thecalculatedresidual,howeverat thecostsof moreinneriterations.
Finally we remark that in [11] GCR is also appliedwith the subdomainsolvers GMR2,GMR1 andRILUD, the latter standingfor just oneforward backward substitutionwith the in-completefactorsfrom theRILUD decomposition.Wequotetheirresultsfor the5 ; 5 subdomaincasein Table9.
They note that the inner loop toleranceof 0 1 is insufficient for fast global convergence,
18
0 20 40 60 80 100 120 140 160 1802
4
6
8
10
12
14
16
18
Outer iteration
Avera
ge nu
mber
of Inn
er Ite
ration
s
300 × 300 grid, 6 × 6 subdomains
Figure9: Inneriterationsin P-GMRES(GMR4F),300 ; 300grid, 6 ; 6 subdomains
GCR GCR GCR GCR P-GMRES(GMR6) (GMR2) (GMR1) (RILUD) (GMR4F)
168 26 4 192 10 9 303 5 9 437 1 163 10 1Table9: Outerandinneriterationsfor varioussubdomainsolvers,300 ; 300grid
althoughit is still anexpensive subdomainapproximation.We did not apply thesecrudetoler-ances,becausetheflexible strategy GMR4Falreadyinvokeslarge inner loop tolerancesduringpartof theouteriterations,see(13), andmoreover (P-)GMRESis moresensitive to inaccuratesubdomainsolutionsthanGCR.We alsodid not apply RILUD asa preconditioner. AlthoughGMREScanbecombinedwith RILUD very well, P-GMRESwill fail, assomeaccuracy in thesolutionof thesubdomainproblemis requiredfor this method.
4.2.3 Performance for the larger problem
Fig. 10 shows theperfomanceof theP-GMRESmethodwith varioussubdomainsolversfor the300 ; 300grid. Thefigureshowsthenumberof floatingpointoperationspersubdomainrequiredfor the convergenceof the outer iteration. Thesenumbersarecomputedfrom the costsof theouteriterationand(8), (9). Increasingthenumberof subdomainsreducesthecomputationalcostssubstantially, showing the potentialof the domaindecompositionmethodfor parallelisation.Note that the exact subdomainsolver is the most efficient one for the relatively small gridsconsideredhere.
19
0 5 10 15 20 25 30 35 400
0.5
1
1.5
2
2.5x 10
9
# flop
s per
subd
omain
# subdomains
Various subdomain solvers, 300 × 300 grid
__ EX
o GMR4F
x GMR5F
+ GMR6F
Figure10: Numberof operationspersubdomain,300 ; 300grid
5 Conclusions
ForapplicationswhichrequiredomaindecompositionthepartitionediterativemethodP-GMRESoffersadvantagescomparedto GMRES,bothwith respectto speedof convergenceandcommu-nicationcosts.Theseadvantagesareexpectedto be morepronouncedfor problemswith largevariationsin thesolutionon thevarioussubdomains,e.g. in layeredproblems([28]). However,P-GMREScanonly beappliedif a redblackcolouringof thesubdomainsis possible.
Althoughmonotoneconvergenceof (P-)GMRESis only guaranteedin caseof exactsubdo-mainsolution,it is alsopossibleto solvethesubdomainproblemsapproximately, in combinationwith restarts.Ourexperimentsindicatethatalargereductionin computationtimeis obtainedby aflexible tolerancestrategy in thesubdomainproblems,whichcontradictstheoreticalsuggestionsin literature([25]). A rathercrudetolerance(tol O 0 1) is allowedin thesubdomainsolutions,buta strictertoleranceleadsto a smootherconvergence.However, for thesizeof theproblemsweconsidered,upto22500unknownspersubdomain,anexactsubdomainsolver still turnedout tobethemostefficient one.Nevertheless,it seemsworthwhileto modify P-GMRESaccordingtotheideasin [21] to recover themonotoneconvergence.This wouldonly costadditionalstorage,which is usuallynot aproblemonaparallelsystem.
Consideringthecomputationalwork only, thedomaindecompositionmethodprofitsfrom alargenumberof subdomains.Notwithstandingtheslowerconvergence,thework persubdomaindecreases.This allows good opportunatiesfor parallelisationif the communicationcostsaresmall. A performancemodelindicatesthattheorthogonalisationin P-GMRESis abouttwice ascheapasin GMRESfor relatively smallproblems;using9 processorsandup to 5000unknownsper processorfor the HP-cluster. The advantageis about20% on large problems,e.g. 10000unknownspersubdomainon theCrayT3E. Themodelalsoindicatesthata speedupof 1 5 canbeobtainedif computationandcommunicationarebalanced,andcanbeoverlapped.For verysmall problemsit might be advantageousto usethe re-orthogonalisedclassicalGram-Schmidtorthogonalisation,assuggestedin [11].
20
References
[1] O. AxelssonandG. Lindskog. On theeigenvaluedistributionof a classof preconditioningmethods.Numer. Math., 48:479–498,1986.
[2] Z. Bai, D. Hu, andL. Reichel. A Newton-basisGMRESimplementation.IMA J. Numer.Anal., 14:563–581,1994.
[3] A. Bjorck. Solvinglinearleastsquaresproblemsby Gram-Schmidtorthogonalization.BIT,7:1–21,1967.
[4] C. Borgers. The Neumann-Dirichletdomaindecompositionmethodwith inexact solverson thesubdomain.Numer. Math., 55:123–136,1989.
[5] E. Brakkee. Domain decomposition for the incompressible Navier-Stokes equations. PhDthesis,Delft Universityof Technology, Delft, TheNetherlands,April 1996.
[6] E. Brakkee, C. Vuik, and P. Wesseling. Domain decompositionfor the incompressibleNavier-Stokesequations:Solvingsubdomainproblemsaccuratelyandinaccurately. Int. J.Num. Meth. Fluids, 26:1217–1237,1998.
[7] T.F. ChanandT.P. Mathew. Domaindecompositionalgorithms.In A. Iserles,editor, ActaNumerica, pages61–143,Cambridge,1994.CambridgeUniversityPress.
[8] E. deSturlerandH.A. vanderVorst. Reducingtheeffectof globalcommunicationin GM-RES(m)andCG on paralleldistributedmemorycomputers.Appl. Numer. Math, 18:441–459,1995.
[9] K. Dekker. ParallelGMRESanddomaindecomposition.Report,Delft Universityof Tech-nology, Delft, 2000.
[10] S.C.Eisenstat.Efficient implementationof a classof preconditionedconjugategradientmethods.SIAM J. Sci. Stat. Comput., 2:1–4,1981.
[11] J. FrankandC. Vuik. Parallel implementationof a multiblock methodwith approximatesubdomainsolution.Appl. Numer. Math., 30:403–423,1999.
[12] G. Haase,U. Langer, andA. Meyer. The approximateDirichtlet domaindecompositionmethod,PartI: An algebraicapproach.Computing, 47:137–151,1991.
[13] G. Haase,U. Langer, andA. Meyer. The approximateDirichtlet domaindecompositionmethod,PartII: Applicationsto 2nd-orderelliptic BVPs. Computing, 47:153–167,1991.
[14] G. Haase,U. Langer, andA. Meyer. Domaindecompositionpreconditionerswith inexactsubdomainsolvers.J. Numer. Linear Algebra Appl, 1:27–41,1991.
[15] R.W. Hockney andC.R.Jesshope.Parallel Computers 2: Architecture, Programming andAlgorithms. AdamHilger, Bristol, 1988.
21
[16] W. Hoffmann. Iterative algorithmsfor Gram-Schmidtorthogonalization. Computing,41:335–348,1989.
[17] W. JalbyandB. Philippe. Stability analysisandimprovementof theblock Gram-Schmidtalgorithm.SIAM J. Sci. Stat. Comput., 12(5):1058–1073,1991.
[18] G. Li. A block variantof theGMRESmethodon massivily parallelprocessors.ParallelComputing, 23:1005–1019,1997.
[19] J.A. Meijerink andH.A. vanderVorst. An iterative solutionmethodfor linearsystemsofwhich thecoefficientmatrix is asymmetricM-matrix. Math. Comp., 31:148–162,1977.
[20] A. Meyer. A parallelpreconditionedconjugategradientmethodusingdomaindecomposi-tion andinexactsolverson eachsubdomain.Computing, 45:217–234,1990.
[21] Y. Saad. A flexible inner-outer preconditionedGMRES algorithm. SIAM J. Sci. Stat.Comput., 14:461–469,1993.
[22] Y. SaadandM.H. Schultz.GMRES:a generalizedminimal residualalgorithmfor solvingnonsymmetriclinearsystems.SIAM J. Sci. Stat. Comput., 7:856–869,1986.
[23] A. Segal,P. Wesseling,J.vanKan,C.W. Oosterlee,andC. Kassels.Invariantdiscretizationof theincompressibleNavier-Stokesequationsin boundaryfittedco-ordinates.Int. J. Num.Meth. Fluids, 15:411–426,1992.
[24] B.F. Smith, P.E. Bjørstad,andW.D. Gropp. Domain Decomposition; Parallel MultilevelMethods for Elliptic Partial Differential Equations. CambridgeUniversity Press,Cam-bridge,UK, 1996.
[25] K.H. Tan. Local coupling in domain decomposition. PhD thesis,Utrecht University,Utrecht,TheNetherlands,April 1996.
[26] H.A. van der Vorst andC. Vuik. GMRESR:a family of nestedGMRESmethods.Num.Lin. Alg. Appl., 1:369–386,1994.
[27] J.J.I.M.vanKan. A second-orderaccuratepressurecorrectionmethodfor viscousincom-pressibleflow. SIAM J. Sci. Stat. Comput., 7:870–891,1986.
[28] C. Vuik, A. Segal, andJ.A. Meijerink. An efficient preconditionedCG methodfor thesolutionof aclassof layeredproblemswith extremecontrastsin thecoefficients.J. Comp.Phys., 152:385–403,1999.
[29] H.F. Walker. Implementationof theGMRESmethodusingHouseholdertransformations.SIAM J. Sci. Stat. Comput., 9:152–163,1988.
[30] P. Wesseling,A. Segal, C.G.M. Kassels,andH. Bijl. Computingflows on generaltwo-dimensionalnonsmoothstaggeredgrids. J. Eng. Math., 34:21–44,1998.
22