7/29/2019 Kh 3118981907 http://slidepdf.com/reader/full/kh-3118981907 1/10 Liviu Octavian Mafteiu-Scai /International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1898-1907 1898 | P age Average Bandwidth Relevance in Parallel Solving Systems of Linear Equations Liviu Octavian Mafteiu-Scai Computer Science Department, West University of Timisoara, Timisoara, Romania ABSTRACT: This paper presentssomeexperimental resultsobtainedon aparallel computerIBMBlueGene /Pthat showsthe average bandwidth reduction [11] relevancein the serial and parallel cases ofgaussianeliminationandconjugate gradient.New measuresfor the effectivenessofparallelizationhave beenintroducedin order to measurethe effects of average bandwidth reduction. The main conclusion is that the average bandwidth reduction in sparse systems of linear equations improves the performance of these methods, a fact thatrecommendusing thisindicator inpreconditioningprocesses, especially when thesolvingis doneusing aparallel computer. KEYWORDS: average bandwidth, linear system of equations, parallel methods, gaussian elimination, conjugate gradient, sparsematrices 1.THEORETICAL CONSIDERATIONS The systems of linear equationsappear in almosteverybranch of scienceand engineering. Theengineeringareasinwhere sparse and large linear systems of equationarisefrequently include thechemical engineering processes, design and computer analysis of circuits, powersystemnetworksandmanyothers. The search for efficientsolutionsisbeingdrivenbytheneedto solve hugesystems-millions of unknowns- on parallelcomputers. The interest inparallelsolvingsystemsof equations, especially those very largeand sparse,has been veryhigh,there arehundreds ofpapersthat deal withthissubject. As solving methodsthere aredirect methodsanditerativemethods. Gaussianeliminationandconjugategradientare twopopularexamplesin this respect. In gaussian elimination the solution is exact and it’s obtained infinitelymanyoperations. The conjugate gradient methodgeneratessequences of approximationsthat converge in thelimittothesolution.For each of themthere are manyvariantsdeveloped and the literatureis veryrichin describingthese methods, especially in the case ofserialimplementations. Below, are listed suscint,by some particularities ofparallelimplementationsof thesetwo methods, case where the matrix system is partitioned perrows. It isconsidereda nxnsystem, the case when the size of system nis divisible with number of partitions p and the partitions are equals in terms of number of rows,ie k=n/p rows ineach partition, iea partition p i ofsizek will includeconsecutiverows/equationsbetween −1∙ + 1and −1∙ + +1. Parallel gaussian elimination (GE) The mainoperationsperformed there are:local pivot determination for each partition in part, global pivot detemination, pivot row exchange, pivot row distribution, computing the elimination factors, computing the matrix elements.Because thevalues of theunknownsdepend on eachotherand are computesoneafteranother, thecomputation of thesolutions in thebackwardsubstitutionisinherently serial. In gaussianeliminationis an issuewithloadbalancingbecausesomeprocessors are idlesincealltheirworkisdone. Paralel conjugate gradient (CG) The CG methodis verygoodforlarge and sparse linear systemsbecause it hasthe property ofuniformconvergence, but onlyif the associatedmatrix is symmetric andpositive definite. The parallelisminCGalgorithmderives fromparallelmatrix-vector product. Otheroperations can beperformedinparallelas long asthere is nodependency betweenthem, such as for example, updating theresidualvectorandthe vectorsolution. Butthese lastteroperationscan not be performedbefore performingthe matrix-vector productandthe matrix- vector productin a newiterationcan not beperformeduntilthe residualvectoris updated. So,there are twomomentsinwhichprocessorsmustsynchronizebefor e they canbecontinue thework.It is desirable that between these twopoints ofsynchronization, the processorsdo nothaveperiods of inactivity, which is an ideal case. In practice,the efficiency of thecomputationfollows aminimization ofthis waiting/idle timefor synchronization. It has been observed that a particular preparation of systembefore applicationanumerical methodfor solving,leads toan improvementof the process andthe solution. Thispreparationwascalled preconditioning . Intime,manypreconditioningmethodshave been proposed,designed to improvethe process of solvinga system ofequations.There are manystudies on
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
The systems of linear equationsappear inalmosteverybranch of scienceand engineering.
Theengineeringareasinwhere sparse and large linear systems of equationarisefrequently includethechemical engineering processes, design and
computer analysis of circuits, powersystemnetworksandmanyothers. The search for efficientsolutionsisbeingdrivenbytheneedto solvehugesystems-millions of unknowns- on
parallelcomputers. The interestinparallelsolvingsystemsof equations, especially
those very largeand sparse,has been veryhigh,therearehundreds ofpapersthat deal withthissubject. As
Gaussianeliminationandconjugategradientaretwopopularexamplesin this respect. In gaussian
elimination the solution is exact and it’s obtainedinfinitelymanyoperations. The conjugate gradientmethodgeneratessequences of approximationsthat
converge in thelimittothesolution.For each of themthere are manyvariantsdeveloped and theliteratureis veryrichin describingthese methods,
especially in the case ofserialimplementations.Below, are listed suscint,by some particularitiesofparallelimplementationsof thesetwo methods, case
where the matrix system is partitioned perrows. Itisconsidereda nxnsystem, the case when the size of
system nis divisible with number of partitions p andthe partitions are equals in terms of number of rows,iek=n/p rows ineach partition, iea
partition piofsizek will
includeconsecutiverows/equationsbetween−1∙ +
1and −1∙ + + 1.
Parallel gaussian elimination (GE)
The mainoperationsperformed there are:local pivotdetermination for each partition in part, global pivotdetemination, pivot row exchange, pivot rowdistribution, computing the elimination factors,computing the matrix elements.Because thevalues of theunknownsdepend on eachotherand arecomputesoneafteranother, thecomputation of
thesolutions in thebackwardsubstitutionisinherentlyserial. In gaussianeliminationis anissuewithloadbalancingbecausesomeprocessors are
idlesincealltheirworkisdone.
Paralel conjugate gradient (CG)
The CG methodis verygoodforlarge and sparse linear systemsbecause it hasthe propertyofuniformconvergence, but onlyif theassociatedmatrix is symmetric andpositive definite.The parallelisminCGalgorithmderivesfromparallelmatrix-vector product. Otheroperationscan beperformedinparallelas long asthere is
nodependency betweenthem, such as for example,updating theresidualvectorandthe vectorsolution.
Butthese lastteroperationscan not be performedbefore performingthe matrix-vector productandthe matrix-vector productin a newiterationcan not beperformeduntilthe residualvectoris updated.
So,there aretwomomentsinwhichprocessorsmustsynchronizebefor e they canbecontinue thework.It is desirable that
between these twopoints ofsynchronization, the processorsdo nothaveperiods of inactivity, which is
an ideal case. In practice,the efficiency of thecomputationfollows aminimization ofthiswaiting/idle timefor synchronization.It has been observed that a particular preparation of
systembefore applicationanumerical methodfor solving,leads toan improvementof the process andthesolution. Thispreparationwascalled preconditioning .
Intime,manypreconditioningmethodshave been
proposed,designed to improvethe process of solvingasystem ofequations.There are manystudies on
Liviu Octavian Mafteiu-Scai /International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1898-1907
1899 | P a g e
theinfluence ofpreconditioningto parallel solvingthesystemsof linear equations [1, 2, 3]. Reducing the bandwidth of associated matrix isone of
these preconditioning methods and for thisthere arealot ofmethods,the most popular
beingpresentedinworks such as [4, 5, 6, 7 and 8]. In paper [9] it is presenteda studyofparalleliterativemethods Newton, conjugate gradientandChebyshev,
including the influence ofbandwidth reduction interms ofconvergenceof thesemethods.In paper [10] it wasproposed anew measurefor sparse
matricescalledaveragebandwidth(mbw). In [11]algorithms andcomparative studiesrelated tothis newindicatorwasmade. Paper [12] proposes methods
thatallowfora pair ofmatrixlines/columns,withoutperforminginterchange,qualitativeandquantitativemeasurement ofopportunityforinterchange in terms of
bandwidthandaveragebandwidthreduction.According to [11],
theaveragebandwidthisdeffinedbyrelation:
=1
− , = 1 , = 1≠0
(1)
wheremisthenumber of non-zeroelementsandnisthesize of thematrix A.Thereasons,specified in [11],for
usingaveragebandwidth (mbw) insteadbandwidth(bw) in preconditioningbeforeparallelsolvingsystemof equations are:
- mbwreductionleadingto a more uniform
distribution of non-zero elementsaroundthemaindiagonalandalong themain diagonal;- mbwis more
sensitivethanthebwtothepresencearoundthemaindiagonal of theso-called "holes", that are compact
regions of zero values;
- mbwislesssensitivetothepresence of someisolated non-zero elements far fromthemaindiagonal. In case of a
non-zero elementsawayfrom it. Thisis anadvantageaccordingtothepaper [13], astobeseen inFigure1c).Forthe
samematrix1a)twoalgorithmsCutHill-McKee for 1b)wereusedandtheoneproposedin [10] for 1c),thefirsttoreduce thebandwidthbwandthe
secondtoreducetheaveragebandwidthmbw.Paper [15] describesa setof indicatorsto measure theeffectiveness of parallel processes. From thatwork,
two simplebutrelevant indicatorswere chosen: RelativeSpeedup(Sp) and RelativeEfficiency( Ep)described byrelations(2)and(3).
=
1
(2)
= =
1
∙(3)
where p is the number ofprocessors, T 1isthe executiontimeachievedby asequentialalgorithmandT pistheexecution timeobtainedwithaparallelalgorithmwith p processors.WhenSp= pwe have alinearspeedup. Sometimesinpractice, there is aninteresting situation, known
as superlinearspeedupwhenSp>p. One possible causeisthe cache effect, resulted frommemorieshierarchyofparallelcomputers [16]. Inour
experimentssuch situations have
beenencountered,some of which arecontainedintables1 and 3.
Note: In our experiments,becausewewereespeciallyinterested inthe effects of mbwreduction,inrelations(2)and(3)weconsider T 1asexecution timebeforembw reduction,serial case.
Figure 1. A matrix after bwreduction and after mbw reduction
GainTime(GT ) measure isintroduced, which is, in
percents, the difference betweenthe execution
timebeforembwreducing(T b) andthe executiontimeafter mbwreducing(T a), related tothe execution
Liviu Octavian Mafteiu-Scai /International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1898-1907
1900 | P a g e
positive valuesshowing a more efficiencyintermsofexecution timeafter mbwreducing.Itintroducedthe measure Increase ofefficiency ( IE ),
which is, in percent, difference betweenthe relativeefficiencyafter mbw reducing( Epa)
andrelativeefficiencybeforembwreducing( Epb)reportedto the relative efficiencybefore mbw reduction ( Epb), for thesamepartitioning:
=−
′ ∗ 100 (5)
It wasnotedby Athe average number ofiterationsrequired forconvergenceand with A pthe
averagenumber ofiterationsrequired forconvergence per processor. Inthe experiments performed, A wasthe average value obtained for 50 systemswith
samesizebutwithdifferentnonzeroelementsdistribution and different sparsity. So, A p = A/p where p is the number of processors/partitions. It
will be seenfrom experimentsthat between Ap andexecution timethere isadirectlyproportionalrelation, so that A pcan bea
measure ofthe parallelization efficiency.Usingonlythe value ofthe average number ofiterationsper processor A p, can allowan estimationof theefficiency by neglecting the hardwarefactor (communication timebetweenprocessors, cacheeffect, etc.), whose influenceisdifficultto
calculate.Based on theseconsiderations, itisproposeda new measureofefficiencybased ontheaverage number ofiterationsper processor,
called Efficiency in Convergence( EC ).It
showsfortwo situationsS a (after) and S b (before),howit has decreased/increased, in percent, the
average number ofiterationsper processorinsituation S bfrom the situation S a. In this study (therelevance of average bandwidth reduction in
parallel case) the situation S a is represented by
and situation S b by whereindices “a” and “b”
refers to “after mbw reduction” and “ before mbw reduction”. So, the computing relation is:
= −
∙ 100 (6)
The next sectionwillshow that mbw reducing leadstoefficientparallel computingprocess.
2. THE EXPERIMENTSThe experimentswere performed onIBM Blue Gene/P supercomputer.Inexperimentsit has been
implementedthegaussian elimination without pivoting and the preconditioned conjugate gradientwith block Jacobi preconditioning.
To implement theseserial and parallel numericalmethods, it was usedIBMXLC compilerversion9.0underLinux.Forparallelimplementationof theGE
and CGmethodsit was usedMPI(Message-PassingProgramming), a standardinparallelcomputingwhichenablesoverlappingcommunication
s andcomputationsamongprocessors
andavoidunnecessarymessages in point to pointsynchronization.Forparallelpartitioning,the systemof equationswas
dividedequally betweenprocessorsusingthedivisorsof the systemsize.
Matrices/systemschosenfor the experimentsweregeneratedrandomly,positive definite, symmetricand sparse, with a sparsity degree between5 and
20%. The size of thesematriceswere between10and1000,theweightrepresentingasizeof 200. In theexperiments there were generatedandused matrices
withuniformandnonuniformdistributionofnonzeroelements. Interms of reducingthe average bandwidthmbw, values
obtained were between10 and70%. For eachinstance/partitioningwere used50 systemswiththe samesizebut differentassparsityand distributionofnonzeroelements. Sizeswere variedwithratio10,
from 10 to 1000.In case ofCG,forarithmeticprecisionrequired by
theconvergence,epsilonvalueschosenwere10-10
, 10-
30, 10
-100and10
-200and the initialvectorthat was
usedX0={0, 0,…,0}.
2.1 Serial caseGaussian elimination.It has beenexperimentally
observed that ingeneral, in GE, the average bandwidth reduction and/or bandwidth reductiondid notsignificantlyaffectthecomputationin termsofitsefficiency. Onlyin the cases wherembw<< n or bw<< nthere was an
experimentsit resultedthat inapproximately60% of cases, the mbwreduction ,lead to an increase
inefficiency, but without major differencesin termsofexecution time(order of magnitude).Itwasobserved thatincreasing the number
ofprocessorsinvolved in computation, firstleadingtoa decrease inexecution time,reachinga minimumvaluefollowed by an increasein execution timewithincreasingnumber ofprocessor, ascan be
seeninFigure 3. An exampleis shownbelow. Example 1:size of system: 500x500, 1486 non-zerovalues, uniform distribution; before mbw reduction:
mbw0=110.33 bw0=499; after mbw reduction:mbw=12.44 bw=245.
Figure 3 Gain Time: Example 1
Figure 4 Relative speedup: Example 1
Figure 5 Relative Efficiency: Example 1
Figure 6 Increase of Efficiency:Example 1
Inexperiments there wereencounteredsituations(some partitioning) when the
gaussianeliminationfailed.Possible causes include:rounding errors,numericalinstabilityormaindiagonalof the
associated matrixcontainszeros orvery small values.
Avera e 1,735589 1,143524 12,741968 7,042048 8,475456 0,327547 0,341137 19,330474Table 1 Example 1: experimental results
2.2.2 Conjugate gradientFigure7 showsthe global effect of reducingmbwin
the case of parallelconjugategradient, especially
with the increasing the size ofsystems andwithincreasedcomputingaccuracy.We mention thatinFigure 7 are representedtheaverage valuesobtained for thedifferentsystems of
equations(sparsity, distribution, etc.) before andafter mbwreductionBelow are presented some examples (2, 3, 4, 5 and
6) ofdifferent situations, which showsthe effects
of mbwreducing in parallelsolvingsystemsof linear equations using conjugate gradient method.
Example 2:size of system: 1000x1000, 36846 non-
zero values, uniform distribution; before mbw reduction: mbw0=413 bw0=999; after mbw reduction: mbw=211 bw=911.InTable 2, Figure 8 and 9 it is shownthe correlation
betweenexecution timeand average number ofiterationsper processor Ap, whichjustifies the useofthe lastas an indicatorofperformance
measurement.
Figure 7 Conjugate gradient-parallel case
Figure 8 The correlation between Ap and Runtime, from Example 2 (e-10)
Liviu Octavian Mafteiu-Scai /International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1898-1907
1904 | P a g e
Figure 10: IE from Example 2-experimental results Example 3:size of system: 100x100, 1182 non-zero
values, sparsity=11,82%, uniform distribution; before mbw reduction: mbw0=33.61 bw0=99; after
mbw reduction: mbw=9.97 bw=60.
Figure 11: Matrix form from Example 3 before and
after mbw reduction
In Figure 12, according toequation (6), we see that
reducing theaveragebandwidthmbw has the besteffect in the case of 10partitions,interms of numberof iterationsper processornecessaryforconvergence.
Figure 12:Efficiency inconvergence-Example 3
Example 4:size of system: 100x100, 924 non-zerovalues, sparsity=9.24%, nonuniform distribution; before mbw reduction: mbw0=38.73 bw0=99; after mbw reduction: mbw=20.05 bw=80.
Figure 13: Matrix form from Example 4 before andafter mbw reduction (ɛ =10-30)
This is an exampleinwhichallpartitioning arefavorable toaverage bandwidth reduction, as it can
be seen in figure 14.
Figure 14:Efficiency inconvergence-Example 4
Example 5:size of system: 100x100, 1898 non-zerovalues, sparsity=18.98%, nonuniform distribution;
before mbw reduction: mbw0=36.38bw0=97; after mbw reduction: mbw=21.63bw=97.
Figure 15 Matrix form from Example 5 before and
after mbw reduction (ɛ =10-30)
Insome cases, somepartitioningleads toa drastic
decrease inefficiencyafter mbwreducing,butingeneral,there are otherfavorablesituationsin termsof convergence, as can be seenin figures 16 and 12.
FUTURE WORK The proposed indicator in [11], average
bandwidth mbw, wasvalidatedbyexperimental
resultspresentedin this paperandrecommendeditsuseinthepreconditioningof large systemsof linear equations, especially when thesolvingis doneusing
aparallel computer. Usingthe proposed indicator average number ofiterationsper processorA p, is a good choice because it allowsan estimation of theefficiency byneglecting the hardwarefactor, in case of paralleliterative methods. Also, the proposed indicator
Efficiency in Convergence( EC ), based on A p, shows
in comparative studies for parallel iterative
methods, in an intuitive way,the obtained progress/regression.The proposedindicatorsGainTime(GT ),and Increaseofefficiency ( IE ), incomparative studiesshowclear by andintuitive by the obtained progress/regression.
Extendingthe studyto the case ofnonlinear systemsof equationsarerequired to bemade,encouraging resultsare also expectedin conditionsas shownin[9], under certainconditionsnonlinearparallel case canbe reduced
toonelinear.The influence of mbwreducingin the caseofunequalandoverlappingpartitions (equal&overlap,unequal&non-overlapping,
andunequal&overlapping) will be afuture study.
REFERENCES[1] Michele Benzi,
„PreconditioningTechniques for Large
Linear Systems: A Survey”, Journal of ComputationalPhysics, Vol. 182, Issue 2, 1 November 2002, Pages 418 – 477, Elsevier
2002[2] YousefSaad, „Iterative Methods for Sparse
Linear Systems”, second edition, ISBN-13:
978-0-898715-34-7 SIAM 2003[3] O. Axelsson, „A survey of preconditioned iterative methods for linear systems of algebraicequations”, Springer, BIT NumericalMathematics 1985, Volume 25,Issue 1, pp 165-187 , 1985
[4] E. Cuthilland J. McKee, Reducingthebandwidth of sparse symmetricmatrices. In Proc. of ACM, pages157-172, New York, NY, USA, 1969.
ACM.[5] N.E. Gibbs, W.G. Poole, and P.K.
Stockmeyer. An algorithm for
reducingthebandwidthandprofle of a sparsematrix. SIAM Journal on NumericalAnalysis, 13:236-250, 1976.
Liviu Octavian Mafteiu-Scai /International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1898-1907
1907 | P a g e
[6] R. Marti, M. Laguna, F. Glover, and V.Campos. Reducingthebandwidth of a sparse
matrixwith tabu search. European Journal
of OperationalResearch, 135:450-459,2001.
[7] R. Marti, V. Campos, and E. Pinana, Abranchandboundalgorithm for thematrix,bandwidthminimization, European Journal
of OperationalResearch, 186:513-528,2008.
[8] A. Capraraand J. Salazar-Gonzalez, Laying
out sparse graphswithprovably minimum Bandwidth, INFORMS J. on Computing,17:356-373, July 2005
[9] S. Maruster,V. Negru, L.O. Mafteiu-Scai ,” Experimental study on parallelmethods for solvingsystems of equations”, Proc. of 14thInternational Symposium on Symbolicand
Numeric Algorithms for ScientificComputing 2011 (will appear in
,, InterchangeOpportunity In AverageBandwidthReduction In Sparse
Matrix, In West University of TimisoaraAnnals, MathematicsandComputer Scienceseries , volume L fasc. 2,Timisoara, Romania, pp 01-14,ISSN:1841-3293, VersitaPublishing DOI:
10.2478/v10324-012-0016-1, 2012.[13] P. Arbenz, A. Cleary, J. Dongarra, and M.
Hegland, „Parallelnumerical linear
algebra, chapter A comparison of parallelsolvers for diagonally dominant and general narrowbanded linear systems”, pages 35-56. Nova SciencePublishers, Inc.,
Commack, NY, USA, 2001[14] S. S. Skiena, “The algorithm design
manual”, second ed., Springer, ISBN 978-
1-84800-069-8, 2012[15] S. Sahni, V. Thanvantri, “ Paralel
computing:performancemetricsandmodels”, Computer &Information SciencesDep., Univ. of
Florida, Rep. USA ArmyResearch Office,Grant DAA H04-95-1-0111, 1995
[16] D. P. Helmbold, C. E. McDowell,
“Modeling speedup(n) greaterthan n” IEEE Transactions on