A Cross-Language Comparison of Support for Parallel Multigrid Applications Bradford L. Chamberlain Steven Deitz Lawrence Snyder University of Washington, Seattle, WA 98195-2350 USA brad,deitz,snyder @cs.washington.edu Abstract In this study we perform the first cross-language comparison of support for parallel multigrid applica- tions. The candidate languages include High Performance Fortran, Co-Array Fortran, Single Assignment C, and ZPL. The NAS MG benchmark is used as the basis for comparison. Each language is evaluated not only in terms of its parallel performance, but also for its ability to express the computation clearly and concisely. Our findings demonstrate that the decision of whether to support a local per-processor view of computationor a global view affects a language’s expressiveness and performance — the local view approaches tend to achieve the best performance while those with a global view have the best ex- pressiveness. We find that ZPL represents a compelling tradeoff between these two extremes, achieving performance that is competitive with a locally-based approach, yet using syntax that is far more concise and expressive. 1 Introduction Scientific programmers often make use of hierarchical arrays in order to accelerate large-scale multigrid computations [4, 3, 11]. These arrays typically have a number of levels, each of which contains roughly half as many elements per dimension as the previous level. Computations performed at the coarsest level provide a rough approximation to the overall solution and can quickly be computed due to the relatively small number of array elements. The coarse solution can then be refined at each of the finer levels in order to produce a more precise solution. Hierarchical computations generally take significantly less time than directly computing a solution at the finest level, yet can often be designed so that the accuracy of the solution is only minimally affected. Scientific computations that use hierarchical arrays include multigrid- style applications such as adaptive mesh refinement (AMR) [19] and the fast multipole method (FMM) [12], as well as algorithms such as the Barnes-Hut algorithm for -body simulations [2] and wavelet-based compression. Although hierarchical algorithms are fast compared to direct methods on the finest grid, parallel com- putation is often used to further accelerate hierarchical computations. Until recent years, parallel language support for hierarchical computation was lacking. In previous work we have described desirable properties for parallel languages that hope to support hierarchical computation [8]. In this paper we continue that work 1
21
Embed
SECTION 13120 METAL BUILDING SYSTEMS PART 1 - GENERAL 1.1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Cross-LanguageComparisonof Supportfor ParallelMultigrid Applications
In thisstudyweperformthefirst cross-languagecomparisonof supportfor parallelmultigrid applica-tions.ThecandidatelanguagesincludeHigh PerformanceFortran,Co-ArrayFortran,SingleAssignmentC, andZPL. TheNAS MG benchmarkis usedasthebasisfor comparison.Eachlanguageis evaluatednot only in termsof its parallelperformance,but alsofor its ability to expressthe computationclearlyandconcisely. Our findingsdemonstratethat the decisionof whetherto supporta local per-processorview of computationor a globalview affectsa language’s expressivenessandperformance— thelocalview approachestendto achieve thebestperformancewhile thosewith a globalview have thebestex-pressiveness.We find thatZPL representsa compellingtradeoff betweenthesetwo extremes,achievingperformancethatis competitive with a locally-basedapproach,yet usingsyntaxthat is far moreconciseandexpressive.
supportfor hierarchicalcomputationwaslacking. In previouswork wehavedescribeddesirableproperties
for parallellanguagesthathopeto supporthierarchicalcomputation[8]. In thispaperwecontinuethatwork
1
Language Programmerview Data distribution Communication Platforms
F90+MPI local manual manual mostHPF global undefined invisible mostCAF local manual manual++ CrayT3E1
SAC global defined invisible Linux/SolarisZPL global defined visible most
Figure1: A summaryof theparallelcharacteristicsof thelanguagesin thisstudy. Programmerview indicateswhethertheprogrammercodesat thelocalper-processorlevel or ata globallevel. Datadistribution indicateswhetherthedatadistributionis donemanuallyby theprogrammer, is definedby thelanguage,or is undefinedandleft upto thecompiler.Communicationindicateswhethercommunicationandsynchronizationaredonemanuallyby theprogrammer(“++”indicatesthat the languageaids the programmersignificantly),or aredoneby the compiler in a mannervisible orinvisible to theprogrammerat thesourcelevel.
by evaluatingthehierarchicalsupportof four modernparallelprogramminglanguages:High Performance
Fortran,Co-ArrayFortran,SingleAssignmentC, andZPL. Thelanguagesarecomparedboth in termsof
anduniform normsfor the finest grid. In a parallel implementation,this requiresa reductionover the
processorset. In addition,an implementationof MG mustmaintainperiodicboundaryconditions,which
requiresadditionalpoint-to-pointcommunications.
Therearefive classesof MG, eachof which is characterizedby the numberof elementsin the finest
grid andthenumberof iterationsto beperformed.Threeof theclasses— A, B, andC — areproduction
gradeproblemsizesandtheir definingparametersaresummarizedin Figure2. Theothertwo classes— S
8
# Processors Problemsize DataLanguage Author Known? known? Distribution
F90+MPI NAS yes2 yes 3D blockedHPF NAS no yes 1D blockedCAF CAF group yes2 yes 3D blockedSAC SAC group no3 yes4 1D blockedZPL ZPL group no no 2D blocked
Figure3: Summaryof the implementationsof MG usedin this study. Author indicatestheorigin of thecode. Thenext two columnsindicatewhetherthenumberof processorsandproblemsizearestaticallyknown to the compiler.Datadistribution indicatestheway in whicharraysaredistributedacrossprocessors.
waswrittenby membersof NAS who expendedconsiderableeffort to make it asefficientaspossible[15].
F90+MPI TheF90+MPIcodethatwe usedwastheNAS implementation,version2.3. It servedasthe
baselinefor this study.
HPF TheHPFimplementationwasobtainedfrom NASA Ames[15] andstemsfrom a projectto imple-
menttheNAS benchmarksin HPF. This implementationwasidentifiedby PGIasthebestknown publicly-
availableimplementationof MG for theircompiler(whichservesasourexperimentalHPFcompiler)[27].
It follows theF90+MPIimplementationof MG very closely, makinga few changesthatallow thecodeto
bemoreamenableto HPF’s semanticsandanalysis.Unfortunately, the authorshadto make somerather
drasticconcessionsin order to implementit in a mannerthat wasconciseand independentof problem
size. Chief amongthesewasthe fact that they hadto allocatearraysof the finestgrid at every level in
thehierarchy— a tremendouswasteof memorythatwill alsohave consequenceson spatiallocality. This
2Additionally, thenumberof processorsmustbea power of two.3Althoughthemaximumnumberof threadsmustbespecifiedon thecompile-timecommandline.4Thoughtheproblemsizemaybespecifieddynamically, thecodeis written suchthatonly a few problemsizesarepossible.
9
requiredtheuseof theHOME directive in orderto align thearraysin a mannerthatwould obtainthede-
Figure 5: The rprj3 operationsfrom eachbenchmark,excluding communicationfor the F90+MPI/CAF version.Note that the HPF implementationis editeddown for sizeandblocked out for the time beingdueto an outstandingnondisclosureagreement.
Figure 6: A summaryof the machinesusedin theseexperiments. Location indicatesthe institution that donatedthecomputertime. Processors indicatesthe total numberof usableprocessorsto a singleuser. Speedtells theclockspeedof theprocessors.Memorygivesthetotalamountof memoryavailableacrossall processorsandMemoryModelindicatesjust that.
anddeclarationcounts.
The next thing to noteis that the declarationsin all of the languagesarereasonablysimilar with the
exceptionof SAC which hasthe fewest. This is dueto SAC’s implicit variabledeclarations,describedin
the previous section. Thus,SAC’s declarationcount is limited to simply thoselines requiredto declare
functionsor theprogramasawhole.
In termsof computation,oneobservesthat theFortran-basedlanguageshave 2 to 3 timesthenumber
of linesastheSAC andZPL benchmarks.Thiscanbelargelyattributedto thefactthatmuchof thestencil-
SAC U. Kiel sac2c 0.8.0 –mtdynamic14 sharedGNU gcc 2.95.1 –O1 memory
ZPL U. Washzc 1.15a MPISUNWcc 5.0 –fast
Figure7: A summaryof the compilersusedin theseexperiments. The compiler, versionnumber, command-lineargumentsusedaregivenfor each.In addition,thecommunicationmechanismusedby thecompileris noted.
4 Performance Evaluation
4.1 Methodology
To evaluateperformance,weranexperimentsontwo hardwareplatforms,theCrayT3EandtheSunEnter-
5This hasapparentlybeenfixed for the next release,andwe hopeto obtaina copy for the final versionof this paperfor fairercomparison.
14
MG CLASS B – Cray T3E
0
10
20
30
40
50
60
1 256Processors
Sp
eed
up
ove
r 4
pro
cess
ors
Linear Speedup
CAF
ZPL (SHMEM)
F90+MPI
ZPL (MPI)
HPF
(a)
MG CLASS C – Cray T3E
0
2
4
6
8
10
12
14
16
1 256Processors
Sp
eed
up
ove
r 16
pro
cess
ors
Linear Speedup
CAF
ZPL (SHMEM)
F90+MPI
ZPL (MPI)
HPF
(b)
Figure8: Speedupcurvesfor theT3E.NotethatTherewasnotenoughmemoryto run thesmallerprocessorsetsizesfor theseclasses.The fastestrunningtime on the smallestnumberof processorsis usedasthe basisfor computingspeedupfor all benchmarks.NotethattheclassC HPFcodewasunableto rundueto its memoryrequirements.
ularly sincetheF90+MPIimplementationcanonly run on processorsetsthatarepowersof two. Memory
limitationsonly allow theclassA andB problemsizesto berun. On this platform,F90+MPIsignificantly
beatstheMPI versionof ZPL with or without thehand-codedoptimization.This is mostlikely dueto the
relative quality of codeproducedby theFortranandC compilerson this platformaswell asthediffering
surface-to-volumecharacteristicsinducedby the benchmarks’allocationschemes.In comingweekswe
will beperformingmoreanalysisof thesenumbers.
Note thatSAC lagsslightly behindF90+MPIandZPL on theclassA problemsize. This is certainly
due in part to the bug that forced it to be compiledusing–O1; we saw nearly a 50% improvementin
the singleprocessorrunningtimeswhenSAC wascompiledwith –O3. In addition,a memoryleak was
discoveredin theclassB problemsizewhich preventedit from completingat all. No doubtthis memory
leakalsoimpactedtheperformanceof the classA runs. We anticipatea fixedversionof thecompilerin
thenearfutureto updatethesenumbers.
16
MG CLASS A – Sun Enterprise 5500
0
2
4
6
8
10
12
14
1 14Processors
Sp
eed
up
Linear Speedup
F90+MPI
ZPL (handopt)
ZPL (MPI)
SAC
(a)
MG CLASS B – Sun Enterprise 5500
0
2
4
6
8
10
12
14
1 14Processors
Sp
eed
up
Linear Speedup
F90+MPI
ZPL (handopt)
ZPL (MPI)
SAC
(b)
Figure10: Speedupcurvesfor benchmarkson theSunEnterprise5500.NotethattheF90+MPIversionis unabletorunon morethan8 processorsdueto its requirementthatapower of two processorsbeused.
5 Related Work
For eachof the languagesin this paper, work hasbeendoneon expressingandoptimizing hierarchical
OpenMP[10] is astandardAPI thatsupportsdevelopmentof portablesharedmemoryparallelprograms
andis rapidlygainingacceptancein theparallelcomputingcommunity. In recentwork, OpenMPhasbeen
usedto implementirregular codes[17], andfuture studiesshouldevaluateits suitability for hierarchical
multigrid-styleproblems.
17
Language Comparison
ZPL (Sun)
ZPL (T3E)
SAC
HPF
F90+MPI
CAF
0
1
2
0 1 2 3 4 5 6 7 8
Expressiveness
Per
form
ance
Figure11: An interpretationof this paper’s experimentsby plotting performancevs. expressiveness.Performanceissummarizedfor eachbenchmarkusingtherunningtimeof thelargestproblemsizeonthelargestnumberof processorssinceparallelperformancefor largeapplicationsis the goal. Expressivenessis measuredin total lines of productivecode. Both numbersarenormalizedto the F90+MPI implementationof MG suchthat highernumbersimply betterperformanceandexpressiveness.
6 Conclusions
We have undertaken a study of parallel languagesupportfor hierarchicalprogrammingusing the NAS
MG benchmark.Wehavemadeaneffort to studylanguagesthatareavailableandin useoncurrentparallel
In futurework, weplanto continueourstudyof languagesupportfor hierarchicalapplications,moving
towardslarger andmorerealisticapplicationssuchasAMR andFMM that arenot denseat every level
of the hierarchy[19, 12]. Our approachwill be basedon extendingZPL’s region conceptto allow for
sparseregions[9]. We arealsoin theprocessof implementingthestencil-basedoptimizationdescribedin
Section3.2 in orderto give theusertheexpressivepower of writing stencilsnaturallywhile achieving the
sameperformanceasahand-codedscalarimplementation.
Acknowledgments. The authorswould like to thank Sven-BodoScholz,ClemensGrelck, and Alan Wallcraftfor their help in writing the versionsof MG usedin this paperand for answeringquestionsabouttheir respectivelanguages.We’d also like to thank the technicalreportingserviceat PGI and the NAS researcherswho helpedusobtainthe HPF implementation.This work wasdoneusinga grantof supercomputertime from the AlaskaRegionSupercomputerCenterandtheUniversityof Texasat Austin, for which we areextremelygrateful.Finally, Sung-EunChoi,E ChristopherLewis, andTonA. Ngomustbethankedfor their invaluablehelpin helpinginspireanddesigntheZPL languagefeatureswhich led to this work.
References[1] David Bailey, Tim Harris,William Saphir, RobvanderWijngaart,Alex Woo,andMauriceYarrow. Thenaspar-
[9] Bradford L. Chamberlain,E ChristopherLewis, and LawrenceSnyder. A region-basedapproachfor sparseparallelcomputing.TechnicalReportUW-CSE-98-11-01,Universityof Washington,November1998.
[10] LeonardoDagumandRameshMenon. Openmp:An industry-standardapi for shared-memoryprogramming.IEEE ComputationalScienceandEngineering, 5(1),January/March1998.
[12] M. A. Epton and B. Dembart. Multipole translationtheory for the three-dimensionallaplaceand helmholtzequations.SIAM-Journal-on-Scientific-Computing, 16(4):865–97,July 1995.
[13] S. J. Fink, S. R. Kohn,andS. B. Baden. Efficient run-timesupportfor irregularblock-structuredapplications.Journalof Parallel andDistributedComputing, 50:61–82,May 1998.
[14] MessagePassingInterfaceForum. MPI: A messagepassinginterfacestandard.InternationalJournal of Super-computingApplications, 8(3/4):169–416,1994.
[15] MichaelFrumkin,HaoqiangJin,andJerryYan.Implementationof NAS parallelbenchmarksin highperformancefortran. TechnicalReportNAS-98-009,NasaAmesResearchCenter, Moffet Field,CA, September1998.
[16] High PerformanceFortranForum. High PerformanceFortranSpecificationVersion1.1. November1994.
[18] Steve Karmesin,JamesCrotinger, JulianCummings,ScottHaney, William Humphrey, JohnReynders,StephenSmith,andTimothy J. Williams. Array DesignandExpressionEvaluationin POOMA II. In D. Caromel,R.R.Oldehoeft,and M. Tholburn, editors,Computingin Object-OrientedParallel Environments, volume 1505 ofLecture Notesin ComputerScience, pages231–238.Springer-Verlag,1998.
[20] Ton A. Ngo, LawrenceSnyder, andBradfordL. Chamberlain.Portableperformanceof dataparallellanguages.In SC97:High PerformanceNetworkingandComputing, November1997.
[21] R. W. NumrichandJ.K. Reid.Co-arrayfortranfor parallelprogramming.TechnicalReportRAL-TR-1998-060,RutherfordAppletonLaboratory, Oxon,UK, August1998.