Top Banner
1 LBNL‐1014E NERSC‐6 Workload Analysis and Benchmark Selection Process Katie Antypas, John Shalf, and Harvey Wasserman National Energy Research Scientific Computing Center Division Ernest Orlando Lawrence Berkeley National Laboratory Berkeley, CA 94720 August 13, 2008 This work was supported by the U.S. Department of Energy’s Office of Science, Office of Advanced Scientific Computing Research, Mathematical, Information and Computational Sciences Division under Contract No. DE-AC02-05CH11231.
45

NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

Jun 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

1

LBNL‐1014E

NERSC‐6WorkloadAnalysisandBenchmarkSelectionProcess

KatieAntypas,JohnShalf,andHarveyWasserman

National Energy Research Scientific Computing Center Division Ernest Orlando Lawrence Berkeley National Laboratory

Berkeley, CA 94720

August 13, 2008

This work was supported by the U.S. Department of Energy’s Office of Science, Office of Advanced Scientific Computing Research, Mathematical, Information and Computational

Sciences Division under Contract No. DE-AC02-05CH11231.

Page 2: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

2

Abstract

Thisreportdescribeseffortscarriedoutduringearly2008todeterminesomeofthescience drivers for the “NERSC‐6” next‐generation high‐performance computingsystemacquisition.AlthoughthestartingpointwasexistingGreenbooksfromDOEandtheNERSCUserGroup,themaincontributionofthisworkisananalysisofthecurrentNERSC computationalworkload combinedwith requirements informationelicitedfromkeyusersandotherscientistsaboutexpectedneedsinthe2009‐2011timeframe.TheNERSCworkload isdescribed in termsof science areas, computercodes supporting research within those areas, and description of key algorithmsthat comprise the codes. Thisworkwas carriedout in largepart tohelp select asmall set of benchmark programs that accurately capture the science andalgorithmic characteristics of the workload. The report concludes with adescriptionofthecodesselectedandsomepreliminaryperformancedataforthemonseveralimportantsystems.

Page 3: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

3

NERSC‐6WorkloadAnalysisandBenchmarkSelectionProcess

IntroductionNERSC’sscience‐drivenstrategyforincreasingresearchproductivityfocusesonthebalancedandtimelyintroductionofthebestnewtechnologiestobenefitthebroadestsubsetoftheNERSCworkload.Thegoalinprocuringnewsystemsistoenablenewsciencediscoveriesviacomputationsofascalethatgreatlyexceedwhatispossibleoncurrentsystems.Thekeytounderstandingtherequirementsforanewsystemistotranslatescientificneedsexpressedbysimulationscientistsintoasetofcomputationaldemandsandfromtheseintoasetofhardwareandsoftwareattributesrequiredtosupportthem.Thiscombinationofrequirementsanddemandsisabstractedintoasetofrepresentativeapplicationbenchmarksthatbynecessityalsocapturecharacteristicstylesofcodingandthusassureaworkload‐drivenevaluation.Theimportanceofworkload‐basedperformanceanalysiswasbestexpressedbyIngridBucherandJoanneMartinina1982LANLtechnicalreportwheretheystated,“Benchmarksareonlyusefulinsofarastheymodeltheintendedcomputationalworkload.”AtNERSCthesecarefullychosenbenchmarksgobeyondrepresentingasetofindividualscientificandcomputationalneeds.ThebenchmarksalsocomprisetheSustainedSystemPerformance(SSP)metric,anaggregatemeasureofthereal,deliveredperformanceofacomputingsystem.TheSSPisevaluatedatdiscretepointsintimeandalsoasanintegratedvaluetogivethesystem’spotency,meaninganestimateofhowwellthesystemwillperformtheexpectedworkoversometimeperiod.Takentogether,theNERSCSSPbenchmarks,theirderivedaggregatemeasures,andtheentireNERSCworkload‐drivenevaluationmethodologycreateastrongconnectionbetweensciencerequirements,howthemachinesareused,andthetestsweuse.ThisdocumentbeginswithanexaminationofthekeysciencedriversforDOEhigh‐endcomputationthatwereidentifiedinthe2005Greenbookreport.Weidentifyanychangesinthesciencedriversandcommensuraterequirementsthatmightaffecttheselectionofnext‐generationsystems.WethenprovideanintroductiontoNERSC’sworkloadanalysisandbenchmarkselectionprocess.Finally,wepresentananalysisoftheperformanceoftheselectedbenchmarkapplications.AbriefsnapshotofbenchmarkperformanceasofMay2008ispresented.Thesearepreliminarydata,measured,ingeneral,withnoattemptatoptimization;inotherwords,representingwhatausermightobserveuponinitialportingofthecodes.Insomecasesthedatacontainestimates,eitherbecausethefullrunistoolongorbecauseitrequiresmorecoresthanwehaveaccesstoonthegivenmachine.

Page 4: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

4

OverviewoftheNERSCWorkloadNERSCservesadiverseusercommunityofover3000usersand400distinctprojects.Theworkloadiscomprisedofsome600codestoservethediversescienceneedsoftheDOEOfficeofScienceresearchcommunity.Despitethediversityofcodes,themedianjobsizeatNERSCisasubstantialfractionoftheoverallsizeofNERSC’sflagshipXT4computingsystem.NERSC’sworkloadcoverstheneedsof15scienceareasacrosssixDOEOfficeofSciencedivisions,inadditiontousersfromotherresearchandacademicinstitutionsoutsideoftheDOE.ThepiechartsinFigure1showthepercentagesofcyclesatNERSCawardedtothevariousDOEofficesandscienceareas.

Figure1:AwardsbyDOEoffice(left)andawardsbysciencearea(right).

InthisstudywefocusedonthelargestcomponentsoftheNERSCworkload,whichincludeFusion,Chemistry,MaterialsScience,ClimateResearch,LatticeGaugeTheory,AcceleratorPhysics,Astrophysics,LifeSciences,andNuclearPhysics,asshowninFigure2.

Page 5: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

5

Figure2:TheNERSCworkloadstudyfocusedondominantcontributorstotheNERSCworkload.

ClimateScienceClimatesciencecomputingatNERSCincludesavarietyofpredictionandsimulationactivities,includingcriticalsupportoftheU.S.submissiontotheIntergovernmentalPanelonClimateChange(IPCC).Figure3showsthatclimatescienceallocationsaredominatedbyCAM,theCommunityClimateSystemModel(CCSM3),andweathermodelingcodes.ThereareINCITEallocationsthatarealsolargeplayers,butbecauseINCITEisnotasteady‐stateallocationwedonotconsidertheseprojectsfurther.WiththeINCITEallocationsremoved,thedominanceofCAMandCCSM3isclear(Table1).Furthermore,CAMisthedominantcomputationalcomponentofthefullycoupledCCSM3climatemodel.Therefore,CAMisagoodproxyforthemostdemandingcomponentsoftheclimateworkload.

Page 6: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

6

Figure3:DistributionofclimateallocationwithandwithoutthecontributionoftheINCITEallocations.TheINCITEawardsfor"forecast"and"filter"codesaremuchlargerthantheyareinthesteady‐stateworkload

Table1:BreakdownoftheNERSCworkloadinClimateSciences.CCSMandCAMaccountfor>74%oftheworkload.SinceCAMisthedominantcomponentofCCSM,CAMisclearlyagoodproxyfortheclimaterequirements.

Inordertobeusefulforclimateresearch,CAMmustachieveatargetperformanceof1000xfasterthanreal‐time.Thisplacesastringentrequirementfortheachievedperformanceoftheclimateapplicationonthetargetsystemarchitecture.Thecurrentmeshresolutionultimatelylimitstheexposableparallelismoftheapplication.However,iftheresolutionisincreased,thesizeofthesimulationtimestepsmustbereducedtosatisfytheCourantstabilitycondition.Increasingtheresolutionby2xrequires4xmoretimeduetotheshortertimesteps.Therefore,theclimateapplicationpushestowardshigherper‐processorsustainedperformance,whichrunscountertocurrentmicroprocessorarchitecturaltrends.

Page 7: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

7

AlthoughspectralCAMdominatesthecurrentworkload,theIPCCisrapidlymigratingtothefinite‐volumeformulationofCAM(fvCAM).FiniteVolumeCAMismorescalablethanspectralCAMandisbetterabletoexploitlooplevelparallelismusingahybridOpenMP+MPIprogrammingmodeandtherefore,takeadvantageofSMPnodes.Thehybridmodelmaymitigatetheneedforimprovedper‐processorserialperformancebutwouldrequirewiderSMPnodes.Inanticipationofthelargermeshresolutiontargets,weselectedtheD‐mesh(~0.5degree),whichislikelytobetheproductionresolutionforsometime.

MaterialsScienceSimulationsinmaterialscienceplayanindispensableroleinnanoscienceresearch,anareaidentifiedasoneofthesevenhighestprioritiesfortheOfficeofScience.OfparticularinterestrecentlyarestudiesofQuantumDots,duetotheiremergingimportanceinminiaturizationofelectronicandoptoelectronicdevices,andcomputationalneedsseemvirtuallyunlimited.Forexample,twospecificapplicationsthatscientistsinthisareaforeseeincludethefollowing:• Electronicstructuresandmagneticmaterials.Thecurrentstateoftheartisa

500‐atomstructure,whichrequires1.0Tflop/ssustainedperformance(typicallyforseveralhours),and100GBmainmemory.Futurerequirements(aharddiskdrivesimulation,forinstance)arefora5000‐atomstructure,whichwillrequireroughly30Tflop/sand4TBmainmemory.

• Moleculardynamicscalculations.Thecurrentstate‐of‐the‐artisfora109‐atom

structure,whichrequires1Tflop/ssustainedand50GBmemory.Futurerequirements(analloymicrostructureanalysis)willrequire20Tflop/ssustainedand5TBmainmemory.

Thematerialsworkloadpresentsadizzyingarrayofcodes,asshowninFigure4.Thereare65accountsundertheBES/materialsciencecategorythataccountfor20%ofthe324totalNERSCaccounts.ThetotalHPCallocationfortheseaccountswas4.1MhoursbeforeBassiwasonlineand8.8MhoursaftertheBassiwasonline.Althoughthisallocationasafunctionofthetotalissmallerthanitwasafewyearsago,itstillrepresents13%ofthetotalNERSCallocation(66.7MhoursafterBassiwasonline).ThematerialssciencepercentagehasdecreasedpartlybecauseofthecreationsofspecialprogramslikeSciDACandINCITE.Inthisscienceareathereisanaverageofaboutonecodeperaccount(orgroupofusers).However,sincethesamecodecanbeusedbydifferentaccounts,onecodeisusedby2.15usergroupsonaverage.ExceptfortheVASPcode,whichisusedby23groups,themajorityofthecodesareusedbylessthan5groups,andmanyofthecodesareusedonlybyonegroup.Thisisaverydiversecommunity,withmanygroupsusingtheirowncodes.

Page 8: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

8

Figure4:Thematerialsscienceareapresentsadizzyingarrayofcodes,noneofwhichdominatetheworkload.

Thedifferentcodescanbeclassifiedintosixcategoriesbasedontheirphysicsandthecorrespondingalgorithms.Thesecategories,andtheircorrespondingallocations,are:densityfunctionaltheory(DFT,74%);beyondDFT(GW+BSE,6.9%);quantumMonteCarlo(QMC,6.7%);classicalmoleculardynamics(CMD,6.4%);(classicalMonteCarlo(CMC,3.1%);andotherpartialdifferentialequations(PDE,2.9%).ThereisanoverwhelmingpreferencefortheDFTmethod,owingtothecurrentsuccessofthatmethodinabinitiomaterialssciencesimulations.However,theDFTmethoditselfhasdifferentnumericalapproachessuchasplanewaveDFT,Green’sfunctionDFT,localizedbasisandorbitalDFT,muffintinspheretypeDFT,andrealspacegridDFT.Themostpopular(bothintermsofnumberofcodesandHPChours)isplanewaveDFT,forwhichthereare12codesaccountingfor1.6Mhours.

DensityFunctionalTheory(DFT)IntheDFTmethod,oneneedstosolvethesingle‐particleSchrödingerequation(asecondorderpartialdifferentialequation).Typically,5–10%ofthelowesteigenvectorsareneededfromthissolution,andthenumberofeigenvectorsisproportionaltothenumberofelectronsinthesystem.Forexample,forathousand‐atomsystem,afewthousandeigenvectorsareneeded.Thereisakeydifferencebetweenthecomplexityofthisapproachandthatofmostengineeringproblems(e.g,fluiddynamics,climatesimulation,combustion,electromagnetics)whereonlyasmall,fixednumberoftimeevolvingoreigenvectorfieldsaresolved.InDFT,theSchrödingerequationitselfdependsontheeigenvectorsthroughthedensityfunction(whichgivesrisetothenamedensityfunctionaltheory).Thus,itisanonlinearproblem,butonethatcanbesolvedbyself‐consistentiterationofthelinearizedeigenstaterepresentedbySchrödinger’sequation,orbydirectnonlinearminimization.Currently,mostlarge‐scalecalculationsaredoneusingself‐consistent

Page 9: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

9

iterations.Numerically,whatdistinguishthedifferentDFTmethodsandcodesarethedifferentbasissetsusedtodescribethewavefunctions(theeigenvectors).

PlanewaveDFTPlanewaveDFTusesplanewavestodescribethewavefunctions,whilerealspaceDFTusesaregularreal‐spacegrid.Duetothesharppeakofthepotentialnearatomicnuclei,specialcareisneededtochoosedifferentbasissets.Besidestheplanewaveandreal‐spacegrid,otherconventionalbasissetsinclude:atomicorbitalbasisset,wheretheeigenvectorsoftheatomicSchrödingerequationareusedtodescribethewavefunctionsinasolidormolecule;Gaussianbasissets,whicharemoreoftenusedinquantumchemistryduetotheiranalyticalproperties;muffin‐tinbasis,whereasphericalholeiscastoutneareachnucleusandsphericalharmonicsandBesselfunctionsareusedtodescribethewavefunctioninsidethehole;augmentedplanewaves,wheresphericalBesselfunctionsnearthenucleiareconnectedwiththeplanewavesintheinterstitialregionsandusedasthebasisset;andthewaveletbasissets.Intermsofthemethodstosolvetheeigenstateproblem,bothiterativeschemeanddirecteigensolvershavebeenusedindifferentcodes.IntheplanewaveDFT,aniterativemethod(e.g.,conjugategradient)isoftenused;whileintheatomicorbital,Gaussian,andaugmentedplane‐wave(FLAPW)methods,directdensesolvers(ScaLAPACK)areoftenused;andinreal‐spacegridmethods,sparsematrixsolversareused.Therearethreekeycomponentsoftheiterativeplanewavemethod:3DglobalFFTforconvertingbetweenrealandreciprocalspace,orthogonalizationofbasisfunctionsusingdenselinearalgebra,andpseudopotentialcalculationsusingdenselinearalgebra.Themosttimeconsumingstepsarethematrixvectormultiplicationandvector‐vectormultiplicationfororthogonalization.Forhighlyparallelcomputation,theFFTistheprimarybottleneck.However,iftheproblemsizeisincreased,thecomputationalrequirementsofthedenselinearalgebragrowO(N3)whereastheFFTrequirementsgrowO(N2).Therefore,weakscalingaDFTproblemcanproducedramaticimprovementsinthedeliveredflopratesgiventhelocalityofthelinear‐algebracalculation;butthebenefitsofweak‐scalingarehighlyproblemdependent.Forexample,time‐domainexperimentsdonotbenefitatallfromthiseffect.

GW+BSEGW+BSEisoneapproachtocalculatingexcitedstatesandopticalspectra.Itoperatesprimarilyonlarge,densematrices.Themosttimeconsumingpartsaregeneratinganddiagonalizingthesematrices.Thediagonalizationisoftendoneusingadenseeigensolver(withlibrariessuchasScaLAPACK).Thedimensionofthematrixisproportionaltothesquareofthenumberofelectronsinthesystem.

QuantumMonteCarlo(QMC)QuantumMonteCarlo(QMC)usesstochasticrandomwalkmethodstocarryoutthemultidimensionalintegrationofthemany‐bodywavefunctions.Sinceitneedsan

Page 10: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

10

assemblysumofdifferentindependentwalkers,itisembarrassinglyparallel.QMCisaveryaccuratemethod,butitsuffersfromstatisticalnoise;thus,itisdifficulttouseforatomicdynamics(wheretheforcesontheatomsareneeded).

ClassicalMolecularDynamicsandClassicalMonteCarloInclassicalmoleculardynamics(MD),theforcecalculationstepisdoneinparallel.Sincetheclassicalforcefieldformalismislocalinnature(excepttheelectrostaticforce),efficientparallelizationispossibleasintheNAMDcode.ClassicalMonteCarloissometimesusedinsteadofclassicalMD,thusitismoreinterestedinthetimeevolvingprocess,insteadofanassemblesum(asinquantumMC).Asaresult,parallelizationisnotsotrivial.TherearemanyrecentdevelopmentsinparallelschemesforclassicalMC(besidesthepossibleapproachforparallelevaluationofthetotalenergyasinclassicalMD).OtherPDEsincludeMaxwellequations(e.g.,inphotonicstudy)andtimeevolvingdifferentialequationsforgrainboundaryanddefectdynamics,etc.Oncethematerialscienceworkloadisreorganizedusingthistaxonomyfortheircomputationalmethods,thedistributionofalgorithmsbecomesclearer(Figure5).Thedensityfunctionaltheory(DFT)codesclearlydominateallocationswith72%oftheoverallworkload.Ofthatsubcategory,planewaveDFTcodes(thatincludeVASP,PARATEC,QBox,andPETot)arethedominantcomponent.TheDFTcodes(particularlyplanewaveDFT)haveverysimilarcomputationalrequirementsaspreviouslydiscussed—withacommonsetofdominantcomponentsintheirexecutionprofileincluding:

• 3DFFT,whichrequiresglobalcommunicationoperationsthatdominateathighconcurrenciesandlowatomcounts

• denselinearalgebrafororthogonalizationofwavefunctions(whichdominatesforlargenumbersofelectronicstatesinlargeatomicsystemswithmanyelectronsbuthasverylocalizedcommunication)

• denselinearalgebraforcalculatingthepseudopotential(thatalsodominatesforlargenumbersofatoms).

SinceVASPwasnotavailableforpublicdistributionandQBoxisexport‐controlled,wechosePARATECasthebest‐availableproxytotherequirementsoftheMaterialsScienceworkload.WealsogettoshareeffortwiththeNSFTrack‐1andTrack‐2procurementbenchmarksthathavealsoadoptedPARATECasaproxytotheMaterialsSciencecomponentoftheirworkloads.

Page 11: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

11

Figure5:Materialssciencecodescategorizedbyalgorithm.

ChemistryDespitebeingoneofthemostmaturesimulationfields,theroleofcomputationasafundamentaltooltoaidintheunderstandingofchemicalprocessescontinuestogrowsteadily,andoneestimateplacesthenumberofelectronicstructurecalculationspublishedyearlyaboutaround8,000!ThedominantsinglecodeintheChemistryworkloadisS3DduetoitssubstantialINCITEallocation.However,ifwesetasideINCITEapplicationsbecausetheyarenotsteady‐statecomponentsoftheworkload,weseethatthechemistryworkloadisdominatedbyVASP,whichisalsodominantinthematerialsscienceworkload.Asaresult,wechosetofocusonothercomponentsoftheChemistryworkloadtoemphasizeadifferentsetofcriteriainthealgorithmspace.S3DbearsmoresimilaritytothecomputationalrequirementsofcodesintheastrophysicsworkloadwheretherearemanyCFDsimulations.WedecidedtoevaluateS3Dasacomponentofthatworkloadsimplyduetothesimilaritiesinthecomputationalmethods.

Table2:Top50%ofthechemistryworkloadshowsadiversityofcodes.

Page 12: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

12

Table2showsthatthetop50%ofthechemistryworkloadhasadiversityofcodes.Theleader,ZORI,isaQMCcode,andassuch,emphasizesper‐processorperformance,butisnotverydemandingoftheinterconnect.TheremainingcodesareeitherDFT(verysimilartomaterialsscience)orquantumchemistrycodes.ForquantumchemistrythemostscalablecodesareNWChemandGAMESS,andsincetheseareverydemandingoftheinterconnecttheycanbeusefulindifferentiatingmachinesbasedonsupportforefficientone‐sidedmessaging.GAMESSwasselectedforNERSCbenchmarking.ItisalsoemployedbytheDOD‐MODTI‐0xprocurements,whichallowsustocollaboratewithDODandallowsvendorstoleveragetheirbenchmarkingeffortsformultiplesystembids.

FusionFusionhaslongbeenthedominantcomponentoftheNERSCworkload,accountingformorethan27%oftheoverallworkloadin2007.Theworkloadiscurrentlydominatedbyfivecodes,whichaccountformorethan50%oftheoverallfusionworkload(Table3).Thosecodes,OSIRIS,GEM,NIROD,M3D,andGTC,fallintotwobasiccategories.Threeofthese,OSIRIS,GTC,GEM,areparticle‐in‐cell(PIC)codes,whereasM3DandNIMRODarebothcontinuumcodes(PDEsolversonsemi‐structuredgrids).Thepercentageofper‐processorpeakachievedbyparticle‐in‐cellcodesisgenerallylowbecausethegather/scatteroperationsrequiredtodepositthechargeofeachparticleonthegridthatwillbeusedtocalculatethefieldinvolvealargenumberofrandomaccessestomemory,makingthecodesensitivetomemoryaccesslatency.GTCiscertainlyrepresentativeinthisregard.However,unlikeotherPICcodes,GTCdoesnotrequireaglobal3‐DFFT,soperformanceandscalabilityissuesassociatedwiththisaspectofPICcodesmustbecoveredbybenchmarksfromothercomponentsoftheNERSCworkload.

Page 13: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

13

Table3:Thetop50%ofthefusiontimeusageisdominatedby5codes.

GTCnowutilizestwoparalleldecompositions,aone‐dimensionaldomaindecompositioninthetoroidaldirectionandaparticledistributionwithineachdomain.Processorsassignedtothesamedomainhaveacopyofthelocalgridbutonlyafractionoftheparticlesandarelinkedwitheachotherbyadomain‐specificcommunicatorusedwhenupdatinggridquantities.AnAllreduceoperationisrequiredwithineachdomaintosumupthecontributionofeachprocessor,whichcanleadtolowerperformanceincertaincases.Althoughit’snotclearthatGTCadequatelycapturestheparallelscalingcharacteristicsofallotherPICcodes,itemergesasaclearwinningcandidateforbenchmarkingbyvirtueofitsexcellentscalability,easeofporting,lackofintellectualpropertyencumbrances,andgeneralsimplicity.Ontheotherhand,wefoundthefusioncontinuumcodestobeexceedinglycomplexasbenchmarks,oneexampleofwhichistherequireduseofexternallibrariesnotguaranteedtobepresentontypicalvendorin‐housebenchmarkingmachines.However,wealsofoundthatthesemi‐implicitsolversemployedbythecontinuumcodes,presumedtobethemosttime‐consumingportionofarun,hadsimilarresourcerequirementstoastrophysicscodesthatmodelstellarprocesses(MHDcodesusedformodelingcorecollapsesupernovaeandastrophysicaljets).Therefore,wechosetofocusonthePICcodesasproxiesforthefusionworkloadrequirements.

AcceleratorModelingSimulationisplayinganincreasinglyprominentandvaluableroleinthetheory,designanddevelopmentofparticleaccelerators,detectors,andX‐rayfreeelectronlasers,whichareamongthelargest,mostcomplexscientificinstrumentsintheworld.TheacceleratorworkloadisdominatedbyPICcodes,butalsohasstrongrepresentationbysparsematrixcodeswithOmega3P(seeTable4).

Table4.TopCodesinAcceleratorModeling.ThePICcodesVORPALandOSIRISdominatetheacceleratormodelingworkload.Evenaswelookfurtherdownthelistto70%oftheworkload,theallocationsaredominatedbyPICcodes.TheexceptionisOmega3p,whichisbottleneckedbyitseigensolver,whichdependsonasparse‐matrixdirectsolver.

Thesparsematrixdirectsolvershavewellknownscalabilityissuesthatwill

Page 14: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

14

ultimatelyforcesuchcodestoadoptnewalgorithmicapproaches(likelysparseiterativesolvers).Therewasconcernthatselectionofcurrentsparsematrixcodeexampleswouldnotbereflectiveofthestate‐of‐the‐artcomputationalmethodsthatwillemergefortheseproblemsinthe2009timeframe.Therefore,weconcentratedourattentiononthePICcomponentoftheworkload,whichisclearlydominantforthissciencearea.ThePICcodesinacceleratormodelinghaveuniqueload‐balancingproblemsandpotentialinter‐processorcommunicationissuesthatdistinguishthemfromplasmasimulationPICcodesrepresentedbyGTC.Whereasplasmamicroturbulencecodesaregenerallywellloadbalancedduetothelargelyuniformdistributionoftheplasmathroughoutthecomputationaldomain,manyacceleratorproblemsdependonbunchinguptheparticlesbecauseofthenarrowfocusoftheacceleratorbeam.Thiscanleadtoseriousissuesforloadbalancing.Additionally,someoftheacceleratorcodes,notablythosecomprisingtheIMPACTsuitefromLBNL,useanFFTalgorithmtosolvePoisson’sequation,withitsassociatedglobal,all‐to‐allcommunication.IMPACT‐Twaschosenasthebenchmarkrepresentingbeamdynamicssimulation.Partofasuiteofcodesusedforthispurpose,IMPACT‐Tisusedspecificallyforphotoinjectorsandlinearaccelerators.Itusesa2‐Ddomaindecompositionofthe3‐DCartesiangridandisoneofthefewcodesusedinthephotoinjectorcommunitythathasaparallelimplementation.Ithasanobject‐orientedFortran90codingstylethatiscriticalforprovidingsomeofitskeyfeatures.Forexample,ithasaflexibledatastructurethatallowsparticlestobestoredin“containers”withcommoncharacteristicssuchasparticlespeciesorphotoinjectorslices;thisallowsbinninginthespace‐chargecalculationforbeamswithlargeenergyspreads.

Page 15: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

15

AstrophysicsInthepast,NERSCresourceshavebeenbroughttobearonsomeofthemostsignificantproblemsincomputationalastrophysics,includingsupernovaexplosions,thermonuclearsupernovae,andanalysisofcosmicmicrowavebackgrounddata.WorkdonebyvariousresearchersatLBNLhasshownthatI/ObenchmarkingismosteffectivelydoneusingI/OkernelbenchmarkssuchasIORandstripped‐downapplicationssuchasMADCAPratherthanusingfullapplicationcodes.ThisisinpartbecauseoftheflexibilityIORoffersinmeasuringarangeoffileandblocksizes,I/Omethods,andfilesharingapproaches.SinceI/Obenchmarkinghasessentiallybeenfactoredoutfromthemainworkload,therefore,weconcentrateontheothersolversintheastrophysicsworkloadthatincludeavarietyofradiationhydrodynamics,MHD,andotherastrophysicalfluidscalculations.PDEsolversonblock‐structuredgridsareadominantcomponentofthisworkload.Althoughmanyofthesecodescurrentlyemployanexplicittime‐steppingscheme,manyofthescientistsinthefieldindicatetheyaremovingasquicklyaspossibletowardsimplicitschemes(suchasNewtonKrylovschemes)thathavemoredemandingcommunicationrequirementsbutofferamuchbettertime‐to‐solution.Inparticular,manyimplicitschemesarelimitedbyfastglobalreductions.VirtuallyallmodernKrylovsubspacealgorithms(i.e.,anythingofthefamilyofconjugategradient,biconjugategradient,biconjugategradient‐stabilized,generalizedminimumresidual,etc.methods)forsparselinearsystemsrequireinnerproductsofvectorsinordertofunction.InFortran,fortwovectorsXandY,thiswouldlooklike inner_prod=0.0 doi=1,n inner_prod=inner_prod+X(i)*Y(i) enddoWhenthevectorsaredistributedacrossaparallelarchitecture,the"Do"loopsumisonlyovertheelementsofthevectorthatexistoneachprocessor.Theglobalsummationisthendoneovertheentireprocesstopologywithalocalsumoneachprocessorfollowedbyaglobalreductionoperationtogetthetotalsumovertheprocesstopology.Thusitwouldlookroughlylike: local_sum=0.0 doi=1,nlocal local_sum=local_sum+X(i)*Y(i) enddo callmpi_all_reduce(local_sum,inner_product,MPI_SUM)

Page 16: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

16

VirtuallyallKrylovsubspacealgorithmsrequirethesekindsofinnerproductsummations.MostclassesofimplicitalgorithmsinturnrelyheavilyontheseKrylovsubspacealgorithms:sparselinearsystems,sparsenon‐linearsystems,optimization,sensitivityanalysis,etc.Therefore,wefeltitnecessarytoselectabenchmarkthatemphasizesthedemandsofimplicittimesteppingschemestothatwedon’tselectamachinethatisonlygoodforexplicitfluiddynamicsalgorithmsbutwillignorethemanynewapplicationsthatutilizeimplicitapproachesthatarejuststartingtobecomeviable.Unfortunately,atthistimeveryfewcodeshavebeendevelopedthatfullyimplementimplicittimesteppingschemes.Therewerenoclearlydominantcodesthatusedimplicittimesteppingschemes.Suchcodesarestillindevelopmentandarenotveryportable.WeselectedtheMAESTROcodeinpartbecauseofcloseinteractionswiththedevelopers.MAESTROisalowmachnumberastrophysicscodeusedtosimulatethelong‐timeevolutionleadinguptoatype1asupernovaexplosion.TheNERSC‐6MAESTROsimulationexaminesconvectioninawhitedwarfandincludesbothexplicitandimplicitsolvers.MAESTROiscapableofdecomposingauniformgridofconstantresolutionintoaspecifiednumberofnon‐overlappinggridpatchesbutintheNERSC‐6benchmarkdoesnotrefinethegrid.

AMRAlthoughadaptivemeshrefinement(AMR)isapowerfultechniquewiththepotentialtoreduceresourcesnecessarytosolveotherwiseintractableproblemsinavarietyofcomputationalsciencefields,itbringswithitavarietyofadditionalperformancechallengesthatareworthyoftestinginaworkload‐drivenevaluationenvironment.Inparticular,theyhavethepotentialtoaddnon‐uniformmemoryaccessandirregularinter‐processorcommunicationtowhatmighthaveotherwisebeenrelativelyregularanduniformaccessandpatterns.WerepresentAMRintheNERSC‐6benchmarksuitewithastripped‐downapplicationthatexercisesanellipticsolverintheChomboAMRframeworkdevelopedatLBNL.ChomboprovidesadistributedinfrastructureforimplementingfinitedifferencemethodsforPDEsonblock‐structuredadaptivelyrefinedrectangulargrids.Thebenchmarkproblemusesasolverwithafixednumberofiterationsandweakscalestheproblemfrom256to4096cores.Formoreinformationseehttp://www.nersc.gov/projects/SDSA/software/?benchmark=AMR.

NuclearPhysicsItisnosurprisethatMILCandothervariantsoflatticeQCDcodesdominatetheworkload.

Page 17: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

17

MILCwasalsoadoptedbytheNSFprocurementsasabenchmark,soweareabletoleveragetheinvestmentinpackagingthisbenchmarkfortheprocurement.

BiologyandLifeSciencesCodesinthisgroupseemtofallintothreebasiccategories.Oneconsistsofcodesforsequencesearching,comparison,identification,and/orprediction.Basicallyinformaticstypescodes,thesedifferfromtypicalHPCworkloadsandarenotrepresentedtoanyextentinNERSCbenchmarksuites.Thesecodeshadamoderateallocationin2007,butweconcludedthatexplicitlyrepresentingtheminthebenchmarksuitewasunnecessarybecausethecodesarebasicallyconstrainedbyclockspeed(integerprocessingrates,actually)andthusdonotpresentasufficientlychallengingoruniqueproblemtothearchitecture.Thesecondcategoryconsistsofavarietyofmoleculardynamicscodes,themostimportantofwhichareAMBER,NAMD,andCHARMM.Thecapabilitythesecodesofferismergedwithrequirementsforproteinfoldinganalysistocreatewhatisreferredtoasmoleculardynameomics—oneofthemostchallengingareasofcomputationalscienceingeneral.Thecharacteristicsofthesecodesare,inamacrosense,similarto,ifnotthesameas,themoleculardynamicscodesusedinmaterialscience;however,individuallife‐sciencerelatedmoleculardynamicssimulationsarelikelytobelessscalableandmoreinter‐processorcommunicationlimitedthanthoseofmaterialscience,sinceinter‐atomicpotentialsarelikelytobemoresubtleandlongerrange.Thethirdcategoryconsistsofcodesthatalsoareorlooklikematerialscienceorchemistrycodes,eitherforquantumMonteCarlo,classicalquantumchemistry,orDFTmethods.Thecomputationalcharacteristicsofthesehavealreadybeendiscussedandlikelydonotdifferforlifescienceapplications.ThetotallifescienceNERSCallocationinrecentyearshasremainedsmallandsonocodehasbeenchosentouniquelyrepresentthisarea.

BenchmarkSelectionBenchmarkselectionisanextraordinarilydifficultprocess.AsperIngridBucher’s1983observation,“benchmarksareonlyusefulinsofarastheymodeltheworkloadthatwillrunonagivensystem.”Tothatend,weuseinformationfromourworkloadanalysistoaidintheselectionofbenchmarksthatareaproxytotheanticipatedNERSCsystemrequirements.Ideally,thebenchmarksmustcovertherequirementsofthetargetHPCworkload,butthereareanumberofpracticalconsiderationsthatmustbetakenintoaccountduringtheselectionprocess.Coverage:WeassesshowwellthecodescoverscienceareasthatdominatetheNERSCworkloadaswellasthecoverageofthealgorithmspace.Thiscreatesatwo‐dimensionalmatrixofcomputationalmethodsandscienceareasthatmustbe

Page 18: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

18

covered.Portability:Theselectedcodesmustbeabletorunonawidevarietyofsystemstoallowvalidcomparisonsacrossarchitectures.Theimplementationofthecodesmustnotbetooarchitecture‐specificandthe”build”systemsforthesecodesmustberobustsothattheycanbeusedacrossmanysystemarchitectures.Scalability:ItisimportanttoemphasizeapplicationsthatjustifyscalableHPCresources.Overtime,codesthatareunabletoscalewiththeresourcestendtohaveadiminishingroleintheHPCworkload.Inselectingscalablecodes,wearenotselectingthosethatscaleeasily—ratherthosethatpresentsomeofthemostdemandingrequirements,butareotherwiseengineeredtoexposesufficientparallelismtorunonhighlyparallelsystems.OpenSource:Sincethebenchmarksmustbedistributedwithoutrestrictiontovendorsforbenchmarkingpurposes,noproprietaryorexport‐controlledcodecanbeused.Onlyopen‐sourceapplicationsthatallowustosharesnapshotswithvendorscanbeconsidered.Availability:Packagingappropriatebenchmarkproblemsrequiresclosecollaborationwiththedeveloperforassistanceandsupport.Withouttheassistanceoftheusercommunity,wecannotselectappropriateproblemsforevaluatingsystems,andhencewillbeunabletogaugethevalueofagivensystem.

Table5:Matrixofcodesforfirstcutoftheselectionprocess.Thecodeswererankedbasedontheirscoresascandidatebenchmarks.Codesthatcouldnotbewidelydistributedorweredifficulttoportwereslowlyeliminateduntilthefinal6benchmarks.Withtheseconsiderationsinmind,thebenchmarkteamselectedproblemsbasedontheirabilitytofulfilltheseattributesandrankthembasedontheirscoreineachoftheseareasascandidatebenchmarkcodes.Aftergainingexperiencewiththecodesbybuildingthemorthroughdetaileddiscussionswiththedevelopers,weslowlyreducedthelistdowntosixtargetbenchmarkcodes.Thefinalproblemsizesandbenchmarkconfigurationsweredeterminedafterthefinalcodeswereselected.

Page 19: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

19

Thesebenchmarksareusedindividuallytounderstandandcomparesystemperformance.Althoughouranalysisoftheindividualapplicationperformancehasfoundspecificbottlenecksoneachsystem,distillingdowntoasetofspecifichardwarefeaturesisunlikelytocapturethecomplexinteractionsbetweenfullapplicationsandcorrespondinglycomplexsystemarchitectures.Thereforetheperformanceoffullapplicationswiththeirassociatedcomplexbehaviorisusedtoassessthevalueofsystems.Therefore,wefeelusingacarefullyselectedsetoffullapplications,asaproxytofullworkloadperformanceisbetterabletoencodetheDOEOfficeofScienceworkloadrequirementsforthepurposeofcomparingHPCsystems.

Table6:FinalBenchmarkSelectionMatrixtogetherwithproblemsizesandconcurrencies.

Benchmark ScienceArea AlgorithmSpace BaseCaseConcurrency

ProblemDescription

Lang Libraries

CAM Climate(BER) NavierStokesCFD 56,240Strongscaling

DGrid,(~0.5degresolution);240timesteps

F90 netCDF

GAMESS QuantumChem(BES)

Denselinearalgebra,DFT

256,1024(SameasTI‐09)

rmsgradient,MP2gradient

F77 DDI,BLAS

GTC Fusion(FES) PIC,finitedifference

512,2048Weakscaling

100particlespercell

F90

IMPACT‐T AcceleratorPhysics(HEP)

LargelyPIC,FFTcomponent

256,1024Strongscaling

50particlespercell(currently)

F90

MAESTRO Astrophysics(HEP)

LowMachHydro;blockstructured‐gridmultiphysics

512,2048Weakscaling

64x64x128gridptsperproc;10timesteps

F90 Boxlib

MILC LatticeGaugePhysics(NP)

Conjugategradient,sparsematrix;FFT

256,1024,8192Weakscaling

8x8x8x9LocalGrid,~70,000iters

C,assem.

PARATEC MaterialScience(BES)

DFT;FFT,BLAS3 256,1024Strongscaling

686Atoms,1372bands,20iters

F90 Scalapack

Page 20: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

20

ScienceAreasDenseLinearAlgebra

SparseLinearAlgebra

SpectralMethods(FFT)s

ParticleMethods

StructuredGrids

UnstructuredorAMRGrids

AcceleratorScience

XX

IMPACTX

IMPACTX

IMPACT

Astrophysics XX

MAESTROX

XMAESTRO

XMAESTRO

ChemistryX

GAMESSX

Climate X

CAM

XCAM

Combustion X

MAESTROX

AMRElliptic

Fusion X

GTCX

GTC

LatticeGauge X

MILCX

MILCX

MILCX

MILC

MaterialScience

XPARATEC

X

PARATEC

XPARATEC

Table7:AlgorithmcoveragebytheselectedSSPalgorithms.

EvolutionfromNERSC‐5ApplicationBenchmarksWepointoutthattheselectedapplicationbenchmarkshavechangedfromthoseusedfortheNERSC‐5procurement.Thefollowingconsiderationsledtothesechanges:

WorkloadEvolution:TheworkloadhaschangedmodestlysincetheNERSC‐5procurement.Twoapplicationsfromthelastbenchmarksuite(NAMDandMadBENCH)wereremovedbecauseoftheirreducedcontributiontotheworkloadandwerereplacedbyMAESTROandIMPACT‐Trespectively.Multicore:PowerhasendedDennardscaling,sofutureperformanceimprovementsofHPCsystemshingeondoublingthenumberofprocessorcoresevery18monthsratherthanthehistoricalclock‐frequencyscaling.Weprojectthenext‐generationNERSCsystemwillbe3–5xmorecapablethanNERSC‐5.Consequently,theconcurrencyofthebenchmarkshasincreasedby4xovertheNERSC‐5suite.Strong­Scaling:Overthepast15years,parallelcomputinghasthrivedonweak‐scalingofproblems.However,withthestallinCPUclockfrequencies,thereisincreasedpressuretoimprovestrong‐scalingperformanceofalgorithms.Thisisbecausefortime‐domainmethods,increasesinproblemresolutionmustoftenbeaccompaniedbycorrespondingdecreasesintime‐

Page 21: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

21

stepsizetosatisfythecourantstabilitycondition.Whereasthetime‐stepdecreasescouldbeaccommodatedbyperformanceimprovementsofindividualcores,nextgenerationalgorithmswillneedtoimprovetheirstepratesusingincreasedparallelism(strongscaling).Consequently,ourprobleminputdecksfortheNERSC‐6applicationbenchmarksnowemphasizestrong‐scalingrequirements.ImplicitTimestepping:ManyPDEsolversonblock‐structuredgridsemployexplicittimesteppingmethods,whichperformwellwhenweak‐scaledonmassivelyconcurrentsystems.However,itisverydifficulttoimprovestrong‐scalingperformanceinlightofthestallinCPUclockfrequencies.Thiswillprovideincreasedpressuretomovetowardsimplicittimesteppingschemes,whicharefarmorechallengingtoscaletohighconcurrencies.Evidenceofthistrendhasbeendocumentedinthe2005GreenbookandsubsequentreportsfromtheFusionSimulationProject(FSP)andotherDOEreports.WehavecodedtherequirementintoourbenchmarksuiteusingMAESTRO,whichimplementsbothimplicitandexplicittimesteppingschemes.AMR/MultiscalePhysics:Althoughitisnotamajorcomponentofourworkloadyet,thepresenceofscalableAMRcodeshasincreasedsubstantiallyinourworkloads.Giventheincreasedemphasisonmulti‐physicsmultiscaleproblemsinSciDACandincreasedneedforstrong‐scaling,weexpectthesemethodstobecomeincreasinglyimportantcomponentoffutureworkloads.SupportforLightweight/One­SidedMessaging:Assystemsscaletoevenlargerconcurrencies,theperformanceoftheinterconnectforverysmallmessagesbecomesincreasinglyimportant.Already,ourcurrentworkloadincludesmanychemistryapplicationsthatdependonGlobalArrays(GA)andotherlibrariesthatemulateaglobalsharedaddressspace.Thesecodebenefitgreatlyfromefficientsupportoflight‐weightmessaging,includingone‐sidedmessageconstructs.TheserequirementsarerepresentedbythechoiceofGAMESSfortheapplicationsuite.

NERSCCompositeBenchmarks(SustainedSystemPerformance)Thehighest‐concurrencyrunsfortheseapplicationbenchmarksarealsousedtogetherintheSustainedSystemPerformance(SSP)compositebenchmark,whichisdiscussedinacompaniondocumentdescribingboththeSSPandESPtests.ThegeometricmeanoftheperformanceofthesebenchmarksisusedtoestimatethelikelydeliveredperformanceoftheevaluatedsystemsforthetargetNERSCworkload.Therefore,thisapproachfulfillstheobservationbyIngridBucherthat“benchmarksareonlyusefulinsofarastheymodeltheintendedworkload.”Thesebenchmarksencodeourworkloadrequirements,andconsequentlytheSSPmetricprovidesagoodmetricforestimatingthesustainedperformanceontheanticipatedworkload.

Page 22: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

22

AnalysisofNERSC‐6BenchmarkCodesThissectionprovidesmoredetaileddescriptionsoftheNERSC‐6applicationbenchmarkcodesandtheirproblemsets,alongwithsomeobservedcharacteristicsofthecodesandsomepreliminaryperformancedatafromvarioussystems.Thebenchmarksservethreepurposesintheprocurementproject.First,eachNERSC‐6benchmarkanditsrespectivetestproblemhasbeencarefullychosenanddevelopedtorepresentaparticularsubsetand/orspecificcharacteristicoftheexpectedDOEOfficeofScienceworkloadontheNERSC‐6system.Asithasbeenstated,“Forbetterorforworse,benchmarksshapeafield1,”andthus,akeycomponentoftheNERSCworkload‐drivenevaluationstrategyismakingavailablerealapplicationcodesthatcapturetheperformance‐criticalaspectsoftheworkload.Second,thebenchmarksprovidevendorstheopportunitytoprovideNERSCwithconcretedata(inanRFPresponse)associatedwiththeperformanceandscalabilityoftheproposedsystemsonprogrammaticallyimportantapplications.WeexpectthatvendorproposalresponseswillincludetheresultsofrunningNERSC‐6benchmarksonexistinghardwareandaswellasprojectionstofuture,currentlyunavailablesystems.Inthisrole,thebenchmarksplayanessentialroleintheproposalevaluationprocess.TheNERSCstaffarewellqualifiedtoevaluateanyprojectionssuppliedbyvendorsasaresultofhavingcarriedoutextensivetechnologyassessmentsofprospectivevendorsystemsforthetimeframeoftheNERSC‐6project.Third, the benchmarks will be used as an integral part of the system acceptance test and part of a continuous measurement of hardware and software performance throughout the operational lifetime of the NERSC-6 system. The NERSC evaluation strategy uses benchmark programs of varying complexity, ranging from kernels and microbenchmarks through stripped-down applications and full applications. Each has its place: we use kernels and microbenchmarks to gain understanding, but applications to evaluate effectiveness and performance. The remainder of this document will concentrate on the NERSC-6 full application codes. For each of the codes a few key characteristics are presented, most of which have been obtained using the NERSC Integrated Performance Modeling (IPM) tool - an application profiling layer that uses a unique hashing approach to allow a fixed memory footprint and minimal CPU usage. Key characteristics measured include the type and frequency of application-issued MPI calls, buffer sizes utilized for point-to-point and collective communications, the communication topology of each application, and in some cases, a time profile of the interprocessor communications. Another key parameter of a completely different type is the computational intensity, the ratio of the number of floating-point operations executed to the number of memory operations.

1 Patterson, D., CS252 Lecture Notes, University of California Berkeley, Spring, 1998.

Page 23: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

23

CAM:CCSMCommunityAtmosphereModelTheCommunityAtmosphereModel(CAM)istheatmosphericcomponentoftheCommunityClimateSystemModel(CCSM)developedatNCARandelsewherefortheweatherandclimateresearchcommunities.AlthoughgenerallyusedinproductionaspartofacoupledsysteminCCSM,CAMcanberunasastandalone(uncoupled)modelasitisatNERSC.TheNERSC‐6benchmarkrunsCAMversion3.1atDresolution(about0.5degree)usingafinitevolume(FV)dynamicalcore.Twoproblemsarerunusingstrongscalingat56and240cores.Atmosphericmodelsconsistoftwoprincipalcomponents,thedynamicsandthephysics.Thedynamics,forwhichthesolverisreferredtoasthe“dynamicalcore,”arethelarge‐scalepartofamodel,theatmosphericequationsofmotionaffectingwind,pressureandtemperaturethatareresolvedontheunderlyinggrid.Thephysicsarecharacterizedbysubgrid‐scaleprocessessuchasradiation,moistureconvection,friction,andboundarylayerinteractionsthataretakenintoconsiderationimplicitly(viaparameterizations).InCAM3asitisrunatNERSC,thedynamicsaresolvedusinganexplicittimeintegrationandfinite‐volumediscretizationthatislocalandentirelyinphysicalspace.HydrostaticequilibriumisassumedandaLagrangianverticalcoordinateisused,whichtogethereffectivelyreducethedimensionalityfromthreetotwo.

RelationshiptoNERSCWorkloadCAMisusedforbothshort‐termweatherpredictionandaspartofalargeclimatemodelingefforttoaccuratelydetectandattributeclimatechange,predictfutureclimate,andengineermitigationstrategies.Anexampleofa“breakthrough”computationinvolvingCAMmightbeitsuseinafullycoupledocean/land/atmosphere/iceclimatesimulationwith0.125‐degree(approximately12kilometer)resolutioninvolvinganensembleofeighttotenindependentruns.DoublinghorizontalresolutioninCAMincreasescomputationalcosteightfold.ThecomputationalcostofCAMintheCCSM,holdingresolutionconstant,hasincreased4xsince1996.Morecomputationalcomplexityiscomingintheformof,forexample,super‐parameterizationsofmoistprocesses,andultimatelyeliminationofparameterizationanduseofgrid‐basedmethods.

ParallelizationThesolutionprocedurefortheFVdynamicalcoreinvolvestwoparts:(1)dynamicsandtracertransportwithineachcontrolvolumeand(2)remappingofprognosticdatafromtheLagrangianframebacktotheoriginalverticalcoordinate.Theremappingoccursonatimescaleseveraltimeslongerthanthatofthetransport.TheversionofCAMusedbyNERSCusesaformalismeffectivelycontainingtwodifferent,two‐dimensionaldomaindecompositions.Inthefirst,thedynamicsisdecomposedoverlatitudeandverticallevel.Becausethisdecompositionisinappropriatefortheremapphase,andfortheverticalintegrationstepfor

Page 24: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

24

calculationofpressure,forwhichthereexistverticaldependences,alongitude–latitudedecompositionisusedforthesephasesofthecomputation.Optimizedtransposesmovedatafromtheprogramstructuresrequiredbyonephasetotheother.Onekeytothisyztoxytransposeschemeishavingthenumberofprocessorsinthelatitudinaldirectionbethesameforbothofthetwo‐dimensionaldecompositions.Figure6presentsthetopologicalconnectivityofcommunicationforfv‐CAM—eachpointinthegraphindicatesprocessorsexchangingmessages,andthecolorindicatesthevolumeofdataexchanged.Figures7and8giveanindicationoftheextenttowhichdifferentcommunicationprimitivesareusedinthecodeandtheirrespectivetimesononesystem.Figures9and10giveanindicationofthesizesoftheMPImessagesinanactualrunontheNERSCCrayXT4systemFranklin.

Figure6.CommunicationtopologyandintensityforfvCAM.

Figure7.MPIcallcountsforfvCAM.

Figure8.PercentoftimespentinvariousMPIroutinesforfvCAMon240processorsontheNERSCBassisystem.

Page 25: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

25

Figure9.MPIbuffersizeforpoint‐to‐pointmessagesinfvCAM.

Figure10.MPIbuffersizeforcollectivemessagesinfvCAM.

ImportanceforNERSC‐6ForNERSC‐6benchmarking,CAMoffersthefollowingcharacteristics:complexity(relativelyflatperformanceprofileinhibitssimpleoptimization);relativelylowcomputationalintensitystresseson‐node/processordatamovement;relativelylongMPImessagesstressinterconnectpoint‐to‐pointbandwidth;strongscalinganddecompositionmethodlimitscalability;moderatelysensitivetoTLBperformanceandVMpagesize.

PerformanceTable8presentsmeasuredperformanceforCAMonseveralsystems.The“GFLOPs”columnrepresentstheaggregatecomputingrate,basedonthereferencefloating‐pointoperationcountfromFranklinandthetimeonthespecificmachine.Theefficiencycolumn(“Effic.”)representsthepercentageofpeakperformanceforthenumberofcoresusedtorunthebenchmark.

Page 26: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

26

Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4Jaguar

DCOpteronCrayXT4Franklin

GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic56 30 7% 6 3% 32 7% 33.5 12%

240 100 6% 49 3% 143 7% 130.6 11%

Table8.MeasuredAggregatePerformanceandPercentofPeakforCAM

GAMESS:GeneralAtomicandMolecularElectronicStructureSystemTheGAMESS(GeneralAtomicandMolecularElectronicStructureSystem)codefromtheGordonresearchgroupattheDepartmentofEnergyAmesLabatIowaStateUniversityisoneofthemostimportanttoolsavailableforvariousab­initioquantumchemistrysimulations.SeveralkindsofSCFwavefunctionscanbecomputedwithconfigurationinteraction,secondorderperturbationtheory,andcoupled‐clusterapproaches,aswellasthedensityfunctionaltheoryapproximation.Avarietyofmolecularproperties,rangingfromsimpledipolemomentstofrequency‐dependenthyper‐polarizabilitiesmaybecomputed.Manybasissetsarestoredinternally,togetherwitheffectivecorepotentials,soallelementsuptoradonmaybeincludedinmolecules.ThetwobenchmarksusedheretestDFTenergyandgradient,RHFenergy,MP2energy,andMP2gradient.

RelationshiptoNERSCWorkloadGAMESSisavailableandisusedonallNERSCsystems.ItismostcommonlyusedbyallocationsfromtheBESarea.Asabenchmark,GAMESSrepresentsavarietyofdifferentconventionalquantumchemistrycodesthatuseGaussianbasissetstosolveHartree‐Fock,MP2,coupledclusterequations.Thesecodesalsonowincludethecapabilitytoperformdensityfunctionalcomputations.CodessuchasGAMESScanbeusedtostudyrelativelylargemolecules,inwhichcaselargeprocessorconfigurationscanbeused,butfrequentlytheyarealsousedtocalculatemany‐atomicconfigurations(e.g.,tomapoutthepotentialmanifoldinachemicalreaction)forsmallmolecules.

ParallelizationGAMESSusesanSPMDapproachbutincludesitsownunderlyingcommunicationlibrary,calledtheDistributedDataInterface(DDI),topresenttheabstractionofaglobalsharedmemorywithone‐sidedatatransfersevenonsystemswithphysicallydistributedmemory.Inpart,theDDIphilosophyistoaddmoreprocessorsnotjustfortheircomputeperformancebutalsotoaggregatemoretotalmemory.Infact,onsystemswherethehardwaredoesnotcontainsupportforefficientone‐sidedmessages,anMPIimplementationofDDIisusedinwhichonlyone‐halfoftheprocessorsallocatedcomputewhiletheotherhalfareessentiallydatamovers.

BenchmarkProblemsRatherthanusingGAMESSforscalingstudies,twoseparaterunsaredonetestingdifferentpartsoftheprogram.TheNERSC“medium”case,runon64coresfor

Page 27: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

27

NERSC‐5andexpectedtorunon256coresforNERSC‐6,isaB3LYP(5)/6‐311G(d,p)calculationwithaRHFSCFgradient.The“large”case,384coresforNERSC‐5andexpectedtorunon1024coresforNERSC‐6,isa6‐311++G(d,p)MP2computationonafive‐atomsystemwithTsymmetry.

ImportanceforNERSC‐6ForNERSC‐6benchmarking,GAMESSoffersthefollowingcharacteristics:complexity(relativelyflatperformanceprofileinhibitssimpleoptimizationbutbuilt‐incommunicationlayeraffordsopportunityforkeysystem‐dependentoptimization);considerablestride‐1memoryaccess;built‐incommunicationlayeryieldsperformancecharacteristicsnotvisiblefromMPI(typicalofGlobalArraycodes,forexample).

PerformanceSincetheproblemsetsforNERSC‐6GAMESShavenotbeenfinalized,thedatapresentedinTable9arefortheNERSC‐5problems.Performanceisnotexpectedtodiffersignificantly.

Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4Jaguar

DCOpteronCrayXT4Franklin

GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic64 26 5% 18 3% 23 7%

384 121 4% 39 2% 177 6% 214 11%

Table9.MeasuredAggregatePerformanceandPercentofPeakforGAMESS

GTC:3DGyrokineticToroidalCodeGTCisathree‐dimensionalcodeusedtostudymicroturbulenceinmagneticallyconfinedtoroidalfusionplasmasviatheparticle‐in‐cell(PIC)method.Itsolvesthegyro‐averagedVlasovequationinrealspaceusingglobalgyrokinetictechniquesandanelectrostaticapproximation.TheVlasovequationdescribestheevolutionofasystemofparticlesundertheeffectsofself‐consistentelectromagneticfields.Theunknownistheflux,f(t,x,v),whichisafunctionoftimet,positionx,andvelocityv,andrepresentsthedistributionfunctionofparticles(electrons,ions,etc.)inphasespace.Thismodelassumesacollisionlessplasma,i.e.,theparticlesinteractwithoneanotheronlythroughaself‐consistentfieldandthusthealgorithmscalesasNinsteadofN2,whereNisthenumberofparticles.

A“classical”PICalgorithminvolvesusingthenotionofdiscrete“particles”tosampleachargedistributionfunction.Theparticlesinteractviaagrid,onwhichthepotentialiscalculatedfromcharges“deposited”atthegridpoints.Fourstepsareinvolvediteratively:

• “SCATTER,”ordeposit,chargesonthegrid.Aparticletypicallycontributestomanygridpointsandmanyparticlescontributetoanyonegridpoint.

Page 28: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

28

• SolvethePoissonequation.• “GATHER”forcesoneachparticlefromtheresultantpotentialfunction.• “PUSH”(move)theparticlestonewlocationsonthegrid.InGTC,point‐chargeparticlesarereplacedbychargedringsusingafour‐pointgyroaveragingscheme,asshowninFigure11.

RelationtoNERSCWorkloadGTCisusedforfusionenergyresearchviatheDOESciDACprogramandfortheinternationalfusioncollaboration.SupportforitcomesfromtheDOEOfficeofFusionEnergyScience.Itisusedforstudyingneoclassicalandturbulenttransportintokamaksandstellaratorsaswellasforinvestigatinghot‐particlephysics,toroidalAlfvenmodes,andneoclassicaltearingmodes.

Figure11.RepresentationofthechargeaccumulationtothegridinGTC.

ParallelizationInolderversionsofGTC,theprimaryparallelprogrammingmodelwasa1‐DdomaindecompositionwitheachMPIprocessassignedatoroidalsectionandMPI_SENDRECVusedtoshiftparticlesmovingbetweendomains.NewerversionsofGTC,includingthecurrentNERSC‐6benchmark,useafixed,1‐Ddomaindecompositionwith64domainsplusaparticle‐baseddecomposition.Atgreaterthan64processors,eachdomaininthe1‐Ddomaindecompositionhasmorethanoneprocessorassociatedwithit,andeachprocessorholdsafractionofthetotalnumberofparticlesinthatdomain.Thisrequiresanall_reduceoperationtocollectallofthechargesfromallparticlesinagivendomain.ThecommunicationtopologyoftheNERSC‐6versionofGTC,usingthecombinationtoroidaldomain/particledecompositionmethod,isshowninFigure12.Figures13,14,and15showothercommunicationcharacteristics.

Page 29: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

29

Figure12.MPImessagetopologyforNERSC‐6versionofGTC.

Figure13.MPIcallcountsfortheNERSC‐6versionofGTC.

Figure14.MPImessagebuffersizedistributionbasedoncallsfortheNERSC‐6versionofGTC.

Figure15.TimespentinMPIfunctionsintheNERSC‐6versionofGTConFranklin.

ImportanceforNERSC‐6ForNERSC‐6benchmarking,GTCoffersthefollowingcharacteristics:chargedepositioninPICcodesutilizesindirectaddressesandthereforestressesrandomaccesstomemory;particle‐decompositionstressescollectivecommunicationswithsmallmessages;point‐to‐pointcommunicationsarestronglybandwidthbound;runsadequatelyonrelativelysimpletopologycommunicationnetworks;importantloopconstructswithlargenumberofarraysleadstohighlysensitiveTLBbehavior.

PerformanceTable10,whichlistssomemeasuredperformancedata,suggeststhatGTCiscapableofachievingrelativelyhighpercentagesofpeakperformanceonsomesystems,whichisprimarilyduetoarelativelyhighcomputationalintensityandalowpercentageofcommunicationcontribution.ItalsoseemsthatOpteronprocessorshaveaperformanceadvantageonthiscode.

Page 30: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

30

Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4

Jaguar

DCOpteronCrayXT4Franklin

GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic512 428 11% 98 6% 511 12% 565 21%

2048 n/a 391 6% 2023 12% 2438 23%

Table10.MeasuredAggregatePerformanceandPercentofPeakforGTC.

IMPACT‐T:Parallel3‐DBeamDynamicsIMPACT‐T(IntegratedMapandParticleAcceleratorTracking—Time)isonememberofasuiteofcomputationaltoolsforthepredictionandperformanceenhancementofaccelerators.Thesetoolsareintendedtopredictthedynamicsofbeamsinthepresenceofopticalelementsandspacechargeforces,thecalculationofelectromagneticmodesandwakefieldsofcavities,thecoolinginducedbyco‐movingbeams,andtheaccelerationofbeamsbyintensefieldsinplasmasgeneratedbybeamsorlasers.TheImpact‐Tcodeusesaparallel,relativisticPICmethodforsimulating3‐Dbeamdynamicsusingtimeastheindependentvariable.(ThedesignofRFacceleratorsisnormallyperformedusingposition,eitherthearclengthorz‐direction,ratherthanthetime,astheindependentvariable.)IMPACT‐Tincludesthearbitraryoverlapoffieldsfromacomprehensivesetofbeamlineelements,canmodelbothstandingandtravelingwavestructure,andcanincludeawidevarietyofexternalmagneticfocusingelementssuchassolenoids,dipoles,quadrupoles,etc.IMPACT‐Tisuniqueintheacceleratormodelingfieldinseveralways.Itusesspace‐chargesolversbasedonanintegratedGreenfunctiontoefficientlyandaccuratelytreatbeamswithlargeaspectratios,andashiftedGreenfunctiontoefficientlytreatimagechargeeffectsofacathode.Itusesenergybinninginthespace‐chargecalculationtomodelbeamswithlargeenergyspread.Italsohasaflexibledatastructure(implementedinobject‐orientedFortran)thatallowsparticlestobestoredincontainerswithcommoncharacteristics;forphotoinjectorsimulationsthecontainersrepresentmultipleslices,butinotherapplicationstheycouldcorrespond,e.g.,toparticlesofdifferentspecies.WhilethecorePICmethodusedinIMPACT‐Tisessentiallythesameasthatusedforplasmasimulation(seeGTC,above),manydetailsoftheimplementationdiffer;inparticular,thespectral/Green’sfunctionpotentialsolver(vs.thefinitedifferencemethodusedinGTC)resultsinasignificantglobalinter‐processorcommunicationpattern.

RelationshiptoNERSCWorkloadPastapplicationsoftheIMPACTcodesuiteincludetheSNSlinacmodeling,theJ‐PARClinaccommissioning,theRIAdriverlinacdesign,theCERNsuperconductinglinacdesign,theLEDAhaloexperiment,theBNLphotoinjector,andtheFNALNICADDphotoinjector.RecentcodeenhancementshaveexpandedIMPACT’sapplicability,anditisnowbeingused(alongwithothercodes)ontheLCLSprojectandtheFERMI@Elettraproject.ArecentDOEINCITEgrantallowedIMPACTtobe

Page 31: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

31

usedinthedesignoffreeelectronlasers.

ParallelizationAtwo‐dimensionaldomaindecompositioninthey‐zdirectionsisusedalongwithadynamicloadbalancingschemebasedondomain.TheimplementationusesMPI‐1messagepassing.Forsomeproblems,mostofthecommunicationduetoparticlescrossingprocessorboundariesislocal;however,inthesimulationofcollidingbeams,particlescanmovealongdistanceandmorethanonepairofexchangesmightberequiredforasingleparticleupdate.Impact‐Tdoesnotuseaparticle‐fieldparalleldecompositionusedbyotherPICcodes.Hockney’sFFTalgorithmisusedtosolvePoisson’sequationwithopenboundaryconditions.Inthisalgorithm,thenumberofgridpointsisdoubledineachdimension,withthechargedensityontheoriginalgridkeptthesame,andthechargedensityelsewhereissettozero.SomecommunicationcharacteristicsforIMPACT‐TarepresentedinTables11and12andFigures16–18.MPIEvent BufferSize

(Bytes)CumulativePercentofTotalWallClockTime

MPI_Alltoallv 131K–164K 15%

MPI_Send 8K 3%

MPI_Allreduce 4–8 4%

Table11.ImportantMPIMessageBufferSizesforIMPACT‐T.RoutineName PercentofTotalTime Purpose

beambunchclass_kick2t_beambunch 32% Interpolatethespace‐chargefieldsEandBinthelabframetoindividualparticle+externalfields.

beamlineelemclass_getfldt_beamlineelem 29% Getexternalfield

ptclmgerclass_ptsmv2_ptclmger 10% Moveparticlesfromoneprocessortofourneighboringprocessors

Table12.ProfileofTopSubroutinesinIMPACT‐TonCrayXT4

ImportanceforNERSC‐6ForNERSC‐6benchmarking,IMPACT‐Toffersthefollowingcharacteristics:object‐orientedFortran90codingstylepresentscompileroptimizationchallenges;FFTPoissonsolverstressescollectivecommunicationswithmoderatemessagesizes;greatercomplexitythanotherPICcodesduetoexternalfields,openboundaryconditions;relativelymoderatecomputationalintensity;fixedglobalproblemsizecausessmallermessagesizesandincreasingimportanceofMPIlatencyathigherconcurrencies.

Page 32: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

32

Figure16.MPIProfileon1024coresofFranklin.

Figure17.MPIcallcounts.

Figure18.MessagetopologyforIMPACT‐T.

PerformanceAlthoughthey’rebothPICcodes,performanceforIMPACT‐T(Table13)isquitedifferentfromthatofGTC,withIMPACT‐Tgenerallyachievingalowerrate,duetoincreasedimportanceofglobalcommunicationsandaconsiderablylowercomputationalintensity.Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4

Jaguar

DCOpteronCrayXT4Franklin

GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic256 143 7% 34 4% 111 5% 130 10%

1024 n/a 174 5% 513 6% 638 12%

Table13.MeasuredAggregatePerformanceandPercentofPeakforIMPACT‐T

Page 33: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

33

MAESTRO:LowMachNumberStellarHydrodynamicsCodeMAESTROisusedforsimulatingastrophysicalflowssuchasthoseleadinguptoignitioninTypeIasupernovae.ItcontainsanewalgorithmspecificallydesignedtoneglectacousticwavesinthemodelingoflowMachnumberconvectivefluidsbutretaincompressibilityeffectsduetonuclearreactionsandbackgroundstratificationthattakesplaceinthecenturies‐longperiodpriortoexplosion.MAESTROallowsforlargetimesteps,finite‐amplitudedensityandtemperatureperturbations,andthehydrostaticevolutionofthebasestateofthestar.ThebasicdiscretizationinMAESTROcombinesasymmetricoperator‐splittreatmentofchemistryandtransportwithadensity‐weightedapproximateprojectionmethodtoimposetheevolutionconstraint.Theresultingintegrationproceedsonthetimescaleoftherelativelyslowadvectivetransport.Fasterdiffusionandchemistryprocessesaretreatedimplicitlyintime.Thisintegrationschemeisembeddedinanadaptivemeshrefinementalgorithmbasedonahierarchicalsystemofrectangularnon‐overlappinggridpatchesatmultiplelevelswithdifferentresolution;however,intheNERSC‐6benchmarkthegriddoesnotadapt.Amultigridsolverisused.

ParallelizationParallelizationinMAESTROisviaa3‐Ddomaindecompositioninwhichdataandworkareapportionedusingacoarse‐graineddistributionstrategytobalancetheloadandminimizecommunicationcosts.Optionsfordatadistributionincludeaknapsackalgorithm,withanadditionalsteptobalancecommunications,andMorton‐orderingspace‐fillingcurves.Studieshaveshownthatthetime‐to‐solutionforthelowMachnumberalgorithmisroughlytwoordersofmagnitudefasterthanamoretraditionalcompressiblereactingflowsolver.MAESTROusesBoxLib,afoundationlibraryofFortran90modulesfromLBNLthatfacilitatesdevelopmentofblock‐structuredfinitedifferencealgorithmswithrichdatastructuresfordescribingoperationsthattakeplaceondatadefinedinregionsofindexspacethatareunionsofnon‐intersectingrectangles.Itisparticularlyusefulinunsteadyproblemswheretheregionsofinterestmaychangeinresponsetoanevolvingsolution.TheMAESTROcommunicationtopologypattern(Figure19)isquiteunusual.OthercommunicationcharacteristicsareshowninFigures20–22.

RelationshiptoNERSCWorkloadMAESTROandthealgorithmsitrepresentsareusedinatleasttwoareasatNERSC:supernovaeignitionandturbulentflamecombustionstudieshavingbothSciDAC,INCITEandbaseallocations.

Page 34: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

34

Figure19.CommunicationtopologyforMAESTROfromIPM.

Figure20.MPIcallcountsforMaestro

Figure21.DistributionofMPItimeson512coresofFranklinfromIPM.

Figure22.DistributionofMPIbuffersizesbytimefromIPMonFranklin.

ImportanceforNERSC‐6ForNERSC‐6benchmarking,MAESTROoffersthefollowingcharacteristics:unusualcommunicationtopologyshouldstresssimpletopologyinterconnectsandrepresentcharacteristicsassociatedwithirregularorrefinedgrids;verylowcomputationalintensitystressesmemoryperformance,especiallylatency;implicitsolvertechnologystressesglobalcommunications;widerangeofmessagesizesfromshorttorelativelymoderate.

PerformanceMaestroachievesarelativelylowoverallrateonthesystemsshowninTable14.Thecomputationalintensityisverylow,mostlikelyasaresultofnon‐floating‐pointoperationoverheadrequiredfortheadaptivemeshframework.Thedataalso

Page 35: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

35

suggestthatMPIcommunicationsperformancemaybeabottleneck,sincetheefficiencydecreasesforthelargerproblemofthisweak‐scalingset.Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4

Jaguar

DCOpteronCrayXT4Franklin

GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic512 178 5% 52* 3% 230 5% 245 9%

2048 n/a 406 2% 437 4%

Table14.MeasuredAggregatePerformanceandPercentofPeakforMAESTRO

MILC:MIMDLatticeComputationThebenchmarkcodeMILCrepresentspartofasetofcodeswrittenbytheMIMDLatticeComputation(MILC)collaborationtostudyquantumchromodynamics(QCD),thetheoryofthestronginteractionsofsubatomicphysics.ItperformssimulationsoffourdimensionalSU(3)latticegaugetheoryonMIMDparallelmachines.Stronginteractionsareresponsibleforbindingquarksintoprotonsandneutronsandholdingthemalltogetherintheatomicnucleus.

TheMILCcollaborationhasproducedapplicationcodestostudyseveraldifferentQCDresearchareas,onlyoneofwhich,ks_dynamicalsimulationswithconventionaldynamicalKogut‐Susskindquarks,isusedhere.

QCDdiscretizesspaceandevaluatesfieldvariablesonsitesandlinksofaregularhypercubelatticeinfour‐dimensionalspacetime.Eachlinkbetweennearestneighborsinthislatticeisassociatedwithathree‐dimensionalSU(3)complexmatrixforagivenfield.

QCDinvolvesintegratinganequationofmotionforhundredsorthousandsoftimestepsthatrequiresinvertingalarge,sparsematrixateachstepoftheintegration.Thesparsematrixproblemissolvedusingaconjugategradientmethod,butbecausethelinearsystemisnearlysingular,manyCGiterationsarerequiredforconvergence.Withinaprocessor,thefour‐dimensionalnatureoftheproblemrequiresgathersfromwidelyseparatedlocationsinmemory.Thematrixinthelinearsystembeingsolvedcontainssetsofcomplexthree‐dimensional“link”matrices,oneper4‐Dlatticelink,butonlylinksbetweenoddsitesandevensitesarenon‐zero.TheinversionbyCGrequiresrepeatedthree‐dimensionalcomplexmatrix‐vectormultiplications,whichreducestoadotproductofthreepairsofthree‐dimensionalcomplexvectors.Thecodeseparatestherealandimaginaryparts,producingsixdotproductpairsofsix‐dimensionalrealvectors.Eachsuchdotproductconsistsoffivemultiply‐addoperationsandonemultiply.

RelationshiptoNERSCWorkloadMILChaswidespreadphysicscommunityuseandalargeallocationofresourcesonNERSCsystems.FundedthroughHighEnergyPhysicsTheory,itsupportsresearch

Page 36: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

36

thataddressesfundamentalquestionsinhighenergyandnuclearphysicsandisdirectlyrelatedtomajorexperimentalprogramsinthesefields.

ParallelizationTheprimaryparallelprogrammingmodelforMILCisa4‐DdomaindecompositionwitheachMPIprocessassignedanequalnumberofsub‐latticesofcontiguoussites.Inafour‐dimensionalproblem,eachsitehaseightnearestneighbors.Thecodeisorganizedsothatmessagepassingroutinesarecompartmentalizedfromwhatisbuiltasalibraryofsingle‐processorlinearalgebraroutines.ManyrecentproductionMILCrunshaveusedtheRHMCalgorithm,whichisanimprovementinthemoleculardynamicsevolutionprocessforQCDthatreducescomputationalcostdramatically.AllthreeNERSC‐6MILCtestsaresetuptoactuallydotworunseach.ThisistomoreaccuratelyrepresenttheworkthattheCGsolvemustdoinactualQCDsimulations.TheCGsolverwilltakeinsufficientiterationstoconvergeifonestartswithanorderedsystem,sowefirstdoashortrunwithafewsteps,withalargerstepsize,andwithalooseconvergencecriterion.Thisletsthelatticeevolvefromtotallyordered.IntheNERSC‐6versionofthecode,thisportionoftherunistimedas“INIT_TIME”andittakesabouttwominutesontheNERSCCrayXT4.Then,startingwiththis“primed”lattice,weincreasetheaccuracyfortheCGsolve,andtheiterationcountpersolvegoesuptoamorerepresentativevalue.Thisistheportionofthecodethatwetime.Between65,000and75,000CGiterationsaredone(dependingonproblemsize).ThethreeNERSC‐6problemsuseweakscaling,andtheonlydifferencebetweenthethreeisthesizeofthelattice;i.e.,therearenodifferencesinstepspertrajectoryornumberoftrajectories.InTable15,thelocallatticesizeisbasedonthetargetconcurrency.Notethatthesesizeswerechosenasacompromisebetweenperceivedscienceneedsandavarietyofpracticalconsiderations.ScalingstudieswithlargerconcurrenciescaneasilybedoneusingMILC.EstimatesofQCDrunsinthetimeframeoftheNERSC‐6systemincludesomeofthelatticesintheNERSC‐6benchmarkbutalsouptoabout963 x 192. Communication characteristics are presented in Figures 23–26. TargetConcurrency LatticeSize(Global) LatticeSize(Local)

256cores 32x32x32x36 8x8x8x9

1024cores 64x64x32x72 8x8x8x9

8192cores 64x64x64x144 8x8x8x9

Table15.NERSC‐6MILClatticesizes.

Page 37: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

37

Figure23.MPICallcountsforMILC.

Figure24.CommunicationtopologyforMILCfromIPMonFranklin.

Figure25.DistributionofCommunicationtimesforMILCfromIPMon1024coresofFranklin.

Figure26.MPImessagebuffersizedistributionbasedontimeforMILConFranklinfromIPM.

Page 38: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

38

ImportanceforNERSC‐6ForNERSC‐6benchmarking,MILCoffersthefollowingcharacteristics:CGsolverwithsmallsubgridsizesusedstressesinterconnectwithsmallmessagesforbothpoint‐to‐pointandcollectiveoperations;extremelydependentonmemorybandwidthandprefetching;highcomputationalintensity.

PerformanceTable16showsthatMILC,too,achievesarelativelyhighpercentageofpeakonsomesystems.Thedual‐coreOpteronresultsincludesomecodetunedforthatsystem,butquad‐coreOpteroncodeisnotyetavailable.TheincreasingeffectofcommunicationsonlargerconcurrenciesisobviousfromtheFranklinandJaguarresultsonthisweakscaledproblem.Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4

Jaguar

DCOpteronCrayXT4Franklin

GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic256 488 25% 113 13% 203 9% 291 22%

1024 n/a 456 13% 513 6% 1101 21%8192 n/a n/a 3179 5% 5783 14%

Table16.MeasuredAggregatePerformanceandPercentofPeakforMILC

PARATEC:ParallelTotalEnergyCodeThebenchmarkcodePARATEC(PARAllelTotalEnergyCode)performsabinitioquantum‐mechanicaltotalenergycalculationsusingpseudopotentialsandaplanewavebasisset.Totalenergyminimizationofelectronsisdonewithaself‐consistentfield(SCF)method.Forcecalculationsarealsodonetorelaxtheatomsintoequilibriumpositions.PARATECusesanall‐band(unconstrained)conjugategradient(CG)approachtosolvetheKohn‐Shamequationsofdensityfunctionaltheory(DFT)andobtaintheground‐stateelectronwavefunctions.InsolvingtheKohn‐Shamequationsusingaplanewavebasis,partofthecalculationiscarriedoutinFourierspace,wherethewavefunctionforeachelectronisrepresentedbyasphereofpoints,andtheremainderisinrealspace.Specializedparallelthree‐dimensionalFFTsareusedtotransformthewavefunctionsbetweenrealandFourierspace;andakeyoptimizationinPARATECistotransformonlythenon‐zerogridelements.ThecodecanuseoptimizedlibrariesforbothbasiclinearalgebraandfastFouriertranforms;butduetoitsglobalcommunicationrequirements,architectureswithapoorbalancebetweenbisectionbandwidthandcomputationalratewillsufferperformancedegradationathigherconcurrenciesonPARATEC.Nevertheless,duetofavorablescaling,highcomputationalintensityandotheroptimizations,thecodegenerallyachievesahighpercentageofpeakperformanceonbothsuperscalarandvectorsystems.

Page 39: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

39

Inthebenchmarkproblemssuppliedhere,20conjugategradientiterationsareperformed;however,arealrunmighttypicallydoupto60.

RelationshiptoNERSCWorkloadArecentsurveyofNERSCERCAPrequestsformaterialsscienceapplicationsshowedthatDFTcodessimilartoPARATECaccountedfornearly80%ofallHPCcyclesdeliveredtotheMaterialsSciencecommunity.SupportedbyDOEBES,PARATECisanexcellentproxytotheapplicationrequirementsoftheentireMaterialsSciencecommunity.PARATECsimulationscanalsobeusedtopredictnuclearmagneticresonanceshifts.Theoverallgoalistosimulatethesynthesisandpredictthepropertiesofmulti‐componentnanosystems.

ParallelizationInaplane‐wavesimulation,agridofpointsrepresentseachelectron’swavefunction.Paralleldecompositioncanbeovern(g),thenumberofgridpointsperelectron(typicallyO(100,000)perelectron),n(i),thenumberofelectrons(typicallyO(800)persystemsimulated),orn(k),anumberofsamplingpoints(O(1‐10)).PARATECusesMPIandparallelizesovergridpoints,therebyachievingafine‐grainlevelofparallelism.InFourierspace,eachelectron’swavefunctiongridformsasphere.Figure27depictsavisualizationoftheparalleldatalayoutonthreeprocessors.Eachprocessorholdsseveralcolumns,whicharelinesalongthez‐axisoftheFFTgrid.Loadbalancingisimportantbecausemuchofthecompute‐intensivepartofthecalculationiscarriedoutinFourierspace.Togetgoodloadbalancing,thecolumnsarefirstassignedtoprocessorsindescendinglengthorderandthentoprocessorscontainingthefewestpoints.Thereal‐spacedatalayoutofthewavefunctionsisonastandardCartesiangrid,whereeachprocessorholdsacontiguouspartofthespacearrangedincolumns,showninFigure27b.Customthree‐dimensionalFFTstransformbetweenthesetwodatalayouts.Dataarearrangedincolumnsasthethree‐dimensionalFFTisperformed,bytakingone‐dimensionalFFTsalongthez,y,andxdirectionswithparalleldatatransposesbetweeneachsetofone‐dimensionalFFTs.Thesetransposesrequireglobalinter‐processorcommunicationandrepresentthemostimportantimpedimenttohighscalability.ThecommunicationtopologyforPARATECisshownbelow.TheFFTportionofthecodescalesapproximatelyasn2log(n),andthedenselinearalgebraportionscalesapproximatelyasn³;therefore,theoverallcomputation‐to‐communicationratioscalesasn,wherenisthenumberofatomsinthesimulation.Ingeneral,thespeedatwhichtheFFTproceedsdominatestheruntime.Table19,below,comparesthespeedoftheFFTandBLAS‐dominatedProjectorportionsofthecodeforvariousmachines.ThedatashowthatPARATECisanextremelyuseful

Page 40: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

40

benchmarkfordiscriminatingbetweenmachinesbasedonbothcomputationandcommunicationspeed.CommunicationcharacteristicsareshowninFigures28–30andinTable17.

Figure27.Athree‐processorexampleofPARATEC’sparalleldatalayoutforthewavefunctionsofeachelectronin(a)Fourierspaceand(b)realspace.

Figure28.MPIcallcountsforPARATEC.

Figure29.DistributionofMPItimesforPARATECon256coresofFranklinfromIPM.

Figure30.CommunicationTopologyforPARATECfromIPM.

Page 41: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

41

MessageBufferSize Counton256Cores Counton1024Cores

TotalMessageCount 428,318 1,940,665

16≤MsgSz<256 114,432

256≤MsgSz<4KB 20,337 1,799,211

4KB≤MsgSz<64KB 403,917 4,611

64KB≤MsgSz<1MB 1,256 22,412

1MB≤MsgSz<16MB 2,808

Table17.ImportantMPIMessageBufferSizesforPARATEC(fromIPMon256cores,Franklin)

ImportanceforNERSC‐6ForNERSC‐6benchmarkingPARATECoffersthefollowingcharacteristics:hand‐codedspectralsolverwithall‐to‐alldatatranspositionsstressesglobalcommunicationsbandwidth;mostlypoint‐to‐pointmessages,andsinceinasingle3‐DFFTthesizeofthedatapacketsscalesastheinverseoftheinverseofthesquareofthenumberofprocessors,relativelyshortmessages;canbeverymemory‐capacitylimited;tuningparameteroffersopportunitytotradememoryusageforinterprocessorcommunicationintensity;requiresqualityimplementationofsystemlibraries(FFTW/SCALAPACK/BLACS/BLAS);useofBLAS3and1‐DFFTlibrariesresultsinhighcachereuseandhighpercentageofper‐processorpeakperformance;fixedglobalproblemsizecausessmallermessagesizesandincreasingimportanceofMPIlatencyathigherconcurrencies;moderatelysensitivetoTLBperformance.

PerformanceTables18and19showperformancedataforPARATEC.Thecodecalculatesitsownfloating‐pointoperationratefortwokeycomponents,the“Projector”portionofthecode,whichisdominatedbyBLAS3routines,andtheFFTs,whichusetunedsystemlibrariesbutdependonglobalcommunications.Thecomparisonamongsystems,especiallythedual‐coreandquad‐coreCrayXT4systems,isquiteinteresting.Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4

Jaguar

DCOpteronCrayXT4Franklin

GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic256 30 7% 517 665 31% 729 55%

1024 n/a 710 20% 2044 24% 2234 42%

Table18.MeasuredAggregatePerformanceandPercentofPeakforPARATEC

Page 42: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

42

Machine FFTRate(MFLOPS/core)

ProjectorRate(MFLOPS/core)

OverallRate(Agg.GFLOPS)

Franklin 213 1045 729Jaguar 129 1581 665BG/P 377 561 517HLRB‐II 694 993 890Bassi 1028 1287 1213

Table19.PerformanceProfileforPARATEC(256cores)

SummaryTable20presentsasummaryofafewoftheimportantcharacteristicsforthecodes.InadditiontodoingagoodjobofrepresentingkeyscienceareaswithintheNERSCOfficeofScienceworkload,thecodesprovideagoodrangeofarchitectural“stresspoints”suchascomputationalintensity,andMPImessageintensity,type,andsize.Assuch,thecodesprovideaneffectivewayofdifferentiatingbetweensystemsandwillbeusefulforallthreegoalsrelatedtotheNERSC‐6project.Finally,thecodesprovideastrongsuiteofSCALABLEbenchmarks,capableofbeingrunatawiderangeofconcurrencies. Code

CAM GAMESS GTC IMPACT‐T MAESTRO MILC PARATEC

CI* 0.67 0.61 1.15 0.77 0.24 1.39 1.50

CrayXT4%peakpercore(largestcase)

13%

12%

24%

14%

5%

14%

44%

CrayXT4%MPImedium

29% 4% 9% 20% 120% 27%

CrayXT4%MPIlarge

35% 6% 40% 23% 64%

CrayXT4%MPIextralarge

n/a n/a n/a n/a n/a 30% n/a

CrayXT4avgmsgsizemed

113K n/a 1MB 35KB 2K 16KB 34KB

Table20ComparisonofComputationalCharacteristicsforNERSC‐6Benchmarks.*CIisthecomputationalintensity,theratioofthenumberoffloating‐pointoperationstothenumberofmemoryoperations.

Page 43: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

43

Bibliography(partial)Wong,AdrianT.,LeonidOliker,WilliamT.C.Kramer,TeresaL.Kaltz,andDavidH.Bailey,“EvaluatingSystemEffectivenessinHighPerformanceComputingSystems,”ProceedingsofSC2000,November2000.Wong,AdrianT.,LeonidOliker,WilliamT.C.Kramer,TeresaL.Kaltz,andDavidH.Bailey,“SystemUtilizationBenchmarkontheCrayT3EandIBMSP,”5thWorkshoponJobSchedulingStrategiesforParallelProcessing,May2000,CancunMexico.Kramer,WilliamandClintRyan,"PerformanceVariabilityonHighlyParallelArchitectures",theInternationalConferenceonComputationalScience2003,MelbourneAustraliaandSt.PetersburgRussia,June2‐4,2003. IngridY.Bucher,“TheComputationalSpeedofSupercomputers,”JointInternationalConferenceonMeasurementandModelingofComputerSystemsarchiveProceedingsofthe1983ACMSIGMETRICSconferenceonMeasurementandmodelingofcomputersystems.I.Y.BucherandJ.L.Martin,“MethodologyforCharacterizingaScientificWorkload,”InCPEUG'82,pages121‐‐126,1982.Horst D. Simon, William T. C. Kramer, David H. Bailey, Michael J. Banda, E. Wes Bethel, Jonathon T. Carter, James M. Craw, William J. Fortney, John A. Hules, Nancy L. Meyer, Juan C. Meza, Esmond G. Ng, Lynn E. Rippe, William C. Saphir, Francesca Verdier, Howard A. Walter, Katherine A. Yelick, “Science-Driven Computing: NERSC’s Plan for 2006–2010,” Lawrence Berkeley National Laboratory Report LBNL-57582, May, 2005. Horst Simon, William Kramer, William Saphir, John Shalf, David Bailey, Leonid Oliker, Michael Banda, C. William McCurdy, John Hules, Andrew Canning, Marc Day, Philip Colella, David Serafini, Michael Wehner and Peter Nugent, “Science-Driven System Architecture: A New Process for Leadership Class Computing,” J. Earth Sim., Vol. 2, January 2005. Leonid Oliker, Andrew Canning, Jonathan Carter, Costin Iancu, Michael Lijewski, Shoaib Kamil, John Shalf, Hongzhang Shan, Erich Strohmaier, Stephane Ethier, Tom Goodale, “Scientific Application Performance on Candidate PetaScale Platforms,” International Parallel & Distributed Processing Symposium (IPDPS), March 24-30, 2007, Long Beach, CA.

Page 44: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

44

Leonid Oliker, Andrew Canning, Jonathan Carter, John Shalf, David Skinner, Stephane Ethier, Rupak Biswas, Jahed Djomehri, and Rob Van der Wijngaart, “Performance Evaluation of the SX6 Vector Architecture for Scientific Computations,” http://crd.lbl.gov/~oliker/drafts/SX6_eval.pdf A. S. Almgren, J. B. Bell, M. Zingale, “MAESTRO: A Low Mach Number Stellar Hydrodynamics Code,” Journal of Physics: Conference Series 78 (2007) 012085. J. B. Bell, A. J. Aspden, M. S. Day, M. J. Lijewski, “Numerical simulation of low Mach number reacting flows,” Journal of Physics: Conference Series 78 (2007) 012004. Y. Ding, Z. Huang, C. Limborg-Deprey, J. Qiang, “LCLS Beam Dynamics Studies with the 3-D Parallel Impact-T Code,” Stanford Linear Accelerator Center Publication SLAC-PUB-12673, Presentedatthe22ndParticleAcceleratorConference,Albuquerque,NewMexico,U.S.A,June25‐29,2007.JiQiang,RobertD.Ryne,SalmanHabib,andViktorDecyk,“AnObject‐OrientedParallelParticle‐in‐CellCodeforBeamDynamicsSimulationinLinearAccelerators,”JournalofComputationalPhysics163,434–451(2000).S.C.Jardin,Editor,“DOEGreenbook:NeedsandDirectionsinHighPerformanceComputingfortheOfficeofScience,”PrincetonPlasmaPhysicsLaboratoryReportPPPL‐4090,June,2005.Availablefromhttp://www.nersc.gov/news/greenbook/.WilliamKramer,JohnShalf,andErichStrohmaier,“TheNERSCSustainedSystemPerformance(SSP)Metric,”LawrenceBerkeleyNationalLaboratoryReportLBNL‐58868,http://repositories.cdlib.org/lbnl/LBNL‐58868,2005.L.Oliker,J.Shalf,J.Carter,A.Canning,S.Kamil,M.Lijewski,S.Ethier(2007)."PerformanceCharacteristicsofPotentialPetascaleScientificApplications."PetascaleComputing:AlgorithmsandApplications(DavidBader:editor),Volume.Chapman&Hall/CRCPress:2007.Lin‐WangWang(2005),“ASurveyofCodesandAlgorithmsusedinNERSCMaterialsScienceAllocations,”LawrenceBerkeleyNationalLaboratoryReportLBNL‐61051,2005.AMirinandPWorley,“ExtendingScalabilityoftheCommunityAtmosphereModel,”JournalofPhysics:ConferenceSeries78(2007)012082.ArnoldKritz,DavidKeyes,etal.,“FusionSimulationProjectWorkshopReport,”availablefromhttp://www.ofes.fusion.doe.gov/programdocuments.shtml.K.Yelick,“ProgramminginUPC(UnifiedParallelC),CS267Lecture,Spring2006,”lecturenotesavailablefromhttp://upc.lbl.gov/publications/.

Page 45: NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

45

Formoreinformationonthebenchmarksseehttp://www.nersc.gov/projects/SDSA/software/?benchmark=NERSC5