Top Banner
nci.org.au @NCInews Scaling Weather Climate and Environmental Science Applications, and experiences with Intel Knights Landing Ben Evans, Dale Roberts 17 th Workshop on High Performance Computing in Meteorology
33

Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

Jul 08, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au@NCInews

ScalingWeatherClimateandEnvironmentalScienceApplications,andexperienceswithIntelKnightsLanding

BenEvans,DaleRoberts

17th WorkshoponHighPerformanceComputinginMeteorology

Page 2: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

NCIprogramtoaccelerateHPCScalingandOptimisation

© National ComputationalInfrastructure 2016

• ModellingExtreme&HighImpactevents– BoM• NWP,ClimateCoupledSystems&DataAssimilation– BoM,CSIRO,ResearchCollaboration• Hazards- GeoscienceAustralia,BoM,States• Geophysics,Seismic– GeoscienceAustralia,Universities• MonitoringtheEnvironment&Ocean– ANU,BoM,CSIRO,GA,Research,Fed/State• Internationalresearch– InternationalagenciesandCollaborativePrograms

TropicalCyclones

CycloneWinston20-21Feb,2016

VolcanicAsh

Manam Eruption31July,2015

WyeValleyandLorneFires25-31Dec,2015

BushFires Flooding

StGeorge,QLDFebruary,2011

BenEvans,ECMWF,Oct2016

Page 3: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

ACCESSModel:NWP,Climate,Ocean,ESM,SeasonalandMulti-year

Coupler

Carbon

Terrestrial

Oceanandsea-ice

Atmosphericchemistry

Atmosphere

Oceanandsea-ice

Carbon cycle (ACCESS-ESM)• Terrestrial – CABLE• Bio-geochemical• Couple to modified ACCESS1.3

Aerosols and Chemistry• UKCA

© National ComputationalInfrastructure 2016

Core Model• Atmosphere – UM 10.5+• Ocean – MOM 5.1 (for most models)• NEMO 3.6 (for GC3 seasonal-only)• Sea-Ice – CICE5• Coupler – OASIS-MCT

Wave• WW3

BenEvans,ECMWF,Oct2016

Page 4: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

Additionalprioritycodes– AustralianstormsurgemodelusingROMS

© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016

Page 5: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

DataCollections Approx.Capacity

CMIP5,CORDEX,ACCESSModels 5PbytesSatelliteEarthObs:LANDSAT,Himawari-8,Sentinels,plusMODIS,INSAR,… 2PbytesDigitalElevation,BathymetryOnshore/OffshoreGeophysics

1Pbytes

SeasonalClimate 700TbytesBureauofMeteorologyObservations 350TbytesBureauofMeteorologyOcean-Marine 350TbytesTerrestrialEcosystem 290TbytesReanalysisproducts 100Tbytes

1. Climate/ESSModelAssetsandDataProducts2. EarthandMarineObservationsandDataProducts3. GeoscienceCollections4. TerrestrialEcosystemsCollections5.WaterManagementandHydrologyCollections

http://geonetwork.nci.org.au

NCIResearchDataCollections:Model,dataprocessing,analysis

© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016

Page 6: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

Himawari-8Observations,DataAssimilationandAnalysis

CapturedatJMA,ProcessedafteracquisitionatBoMMadeavailableatNCI

DataProductsstilltobegenerated,butfirststagewastomaketheimagedataavailable.

10minutecaptureandprocess.Thenalsoneedtomakeitavailableforbroadanalysis.

© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016

Page 7: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

• Over300,000Landsatscenes(spatial/temporal)allowingflexible,efficient,large-scalein-situanalysis

• Spatially-regular,time-stamped,band-aggregatedtilespresentedastemporalstacks.

Spatiallypartitionedtiles TemporalAnalysis

EarthObservationTimeSeriesAnalysis

Continental-ScaleWaterObservationsfromSpace

WOFSwaterdetection• 27YearsofdatafromLS5&

LS7(1987-2014)• 25m NominalPixel

Resolution• Approx.300,000 individual

sourceARG-25scenesinapprox.20,000passes

• Entire27yearsof1,312,087ARG25tiles=>93x1012 pixelsvisited

• 0.75PBofdata• 3hrs atNCI(elapsed

time)tocompute.

© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016

Page 8: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

Enableglobalandcontinentalscale…andtoscale-downtolocal/catchment/plot

• Wateravailabilityandusageovertime

• Catchmentzone• Vegetationchanges• Datafusionwithpoint-clouds

andlocalorothermeasurements

• Statisticaltechniquesonkeyvariables

Preparingfor:• Betterprogrammaticaccess• Machine/DeepLearning• BetterIntegrationthrough

Semantic/Linkeddatatechnologies

© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016

Page 9: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

EmergingPetascale Geophysicscodes

- Assess priority Geophysics areas- 3D/4D Geophysics: Magnetotellurics, AEM- Hydrology, Groundwater, Carbon Sequestration- Forward and Inverse Seismic models and analysis (onshore and offshore)- Natural Hazard and Risk models: Tsunami, Ash-cloud

- Issues- Data across domains, data resolution (points, lines, grids), data coverage- Model maturity for running at scale- Ensemble, Uncertainty analysis and Inferencing

© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016

Page 10: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

NCIHighPerformanceOptimisation andScalingactivities2014-17

• Objectives:• Upscale&increaseperformanceofhigh-prioritynationalcodes– particularlyWeatherandClimate

• Year1• Characterise,OptimiseandTuneofcriticalapplicationsforhigherresolution• Bestpractiseconfigurationforimprovedthroughput• Establishanalysistoolsetsandmethodology

• Year2• Characterise,OptimiseandTuneofnextgenerationhighpriorityapplications• SelecthighprioritygeophysicscodesandexemplarHPCcodesforscalability• ParallelAlgorithmReviewandI/Ooptimisation methodstoenablebetterscaling• EstablishedTIHPOptimisation workpackageforUMcodes(Selwood,Evans)

• Year3• Assessbroadersetofcommunitycodesforscalability• Updatedhardware(many-core),memory/datalatency/bandwidths,energyefficiency• Communicationlibraries,mathlibraries

© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016

Page 11: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

CommonMethodologyandapproachforanalysis

© National ComputationalInfrastructure 2016

• Analyse codetoestablishstrengthsandweaknesses.• Fullcodeanalysisincludinghotspotandalgorithmchoices• Exposemodeltomoreextremescaling– e.g.,realistichigherresolution• Analyse andcomparedifferentsoftwarestacks• Decompositionstrategiesfornodesandnodetuning• ParallelAlgorithms,MPIcommunicationPatterns.e.g.Haloanalysis,gridexchanges• I/Otechniques:Evaluateserialandparalleltechniques• Futurehardwaretechnologies

BenEvans,ECMWF,Oct2016

Page 12: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

Scaling/Optimisation:BoMResearch-to-OperationsHighProfilecases

© National ComputationalInfrastructure 2016

Domain Yr1– 2014/5 Yr2– 2015/6 Yr3– 2016/7

Atmosphere APS1(UM8.2-4)• GlobalN320L70

(40km)andpre-APS2N512L70(25km)

• RegionalN768L70• (~17km)• City4.5k

UM10.x(PS36)• GlobalN768L70

(~17km)• Regional,City

APS3prep• UM10.xlatest• ACCESS-G(Global)

N1024L70/L85(12km) orN1280L70/L85(10km)

• ACCESS-GEGlobalEnsemble(N216L70)(~60km)

• ACCESS-TC4km• ACCESS-R(Regional)12km• ACCESS-C(City)1.5km

BenEvans,ECMWF,Oct2016

Page 13: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

NCI-FujitsuScaling/Optimisation:BoMResearch-to-OperationsHighProfilecases

© National ComputationalInfrastructure 2016

Domain Yr1– 2014/5 Yr2– 2015/6 Yr3– 2016/7

Dataassimilation

4D-VARv30• N216L70,N320L70

• 4D-VARLatestforGlobalatN320L70

• enKF-C

Ocean MOM5.1• OFAM3• 0.25°, L50

MOM5.1• 0.1°, L50

• OceanMAPS3.1(MOM5)withenKF-C

• MOM5/6 0.1° and0.03°• ROMS(Regional)

• StormSurge (2D)• eReefs (3D)

Wave WaveWatch3v4.18(v5.08beta)

• AusWave-G0.4°• AusWave-R0.1°

BenEvans,ECMWF,Oct2016

Page 14: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

NCI-FujitsuScaling/Optimisation:BoMResearch-to-OperationsHighProfilecases

© National ComputationalInfrastructure 2016

Domain Yr1– 2014/5 Yr2– 2015/6 Yr3– 2016/7CoupledSystems

Climate:ACCESS-CM• GA6(UM8.5)+MOM5withOASIS-MCT• GlobalN96L38

(135km),1° and0.25°ocean.

Climate:ACCESS-CMcont.

Climate:ACCESS-CM2(AOGCM)• GA7(UM10.4+andGLOMAPaerosol),• GlobalN96L85,N216L85(~60km)• MOM5.10.25°• CABLE2• UKCAaerosols

EarthSystem:ACCESS-ESM2• ACCESS-CM2+TerrestrialBiochemistry

– CASA-CNP• Oceanicbiogeochemistry– WOMBAT• Atmosphericchemistry– UKCA

SeasonalClimate:• ACCESS-S1- UK

GC2withOASIS3• N216L85(~60km)

NEMO0.25°

GC2NCIprofilingmethodologyapplied forMPMD

• Multi-weekandSeasonalClimate:• ACCESS-S2/UKGC3• Atmos:N216L85(60km)and

NEMO3.6 0.25° L75

Page 15: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

HPCGeneralScalingandOptimisation Approach

© National ComputationalInfrastructure 2016

Domain Yr2– 2015/6 Yr3– 2016/7

ProfilingMethodology CreateMethodologyforprofilingcodes

UpdatestoMethodologybasedonapplicationacrossmorecodes

I/Oprofiling • BaselineprofilingforcomparisonofNetCDF3,NetCDF4,HDF5andGeoTIFF andAPIoptions(e.g.GDAL).

• ProfilingcomparisonofIOperformanceofLustre,NFS

• CompareMPI-IOvsPOSIXvsHDF5 onLustre

• AdvancedProfilingHDF5andNetCDF4forcompressionalgorithms,multithreading,cachemanagement

• Profilinganalysisofotherdataformats

• e.g., GRIB,AstronomyFITS,SEG-Y,BAM

AcceleratorTechnologyInvestigation

• IntelPhi(KnightsLanding)• AMDGPU

Profilingtoolssuite Review MajoropensourceProfilingTools

Investigationofprofilers forAccelerators

BenEvans,ECMWF,Oct2016

Page 16: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

HPCGeneralScalingandOptimisation Approach

© National ComputationalInfrastructure 2016

Domain Yr2 Yr3

ComputeNodePerformanceAnalysis

• Partiallycommittingnodes

• HardwareHyper-threading

• MemoryBandwidth• Interconnect

bandwidth

• EvaluatingEnergyEfficiencyvsperformanceofnextgenerationprocessors

• Broadwellimprovements• Memory speed• Vectorisation• OpenMP coverage

SoftwareStacks • OpenMPI vsIntelMPIanalysis

• Intelcompilerversions

• OpenMPI vsIntelMPI analysis• Intelcompiler versions• MathLibraries

Analysisother EarthSystems&Geophysics prioritycodesandalgorithms

• InitialAnalysisofMPIcommunications

• Commenceanalysisofhighpriority/profileHPCcodes in

• DetailedAnalysisofMPIcommunicationdependentalgorithms

• SurveyofCodesandAlgorithmsused.

BenEvans,ECMWF,Oct2016

Page 17: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

NCIContributionstoUMcollaborationsofar

© National ComputationalInfrastructure 2016

• UM10.4+IOServernowusingMPI-IO• ImmediatelyvaluableforNWP(UKMet,Aus,…)• Criticalfornextgenerationprocessors(i.e.,KnL)

• UM10.5+OpenMP coverage• Increasedperformance• Criticalforbothcurrentandnextarchitectures,especiallywithincreasingmem

bandwidthissues

BenEvans,ECMWF,Oct2016

Page 18: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

KnightsLandingnodesatNCI

KnLs• 32IntelXeonPhi7230processors

– 64cores/256threadspersocket,1socketpernode– 16GBMCDRAMonpackage(380+GB/sbandwidth)– 192GBDDR4-2400MHz(115.2GB/s)

• EDRInfiniBandinterconnectbetweenKnLs (100Gb/s)• FDRInfiniband linkstomainlustrestorage(56Gb/s)

KeplerK80GPUsalsoavailable

RaijinisaFujitsuPrimergy cluster• 57,472cores(IntelXeonSandyBridgetechnology,2.6GHz)in3592computenodes• Infiniband FDRinterconnect• 10PBytes Lustre forshort-termscratchspace• 30Pbytes fordatacollectionsstorage

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 19: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

BasicKNLfirstimpression

• KnL Pros– Fullx86compatibility– applications‘justwork’withoutneedforcodemajorchanges– AVX512-bitinstructionsetgivesperformanceboostforwellvectorised applications– Potentialtoprocess2vectoroperationspercycle

• KnL Cons– CoresaresignificantlyslowerthantypicalXeonprocessors

• 1.3GHzKnL vs.2.5+GHzfortypicalHaswell/BroadwellXeons• Simplerarchitecturemeansfewerinstructionsprocessedpercycle

– Profilingdifficultandhardwarenotfullyexposingwhatisneeded

• Needtounderstandmuchmoreaboutourapplicationsandtheirmulti-phasicnature• DeepworkonbothIO,memorypressure,andinterprocessor comms• Relearnhowtoprojectforthevalueoftheprocessors• Useexperiencetolookatotheremergingtechnologiesinparallel

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 20: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

ExperimentingwithKnL characteristics

• AustralianGeoscienceDataCubeLANDSATprocessingpipeline– ProcessaseriesofobservationsfromLANDSAT8satellite.

• NOAAMethodofSplittingTsunami(MOST)model– Wavepropagationdueto7.5magnitude

earthquakeinSunda subductionzone

• UKMOUnifiedModelv10.5– N96AMIPglobalmodel– N512NWPglobalmodel

• ThesearenotchosenasthebestcodesforKnL,butonesthatwerebothimportantandthatwecould“quickly“explore.

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 21: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

LandsatNBARdataprocessingpipeline:KnL vs.SandyBridgeand1thread

• Sameexecutable,runonbotharchitectures(i.e.noAVX-512instructions)• Separatelyrecompiledwith-xMIC-AVX512

0

1

2

3

4

5

6

7

ModTrans CalcGrids TCBand Packager InterpBand Exctract

RelativeRu

ntime

PipelineTask

LandsatProcessingPipelinetasks

SandyBridge KnL KnLw/AVX-512• Mosttaskstooklongertocomplete• LANDSATpipelinetasksaremostlypoint-wisekernelsorIObound• Littleopportunityforthecompilertovectorise• AVXoperationsrunatlowerclockspeedontheKnL

• ‘ModTrans’and‘TCBand’tasksexceptions• ModTrans wasrelativelywellvectorised• TCBand (TerrainCorrection)wasconvertedfrompoint-wisekernelstovector-wisekernels• NotedtheyarefasterthanSnB (normalisedforclockspeed)

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 22: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

NOAAMOSTTsunamiCode– singlethreadedperformance

TimespentonvectorisationistheimportantfirststeptoCPUperformanceonKnL

0 0.5 1 1.5 2 2.5

KnL- Vectorised

KnL- Original

SandyBridge

Time(s)

MOSTAverageTimestep

0 0.5 1 1.5 2

KnL- Vectorised

KnL- Original

SandyBridge

TimeScaledbyCPUclockspeed

MOSTAverageTimestep

• WhileMOSToriginalcodeisnotvectorised,butdoesrunonKnL• Replacekeyroutineswithvectorised versions• Comparebothrawperformanceandnormalisedbyclockspeed

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 23: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

LANDSATprocessingpipeline– comparingNode-for-Nodeperformance

• ParallelisminAGDCLANDSATisobtainedthrough‘Luigi’pythonscheduler.– Taskdependenciesaretrackedwithinscenes,embarrassinglyparallel– For20scenes,2620tasksintotal

• ‘ideal’combinationoftasksbuilt(withandwithoutAVX-512instructions)• AGDCLANDSATProcessing

45:39

52:29

43:43

0 10 20 30 40 50 60

KnL- 128workers

KnL- 64workers

SandyBridge- 16workers

Time(minutes)

AGDCLANDSATProcessing- 20scenes

• KnightsLandingisslowerthanSandyBridgeinthiscase– Node-for-nodehascompetitiveperformance.– Vectorisationcanyetimprove– noted128tasksoutperforms64tasksbyover20%

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 24: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

NOAAMOST– OpenMP performanceonKnL

• ParallelisminNOAAMOSTisobtainedthroughOpenMP

• Goodscalingoveralreadygoodsinglethreadedperformance– Over90%efficiencygoingfrom1threadtofullyoccupyinganode

• Doesnotbenefitfromoversubscription– Likelyduetothesubdomainsbecomingquitesmallathighthreadcounts

0

0.2

0.4

0.6

0.8

1

1.2

1 2 4 8 16 32 64 128

ScalingFactor

Threads

NOAAMOST- ScalingFactor

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 25: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

NOAAMOST:KnL vsSandyBridgenode-for-node

• 3xFasternode-for-nodeaftervectorisation.• NotethatourexperimentshowsMOSTmaybeveryperformant

onGPUs

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

SandyBridge- 16threads

KnL- 64threads

KnLvectorised- 64threads

Time(s)

NOAAMOST- AverageTimestep,fullnode

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 26: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

UnifiedModel

• UM(10.5)isparallelisedusingbothMPIandOpenMP• InitiallychoseAMIPN96globalmodel

– usefulforperformanceevaluationasrunonasinglenodeandnocomplexIOtraffic– FindbestdecompositiononasingleKnL node(willcomparewithbest

decompositiononasingleSandyBridgenode)

Outcomes:• OvercommittingKnL provesbeneficialtoperformancewiththeUM• All64threadjobsareoutperformedby128and256threadjobs

0

50

100

150

200

250

300

350

4x8 4x16 8x8

Time(s)

Decomposition

N96AMIPRuntime

64Threads

128Threads

256Threads

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 27: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

SoftwareStacksfortheUM

• IntelMPIconsistentlyoutperformsOpenMPI fortheUMonKnL

• ButIntelMPIlackssomeofthefine-grainedcontrolweneed• Theabilitytospecifyindividualcoresinarankfile• Seeminglyunabletobindto‘none’– importantforexplicitbindingwithnumactl• Can’treportbindingwiththesamedetailasOpenMPI

• Weusedversions15or16oftheIntelFortran/C/C++compilers– ‘-x’compileroptionstoenableordisableAVX-512inordertotestthe

effectivenessofthelongervectorregistersorissues• LANDSATprocessingslowswithAVX-512enabled• SomeinstabilityintheUMwhenAVX-512enabled

0

50

100

150

200

250

300

4x8 4x16 8x8

Time(s)

Decomposition

N96AMIPRuntime- 128Threads

OpenMPI

IntelMPI

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 28: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

UMdecompositioncomparison

• Bestperformingdecompositions:• KnL is4x16,with2OpenMP threadsperMPItask• SnB is2x8

• About20%fasterthanbestdecompositiononSandyBridge• DespitemodelinputI/Ostagetaking5xlongeronKnL• largerMPIdecompositionlimitsmultinode scalabilityforUMonKnL

• Hybridparallelismcanhelphere• MorethreadsperMPItaskmeanssmallerdecompositions• ManythreadingimprovementstocomeinUM10.6+

0 50 100 150 200 250 300

KnL- 4x16,2OMPthreads

SandyBridge,2x8

Runtime(s)

N96AMIPRuntimeComparison

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 29: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

UMN512globalNWPonKnL vsSnB

• Usesamedecomposition:16x64,2threadsperMPItask,totalof2048threads

But16KnL nodesvs64SnB nodesmeansmodeluses33%fewernode-hoursonKnL

• MPI• TasklayoutisimportantonKnL• N512jobusestheUM’sIOserverfeaturewhereallIOtaskscanrunonaseparatenode• WhentheIOtasksareinterleavedwithmodel,runtimeincreases->NeedtoseparateIO

0 0.5 1 1.5 2 2.5 3

SandyBridge

KnL

RelativeRuntime- N512globalNWP

0 0.5 1 1.5 2 2.5 3 3.5

SandyBridge

KnL

KnL- interleavedIO

RelativeRuntime- N512globalNWP

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 30: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

UMN96

• KnL hasa2stagememoryhierarchy• 16GBMCDRAMon-package(CacheorFlatmode)• 192GBDDR4

• AllourUMtestsshownsofarhavebeenin‘cache’mode.• N96AMIPglobaloccupiesjustover16GBRSSwhenrunina4x16decomposition• Canadditionalperformancebeextractedin‘flat’mode?

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

DefaultBind

MCDRAMBind

Cachemode

RelativeRuntime

RelativeRuntimeofN96AMIPwithdifferentmemorybindingsettings

• NoMPIdistributionperformsthebindingcorrectly,solaunchMPIprocessesusingnumactl• Bothdefaultbinding(DDR4)andMCDRAMbindingareslowerthancachemode.• LossofperformancewhenrunonDDR4impliesthattheUMisstillmemorybound,evenon

slowKnL cores.

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 31: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

ProfilingwithScore-P

• Score-Pisaninstrumentingprofiler• Issues– instrumentingonKnL isverycostly

• Enteringandexitinginstrumentedareasseemstocostafixednumberofcycles• CyclestakemuchlongeronaKnL

• Comparewithlimitedinstrumentationtokey‘control’subroutines• Allowsidentificationofkeycodeareas(e.g.convection,radiationetc.),butnothing

withinthoseareas

• Partialinstrumentationisbetter,but• ifanOpenMP parallelsectionisnotinstrumented,timespentinthreadsotherthan

themainthreadislost• Can’tanalysethreadefficiencythisway

0 50 100 150 200 250 300 350 400 450 500

NoProfiling

Score-Penabled

Score-Ppartial

Runtime(s)

ProfilingN96AMIPjobwithScore-P

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 32: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

Profilingthroughsampling– experiencesofar

• Samplingprofilingcanbeusedinstead– OpenSpeedShop– HPCToolkit– IntelVtune (can’tbeusedwithOpenMPI)

OpenSpeedShop• ProfilingUMwithOpenSpeedShop producesnegligibleoverhead• Potentialissuewithsamplingrate,butinpractisegoodagreement

IntelVtune• Around10%overheadinMOST

withIntelVTune• Somefeaturesarenotavailable

onKnL (e.g.AdvancedHotspots)

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

Page 33: Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4 for compression algorithms, multithreading, cache management • Profiling analysis

nci.org.au

SummaryofKnightsLandingexperiencesofar

• KnL’s looklikepromisingtechnologyandworthmoreinvestigation– Wellvectorised workloadsareessentialtoperformanceonKnL

• Unvectorised workloadsseeKnL outperformedbynode-for-nodebySandyBridge• Wellvectorised workloadsrunsignificantlyfaster

– Nodesaremoreenergyefficient.– Codechangesaremoregenerallyuseful,sonotspecificallytargetedforKnL.– HybridParallelismandreducingMPItaskmanagementisneededforlarge-scalejobs

• Data-intensiveIOneedsmoreattentionforperformance– especiallyparallelI/O– ParallelI/OavailablethroughNetCDF andHDF5

• Profilingapplicationsisstilldifficult– Instrumentedprofilerscan’tbeuseduntiltheoverheadcanbereduced– Samplingprofilersmaybemissingevents– Somemissingfunctionality

• Helpfulforunderstandingmoredetailsofthebehaviourofcodes• HowdoesitcomparetoGPUandotheremergingchiptechnologies?

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016