Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

AcceleratingDeepLearningwithMVAPICH

AmmarAhmadAwan,HariSubramoni,andDhabaleswarK.Panda

NetworkBasedComputingLaboratory

Dept.ofComputerScienceandEngineering

TheOhioStateUniversity

OSUBoothTalk(SC’17)

MLHPC‘17 2NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Introduction– DeepLearningTrends

– CPUsandGPUsforDeepLearning

– MessagePassingInterface(MPI)

• Co-designEfforts– OSU-Caffe

– NCCL-augmentedMPIBroadcast

– Large-messageCUDA-AwareMPICollectives

• CharacterizationofDeepLearningWorkloads– CPUsvs.GPUsforDeepLearningwithCaffe

Agenda


• Caffe,TensorFlow,CNTKandmanymore..

• MostframeworksareexploitingGPUstoacceleratetraining

• Diverseapplications– ImageRecognition,CancerDetection,Self-DrivingCars,SpeechProcessingetc.

DLFrameworksandTrends

https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/


• NVIDIAGPUshavebeenthemaindrivingforceforfastertrainingofDeepNeuralNetworks(DNNs)– TheImageNetChallenge- (ILSVRC)

– 90%oftheImageNetteamsusedGPUsin2014*

– DLmodelslikeAlexNet,GoogLeNet,andVGG

– AnaturalfitforDLduetothethroughput-orientednature

– GPUsarealsogrowingintheHPCarena!à

GPUsaregreatforDeepLearning

*https://blogs.nvidia.com/blog/2014/09/07/imagenet/

https://www.top500.org/statistics/list/


AndCPUsarecatchingupfast

1- https://dl.acm.org/citation.cfm?id=19935162- http://ieeexplore.ieee.org/abstract/document/5762730/3- https://dspace.mit.edu/bitstream/handle/1721.1/51839/MIT-CSAIL-TR-2010-013.pdf?sequence=1

• IntelCPUsareeverywhereandmany-coreCPUsareemergingaccordingtoTop500.org

• HostCPUsexistevenontheGPUnodes– Many-coreXeonPhisareincreasing

• XeonPhi1st generationwasaco-processor

• Unlike XeonPhi2nd generation,whichisaself-hostedprocessor!

• Usually,wehearCPUsare10x– 100x slowerthanGPUs?[1-3]– Butcanwedobetter?

https://www.top500.org/statistics/list/

SystemCountforXeonPhi


• WhatisMessagePassingInterface(MPI)?– ade-factostandardforexpressingdistributed-memoryparallelprogramming

– usedforcommunicationbetweenprocessesinmulti-processapplications

• MVAPICH2isahighperformanceimplementationoftheMPIstandard

• WhatcanMPIdoforDeepLearning?– MPIhasbeenusedforlargescalescientificapplications

– DeepLearningcanalsoexploitMPItoperformhigh-performancecommunication

• WhydoIneedcommunicationinDeepLearning?– IfyouuseoneGPUoroneCPU,youdonotneedcommunication

– But,oneGPUorCPUisnotenough!

– DLwantsasmanycomputeelementsasitcanget!

– MPIisagreatfit– Broadcast,Reduce,andAllreduceiswhatmostDLworkloadsrequire

Whattouseforscale-out?(DistributedtrainingofNeuralNets.)


OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015

– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,825organizationsin85countries

– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly

– EmpoweringmanyTOP500clusters(June‘17ranking)• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China

• 15th,241,108-core(Pleiades)atNASA

• 20th,462,462-core(Stampede)atTACC

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu

• EmpoweringTop500systemsforoveradecade

– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->

– SunwayTaihuLight(1st inJun’17,10Mcores,100PFlops)


• ThereareseveralDeepLearning(DL)orDNNTrainingframeworks– Caffe,CognitiveToolkit,TensorFlow,MXNet, andcounting....

• Every(almostevery)frameworkhasbeenoptimizedforNVIDIAGPUs– cuBLASandcuDNNhaveledtosignificantperformancegains!

• ButeveryframeworkisabletoexecuteonaCPUaswell– Sowhyarewenotusingthem?

– Performancehasbeen“terrible”andseveralstudieshavereportedsignificantdegradationwhenusingCPUs(seenvidia.qwiklab.com)

• Butthereishope,actuallyalotofgreatprogresshere!– AndMKL-DNN,justlikecuDNN,hasdefinitelyrekindledthis!!

– CoupledwithIntelXeonPhi(KnightsLandingorKNL)andMC-DRAM,thelandscapeforCPU-basedDLlookspromising..

DeepLearningFrameworks – CPUsorGPUs?


Howtoefficientlyscale-outaDeepLearning(DL)frameworkandtakeadvantageofheterogeneousHigh

PerformanceComputing(HPC)resourceslikeGPUsandXeonPhi(s)?

TheKeyQuestion!


ResearchChallenges

LetusbringHPCandDL“together”!

ComputationandcommunicationcharacteristicsofDLworkloads?

VariousdatasetsandnetworkshandleddifferentlyinDLframeworks

Possiblestrategiestoevaluatethe

performanceofDLframeworks

Performancetrendsthatcanbeobservedfora

singlenode

Scale-outofDNNtrainingforCPU-basedandGPU-

basedDNNtraining

Performancebehaviorfor

hardwarefeatures









Agenda


Bcast(GPU0)

packed_comm_buff

L1L2..

Ln

F

L1L2..

Ln

L1L2..

Ln

L1L2..

Ln

Params

GPU

0

Params

GPU

1 Params

GPU

2

Params

GPU

3

Gradients

1.DataPropagation

2.ForwardBackwardPass

3.GradientAggregation

B F B F B F B

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

ApplyUpdates

Reduce(GPU0)

Loop{}

CaffeArchitecture

http://hidl.cse.ohio-state.edu


• DeepLearningframeworksareadifferentgamealtogether– Unusuallylargemessagesizes(orderofmegabytes)

– MostcommunicationbasedonGPUbuffers

• ExistingState-of-the-art– cuDNN,cuBLAS,NCCL-->scale-up performance

– CUDA-AwareMPI-->scale-out performance• Forsmallandmediummessagesizesonly!

• Proposed:Canweco-design theMPIruntime(MVAPICH2-GDR)andtheDLframework(Caffe)toachieveboth?

– EfficientOverlap ofComputationandCommunication

– EfficientLarge-Message Communication(Reductions)

– Whatapplicationco-designsareneededtoexploitcommunication-runtime co-designs?

OSU-Caffe:Co-designtoTackleNewChallengesforMPIRuntimes

Scale-up

Perform

ance

Scale-outPerformance

cuDNN

NCCL

gRPC

Hadoop

ProposedCo-Designs

MPIcuBLAS

A.A.Awan,K.Hamidouche,J.M.Hashmi,andD.K.Panda,S-Caffe:Co-designingMPIRuntimesandCaffeforScalableDeepLearningonModernGPUClusters.In Proceedingsofthe22ndACMSIGPLANSymposiumonPrinciplesandPracticeofParallelProgramming (PPoPP'17)


0

2000

4000

6000

1 4 16 64 256 1K 4KBand

width(M

B/s)

MessageSize(Bytes)

GPU-GPUInter-nodeBi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

01000200030004000

1 4 16 64 256 1K 4KBand

width(M

B/s)

MessageSize(Bytes)

GPU-GPUInter-nodeBandwidth


0

10

20

30

0 2 8 32 128 512 2K 8K

Latency(us)

MessageSize(Bytes)

GPU-GPUInter-nodeLatency


MVAPICH2-GDR-2.3aIntelHaswell(E5-2687W)node- 20cores

NVIDIAVoltaV100GPUMellanoxConnect-X4EDRHCA

CUDA9.0MellanoxOFED4.0withGPU-Direct-RDMA

10x

9x

MVAPICH2-GDR:Scale-outforGPU-basedDistributedTraining

1.88us11X

MVAPICH2-GDR:PerformancethatmeetsDeepLearningrequirements!


• Caffe:AflexibleandlayeredDeepLearningframework.

• BenefitsandWeaknesses– Multi-GPUTrainingwithinasinglenode

– PerformancedegradationforGPUsacrossdifferentsockets

– LimitedScale-out

• OSU-Caffe:MPI-basedParallelTraining– EnableScale-up(withinanode)andScale-out

(acrossmulti-GPUnodes)

– Scale-outon64GPUsfortrainingCIFAR-10networkonCIFAR-10dataset

– Scale-outon128GPUsfortrainingGoogLeNetnetworkonImageNetdataset

OSU-Caffe 0.9:ScalableDeepLearningonGPUClusters

0

50

100

150

200

250

8 16 32 64 128

TrainingTim

e(secon

ds)

No.ofGPUs

GoogLeNet(ImageNet)on128GPUs

Caffe OSU-Caffe(1024) OSU-Caffe(2048)Invalidusecase

OSU-Caffe0.9availablefromHiDL site


• NCCLhassomelimitations– Onlyworksforasinglenode,thus,noscale-outon

multiplenodes

– DegradationacrossIOH(socket)forscale-up(withinanode)

• WeproposeoptimizedMPI_Bcast– CommunicationofverylargeGPUbuffers(orderof

megabytes)

– Scale-outonlargenumberofdensemulti-GPUnodes

• HierarchicalCommunicationthatefficientlyexploits:– CUDA-AwareMPI_BcastinMV2-GDR

– NCCLBroadcastprimitive

EfficientBroadcastforMVAPICH2-GDRusingNVIDIANCCL

110100100010000100000

1 8 64 512 4K 32K

256K 2M 16M

128M

Latency(us)

LogScale

MessageSize

MV2-GDR MV2-GDR-Opt

100x

0102030

2 4 8 16 32 64Time(secon

ds)

NumberofGPUs

MV2-GDR MV2-GDR-Opt

PerformanceBenefits:MicrosoftCNTKDLframework(25%avg.improvement)

PerformanceBenefits:OSUMicro-benchmarks

EfficientLargeMessageBroadcastusingNCCLandCUDA-AwareMPIforDeepLearning,A. Awan,K.Hamidouche,A.Venkatesh,andD.K.Panda,The23rdEuropeanMPIUsers'GroupMeeting(EuroMPI16),Sep2016 [BestPaperRunner-Up]


• MPI_Bcast:DesignandPerformanceTuningforDLWorkloads– Designring-basedalgorithmsforlargemessages

– Harnessamultitudeofalgorithmsandtechniquesforbestperformanceacrossthefullrangeofmessagesizeandprocess/GPUcount

• PerformanceBenefits– PerformancecomparableorbetterthanNCCL-

augmentedapproachesforlargemessages

– Upto10Ximprovementforsmall/mediummessagesizeswithmicro-benchmarks

– Upto7%improvementforVGGtraining

PureMPILargeMessageBroadcast

0.0010.010.11101001000

1 8 64 512 4K 32K

256K 2M 16M

128M

Latency(m

s)-logscale

MessageSize(bytes)

MV2-GDR-NCCL MV2-GDR-Opt

051015202530

2 4 8 32 64 128

TrainingTime(secon

ds)

No.ofGPUs

MV2-GDR-NCCL MV2-GDR-Opt

VGGTrainingwithCNTK

A.A.Awan,C-H.Chu,H.Subramoni,andD.K.Panda.OptimizedBroadcastforDeepLearningWorkloadsonDense-GPUInfiniBandClusters:MPIorNCCL?,arXiv ’17(https://arxiv.org/abs/1707.09414)

MPIBcast Benchmark:128GPUs(8nodes)


• PerformancegainsforMVAPICH2-GDR2.3a*comparedtoBaidu-allreduceLargeMessageAllreduce:MVAPICH2-GDRvs.Baidu-allreduce

*AvailablewithMVAPICH2-GDR2.3a

110

1001000

10000100000

1000000

Latency(us)-logscale

MessageSize(bytes)

8GPUs(4nodeslogscale-allreducevsMVAPICH2-GDR

Baidu-allreduce MVAPICH2-GDR

~30Xbetter~11%improvement


0100200300

2 4 8 16 32 64 128Latency(m

s)

MessageSize(MB)

Reduce– 192GPUs

LargeMessageOptimizedCollectivesforDeepLearning

0

100

200

128 160 192Latency(m

s)

No.ofGPUs

Reduce– 64MB

0100200300

16 32 64Latency(m

s)

No.ofGPUs

Allreduce- 128MB

0

50

100

2 4 8 16 32 64 128Latency(m

s)

MessageSize(MB)

Bcast– 64GPUs

0

50

100

16 32 64Latency(m

s)

No.ofGPUs

Bcast128MB

• MVAPICH2-GDRprovidesoptimizedcollectivesforlargemessagesizes

• OptimizedReduce,Allreduce,andBcast

• GoodscalingwithlargenumberofGPUs

• AvailableinMVAPICH2-GDR2.2andhigher

0100200300

2 4 8 16 32 64 128Latency(m

s)MessageSize(MB)

Allreduce– 64GPUs









Agenda


• Performancedependsonmanyfactors

• HardwareArchitectures– GPUs

– Multi-/Many-coreCPUs

– SoftwareLibraries:cuDNN(forGPUs),MKL-DNN/MKL2017

(forCPUs)

• HardwareandSoftwareco-design– Softwarelibrariesoptimized

foroneplatformwillnothelptheother!

– cuDNNvs.MKL-DNN

UnderstandingtheImpactofExecutionEnvironmentsDLApplications(ImageRecognition,SpeechProcessing,etc.)

DLFrameworks(Caffe,TensorFlow,etc.)

BLASLibraries

Hardware

Many-coreGPU(PascalP100)

GenericConvolutionLayer

MKLOptimizedConvolutionLayer

MKL2017 cuDNN/cuBLAS

Multi-/Many-core(Xeon,XeonPhi)

cuDNN OptimizedConvolutionLayer

OtherBLASLibraries

OpenBLASATLAS

OtherProcessors

A.A.Awan,H.Subramoni,D.Panda,“AnIn-depthPerformanceCharacterizationofCPU- andGPU-basedDNNTrainingonModernArchitectures”3rdWorkshoponMachineLearninginHighPerformanceComputingEnvironments,heldinconjunctionwithSC17,Nov2017.


• WeuseMCDRAMasCache forallthesubsequentresults

• Onaverage,DDR-Allisupto1.5XslowerthanMCDRAM

• MKLengineisupto3XbetterthandefaultCaffeengine

• Biggest gainsforIntel XeonPhi (many-core)architecture

• BothHaswellandBroadwellarchitecturesgetsignificantspeedups(upto1.5X)

ImpactofMKLengineandMC-DRAMforIntel-Caffe

020040060080010001200140016001800

Training

Time(m

s)

CPUArchitectures

0100200300400500600700

DDR-All MCDRAM-All MCDRAMasCache

Training

Time(m

s)

MemoryConfigurations

Forward Backward


TheFullLandscapeforAlexNetTraining

050100150200250300350400450500

Time(m

s)

conv1 conv2 conv3 conv4 conv5

• ConvolutionsintheForwardandBackwardPass

• FasterConvolutionsà FasterTraining

• Mostperformancegainsarebasedonconv2 andconv3.

0100200300400500600700800

Time(m

s)

conv1 conv2 conv3 conv4 conv5


Multi-nodeResults:ResNet-50• Allresultsareweakscaling

– Thebatchsizeremainsconstantpersolverbutincreasesoverallby:

– Batch-size*#nodesor

– Batch-size*#gpus

• Images/secondisaderivedmetricbutmoremeaningfulforunderstandingscalability

• Efficiencyisanotherstory[1]– LargerDNNarchitecturesà Lessscalability

duetocommunicationoverhead

0

100

200

300

400

500

600

0

50

100

150

200

250

300

350

400

2 4 8 16 20 32

Images/secon

d

Training

Time(secon

ds)

No.ofNodes

Time(seconds) Images/second

ResNet-50Intel-Caffe

1.ExperiencesofScalingTensorFlowOnUpto512NodesOnCORISupercomputer,IntelHPCDev.Con.,https://www.intel.com/content/www/us/en/events/hpcdevcon/overview.html


Summary

• DeepLearningisontherise– Rapidadvancesinsoftware,hardware,andavailabilityoflargedatasetsis

drivingit

• SinglenodeorsingleGPUisnotenoughforDeepLearningworkloads

• WeneedtofocusondistributedDeepLearningbuttherearemanychallenges

• MPIoffersagreatabstractionforcommunicationinDLTrainingtasks

• Aco-designofDeepLearningframeworksandcommunicationruntimeswillberequiredtomakeDNNTrainingscalable


ThankYou!

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

HighPerformanceDeepLearninghttp://hidl.cse.ohio-state.edu/

[email protected]

http://web.cse.ohio-state.edu/~awan.10

TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/

TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/


PleasejoinusforothereventsatSC’17• Workshops

– ESPM22017:ThirdInternationalWorkshoponExtremeScaleProgrammingModelsandMiddleware

• Tutorials– InfiniBand,Omni-Path,andHigh-Speed

EthernetforDummies

– InfiniBand,Omni-Path,andHigh-SpeedEthernet:AdvancedFeatures,ChallengesinDesigning,HECSystemsandUsage

• BoFs– MPICHBoF:MVAPICH2Project:Latest

StatusandFuturePlans

Pleaserefertohttp://mvapich.cse.ohio-state.edu/talks/ formoredetails

• ACMSRCPosters– Co-designingMPIRuntimesandDeepLearning

FrameworksforScalableDistributedTrainingonGPUClusters

– High-PerformanceandScalableBroadcastSchemesforDeepLearningonGPUClusters

• BoothTalks– TheMVAPICH2Project:LatestDevelopmentsandPlans

TowardsExascaleComputing

– ExploitingLatestNetworkingandAcceleratorTechnologiesforMPI,Streaming,andDeepLearning:AnMVAPICH2-BasedApproach

– AcceleratingDeepLearningwithMVAPICH

– MVAPICH2-GDRLibrary:PushingtheFrontierofHPCandDeepLearning

Accelerating Deep Learning with MVAPICHmvapich.cse.ohio-state.edu/static/media/talks/slide/awan...Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance

Documents