Accelerating Deep Learning with MVAPICH Ammar Ahmad Awan, Hari Subramoni, and Dhabaleswar K. Panda Network Based Computing Laboratory Dept. of Computer Science and Engineering The Ohio State University OSU Booth Talk (SC ’17)
AcceleratingDeepLearningwithMVAPICH
AmmarAhmadAwan,HariSubramoni,andDhabaleswarK.Panda
NetworkBasedComputingLaboratory
Dept.ofComputerScienceandEngineering
TheOhioStateUniversity
OSUBoothTalk(SC’17)
MLHPC‘17 2NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Introduction– DeepLearningTrends
– CPUsandGPUsforDeepLearning
– MessagePassingInterface(MPI)
• Co-designEfforts– OSU-Caffe
– NCCL-augmentedMPIBroadcast
– Large-messageCUDA-AwareMPICollectives
• CharacterizationofDeepLearningWorkloads– CPUsvs.GPUsforDeepLearningwithCaffe
Agenda
MLHPC‘17 3NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Caffe,TensorFlow,CNTKandmanymore..
• MostframeworksareexploitingGPUstoacceleratetraining
• Diverseapplications– ImageRecognition,CancerDetection,Self-DrivingCars,SpeechProcessingetc.
DLFrameworksandTrends
https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/
MLHPC‘17 4NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• NVIDIAGPUshavebeenthemaindrivingforceforfastertrainingofDeepNeuralNetworks(DNNs)– TheImageNetChallenge- (ILSVRC)
– 90%oftheImageNetteamsusedGPUsin2014*
– DLmodelslikeAlexNet,GoogLeNet,andVGG
– AnaturalfitforDLduetothethroughput-orientednature
– GPUsarealsogrowingintheHPCarena!à
GPUsaregreatforDeepLearning
*https://blogs.nvidia.com/blog/2014/09/07/imagenet/
https://www.top500.org/statistics/list/
MLHPC‘17 5NetworkBasedComputingLaboratory High-PerformanceDeepLearning
AndCPUsarecatchingupfast
1- https://dl.acm.org/citation.cfm?id=19935162- http://ieeexplore.ieee.org/abstract/document/5762730/3- https://dspace.mit.edu/bitstream/handle/1721.1/51839/MIT-CSAIL-TR-2010-013.pdf?sequence=1
• IntelCPUsareeverywhereandmany-coreCPUsareemergingaccordingtoTop500.org
• HostCPUsexistevenontheGPUnodes– Many-coreXeonPhisareincreasing
• XeonPhi1st generationwasaco-processor
• Unlike XeonPhi2nd generation,whichisaself-hostedprocessor!
• Usually,wehearCPUsare10x– 100x slowerthanGPUs?[1-3]– Butcanwedobetter?
https://www.top500.org/statistics/list/
SystemCountforXeonPhi
MLHPC‘17 6NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• WhatisMessagePassingInterface(MPI)?– ade-factostandardforexpressingdistributed-memoryparallelprogramming
– usedforcommunicationbetweenprocessesinmulti-processapplications
• MVAPICH2isahighperformanceimplementationoftheMPIstandard
• WhatcanMPIdoforDeepLearning?– MPIhasbeenusedforlargescalescientificapplications
– DeepLearningcanalsoexploitMPItoperformhigh-performancecommunication
• WhydoIneedcommunicationinDeepLearning?– IfyouuseoneGPUoroneCPU,youdonotneedcommunication
– But,oneGPUorCPUisnotenough!
– DLwantsasmanycomputeelementsasitcanget!
– MPIisagreatfit– Broadcast,Reduce,andAllreduceiswhatmostDLworkloadsrequire
Whattouseforscale-out?(DistributedtrainingofNeuralNets.)
MLHPC‘17 7NetworkBasedComputingLaboratory High-PerformanceDeepLearning
OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)
– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015
– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015
– Usedbymorethan2,825organizationsin85countries
– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly
– EmpoweringmanyTOP500clusters(June‘17ranking)• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China
• 15th,241,108-core(Pleiades)atNASA
• 20th,462,462-core(Stampede)atTACC
– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– http://mvapich.cse.ohio-state.edu
• EmpoweringTop500systemsforoveradecade
– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->
– SunwayTaihuLight(1st inJun’17,10Mcores,100PFlops)
MLHPC‘17 8NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• ThereareseveralDeepLearning(DL)orDNNTrainingframeworks– Caffe,CognitiveToolkit,TensorFlow,MXNet, andcounting....
• Every(almostevery)frameworkhasbeenoptimizedforNVIDIAGPUs– cuBLASandcuDNNhaveledtosignificantperformancegains!
• ButeveryframeworkisabletoexecuteonaCPUaswell– Sowhyarewenotusingthem?
– Performancehasbeen“terrible”andseveralstudieshavereportedsignificantdegradationwhenusingCPUs(seenvidia.qwiklab.com)
• Butthereishope,actuallyalotofgreatprogresshere!– AndMKL-DNN,justlikecuDNN,hasdefinitelyrekindledthis!!
– CoupledwithIntelXeonPhi(KnightsLandingorKNL)andMC-DRAM,thelandscapeforCPU-basedDLlookspromising..
DeepLearningFrameworks – CPUsorGPUs?
MLHPC‘17 9NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Howtoefficientlyscale-outaDeepLearning(DL)frameworkandtakeadvantageofheterogeneousHigh
PerformanceComputing(HPC)resourceslikeGPUsandXeonPhi(s)?
TheKeyQuestion!
MLHPC‘17 10NetworkBasedComputingLaboratory High-PerformanceDeepLearning
ResearchChallenges
LetusbringHPCandDL“together”!
ComputationandcommunicationcharacteristicsofDLworkloads?
VariousdatasetsandnetworkshandleddifferentlyinDLframeworks
Possiblestrategiestoevaluatethe
performanceofDLframeworks
Performancetrendsthatcanbeobservedfora
singlenode
Scale-outofDNNtrainingforCPU-basedandGPU-
basedDNNtraining
Performancebehaviorfor
hardwarefeatures
MLHPC‘17 11NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Introduction– DeepLearningTrends
– CPUsandGPUsforDeepLearning
– MessagePassingInterface(MPI)
• Co-designEfforts– OSU-Caffe
– NCCL-augmentedMPIBroadcast
– Large-messageCUDA-AwareMPICollectives
• CharacterizationofDeepLearningWorkloads– CPUsvs.GPUsforDeepLearningwithCaffe
Agenda
MLHPC‘17 12NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Bcast(GPU0)
packed_comm_buff
L1L2..
Ln
F
L1L2..
Ln
L1L2..
Ln
L1L2..
Ln
Params
GPU
0
Params
GPU
1 Params
GPU
2
Params
GPU
3
Gradients
1.DataPropagation
2.ForwardBackwardPass
3.GradientAggregation
B F B F B F B
packed_reduce_buff
packed_reduce_buff
packed_reduce_buff
packed_reduce_buff
ApplyUpdates
Reduce(GPU0)
Loop{}
CaffeArchitecture
http://hidl.cse.ohio-state.edu
MLHPC‘17 13NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• DeepLearningframeworksareadifferentgamealtogether– Unusuallylargemessagesizes(orderofmegabytes)
– MostcommunicationbasedonGPUbuffers
• ExistingState-of-the-art– cuDNN,cuBLAS,NCCL-->scale-up performance
– CUDA-AwareMPI-->scale-out performance• Forsmallandmediummessagesizesonly!
• Proposed:Canweco-design theMPIruntime(MVAPICH2-GDR)andtheDLframework(Caffe)toachieveboth?
– EfficientOverlap ofComputationandCommunication
– EfficientLarge-Message Communication(Reductions)
– Whatapplicationco-designsareneededtoexploitcommunication-runtime co-designs?
OSU-Caffe:Co-designtoTackleNewChallengesforMPIRuntimes
Scale-up
Perform
ance
Scale-outPerformance
cuDNN
NCCL
gRPC
Hadoop
ProposedCo-Designs
MPIcuBLAS
A.A.Awan,K.Hamidouche,J.M.Hashmi,andD.K.Panda,S-Caffe:Co-designingMPIRuntimesandCaffeforScalableDeepLearningonModernGPUClusters.In Proceedingsofthe22ndACMSIGPLANSymposiumonPrinciplesandPracticeofParallelProgramming (PPoPP'17)
MLHPC‘17 14NetworkBasedComputingLaboratory High-PerformanceDeepLearning
0
2000
4000
6000
1 4 16 64 256 1K 4KBand
width(M
B/s)
MessageSize(Bytes)
GPU-GPUInter-nodeBi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
01000200030004000
1 4 16 64 256 1K 4KBand
width(M
B/s)
MessageSize(Bytes)
GPU-GPUInter-nodeBandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
0
10
20
30
0 2 8 32 128 512 2K 8K
Latency(us)
MessageSize(Bytes)
GPU-GPUInter-nodeLatency
MV2-(NO-GDR) MV2-GDR-2.3a
MVAPICH2-GDR-2.3aIntelHaswell(E5-2687W)node- 20cores
NVIDIAVoltaV100GPUMellanoxConnect-X4EDRHCA
CUDA9.0MellanoxOFED4.0withGPU-Direct-RDMA
10x
9x
MVAPICH2-GDR:Scale-outforGPU-basedDistributedTraining
1.88us11X
MVAPICH2-GDR:PerformancethatmeetsDeepLearningrequirements!
MLHPC‘17 15NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Caffe:AflexibleandlayeredDeepLearningframework.
• BenefitsandWeaknesses– Multi-GPUTrainingwithinasinglenode
– PerformancedegradationforGPUsacrossdifferentsockets
– LimitedScale-out
• OSU-Caffe:MPI-basedParallelTraining– EnableScale-up(withinanode)andScale-out
(acrossmulti-GPUnodes)
– Scale-outon64GPUsfortrainingCIFAR-10networkonCIFAR-10dataset
– Scale-outon128GPUsfortrainingGoogLeNetnetworkonImageNetdataset
OSU-Caffe 0.9:ScalableDeepLearningonGPUClusters
0
50
100
150
200
250
8 16 32 64 128
TrainingTim
e(secon
ds)
No.ofGPUs
GoogLeNet(ImageNet)on128GPUs
Caffe OSU-Caffe(1024) OSU-Caffe(2048)Invalidusecase
OSU-Caffe0.9availablefromHiDL site
MLHPC‘17 16NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• NCCLhassomelimitations– Onlyworksforasinglenode,thus,noscale-outon
multiplenodes
– DegradationacrossIOH(socket)forscale-up(withinanode)
• WeproposeoptimizedMPI_Bcast– CommunicationofverylargeGPUbuffers(orderof
megabytes)
– Scale-outonlargenumberofdensemulti-GPUnodes
• HierarchicalCommunicationthatefficientlyexploits:– CUDA-AwareMPI_BcastinMV2-GDR
– NCCLBroadcastprimitive
EfficientBroadcastforMVAPICH2-GDRusingNVIDIANCCL
110100100010000100000
1 8 64 512 4K 32K
256K 2M 16M
128M
Latency(us)
LogScale
MessageSize
MV2-GDR MV2-GDR-Opt
100x
0102030
2 4 8 16 32 64Time(secon
ds)
NumberofGPUs
MV2-GDR MV2-GDR-Opt
PerformanceBenefits:MicrosoftCNTKDLframework(25%avg.improvement)
PerformanceBenefits:OSUMicro-benchmarks
EfficientLargeMessageBroadcastusingNCCLandCUDA-AwareMPIforDeepLearning,A. Awan,K.Hamidouche,A.Venkatesh,andD.K.Panda,The23rdEuropeanMPIUsers'GroupMeeting(EuroMPI16),Sep2016 [BestPaperRunner-Up]
MLHPC‘17 17NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• MPI_Bcast:DesignandPerformanceTuningforDLWorkloads– Designring-basedalgorithmsforlargemessages
– Harnessamultitudeofalgorithmsandtechniquesforbestperformanceacrossthefullrangeofmessagesizeandprocess/GPUcount
• PerformanceBenefits– PerformancecomparableorbetterthanNCCL-
augmentedapproachesforlargemessages
– Upto10Ximprovementforsmall/mediummessagesizeswithmicro-benchmarks
– Upto7%improvementforVGGtraining
PureMPILargeMessageBroadcast
0.0010.010.11101001000
1 8 64 512 4K 32K
256K 2M 16M
128M
Latency(m
s)-logscale
MessageSize(bytes)
MV2-GDR-NCCL MV2-GDR-Opt
051015202530
2 4 8 32 64 128
TrainingTime(secon
ds)
No.ofGPUs
MV2-GDR-NCCL MV2-GDR-Opt
VGGTrainingwithCNTK
A.A.Awan,C-H.Chu,H.Subramoni,andD.K.Panda.OptimizedBroadcastforDeepLearningWorkloadsonDense-GPUInfiniBandClusters:MPIorNCCL?,arXiv ’17(https://arxiv.org/abs/1707.09414)
MPIBcast Benchmark:128GPUs(8nodes)
MLHPC‘17 18NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• PerformancegainsforMVAPICH2-GDR2.3a*comparedtoBaidu-allreduceLargeMessageAllreduce:MVAPICH2-GDRvs.Baidu-allreduce
*AvailablewithMVAPICH2-GDR2.3a
110
1001000
10000100000
1000000
Latency(us)-logscale
MessageSize(bytes)
8GPUs(4nodeslogscale-allreducevsMVAPICH2-GDR
Baidu-allreduce MVAPICH2-GDR
~30Xbetter~11%improvement
MLHPC‘17 19NetworkBasedComputingLaboratory High-PerformanceDeepLearning
0100200300
2 4 8 16 32 64 128Latency(m
s)
MessageSize(MB)
Reduce– 192GPUs
LargeMessageOptimizedCollectivesforDeepLearning
0
100
200
128 160 192Latency(m
s)
No.ofGPUs
Reduce– 64MB
0100200300
16 32 64Latency(m
s)
No.ofGPUs
Allreduce- 128MB
0
50
100
2 4 8 16 32 64 128Latency(m
s)
MessageSize(MB)
Bcast– 64GPUs
0
50
100
16 32 64Latency(m
s)
No.ofGPUs
Bcast128MB
• MVAPICH2-GDRprovidesoptimizedcollectivesforlargemessagesizes
• OptimizedReduce,Allreduce,andBcast
• GoodscalingwithlargenumberofGPUs
• AvailableinMVAPICH2-GDR2.2andhigher
0100200300
2 4 8 16 32 64 128Latency(m
s)MessageSize(MB)
Allreduce– 64GPUs
MLHPC‘17 20NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Introduction– DeepLearningTrends
– CPUsandGPUsforDeepLearning
– MessagePassingInterface(MPI)
• Co-designEfforts– OSU-Caffe
– NCCL-augmentedMPIBroadcast
– Large-messageCUDA-AwareMPICollectives
• CharacterizationofDeepLearningWorkloads– CPUsvs.GPUsforDeepLearningwithCaffe
Agenda
MLHPC‘17 21NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• Performancedependsonmanyfactors
• HardwareArchitectures– GPUs
– Multi-/Many-coreCPUs
– SoftwareLibraries:cuDNN(forGPUs),MKL-DNN/MKL2017
(forCPUs)
• HardwareandSoftwareco-design– Softwarelibrariesoptimized
foroneplatformwillnothelptheother!
– cuDNNvs.MKL-DNN
UnderstandingtheImpactofExecutionEnvironmentsDLApplications(ImageRecognition,SpeechProcessing,etc.)
DLFrameworks(Caffe,TensorFlow,etc.)
BLASLibraries
Hardware
Many-coreGPU(PascalP100)
GenericConvolutionLayer
MKLOptimizedConvolutionLayer
MKL2017 cuDNN/cuBLAS
Multi-/Many-core(Xeon,XeonPhi)
cuDNN OptimizedConvolutionLayer
OtherBLASLibraries
OpenBLASATLAS
OtherProcessors
A.A.Awan,H.Subramoni,D.Panda,“AnIn-depthPerformanceCharacterizationofCPU- andGPU-basedDNNTrainingonModernArchitectures”3rdWorkshoponMachineLearninginHighPerformanceComputingEnvironments,heldinconjunctionwithSC17,Nov2017.
MLHPC‘17 22NetworkBasedComputingLaboratory High-PerformanceDeepLearning
• WeuseMCDRAMasCache forallthesubsequentresults
• Onaverage,DDR-Allisupto1.5XslowerthanMCDRAM
• MKLengineisupto3XbetterthandefaultCaffeengine
• Biggest gainsforIntel XeonPhi (many-core)architecture
• BothHaswellandBroadwellarchitecturesgetsignificantspeedups(upto1.5X)
ImpactofMKLengineandMC-DRAMforIntel-Caffe
020040060080010001200140016001800
Training
Time(m
s)
CPUArchitectures
0100200300400500600700
DDR-All MCDRAM-All MCDRAMasCache
Training
Time(m
s)
MemoryConfigurations
Forward Backward
MLHPC‘17 23NetworkBasedComputingLaboratory High-PerformanceDeepLearning
TheFullLandscapeforAlexNetTraining
050100150200250300350400450500
Time(m
s)
conv1 conv2 conv3 conv4 conv5
• ConvolutionsintheForwardandBackwardPass
• FasterConvolutionsà FasterTraining
• Mostperformancegainsarebasedonconv2 andconv3.
0100200300400500600700800
Time(m
s)
conv1 conv2 conv3 conv4 conv5
MLHPC‘17 24NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Multi-nodeResults:ResNet-50• Allresultsareweakscaling
– Thebatchsizeremainsconstantpersolverbutincreasesoverallby:
– Batch-size*#nodesor
– Batch-size*#gpus
• Images/secondisaderivedmetricbutmoremeaningfulforunderstandingscalability
• Efficiencyisanotherstory[1]– LargerDNNarchitecturesà Lessscalability
duetocommunicationoverhead
0
100
200
300
400
500
600
0
50
100
150
200
250
300
350
400
2 4 8 16 20 32
Images/secon
d
Training
Time(secon
ds)
No.ofNodes
Time(seconds) Images/second
ResNet-50Intel-Caffe
1.ExperiencesofScalingTensorFlowOnUpto512NodesOnCORISupercomputer,IntelHPCDev.Con.,https://www.intel.com/content/www/us/en/events/hpcdevcon/overview.html
MLHPC‘17 25NetworkBasedComputingLaboratory High-PerformanceDeepLearning
Summary
• DeepLearningisontherise– Rapidadvancesinsoftware,hardware,andavailabilityoflargedatasetsis
drivingit
• SinglenodeorsingleGPUisnotenoughforDeepLearningworkloads
• WeneedtofocusondistributedDeepLearningbuttherearemanychallenges
• MPIoffersagreatabstractionforcommunicationinDLTrainingtasks
• Aco-designofDeepLearningframeworksandcommunicationruntimeswillberequiredtomakeDNNTrainingscalable
MLHPC‘17 26NetworkBasedComputingLaboratory High-PerformanceDeepLearning
ThankYou!
Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/
HighPerformanceDeepLearninghttp://hidl.cse.ohio-state.edu/
http://web.cse.ohio-state.edu/~awan.10
TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/
TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/
MLHPC‘17 27NetworkBasedComputingLaboratory High-PerformanceDeepLearning
PleasejoinusforothereventsatSC’17• Workshops
– ESPM22017:ThirdInternationalWorkshoponExtremeScaleProgrammingModelsandMiddleware
• Tutorials– InfiniBand,Omni-Path,andHigh-Speed
EthernetforDummies
– InfiniBand,Omni-Path,andHigh-SpeedEthernet:AdvancedFeatures,ChallengesinDesigning,HECSystemsandUsage
• BoFs– MPICHBoF:MVAPICH2Project:Latest
StatusandFuturePlans
Pleaserefertohttp://mvapich.cse.ohio-state.edu/talks/ formoredetails
• ACMSRCPosters– Co-designingMPIRuntimesandDeepLearning
FrameworksforScalableDistributedTrainingonGPUClusters
– High-PerformanceandScalableBroadcastSchemesforDeepLearningonGPUClusters
• BoothTalks– TheMVAPICH2Project:LatestDevelopmentsandPlans
TowardsExascaleComputing
– ExploitingLatestNetworkingandAcceleratorTechnologiesforMPI,Streaming,andDeepLearning:AnMVAPICH2-BasedApproach
– AcceleratingDeepLearningwithMVAPICH
– MVAPICH2-GDRLibrary:PushingtheFrontierofHPCandDeepLearning