Machine Learning on Accelerated Platforms · Machine Learning on Accelerated Platforms Thomas Steinke February 27, 2017. CPUs & CUDA GPUs and Google’s TPU ... •Do not require

ZIBWorkshop"HotTopicsinMachineLearning”

MachineLearningonAcceleratedPlatforms

ThomasSteinke

February27,2017

CPUs&CUDAGPUsandGoogle’sTPU

27.02.17 [email protected] 2

Source:http://www.theverge.com/circuitbreaker/2016/5/19/11716818/google-alphago-hardware-asic-chip-tensor-processor-unit-machine-learning

Outline

§ MachineLearningonAcceleratorsinResearchEnvironmentsv Hardwarev Softwarev Relativeperformancedata

§ Outlookv ProgrammingHeterogeneousPlatformsv FutureDevices


MachineLearningonAccelerators


DeepLearning:FaceRecognitionTrainingdata:§ 10-100M images

Networkarchitecture:§ 10+layers§ 1B parameters

Learningalgorithm:§ ~30exaflops§ ~30GPUdays

27.02.17 [email protected] 5Source:Nvidia

KeyRequirementsinResearchEnvironments

§ Easyofuse§ Lowbarriersforprogrammability§ Broadspectrumofoptimizedlibrariesandtools§ Affordablehardwareandsoftware§ Sufficientspeed§ ???


WhatNeedstobeAccelerated?Buildingblocksinlearningalgorithms:§ Compute:

v Convolutionv Matrix-matrixmultiplyinDeepNeuralNetworks(DNN)

• DonotrequireIEEEfloating-pointops• Weightsmayneedonly3-8bits

§ Accesstodata:v Highmemorybandwidth(lowcomputationalintensity?)

§ Scalability:v Inter-nodecommunication


S.Guptaetal.,DeepLearningwithLimitedNumericalPrecision,arXiv,2015J.Paketal.,FPGABasedImplementationofDeepNeuralNetworksUsingOn-chipMemoryonly,arXiv,2016

BasicComputeDevices

§ Multi- andMany-coreCPU§ GraphicsProcessingUnit(GPU)§ FieldProgrammableGateArray(FGPA)

§ Application-SpecificIntegratedCircuit(ASIC)


MatchingRequirementsofDNNbyHardwareDeepNeuralNetwork

CPU ManyCCPU

GPU FPGA ASIC

Parallelism o + + + +Matrix operations o + + o/+ ++

FLOp/s o + + o ++Memorybandwidth - + + + ++Low precisionops - - /+ + ++ ++Energyefficiency o + + ++ +++

Easyofprogramming + + + -- (+) ?Price $...$$ $$ $...$$ $$ $$$$


ALookatComputePerformance

§ Single-coreCPU: 1§ Multi-coreCPU: 10§ Many-coreCPU: 100§ GPU: 100§ FPGA: 1’000§ ASIC: 10’000…1’000’000

27.02.17 [email protected] 10Source:PhilipJama,TheFutureofMachineLearning Hardwarehttps://hackernoon.com/the-future-of-machine-learning-hardware-c872a0448be8#.x8t8vwae5

DataforMiningCryptocurrencies

§ Computationalcapacity(throughput)§ Energy-efficiency(computationsperJoule)§ Cost-efficiency(throughputper€)

ProcessorsforNNTrainingandInference(I)

§ Nvidia P40(Pascal,2017):v Performance:47INT8TOp/s,12TFlop/s32bitv Configuration:24GBglobalmem,346GB/sbandwidth,250W

§ Nvidia M4(Maxwell,04/2016):v Performance:2.2Tflop/s32bitv Configuration:4GBglobalmem,88GB/sbandwidth,75W


12

NVIDIA DGX-1AI supercomputer-appliance-in-a-box

• 8x Tesla P100 (170 TFlop/s 16bit perf)

• NVLink Hybrid Cube Mesh,160 GB/s bi-direct per peer

• Dual Xeon, 512 GB Memory

• 7 TB SSD Deep Learning Cache

• Dual 10GbE, Quad IB 100Gb

• 3RU – 3200W

• Optimized Deep Learning Software across the entire stack

Source: Nvidia

ProcessorsforNNTrainingandInference(II)

§ IntelKnightsMill:3rd genIntelXeonPhi,2017v manycoresv FP16arithmetic,"enhancedvariableprecision"

§ IntelXeonSkylake (AVX512)

§ IntelXeonPhi(KnightsLanding)v Performance:3.5TFlop/s64bit,6+TFlop/s32bitv Configuration:64-72cores,16GBMCDRAM,400+GB/smembandwidth


IntelNervana Platform

[email protected] 14Source:Intel,2017

FPGAinImageRecognition

§ FPGAvs.GPU:~5xslower,10xbetterenergyefficiency§ anyprecisionpossible:32float… 8bitfixed

27.02.17 [email protected] 15Source:Intel,2017

Winograd minimalfilteringalgorithm,reducecomputecomplexitybyupto4xw.r.t convolution

UsingFPGAsforNNLearning

§ Fixedpoint:16bit,fractionalpartfrom8to14bits

§ Multiplyandaccumulate(MACC)forinnerproducts

§ Stochasticrounding


S.Guptaetal.,DeepLearningwithLimitedNumericalPrecision,arXiv,2015

SoftwareStacks


VendorSupportofMLSoftwareFrameworks

[email protected] 18

Mocha.jl

NVIDIA DEEP LEARNING SDK

Nvidia Solutions


Nvidia cuDNNDeep Learning Training Performance

Caffe AlexNet

Spee

d-up

of

Imag

es/S

ec v

s K4

0 in

201

3

K40 K80 + cuDN…

M40 + cuDNN4

P100 + cuDNN5

0x

10x

20x

30x

40x

50x

60x

70x

80x

AlexNet training throughput on CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04M40 bar: 8x M40 GPUs in a node, P100: 8x P100 NVLink-enabled

0

1.000

2.000

3.000

4.000

5.000

6.000

7.000

2 8 128

CPU-OnlyTesla P40 + TensorRT (FP32)Tesla P40 + TensorRT (INT8)

Nvidia TensorRTUp to 36x More Image/sec

Batch SizeGoogLenet, CPU-only vs Tesla P40 + TensorRTCPU: 1 socket E4 2690 v4 @2.6 GHz, HT-onGPU: 2 socket E5-2698 v3 @2.3 GHz, HT off, 1 P40 card in the box

Imag

es/S

econ

d

IntelMKL&Python


usingIntelMKL

IntelMKL&Python


usingIntelPythondistro

ProgrammingforHeterogenous Platforms


SYCL- C++Single-SourceHeterogeneousProgrammingforOpenCL§ Cross-platformabstractionlayerbuildonconcept+portabilityofOpenCL§ Enables“single-source”stylestandardC++codeforheterogeneousprocs§ C++templatefunctionswithbothhostanddevicecode§ Open-sourceC++17ParallelSTLforSYCLandOpenCL2.2features§ Targetdevices:CPU,GPU,FPGA(anyOpenCLsupporteddevice)

§ Compiler:v ComputeCpp (Codeplay),freecommunity+commercialversionv triSYCL (Xilinx),open-source(github.com/Xilinx/triSYCL)


SYCL/OpenCLStack


Source:Khronos Group,2015

CodeGenerationProcesswithSYCL

[email protected] 25Source:Khronos Group,2015

SYCLinPractice?

§ TensorFlow forOpenCLusingSYCL(Codeplay)§ ParallelSTL(C++extensionsforparallelism)à PartoffutureC++standard

§ BMBFProjectHighPerMeshes (04/2017-03/2020)v ZIB+Paderborn+Erlangen+Fraunhofer ITWM

(M.Weiser,Th.Steinke)


(Near)FutureDevicesforDeepLearning


IntelLakeCrest

§ 1st genNervana Engine(2017)§ HardwareforDLtraining§ 32GBHBM2§ 1TByte/smemaccessspeed§ Nocaches§ Flexpoint arithmetic(betweenfixedandfloatingpoint)

§ 12bi-directlinksforinter-chipcommunication(20xofPCIe)


Source:http://www.eetimes.com/document.asp?doc_id=1330854

IntelKnightsCrest(ca.2020)• 2nd genofNervana EnginewithXeoncores

• Bootable=stand-aloneCPU

IBMResistiveCross-PointDevices(RPU)


Source:TayfunGokmen,Yurii Vlasov (IBM),AccelerationofDeepNeuralNetworkTrainingwithResistiveCross-PointDevices

MoreInformation

§ Intel:v www.intel.com/ai

• IntelAIDay2017:http://www.inteldevconference.com/ai-day-munich-slides/• IntelAITechnicalWorkshop2017:http://www.inteldevconference.com/intel-ai-workshop-slides

§ Nvidia:v https://developer.nvidia.com/deep-learning-software


AcknowledgementsAxelKöhler,Nvidia

ThorstenSchmidt,Intel27.02.17 [email protected] 31

???

Machine Learning on Accelerated Platforms · Machine Learning on Accelerated Platforms Thomas Steinke February 27, 2017. CPUs & CUDA GPUs and Google’s TPU ... •Do not require

Documents