Top Banner
Deep Learning Cookbook: technology recipes to run deep learning workloads Natalia Vassilieva, Sergey Serebryakov
21

Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Apr 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Deep Learning Cookbook: technology recipes to run deep learning workloadsNatalia Vassilieva, Sergey Serebryakov

Page 2: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

• Search & information extraction

• Security/Video surveillance

• Self-driving cars• Medical imaging• Robotics

• Interactive voice response (IVR) systems

• Voice interfaces (Mobile, Cars, Gaming, Home)

• Security (speaker identification)

• Health care• Simultaneous

interpretation

• Search and ranking• Sentiment analysis• Machine translation• Question answering

• Recommendation engines• Advertising• Fraud detection• AI challenges• Drug discovery• Sensor data analysis• Diagnostic support

Deep learning applications

2

TextVision Speech Other

Page 3: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Hardware

Software

Deep learning ecosystem

3

Keras

Page 4: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

4

How to pick the right hardware/software stack?

Does one size fit all?

Page 5: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Applications break down

5

DetectionLook for a known object/pattern

ClassificationAssign a label from a predefined set of labels

GenerationGenerate content

Anomaly detectionLook for abnormal, unknown patterns

Images

Video

Text

Sensor

Other

Speech

Video surveillance

Speech recognition

Sentiment analysis

Predictive maintenance

Fraud detection

Tissue classification in medical images

Page 6: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Types of artificial neural networksTopology to fit data characteristics

6

Input HiddenLayer 1

HiddenLayer 2

HiddenLayer 3 Output Input Hidden

Layer 1HiddenLayer 2

HiddenLayer 3 Output

Images: Convolutional (CNN)

Speech, time series, sequences: Fully Connected (FC), Recurrent (RNN)

Page 7: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

One size does NOT fit all

7

Data type

Data size

Application

Model (topology of artificial neural network):- How many layers- How many neurons per layer- Connections between neurons (types of layers)

Page 8: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Popular models

8

Name Type Model size(# params) Model size (MB) GFLOPs

(forward pass)

AlexNet CNN 60,965,224 233 MB 0.7

GoogleNet CNN 6,998,552 27 MB 1.6

VGG-16 CNN 138,357,544 528 MB 15.5

VGG-19 CNN 143,667,240 548 MB 19.6

ResNet50 CNN 25,610,269 98 MB 3.9

ResNet101 CNN 44,654,608 170 MB 7.6

ResNet152 CNN 60,344,387 230 MB 11.3

Eng Acoustic Model RNN 34,678,784 132 MB 0.035

TextCNN CNN 151,690 0.6 MB 0.009

Page 9: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Popular models

9

Name Type Model size(# params) Model size (MB) GFLOPs

(forward pass)

AlexNet CNN 60,965,224 233 MB 0.7

GoogleNet CNN 6,998,552 27 MB 1.6

VGG-16 CNN 138,357,544 528 MB 15.5

VGG-19 CNN 143,667,240 548 MB 19.6

ResNet50 CNN 25,610,269 98 MB 3.9

ResNet101 CNN 44,654,608 170 MB 7.6

ResNet152 CNN 60,344,387 230 MB 11.3

Eng Acoustic Model RNN 34,678,784 132 MB 0.035

TextCNN CNN 151,690 0.6 MB 0.009

Page 10: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Compute requirements

10

ResNet152 CNN 60,344,387 230 MB 11.3

Name Type Model size(# params) Model size (MB) GFLOPs

(forward pass)

Training data: 14M images (ImageNet)FLOPs per epoch: 3 ∗ 11.3 ∗ 10& ∗ 14 ∗ 10( ≈ 5 ∗ 10+,1 epoch per hour: ~140 TFLOPS

Today’s hardware:Google TPU2: 180 TFLOPS Tensor opsNVIDIA Tesla V100: 15 TFLOPS SP (30 TFLOPS FP16, 120 TFLOPS Tensor ops), 12 GB memoryNVIDIA Tesla P100: 10.6 TFLOPS SP, 16 GB memoryNVIDIA Tesla K40: 4.29 TFLOPS SP, 12 GB memoryNVIDIA Tesla K80: 5.6 TFLOPS SP (8.74 TFLOPS SP with GPU boost), 24 GB memoryINTEL Xeon Phi: 2.4 TFLOPS SP

Page 11: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Model parallelism

– Can be achieved with scalable distributed matrix operations

– Requires a certain compute/bandwidth ratio

11

Let’s assume:

– input size = batch size = output size– compute power of the device (FLOPS)– bandwidth (memory or interconnect)– number of compute devices

𝑇/012345 =2𝑛9

𝑝;𝛾 𝑇=>4>_@5>= =2𝑛;

𝑝𝛽

𝛽𝛾𝑛

A B

“SUMMA: Scalable Universal Matrix Multiplication Algorithm”, R.A. van de Geijn, J. Watts

𝑝;

𝛽 ≥4𝑝𝛾𝑛

for FP32

Page 12: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Model parallelism

– Can be achieved with scalable distributed matrix operations

– Requires a certain compute/bandwidth ratio

12

Let’s assume:

– input size = batch size = output size– compute power of the device (FLOPS)– bandwidth (memory or interconnect)– number of compute devices

𝑇/012345 =2𝑛9

𝑝;𝛾 𝑇=>4>_@5>= =2𝑛;

𝑝𝛽

𝛽 ≥4𝑝𝛾𝑛

𝛽𝛾𝑛

for FP32

𝑝;

𝑛 = 2000, γ = 15𝑇𝐹𝐿𝑂𝑃𝑆

𝑝 = 10, 𝛽 ≥ 300𝐺𝐵/𝑠

𝑝 = 1, 𝛽 ≥ 30𝐺𝐵/𝑠

Page 13: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

13

Data parallelism

𝑇/012345(𝑝, 𝑐, 𝛾) = 𝑐/(𝑝𝛾)𝑇/0113QR/>45(𝑝, 𝑤, 𝛽) = 2𝑤𝑙𝑜𝑔(𝑝)/𝛽

– number of workers (nodes),– the computational power of the node,– the computational complexity of the model,– bandwidth,– the size of the weights in bits.

𝛾

𝛽𝑤

𝑝

𝑐

Page 14: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

14

Data parallelism

𝑇/012345(𝑝, 𝑐, 𝛾) = 𝑐/(𝑝𝛾)𝑇/0113QR/>45(𝑝, 𝑤, 𝛽) = 2𝑤𝑙𝑜𝑔(𝑝)/𝛽

– number of workers (nodes),– the computational power of the node,– the computational complexity of the model,– bandwidth,– the size of the weights in bits.

𝛾

𝛽𝑤

𝑝

𝑐

NVIDIA K40 (~4 TFLOPS), PCIe v3 (~16 GB/s)

Page 15: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

15

Data parallelism

𝑇/012345(𝑝, 𝑐, 𝛾) = 𝑐/(𝑝𝛾)𝑇/0113QR/>45(𝑝, 𝑤, 𝛽) = 2𝑤𝑙𝑜𝑔(𝑝)/𝛽

– number of workers (nodes),– the computational power of the node,– the computational complexity of the model,– bandwidth,– the size of the weights in bits.

𝛾

𝛽𝑤

𝑝

𝑐

NVIDIA K40 (~4 TFLOPS), Infiniband (~56 Gb/s)

Page 16: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Deep Learning Cookbook helps to pick the right HW/SW stack

16

- Benchmarking suite- Benchmarking scripts- Set of benchmarks (for core operations and reference models)

- Performance measurements for a subset of applications, models and HW/SW stacks- 11 models- 8 frameworks- 6 hardware systems

- Analytical performance and scalability models- Performance prediction for arbitrary models- Scalability prediction

- Reference solutions, white papers

Page 17: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

17

Page 18: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Selected scalability results

18

HPE Apollo 6500 (8 x NVIDIA P100)

0.00

50.00

100.00

150.00

0 5 10

Batc

h tim

e (m

s)

Number of GPUs

AlexNet Weak Scaling

64

1280.005.00

10.0015.0020.0025.00

0 5 10

Batc

h tim

e (m

s)

Number of GPUs

DeepMNIST Weak Scaling

32

64

128 0.0010.0020.0030.0040.0050.00

0 5 10

Batc

h tim

e (m

s)

Number of GPUs

EngAcousticModel Weak Scaling

32

64

128

0.00

100.00

200.00

300.00

400.00

0 5 10

Batc

h tim

e (m

s)

Number of GPUs

GoogleNet Weak Scaling

32

64

128 0.00

200.00

400.00

600.00

800.00

0 2 4 6 8 10

Batc

h tim

e (m

s)

Number of GPUs

VGG16 Weak Scaling

16

32

64 0.00200.00400.00600.00800.00

1000.00

0 2 4 6 8 10

Batc

h tim

e (m

s)

Number of GPUs

VGG19 Weak Scaling

16

32

64

Batch size: Batch size: Batch size:

Batch size: Batch size: Batch size:

Page 19: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Selected observations and tips

– Larger models are easier to scale (such as ResNet and VGG)– A single GPU can hold only small batches (the rest of memory is occupied by a model)

– Fast interconnect is more important for less compute-intensive models (FCC)– A rule of thumb: 1 or 2 CPU cores per GPU– PCIe topology of the system is important

19

Page 20: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Further into the future: neuromorphic research projects

Hewlett Packard Enterprise

Neuromorphic Computing – the integration of algorithms, architectures, and technologies, informed by neuroscience, to create new computational approaches.

– Memristor Dot-Product Engine (DPE) –successfully demonstrated– Memristor crossbar analog vector-matrix

multiplication accelerator

– Hopfield Network (electronic and photonic) –in progress IjO=∑j Gij . ViI

Page 21: Deep Learning Cookbook: technology recipes to run deep ... · Deep Learning Cookbook helps to pick the right HW/SW stack 16-Benchmarking suite - Benchmarking scripts - Set of benchmarks

Thank youNatalia [email protected]

21