Top Banner
Distributed Deep Learning Mathew Salvaris
33

Distributed Deep Learning

Dec 18, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Distributed Deep Learning

Distributed Deep LearningMathew Salvaris

Page 2: Distributed Deep Learning

What will be covered

• Overview of Distributed Training

• What affects distributed training

• Network

• Model

• Data location

• Data format

Page 3: Distributed Deep Learning

penultimate

layer

RGB Channelsof input image

Convolution layer with Kernels

Pooling layer Fully connected layer

Cat

Dog

Mouse

Deep Learning Model (CNN)

Page 4: Distributed Deep Learning

Distributed training mode: Data parallelism

Dataset

CNN modelSubset 1 CNN model

Worker 1

Subset 2 CNN model

Job managerWorker 2

Page 5: Distributed Deep Learning

Distributed training mode: Model parallelism

Dataset

CNN modelSubset 1CNN model

Worker 1

CNN model

Job managerWorker 2

Subset 1

Page 6: Distributed Deep Learning

Data parallelism vs model parallelism

Data parallelism

• Easier implementation

• Stronger fault tolerance

• Higher cluster utilization

Model parallelism• Better scalability of large models

• Less memory on each GPU

Page 7: Distributed Deep Learning

Horovod: Ring All Reduce

Page 8: Distributed Deep Learning

Effects of Network, Model and Precision

Page 9: Distributed Deep Learning

Setup

Clusters of 8 nodes using K80, P40, P100and V100 (4 GPUs per node+Infiniband)

Two MPI configurations OpenMPI+NCCLand IntelMPI

Page 10: Distributed Deep Learning

Experiments

345 experiments across many different models including ResNet50, MobileNet V2 etc.

Using synthetic data

Batch size remains 64 across all models and GPUs

Use the benchmarking scripts from TensorFlow

Page 11: Distributed Deep Learning

Distributed training with synthetic data

A I I

Compute Pool

Page 12: Distributed Deep Learning

Single GPUMathew Salvaris @msalvaris

Page 13: Distributed Deep Learning

32 GPUs

Page 14: Distributed Deep Learning

32 GPUsMathew Salvaris @msalvaris

Page 15: Distributed Deep Learning

MobileNetMathew Salvaris @msalvaris

Page 16: Distributed Deep Learning

MobileNetMathew Salvaris @msalvaris

Page 17: Distributed Deep Learning

Data Transfer

Batch Execution

K80GPU P40 P100 V100

Time it takes to transfer weights between GPUs

Time it takes to process batch on GPU

Page 18: Distributed Deep Learning

6,629

23,436

54

82

0

10

20

30

40

50

60

70

80

90

0

5000

10000

15000

20000

25000

Full precision[64] Mixed precision [256]

SCA

LIN

G E

FFIC

IEN

CY

IMA

GES

/SEC

ON

DResNet50 Full Precision vs Mixed Precision [32 V100s]

Images/second Scaling efficiency

Page 19: Distributed Deep Learning

Effects of Storage

Page 20: Distributed Deep Learning

Experiments

Using ResNet50 across three frameworks [PyTorch, TensorFlow, Keras]

Using real and synthetic data. Real data on local, NFS and Blob storage

Batch size remains 64 across all configurations

Uses V100 GPUs

Page 21: Distributed Deep Learning

Distributed training with NFS

A I I

Compute Pool

NFSShare

MountedFileshare

Copy Data

Page 22: Distributed Deep Learning

Distributed training with blob storage

A I I

Compute PoolMounted

Blob

MountedFileshare

Copy Data

Page 23: Distributed Deep Learning

Distributed training with local storage

A I I

Compute Pool

MountedFileshare

Copy Data

Page 24: Distributed Deep Learning

0

0.2

0.4

0.6

0.8

1

TensorFlow Keras PyTorch

ResNet50 - Relative performance across storage

Synthetic Local(SSD) NFS Premium Blob Blob

Page 25: Distributed Deep Learning

Data Loaders and Preprocessors

Keras Data LoaderSimple with no parameters for buffering and

parallelizing

PyTorch Data LoaderSpecify number of workers with num_workers

Page 26: Distributed Deep Learning

TensorFlow

Highly configurable

Many options : buffer, shuffle, cache and shard

Daunting and easy to get wrong

https://www.tensorflow.org/guide/performance/datasets

Page 27: Distributed Deep Learning

Effects of Data Type

Page 28: Distributed Deep Learning

TensorFlow Records

• Binary data format created for TensorFlow – Recommended format for TensorFlow

• Can aggregate number of examples to smaller number of TFRecords –efficient for transferring and reading in the cloud

• Have to export data to format - Has to be tailored to use case

Page 29: Distributed Deep Learning

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8 16 32

AV

ERA

GE

IMA

GES

/SEC

ON

DResNet50 – Data Type Performance [Average]

Synthetic Images TFRecords

Page 30: Distributed Deep Learning

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8 16 32

MA

XIM

UM

IMA

GES

/SEC

ON

DResNet50 – Data Format Performance [Maximum]

Synthetic Images TFRecords

Page 31: Distributed Deep Learning

Things not discussed

Asynchronous distributed training

Tradeoff between batch size and other parameters

Optimization of TensorFlow pipeline

Other data formats such as Parquet (Petastorm)

Transform libraries [albumentations]

Distributed file systems BeeGFs and other storage GlusterFS, Lustre etc.

Models other than CNN

Page 32: Distributed Deep Learning

Summary

Do try to use enhanced networking wherever possible especially for the latest GPUs

Training small models using distributed training is not recommended

Do use TFRecords or other columnar or row based data formats

Not all data loaders are equal

Page 33: Distributed Deep Learning

Thanks &

Questions?