Top Banner
Przemek Tredak, Simon Layton S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING
41

Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

May 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

Przemek Tredak, Simon Layton

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING

Page 2: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

2

THE PROBLEM

Page 3: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

3

CPU BOTTLENECK OF DL TRAINING

• Multi-GPU, dense systems are more common (DGX-1V, DGX-2)

• Using more cores / sockets is very expensive

• CPU to GPU ratio becomes lower:

• DGX-1V: 40 cores / 8, 5 cores / GPU

• DGX-2: 48 cores / 16, 3 cores / GPU

CPU : GPU ratio

Page 4: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

4

CPU BOTTLENECK OF DL TRAININGComplexity of I/O pipeline

Alexnet

256x256 image 224x224 crop and mirror

ResNet 50

480p image Random resize Color augment224x224 crop

and mirror

Training

Training

Page 5: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

5

CPU BOTTLENECK OF DL TRAINING

Increased complexity of

CPU-based I/O pipelineHigher GPU to CPU ratio

CPU

GPU

Thro

ughput

Time

Page 6: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

6

LOTS OF FRAMEWORKS

Frameworks have their own I/O pipelines (often more than 1!)

Lots of duplicated effort to optimize them all

Training process is not portable even if the model is (e.g. via ONNX)

Lots of effort

Caffe2

ImageInputOp Python

MXNet

ImageRecordIter

Python

TensorFlow

Dataset

Python

ImageIOManual graph

construction

Page 7: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

7

LOTS OF FRAMEWORKS

Optimized I/O pipelines are not flexible and often unsuitable for research

Lots of effort

train = mx.io.ImageRecordIter(

path_imgrec = args.data_train,

path_imgidx = args.data_train_idx,

label_width = 1,

mean_r = rgb_mean[0],

mean_g = rgb_mean[1],

mean_b = rgb_mean[2],

data_name = 'data',

label_name = 'softmax_label',

data_shape = image_shape,

batch_size = 128,

rand_crop = True,

max_random_scale = 1,

pad = 0,

fill_value = 127,

min_random_scale = 0.533,

max_aspect_ratio = args.max_random_aspect_ratio,

random_h = args.max_random_h,

random_s = args.max_random_s,

random_l = args.max_random_l,

max_rotate_angle = args.max_random_rotate_angle,

max_shear_ratio = args.max_random_shear_ratio,

rand_mirror = args.random_mirror,

preprocess_threads = args.data_nthreads,

shuffle = True,

num_parts = 0,

part_index = 1)

vs

image, _ = mx.image.random_size_crop(image,

(data_shape, data_shape), 0.08, (3/4., 4/3.))

image = mx.nd.image.random_flip_left_right(image)

image = mx.nd.image.to_tensor(image)

image = mx.nd.image.normalize(image, mean=(0.485,

0.456, 0.406), std=(0.229, 0.224, 0.225))

return mx.nd.cast(image, dtype), label

Inflexible fast flexible slow

Page 8: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

8

SOLUTION: ONE LIBRARY

• Centralize the effort

• Integrate into all frameworks

• Provide both flexibility and performance

DALI

MXNet Caffe2 PyTorch TF etc.

Page 9: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

9

DALI: OVERVIEW

Page 10: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

10

DALI

• Flexible, high-performance image data pipeline

• Python / C++ frontends with C++ / CUDA backend

• Minimal (or no) changes to the frameworks required

• Full pipeline - from disk to GPU, ready to train

• OSS (soon) Fra

mew

ork

DALI

Plu

gin

Page 11: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

11

GRAPH WITHIN A GRAPH

Data pipeline is just a (simple) graph

I/O in Frameworks today

Loader

Decode Resize

Training

Images

Labels

JPEG

Augment

GPU

CPU

Page 12: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

12

GPU OPTIMIZED PRIMITIVES

High performance, GPU optimized implementations

DALI

Loader

Decode Resize

Training

Images

Labels

JPEG

Augment

GPU

CPU

Page 13: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

13

GPU ACCELERATED JPEG DECODE

Hybrid approach to JPEG decoding – can move fully to GPU in the future

Hu

DALI with nvJPEG

Loader

Decode Resize

Training

Images

Labels

JPEG

Augment

GPU

CPU

Page 14: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

14

SET YOUR DATA FREE

Use any file format in any framework

DALI

LMDB (Caffe,

Caffe2)

RecordIO

(MXNet)

TFRecord

(TensorFlow)

List of JPEGs

(PyTorch,

others)

Page 15: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

15

BEHIND THE SCENES: PIPELINE

Page 16: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

16

PIPELINEOverview

Framework

One pipeline per GPUThe same logic for multithreaded and multiprocess frameworks

Page 17: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

17

PIPELINEOverview

Framework

CPU

Mixed

GPU

Single direction3 stages

CPU -> Mixed -> GPU

Page 18: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

18

PIPELINEOverview

1

2

3

4

6 8

5 7

9 Framework

CPU

Mixed

GPUSimple scheduling of operations

Page 19: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

19

55

PIPELINECPU

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

5 5 5Operations processed per-sample in a thread pool

Page 20: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

20

PIPELINEGPU

8

9

8

9

9Batched processing of data

Page 21: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

21

PIPELINEMixed

Mixed

9

A bridge between CPU and GPUPer-sample input, batched output

Used also for batching CPU data (for CPU outputs of the pipeline)

Page 22: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

22

EXECUTORPipelining the pipeline

CPU, Mixed and GPU stages need to be executed serially

But each batch of data is independent…

Mixed 1 GPU 1CPU 1 Mixed 2 GPU 2CPU 2 Mixed 4CPU 3

time

Page 23: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

23

EXECUTORPipelining the pipeline

Mixed 1

Each stage is asynchronous

Stages of given batch synchronized via events

GPU 1

CPU 1

time

Mixed 2

GPU 2

CPU 2

Mixed 3

GPU 3

CPU 3…

Page 24: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

24

OPERATORSGallery

Page 25: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

25

USING DALI

Page 26: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

26

EXAMPLE: RESNET-50 PIPELINEPipeline class

import dali

import dali.ops as ops

class HybridRN50Pipe(dali.Pipeline):

def __init__(self, batch_size, num_threads, device_id, num_devices):

super(HybridRN50Pipe, self).__init__(batch_size,

num_threads, device_id)

# define used operators

def define_graph(self):

# define graph of operations

Page 27: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

27

EXAMPLE: RESNET-50 PIPELINEDefining operators

def __init__(self, batch_size, num_threads, device_id, num_devices):

super(HybridRN50Pipe, self).__init__(batch_size, num_threads,

device_id)

self.loader = ops.Caffe2Reader(path=lmdb_path, shard_id=dev_id,

num_shards=num_devices)

self.decode = ops.HybridDecode(output_type=dali.types.RGB)

self.resize = ops.Resize(device="gpu", resize_a=256,

resize_b=480, random_resize=True,

image_type=types.RGB)

self.crop = ops.CropMirrorNormalize(device="gpu",

random_crop=True, crop=(224, 224),

mirror_prob=0.5, mean=[128.,128.,128.],

std=[1.,1.,1.], output_layout=dali.types.NCHW)

Page 28: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

28

EXAMPLE: RESNET-50 PIPELINEDefining graph

def define_graph(self):

jpeg, labels = self.loader(name="Reader")

images = self.decode(jpeg)

resized_images = self.resize(images)

cropped_images = self.crop(resized_images)

return [cropped_images, labels]

Loader

Decode Resize Crop

MakeContiguous

Data

Label

jpeg

labels

Page 29: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

29

EXAMPLE: RESNET-50 PIPELINEUsage: MXNet

import mxnet as mx

from dali.plugin.mxnet import DALIIterator

pipe = HybridRN50Pipe(128, 2, 0, 1)

pipe.build()

train = DALIIterator(pipe, pipe.epoch_size("Reader"))

model.fit(train,

# other parameters

)

Page 30: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

30

EXAMPLE: RESNET-50 PIPELINEUsage: TensorFlow

import tensorflow as tf

from dali.plugin.tf import DALIIterator

pipe = HybridRN50Pipe(128, 2, 0, 1)

serialized_pipe = pipe.serialize()

train = DALIIterator()

with tf.session() as sess:

images, labels = train(serialized_pipe)

# rest of the model using images and labels

sess.run(...)

Page 31: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

31

EXAMPLE: RESNET-50 PIPELINEUsage: Caffe 2

from caffe2.python import brew

pipe = HybridRN50Pipe(128, 2, 0, 1)

serialized_pipe = pipe.serialize()

data, label = brew.dali_input(model, ["data", "label"],

serialized_pipe=serialized_pipe)

# Add the rest of your network as normal

conv1 = brew.conv(model, data, “conv1”, …)

Page 32: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

32

PERFORMANCE

Page 33: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

33

PERFORMANCEI/O Pipeline

5150 5450

8000

14350

23000

0

5000

10000

15000

20000

25000

Imag

es /

Sec

on

dThroughput, DGX-2, RN50 pipeline, Batch 128, NCHW

Page 34: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

34

PERFORMANCEEnd-to-end training

8000

15500

17000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Native DALI Synthetic

imag

es /

sec

on

dEnd-to-end DGX-2, RN50 training - MXNet, Batch 192 / GPU

Page 35: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

35

NEXT STEPS

Page 36: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

36

NEXT: MORE WORKLOADSSegmentation

def define_graph(self):

images, masks = self.loader(name="Reader")

images = self.decode(images)

masks = self.decode(masks)

# Apply identical transformations

resized_images, resized_masks = self.resize([images, masks], …)

cropped_images, cropped_masks = self.crop([resized_images,

resized_masks], …)

return [cropped_images, cropped_masks]

Page 37: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

37

NEXT: MORE FORMATS

What would be useful to you?

PNG Video frames

Page 38: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

38

NEXT++: MORE OFFLOADING

Fully GPU-based decode

HW-based via. NVDEC

Transcode to video

Page 39: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

39

SOON: EARLY ACCESS

Looking for:

General feedback

New workloads

New transformations

Contact: Milind Kukanur

[email protected]

Page 40: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK

40

ACKNOWLEDGEMENTS

Trevor Gale

Andrei Ivanov

Serge Panev

Cliff Woolley

DL Frameworks team @ NVIDIA

Page 41: Fast Data Pipelines for deep learning trainingon-demand.gputechconf.com/gtc/2018/presentation/s8906...S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING 2 THE PROBLEM 3 CPU BOTTLENECK