Top Banner
ImageNet is the new MNIST Chris Ying Research SWE @ Google Brain g.co/brain on behalf of many people across Google
32

ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Apr 18, 2018

Download

Documents

ngonhan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

ImageNet is the new MNISTChris Ying

Research SWE @ Google Braing.co/brain

on behalf of many people across Google

Page 2: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Goal: “Interactive ML supercomputing”

● Hardware

○ Cloud TPUs

○ TPU pods

● Software

○ TensorFlow Datasets, Layers, and Estimator APIs (open-source)

○ XLA compiler (open-source) with TPU backend

● Research

○ Understanding of generalization gap

○ Large-batch training advances

Page 3: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

(classical workflow)

Motivation

Page 4: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

(classical workflow)

Motivation

x 10

Page 5: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

(what's happening now)

Motivation

x 10

Page 6: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

(our vision of the future)

Motivation

x 1000

Page 7: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

ImageNet is the new MNIST

MNIST: 60,000 B&W images ImageNet: 1,281,167 color images

Page 8: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Motivating results

# of TPU devices Batch size Time to 90 epochs Accuracy

1 256 23 hours 22 minutes 76.6%

4 1024 5 hours 48 minutes 76.3%

16 4096 1 hour 30 minutes 76.5%

32 8192 45 minutes 76.1%

64 16384 22 minutes 75.0%

Only change between different runs is batch size (linearly scale LR) and hardware, no model changes or hyperparameter re-tuning!

ResNet-50-v2 on ImageNet

Page 9: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Cloud TPU

180 TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW

Page 10: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Cloud TPU

Page 11: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

TPUv2 Chipcore core

HBM8 GB

HBM8 GB

scalar unit

MXU128x128

MXU128x128

● 45 TFLOPS● 16 GB of HBM● 600 GB/s mem BW● Vector unit: float32● Scalar unit: float32● Matrix unit (MXU):

float32 input/output, reduced precision multiplication

vector unit

scalar unit

vector unit

Page 12: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

core coreHBM8 GB

HBM8 GB

MXU128x128

MXU128x128

● 16 GB of HBM● 600 GB/s mem BW● Scalar unit: 32b float● MXU: 32b float

accumulation but reduced precision for multipliers

● 45 TFLOPS

scalar unit

vector unit

scalar unit

vector unit

Matrix Unit

128x128 systolic arrayfloat32 results*

* reduced precision multiplication

TPUv2 Chip

Page 13: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Mat

rix U

nit (

MXU

)Matrix Unit Systolic Array

W11 W12 W13

W21 W22 W23

W31 W32 W33

X11

X12

Computing y = Wx

Toy example: 3x3 systolic array

W = 3x3 matrixbatch_size(x) = 3

X13

X21

X22

X23

X31

X32

X33in

puts

wei

ghts

accumulation

Page 14: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Mat

rix U

nit (

MXU

) W11X11

W12 W13

W21 W22 W23

W31 W32 W33

X12

X13

X21

X22

X23

X31

X32

X33

inpu

tsw

eigh

ts

Computing y = Wxwith W = 3x3, batch_size(x) = 3

accumulation

Matrix Unit Systolic Array

Page 15: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Mat

rix U

nit (

MXU

) W11X21

W12X12+

W11X11

W13

W21X11

W22 W23

W31 W32 W33

X13X22

X23

X31

X32

X33inpu

tsw

eigh

ts

accumulation

Computing y = Wxwith W = 3x3, batch_size(x) = 3

Matrix Unit Systolic Array

Page 16: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Mat

rix U

nit (

MXU

) W11X31

W12X22+

W11X21

W13X13+ ...

W21X21

W22X12+

W21X11

W23

W31X11

W32 W33

X23X32

X33

inpu

tsw

eigh

ts

accumulation

Computing y = Wxwith W = 3x3, batch_size(x) = 3

Matrix Unit Systolic Array

Page 17: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Mat

rix U

nit (

MXU

)

W11

W12X32+

W11X31

W13X23+ ...

W21X31

W22X22+

W21X21

W23X13+ ...

W31X21

W32X12+

W31X11

W33

X33

inpu

tsw

eigh

ts

Y11 = W11X11 + W12X12 + W13X13

outputs

accumulation

Computing y = Wxwith W = 3x3, batch_size(x) = 3

Matrix Unit Systolic Array

Page 18: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Mat

rix U

nit (

MXU

)

W11 W12

W13X33+ ...

W21

W22X32+

W21X31

W23X23+ ...

W31X31

W32X22+

W31X21

W33X13+ ...

inpu

tsw

eigh

ts

Y21 = W11X21 + W12X22 + W13X23 Y11 = W11X11 + W12X12 + W13X13

Y12 = W21X11 + W22X12 + W23X13

outputs

accumulation

Computing y = Wxwith W = 3x3, batch_size(x) = 3

Matrix Unit Systolic Array

Page 19: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Mat

rix U

nit (

MXU

)

W11 W12 W13

W21 W22

W23X33+ ...

W31

W32X32+

W31X31

W33X23+ ...

inpu

tsw

eigh

ts

outputs

Y31 = W11X31 + W12X32 + W13X33

Y22 = W21X21 + W22X22 + W23X23

Y11 = W11X11 + W12X12 + W13X13

Y12 = W21X11 + W22X12 + W23X13

Y21 = W11X21 + W12X22 + W13X23

Y13 = W31X11 + W32X12 + W33X13

accumulation

Computing y = Wxwith W = 3x3, batch_size(x) = 3

Matrix Unit Systolic Array

Page 20: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Mat

rix U

nit (

MXU

)

W11 W12 W13

W21 W22 W23

W31 W32

W33X33+ ...

inpu

tsw

eigh

ts

outputs

Y31 = W11X11 + W12X12 + W13X13

Y22 = W21X21 + W22X22 + W23X23 Y12 = W21X11 + W22X12 + W23X13

Y21 = W11X21 + W12X22 + W13X23

Y13 = W31X11 + W32X12 + W33X13

Y32 = W21X31 + W22X32 + W23X33

Y23 = W31X21 + W32X22 + W33X23

accumulation

Computing y = Wxwith W = 3x3, batch_size(x) = 3

Matrix Unit Systolic Array

Page 21: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Mat

rix U

nit (

MXU

)

W11 W12 W13

W21 W22 W23

W31 W32 W33

inpu

tsw

eigh

ts

outputs

Y31 = W11X11 + W12X12 + W13X13

Y22 = W21X21 + W22X22 + W23X23

Y13 = W31X11 + W32X12 + W33X13

Y32 = W21X31 + W22X32 + W23X33

Y23 = W31X21 + W32X22 + W33X23Y33 = W31X31 + W32X32 + W33X33

accumulation

Computing y = Wxwith W = 3x3, batch_size(x) = 3

Matrix Unit Systolic Array

Page 22: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Cloud TPU Pod 64 Cloud TPUs in 2-D toroidal mesh

11.5 petaFLOPS4 terabytes of HBM memory

Page 23: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Accelerated Linear Algebra (XLA)

● JIT / AOT compiler for linear algebra● Targets multiple backends, e.g. CPUs, GPUs, and TPUs● Compiler, runtime, and accelerator-specific optimizer

The life of a neural network:

model.py

TF Estimator code TF Graph

Page 24: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Accelerated Linear Algebra (XLA)

● JIT / AOT compiler for linear algebra● Targets multiple backends, e.g. CPUs, GPUs, and TPUs● Compiler, runtime, and accelerator-specific optimizer

The life of a neural network:

model.py

XLATarget-independent

optimizationsTarget-specific

code generation

XLA

TF Estimator code TF Graph

Page 25: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Large batch training

● Understanding generalization gap (2016 N. Keskar et. al., 2017 E. Hoffer et.

al.)

● Relationship of batch size and noise scale (2018 S. Smith et. al.)

● Learning rate scaling and schedule (2017 P. Goyal et. al.)

● New optimizers

○ K-FAC*: approximate Fisher information matrix (2015 J. Martens)

○ Neumann*: approximate inverse Hessian (2018 S. Krishnan et. al.)

○ LARS: per-layer learning rate (2018 Y. You et. al.)

* stick around after this talk to hear more about these!

Page 26: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Experiments

hours to 90 epochs

valid

atio

n ac

cura

cy

76.6%

45 min

batch size

256

1024

4096

8192

16384

ResNet-50 training on ImageNet

# TPUs

1

4

16

32

64

Page 27: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Experiments

Page 28: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Experiments

# of TPU devices Batch size Time to 90 epochs Accuracy

32 8192 44.9 minutes 76.1%

64 8192 29.8 minutes 75.7%

64 16384 22.3 minutes 75.0%

64 65536 17.5 minutes 65.4%

64 8192 → 16384[1] 29.5 minutes 76.1%

Only change between different runs is batch size (linearly scale LR) and hardware, no model changes or hyperparameter re-tuning!

[1] Don't Decay the Learning Rate, Increase the Batch Size (2018 S. Smith et. al)

Page 29: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

More than just ImageNet

Transformer model from "Attention is All You Need" (2017 A. Vaswani et. al.)

WMT’14 English-German translation task

Adam optimizer - same learning rate schedule across configurations

batch size(i/o tokens)

16k / 16k

32k / 32k

256k / 256k

1M / 1M

Time toPPL=4.8

17.9 hours

3.5 hours

1.1 hours

0.5 hours

# TPUs

1

4

16

64

Page 30: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Implications

● Faster training enables neural architecture search

○ Reinforcement learning architectures beat existing models

in accuracy and cost [1]

[1] Learning Transferable Architectures for Scalable Image Recognition (2017 B. Zoph et. al)

Page 31: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Implications

● Faster training enables neural architecture search

○ Reinforcement learning architectures beat existing models

in accuracy and cost [1]

● What's the "new ImageNet"?

○ Full ImageNet (14M), Open Images (9M), YouTube-8M

○ Performance increases logarithmically with data [2]

[2] Revisiting Unreasonable Effectiveness of Data in Deep Learning Era (2017 C. Sun et. al)

[1] Learning Transferable Architectures for Scalable Image Recognition (2017 B. Zoph et. al)

Page 32: ImageNet is the new MNIST - GitHub Pages TFLOPS of computation, 64 GB of HBM memory, 2400 GB/s mem BW. Cloud TPU. TPUv2 Chip core core HBM 8 GB HBM 8 GB scalar unit MXU 128x128 MXU

Thank [email protected]

Pieter-jan Brennan Sam Jonathan

Chris

Zak Quoc Bjarke Noam Naveen

Sameer

g.co/braing.co/tpusignup