Top Banner
Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING CONVOLUTIONAL NEURAL NETWORKS 2017
42

PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

Mar 31, 2018

Download

Documents

vutram
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

Pavlo Molchanov

Stephen Tyree

Tero Karras

Timo Aila

Jan Kautz

PRUNING CONVOLUTIONAL NEURAL NETWORKS

2017

Page 2: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

2

WHY WE CAN PRUNE CNNS?

Page 3: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

3 3

WHY WE CAN PRUNE CNNS?

Optimization “failures”:

• Some neurons are "dead": little activation

• Some neurons are uncorrelated with output

Modern CNNs are overparameterized:

• VGG16 has 138M parameters

• Alexnet has 61M parameters

• ImageNet has 1.2M images

Page 4: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

4 4

PRUNING FOR TRANSFER LEARNING

Caltech-UCSD Birds (200 classes, <6000 images)

Small Dataset

Page 5: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

5 5

PRUNING FOR TRANSFER LEARNING

Small Dataset

Oriole

Goldfinch

Accuracy

Size/Speed

Small

Network

Training

Page 6: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

6 6

PRUNING FOR TRANSFER LEARNING

Small Dataset

Fine-tuning

Large

Pretrained

Network Oriole

Goldfinch

Accuracy

Size/Speed

AlexNet - VGG16 - ResNet

Page 7: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

7 7

PRUNING FOR TRANSFER LEARNING

Small Dataset

Large

Pretrained

Network

AlexNet - VGG16 - ResNet

Fine-tuning

Oriole

Goldfinch

Accuracy

Size/Speed Pruning

Smaller Network

Page 8: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

8 8

TYPES OF UNITS

• Convolutional units

• Heavy on computation

• Small on storage

• Fully connected (dense) units

• Fast on computations

• Heavy on storage

Convolutional layers Fully connected layers

VGG16 99% 1%

Alexnet 89% 11%

R3DCNN 90% 10%

To reduce computation, we focus pruning on convolutional units.

Ratio of floating point operations

Our focus

Page 9: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

9 9

TYPES OF PRUNING

No pruning Fine pruning

• Remove connections

between neurons/feature

maps

• May require special

SW/HW for full speed-up

Coarse pruning

• Remove entire neurons/feature maps

• Instant speed-up

• No change to HW/SW

Our focus

Page 10: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

10

NETWORK PRUNING

Page 11: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

11 11

NETWORK PRUNING

𝐶: training cost function

𝒟: training data

𝑊: network weights

𝑊 : pruned network weights

Training:

min𝑊

𝐶 𝑊,𝒟

Page 12: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

12 12

NETWORK PRUNING

𝐶: training cost function

𝒟: training data

𝑊: network weights

𝑊 : pruned network weights

min𝑊

𝐶 𝑊 ,𝒟 − 𝐶 𝑊,𝒟

Training: Pruning:

min𝑊

𝐶 𝑊,𝒟

𝑠. 𝑡. 𝑊 ⊂ 𝑊, 𝑊 < 𝐵

Page 13: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

13 13

NETWORK PRUNING

𝐶: training cost function

𝒟: training data

𝑊: network weights

𝑊 : pruned network weights

min𝑊

𝐶 𝑊 ,𝒟 − 𝐶 𝑊,𝒟

Training: Pruning:

min𝑊

𝐶 𝑊,𝒟

𝑠. 𝑡. 𝑊 ⊂ 𝑊, 𝑊 < 𝐵

𝑠. 𝑡. 𝑊 0≤ 𝐵

∙ 0 − ℓ0 norm, number of non zero elements

Page 14: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

14 14

NETWORK PRUNING

Exact solution: combinatorial optimization problem – too computationally expensive

• VGG-16 has 𝑊 = 4224 convolutional units

2W=3553871205531788502027616705177895234326962283811349000683834453551638494934980826570988629674816508671333937997942971545498563185784012615902725922028388957693142

1862796735241131771064707150729404513525374011172364491439311003809147986212244125583682040173009664289254204672705377527023751838969121362871174353608981432683121364

5491611587700632287226757360106388212811709391049243449409694131581866174894684285426551148222434459277138467708468356441728767115601429026774386653664558802884798090

6965876098883394994207765939795994221495102245529321358133169053471175098438846379813927963588224649996889912395677448659534869881828474761387469462375439163452354234

5894518795402778976197641675203085270364961383790287738178866981707575145292010325953635643917893687322226855341345293028465563634475713300900704784609781200491091266

5177085470491781920811732083028359068442910422663939383012657211605418802586239081536469961410441163264284259407567601349688157128480106842375724875121706906188815680

8417681026874596048633568575893047553712713299830093139608694750348505494684606129671946123873358658490052333372765817334544824122023280282312402650277313912908677267

41995809784279019489403498646468630714031376402488628074647455635839933307882358008948992762943104694366519689215

Page 15: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

15 15

NETWORK PRUNING

Exact solution: combinatorial optimization problem – too computationally expensive

• VGG-16 has 𝑊 = 4224 convolutional units

Greedy pruning:

• Assumes all neurons are independent

(same assumption for back propagation)

• Iteratively, remove neuron with the smallest contribution

2W=3553871205531788502027616705177895234326962283811349000683834453551638494934980826570988629674816508671333937997942971545498563185784012615902725922028388957693142

1862796735241131771064707150729404513525374011172364491439311003809147986212244125583682040173009664289254204672705377527023751838969121362871174353608981432683121364

5491611587700632287226757360106388212811709391049243449409694131581866174894684285426551148222434459277138467708468356441728767115601429026774386653664558802884798090

6965876098883394994207765939795994221495102245529321358133169053471175098438846379813927963588224649996889912395677448659534869881828474761387469462375439163452354234

5894518795402778976197641675203085270364961383790287738178866981707575145292010325953635643917893687322226855341345293028465563634475713300900704784609781200491091266

5177085470491781920811732083028359068442910422663939383012657211605418802586239081536469961410441163264284259407567601349688157128480106842375724875121706906188815680

8417681026874596048633568575893047553712713299830093139608694750348505494684606129671946123873358658490052333372765817334544824122023280282312402650277313912908677267

41995809784279019489403498646468630714031376402488628074647455635839933307882358008948992762943104694366519689215

Page 16: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

16 16

GREEDY NETWORK PRUNING Iterative pruning

Algorithm:

1) Estimate importance of neurons (units)

2) Rank units

3) Remove the least important unit

4) Fine tune network for K iterations

5) Go back to step1)

Page 17: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

17

ORACLE

Page 18: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

18 18

ORACLE Caltech-UCSD Birds-200-2011 Dataset

• 200 classes

• <6000 training images

Method Test accuracy

S. Belongie et al *SIFT+SVM 19%

From scratch CNN 25%

S. Razavian et al *OverFeat+SVM 62%

Our baseline VGG16 finetuned 72.2%

N. Zhang et al R-CNN 74%

S. Branson et al *Pose-CNN 76%

J. Krause et al *R-CNN+ 82%

*require additional attributes

Page 19: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

19 19

ORACLE

• Exhaustively computed change in loss by removing one unit

VGG16 on Birds-200 dataset

First layer Last layer

Page 20: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

20 20

ORACLE VGG-16 on Birds-200

• On average first layers are more important

• Every layer has very important units

• Every layer has non important units

• Layers with pooling are more important

*only convolutional layers

Layer #

Rank,

low

er

bett

er

Page 21: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

21

APPROXIMATING THE ORACLE

Page 22: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

22 22

APPROXIMATING THE ORACLE

• Average activation (discard lower activations)

• Minimum weight (discard lower l2 of weight)

• With first-order Taylor expansion (TE):

Candidate criteria

ignore

Absolute difference in cost by removing a neuron:

Gradient of the cost wrt.

activation ℎ𝑖

Unit’s output

Both computed during standard

backprop.

Page 23: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

23 23

APPROXIMATING THE ORACLE

• Alternative: Optimal Brain Damage (OBD) by Y. LeCun et al., 1990

• Use second order derivatives to estimate importance of neurons:

Candidate criteria

ignore =0

Needs extra comp of

second order derivative

Page 24: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

24 24

APPROXIMATING THE ORACLE Comparison to OBD

OBD: second-order expansion:

=0

we propose: abs of first-order expansion:

Assuming 𝑦 =𝛿𝐶

𝛿ℎ𝑖ℎ𝑖

For perfectly trained model:

if y is Gaussian

Page 25: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

25 25

APPROXIMATING THE ORACLE Comparison to OBD

=0

No extra computations

We look at absolute difference

— Can’t predict exact change in loss

Assuming 𝑦 =𝛿𝐶

𝛿ℎ𝑖ℎ𝑖

For perfectly trained model:

if y is Gaussian

OBD: second-order expansion:

we propose: abs of first-order expansion:

Page 26: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

26 26

EVALUATING PRUNING CRITERIA Spearman’s rank correlation with oracle: VGG16 on Birds-200

Mean rank correlation (across layers)

0.27

0.56 0.59

0.73

0

0.2

0.4

0.6

0.8

1

Min weight Activation OBD Taylorexpansion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13

Corr

ela

tion w

ith o

racle

Layer #

Min weight Activation

OBD Taylor Expansion

Page 27: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

27

EVALUATING PRUNING CRITERIA Pruning with objective

VGG16 • Regularize criteria with objective:

• Regularizer can be:

• FLOPs

• Memory

• Bandwidth

• Target device

• Exact inference time

0

10

20

30

40

50

60

70

1 2 3 4 5 6 7 8 9 10 11 12 13 14

FLO

Ps

per

unit

Layer #

Page 28: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

28

RESULTS

Page 29: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

29 29

RESULTS VGG16 on Birds 200 dataset

• Remove 1 conv unit every 30 updates

Page 30: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

30 30

RESULTS VGG16 on Birds 200 dataset

• Training from scratch doesn’t work

• Taylor shows the best result vs any other metric for pruning

GFLOPs #convolutional kernels

Page 31: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

32 32

RESULTS AlexNet on Oxford Flowers102

102 classes

~2k training images

~6k testing images

10 up

30 up

60 up

1000 up

Changing number of updates between pruning iterations

Page 32: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

33 33

RESULTS AlexNet on Oxford Flowers102

102 classes

~2k training images

~6k testing images

10 up

30 up

60 up

1000 up

Changing number of updates between pruning iterations

3.8x FLOPS reduction 2.4x actual speed up

Page 33: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

34 34

RESULTS VGG16 on ImageNet

• Pruned over 7 epochs

Top-5 validation set

Page 34: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

35 35

RESULTS VGG16 on ImageNet

• Pruned over 7 epochs

• Fine-tuning 7 epochs

GFLOPs FLOPS

reduction

Actual

speed up

Top-5

31 1x 89.5%

12 2.6x 2.5x -2.5%

8 3.9x 3.3x -5.0%

Top-5 validation set

Page 35: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

36 36

RESULTS R3DCNN for gesture recognition

3D-CNN with recurrent layers fine-tuned for 25 dynamic gestures

P. Molchanov, Gesture recognition with 3D CNNs, GTC 2016

Page 36: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

37 37

RESULTS R3DCNN for gesture recognition

3D-CNN with recurrent layers fine-tuned for 25 dynamic gestures

P. Molchanov, Gesture recognition with 3D CNNs, GTC 2016

12.6x

Reduction in FLOPs 2.5%

Drop in accuracy

5.2x

Speed-up

Page 37: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

38

How many neurons we need to

classify a cat?

Page 38: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

39 39

DOGS VS. CATS

@kaggle

Dogs vs. Cats classification Marco Lugo’s solution, 3rd place :

25,000 images

Page 39: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

40 40

DOGS VS. CATS Fine-tuned ResNet-101

99.2%

Full network

99.0 %

Pruned network

Page 40: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

41 41

DOGS VS. CATS Fine-tuned ResNet-101

0

10000

20000

30000

40000

50000

60000

0 500 1000

Convolu

tional

unit

s

Pruning iteration

52 672 units

3472 units

99.2%

Full network

99.0 %

Pruned network

15x

Compression

Page 41: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

42 42

CONCLUSIONS

• Pruning as greedy feature selection

• New criteria based on Taylor expansion

• Pruning is especially effective (and necessary!) for transfer learning

• Pruning can incorporate desired objectives (such as FLOPs)

• Read more in our ICLR2017 paper: https://openreview.net/pdf?id=SJGCiw5gl

Page 42: PRUNING CONVOLUTIONAL NEURAL NETWORKS - …on-demand.gputechconf.com/gtc/2017/presentation/s7442-pavlo... · Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz PRUNING

THANK YOU!