Top Banner
Quantized Stochastic Gradient Descent: Communication vs. Convergence Dan Alistarh 1 , Jerry Li 2,3 , Ryota Tomioka 2 , Milan Vojnovic 2 1 ETH, 2 Microsoft Research, 3 MIT
25

Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Oct 05, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Quantized Stochastic Gradient Descent: Communication vs. Convergence

Dan Alistarh1, Jerry Li2,3, Ryota Tomioka2, Milan Vojnovic2

1ETH, 2Microsoft Research, 3MIT

Page 2: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Deep models: how to train them efficiently?

• Vision• ImageNet: 1.6 million images

• ResNet-152 [He+15]: 152 layers, 60 million parameters

• Speech• NIST2000 Switchboard dataset: 2000 hours

• LACEA [Yu+16]: 22 layers, 65 million parameters (w/o language model)

He et al. (2015) “Deep Residual Learning for Image Recognition”Yu et al. (2016) “Deep convolutional neural networks with layer-wise context expansion and attention”

Page 3: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Data parallel SGD

Computegradient

Exchangegradient

Update params

Minibatch 1 Minibatch 2 Minibatch 3

Page 4: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Data parallel SGD (bigger model)

Computegradient

Exchangegradient

Update params

Minibatch 1 Minibatch 2

Page 5: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Data parallel SGD (biggggger model)

Computegradient

Exchangegradient

Update params

Minibatch 1 Minibatch 2

Page 6: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

If we could compress the gradients…

Minibatch 1 Minibatch 2

Page 7: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Inspiration: 1-bit SGD [Seide et al 2014]

• Quantization function

𝑄𝑖 𝑣 = ቊҧ𝑣𝑝 if 𝑣𝑖 ≥ 0,

ҧ𝑣𝑛 otherwise

where ҧ𝑣𝑝 = mean( 𝑣𝑖 for 𝑖: 𝑣𝑖 ≥ 0 ), ҧ𝑣𝑛 = mean 𝑣𝑖 for 𝑖: 𝑣𝑖 < 0

v1

float

v2

float

v3

float

v4

float

vn

float

vp

float

vn

float

sign Compression rate ≈ 32xn bits

Seide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs”

Unfortunately, no theoretical justification!

Page 8: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Our contribution

• We propose a (family of) new quantization function

• Unbiased stochastic gradient

• Allows super-constant ෨𝑂 𝑛 compression rate

• Convergence guarantee in ෨𝑂 𝑛 more steps

• Hyper-parameters control trade-off between convergence and compression

• We empirically show that the convergence does not slow-down too

much

Page 9: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

A simple randomized quantization function

• Quantization function

𝑄𝑖 𝑣 = 𝑣 2 ⋅ sgn 𝑣𝑖 ⋅ 𝜉𝑖 𝑣

where 𝜉𝑖 𝑣 = 1 with probability 𝑣𝑖 / 𝑣 2 and 0 otherwise.

v1

float

v2

float

v3

float

v4

float

vn

float

𝑣 2

float

Compression rate ≈ 𝑛sign and locations

≈ 𝑛 bits and ints

Note that 𝐸 σ𝑖 𝜉𝑖(𝑣) ≤ Τ𝑣 1 𝑣 2 ≤ 𝑛

Page 10: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Properties of the proposed quantization function

• Quantization function

𝑄𝑖 𝑣 = 𝑣 2 ⋅ sgn 𝑣𝑖 ⋅ 𝜉𝑖 𝑣

1. Sparsity:

𝐸 𝑖𝜉𝑖 𝑣 ≤ Τ𝑣 1 𝑣 2 ≤ 𝑛

2. Unbiasedness:𝐸 𝑄𝑖 𝑣 = 𝑣𝑖

3. Second moment bound:𝐸 𝑄 𝑣 2 = 𝑛 𝑣 2

where 𝜉𝑖 𝑣 = 1 with probability 𝑣𝑖 / 𝑣 2 and 0 otherwise.

Page 11: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Why this is a better quantization

• 1-bit SGD approximates large and tiny coordinates to the same mean value

• The proposed quantization function avoids this by randomizing

Page 12: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Theorem

• The quantized gradient can be communicated in

𝐹 + 𝑛 log 𝑛 + log 2𝑒

bits in expectation

• F is the number of bits to represent one float number

• There are only 𝑛 non-zero coordinates in expectation

• For each non-zero entry we use O(log 𝑛 ) bits to encode the location and 1

bit to encode the sign

• 𝑛 times reduction in per iteration communication

Page 13: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Bucketing

• Apply the quantization for every consecutive 𝑑 coordinates

• Bucket size d=1 corresponds to no quantization

• Reduced second moment bound => faster convergence

𝐸 𝑄𝑑 𝑣 2 = 𝑑 𝑣 2

• Communication cost

𝑛

𝑑⋅ 𝐹 + 𝑑 log 𝑑 + log 2𝑒

Bucket 1 Bucket 2 Bucket n/d⋯

d d d

(𝑛/𝑑 buckets in total)

Page 14: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Generalized quantization sheme

• Quantization function𝑄𝑖 𝑣; 𝑠 = 𝑣 2 ⋅ sgn 𝑣𝑖 ⋅ 𝜉𝑖 𝑣, 𝑠

where

• Note: s=1 reduces to the simple quantization function.

1/𝑠

ℓ/𝑠

(ℓ + 1)/𝑠

1

Τ𝑣𝑖 𝑣2

ℓ = floor Τ𝑠 ⋅ 𝑣𝑖 𝑣 2

With probability ℓ + 1 − Τ𝑠 ⋅ 𝑣𝑖 𝑣 2,

otherwise

(s is a tuning parameter)

Page 15: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Properties of the generalized quantization scheme

• Quantization function

𝑄𝑖 𝑣; 𝑠 = 𝑣 2 ⋅ sgn 𝑣𝑖 ⋅ 𝜉𝑖 𝑣, 𝑠

• Properties

1. Sparsity

𝐸 𝜉 𝑣, 𝑠 0 ≤ 𝑠2 + 𝑛

2. Unbiasedness

𝐸 𝑄𝑖 𝑣; 𝑠 = 𝑣𝑖

3. Second moment bound

𝐸 𝑄 𝑣; 𝑠 22 ≤ 1 +min

𝑛

𝑠2,𝑛

𝑠⋅ 𝑣 2

2

1/𝑠

ℓ/𝑠

(ℓ + 1)/𝑠

1

Τ𝑣𝑖 𝑣2

ℓ = floor Τ𝑠 ⋅ 𝑣𝑖 𝑣 2

With prob𝑝( 𝑣𝑖 / 𝑣 2, 𝑠)

otherwise

(Only 2 𝑣 22 for 𝑠 = 𝑛)

Page 16: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Sublinear Theorem (for small s)

• In expectation, the quantized gradient can be communicated in

𝐹 + 3 +3

2⋅ 1 + 𝑜 1 log

2 ⋅ 𝑠2 + 𝑛

𝑠2 + 𝑛⋅ (𝑠2 + 𝑛)

bits

• Communicate the difference of non-zero locations (at most 𝑠2 + 𝑛 )

• Use Elias recursive coding

• Magnitude can be encoded by log (power per dimension)

• Recovers the simple case for s=1.

Page 17: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Linear Theorem (for large s)

• In expectation, the quantized gradient can be communicated in

𝐹 +1 + 𝑜 1

2log 1 +

𝑠2 +min 𝑛, 𝑠 𝑛

𝑛+ 1 + 2 ⋅ 𝑛

bits

• Don’t communicate the locations

• For 𝑠 = 𝑛, we have the bound

𝐹 + 2.8𝑛

• Linear in dimension n but much smaller constant 2.8 compared to

uncompressed float (32 bits) and second moment only 2 times worse.

Page 18: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Experiments

Page 19: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

MNIST (digit recognition task)

• Two-layer Network (non-convex!):• Input 784 -> hidden 4096 -> output 10

• Used the simple quantization scheme with bucketing (bucket size: d)

Page 20: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

CIFAR-10 (object recognition task)

• Convolutional network (a small VGG network), 12 layers• Input → Conv → BN → Conv → BN → … → Hidden 4096 → Hidden 4096 →

Hidden 4096 → Output 10

Training loss Test accuracy

Page 21: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Parallelization (preliminary)

Page 22: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Conclusion

• Simple, easy-to-implement quantization scheme

• Sublinear ( 𝑛) number of bits per iteration

• Performance guarantee

• Bucketing to control compression / convergence trade-off

• General quantization scheme

• Requires roughly 3 bits per coordinate and convergence guarantee only 2

times worse compare to SGD

Page 23: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Generalized quantization sheme

• Quantization function𝑄𝑖 𝑣; 𝑠 = 𝑣 2 ⋅ sgn 𝑣𝑖 ⋅ 𝜉𝑖 𝑣, 𝑠

where

𝜉𝑖 𝑣, 𝑠 = ቊℓ/𝑠 with probability ℓ + 1 − Τ𝑠 ⋅ 𝑣𝑖 𝑣 2,

(ℓ + 1)/𝑠 otherwise,

with ℓ = floor Τ𝑠 ⋅ 𝑣𝑖 𝑣 2 and s is a hyper-parameter.

• Note: s=1 reduces to the simple quantization function.

Page 24: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

Double buffering

Computegradient

Computegradient

Exchangegradient

Update params

Update params

/Computegradient

Exchangegradient

/

Minibatch 1 Minibatch 2 Minibatch 3 Minibatch 4

Page 25: Quantized Stochastic Gradient Descent: Communication vs ...tomioka.dk/talks/witmse16.pdfSeide et al (2014) “1-Bit Stochastic Gradient Descent and its Application to Data-Parallel

ThanksStreamline icons