Top Banner
Training Machine Learning Models via Empirical Risk Minimiza8on (Lecture 2) Peter Richtárik The 41 st Woudschoten Conference - October 5-7, 2016
59

Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

May 26, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

TrainingMachineLearningModelsviaEmpiricalRiskMinimiza8on

(Lecture2)PeterRichtárik

The41stWoudschotenConference-October5-7,2016

Page 2: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Part1Lecture1Condensed

to2Slides

Page 3: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Lecture1•  EmpiricalRiskMinimiza8on–  PrimalFormula8on(minimizetheaverageofnconvexfunc8onsofdvariables)

– DualFormula8on(maximizeaconcavefunc8onofnvariables)

•  5BasicTools– GradientDescent(GD)– AcceleratedGradientDescent(AGD)– HandlingNonsmoothness:ProximalGradientDescent:–  RandomizedDecomposi8on

•  Stochas8cGradientDescent(SGD)•  RandomizedCoordinateDescent(RCD)

–  Parallelism/Minibatching

Page 4: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

SummaryofComplexityResultsfromLecture1

Method # iterations Cost of 1 iter.

Gradient Descent

(GD)

Lµ log(1/✏) n

Accelerated Gradient Descent

(AGD)

qLµ log(1/✏) n

Proximal Gradient Descent

(PGD)

Lµ log(1/✏) n + Prox Step

Stochastic Gradient Descent

(SGD)

⇣maxi Li

µ +

�2

µ2✏

⌘log(1/✏) 1

Randomized Coordinate Descent

(RCD)

maxi Liµ log(1/✏) 1

\begin{tabular}{|c|c|c|}\hline\bfMethod&\bf\#itera8ons&\bfCostof1iter.\\\hline\hline\begin{tabular}{c}GradientDescent\\(GD)\end{tabular}&$\arac{L}{\mu}\log(1/{\color{red}\epsilon})$&$n$\\\hline\begin{tabular}{c}AcceleratedGradientDescent\\(AGD)\end{tabular}&$\sqrt{\arac{L}{\mu}}\log(1/{\color{red}\epsilon})$&$n$\\\hline\begin{tabular}{c}ProximalGradientDescent\\(PGD)\end{tabular}&$\arac{L}{\mu}\log(1/{\color{red}\epsilon})$&$n$+ProxStep\\\hline\begin{tabular}{c}Stochas8cGradientDescent\\(SGD)\end{tabular}&$\led(\arac{\max_iL_i}{\mu}+\arac{\sigma^2}{\mu^2{\color{red}\epsilon}}\right)\log(1/{\color{red}\epsilon})$&$1$\\\hline\begin{tabular}{c}RandomizedCoordinateDescent\\(RCD)\end{tabular}&$\frac{\max_iL_i}{\mu}\log(1/{\color{red}\epsilon})$&$1$\\\hline\end{tabular}

Page 5: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Part2ArbitrarySampling(AUnifiedTheoryofDeterminis8candRandomizedGradient-TypeMethods)

P.R.andMar8nTakáčOnop:malprobabili:esinstochas:ccoordinatedescentmethodsOp&miza&onLe.ers10(6),1233-1243,2016(arXiv:1310.3438)

Page 6: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

TheProblem

Page 7: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

TheProblem\[\min_{w\in\mathbb{R}^d}\;\;\led[P(w)\equiv\frac{1}{n}\sum_{i=1}^n\phi_i(A_i^\topw)+\lambdag(w)\right]\]

minx2Rn

f(x)

Smooth and �-strongly convex

Page 8: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

TheAlgorithm

Page 9: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

\[x_i^{t+1}\ledarrowx_i^t-\frac{1}{{\color{red}v_i}}(\nablaf(x^t))^\tope_i\]

Choosearandomset${\color{blue}S_t}$ofcoordinates

${\color{blue}p_i=\mathbf{P}(i\inS_t)}$

For i 2 St do

Choose a random set St of coordinates

x

t+1i x

ti

\[x_i^{t+1}\ledarrowx_i^t\]

For i /2 St do

For$i\no8n{\color{blue}S_t}$do

x

t+1i x

ti �

1

vi(rf(xt))>ei

i.i.d.witharbitrarydistribu8on

St ✓ {1, 2, . . . , n} ${\color{blue}S_t}\subseteq\{1,2,\dots,n\}$

e1 =

0

@100

1

A

$e_1=\begin{pmatrix}1\\0\\0\end{pmatrix}$

e2 =

0

@010

1

A

n = 3Example

Page 10: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Complexity

Page 11: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

KeyAssump8on

\[\mathbf{E}\led[f\led(x+\sum_{i\in{\color{blue}\hat{S}}}h_ie_i\right)\right]\;\;\leq\;\;f(x)+\sum_{i=1}^n{\color{blue}p_i}\nabla_if(x)h_i+\sum_{i=1}^n{\color{blue}p_i}{\color{red}v_i}h_i^2\]

\[{\color{blue}p_i}=\mathbb{P}(i\in\hat{S})\]

Parameters v1, . . . , vn satisfy:

Parameters${\color{red}v_1,\dots,\color{red}v_n}$sa8sfy:

Inequalitymustholdforall

\[x,h\in\mathbb{R}^n\]

E

2

4f

0

@x+

X

i2S

hiei

1

A

3

5 f(x) +nX

i=1

pirif(x)hi +nX

i=1

pivih2i

x, h 2 Rn pi = P(i 2 St)

i 2 St

Page 12: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

ComplexityTheorem

\[t\;\;\geq\;\;\led(\max_i\frac{{\color{red}v_i}}{{\color{blue}p_i}\lambda}\right)\log\led(\frac{f(x^0)-f(x^*)}{\epsilon{\color{blue}\rho}}\right)\] \[\mathbf{P}\led(f(x^t)-f(x^*)\leq\epsilon\right)\geq1-{\color{blue}\rho}\]

t �✓max

i

vi

pi�

◆log

✓f(x

0)� f(x

⇤)

✏⇢

P�f(xt)� f(x⇤) ✏

�� 1� ⇢

strongconvexityconstantpi = P(i 2 St)

Page 13: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

UniformvsOp8malSampling

\[{\color{blue}p_i}=\frac{{\color{red}v_i}}{\sum_i{\color{red}v_i}}\]

pi =1

nmax

i

vipi�

=

nmaxi vi�

\[\max_i\frac{{\color{red}v_i}}{{\color{blue}p_i}\lambda}=\frac{{\color{blue}n}\max_i{\color{red}v_i}}{\lambda}\]

max

i

vipi�

=

Pi vi�

\[\max_i\frac{{\color{red}v_i}}{{\color{blue}p_i}\lambda}=\frac{\sum_i{\color{red}v_i}}{\lambda}\]

pi =viPi vi

Outputasubsetof$C_j$ofsize$\tau$chosenuniformlyatrandom

Page 14: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

HowtoComputetheStepsizeParameters?

ZhengQuandP.R.CoordinatedescentwitharbitrarysamplingII:ESO,arXiv:1412.8063,2014

ZhengQuandP.R.CoordinatedescentwitharbitrarysamplingI:algorithmsandcomplexityOp&miza&onMethodsandSoCware31(5),829-857,2016(arXiv:1412.8060)

ZhengQuandP.R.CoordinatedescentwitharbitrarysamplingII:expectedseparableoverapproxima:onOp&miza&onMethodsandSoCware31(5),858-884,2016

Page 15: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Part3Quartz

ZhengQu,P.R.andTongZhangQuartz:RandomizeddualcoordinateascentwitharbitrarysamplingInAdvancesinNeuralInforma&onProcessingSystems28,2015

Page 16: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

EmpiricalRiskMinimiza8on

Page 17: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Sta8s8calNatureofData

\[\phi_i(a)=\arac{1}{2\gamma}(a-b_i)^2\]\[\Downarrow\]\[\frac{1}{n}\sum_{i=1}^n\phi_i(A_i^\topw)=\frac{1}{2\gamma}\|Aw-b\|_2^2\]

Data(e.g.,image,text,measurements,…)

Label\[A_i\in\mathbb{R}^{d\8mesm},\qquady_i\in\mathbb{R}^m\]

Ai 2 Rd⇥m

yi 2 Rm

(Ai, yi) ⇠ Distribution

Page 18: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Predic8onofLabelsfromData\[(A_i,y_i)\sim\emph{Distribu8on}\]

(Ai, yi) ⇠ Distribution

A>i w ⇡ yi

\[A_i^\topw\approxy_i\]

w 2 RdFind

Suchthatwhen(data,label)pairisdrawnfromthedistribu8on

Then

Predictedlabel Truelabel

Linearpredictor

Page 19: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

MeasureofSuccess\[\mathbf{E}\led[loss(A_i^\topw,y_i)\right]\]

loss(a, b)

E⇥loss(A>

i w, yi)⇤

(Ai, yi) ⇠ Distribution

Wewanttheexpectedloss(=risk)tobesmall:

data label

Page 20: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

FindingaLinearPredictorviaEmpiricalRiskMinimiza8on(ERM)

\[\min_{w\in\mathbb{R}^d}\frac{1}{n}\sum_{i=1}^nloss(A_i^\topw,y_i)\]

\[(A_1,y_1),(A_2,y_2),\dots,(A_n,y_n)\sim\emph{Distribu8on}\](A1, y1), (A2, y2), . . . , (An, yn) ⇠ Distribution

Drawi.i.d.data(samples)fromthedistribu8on

minw2Rd

1

n

nX

i=1

loss(A>i w, yi)

Outputpredictorwhichminimizestheempiricalrisk:

Page 21: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

ERM:Primal&DualProblems

Page 22: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

PrimalProblem

Regularizer:1-stronglyconvex

\[\min_{w\in\mathbb{R}^d}\;\;\led[P(w)\equiv\frac{1}{n}\sum_{i=1}^n\phi_i(A_i^\topw)+\lambdag(w)\right]\]

Loss:convex&-smooth

1/�

minw2Rd

"P (w) ⌘ 1

n

nX

i=1

�i(A>i w) + �g(w)

#

g(w) � g(w0) + hrg(w0), w � w0i+ 1

2kw � w0k2 \[g(w)\geq g(w' ) + \left< \nabla g(w' ), w-w' \right> + \frac{1}{2}\|w-w' \|^2\]

kr�i(t)�r�i(t0)k ��1 · kt� t0k

Lipschitzconstant

Page 23: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

DualProblem

-stronglyconvex

\[D(\alpha)\equiv-\lambdag^*\led(\frac{1}{\lambdan}\sum_{i=1}^nA_i\alpha_i\right)-\frac{1}{n}\sum_{i=1}^n\phi_i^*(-\alpha_i)

1–smooth&convex

D(↵) ⌘ ��g⇤

1

�n

nX

i=1

Ai↵i

!� 1

n

nX

i=1

�⇤i (�↵i)

max

↵=(↵1,...,↵n)2RN=RnmD(↵)

\[\max_{\alpha=(\alpha_1,\dots,\alpha_n)\in\mathbb{R}^{N}=\mathbb{R}^{nm}}D(\alpha)\]

g⇤(w0) = max

w2Rd

�(w0

)

>w � g(w)

\[g^*(w')=\max_{w\in\mathbb{R}^d}\led\{(w')^\topw-g(w)\right\}\]

�⇤i (a

0) = max

a2Rm

�(a0)>a� �i(a)

\[\phi_i^*(a')=\max_{a\in\mathbb{R}^m}\led\{(a')^\topa-\phi_i(a)\right\}\]

2 Rm

2 Rm 2 Rm

2 Rd

Page 24: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

FenchelDuality

\[P(w)-D(\alpha)\;\;=\;\;\lambda\led(g(w)+g^*\led(\bar{\alpha}\right)\right)+\frac{1}{n}\sum_{i=1}^n\phi_i(A_i^\topw)+\phi_i^*(-\alpha_i)=\]\[\lambda(g(w)+g^*\led(\bar{\alpha}\right)-\led\langlew,\bar{\alpha}\right\rangle)+\frac{1}{n}\sum_{i=1}^n\phi_i(A_i^\topw)+\phi_i^*(-\alpha_i)+\led\langleA_i^\topw,\alpha_i\right\rangle\]

↵ =1

�n

nX

i=1

Ai↵i

\[\bar{\alpha}=\frac{1}{\lambdan}\sum_{i=1}^nA_i\alpha_i\]

� 0� 0

w = rg⇤(↵)

\[w=\nablag^*(\bar{\alpha})\]

↵i = �r�i(A>i w)

\[\alpha_i=-\nabla\phi_i(A_i^\topw)\]

P (w)�D(↵) = � (g(w) + g⇤ (↵)) +1

n

nX

i=1

�i(A>i w) + �⇤

i (�↵i) =

�(g(w) + g⇤ (↵)� hw, ↵i) + 1

n

nX

i=1

�i(A>i w) + �⇤

i (�↵i) +⌦A>

i w,↵i

Weakduality

Op:malitycondi:ons

Page 25: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

QuartzAlgorithm

(↵t, wt) ) (↵t+1, wt+1)

Page 26: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Quartz:Bird’sEyeView

\[(\alpha^t,w^t)\qquad\Rightarrow\qquad(\alpha^{t+1},w^{t+1})\]

\[w^{t+1}\ledarrow(1-\theta)w^t+\theta{\color{red}\nablag^*(\bar{\alpha}^t)}\]

\[\alpha_i^{t+1}\ledarrow\led(1-\frac{\theta}{{\color{blue}p_i}}\right)\alpha_i^{t}+\frac{\theta}{{\color{blue}p_i}}{\color{red}\led(-\nabla\phi_i(A_i^\topw^{t+1})\right)}\]

STEP1:PRIMALUPDATE

STEP2:DUALUPDATE

↵t+1i

✓1� ✓

pi

◆↵ti +

pi

��r�i(A

>i w

t+1)�

wt+1 (1� ✓)wt + ✓rg⇤(↵t)

Choosearandomset${\color{blue}S_t}$ofdualvariables

${\color{blue}p_i=\mathbf{P}(i\inS_t)}$

Choose a random set St of dual variables

pi = P(i 2 St)For i 2 St do

✓ = minipi��nvi+��n

$\theta=\min_i\frac{{\color{blue}p_i}\lambda\gamman}{{\color{red}v_i}+\lambda\gamman}$

Page 27: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

RandomizedPrimal-DualMethods

SDCA: SS Shwartz & T Zhang, 09/2012 mSDCA M Takac, A Bijral, P R & N Srebro, 03/2013 ASDCA: SS Shwartz & T Zhang, 05/2013 AccProx-SDCA: SS Shwartz & T Zhang, 10/2013 DisDCA: T Yang, 2013 Iprox-SDCA: P Zhao & T Zhang, 01/2014 APCG: Q Lin, Z Lu & L Xiao, 07/2014 SPDC: Y Zhang & L Xiao, 09/2014 Quartz: Z Qu, P R & T Zhang, 11/2014

Algorithm 1-nice 1-optimal ⌧ -nice arbitrary

additional

speedup

direct

p-d

analysis

acceleration

SDCA •mSDCA • • •ASDCA • • •

AccProx-SDCA • •DisDCA • •

Iprox-SDCA • •APCG • •SPDC • • • • •Quartz • • • • • •

{\footnotesize\begin{tabular}{|c|c|c|c|c|c|c|c|c|c}\hlineAlgorithm&1-nice&1-op8mal&$\tau$-nice&arbitrary&{\begin{tabular}{c}addi8onal\\speedup\end{tabular}}&{\begin{tabular}{c}direct\\p-d\\analysis\end{tabular}}&accelera8on\\\hline\hlineSDCA&$\bullet$&&&&&&\\\hlinemSDCA&$\bullet$&&$\bullet$&&$\bullet$&&\\\hlineASDCA&$\bullet$&&$\bullet$&&&&$\bullet$\\\hlineAccProx-SDCA&$\bullet$&&&&&&$\bullet$\\\hlineDisDCA&$\bullet$&&$\bullet$&&&&\\\hlineIprox-SDCA&$\bullet$&$\bullet$&&&&&\\\hlineAPCG&$\bullet$&&&&&&$\bullet$\\\hlineSPDC&$\bullet$&$\bullet$&$\bullet$&&&$\bullet$&$\bullet$\\\hline\bf{Quartz}&{\color{red}$\bullet$}&{\color{red}$\bullet$}&{\color{red}$\bullet$}&{\color{red}$\bullet$}&{\color{red}$\bullet$}&{\color{red}$\bullet$}&\\\hline\end{tabular}}

Page 28: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Complexity

Page 29: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Assump8on3(ExpectedSeparableOverapproxima8on)

E

������

X

i2S

Ai↵i

������

2

nX

i=1

pivik↵ik2

\[\mathbf{E}\led\|\sum_{i\in\hat{S}}A_i\alpha_i\right\|^2\;\;\leq\;\;\sum_{i=1}^n{\color{blue}p_i}{\color{red}v_i}\|\alpha_i\|^2\]

\[{\color{blue}p_i}=\mathbb{P}(i\in\hat{S})\]

Parameters v1, . . . , vn satisfy:

Parameters${\color{red}v_1,\dots,\color{red}v_n}$sa8sfy:

inequalitymustholdforall↵1, . . . ,↵n 2 Rm

\[\alpha_1,\dots,\alpha_n\in\mathbb{R}^m\]

pi = P(i 2 St)

i 2 St

Page 30: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

ComplexityTheorem(QRZ’14)

\[t\;\;\geq\;\;\max_i\led(\frac{1}{{\color{blue}p_i}}+\frac{{\color{red}v_i}}{{\color{blue}p_i}\lambda\gamman}\right)\log\led(\frac{P(w^0)-D(\alpha^0)}{\epsilon}\right)\]

\[\mathbf{E}\led[P(w^t)-D(\alpha^t)\right]\leq\epsilon\]

E⇥P (wt)�D(↵t)

⇤ ✏

t � max

i

✓1

pi+

vipi��n

◆log

✓P (w0

)�D(↵0)

Page 31: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Part4Quartz:SpecialCases

Page 32: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

SpecialCase1:SerialSampling

Page 33: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

\begin{table}\begin{tabular}{|c|c|}\hline&\\Op8malsampling&$\displaystylen+\frac{\arac{1}{n}\sum_{i=1}^nL_i}{\lambda\gamma}$\\&\\\hline&\\Uniformsampling&$\displaystylen+\frac{\max_iL_i}{\lambda\gamma}$\\&\\\hline\end{tabular}\end{table}Li ⌘ �

max

�A>

i Ai

Optimal sampling n+1n

Pni=1 Li

��

Uniform sampling n+maxi Li

��

Complexity

pi =LiPj Lj

$p_i=\frac{L_i}{\sum_jL_j}$

pi =1n

Page 34: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Data

Dataset# Samples

n# features

ddensity

nnz(A)/(nd)

astro-ph 29,882 99,757 0.08%

CCAT 781,265 47,236 0.16%

cov1 522,911 54 22.22%

w8a 49,749 300 3.91%

ijcnn1 49,990 22 59.09%

webspam 350,000 254 33.52%

\begin{table}\begin{tabular}{|c|c|c|c|}\hline{\bfDataset}&\begin{tabular}{c}{\bf\#Samples}\\$n$\end{tabular}&\begin{tabular}{c}{\bf\#features}\\$d$\end{tabular}&\begin{tabular}{c}{\bfdensity}\\$nnz(A)/(nd)$\end{tabular}\\\hline\hlineastro-ph&29,882&99,757&0.08\%\\\hlineCCAT&781,265&47,236&0.16\%\\\hlinecov1&522,911&54&22.22\%\\\hlinew8a&49,749&300&3.91\%\\\hlineijcnn1&49,990&22&59.09\%\\\hlinewebspam&350,000&254&33.52\%\\\hline\end{tabular}\end{table}

Page 35: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

0 50 100 15010−15

10−10

10−5

100

nb of epochs

Prim

al d

ual g

ap

Prox-SDCAQuartz-UIprox-SDCAQuartz-IP

0 50 10010−15

10−10

10−5

100

nb of epochs

Prim

al d

ual g

ap

Prox-SDCAQuartz-U (10θ)Iprox-SDCAQuartz-IP (10θ)

Data = cov1, n = 522, 911, � = 10�6

Standardprimalupdate

Experiment:QuartzvsSDCA,UniformvsOp8malSampling

“Aggressive”primalupdate

Data=\tex�t{cov1},\quad$n=522,911$,\quad$\lambda=10^{-6}$

Page 36: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

SpecialCase2:Minibatching&

Sparsity

Page 37: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

DataSparsity

1 ! nAnormalized

measureofaveragesparsityofthedata

“Fullysparsedata” “Fullydensedata”

Page 38: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

ComplexityofQuartz

Fully sparse data

(! = 1)

n

⌧+

maxi Li

��⌧

Fully dense data

(! = n)n

⌧+

maxi Li

��

Any data

(1 ! n)n

⌧+

⇣1 +

(!�1)(⌧�1)n�1

⌘maxi Li

��⌧

\begin{table}\begin{tabular}{|c|c|}\hline&\\\begin{tabular}{c}Fullysparsedata\\(${\color{blue}\8lde{\omega}=1}$)\end{tabular}&$\displaystyle\frac{n}{{\color{red}\tau}}+\frac{\max_iL_i}{\lambda\gamma{\color{red}\tau}}$\\&\\\hline&\\\begin{tabular}{c}Fullydensedata\\(${\color{blue}\8lde{\omega}=n}$)\end{tabular}&$\displaystyle\frac{n}{{\color{red}\tau}}+\frac{\max_iL_i}{\lambda\gamma}$\\&\\\hline&\\\begin{tabular}{c}Anydata\\(${\color{blue}1\leq\8lde{\omega}\leqn}$)\end{tabular}&$\displaystyle\frac{n}{{\color{red}\tau}}+\frac{\led(1+\frac{({\color{blue}\8lde{\omega}}-1)({\color{red}{\color{red}\tau}}-1)}{n-1}\right)\max_iL_i}{\lambda\gamma{\color{red}\tau}}$\\&\\\hline\end{tabular}\end{table}

⌘ T (⌧)

Page 39: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Speedup

\[\frac{T(1)}{T({\color{red}\tau})}\geq\frac{{\color{red}\tau}}{1+\frac{\8lde{\omega}-1}{n-1}}\geq\frac{{\color{red}\tau}}{2}\]

\[1\leq{\color{red}\tau}\leq2+\lambda\gamman\]

Assumethedataisnormalized: Li ⌘ �max

(A>i Ai) 1

Then:

Linearspeedupuptoacertaindata-independentminibatchsize:

\[T({\color{red}\tau})\;\;=\;\;\frac{\led(1+\frac{({\color{blue}\8lde{\omega}}-1)({\color{red}\tau}-1)}{(n-1)(1+\lambda\gamman)}\right)}{{\color{red}\tau}}\8mesT({\color{red}1})\]

⌧ 2 + ��n T (⌧) 2

⌧⇥ T (1)

Furtherdata-dependentspeedup,uptotheextremecase:

\[T({\color{red}\tau})\leq\frac{2}{{\color{red}\tau}}\8mesT({\color{red}1})\]

\[{\color{blue}\8lde{\omega}}=O(1)\]T (⌧) = O

✓T (1)

\[T({\color{red}\tau})={\calO}\led(\frac{T({\color{red}1})}{\color{red}\tau}\right)\]

T (⌧) =

⇣1 + (!�1)(⌧�1)

(n�1)(1+��n)

⌧⇥ T (1)

! = O(��n)

Page 40: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

0 500 1000 1500 20000

500

1000

1500

2000

τ

spee

d up

fact

or

λ=1e-3

λ=1e-4

λ=1e-6

n = 106, ! = 102, � = 1

Speedup:sparsedata

Page 41: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

0 500 1000 1500 20000

500

1000

1500

2000

τ

spee

d up

fact

or

λ=1e-3

λ=1e-4

λ=1e-6

n = 106, ! = 104, � = 1

Speedup:denserdata

Page 42: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

0 500 1000 1500 20000

100

200

300

400

500

600

700

τ

spee

d up

fact

or

λ=1e-3

λ=1e-4

λ=1e-6

n = 106, ! = 106, � = 1

Speedup:fullydensedata

Page 43: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

astro_ph:n=29,882density=0.08%

0 500 10000

200

400

600

800

τ

spee

d up

fact

or T

(1,1

)/T(1

,τ)

λ=5.8e−3 in practiceλ=5.8e−3 in theoryλ=1e−3 in practiceλ=1e−3 in theoryλ=1e−4 in practiceλ=1e−4 in theory

Page 44: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

CCAT:n=781,265density=0.16%

0 500 10000

200

400

600

800

1000

τ

spee

d up

fact

or T

(1,1

)/T(1

,τ)

λ=1.1e−3 in practiceλ=1.1e−3 in theoryλ=1e−4 in practiceλ=1e−4 in theoryλ=1e−5 in practiceλ=1e−5 in theory

Page 45: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Algorithm Iteration complexity g

SDCA n+

1

��12k · k

2

ASDCA 4⇥max

(n

⌧,

rn

��⌧,

1

��⌧,

n13

(��⌧)23

)12k · k

2

SPDC

n

⌧+

rn

��⌧general

Quartzn

⌧+

✓1 +

(! � 1)(⌧ � 1)

n� 1

◆1

��⌧general

\begin{tabular}{c|c|c}\hlineAlgorithm&Itera8oncomplexity&$g$\\\hline&&\\SDCA&$\displaystylen+\frac{1}{\lambda\gamma}$&$\arac{1}{2}\|\cdot\|^2$\\&&\\\hline&&\\ASDCA&$\displaystyle4\8mes\max\led\{\frac{n}{{\color{red}\tau}},\sqrt{\frac{n}{\lambda\gamma{\color{red}{\color{red}\tau}}}},\frac{1}{\lambda\gamma{\color{red}\tau}},\frac{n^{\frac{1}{3}}}{(\lambda\gamma{\color{red}\tau})^{\frac{2}{3}}}\right\}$&$\arac{1}{2}\|\cdot\|^2$\\&&\\\hline&&\\SPDC&$\displaystyle\frac{n}{{\color{red}\tau}}+\sqrt{\frac{n}{\lambda\gamma{\color{red}\tau}}}$&general\\&&\\\hline&&\\\bf{Quartz}&$\displaystyle\frac{n}{{\color{red}\tau}}+\led(1+\frac{(\8lde\omega-1)({\color{red}\tau}-1)}{n-1}\right)\frac{1}{\lambda\gamma{\color{red}\tau}}$&general\\&&\\\hline\end{tabular}

Primal-DualMethodswithtau-niceSamplingLi=

1

SS-Shwartz&TZhang‘13

SS-Shwartz&TZhang‘13

YZhang&LXiao‘14

Page 46: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

\begin{tabular}{c|c|c|c|c}\hlineAlgorithm&$\gamma\lambdan=\Theta(\arac{1}{\tau})$&$\gamma\lambdan=\Theta(1)$&$\gamma\lambdan=\Theta(\tau)$&$\gamma\lambdan=\Theta(\sqrt{n})$\\\hline&$\kappa=n\tau$&$\kappa=n$&$\kappa=n/\tau$&$\kappa=\sqrt{n}$\\\hline\hline&&&&\\SDCA&$n\tau$&$n$&$n$&$n$\\&&&&\\\hline&&&&\\ASDCA&$n$&$\displaystyle\frac{n}{\sqrt{\tau}}$&$\displaystyle\frac{n}{\tau}$&$\displaystyle\frac{n}{\tau}+\frac{n^{3/4}}{\sqrt{\tau}}$\\&&&&\\\hline&&&&\\SPDC&$n$&$\displaystyle\frac{n}{\sqrt{\tau}}$&$\displaystyle\frac{n}{\tau}$&$\displaystyle\frac{n}{\tau}+\frac{n^{3/4}}{\sqrt{\tau}}$\\&&&&\\\hline&&&&\\\bf{Quartz}&$\displaystylen+\8lde{\omega}\tau$&$\displaystyle\frac{n}{\tau}+\8lde{\omega}$&$\displaystyle\frac{n}{\tau}$&$\displaystyle\frac{n}{\tau}+\frac{\8lde{\omega}}{\sqrt{n}}$\\&&&&\\\hline\end{tabular}

Algorithm ��n = ⇥(

1⌧ ) ��n = ⇥(1) ��n = ⇥(⌧) ��n = ⇥(

pn)

= n⌧ = n = n/⌧ =

pn

SDCA n⌧ n n n

ASDCA nnp⌧

n

n

⌧+

n3/4

p⌧

SPDC nnp⌧

n

n

⌧+

n3/4

p⌧

Quartz n+ !⌧n

⌧+ !

n

n

⌧+

!pn

Forsufficientlysparsedata,Quartzwinsevenwhencomparedagainstacceleratedmethods

Accelerated

Page 47: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

SpecialCase3:DistributedSampling

Page 48: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

References

ZhengQu,P.R.andTongZhangQuartz:RandomizeddualcoordinateascentwitharbitrarysamplingNeuralInforma&onProcessingSystems28,865-873,2015

P.R.andMar8nTakáčDistributedcoordinatedescentforlearningwithbigdataJournalofMachineLearningResearch17(75),1-25,2016(arXiv:1310.2059)

OlivierFercoq,ZhengQu,P.R.andMar8nTakáčFastdistributedcoordinatedescentforminimizingnon-stronglyconvexlossesIEEEInt.WorkshoponMachineLearningforSignalProc.,2014

2

Page 49: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

DistributedQuartz:PerformtheDualUpdatesinaDistributedManner

↵t+1i

✓1� ✓

pi

◆↵ti +

pi

��r�i(A

>i w

t+1)�

QuartzSTEP2:DUALUPDATE

Choose a random set St of dual variables

For i 2 St do

Datarequiredtocomputetheupdate

Page 50: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Distribu8onofDatan=#dualvariables Datamatrixn

c

n

c

n

c

n

c

Page 51: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

DistributedSampling

Page 52: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

DistributedSampling

Randomsetofdualvariables

Each node independently picks ⌧ dual variables

from those it owns, uniformly at random

Alsosee:CoCoA+[Ma,Smith,Jaggietal15]

Page 53: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

ComplexityofDistributedQuartz

n

c⌧+max

i

�max

⇣Pdj=1

⇣1 +

(⌧�1)(!j�1)

max{n/c�1,1} +

⇣⌧cn � ⌧�1

max{n/c�1,1}

⌘!0

j�1

!0j

!j

⌘A>

jiAji

��c⌧

\[\frac{n}{c\tau}+\max_i\frac{\lambda_{\max}\led(\sum_{j=1}^d\led(1+\frac{(\tau-1)(\omega_j-1)}{\max\{n/c-1,1\}}+\led(\frac{\tauc}{n}-\frac{\tau-1}{\max\{n/c-1,1\}}\right)\frac{\omega_j'-1}{\omega_j'}\omega_j\right)A_{ji}^\topA_{ji}\right)}{\lambda\gammac\tau}\]

n

c⌧+

Something that looks complicated

��c⌧

Key:Gettherightstepsizeparametersv(sothattheESOinequalityholds)

Theleadingterminthecomplexityboundthenis:

max

i

✓1

pi+

vipi��n

==

Page 54: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Experiment

Machine:128nodesofHectorSupercomputer(4,096cores)

Problem:LASSO,n=1billion,d=0.5billion,3TB

P.R.andMar8nTakáčDistributedcoordinatedescentforlearningwithbigdataJournalofMachineLearningResearch17(75),1-25,2016(arXiv:1310.2059)

Page 55: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

LASSO:3TBdata+128nodes

Page 56: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Experiment(Accelera8on)Machine:128nodesofArcherSupercomputer

Problem:LASSO,n=5million,d=50billion,5TB

(60,000nnzperrowofA)

2

OlivierFercoq,ZhengQu,P.R.andMar8nTakáčFastdistributedcoordinatedescentforminimizingnon-stronglyconvexlossesIEEEInt.WorkshoponMachineLearningforSignalProc.,2014

2

Page 57: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

LASSO:5TBdata(d=50billion)128nodes

104

105

106

10−0.1

100

100.1

100.2

Iterations

L(x

k)−L

*

hydra

hydra2

102

103

10−0.1

100

100.1

100.2

Elapsed time [sec.]

L(x

k)−L

*

hydra

hydra2

Page 58: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

THEEND

Page 59: Training Machine Learning Models via Empirical Risk ...richtarik/talks/TALK-Woudschoten2016-Lecture2.pdfTraining Machine Learning Models via Empirical Risk Minimizaon (Lecture 2) Peter

Coauthors

ZhengQu(HongKong)

ZhengQu,P.R.andTongZhangQuartz:RandomizeddualcoordinateascentwitharbitrarysamplingInAdvancesinNeuralInforma&onProcessingSystems28,865-873,2015(arXiv:1411.5873)

P.R.andMar8nTakáčOnop:malprobabili:esinstochas:ccoordinatedescentmethodsOp&miza&onLe.ers10(6),1233-1243,2016(arXiv:1310.3438)

Mar:nTakáč(Lehigh)

TongZhang(Baidu)