Top Banner
Solving Large-Scale Machine Learning Problems in a Distributed Way Martin Tak´ c Cognitive Systems Institute Group Speaker Series June 09 2016 1 / 28
43

Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Feb 20, 2017

Download

Technology

diannepatricia
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Solving Large-Scale Machine Learning Problemsin a Distributed Way

Martin Takac

Cognitive Systems Institute Group Speaker Series

June 09 2016

1 / 28

Page 2: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Outline

1 Machine Learning - Examples and Algorithm

2 Distributed Computing

3 Learning Large-Scale Deep Neural Network (DNN)

2 / 28

Page 3: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Examples of Machine Learning

binary classification

classifies person to have cancer or notdecided for an input image to which class it belongs, e.g. car/personspam detection/credit card fraud detection

multi-class classification

hand-written digits classification

speech understanding

face detection

product recommendation (collaborative filtering)

stock trading

. . . and many many others. . .

3 / 28

Page 4: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Support Vector Machines (SVM)

blue: healthy person

green: e.g. patient with lung cancer

Exhaled breath analysis for lung cancer: predict if patient has cancer or not

4 / 28

Page 5: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

ImageNet - Large Scale Visual Recognition Challenge

Two main chalangesObject detection - 200 categoriesObject localization - 1000 categories (over 1.2 million images for training)

5 / 28

Page 6: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

ImageNet - Large Scale Visual Recognition Challenge

Two main chalanges

Object detection - 200 categories

Object localization - 1000 categories (over 1.2 million images for training)

The state-of-the-art solution method is Deep Neural Network (DNN)

E.g. input layer has dimension of input imageThe output layer has dimension of e.g. 1000 (how many categories we have)

5 / 28

Page 7: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Deep Neural Network

we have to learn the weights between neurons (blue arrows)

the neural network is defining a non-linear and non-convex function (ofweights w) from input x to output y :

y = f (w ; x)

6 / 28

Page 8: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Example - MNIST handwritten digits recognition

A good w could give us

f

w ;

=

000

0.991...

f

w ;

=

00...0

0.999

7 / 28

Page 9: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Mathematical Formulation

Expected Loss Minimization

let (X ,Y ) be the distribution of input samples and its labels

we would like to find w such that

w∗ = arg minw

E(x,y)∼(X ,Y )[`(f (w ; x), y)]

` is a loss function, i.e. `(f (w ; x), y) = ‖f (w ; x)− y‖2

Impossible, as we do not know the distribution (X ,Y )Common approach: Empirical loss minimization:

we sample n points from (X ,Y ): {(xi , yi )}ni=1

we minimize regularized empirical loss

w∗ = arg minw

1

n

n∑i=1

`(f (w ; xi ), yi ) +λ

2‖w‖2

8 / 28

Page 10: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Mathematical Formulation

Expected Loss Minimization

let (X ,Y ) be the distribution of input samples and its labels

we would like to find w such that

w∗ = arg minw

E(x,y)∼(X ,Y )[`(f (w ; x), y)]

` is a loss function, i.e. `(f (w ; x), y) = ‖f (w ; x)− y‖2

Impossible, as we do not know the distribution (X ,Y )

Common approach: Empirical loss minimization:

we sample n points from (X ,Y ): {(xi , yi )}ni=1

we minimize regularized empirical loss

w∗ = arg minw

1

n

n∑i=1

`(f (w ; xi ), yi ) +λ

2‖w‖2

8 / 28

Page 11: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Mathematical Formulation

Expected Loss Minimization

let (X ,Y ) be the distribution of input samples and its labels

we would like to find w such that

w∗ = arg minw

E(x,y)∼(X ,Y )[`(f (w ; x), y)]

` is a loss function, i.e. `(f (w ; x), y) = ‖f (w ; x)− y‖2

Impossible, as we do not know the distribution (X ,Y )Common approach: Empirical loss minimization:

we sample n points from (X ,Y ): {(xi , yi )}ni=1

we minimize regularized empirical loss

w∗ = arg minw

1

n

n∑i=1

`(f (w ; xi ), yi ) +λ

2‖w‖2

8 / 28

Page 12: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Stochastic Gradient Descent (SGD) Algorithm

How can we solve

minw

{F (w) :=

1

n

n∑i=1

`(f (w ; xi ); yi ) +λ

2‖w‖2

}

1 we can use an iterative algorithm

2 we start with some initial w

3 we compute g = ∇F (w)

4 we get a new iterate w ← w − αg5 if w is still not good enough go to step 3

if n is very large, computing g can take a while.... even few hours/daysTrick:

choose i ∈ {1, . . . , n} randomly

define gi = ∇(`(f (w ;wi ); yi ) + λ

2 ‖w‖2)

use gi instead of g in the algorithm (step 4)

Note: E[gi ] = g , so in expectation, the ”direction” the algorithm is going is thesame as if we use the true gradient, but we can compute it n times faster!

9 / 28

Page 13: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Stochastic Gradient Descent (SGD) Algorithm

How can we solve

minw

{F (w) :=

1

n

n∑i=1

`(f (w ; xi ); yi ) +λ

2‖w‖2

}

1 we can use an iterative algorithm

2 we start with some initial w

3 we compute g = ∇F (w)

4 we get a new iterate w ← w − αg5 if w is still not good enough go to step 3

if n is very large, computing g can take a while.... even few hours/days

Trick:

choose i ∈ {1, . . . , n} randomly

define gi = ∇(`(f (w ;wi ); yi ) + λ

2 ‖w‖2)

use gi instead of g in the algorithm (step 4)

Note: E[gi ] = g , so in expectation, the ”direction” the algorithm is going is thesame as if we use the true gradient, but we can compute it n times faster!

9 / 28

Page 14: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Stochastic Gradient Descent (SGD) Algorithm

How can we solve

minw

{F (w) :=

1

n

n∑i=1

`(f (w ; xi ); yi ) +λ

2‖w‖2

}

1 we can use an iterative algorithm

2 we start with some initial w

3 we compute g = ∇F (w)

4 we get a new iterate w ← w − αg5 if w is still not good enough go to step 3

if n is very large, computing g can take a while.... even few hours/daysTrick:

choose i ∈ {1, . . . , n} randomly

define gi = ∇(`(f (w ;wi ); yi ) + λ

2 ‖w‖2)

use gi instead of g in the algorithm (step 4)

Note: E[gi ] = g , so in expectation, the ”direction” the algorithm is going is thesame as if we use the true gradient, but we can compute it n times faster!

9 / 28

Page 15: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Stochastic Gradient Descent (SGD) Algorithm

How can we solve

minw

{F (w) :=

1

n

n∑i=1

`(f (w ; xi ); yi ) +λ

2‖w‖2

}

1 we can use an iterative algorithm

2 we start with some initial w

3 we compute g = ∇F (w)

4 we get a new iterate w ← w − αg5 if w is still not good enough go to step 3

if n is very large, computing g can take a while.... even few hours/daysTrick:

choose i ∈ {1, . . . , n} randomly

define gi = ∇(`(f (w ;wi ); yi ) + λ

2 ‖w‖2)

use gi instead of g in the algorithm (step 4)

Note: E[gi ] = g , so in expectation, the ”direction” the algorithm is going is thesame as if we use the true gradient, but we can compute it n times faster!

9 / 28

Page 16: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Outline

1 Machine Learning - Examples and Algorithm

2 Distributed Computing

3 Learning Large-Scale Deep Neural Network (DNN)

10 / 28

Page 17: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

The Architecture

What if the size of data {(xi , yi )} exceeds the memory of a singlecomputing node?

each node can store portion of the data {(xi , yi )}each node is connected to the computer network

they can communicate with any other node (over maybe 1 or more switches)

Fact: every communication is much more expensive then accessing local data(can be even 100,000 times slower).

11 / 28

Page 18: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

The Architecture

What if the size of data {(xi , yi )} exceeds the memory of a singlecomputing node?

each node can store portion of the data {(xi , yi )}each node is connected to the computer network

they can communicate with any other node (over maybe 1 or more switches)

Fact: every communication is much more expensive then accessing local data(can be even 100,000 times slower).

11 / 28

Page 19: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Outline

1 Machine Learning - Examples and Algorithm

2 Distributed Computing

3 Learning Large-Scale Deep Neural Network (DNN)

12 / 28

Page 20: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Using SGD for DNN in Distributed Way

assume that the size of data or the size weights (or both) is so big, that wecannot store them on one machine

. . . or we can store them but it takes too long to compute something . . .

SGD: we need to compute ∇w `(f (w ; xi ); yi )

The DNN has a nice structure

∇w `(f (w ; xi ); yi ) can we computed by backpropagation procedure (this isnothing else just automated differentiation)

13 / 28

Page 21: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Why is SGD a Bad Distributed Algorithm

it samples only 1 sample and computes gi (this is very fast)

then w is updated

each update of w requires a communication (cost c seconds)

hence one iteration is suddenly much slower then if we would run SGD onone computer

The trick: Mini-batch SGDIn each iteration

1 Choose randomly S ⊂ {1, 2, . . . , n} with |S | = b2 Use gb = 1

b

∑i∈S gi instead of just gi

Cost of one epochnumber of MPI calls / epoch n/b

amount of data send over network nb × log(N)× sizeof (w)

if we increase b → n we would minimize amount of data and number ofnumber of communications per epoch! Caveat: there is no free lunch!Very large b means slower convergence!

14 / 28

Page 22: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Why is SGD a Bad Distributed Algorithm

it samples only 1 sample and computes gi (this is very fast)

then w is updated

each update of w requires a communication (cost c seconds)

hence one iteration is suddenly much slower then if we would run SGD onone computer

The trick: Mini-batch SGDIn each iteration

1 Choose randomly S ⊂ {1, 2, . . . , n} with |S | = b2 Use gb = 1

b

∑i∈S gi instead of just gi

Cost of one epochnumber of MPI calls / epoch n/b

amount of data send over network nb × log(N)× sizeof (w)

if we increase b → n we would minimize amount of data and number ofnumber of communications per epoch! Caveat: there is no free lunch!Very large b means slower convergence!

14 / 28

Page 23: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Why is SGD a Bad Distributed Algorithm

it samples only 1 sample and computes gi (this is very fast)

then w is updated

each update of w requires a communication (cost c seconds)

hence one iteration is suddenly much slower then if we would run SGD onone computer

The trick: Mini-batch SGDIn each iteration

1 Choose randomly S ⊂ {1, 2, . . . , n} with |S | = b2 Use gb = 1

b

∑i∈S gi instead of just gi

Cost of one epochnumber of MPI calls / epoch n/b

amount of data send over network nb × log(N)× sizeof (w)

if we increase b → n we would minimize amount of data and number ofnumber of communications per epoch!

Caveat: there is no free lunch!Very large b means slower convergence!

14 / 28

Page 24: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Why is SGD a Bad Distributed Algorithm

it samples only 1 sample and computes gi (this is very fast)

then w is updated

each update of w requires a communication (cost c seconds)

hence one iteration is suddenly much slower then if we would run SGD onone computer

The trick: Mini-batch SGDIn each iteration

1 Choose randomly S ⊂ {1, 2, . . . , n} with |S | = b2 Use gb = 1

b

∑i∈S gi instead of just gi

Cost of one epochnumber of MPI calls / epoch n/b

amount of data send over network nb × log(N)× sizeof (w)

if we increase b → n we would minimize amount of data and number ofnumber of communications per epoch! Caveat: there is no free lunch!Very large b means slower convergence!

14 / 28

Page 25: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Model Parallelism

Model parallelism: we partition weights w across many nodes; every nodehas all data points (but maybe just few features of them)

Hidden Layer 1

Hidden Layer 2

Output

Input

Forward Propagation

Hidden Layer 1

Hidden Layer 2

Output

Backward Propagation

Node

1

Node

1

Node

2

Node

2

All S

ampl

es

All S

ampl

es

All S

ampl

es

All S

ampl

es

Exchange Activation

Exchange Activation

Exchange Deltas

Exchange Deltas

15 / 28

Page 26: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Data Parallelism

Data parallelism: we partition data-samples across many nodes, each nodehas a fresh copy of w

Hidden Layer 1

Hidden Layer 2

Output

Input

Forward Propagation

Hidden Layer 1

Hidden Layer 2

Output

Backward Propagation

Node

1

Node

1

Node

2

Node

2

Parti

al S

ampl

es

Parti

al S

ampl

es

Parti

al S

ampl

es

Parti

al S

ampl

es

Hidden Layer 1

Hidden Layer 2

Hidden Layer 1

Hidden Layer 2

Exchange Gradient

16 / 28

Page 27: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Large-Scale Deep Neural Network1

1Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, SrinivasSridharan, Dhiraj Kalamkar, Bharat Kaul, Pradeep Dubey: Distributed Deep Learning UsingSynchronous Stochastic Gradient Descent, arXiv:1602.06709

17 / 28

Page 28: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

There is almost no speedup for large b

18 / 28

Page 29: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

The Dilemma

large b allows algorithm to be efficiently run on large computer cluster (morenodes)

very large b doesn’t reduce number of iterations, but each iteration is moreexpensive!

The Trick: Do not use just gradient, but use also Hessian (Martens 2010)

Caveat: Hessian matrix can be very large, e.g. the dimension of weights forTIMIT datasets is almost 1.5M, hence to store Hessian we would need almost10TB.The Trick:We can use Hessian Free approach (we need to be able to compute justHessian-vector products)Algorithm:

w ← w − α[∇2F (w)]−1∇F (w)

19 / 28

Page 30: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

The Dilemma

large b allows algorithm to be efficiently run on large computer cluster (morenodes)

very large b doesn’t reduce number of iterations, but each iteration is moreexpensive!

The Trick: Do not use just gradient, but use also Hessian (Martens 2010)Caveat: Hessian matrix can be very large, e.g. the dimension of weights forTIMIT datasets is almost 1.5M, hence to store Hessian we would need almost10TB.

The Trick:We can use Hessian Free approach (we need to be able to compute justHessian-vector products)Algorithm:

w ← w − α[∇2F (w)]−1∇F (w)

19 / 28

Page 31: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

The Dilemma

large b allows algorithm to be efficiently run on large computer cluster (morenodes)

very large b doesn’t reduce number of iterations, but each iteration is moreexpensive!

The Trick: Do not use just gradient, but use also Hessian (Martens 2010)Caveat: Hessian matrix can be very large, e.g. the dimension of weights forTIMIT datasets is almost 1.5M, hence to store Hessian we would need almost10TB.The Trick:We can use Hessian Free approach (we need to be able to compute justHessian-vector products)

Algorithm:w ← w − α[∇2F (w)]−1∇F (w)

19 / 28

Page 32: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

The Dilemma

large b allows algorithm to be efficiently run on large computer cluster (morenodes)

very large b doesn’t reduce number of iterations, but each iteration is moreexpensive!

The Trick: Do not use just gradient, but use also Hessian (Martens 2010)Caveat: Hessian matrix can be very large, e.g. the dimension of weights forTIMIT datasets is almost 1.5M, hence to store Hessian we would need almost10TB.The Trick:We can use Hessian Free approach (we need to be able to compute justHessian-vector products)Algorithm:

w ← w − α[∇2F (w)]−1∇F (w)

19 / 28

Page 33: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Non-convexity

We want to minimizemin

wF (w)

∇2F (w) is NOT positive semi-definite at any w !

20 / 28

Page 34: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Computing Step

recall the algorithm

w ← w − α[∇2F (w)]−1∇F (w)

we need to compute p = [∇2F (w)]−1∇F (w), i.e. to solve

∇2F (w)p = ∇F (w) (1)

we can use few iterations of CG method to solve it(CG assumes that ∇2F (w) � 0)

In our case it may not be true, hence, it is suggested to stop CG sooner, if itis detected during CG that ∇2F (w) is indefinite

We can use a Bi-CG algorithm to solve (1) and modify the algorithm2 asfollows

w ← w − α

{p, if pT∇F (x) > 0,

−p, otherwise

PS: we use just b samples to estimate ∇2∇F (w)2Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy and Martin Takac: Large Scale Distributed

Hessian-Free Optimization for Deep Neural Network, arXiv:1606.00511, 2016.21 / 28

Page 35: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Saddle Point

Gradient descent slows down around saddle point. Second order methods can helpa lot to prevent that.

22 / 28

Page 36: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

50 100 150 200 250 300 350 400

10−3

10−2

10−1

100

MNIST, 4 layers

Number of iteration

Tra

in E

rro

r

SGD, b=64

SGD, b=128

ggn−cg, b=512

hess−bicgstab, b=512

hess−cg, b=512

hybrid−cg, b=512

50 100 150 200 250 300 350 400

10−3

10−2

10−1

100

MNIST, 4 layers

Number of iteration

Tra

in E

rro

r

SGD, b=64

SGD, b=128

ggn−cg, b=1024

hess−bicgstab, b=1024

hess−cg, b=1024

hybrid−cg, b=1024

50 100 150 200 250 300 350 400

10−3

10−2

10−1

100

MNIST, 4 layers

Number of iteration

Tra

in E

rro

r

SGD, b=64

SGD, b=128

ggn−cg, b=2048

hess−bicgstab, b=2048

hess−cg, b=2048

hybrid−cg, b=2048

101

102

103

102

103

MNIST, 4 layers

Size of Mini−batch

Nu

mb

er

of

Ite

ratio

ns

ggn−cg

hess−bicgstab

hess−cg

hybrid−cg

23 / 28

Page 37: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

1 1.5 2 2.5 3 3.5 4 4.5 5

100

101

102

103

TIMIT, T=18, b=512

log2(Number of Nodes)

Ru

n T

ime

pe

r It

era

tio

n

Gradient

CG

Linesearch

1 1.5 2 2.5 3 3.5 4 4.5 5

100

101

102

103

TIMIT, T=18, b=1024

log2(Number of Nodes)

Ru

n T

ime

pe

r It

era

tio

n

Gradient

CG

Linesearch

1 1.5 2 2.5 3 3.5 4 4.5 5

100

101

102

103

TIMIT, T=18, b=4096

log2(Number of Nodes)

Ru

n T

ime

pe

r It

era

tio

n

Gradient

CG

Linesearch

1 1.5 2 2.5 3 3.5 4 4.5 5

101

102

103

TIMIT, T=18, b=8192

log2(Number of Nodes)

Ru

n T

ime

pe

r It

era

tio

n

Gradient

CG

Linesearch

24 / 28

Page 38: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

1 1.5 2 2.5 3 3.5 4 4.5 5

100

101

TIMIT, T=18

log2(Number of Nodes)

Ru

n T

ime

pe

r O

ne

Lin

e S

ea

rch

b=512

b=1024

b=4096

b=8192

25 / 28

Page 39: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Learning Artistic Style by Deep Neural Network3

3Joint work with Jiawei Zhang, based on Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, arxiv 1508.06576

26 / 28

Page 40: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Learning Artistic Style by Deep Neural Network3

3Joint work with Jiawei Zhang, based on Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, arxiv 1508.06576

26 / 28

Page 41: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Learning Artistic Style by Deep Neural Network4

4Joint work with Jiawei Zhang, based on Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, arxiv 1508.0657627 / 28

Page 42: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

Learning Artistic Style by Deep Neural Network4

4Joint work with Jiawei Zhang, based on Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, arxiv 1508.0657627 / 28

Page 43: Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

References

1 Albert Berahas, Jorge Nocedal and Martin Takac: A Multi-Batch L-BFGS Method for MachineLearning, arXiv:1605.06049, 2016.

2 Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy and Martin Takac:Large Scale Distributed Hessian-Free Optimization for Deep Neural Network, arXiv:1606.00511, 2016.

3 Chenxin Ma and Martin Takac: Partitioning Data on Features or Samples in Communication-EfficientDistributed Optimization?, OptML@NIPS 2015.

4 Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I. Jordan, Peter Richtarik and Martin Takac: Addingvs. Averaging in Distributed Primal-Dual Optimization, ICML 2015.

5 Martin Jaggi, Virginia Smith, Martin Takac, Jonathan Terhorst, Thomas Hofmann and Michael I.Jordan: Communication-Efficient Distributed Dual Coordinate Ascent, NIPS 2014.

6 Richtarik, P. and Takac, M.: Distributed coordinate descent method for learning with big data, JournalPaper Journal of Machine Learning Research (to appear), 2016

7 Richtarik, P. and Takac, M.: On optimal probabilities in stochastic coordinate descent methods,Optimization Letters, 2015.

8 Richtarik, P. and Takac, M.: Parallel coordinate descent methods for big data optimization,Mathematical Programming, 2015.

9 Richtarik, P. and Takac, M.: Iteration complexity of randomized block-coordinate descent methods forminimizing a composite function, Mathematical Programming, 2012.

10 Takac, M., Bijral, A., Richtarik, P. and Srebro, N.: Mini-batch primal and dual methods for SVMs, InICML, 2013.

11 Qu, Z., Richtarik, P. and Zhang, T.: Randomized dual coordinate ascent with arbitrary sampling,arXiv:1411.5873, 2014.

12 Qu, Z., Richtarik, P., Takac, M. and Fercoq, O.: SDNA: Stochastic Dual Newton Ascent for EmpiricalRisk Minimization, arXiv:1502.02268, 2015.

13 Tappenden, R., Takac, M. and Richtarik, P., On the Complexity of Parallel Coordinate Descent, arXiv:1503.03033, 2015.

28 / 28