Top Banner
Distributed nonsmooth optimization: variational analysis at work and flexible proximal algorithms erˆ ome MALICK CNRS, Laboratoire Jean Kuntzmann, Grenoble (France) International School of Mathematics “Guido Stampacchia” 71st Workshop: Advances in nonsmooth analysis and optimization June 2019 – Erice (Sicily)
96

variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Jul 09, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed nonsmooth optimization:variational analysis at work and flexible proximal algorithms

Jerome MALICK

CNRS, Laboratoire Jean Kuntzmann, Grenoble (France)

International School of Mathematics “Guido Stampacchia”

71st Workshop: Advances in nonsmooth analysis and optimization

June 2019 – Erice (Sicily)

Page 2: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Page 3: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Page 4: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Page 5: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Page 6: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Page 7: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Page 8: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Page 9: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Page 10: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Page 11: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Page 12: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Page 13: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Page 14: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Page 15: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Page 16: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Page 17: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Distributed optimization: past and present

Old history...Seminal work of D. Bertsekas and co-authors

Two classical frameworks:

(1) network (2) master/slaves

Recent challenges : new distributed systems

– Increase available computingresources (data centers, cloud)

– Improvement of multicoreinfrastructure and networks

How to (efficiently) deploy our algorithms ?

1

Page 18: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Distributed optimization: past and present

Old history...Seminal work of D. Bertsekas and co-authors

Two classical frameworks:

(1) network (2) master/slaves

Recent challenges : new distributed systems

– Increase available computingresources (data centers, cloud)

– Improvement of multicoreinfrastructure and networks

How to (efficiently) deploy our algorithms ?

1

Page 19: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Example in master/slave framework

Basic example: minimize m smoothfunctions over M = m workers

minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)

Standard/Distributed gradient descent

xk+1 = xk − γm∑

i=1

∇f i(xk) (map-reduce)

Extensions:

with prox-term, nonsmooth f i, ...

When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)

shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

2

Page 20: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Example in master/slave framework

Basic example: minimize m smoothfunctions over M = m workers

minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)

Standard/Distributed gradient descent

xk+1 = xk − γm∑

i=1

∇f i(xk) (map-reduce)

Extensions:

with prox-term, nonsmooth f i, ...

When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)

shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

2

Page 21: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Example in master/slave framework

Basic example: minimize m smoothfunctions over M = m workers

minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)

Standard/Distributed gradient descent

xk+1 = xk − γm∑

i=1

∇f i(xk) (map-reduce)

Extensions:

with prox-term, nonsmooth f i, ...

When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)

shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

2

Page 22: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Example in master/slave framework

Basic example: minimize m smoothfunctions over M = m workers

minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)

Standard/Distributed gradient descent

xk+1 = xk − γm∑

i=1

∇f i(xk) (map-reduce)

Extensions:

with prox-term, nonsmooth f i, ...

When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)

shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

2

Page 23: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Limitations of optimization algorithms in master/slave framework

The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...

Two key limitations of usual algorithms

(1) Synchronization: Master has to wait for all workers at each iteration...

Example with 3 workers/agents: (image: W. Yin)

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

synchronous

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

asynchronous

(2) Communications: unreliable, costly,...

(image: Google AI)

3

Page 24: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Limitations of optimization algorithms in master/slave framework

The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...

Two key limitations of usual algorithms

(1) Synchronization: Master has to wait for all workers at each iteration...

Example with 3 workers/agents: (image: W. Yin)

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

synchronous

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

asynchronous

(2) Communications: unreliable, costly,...

(image: Google AI)

3

Page 25: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Limitations of optimization algorithms in master/slave framework

The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...

Two key limitations of usual algorithms

(1) Synchronization: Master has to wait for all workers at each iteration...

Example with 3 workers/agents: (image: W. Yin)

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

synchronous

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

asynchronous

(2) Communications: unreliable, costly,...

(image: Google AI)

3

Page 26: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Limitations of optimization algorithms in master/slave framework

The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...

Two key limitations of usual algorithms

(1) Synchronization: Master has to wait for all workers at each iteration...

Example with 3 workers/agents: (image: W. Yin)

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

synchronous

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

asynchronous

(2) Communications: unreliable, costly,...

(image: Google AI)

3

Page 27: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Examples & applications of distributed/parallel optimization

Rich activity : e.g. [Nedic et al ’14], [Yin et al ’15], [Richtarik et al ’16],[Chouzenoux et al ’17]... just to name a few

Various practical situations 6=

In the workshop: – Welington’s talk (wednesday)

– Mentioned in the talks of Russel, Silvia, and Antoine

Our community is more knowledgeable on the “data center” situation

1 data is distributed among many machines (“big data”)

2 computation load is distributed (“big problems” or HPC)

Ex: large-scale stochastic programming problems of Welington’s talk

let’s discuss more distributed (on-device) learning

4

Page 28: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Examples & applications of distributed/parallel optimization

Rich activity : e.g. [Nedic et al ’14], [Yin et al ’15], [Richtarik et al ’16],[Chouzenoux et al ’17]... just to name a few

Various practical situations 6=

In the workshop: – Welington’s talk (wednesday)

– Mentioned in the talks of Russel, Silvia, and Antoine

Our community is more knowledgeable on the “data center” situation

1 data is distributed among many machines (“big data”)

2 computation load is distributed (“big problems” or HPC)

Ex: large-scale stochastic programming problems of Welington’s talk

let’s discuss more distributed (on-device) learning

4

Page 29: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Examples & applications of distributed/parallel optimization

Rich activity : e.g. [Nedic et al ’14], [Yin et al ’15], [Richtarik et al ’16],[Chouzenoux et al ’17]... just to name a few

Various practical situations 6=

In the workshop: – Welington’s talk (wednesday)

– Mentioned in the talks of Russel, Silvia, and Antoine

Our community is more knowledgeable on the “data center” situation

1 data is distributed among many machines (“big data”)

2 computation load is distributed (“big problems” or HPC)

Ex: large-scale stochastic programming problems of Welington’s talk

let’s discuss more distributed (on-device) learning

4

Page 30: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Page 31: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Recalls on supervised learning: context

Data : n observations (aj, yj) (j = 1, . . . , n)

e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)

a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1

Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd

usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)

goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)

Standard prediction functions :

Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)

Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉

Definition of the pliable lasso Optimization

Deep Nets/Deep Learning

x1

x2

x3

x4

f(x)

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

Neural network diagram with a single hidden layer. The hidden layer

derives transformations of the inputs — nonlinear transformations of linear

combinations — which are then used to model the output10 / 46

5

Page 32: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Recalls on supervised learning: context

Data : n observations (aj, yj) (j = 1, . . . , n)

e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)

a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1

Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd

usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)

goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)

Standard prediction functions :

Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)

Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉

Definition of the pliable lasso Optimization

Deep Nets/Deep Learning

x1

x2

x3

x4

f(x)

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

Neural network diagram with a single hidden layer. The hidden layer

derives transformations of the inputs — nonlinear transformations of linear

combinations — which are then used to model the output10 / 46

5

Page 33: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Recalls on supervised learning: context

Data : n observations (aj, yj) (j = 1, . . . , n)

e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)

a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1

Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd

usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)

goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)

Standard prediction functions :

Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)

Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉

Definition of the pliable lasso Optimization

Deep Nets/Deep Learning

x1

x2

x3

x4

f(x)

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

Neural network diagram with a single hidden layer. The hidden layer

derives transformations of the inputs — nonlinear transformations of linear

combinations — which are then used to model the output10 / 46

5

Page 34: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Recalls on supervised learning: context

Data : n observations (aj, yj) (j = 1, . . . , n)

e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)

a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1

Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd

usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)

goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)

Standard prediction functions :

Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)

Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉

Definition of the pliable lasso Optimization

Deep Nets/Deep Learning

x1

x2

x3

x4

f(x)

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

Neural network diagram with a single hidden layer. The hidden layer

derives transformations of the inputs — nonlinear transformations of linear

combinations — which are then used to model the output10 / 46

5

Page 35: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Recalls on supervised learning: learning is optimizing !

Learning = finding x, i.e. the best x from data

= solving an optimization problem

Regularized empirical risk minimization(regularization avoids overfitting, imposes structure to x, or helps numerically)

minx∈Rd

1n

n∑j=1

`(yj, h(aj, x)

)+ λR(x)

data fitting-term + regularizer

Ex: regularized logistic regression (to be used in numerical exps)

minx∈Rd

1n

n∑j=1

log(1+exp(−yj〈aj, x〉)

)+λ2

2‖x‖2

2 + λ1‖x‖1

Distributed setting: machine i has a chunk of data Si

1n

n∑j=1

`(yj, h(aj, x)

)=

m∑i=1

1n

∑i∈Si

`(yi, h(ai, x)

)︸ ︷︷ ︸

f i(x)

6

Page 36: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Recalls on supervised learning: learning is optimizing !

Learning = finding x, i.e. the best x from data

= solving an optimization problem

Regularized empirical risk minimization(regularization avoids overfitting, imposes structure to x, or helps numerically)

minx∈Rd

1n

n∑j=1

`(yj, h(aj, x)

)+ λR(x)

data fitting-term + regularizer

Ex: regularized logistic regression (to be used in numerical exps)

minx∈Rd

1n

n∑j=1

log(1+exp(−yj〈aj, x〉)

)+λ2

2‖x‖2

2 + λ1‖x‖1

Distributed setting: machine i has a chunk of data Si

1n

n∑j=1

`(yj, h(aj, x)

)=

m∑i=1

1n

∑i∈Si

`(yi, h(ai, x)

)︸ ︷︷ ︸

f i(x)

6

Page 37: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

Recalls on supervised learning: learning is optimizing !

Learning = finding x, i.e. the best x from data

= solving an optimization problem

Regularized empirical risk minimization(regularization avoids overfitting, imposes structure to x, or helps numerically)

minx∈Rd

1n

n∑j=1

`(yj, h(aj, x)

)+ λR(x)

data fitting-term + regularizer

Ex: regularized logistic regression (to be used in numerical exps)

minx∈Rd

1n

n∑j=1

log(1+exp(−yj〈aj, x〉)

)+λ2

2‖x‖2

2 + λ1‖x‖1

Distributed setting: machine i has a chunk of data Si

1n

n∑j=1

`(yj, h(aj, x)

)=

m∑i=1

1n

∑i∈Si

`(yi, h(ai, x)

)︸ ︷︷ ︸

f i(x)

6

Page 38: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

New machine learning frameworks ?

Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)

Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)

Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage

A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data

no data move... but:– unbalanced # samples– unstable communications...

Back to the initial question: how to (efficiently) deploy our algorithms ?

7

Page 39: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

New machine learning frameworks ?

Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)

Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)

Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage

A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data

no data move... but:– unbalanced # samples– unstable communications...

Back to the initial question: how to (efficiently) deploy our algorithms ?

7

Page 40: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

New machine learning frameworks ?

Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)

Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)

Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage

A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data

no data move... but:– unbalanced # samples– unstable communications...

Back to the initial question: how to (efficiently) deploy our algorithms ?

7

Page 41: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Distributed optimization

New machine learning frameworks ?

Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)

Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)

Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage

A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data

no data move... but:– unbalanced # samples– unstable communications...

Back to the initial question: how to (efficiently) deploy our algorithms ?

7

Page 42: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Page 43: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Page 44: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

Asynchronous Master/Slave framework

Algorithm = global communication scheme + local optimization methodwhat is x what is i

xk = xk−1 + ∆

Master

oracle f1

1

Worker 1

... oracle f i

i → ∆

Worker i

... oracle fM

M

Worker M

∆ xki = i(k)

timei = i(k) viewpoint i

k = k− dki

ik− Dk

i

ii

timej 6= i(k) viewpoint

k

j

k− dkj

j

k− Dkj

jj

– iteration = receive from a worker + master update + send– time k = number of iterations– machine i(k) = the machine updating at time k– delay dk

i = time since last exchange with idk

i = 0 iff i = i(k) dki = dk−1

i + 1 elsewhere

– second delay Dki = time since penultimate exchange with i

8

Page 45: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

An instantiation: asynchr. gradient

Averaged minimization of smooth functions

minx∈Rd

1m

m∑i=1

f i(x) with f i L-smooth and µ-strongly conv.

Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]

xk = xk−1 −γ

m

m∑i=1

∇f i(

xk−Dki

)Problem: stepsize γ depends on delays between machines /Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]

xk =1m

m∑i=1

xk−Dki− γ

m

m∑i=1

∇f i(xk−Dki)

Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence

Variational analysis at work ? not much yet

– Convergence proof follows from standard tools and usual rationale

– Originality : new “epoch”-based analysis to get delay-free results

9

Page 46: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

An instantiation: asynchr. gradient

Averaged minimization of smooth functions

minx∈Rd

1m

m∑i=1

f i(x) with f i L-smooth and µ-strongly conv.

Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]

xk = xk−1 −γ

m

m∑i=1

∇f i(

xk−Dki

)Problem: stepsize γ depends on delays between machines /

Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]

xk =1m

m∑i=1

xk−Dki− γ

m

m∑i=1

∇f i(xk−Dki)

Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence

Variational analysis at work ? not much yet

– Convergence proof follows from standard tools and usual rationale

– Originality : new “epoch”-based analysis to get delay-free results

9

Page 47: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

An instantiation: asynchr. gradient

Averaged minimization of smooth functions

minx∈Rd

1m

m∑i=1

f i(x) with f i L-smooth and µ-strongly conv.

Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]

xk = xk−1 −γ

m

m∑i=1

∇f i(

xk−Dki

)Problem: stepsize γ depends on delays between machines /Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]

xk =1m

m∑i=1

xk−Dki− γ

m

m∑i=1

∇f i(xk−Dki)

Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence

Variational analysis at work ? not much yet

– Convergence proof follows from standard tools and usual rationale

– Originality : new “epoch”-based analysis to get delay-free results

9

Page 48: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

An instantiation: asynchr. gradient

Averaged minimization of smooth functions

minx∈Rd

1m

m∑i=1

f i(x) with f i L-smooth and µ-strongly conv.

Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]

xk = xk−1 −γ

m

m∑i=1

∇f i(

xk−Dki

)Problem: stepsize γ depends on delays between machines /Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]

xk =1m

m∑i=1

xk−Dki− γ

m

m∑i=1

∇f i(xk−Dki)

Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence

Variational analysis at work ? not much yet

– Convergence proof follows from standard tools and usual rationale

– Originality : new “epoch”-based analysis to get delay-free results

9

Page 49: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

Numerical illustrations

Illustration on toy problem2D quadratic functions on 5 workers but one worker 10x slower

Illustration on standard regularized logistic regression100 machines in a cluster with 10% of the data in machine one, even on the rest

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800

10−3

10−2

10−1

100

Wallclock time (s)

Subo

ptim

alit

y

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2

10−1

Wallclock time (s)

RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)

DAve-PG Synchronous PG PIAG

Conclusion: stepsize matters ! (remind Silvia’s talk!)

10

Page 50: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

Numerical illustrations

Illustration on toy problem2D quadratic functions on 5 workers but one worker 10x slower

Illustration on standard regularized logistic regression100 machines in a cluster with 10% of the data in machine one, even on the rest

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800

10−3

10−2

10−1

100

Wallclock time (s)

Subo

ptim

alit

y

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2

10−1

Wallclock time (s)

RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)

DAve-PG Synchronous PG PIAG

Conclusion: stepsize matters ! (remind Silvia’s talk!)

10

Page 51: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

Numerical illustrations

Illustration on toy problem2D quadratic functions on 5 workers but one worker 10x slower

Illustration on standard regularized logistic regression100 machines in a cluster with 10% of the data in machine one, even on the rest

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800

10−3

10−2

10−1

100

Wallclock time (s)

Subo

ptim

alit

y

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2

10−1

Wallclock time (s)

RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)

DAve-PG Synchronous PG PIAG

Conclusion: stepsize matters ! (remind Silvia’s talk!)

10

Page 52: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Page 53: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

i)

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 f2

f1 + f2

f1 f2

f1 + f2

fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)

f1 + f2

f lev

11

Page 54: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

i)

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 f2

f1 + f2

f1 f2

f1 + f2

fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)

f1 + f2

f lev

11

Page 55: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

i)

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 f2

f1 + f2

f1 f2

f1 + f2

lower bound

[Hintermuller ’01][Oliveira Sagastizabal ’17]

fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)

f1 + f2

f lev

11

Page 56: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

i)

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 f2

f1 + f2

f1 f2

f1 + f2

lower bound

[Hintermuller ’01][Oliveira Sagastizabal ’15]

but no upper bound

f 1(xk1 ) f 2(xk2 ) xk1 6= xk2

fup = flev +m∑

i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)

f1 + f2

f lev

11

Page 57: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

i)

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 f2

f1 + f2

f1 f2

f1 + f2

lower bound

[Hintermuller ’01][Oliveira Sagastizabal ’15]

but no upper bound

f 1(xk1 ) f 2(xk2 ) xk1 6= xk2

fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)

f1 + f2

f lev

11

Page 58: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

i)

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 f2

f1 + f2

fup = flev +m∑

i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)

f1 + f2

f lev

11

Page 59: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

Direct convergence by basic variational analysis

Convergence of usual level bundle [Lemarechal Nesterov Nemirovski ’95]

requires tight control of iterates moves to get complexity analysis

Forget this : asynchronous iterates are out of controlso we can simplify the proof in this more complicated setting !

3 line proof : finite number of null steps (by contradiction)

– recall xk+1 = ProjXk(x)

– nested sequence of (compact) sets Xk ⊂ Xk+1 ⊂ · · ·

– Xk −→k∩kXk = X∞ 6= ∅ (Painleve-Kuratowski)

– xk+1 = ProjXk(x) −→

kProjX∞ (x) [Rockafellar Wets Book]

– fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)−→

kflev

– it leads to contradiction ,

Variational analysis at work... a little... more to come... (Samir be patient)

12

Page 60: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Asynchronous nonsmooth optimization methods

Direct convergence by basic variational analysis

Convergence of usual level bundle [Lemarechal Nesterov Nemirovski ’95]

requires tight control of iterates moves to get complexity analysis

Forget this : asynchronous iterates are out of controlso we can simplify the proof in this more complicated setting !

3 line proof : finite number of null steps (by contradiction)

– recall xk+1 = ProjXk(x)

– nested sequence of (compact) sets Xk ⊂ Xk+1 ⊂ · · ·

– Xk −→k∩kXk = X∞ 6= ∅ (Painleve-Kuratowski)

– xk+1 = ProjXk(x) −→

kProjX∞ (x) [Rockafellar Wets Book]

– fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)−→

kflev

– it leads to contradiction ,

Variational analysis at work... a little... more to come... (Samir be patient)

12

Page 61: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Page 62: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Page 63: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

When communication is the bottleneck...

Communicate may be bad :

– Heterogeneous frameworks: communications are unreliable

– High-dimensional problems: communications can also be costly

A solution to reduce communication is to reduce dimension

How ? using the structure of regularized problem

minx∈Rd

m∑i=1

f i(x) + λR(x)

Typically x? belongs to low-dimensional manifold

because the convex R is usually highly structured...

13

Page 64: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Mirror-stratifiable regularizers

Most of the regularizers used in machine learning or image processing

have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]

Examples: (associated unit ball and low-dimensional manifold where x belongs)

R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)

nuclear norm (aka trace-norm) R(X) =∑

i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

xMx

x Mx

x

Mx

14

Page 65: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Mirror-stratifiable regularizers

Most of the regularizers used in machine learning or image processing

have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]

Examples: (associated unit ball and low-dimensional manifold where x belongs)

R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)

nuclear norm (aka trace-norm) R(X) =∑

i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

xMx

x Mx

x

Mx

14

Page 66: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Mirror-stratifiable regularizers

Most of the regularizers used in machine learning or image processing

have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]

Examples: (associated unit ball and low-dimensional manifold where x belongs)

R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)

nuclear norm (aka trace-norm) R(X) =∑

i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

xMx

x Mx

x

Mx

14

Page 67: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Mirror-stratifiable regularizers

Most of the regularizers used in machine learning or image processing

have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]

Examples: (associated unit ball and low-dimensional manifold where x belongs)

R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)

nuclear norm (aka trace-norm) R(X) =∑

i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

xMx

x Mx

x

Mx

14

Page 68: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Recall on stratifications

Recall the talks of Aris (monday morning) and of Hasnaa (tuesday afternoon)

A stratification of a set D ⊂ RN is a finite partitionM = {Mi}i∈I

D =⋃i∈I

Mi

with so-called “strata” (e.g. smooth/affine manifolds) which fit nicely:

M ∩ cl(M′) 6= ∅ =⇒ M ⊂ cl(M′)

This relation induces a (partial) ordering M 6 M′

Example: B∞ the unit `∞-ball in R2

a stratification with 9 (affine) strata

M1 6 M2 6 M4

M1 6 M3 6 M4

M1M2

M3M4

15

Page 69: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Recall on stratifications

Recall the talks of Aris (monday morning) and of Hasnaa (tuesday afternoon)

A stratification of a set D ⊂ RN is a finite partitionM = {Mi}i∈I

D =⋃i∈I

Mi

with so-called “strata” (e.g. smooth/affine manifolds) which fit nicely:

M ∩ cl(M′) 6= ∅ =⇒ M ⊂ cl(M′)

This relation induces a (partial) ordering M 6 M′

Example: B∞ the unit `∞-ball in R2

a stratification with 9 (affine) strata

M1 6 M2 6 M4

M1 6 M3 6 M4

M1M2

M3M4

15

Page 70: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Mirror-stratifiable function: formal definition

A convex function R : RN→R ∪ {+∞} is mirror-stratifiable with respect to

– a (primal) stratificationM = {Mi}i∈I of dom(∂R)

– a (dual) stratificationM∗ = {M∗i }i∈I of dom(∂R∗)

if JR has 2 properties

JR :M→M∗ is invertible with inverse JR∗

M∗ 3 M∗ = JR(M) ⇐⇒ JR∗(M∗) = M ∈M

JR is decreasing for the order relation 6 between strata

M 6 M′ ⇐⇒ JR(M) > JR(M′)

with the transfert operator JR : RN ⇒ RN [Daniilidis-Drusvyatskiy-Lewis ’13]

JR(S) =⋃x∈S

ri(∂R(x))

16

Page 71: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Mirror-stratifiable function: simple example

R = ιB∞ R∗ = ‖ · ‖1

JR(Mi) =⋃

x∈Mi

ri ∂R(x) = ri N B∞ (x) = M∗i

Mi = ri ∂‖x‖1 =⋃

x∈M∗i

ri ∂R∗(x) = JR∗ (M∗i )

M1M2

M3M4

M⇤1

M⇤2

M⇤3

M⇤4

JR

JR⇤

17

Page 72: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Page 73: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Sensitivity under small variations

Parameterized composite optimization problem (smooth + nonsmooth)

minx∈RN

F(x, p) + R(x),

Optimality condition for a primal-dual solution (x?(p), u?(p))

u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))

For p∼p0, can we localize x?(p) with respect to x?(p0) ?

Theorem (Enlarged sensitivity)

Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded

in x), if R is mirror-stratifiable, then for p∼p0,

Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))

In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))

)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

))

we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)

18

Page 74: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Sensitivity under small variations

Parameterized composite optimization problem (smooth + nonsmooth)

minx∈RN

F(x, p) + R(x),

Optimality condition for a primal-dual solution (x?(p), u?(p))

u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))

For p∼p0, can we localize x?(p) with respect to x?(p0) ?

Theorem (Enlarged sensitivity)

Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded

in x), if R is mirror-stratifiable, then for p∼p0,

Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))

In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))

)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

))

we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)

18

Page 75: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Sensitivity under small variations

Parameterized composite optimization problem (smooth + nonsmooth)

minx∈RN

F(x, p) + R(x),

Optimality condition for a primal-dual solution (x?(p), u?(p))

u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))

For p∼p0, can we localize x?(p) with respect to x?(p0) ?

Theorem (Enlarged sensitivity)

Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded

in x), if R is mirror-stratifiable, then for p∼p0,

Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))

In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))

)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

))

we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)

18

Page 76: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Sensitivity under small variations

Parameterized composite optimization problem (smooth + nonsmooth)

minx∈RN

F(x, p) + R(x),

Optimality condition for a primal-dual solution (x?(p), u?(p))

u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))

For p∼p0, can we localize x?(p) with respect to x?(p0) ?

Theorem (Enlarged sensitivity)

Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded

in x), if R is mirror-stratifiable, then for p∼p0,

Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))

In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))

)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

))

we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)18

Page 77: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

First sensitivity result illustrated

Simple projection problem{min 1

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

19

Page 78: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

First sensitivity result illustrated

Simple projection problem{min 1

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

Non-degenerate case: u?(p0) = p0 − x?(p0) ∈ ri NB∞(x?(p0))

x?(p0)

p0

u?(p0)

M⇤1

19

Page 79: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

First sensitivity result illustrated

Simple projection problem{min 1

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

Non-degenerate case: u?(p0) = p0 − x?(p0) ∈ ri NB∞(x?(p0))

=⇒ M1 = Mx?(p0) = Mx?(p) (in this case x?(p) = x?(p0))

x?(p0)

p0

u?(p0)

M⇤1

p

u?(p)

19

Page 80: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

First sensitivity result illustrated

Simple projection problem{min 1

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

General case: u?(p0) = p0 − x?(p0) ∈ �ri NB∞(x?(p))

x?(p0)

p0

u?(p0)

M⇤1

M⇤2

19

Page 81: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

First sensitivity result illustrated

Simple projection problem{min 1

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

General case: u?(p0) = p0 − x?(p0) ∈ �ri NB∞(x?(p))

=⇒ M1 = Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0)) = M2

x?(p0)

p0

u?(p0)

M⇤1

M⇤2

M2

19

Page 82: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Activity identification of proximal algorithms

Composite optimization problem (smooth + nonsmooth)

minx∈RN

f(x) + R(x)

Optimality condition −∇f(x?) ∈ ∂R(x?)

Proximal-gradient algorithm (aka forward-backward algorithm)

xk+1 = proxγR

(xk − γ∇f(xk)

)

Does the iterates xk identify the low-complexity of x? ?

Theorem (Enlarged activity identification)

Under convergence assumptions, if R is mirror-stratifiable, then for k large

Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))

In the non-degenerate case: −∇f(x?) ∈ ri(∂R(x?)))

we have exact identification Mx? = Mxk

(= JR∗(M∗−∇f(x?))

)[Liang et al 15]

and we can bound the threshold in this case [Hare et al 19]

20

Page 83: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Activity identification of proximal algorithms

Composite optimization problem (smooth + nonsmooth)

minx∈RN

f(x) + R(x)

Optimality condition −∇f(x?) ∈ ∂R(x?)

Proximal-gradient algorithm (aka forward-backward algorithm)

xk+1 = proxγR

(xk − γ∇f(xk)

)Does the iterates xk identify the low-complexity of x? ?

Theorem (Enlarged activity identification)

Under convergence assumptions, if R is mirror-stratifiable, then for k large

Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))

In the non-degenerate case: −∇f(x?) ∈ ri(∂R(x?)))

we have exact identification Mx? = Mxk

(= JR∗(M∗−∇f(x?))

)[Liang et al 15]

and we can bound the threshold in this case [Hare et al 19]

20

Page 84: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Activity identification of proximal algorithms

Composite optimization problem (smooth + nonsmooth)

minx∈RN

f(x) + R(x)

Optimality condition −∇f(x?) ∈ ∂R(x?)

Proximal-gradient algorithm (aka forward-backward algorithm)

xk+1 = proxγR

(xk − γ∇f(xk)

)Does the iterates xk identify the low-complexity of x? ?

Theorem (Enlarged activity identification)

Under convergence assumptions, if R is mirror-stratifiable, then for k large

Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))

In the non-degenerate case: −∇f(x?) ∈ ri(∂R(x?)))

we have exact identification Mx? = Mxk

(= JR∗(M∗−∇f(x?))

)[Liang et al 15]

and we can bound the threshold in this case [Hare et al 19]20

Page 85: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

For the case of sparse optimization

R = ‖ · ‖1 promotes sparsity

Mx ={

z ∈ Rd : supp(z) = supp(x)} x0

Mx0 ||x0||0 =1

?

?

dim(Mx) = # supp(x) = ‖x‖0

`1 regularized least-squares (LASSO)

minx∈Rd

12‖Ax − y‖2 + λ‖x‖1

Ilustration: plot of supp(xk)for one instance with d = 27

Result reads: for all k large enough

supp(x?) ⊆ supp(xk) ⊆ supp(y?ε)

where y?ε = proxγ(1−ε)R(u? − x?) for any ε > 0

Gap between two extreme strata

δ = dim(JR∗(M∗−∇f(x?)))− dim(Mx?) = #supp(A>(Ax? − y))−#supp(x?)

21

Page 86: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Illustration of the identification of proximal-gradient algorithm

Generate many random problems (with d = 100 and n = 50) and solve them

Select those with #supp(x?) = 10with δ = 0 or 10 (δ = dim(JR∗ (M

∗A>(Ax?−y)

))− dim(Mx? ))

Plot the evolution of #supp(xk) with xk+1 = proxγ‖·‖1

(xk − γ A>(Axk − y))

)

δ quantifies the degeneracy of the problem and the identification of algorithm

δ = 0: weak degeneracy→ exact identification

δ = 10: strong degeneracy→ enlarged identification

22

Page 87: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Page 88: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Sparse communication

For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...

In our distributed optim. context

↓↓↓ downward comm.

become sparse ,↑↑↑ upward comm.

stay dense... /

xk = proxλ‖·‖1(xk−1+∆)

Master

oracle f i

i → ∆

Worker i

∆ xk (naturally) sparsei = i(k)

Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]

Take a selection S (at random or not...)

(xik)[j] =

(

xk−Dki− γ∇f i(xk−Dk

i))[j]

if i = i(k) and j ∈ S

(xik−1)[j] otherwise

Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging

23

Page 89: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Sparse communication

For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...

In our distributed optim. context

↓↓↓ downward comm.

become sparse ,↑↑↑ upward comm.

stay dense... /

xk = proxλ‖·‖1(xk−1+∆)

Master

oracle f i

i → ∆

Worker i

∆ xk (naturally) sparsei = i(k)

Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]

Take a selection S (at random or not...)

(xik)[j] =

(

xk−Dki− γ∇f i(xk−Dk

i))[j]

if i = i(k) and j ∈ S

(xik−1)[j] otherwise

Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging

23

Page 90: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Sparse communication

For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...

In our distributed optim. context

↓↓↓ downward comm.

become sparse ,↑↑↑ upward comm.

stay dense... /

xk = proxλ‖·‖1(xk−1+∆)

Master

oracle f i

i → ∆

Worker i

∆let’s sparsify xk (naturally) sparsei = i(k)

Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]

Take a selection S (at random or not...)

(xik)[j] =

(

xk−Dki− γ∇f i(xk−Dk

i))[j]

if i = i(k) and j ∈ S

(xik−1)[j] otherwise

Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging

23

Page 91: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Sparse communication

For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...

In our distributed optim. context

↓↓↓ downward comm.

become sparse ,↑↑↑ upward comm.

stay dense... /

xk = proxλ‖·‖1(xk−1+∆)

Master

oracle f i

i → ∆

Worker i

∆let’s sparsify xk (naturally) sparsei = i(k)

Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]

Take a selection S (at random or not...)

(xik)[j] =

(

xk−Dki− γ∇f i(xk−Dk

i))[j]

if i = i(k) and j ∈ S

(xik−1)[j] otherwise

Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging

23

Page 92: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Choice of selection: random vs. adaptative

Random: Take randomly a fixed number of entries (e.g. 200)

Adaptative: Take the mask of the support + some entries randomly

`1-regularized logistic regression10 machines in a cluster, madelon dataset (2000× 500) evenly distributed

comparison of asynchr. prox-grad vs. 3 sparsified variants

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

·107

10−11

10−8

10−5

10−2

quantity of data exchanged

subo

ptim

alit

y

Full Update200 coord.

Mask + 100 coord.Mask + sizeof(Mask)

Tradeoff between sparsification less comm. and identification faster cv

Taking twice the size of the support works well ,' Automatic Dimension Reduction

24

Page 93: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Communication-efficient proximal methods

Choice of selection: random vs. adaptative

Random: Take randomly a fixed number of entries (e.g. 200)

Adaptative: Take the mask of the support + some entries randomly

`1-regularized logistic regression10 machines in a cluster, madelon dataset (2000× 500) evenly distributed

comparison of asynchr. prox-grad vs. 3 sparsified variants

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

·107

10−11

10−8

10−5

10−2

quantity of data exchanged

subo

ptim

alit

y

Full Update200 coord.

Mask + 100 coord.Mask + sizeof(Mask)

Tradeoff between sparsification less comm. and identification faster cv

Taking twice the size of the support works well ,' Automatic Dimension Reduction

24

Page 94: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Conclusions

Take-home message

Distributed optimization is a hot topic

Work to adapt our algorithms to recent distributed computing systems

Practical deployment + theoretical work (to design/analyze aglorithms)

Key ideas

our algos are “flexible” : independent from computing systems

no assumptions on delays !

we exploit identification proximal algorithm

to automatically reduce dimension and size of communications

thanks !!

Page 95: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Conclusions

Take-home message

Distributed optimization is a hot topic

Work to adapt our algorithms to recent distributed computing systems

Practical deployment + theoretical work (to design/analyze aglorithms)

Key ideas

our algos are “flexible” : independent from computing systems

no assumptions on delays !

we exploit identification proximal algorithm

to automatically reduce dimension and size of communications

thanks !!

Page 96: variational analysis at work and flexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identification Sparse communinations by adaptative

Talk based on joint work

Mirror-stratified functions

J. Fadili, J. Malick, G. PeyreSensitivity Analysis for Mirror-StratifiableConvex FunctionsSIAM Journal on Optimization, 2018

Asynchronous level bundle algorithms

F. Iutzeler, J. Malick, W. de Oliveira,Asynchronous level bundle methodsIn revision in Math. Programming, 2019

Flexible distributed proximal algorithms K. Mishchenko, F. Iutzeler, J. MalickA Delay-tolerant Proximal-Gradient Algorithmfor Distributed LearningICML, 2018

M. Grishchenko, F. Iutzeler, J. MalickSubspace Descent Methods withIdentification-Adapted SamplingSubmitted to Maths of OR, 2019