variational analysis at work and ﬂexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identiﬁcation Sparse communinations by adaptative

Distributed nonsmooth optimization:variational analysis at work and flexible proximal algorithms

Jerome MALICK

CNRS, Laboratoire Jean Kuntzmann, Grenoble (France)

International School of Mathematics “Guido Stampacchia”

71st Workshop: Advances in nonsmooth analysis and optimization

June 2019 – Erice (Sicily)

Content at a glance

Today’s talk:a mix of

various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance


various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance


various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance


various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance


various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance


various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance


various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance


various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance


various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance


various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance


various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Content at a glance


various notions

variationalanalysis

at work

setconv.

sensivityanalysis

nonsmoothalgorithms

proximalgradient

levelbundle

distributedoptim.

parallel

computing

distrib.learning

synchr.vs.

asynchr.

unreliablecomm.

Outline

1 Distributed optimizationSome main ideas in distributed optimizationDistributed/Federated learning

2 Asynchronous nonsmooth optimization methodsAsynchr. (proximal) gradient methodsAsynchr. level bundle methods

3 Communication-efficient proximal methodsOptimization with mirror-stratifiable regularizersVariational analysis at work: sensitivity and identificationSparse communinations by adaptative dimension reduction

Outline




Outline




Distributed optimization

Distributed optimization: past and present

Old history...Seminal work of D. Bertsekas and co-authors

Two classical frameworks:

(1) network (2) master/slaves

Recent challenges : new distributed systems

– Increase available computingresources (data centers, cloud)

– Improvement of multicoreinfrastructure and networks

How to (efficiently) deploy our algorithms ?

1


Distributed optimization: past and present

Old history...Seminal work of D. Bertsekas and co-authors

Two classical frameworks:

(1) network (2) master/slaves

Recent challenges : new distributed systems

– Increase available computingresources (data centers, cloud)

– Improvement of multicoreinfrastructure and networks

How to (efficiently) deploy our algorithms ?

1


Example in master/slave framework

Basic example: minimize m smoothfunctions over M = m workers

minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)

Standard/Distributed gradient descent

xk+1 = xk − γm∑

i=1

∇f i(xk) (map-reduce)

Extensions:

with prox-term, nonsmooth f i, ...

When m the number of functionsis greater than N the number ofmachines (m > M), the algoextends to incremental/stochastic(batch) gradient descent(reviewed in Georg’s talk on monday)

shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

2




minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)



i=1


Extensions:



shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

2




minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)



i=1


Extensions:



shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

2




minx∈Rd

m∑i=1

f i(x)f1 f2 f3 f4

(m = M = 4)



i=1


Extensions:



shared memory

f i1 f i2 f i3 f i4

(m > M = 4)

2


Limitations of optimization algorithms in master/slave framework

The previous slide hides thecomplexity of tasks and the varietyof possible computing systems...

Two key limitations of usual algorithms

(1) Synchronization: Master has to wait for all workers at each iteration...

Example with 3 workers/agents: (image: W. Yin)

Comparison: iteration is redefined

Synchronousnew iteration = all agents finish

Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9

Asynchronousnew iteration = any agent finishes

synchronous



Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9


asynchronous

(2) Communications: unreliable, costly,...

(image: Google AI)

3









Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9


synchronous



Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9


asynchronous


(image: Google AI)

3









Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9


synchronous



Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9


asynchronous


(image: Google AI)

3









Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9


synchronous



Agent 1

Agent 2

Agent 3

t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9


asynchronous


(image: Google AI)

3


Examples & applications of distributed/parallel optimization

Rich activity : e.g. [Nedic et al ’14], [Yin et al ’15], [Richtarik et al ’16],[Chouzenoux et al ’17]... just to name a few

Various practical situations 6=

In the workshop: – Welington’s talk (wednesday)

– Mentioned in the talks of Russel, Silvia, and Antoine

Our community is more knowledgeable on the “data center” situation

1 data is distributed among many machines (“big data”)

2 computation load is distributed (“big problems” or HPC)

Ex: large-scale stochastic programming problems of Welington’s talk

let’s discuss more distributed (on-device) learning

4












4












4

Outline





Recalls on supervised learning: context

Data : n observations (aj, yj) (j = 1, . . . , n)

e.g. (aj, yj) ∈ Rm × R (regression) or (aj, yj) ∈ Rm × {−1,+1} (binary classification)

a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1

Supervised learning : h(a, x) ∈ R parameterized by x ∈ Rd

usually x = β (in stats) x = ω (in learning) or x = θ (in deep learning)

goal: find x such that h(ai, x) ' yi (and generalizes well on unseen data)

Standard prediction functions :

Linear prediction: h(a, x) = 〈a, x〉(or h(a, x) = 〈φ(a), x〉 for kernel)

Highly non-linear prediction by artificialneural networksh(a, x) = 〈xm, σ(〈xm−1, · · ·σ(〈x1, a〉)〉)〉

Definition of the pliable lasso Optimization

Deep Nets/Deep Learning

x1

x2

x3

x4

f(x)

Hiddenlayer L2

Inputlayer L1

Outputlayer L3

Neural network diagram with a single hidden layer. The hidden layer

derives transformations of the inputs — nonlinear transformations of linear

combinations — which are then used to model the output10 / 46

5





a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1









x1

x2

x3

x4

f(x)

Hiddenlayer L2

Inputlayer L1

Outputlayer L3




5





a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1









x1

x2

x3

x4

f(x)

Hiddenlayer L2

Inputlayer L1

Outputlayer L3




5





a1 a2 a3 a4 a5

y1 = 1 y2 = �1 y3 = 1 y4 = �1 y5 = 1









x1

x2

x3

x4

f(x)

Hiddenlayer L2

Inputlayer L1

Outputlayer L3




5


Recalls on supervised learning: learning is optimizing !

Learning = finding x, i.e. the best x from data

= solving an optimization problem

Regularized empirical risk minimization(regularization avoids overfitting, imposes structure to x, or helps numerically)

minx∈Rd

1n

n∑j=1

`(yj, h(aj, x)

)+ λR(x)

data fitting-term + regularizer

Ex: regularized logistic regression (to be used in numerical exps)

minx∈Rd

1n

n∑j=1

log(1+exp(−yj〈aj, x〉)

)+λ2

2‖x‖2

2 + λ1‖x‖1

Distributed setting: machine i has a chunk of data Si

1n

n∑j=1

`(yj, h(aj, x)

)=

m∑i=1

1n

∑i∈Si

`(yi, h(ai, x)

)︸︷︷︸

f i(x)

6






minx∈Rd

1n

n∑j=1

`(yj, h(aj, x)

)+ λR(x)



minx∈Rd

1n

n∑j=1


)+λ2

2‖x‖2

2 + λ1‖x‖1


1n

n∑j=1

`(yj, h(aj, x)

)=

m∑i=1

1n

∑i∈Si

`(yi, h(ai, x)

)︸︷︷︸

f i(x)

6






minx∈Rd

1n

n∑j=1

`(yj, h(aj, x)

)+ λR(x)



minx∈Rd

1n

n∑j=1


)+λ2

2‖x‖2

2 + λ1‖x‖1


1n

n∑j=1

`(yj, h(aj, x)

)=

m∑i=1

1n

∑i∈Si

`(yi, h(ai, x)

)︸︷︷︸

f i(x)

6


New machine learning frameworks ?

Machine learning has led to major breakthroughs in various areas(such as natural language processing, computer vision and speech recognition...)

Much of this success has been based on collecting huge amounts of data(and requires a lot of energy...)

Collecting data has two major problems:1 it is privacy-invasive2 it requires a lot of storage

A solution: federated learningdecoupling the ability to domachine learning from theneed to store the data

no data move... but:– unbalanced # samples– unstable communications...

Back to the initial question: how to (efficiently) deploy our algorithms ?

7









7









7









7

Outline




Outline




Asynchronous nonsmooth optimization methods

Asynchronous Master/Slave framework

Algorithm = global communication scheme + local optimization methodwhat is x what is i

xk = xk−1 + ∆

Master

oracle f1

1

Worker 1

... oracle f i

i → ∆

Worker i

... oracle fM

M

Worker M

∆ xki = i(k)

timei = i(k) viewpoint i

k = k− dki

ik− Dk

i

ii

timej 6= i(k) viewpoint

k

j

k− dkj

j

k− Dkj

jj

– iteration = receive from a worker + master update + send– time k = number of iterations– machine i(k) = the machine updating at time k– delay dk

i = time since last exchange with idk

i = 0 iff i = i(k) dki = dk−1

i + 1 elsewhere

– second delay Dki = time since penultimate exchange with i

8


An instantiation: asynchr. gradient

Averaged minimization of smooth functions

minx∈Rd

1m

m∑i=1

f i(x) with f i L-smooth and µ-strongly conv.

Asynchr. averaged gradient [Vanli, Gurbuzbalaban, Ozdaglar ’17]

xk = xk−1 −γ

m

m∑i=1

∇f i(

xk−Dki

)Problem: stepsize γ depends on delays between machines /Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]

xk =1m

m∑i=1

xk−Dki− γ

m

m∑i=1

∇f i(xk−Dki)

Interest: Take usual stepsize , γ ∈ (0, 2/(µ+ L)] gives linear convergence

Variational analysis at work ? not much yet

– Convergence proof follows from standard tools and usual rationale

– Originality : new “epoch”-based analysis to get delay-free results

9




minx∈Rd

1m

m∑i=1



xk = xk−1 −γ

m

m∑i=1

∇f i(

xk−Dki

)Problem: stepsize γ depends on delays between machines /

Proposed algo: rather combine iterates [Mishchenko, Iutzeler, M. ’18]

xk =1m

m∑i=1

xk−Dki− γ

m

m∑i=1

∇f i(xk−Dki)





9




minx∈Rd

1m

m∑i=1



xk = xk−1 −γ

m

m∑i=1

∇f i(

xk−Dki


xk =1m

m∑i=1

xk−Dki− γ

m

m∑i=1

∇f i(xk−Dki)





9




minx∈Rd

1m

m∑i=1



xk = xk−1 −γ

m

m∑i=1

∇f i(

xk−Dki


xk =1m

m∑i=1

xk−Dki− γ

m

m∑i=1

∇f i(xk−Dki)





9


Numerical illustrations

Illustration on toy problem2D quadratic functions on 5 workers but one worker 10x slower

Illustration on standard regularized logistic regression100 machines in a cluster with 10% of the data in machine one, even on the rest

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800

10−3

10−2

10−1

100

Wallclock time (s)

Subo

ptim

alit

y

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2

10−1

Wallclock time (s)

RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)

DAve-PG Synchronous PG PIAG

Conclusion: stepsize matters ! (remind Silvia’s talk!)

10





0 200 400 600 800 1,000 1,200 1,400 1,600 1,800

10−3

10−2

10−1

100

Wallclock time (s)

Subo

ptim

alit

y

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2

10−1

Wallclock time (s)

RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)



10





0 200 400 600 800 1,000 1,200 1,400 1,600 1,800

10−3

10−2

10−1

100

Wallclock time (s)

Subo

ptim

alit

y

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,20010−2

10−1

Wallclock time (s)

RCV1 ( 697641 × 47236) URL ( 2396130 × 3231961)



10

Outline





The idea of the work with Welington in one slide/picture

Basic situation: minimize m convex functions over m machines/oracles

minx∈X

m∑i=1

f i(x)at time k, machine i= i(k) sends up

f i(xk−dki) and g ∈ ∂f i(xk−dk

i)

Example: d = 1, m = 2 (oracle f1 3× faster than oracle f2)

f1 f2

f1 + f2

f1 f2

f1 + f2

fup = flev +

m∑i=1

(Li‖xk+1 − xk−dk

j‖ − 〈gj

k−dkj, xk+1 − xk−dk

j〉)

f1 + f2

f lev

11




minx∈X

m∑i=1



i)


f1 f2

f1 + f2

f1 f2

f1 + f2

fup = flev +

m∑i=1


j‖ − 〈gj


j〉)

f1 + f2

f lev

11




minx∈X

m∑i=1



i)


f1 f2

f1 + f2

f1 f2

f1 + f2

lower bound

[Hintermuller ’01][Oliveira Sagastizabal ’17]

fup = flev +

m∑i=1


j‖ − 〈gj


j〉)

f1 + f2

f lev

11




minx∈X

m∑i=1



i)


f1 f2

f1 + f2

f1 f2

f1 + f2

lower bound


but no upper bound

f 1(xk1 ) f 2(xk2 ) xk1 6= xk2

fup = flev +m∑

i=1


j‖ − 〈gj


j〉)

f1 + f2

f lev

11




minx∈X

m∑i=1



i)


f1 f2

f1 + f2

f1 f2

f1 + f2

lower bound


but no upper bound

f 1(xk1 ) f 2(xk2 ) xk1 6= xk2

fup = flev +

m∑i=1


j‖ − 〈gj


j〉)

f1 + f2

f lev

11




minx∈X

m∑i=1



i)


f1 f2

f1 + f2

fup = flev +m∑

i=1


j‖ − 〈gj


j〉)

f1 + f2

f lev

11


Direct convergence by basic variational analysis

Convergence of usual level bundle [Lemarechal Nesterov Nemirovski ’95]

requires tight control of iterates moves to get complexity analysis

Forget this : asynchronous iterates are out of controlso we can simplify the proof in this more complicated setting !

3 line proof : finite number of null steps (by contradiction)

– recall xk+1 = ProjXk(x)

– nested sequence of (compact) sets Xk ⊂ Xk+1 ⊂ · · ·

– Xk −→k∩kXk = X∞ 6= ∅ (Painleve-Kuratowski)

– xk+1 = ProjXk(x) −→

kProjX∞ (x) [Rockafellar Wets Book]

– fup = flev +

m∑i=1


j‖ − 〈gj


j〉)−→

kflev

– it leads to contradiction ,

Variational analysis at work... a little... more to come... (Samir be patient)

12


Direct convergence by basic variational analysis

Convergence of usual level bundle [Lemarechal Nesterov Nemirovski ’95]

requires tight control of iterates moves to get complexity analysis

Forget this : asynchronous iterates are out of controlso we can simplify the proof in this more complicated setting !

3 line proof : finite number of null steps (by contradiction)

– recall xk+1 = ProjXk(x)

– nested sequence of (compact) sets Xk ⊂ Xk+1 ⊂ · · ·

– Xk −→k∩kXk = X∞ 6= ∅ (Painleve-Kuratowski)

– xk+1 = ProjXk(x) −→

kProjX∞ (x) [Rockafellar Wets Book]

– fup = flev +

m∑i=1


j‖ − 〈gj


j〉)−→

kflev

– it leads to contradiction ,

Variational analysis at work... a little... more to come... (Samir be patient)

12

Outline




Outline




Communication-efficient proximal methods

When communication is the bottleneck...

Communicate may be bad :

– Heterogeneous frameworks: communications are unreliable

– High-dimensional problems: communications can also be costly

A solution to reduce communication is to reduce dimension

How ? using the structure of regularized problem

minx∈Rd

m∑i=1

f i(x) + λR(x)

Typically x? belongs to low-dimensional manifold

because the convex R is usually highly structured...

13


Mirror-stratifiable regularizers

Most of the regularizers used in machine learning or image processing

have a strong primal-dual structure – mirror-stratifiable [Fadili, M., Peyre ’18]

Examples: (associated unit ball and low-dimensional manifold where x belongs)

R = ‖ · ‖1 ( and ‖ · ‖∞ or other polyedral gauges)

nuclear norm (aka trace-norm) R(X) =∑

i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

xMx

x Mx

x

Mx

14








i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

xMx

x Mx

x

Mx

14








i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

xMx

x Mx

x

Mx

14








i |σi(X)| = ‖σ(X)‖1

group-`1 R(x) =∑

b∈B ‖xb‖2 ( e.g. R(x) = |x1|+ ‖x2,3‖ )

xMx

x Mx

x

Mx

14


Recall on stratifications

Recall the talks of Aris (monday morning) and of Hasnaa (tuesday afternoon)

A stratification of a set D ⊂ RN is a finite partitionM = {Mi}i∈I

D =⋃i∈I

Mi

with so-called “strata” (e.g. smooth/affine manifolds) which fit nicely:

M ∩ cl(M′) 6= ∅ =⇒ M ⊂ cl(M′)

This relation induces a (partial) ordering M 6 M′

Example: B∞ the unit `∞-ball in R2

a stratification with 9 (affine) strata

M1 6 M2 6 M4

M1 6 M3 6 M4

M1M2

M3M4

15


Recall on stratifications

Recall the talks of Aris (monday morning) and of Hasnaa (tuesday afternoon)

A stratification of a set D ⊂ RN is a finite partitionM = {Mi}i∈I

D =⋃i∈I

Mi

with so-called “strata” (e.g. smooth/affine manifolds) which fit nicely:

M ∩ cl(M′) 6= ∅ =⇒ M ⊂ cl(M′)

This relation induces a (partial) ordering M 6 M′

Example: B∞ the unit `∞-ball in R2

a stratification with 9 (affine) strata

M1 6 M2 6 M4

M1 6 M3 6 M4

M1M2

M3M4

15


Mirror-stratifiable function: formal definition

A convex function R : RN→R ∪ {+∞} is mirror-stratifiable with respect to

– a (primal) stratificationM = {Mi}i∈I of dom(∂R)

– a (dual) stratificationM∗ = {M∗i }i∈I of dom(∂R∗)

if JR has 2 properties

JR :M→M∗ is invertible with inverse JR∗

M∗ 3 M∗ = JR(M) ⇐⇒ JR∗(M∗) = M ∈M

JR is decreasing for the order relation 6 between strata

M 6 M′ ⇐⇒ JR(M) > JR(M′)

with the transfert operator JR : RN ⇒ RN [Daniilidis-Drusvyatskiy-Lewis ’13]

JR(S) =⋃x∈S

ri(∂R(x))

16


Mirror-stratifiable function: simple example

R = ιB∞ R∗ = ‖ · ‖1

JR(Mi) =⋃

x∈Mi

ri ∂R(x) = ri N B∞ (x) = M∗i

Mi = ri ∂‖x‖1 =⋃

x∈M∗i

ri ∂R∗(x) = JR∗ (M∗i )

M1M2

M3M4

M⇤1

M⇤2

M⇤3

M⇤4

JR

JR⇤

17

Outline





Sensitivity under small variations

Parameterized composite optimization problem (smooth + nonsmooth)

minx∈RN

F(x, p) + R(x),

Optimality condition for a primal-dual solution (x?(p), u?(p))

u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))

For p∼p0, can we localize x?(p) with respect to x?(p0) ?

Theorem (Enlarged sensitivity)

Under mild assumptions (unique minimizer x?(p0) at p0 and objective uniformly level-bounded

in x), if R is mirror-stratifiable, then for p∼p0,

Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))

In the non-degenerate case: u?(p0) ∈ ri(∂R(x?(p0))

)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

))

we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)

18




minx∈RN

F(x, p) + R(x),


u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))





Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))


)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

))


18




minx∈RN

F(x, p) + R(x),


u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))





Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))


)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

))


18




minx∈RN

F(x, p) + R(x),


u?(p) = −∇F(x?(p), p) ∈ ∂R(x?(p))





Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0))


)Mx?(p0) = Mx?(p)

(= JR∗(M∗u?(p0)

))

we retrieve exactly the active strata ([Lewis ’06] for partly-smooth functions)18


First sensitivity result illustrated

Simple projection problem{min 1

2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

19




2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

Non-degenerate case: u?(p0) = p0 − x?(p0) ∈ ri NB∞(x?(p0))

x?(p0)

p0

u?(p0)

M⇤1

19




2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

Non-degenerate case: u?(p0) = p0 − x?(p0) ∈ ri NB∞(x?(p0))

=⇒ M1 = Mx?(p0) = Mx?(p) (in this case x?(p) = x?(p0))

x?(p0)

p0

u?(p0)

M⇤1

p

u?(p)

19




2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

General case: u?(p0) = p0 − x?(p0) ∈ �ri NB∞(x?(p))

x?(p0)

p0

u?(p0)

M⇤1

M⇤2

19




2‖x − p‖2

‖x‖∞ 6 1

{min 1

2‖u− p‖2 + ‖u‖1

u ∈ RN

General case: u?(p0) = p0 − x?(p0) ∈ �ri NB∞(x?(p))

=⇒ M1 = Mx?(p0) 6 Mx?(p) 6 JR∗(M∗u?(p0)) = M2

x?(p0)

p0

u?(p0)

M⇤1

M⇤2

M2

19


Activity identification of proximal algorithms

Composite optimization problem (smooth + nonsmooth)

minx∈RN

f(x) + R(x)

Optimality condition −∇f(x?) ∈ ∂R(x?)

Proximal-gradient algorithm (aka forward-backward algorithm)

xk+1 = proxγR

(xk − γ∇f(xk)

)

Does the iterates xk identify the low-complexity of x? ?

Theorem (Enlarged activity identification)

Under convergence assumptions, if R is mirror-stratifiable, then for k large

Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))

In the non-degenerate case: −∇f(x?) ∈ ri(∂R(x?)))

we have exact identification Mx? = Mxk

(= JR∗(M∗−∇f(x?))

)[Liang et al 15]

and we can bound the threshold in this case [Hare et al 19]

20




minx∈RN

f(x) + R(x)



xk+1 = proxγR

(xk − γ∇f(xk)

)Does the iterates xk identify the low-complexity of x? ?



Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))



(= JR∗(M∗−∇f(x?))

)[Liang et al 15]

and we can bound the threshold in this case [Hare et al 19]

20




minx∈RN

f(x) + R(x)



xk+1 = proxγR

(xk − γ∇f(xk)

)Does the iterates xk identify the low-complexity of x? ?



Mx? 6 Mxk 6 JR∗(M∗−∇f(x?))



(= JR∗(M∗−∇f(x?))

)[Liang et al 15]

and we can bound the threshold in this case [Hare et al 19]20


For the case of sparse optimization

R = ‖ · ‖1 promotes sparsity

Mx ={

z ∈ Rd : supp(z) = supp(x)} x0

Mx0 ||x0||0 =1

?

?

dim(Mx) = # supp(x) = ‖x‖0

`1 regularized least-squares (LASSO)

minx∈Rd

12‖Ax − y‖2 + λ‖x‖1

Ilustration: plot of supp(xk)for one instance with d = 27

Result reads: for all k large enough

supp(x?) ⊆ supp(xk) ⊆ supp(y?ε)

where y?ε = proxγ(1−ε)R(u? − x?) for any ε > 0

Gap between two extreme strata

δ = dim(JR∗(M∗−∇f(x?)))− dim(Mx?) = #supp(A>(Ax? − y))−#supp(x?)

21


Illustration of the identification of proximal-gradient algorithm

Generate many random problems (with d = 100 and n = 50) and solve them

Select those with #supp(x?) = 10with δ = 0 or 10 (δ = dim(JR∗ (M

∗A>(Ax?−y)

))− dim(Mx? ))

Plot the evolution of #supp(xk) with xk+1 = proxγ‖·‖1

(xk − γ A>(Axk − y))

)

δ quantifies the degeneracy of the problem and the identification of algorithm

δ = 0: weak degeneracy→ exact identification

δ = 10: strong degeneracy→ enlarged identification

22

Outline





Sparse communication

For R = ‖ · ‖1, iterates of proximal gradient will eventually become sparse...

In our distributed optim. context

↓↓↓ downward comm.

become sparse ,↑↑↑ upward comm.

stay dense... /

xk = proxλ‖·‖1(xk−1+∆)

Master

oracle f i

i → ∆

Worker i

∆ xk (naturally) sparsei = i(k)

Proposed solution: sparsify the updates[Grishchenko, Iutzeler, M. ’19]

Take a selection S (at random or not...)

(xik)[j] =

(

xk−Dki− γ∇f i(xk−Dk

i))[j]

if i = i(k) and j ∈ S

(xik−1)[j] otherwise

Similar to block-coordinate (recall Silvia’s talk yesterday)......but the iteration is different by averaging

23







stay dense... /


Master

oracle f i

i → ∆

Worker i

∆ xk (naturally) sparsei = i(k)



(xik)[j] =

(


i))[j]




23







stay dense... /


Master

oracle f i

i → ∆

Worker i

∆let’s sparsify xk (naturally) sparsei = i(k)



(xik)[j] =

(


i))[j]




23







stay dense... /


Master

oracle f i

i → ∆

Worker i

∆let’s sparsify xk (naturally) sparsei = i(k)



(xik)[j] =

(


i))[j]




23


Choice of selection: random vs. adaptative

Random: Take randomly a fixed number of entries (e.g. 200)

Adaptative: Take the mask of the support + some entries randomly

`1-regularized logistic regression10 machines in a cluster, madelon dataset (2000× 500) evenly distributed

comparison of asynchr. prox-grad vs. 3 sparsified variants

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

·107

10−11

10−8

10−5

10−2

quantity of data exchanged

subo

ptim

alit

y

Full Update200 coord.

Mask + 100 coord.Mask + sizeof(Mask)

Tradeoff between sparsification less comm. and identification faster cv

Taking twice the size of the support works well ,' Automatic Dimension Reduction

24


Choice of selection: random vs. adaptative

Random: Take randomly a fixed number of entries (e.g. 200)

Adaptative: Take the mask of the support + some entries randomly

`1-regularized logistic regression10 machines in a cluster, madelon dataset (2000× 500) evenly distributed

comparison of asynchr. prox-grad vs. 3 sparsified variants

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

·107

10−11

10−8

10−5

10−2

quantity of data exchanged

subo

ptim

alit

y

Full Update200 coord.

Mask + 100 coord.Mask + sizeof(Mask)

Tradeoff between sparsification less comm. and identification faster cv

Taking twice the size of the support works well ,' Automatic Dimension Reduction

24

Conclusions

Take-home message

Distributed optimization is a hot topic

Work to adapt our algorithms to recent distributed computing systems

Practical deployment + theoretical work (to design/analyze aglorithms)

Key ideas

our algos are “flexible” : independent from computing systems

no assumptions on delays !

we exploit identification proximal algorithm

to automatically reduce dimension and size of communications

thanks !!

Conclusions

Take-home message

Distributed optimization is a hot topic

Work to adapt our algorithms to recent distributed computing systems

Practical deployment + theoretical work (to design/analyze aglorithms)

Key ideas

our algos are “flexible” : independent from computing systems

no assumptions on delays !

we exploit identification proximal algorithm

to automatically reduce dimension and size of communications

thanks !!

Talk based on joint work

Mirror-stratified functions

J. Fadili, J. Malick, G. PeyreSensitivity Analysis for Mirror-StratifiableConvex FunctionsSIAM Journal on Optimization, 2018

Asynchronous level bundle algorithms

F. Iutzeler, J. Malick, W. de Oliveira,Asynchronous level bundle methodsIn revision in Math. Programming, 2019

Flexible distributed proximal algorithms K. Mishchenko, F. Iutzeler, J. MalickA Delay-tolerant Proximal-Gradient Algorithmfor Distributed LearningICML, 2018

M. Grishchenko, F. Iutzeler, J. MalickSubspace Descent Methods withIdentification-Adapted SamplingSubmitted to Maths of OR, 2019

variational analysis at work and ﬂexible proximal algorithms · 2019-07-02 · Variational analysis at work: sensitivity and identiﬁcation Sparse communinations by adaptative

Documents