Top Banner
Smoothing convex functions for non-differentiable optimization Federico Pierucci Joint work with Zaid Harchaoui, J ´ erˆ ome Malick Laboratoire Jean Kuntzmann - Inria eminaire BiPoP Pinsot November 17th, 2015
27

Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Feb 10, 2019

Download

Documents

vominh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Smoothing convex functionsfor non-differentiable optimization

Federico Pierucci

Joint work withZaid Harchaoui, Jerome Malick

Laboratoire Jean Kuntzmann - Inria

Seminaire BiPoPPinsot

November 17th, 2015

Page 2: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Outline

1 “Doubly” non-differentiable optimization problems

2 How to smooth a convex function?

3 Combining smoothing with algorithms

4 Conclusions and perspectives

Federico Pierucci Smoothing convex functions for non-differentiable optimization 2 / 27

Page 3: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Problem to solve

“Doubly” non-differentiable optimization problem:

minW∈Rd×k

R(W )︸ ︷︷ ︸non-differentiable loss

+ λ ‖W‖︸ ︷︷ ︸non-differentiable regularization

The regularization is need to make “robust” the learning task

Motivations

minW∈Rd×k

‖BW‖1 + λ ‖W‖σ,1

minW∈Rd×k

‖BW‖∞ + λ ‖W‖σ,1

B Affine application that depend on data.

‖W‖σ,1 1) Nuclear norm, i.e. the sum of singular values of W2) It is the convex hull of rank(W ) when maxij{Wij} ≤ 1

Federico Pierucci Smoothing convex functions for non-differentiable optimization 3 / 27

Page 4: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Motivation 1Collaborative filtering - Example: Netflix challenge

Data: for user i and movie jXij ∈ {0, 0.5, . . . , 4.5, 5} ratingsI set of indices of observationsCharacteristics of collaborative filtering:

large scale: size(X ) ∼ 100 000 × 100 000sparse data: size(I) < 0.1%

The aim is to guess a future evaluationNew (i , j) 7→ Xij =?

minW∈Rd×k

1N

∑(i,j)∈I

|Wij − Xij |︸ ︷︷ ︸R(W )

+ λ ‖W‖σ,1

Xij ∈ R, with (i , j) ∈ I: known rates (of movies)

‖·‖σ,1 regularization: enforces low rank solutions

|·| loss: enforces robustness to outliers

Federico Pierucci Smoothing convex functions for non-differentiable optimization 4 / 27

Page 5: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Motivation 2

Multiclass classification - adaptation of SVM(standard method in machine learning)

Data (xi , yi ) ∈ Rd × Rk : pairs of (picture, label)Wj ∈ Rd : the j-th column of W

The aim is to guess a future evaluationNew picture x 7→ y =?

minW∈Rd×k

max{0, 1 + maxr s.t. r 6=y

{

AW︷ ︸︸ ︷W T

r x −W Ty x}}︸ ︷︷ ︸

R(W ) := H(AW )

+λ ‖W‖σ,1

Figure: H(·)

R loss: minimizes the misclassification error

‖·‖σ,1 regularization: enforces low rank models

Federico Pierucci Smoothing convex functions for non-differentiable optimization 5 / 27

Page 6: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Why nuclear-norm regularizer?Classes are embedded in a low dimension subspace of the feature space.

xkcd.com

Federico Pierucci Smoothing convex functions for non-differentiable optimization 6 / 27

Page 7: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Existing algorithms for nonsmooth optimization

minW∈Rd×k

R(W )︸ ︷︷ ︸non-differentiable loss

+ λ ‖W‖︸ ︷︷ ︸non-differentiable regularization

General approach: Subgradient algorithmsSpecial approaches:

reformultaions (e.g. QP, LP)for special cases, Douglas-Rachford algorithm [Douglas, Rachford 1956]

Both algorithms are not scalable for double nonsmooth problems with ‖·‖σ,1

What if the loss were smooth?

minW∈Rd×k

R(W )︸ ︷︷ ︸smooth loss

+ λ ‖W‖︸ ︷︷ ︸nonsmooth regularization

Algorithms for smooth loss are “good” (by convergence)Proximal gradient algorithms. [Nemirovski, Yudin 1976] [Nesterov 2005][Beck, Teboulle, 2009]

Composite conditional gradient algorithm. Efficient iterations for ‖·‖σ,1[Harchaoui, Juditsky, Nemirovski, 2013]

Federico Pierucci Smoothing convex functions for non-differentiable optimization 7 / 27

Page 8: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Existing algorithms for nonsmooth optimization

minW∈Rd×k

R(W )︸ ︷︷ ︸non-differentiable loss

+ λ ‖W‖︸ ︷︷ ︸non-differentiable regularization

General approach: Subgradient algorithmsSpecial approaches:

reformultaions (e.g. QP, LP)for special cases, Douglas-Rachford algorithm [Douglas, Rachford 1956]

Both algorithms are not scalable for double nonsmooth problems with ‖·‖σ,1

What if the loss were smooth?

minW∈Rd×k

R(W )︸ ︷︷ ︸smooth loss

+ λ ‖W‖︸ ︷︷ ︸nonsmooth regularization

Algorithms for smooth loss are “good” (by convergence)Proximal gradient algorithms. [Nemirovski, Yudin 1976] [Nesterov 2005][Beck, Teboulle, 2009]

Composite conditional gradient algorithm. Efficient iterations for ‖·‖σ,1[Harchaoui, Juditsky, Nemirovski, 2013]

Federico Pierucci Smoothing convex functions for non-differentiable optimization 8 / 27

Page 9: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Our approach

The idea:combine existing algorithms with smoothing techniques“New algorithm = smoothing techniques + algorithm for smooth loss”

This talk:Mainly about smoothing techniques

In my thesisApplications to machine learning problemsReal datasets: Imagenet, Movielens“Optimal” smoothing

Federico Pierucci Smoothing convex functions for non-differentiable optimization 9 / 27

Page 10: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Outline

1 “Doubly” non-differentiable optimization problems

2 How to smooth a convex function?

3 Combining smoothing with algorithms

4 Conclusions and perspectives

Federico Pierucci Smoothing convex functions for non-differentiable optimization 10 / 27

Page 11: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Definition ( Smooth convex function)

The function f is differentiable on its domain

The gradient ∇f is Lipschitz with modulus L, i.e

for any x , y ‖∇f (x)−∇f (y )‖∗ ≤ L ‖x − y‖

where ‖·‖∗ is the dual norm of ‖·‖.

( Think about ‖·‖ = euclidean norm = ‖·‖∗)

Smooth function and gradient

-6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

Federico Pierucci Smoothing convex functions for non-differentiable optimization 11 / 27

Page 12: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Smoothing technique 1: convolutionWe want to smooth g

gcγ(x) :=

∫Rn

g(x − z)µγ(z)dz

where µγ is a probability density function (concentration controlled by γ).

Let µγ be the uniform distribution on a ball or normal distribution. Then asmooth surrogate gγ has properties

gγ differentiable

the gradient

∇gcγ(x) =

∫Rn s(x − z)µγ(z)dz, where s(x − z) ∈ ∂g(x − z)

is Lipschitz with modulus Lγ = O(1/γ)

gγ is uniform approximation of g, i.e. ∃m, ∃M s.t.

g(x)− γm ≤ gγ(x) ≤ g(x) + γM, for all x

[Bertsekas 1978] [Duchi et al. 2012] [Pierucci et al. 2015]

Numerical integration is difficultOur objective is to obtain gγ easy to evaluate numerically, possibly explicitly

Federico Pierucci Smoothing convex functions for non-differentiable optimization 12 / 27

Page 13: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Smoothing technique 1: convolutionWe want to smooth g

gcγ(x) :=

∫Rn

g(x − z)µγ(z)dz

where µγ is a probability density function (concentration controlled by γ).

Let µγ be the uniform distribution on a ball or normal distribution. Then asmooth surrogate gγ has properties

gγ differentiable

the gradient

∇gcγ(x) =

∫Rn s(x − z)µγ(z)dz, where s(x − z) ∈ ∂g(x − z)

is Lipschitz with modulus Lγ = O(1/γ)

gγ is uniform approximation of g, i.e. ∃m, ∃M s.t.

g(x)− γm ≤ gγ(x) ≤ g(x) + γM, for all x

[Bertsekas 1978] [Duchi et al. 2012] [Pierucci et al. 2015]

Numerical integration is difficultOur objective is to obtain gγ easy to evaluate numerically, possibly explicitly

Federico Pierucci Smoothing convex functions for non-differentiable optimization 13 / 27

Page 14: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Examples of explicit expressions in RUniform distribution: µγ(z) = 1

2γ I[−1,1]( zγ

).

Gaussian distribution: µγ(z) = 1γ√

2πexp

(− z2

2γ2

), F : cumulative distribution

Proof. (Of eq. (20)) We just separate the integral into the two subsets where max{g1, g2} ismaximized.

S (max{g1, g2}) (ξ) =

�max{g1(ξ + z), g2(ξ + z)}µ(z) dz

=

ξ+z∈U1

max{g1, g2}(ξ + z)µ(z) dz +

ξ+z∈U2

max{g1, g2}(ξ + z)µ(z) dz

=

ξ+z∈U1

g1(ξ + z)µ(z) dz +

ξ+z∈U2

g2(ξ + z)µ(z) dz

=

ξ+z∈U1

g1(ξ + z)iU1(ξ + z)µ(z) dz +

ξ+z∈U2

g2(ξ + z)iU2(ξ + z)µ(z) dz

=

�g1(ξ + z)iU1

(ξ + z)µ(z) dz +

�g2(ξ + z)iU2

(ξ + z)µ(z) dz

= S (g1iU1) + S (g2iU2)

2.4 Examples

In this section where F is the cumulative distribution function of the gaussian distribution µ, i.e.

F (ξ) :=1√2π

ξ�

−∞

e−t2

2 dt.

g(ξ) µ gr(ξ) ∇gr(ξ)

|ξ| uniform

�r2 ( ξr )2 + 1

2 if |ξ| ≤ r

|ξ| if |ξ| > r

�ξr if |ξ| ≤ r

sign(ξ) if |ξ| > r

|ξ| gaussian −ξF (− ξr ) +

√2√πre−

ξ2

2r2 + ξF ( ξr ) F ( ξr ) − F (− ξr )

max{0, ξ} uniform

0 if ξ ≤ −rr4 ( ξr + 1)2 if − r < ξ < r

ξ if r ≥ ξ

0 if ξ ≤ −rξ2r + 1

2 if − r < ξ < r

1 if r ≥ ξ

max{0, ξ} gaussian r√2π

e−1

2r2 ξ2

+ ξF ( ξr ) F ( ξr )

Table 4: Table of smooth surrogates in R. We compute explicitly (8). F is the cumulativedistribution function of the gaussian distribution. The uniform distribution on [−1, 1] is µr(z) =12r I[−1,1](

zr ) and the gaussian is µr(z) = 1

r√

2πexp

�− z2

2r2

�.

Example with uniform distribution

Proposition 2.8. Let µ associated to uniform distribution on B∞(0, 1), i.e. µ = 12d χ{�·�∞≤1}.

We smooth g(ξ) = �ξ�1. Then the smooth surrogate and gradient are

gr(ξ) = r

k�

i=1

h�

ξi

r

�, (22)

11

-2 -1 0 1 20

0.5

1

1.5

2gaussiannonsmoothuniform

-2 -1 0 1 20

0.5

1

1.5

2gaussiannonsmoothuniform

g(ξ) = |ξ| g(ξ) = max{0, ξ}

Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27

Page 15: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Examples of explicit expressions in Rn

To smooth in Rn can be complicate (for easy numerical evaluation)But for a decomposition

g(x) =n∑

i=1

g(i)(xi ), g(i)defined on R

we find a smooth g(i)γ for each component and get

gγ(x) =n∑

i=1

g(i)γ (xi )

Example: norm `1

g(x) = ‖x‖1 =∑n

i=1 |xi | to make smooth

µγ(z) = 1γ

12n IB∞ ( z

γ) uniform distribution on B∞ = {‖·‖∞ ≤ 1}

gγ(x) =∑k

i=1 γH( xiγ

), with H(t) =

{12 t2 + 1

2 |t | ≤ 1|t | |t | > 1

Federico Pierucci Smoothing convex functions for non-differentiable optimization 15 / 27

Page 16: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Smoothing technique 2: infimal convolution

We want to smooth g

g icγ (x) := inf

z∈Rng(x − z) + ωγ(z)

where ωγ(·) = γω(·γ

)and ω is a smooth function.

Then a smooth surrogate gγ has properties

gγ differentiable

The gradient

∇gγ(x) = ∇ωγ(x − z?µ(x)), with z?µ(x) optimal in g icγ (x),

is Lipschitz with modulus Lγ = O(1/γ)

gγ is uniform approximation of g, i.e. ∃m, ∃M s.t.

g(x)− γm ≤ gγ(x) ≤ g(x) + γM for all x

Federico Pierucci Smoothing convex functions for non-differentiable optimization 16 / 27

Page 17: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Examples of infimal convolution

We retrieve usual smoothing of the literature:

Moreau-Yosida: ωγ(z) = 12γ ‖z‖

2 [Moreau 1965]

g icγ (x) := inf

z∈Rng(z) + 1

2γ ‖z − x‖22

Fenchel-type: ωγ = γd∗, with d strongly convex [Nesterov 2007]

g icγ (x) := max

z∈Z〈x ,Az〉 − φ(z)− γd(z)

where A affine function, φ convex, and Z ⊂ Rn compact convex set.

Asymptotic: any smooth ωγ s.t. limγ→0+

ωγ(x) = g(x) [Beck, Teboulle 2012]

g icγ (x) := ωγ(x)

Our objective is to obtain gγ easy to evaluate numerically, possibly explicitly

Federico Pierucci Smoothing convex functions for non-differentiable optimization 17 / 27

Page 18: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Examples with Fenchel-type smoothing28 Federico Pierucci et al.

Nonsmooth σ(ξ) Ball Z Proximity ω(z) Smooth surrogate σ(ξ, γ)

|ξ| [−1, 1] 12

|·|2�

12γ

ξ2 if |ξ| ≤ γ

|ξ| − γ2

if |ξ| > γ

|ξ| [−1, 1] (1 − |z|) ln(1 − |z|) + |z| f(ξ, γ) = γe−��� ξγ

���+ |ξ| − γ

maxi{ξi, 0} co(∆n ∪ {0}) 12�·�2

�ξ,πZ

�ξγ

��− γ

2

���πZ�

ξγ

����2

maxi{ξi, 0} co(∆n ∪ {0}) 1 +n�

i=1zi log(zi) − zi

γ

�−1 +

n�i=1

exp (ξi/γ)

�if ξ

γ∈ C

γ log

�n�

i=1exp (ξi/γ)

�if ξ

γ∈ B

1q

�qi=1 ξα(i)

�z����

zi ≤ 1; zi ∈�0, 1

q

��12�·�2

�ξ,πZ

�ξγ

��− γ

2

���πZ�

ξγ

����2

1q

�qi=1 ξα(i)

�z����

zi ≤ 1; zi ∈�0, 1

q

�� �ni=1 zi ln(nzi) Θ(λ∗(ξ, γ)) (solve dual problem)

Table 1 On the first line we obtain the Huber function, third and fourth lines we have the smoothing of themulticlass hinge, 5th and 6th line: smoothing of the top-q error. C :=

�s ∈ Rn

���ni=1 exp (si) ≤ 1

�and

B :=�s ∈ Rn

���ni=1 exp (si) > 1

�. We assume that 0 log 0 = 1. α is the permutation that orders in

decreasing order: xα(1) = maxi xi.

Proof We compute the partial derivatives

∂ω

∂zi= h�

i(zi),∂ω

∂zi∂zj=

�h��(zi) if i = j

0 if i �= j

The the hessian of ω is positive definite and so ω is strongly convex with strong convexityconstant α

Lemma 5 Let h : [a, b] → R strongly convex on [a, b] with constant α. Then ω(z) :=�ni=1 h(zi) is strongly convex with constant α on [a, b]n and on any convex subset of

[a, b]n.

Proof For any t ∈ [0, 1]; x, y ∈ [a, b]n

ω(tx + (1 − t)y) =

n�

i=1

h(txi + (1 − t)yi)

≤�

i

�th(xi) + (1 − t)h(yi) − α

2t(1 − t) |xi − yi|2

= t�

i

h(xi) + (1 − t)�

i

h(yi) − α

2t(1 − t)

i

|xi − yi|2

= tω(x) + (1 − t)ω(y) − α

2t(1 − t) �x − y�2

The inequality is due to the strong convexity of h. Then we have the statement. The stronglyconvexity property is valid also on convex subsets.

Lemma 6 Let A be the function defined on [a, b] compound of two segments such thatAa = Ab = 0 and A(a+b

2 ) = 1. Let h(t) := At(ln(At)− 1). Then h is strongly convex in[a, b] with constant α = 4

(b−a)2 .

Proof We define t∗ := (a + b)/2. For t �= t∗ we define the derivative��A�t

�� =: v andobserve that v = 2

b−a . We claim that h is twice differentiable. For t �= t∗ we compute the

B ={

s ∈ Rn∣∣∑n

i=1 exp (si ) > 1}

C ={

s ∈ Rn∣∣∑n

i=1 exp (si ) ≤ 1}

α: permutation that orders in decreasing order

Note:Statistics and optimization lead to the same surrogate for maxi{xi , 0}

Federico Pierucci Smoothing convex functions for non-differentiable optimization 18 / 27

Page 19: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Outline

1 “Doubly” non-differentiable optimization problems

2 How to smooth a convex function?

3 Combining smoothing with algorithms

4 Conclusions and perspectives

Federico Pierucci Smoothing convex functions for non-differentiable optimization 19 / 27

Page 20: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Algorithms1 Doubly non-smooth problem to solve:

minimizeW∈Rd×k

F (W ) := R(W ) + λ ‖W‖σ,1

2 Smoothed problem solved with a standard algorithm:

minimizeW∈Rd×k

Rγ(W ) + λ ‖W‖σ,1

3 Convergence + Explicit formula for good γ [Pierucci et al. 2013]

Theorem (Convergence)

If the iterations Wt are generated with the composite conditional gradientalgorithm to solve the smoothed problem, then

F (Wt )−minx

F (W ) ≤ O(γ) + O(

1γt

)︸ ︷︷ ︸

ε

i.e. for any ε, it exists γ = O(ε) such that we get an ε-optimal solution for thenonsmooth problem

Federico Pierucci Smoothing convex functions for non-differentiable optimization 20 / 27

Page 21: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Overview

1) Main objective (Statistical learning): have accurate predictions for newdata

fW (x) = y .

2) A modelization for 1) is to solve

minW

R(W ) + λ ‖W‖σ,1 ,

because to find low rank linear models is a robust technique for movierecommendation and image classifications.

3) To optimize the problem at 2) we are interested in smoothing techniques.

Our contribution is at the point 3), to find accurate solutions to 2), but we keepin mind that the ultimate objective is 1).

Federico Pierucci Smoothing convex functions for non-differentiable optimization 21 / 27

Page 22: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Numerical illustration

X with ratings of movies943(users)× 1682(movies)

I = indices of known entries (1 %)

Yellow = ”nice” movie

Dark red = ”bad” movie

minW∈Rd×k

1N

∑(i,j)∈I

|Wij − Xij |︸ ︷︷ ︸RI (W )

+ λ ‖W‖σ,1

Federico Pierucci Smoothing convex functions for non-differentiable optimization 22 / 27

Page 23: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Numerical illustration - optimizationA grid of different values for γ ∈ {0.0001, 0.01, 0.1, 0.5, 1, 5, 10, 50}

Each dataset is split into: train, validation, and test sets

On train we run algorithm for each value of γ.

At each iteration we obtain parameters W γt and plot RItrain (W γ

t )

Stop criterion = fixed number of iterations.Simple, but enough to show the effect of smoothing

Plot ofempirical riskvs iterations

ddTrain set Validation set Test set

Mov

.sm

all

100 1020.0113

0.0259

0.0592

0.1352

0.3089

0.7057

iterations

Emp. risk on train set, λ=10−6

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

100 1020.191

0.248

0.3221

0.4183

0.5433

0.7056

iterations

Emp. risk on validation set, λ=10−6

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

100 1020.2037

0.2613

0.3351

0.4298

0.5513

0.7072

iterations

Emp. risk on test set, λ=10−6

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

Mov

.medium

100 1020.1451

0.1997

0.2748

0.3782

0.5204

0.7162

iterations

Emp. risk on train set, λ=10−6

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

100 1020.188

0.2457

0.3211

0.4196

0.5484

0.7167

iterations

Emp. risk on validation set, λ=10−6

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

100 1020.1883

0.246

0.3214

0.4198

0.5484

0.7164

iterations

Emp. risk on test set, λ=10−6

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

Mov

.large

100 1010.2284

0.2859

0.358

0.4482

0.5611

0.7025

iterations

Emp. risk on train set, λ=10−9

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

100 1010.2372

0.2947

0.3662

0.455

0.5654

0.7025

iterations

Emp. risk on validation set, λ=10−9

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

100 1010.237

0.2945

0.366

0.4548

0.5652

0.7023

iterations

Emp. risk on test set, λ=10−9

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

Figure 2: Movielens data - Empirical risk versus iterations.

7

Federico Pierucci Smoothing convex functions for non-differentiable optimization 23 / 27

Page 24: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Numerical illustration - learning

1) X tr Train

2) X val Validation: to chose the best γ, i.e. that makes most accuratepredictions. We plot RIvalidation (W γ

t )

3) X ts Test: To check finally the results we plot RItest (Wγt )

Train set Validation set Test set

Mov

.sm

all

100 1020.0113

0.0259

0.0592

0.1352

0.3089

0.7057

iterations

Emp. risk on train set, λ=10−6

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

100 1020.191

0.248

0.3221

0.4183

0.5433

0.7056

iterations

Emp. risk on validation set, λ=10−6

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

100 1020.2037

0.2613

0.3351

0.4298

0.5513

0.7072

iterations

Emp. risk on test set, λ=10−6

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

Mov

.medium

100 1020.1451

0.1997

0.2748

0.3782

0.5204

0.7162

iterations

Emp. risk on train set, λ=10−6

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

100 1020.188

0.2457

0.3211

0.4196

0.5484

0.7167

iterations

Emp. risk on validation set, λ=10−6

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

100 1020.1883

0.246

0.3214

0.4198

0.5484

0.7164

iterations

Emp. risk on test set, λ=10−6

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

Mov

.large

100 1010.2284

0.2859

0.358

0.4482

0.5611

0.7025

iterations

Emp. risk on train set, λ=10−9

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

100 1010.2372

0.2947

0.3662

0.455

0.5654

0.7025

iterations

Emp. risk on validation set, λ=10−9

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

100 1010.237

0.2945

0.366

0.4548

0.5652

0.7023

iterations

Emp. risk on test set, λ=10−9

γ = 0.001γ = 0.01γ = 0.1γ = 0.5γ = 1γ = 5γ = 10γ = 50γ = best

Figure 2: Movielens data - Empirical risk versus iterations.

7

Plots of empirical risk RI vs iterations

Federico Pierucci Smoothing convex functions for non-differentiable optimization 24 / 27

Page 25: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Outline

1 “Doubly” non-differentiable optimization problems

2 How to smooth a convex function?

3 Combining smoothing with algorithms

4 Conclusions and perspectives

Federico Pierucci Smoothing convex functions for non-differentiable optimization 25 / 27

Page 26: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Conclusions

This research opensChoice of γ ⇐ heavy computations

Need of a simple automatic way for calibrating γ

We came up to an “optimal” (in the sense of complexity analysis of thealgorithm) and iteration-dependent

γt = O(

1√t

)

In this talkA way to combine standard algorithms and smooth surrogatesTwo techniques of smoothing

Infimal convolutionConvolution

Thank you for your attention

Federico Pierucci Smoothing convex functions for non-differentiable optimization 26 / 27

Page 27: Smoothing convex functions for non-differentiable optimization · Federico Pierucci Smoothing convex functions for non-differentiable optimization 14 / 27 Examples of explicit expressions

Pierucci, Harchaoui, Malick 2015 - Smoothing convex functions fornonsmooth optimization (in preparation)

Pierucci, Harchaoui, Malick 2015 - Conditional gradient algorithms fordoubly non-smooth learning (in preparation)

Pierucci, Harchaoui, Malick 2013 - A smoothing approach for compositeconditional gradient with nonsmooth loss (CAP conferenceApprentissage)

Federico Pierucci Smoothing convex functions for non-differentiable optimization 27 / 27