Top Banner
A unified perspective on convex structured sparsity Guillaume Obozinski Laboratoire d’Informatique Gaspard Monge ´ Ecole des Ponts ParisTech Joint work with Francis Bach Conf´ erence Mascot-Num Ecole Centrale de Nantes, 21 mars 2018 Guillaume Obozinski Unified perspective on convex structured sparsity 1/39
34

A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Jul 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

A unified perspective on convex structured sparsity

Guillaume Obozinski

Laboratoire d’Informatique Gaspard Monge

Ecole des Ponts ParisTech

Joint work with Francis Bach

Conference Mascot-Num

Ecole Centrale de Nantes, 21 mars 2018

Guillaume Obozinski Unified perspective on convex structured sparsity 1/39

Page 2: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Structured Sparsity

The support is not only sparse, but, in addition,we have prior information about its structure.

Examples

The variables should be selected in groups.

The variables lie in a hierarchy.

The variables lie on a graph or network and the support should belocalized or densely connected on the graph.

Guillaume Obozinski Unified perspective on convex structured sparsity 2/39

Page 3: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Applications: Difficult inverse problem in Brain Imaging

L R

y=-84 x=17

L R

z=-13

-5.00e-02 5.00e-02Scale 6 - Fold 9

Jenatton et al. (2011b)

Guillaume Obozinski Unified perspective on convex structured sparsity 3/39

Page 4: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Convex relaxation for classical sparsity

Empirical risk: for w ∈ Rd ,

L(w) =1

2n

n∑i=1

(yi − x>i w)2

Support of the model:

Supp(w) = i | wi 6= 0.

Penalization for variable selection

minw∈Rd

L(w) + λ |Supp(w)|

Lasso

minw∈Rd

L(w) + λ‖w‖1

|Supp(w)| =n∑

i=1

1wi 6=0

0 0 1−1 0 1−1

Guillaume Obozinski Unified perspective on convex structured sparsity 4/39

Page 5: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Formulation with combinatorial functions

Let V = 1, . . . , d.

Let L be some empirical risk such as L(w) = 12n

∑ni=1(yi − x>i w)2.

Given a set function F : 2V 7→ R+ consider

minw∈Rd

L(w) + F (Supp(w))

Examples of combinatorial functions

Use recursivity or counts of structures (e.g. tree) with DP

Block-coding (Huang et al., 2011)

G (A) = minBi

F (B1) + . . .+ F (Bk) s.t. B1 ∪ . . . ∪ Bk ⊃ A

Submodular functions

Guillaume Obozinski Unified perspective on convex structured sparsity 6/39

Page 6: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Block-coding (Huang, Zhang and Metaxas (2009))

A

Bi

p1

F+ : 2V → R+ a positive set function.

F∪(A) = minS

∑B∈S

F+(B) s.t. A ⊂⋃B∈S

B.

→ minimal weighted cover set problem.

Guillaume Obozinski Unified perspective on convex structured sparsity 7/39

Page 7: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

A relaxation for F ...?

How to solve?minw∈Rd

L(w) + F (Supp(w))

→ Greedy algorithms

→ Non-convex methods

→ Relaxation

|A| F (A)

L(w) + λ |Supp(w)| L(w) + λF (Supp(w))

↓ ↓ ?

L(w) + λ ‖w‖1 L(w) + λ ...?...

Guillaume Obozinski Unified perspective on convex structured sparsity 8/39

Page 8: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Penalizing and regularizing...

Given a function F : 2V → R+, consider for ν, µ > 0 the combinedpenalty:

pen(w) = µF (Supp(w)) + ν ‖w‖pp.

Motivations

Compromise between variable selection and smooth regularization

Required for functions F allowing large supports

Interpretable as a description length for the parameters w .

Guillaume Obozinski Unified perspective on convex structured sparsity 9/39

Page 9: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

A convex and homogeneous relaxation

Looking for a convex relaxation of pen(w).

Require as well that it is positively homogeneous → scale invariance.

Definition (Homogeneous extension of a function g)

gh : x 7→ infλ>0

1

λg(λx).

Proposition

The tightest convex positively homogeneous lower bound of a function gis the convex envelope of gh.

Leads us to consider:

penh(w) = infλ>0

1

λ

(µF (Supp(λw)) + ν ‖λw‖pp

)∝ Θ(w) := ‖w‖p F (Supp(w))1/q with

1

p+

1

q= 1.

Guillaume Obozinski Unified perspective on convex structured sparsity 11/39

Page 10: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Envelope of the homogeneous penalty ΘConsider Ωp with dual norm

Ω∗p(s) = maxA⊂V ,A 6=∅

‖sA‖qF (A)1/q

.

Proposition

The norm Ωp is the convex envelope (tightest convex lower bound) ofthe function w 7→ ‖w‖p F (Supp(w))1/q.

Proof.

Denote Θ(w) = ‖w‖p F (Supp(w))1/q:

Θ∗(s) = maxw∈Rd

w>s − ‖w‖p F (Supp(w))1/q

= maxA⊂V

maxwA∈RA

w>A sA − ‖wA‖p F (A)1/q

= maxA⊂V

ι‖sA‖q6F (A)1/q = ιΩ∗p (s)61

Guillaume Obozinski Unified perspective on convex structured sparsity 12/39

Page 11: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Graphs of the different penalties

F (Supp(w)) pen(w) = µF (Supp(w)) + ν ‖w‖22

Guillaume Obozinski Unified perspective on convex structured sparsity 13/39

Page 12: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Graphs of the different penalties

Θ(w) =√F (Supp(w))‖w‖2 ΩF (w)

Guillaume Obozinski Unified perspective on convex structured sparsity 14/39

Page 13: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

A large latent group Lasso (Jacob et al., 2009)

V = v = (vA)A⊂V ∈(RV)2V

s.t. Supp(vA) ⊂ A

Ωp(w) = minv∈V

∑A⊂V

F (A)1q ‖vA‖p s.t. w =

∑A⊂V

vA,

w

v1 v2 v1,2 v1,2,3,4... ...

+ + + + + + + + + + + + + +=

Guillaume Obozinski Unified perspective on convex structured sparsity 15/39

Page 14: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Some simple examples

F Ωp

|A| ‖w‖1

1A 6=∅ ‖w‖pIf G is a partition:

∑B∈G 1A∩B 6=∅

∑B∈G ‖wB‖p

If G is not a partition:∑

B∈G 1A∩B 6=∅ new: Overlap count Lasso

Guillaume Obozinski Unified perspective on convex structured sparsity 16/39

Page 15: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Combinatorial norms as atomic norms

F (A) = |A|1/2

F (A) =1A∩1,2,36=∅+ 1A∩2,36=∅+ 1A∩36=∅

ΘF2 (w) ΩF

2 (w)

Guillaume Obozinski Unified perspective on convex structured sparsity 17/39

Page 16: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Relation between combinatorial functions and norms

Name F (A) Norm Ωp

cardinality |A| Lasso (`1)

nb of groups∑

B∈G 1A∩B 6=∅ Group Lasso (`1/`p)

nb of groups δA,A ∈ G,+∞ else Latent group Lasso

max. nb of el./group maxB∈G |A ∩ B| Exclusive Lasso (`p/`1)

constant 1A 6=∅ `p-norm

func. of cardinality h(|A|), h sublinear

1A 6=∅ ∨ |A|k k-support norm (p = 2)

func. of cardinality h(|A|), h concave OWL (for p =∞)

λ1|A|+ λ2

[(dk

)−(d−|A|

k

)]OSCAR (p =∞, k = 2)∑|A|

i=1 Φ−1(1− qi

2d

)SLOPE (p =∞)

chain length h(max(A)) wedge penalty

Guillaume Obozinski Unified perspective on convex structured sparsity 18/39

Page 17: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Is the relaxation “faithful” to the original function

Consider V = 1, . . . , p and the function

F (A) = range(A) = max(A)−min(A) + 1.

→ Leads to the selection of interval patterns.

What is its convex relaxation?

Easy to show that |A| must have the same relaxation.

⇒ ΩFp (w) = ‖w‖1

The relaxation fails

⇒ What are the good functions F?

→ Good functions are Lower Combinatorial Envelopes (LCE)

Submodular functions are LCEs !

Guillaume Obozinski Unified perspective on convex structured sparsity 19/39

Page 18: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Min-cover vs Overlap count functionsGiven a collection of sets G with weights (dB)B∈G ...... two natural functions to consider:

Min-cover

F∪(A) := infS⊂G

∑B∈S

dB | A ⊂⋃B∈S

B

:

F∪,− is the corresponding fractional min-cover value

Overlap count

F∩(A) =∑B∈G

dB 1A∩B 6=∅

counting the number of set of G intersected

“maximal cover” by elements of GF∩ is a submodular function (as a sum of submodular functions).

Guillaume Obozinski Unified perspective on convex structured sparsity 21/39

Page 19: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Latent group Lasso vs Overlap count Lasso vs `1/`p

G = 1, 22, 3.

ΩF∪2 (w) ≤ 1 ΩF∩

2 (w) ≤ 1 ‖w1,2‖2+‖w2,3‖2≤1

F∩(A) = 1A∩1,26=∅ + 1A∩2,36=∅,

F∪(A) = minδ,δ′

δ + δ′ | 1A ≤ δ 11,2 + δ′ 12,3

.

Guillaume Obozinski Unified perspective on convex structured sparsity 22/39

Page 20: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Hierarchical sparsityConsider a DAG, with

Ai ,Di ancestors/descendants sets of iincluding itself.

Significant literature: Zhao et al.

(2009); Yuan et al. (2009); Jenatton et al.

(2011c); Mairal et al. (2011); Bien et al.

(2013); Yan and Bien (2015) and many

others...

e.g. formulations with`1/`p-norms (Zhao et al., 2009; Jenatton

et al., 2011c)

Ω(w) =∑i∈V‖wD(i)‖2, with

efficient algorithms for tree-structuredgroups.

11

22 33

44 55 66

Guillaume Obozinski Unified perspective on convex structured sparsity 24/39

Page 21: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Combinatorial functions for strong hierarchical sparsityConsider a DAG, with

Ai ,Di ancestors/descendants sets of iincluding itself.

Strong hierarchical sparsity:

“A node can be selected only if all itsancestors are selected”.

Overlap count with Di :

F∩(B) :=∑i∈V

di 1B∩Di 6=∅ =∑i∈AB

di ,

vs Min-cover with Ai :

F∪(B) := infI⊂V

∑i∈I

fi | B ⊂⋃i∈I

Ai

.

11

22 33

44 55 66

Guillaume Obozinski Unified perspective on convex structured sparsity 25/39

Page 22: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Results for different types of graphs

Chains

Families F∩ and F∪ are equivalent

Norms and prox can be computed using algorithms for isotonicregression.

Trees

Families F∩ and F∪ are different

Norms and prox for F∩ can be computed using a decompositionalgorithm.

No efficient algorithm known for F∪.

DAGs

Norms and prox for F∩ can be computed using general connexionwith isotonic regressions on DAGs.

No efficient algorithm known for F∪.

Guillaume Obozinski Unified perspective on convex structured sparsity 26/39

Page 23: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Sublinear functions of the cardinality

F (A) =d∑

k=1

fk 1|A|=k,

and F− must be sublinear.

Let |s|(1) ≥ . . . ≥ |s|(d) be the reverse order statistics of the entries of s.Then

Ω∗p(w) = max1≤j≤d

1

f1/qj

[j∑

i=1

|s|q(i)

]1/q

First example

F+(A) =

1 if |A| = k

∞ o.w.

recovers the k-support norm of Argyriou et al.(2012) (p = 2).

Guillaume Obozinski Unified perspective on convex structured sparsity 28/39

Page 24: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Concave functions of the cardinality

If k 7→ fk is concave then we have

Ω∞(w) =d∑

i=1

(fi − fi−1) |w |(i).

Ordered weighted Lasso (OWL) (Figueiredo and Nowak, 2014)

Examples

OSCAR (Bondell and Reich, 2008): = λ1‖w‖1 + λ2Ω(w) with

Ω(w) =∑i<j

max(|wi |, |wj |

)obtained with fk =

(d2

)−(d−k

2

)SLOPE (Bogdan et al., 2015): fk =

k∑i=1

Φ−1(

1− qi

2d

)Guillaume Obozinski Unified perspective on convex structured sparsity 29/39

Page 25: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Computations and extensions of OWL

Since F is submodular, ΩF∞ is a linear function of |w | if the order of the

coefficients is fixed. Computational problem can therefore be reduced tothe case of the chain.

Proposition (Figueiredo and Nowak, 2014)

In the p =∞ case the proximal operator can be computed efficiently viaisotonic regression and PAVA.

Proposition (`p-OWL norms)

Norm definitions and efficient computations ofnorms and proximal operators can be naturallyextended to ΩF

p via isotonic regression and PAVA.

Guillaume Obozinski Unified perspective on convex structured sparsity 30/39

Page 26: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

An example: penalizing the rangeStructured prior on support (Jenatton et al., 2011a):

the support is an interval of 1, . . . , p

Natural associated penalization:F (A) = range(A) = imax(A)− imin(A) + 1.

→ F is not submodular...

→ G (A) = |A|But F (A) : = d − 1 + range(A) is submodular !

In fact F (A) =∑

B∈G 1A∩B 6=∅ for B of the form:

Jenatton et al. (2011a) considered Ω(w) =∑

B∈B ‖wB dB‖2.

Guillaume Obozinski Unified perspective on convex structured sparsity 31/39

Page 27: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Experiments

0 50 100 150 200 2500

0.5

1

1.5

S1

0 50 100 150 200 2500

0.5

1

S2

0 50 100 150 200 2500

0.5

1

S3

0 50 100 150 200 2500

0.5

1

S4

0 50 100 150 200 250−2

0

2

S5

Figure: Signals

S1 constant

S2 triangular shape

S3 x 7→ | sin(x) sin(5x)|S4 a slope pattern

S5 i.i.d. Gaussian pattern

Compare:

Lasso

Elastic Net

Naive `2 group-Lasso

Ω2 for F (A) = d − 1 + range(A)

Ω∞ for F (A) = d − 1 + range(A)

The weighted `2 group-Lasso of(Jenatton et al., 2011a).

Guillaume Obozinski Unified perspective on convex structured sparsity 32/39

Page 28: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Constant signal

0 500 1000 1500 20000

20

40

60

80

100

d=256, k=160, σ=0.5

n

Best

Ham

min

g

EN

GL+w

GL

L1

Sub p=∞

Sub p=2

d = 256

k = 160

σ = .5

Guillaume Obozinski Unified perspective on convex structured sparsity 33/39

Page 29: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Triangular signal

0 500 1000 1500 20000

10

20

30

40

50

60

70

80

90

100

Best Hamming d=256, k=160, σ=0.5, S2, cov=id

n

Best

Ham

min

g

EN

GL+w

GL

L1

Sub p=∞

Sub p=2 d = 256

k = 160

σ = .5

Guillaume Obozinski Unified perspective on convex structured sparsity 34/39

Page 30: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

(x1, x2) 7→ | sin(x1) sin(5x1) sin(x2) sin(5x2)| signal in 2D

0 500 1000 1500 2000 250020

30

40

50

60

70

80

90

100

d=256, k=160, σ=1.0

n

Best

Ham

min

g

EN

GL+w

GL

L1

Sub p=∞

Sub p=2

d = 256

k = 160

σ = .5

Guillaume Obozinski Unified perspective on convex structured sparsity 35/39

Page 31: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

i.i.d Random signal in 2D

0 500 1000 1500 2000 25000

20

40

60

80

100

d=256, k=160, σ=1.0

n

Best

Ham

min

g

EN

GL+w

GL

L1

Sub p=∞

Sub p=2

d = 256

k = 160

σ = .5

Guillaume Obozinski Unified perspective on convex structured sparsity 36/39

Page 32: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

Summary

A convex relaxation for functions penalizing

(a) the support via a general set function(b) the `p norm of the parameter vector w .

Retrieves a large fraction of the norms used (Lasso, group Lasso,Exclusive Lasso, OSCAR, OWL, SLOPE, etc).

Generic efficient algorithms for chains/trees/graphs-OCL

Open: efficient prox computation for tree/DAG for F∩Alternative fast column generation/FCFW algorithm(Vinyes and Obozinski, 2017).

Did not talk about general support recovery and fast ratesconvergence that can be obtained based on generalization of theirrepresentability condition/restricted eigenvalue condition.

Guillaume Obozinski Unified perspective on convex structured sparsity 37/39

Page 33: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

References I

Argyriou, A., Foygel, R., and Srebro, N. (2012). Sparse prediction with the k-support norm.In Advances in Neural Information Processing Systems 25, pages 1466–1474.

Bickel, P., Ritov, Y., and Tsybakov, A. (2009). Simultaneous analysis of Lasso and Dantzigselector. Annals of Statistics, 37(4):1705–1732.

Bien, J., Taylor, J., Tibshirani, R., et al. (2013). A lasso for hierarchical interactions. TheAnnals of Statistics, 41(3):1111–1141.

Bogdan, M., van den Berg, E., Sabatti, C., Su, W., and Candes, E. J. (2015). SLOPE:adaptive variable selection via convex optimization. Annals of Applied Statistics,9(3):1103–1140.

Bondell, H. D. and Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection,and supervised clustering of predictors with OSCAR. Biometrics, 64(1):115–123.

Figueiredo, M. and Nowak, R. D. (2014). Sparse estimation with strongly correlated variablesusing ordered weighted `1 regularization. Technical Report 1409.4005, arXiv.

Groenevelt, H. (1991). Two algorithms for maximizing a separable concave function over apolymatroid feasible region. Eur. J Oper. Res., 54(2):227–236.

Huang, J., Zhang, T., and Metaxas, D. (2011). Learning with structured sparsity. The JMLR,12:3371–3412.

Jacob, L., Obozinski, G., and Vert, J. (2009). Group lasso with overlap and graph lasso. InICML.

Guillaume Obozinski Unified perspective on convex structured sparsity 38/39

Page 34: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms

References II

Jenatton, R., Audibert, J., and Bach, F. (2011a). Structured variable selection withsparsity-inducing norms. JMLR, 12:2777–2824.

Jenatton, R., Gramfort, A., Michel, V., Obozinski, G., Eger, E., Bach, F., and Thirion, B.(2011b). Multi-scale mining of fmri data with hierarchical structured sparsity. Arxivpreprint arXiv:1105.0363.

Jenatton, R., Mairal, J., Obozinski, G., and Bach, F. (2011c). Proximal methods forhierarchical sparse coding. JMLR, 12:2297–2334.

Mairal, J., Jenatton, R., Obozinski, G., and Bach, F. (2011). Convex and network flowoptimization for structured sparsity. JMLR, 12:2681–2720.

Vinyes, M. and Obozinski, G. (2017). Fast column generation for atomic norm regularization.In Artificial Intelligence and Statistics.

Wainwright, M. J. (2009). Sharp thresholds for noisy and high-dimensional recovery ofsparsity using `1- constrained quadratic programming. IEEE Transactions on InformationTheory, 55:2183–2202.

Yan, X. and Bien, J. (2015). Hierarchical sparse modeling: A choice of two regularizers. arXivpreprint arXiv:1512.01631.

Yuan, M., Joseph, V. R., and Zou, H. (2009). Structured variable selection and estimation.The Annals of Applied Statistics, 3(4):1738–1757.

Zhao, P., Rocha, G., and Yu, B. (2009). The composite absolute penalties family for groupedand hierarchical variable selection. The Annals of Statistics, pages 3468–3497.

Guillaume Obozinski Unified perspective on convex structured sparsity 39/39