Top Banner
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28
47

Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Jun 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Sparsity Models

Tong Zhang

Rutgers University

T. Zhang (Rutgers) Sparsity Models 1 / 28

Page 2: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Topics

Standard sparse regression modelalgorithms: convex relaxation and greedy algorithmsparse recovery analysis: high level view

Some extensions (complex regularization)structured sparsitygraphical modelmatrix regularization

T. Zhang (Rutgers) Sparsity Models 2 / 28

Page 3: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Modern Sparsity Analysis: Motivation

Modern datasets are often high dimensionalstatistical estimation suffers from curse of dimensionality

Sparsity: popular assumption to address curse of dimensionalitymotivated from real applications

Challenges:formulation, focusing on efficient computationmathematical analysis

T. Zhang (Rutgers) Sparsity Models 3 / 28

Page 4: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Modern Sparsity Analysis: Motivation

Modern datasets are often high dimensionalstatistical estimation suffers from curse of dimensionality

Sparsity: popular assumption to address curse of dimensionalitymotivated from real applications

Challenges:formulation, focusing on efficient computationmathematical analysis

T. Zhang (Rutgers) Sparsity Models 3 / 28

Page 5: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Modern Sparsity Analysis: Motivation

Modern datasets are often high dimensionalstatistical estimation suffers from curse of dimensionality

Sparsity: popular assumption to address curse of dimensionalitymotivated from real applications

Challenges:formulation, focusing on efficient computationmathematical analysis

T. Zhang (Rutgers) Sparsity Models 3 / 28

Page 6: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Standard Sparse Regression

Model: Y = X β̄ + ε

Y ∈ Rn: observationX ∈ Rn×p: design matrixβ̄ ∈ Rp: parameter vector to be estimatedε ∈ Rn: zero mean stochastic noise with variance σ2

High dimensional setting: n� pSparsity: β̄ has few nonzero components

supp(β̄) = {j : β̄j 6= 0}.‖β̄‖0 = |supp(β̄)| is small: � n

T. Zhang (Rutgers) Sparsity Models 4 / 28

Page 7: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Standard Sparse Regression

Model: Y = X β̄ + ε

Y ∈ Rn: observationX ∈ Rn×p: design matrixβ̄ ∈ Rp: parameter vector to be estimatedε ∈ Rn: zero mean stochastic noise with variance σ2

High dimensional setting: n� p

Sparsity: β̄ has few nonzero componentssupp(β̄) = {j : β̄j 6= 0}.‖β̄‖0 = |supp(β̄)| is small: � n

T. Zhang (Rutgers) Sparsity Models 4 / 28

Page 8: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Standard Sparse Regression

Model: Y = X β̄ + ε

Y ∈ Rn: observationX ∈ Rn×p: design matrixβ̄ ∈ Rp: parameter vector to be estimatedε ∈ Rn: zero mean stochastic noise with variance σ2

High dimensional setting: n� pSparsity: β̄ has few nonzero components

supp(β̄) = {j : β̄j 6= 0}.‖β̄‖0 = |supp(β̄)| is small: � n

T. Zhang (Rutgers) Sparsity Models 4 / 28

Page 9: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Algorithms for Standard Sparsity

L0 regularization: natural method (computationally inefficient)

β̂L0 = arg minβ‖Y − Xβ‖22, subject to ‖β‖0 ≤ k

L1 regularization (Lasso): convex relaxation (computationally efficient)

β̂L1 = arg minβ

[‖Y − Xβ‖22 + λ‖β‖1

]

Theoretical questions:how well can we estimate parameter β̄ (recovery performance)

T. Zhang (Rutgers) Sparsity Models 5 / 28

Page 10: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Algorithms for Standard Sparsity

L0 regularization: natural method (computationally inefficient)

β̂L0 = arg minβ‖Y − Xβ‖22, subject to ‖β‖0 ≤ k

L1 regularization (Lasso): convex relaxation (computationally efficient)

β̂L1 = arg minβ

[‖Y − Xβ‖22 + λ‖β‖1

]

Theoretical questions:how well can we estimate parameter β̄ (recovery performance)

T. Zhang (Rutgers) Sparsity Models 5 / 28

Page 11: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Algorithms for Standard Sparsity

L0 regularization: natural method (computationally inefficient)

β̂L0 = arg minβ‖Y − Xβ‖22, subject to ‖β‖0 ≤ k

L1 regularization (Lasso): convex relaxation (computationally efficient)

β̂L1 = arg minβ

[‖Y − Xβ‖22 + λ‖β‖1

]

Theoretical questions:how well can we estimate parameter β̄ (recovery performance)

T. Zhang (Rutgers) Sparsity Models 5 / 28

Page 12: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Greedy Algorithms for standard sparse regularization

Reformulation: find variable set F ⊂ {1, . . . ,p} to minimize

minβ‖Xβ − Y‖22 supp(β) ⊂ F s.t. |F | ≤ k

Forward Greedy Algorithm (OMP): select variables one by one

Initialize variable set F k = ∅ at k = 0Iterate k = 1, . . . ,p

find best variable j to add to F k−1 (maximum reduction of squared error)F k = F k−1 ∪ {j}

terminate with some criterion;output β̂ using regression with selected variables F k

Theoretical question: recovery performance?

T. Zhang (Rutgers) Sparsity Models 6 / 28

Page 13: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Greedy Algorithms for standard sparse regularization

Reformulation: find variable set F ⊂ {1, . . . ,p} to minimize

minβ‖Xβ − Y‖22 supp(β) ⊂ F s.t. |F | ≤ k

Forward Greedy Algorithm (OMP): select variables one by one

Initialize variable set F k = ∅ at k = 0Iterate k = 1, . . . ,p

find best variable j to add to F k−1 (maximum reduction of squared error)F k = F k−1 ∪ {j}

terminate with some criterion;output β̂ using regression with selected variables F k

Theoretical question: recovery performance?

T. Zhang (Rutgers) Sparsity Models 6 / 28

Page 14: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Conditions and Results

Type of results (sparse recovery):Variable selection (can we find nonzero variables): can we recover thetrue support F̄?

supp(β̂) ≈ F̄?

Parameter estimation (how well we can estimate β̄): can we recoverthe parameters?

‖β̂ − β̄‖22 ≤?

Are efficient algorithms (such as L1 or OMP) good enough?

Yes but require conditions:Irrepresentable: for support recoveryRIP – Restricted Isometry Property: for parameter recovery

T. Zhang (Rutgers) Sparsity Models 7 / 28

Page 15: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Conditions and Results

Type of results (sparse recovery):Variable selection (can we find nonzero variables): can we recover thetrue support F̄?

supp(β̂) ≈ F̄?

Parameter estimation (how well we can estimate β̄): can we recoverthe parameters?

‖β̂ − β̄‖22 ≤?

Are efficient algorithms (such as L1 or OMP) good enough?Yes but require conditions:

Irrepresentable: for support recoveryRIP – Restricted Isometry Property: for parameter recovery

T. Zhang (Rutgers) Sparsity Models 7 / 28

Page 16: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

KKT Condition for Lasso Solution

Lasso solution:

β̂L1 = arg minβ

[‖Y − Xβ‖22 + λ‖β‖1

]KKT condition: at β̂ = β̂L1:

Exists a sub-gradient being zero:for all j = 1, . . . ,p (Xj is the j-th column of X ):

2X>j (X β̂ − y) + λ∇|β̂j | = 0.

Subgradient of L1 norm: ∇|u| = sign(u) =

1 u > 0−1 u < 0∈ [−1,1] u = 0.

If we can find a β̂ that satisfies KKT condition, then it is Lasso solution.A slightly stronger condition implies uniqueness.

T. Zhang (Rutgers) Sparsity Models 8 / 28

Page 17: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Feature Selection Consistency of Lasso

Idea: construct a solution and check KKT condition.

Define β̂ such that β̂F̄ satisfies:

2X>F̄ (XF̄ β̂F̄ − y) + λsign(β̄)F̄ = 0,

and set β̂F̄ c = 0.

Condition A:X>F̄ XF̄ is full rank

sign(β̂F̄ ) = sign(β̂F̄ )

2|X>j (XF̄ β̂F̄ − y)| < λ for j /∈ F̄ .

Under Condition A:β̂ is the unique Lasso solution (satisfies KKT)

T. Zhang (Rutgers) Sparsity Models 9 / 28

Page 18: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Feature Selection Consistency of Lasso

Idea: construct a solution and check KKT condition.Define β̂ such that β̂F̄ satisfies:

2X>F̄ (XF̄ β̂F̄ − y) + λsign(β̄)F̄ = 0,

and set β̂F̄ c = 0.

Condition A:X>F̄ XF̄ is full rank

sign(β̂F̄ ) = sign(β̂F̄ )

2|X>j (XF̄ β̂F̄ − y)| < λ for j /∈ F̄ .

Under Condition A:β̂ is the unique Lasso solution (satisfies KKT)

T. Zhang (Rutgers) Sparsity Models 9 / 28

Page 19: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Feature Selection Consistency of Lasso

Idea: construct a solution and check KKT condition.Define β̂ such that β̂F̄ satisfies:

2X>F̄ (XF̄ β̂F̄ − y) + λsign(β̄)F̄ = 0,

and set β̂F̄ c = 0.

Condition A:X>F̄ XF̄ is full rank

sign(β̂F̄ ) = sign(β̂F̄ )

2|X>j (XF̄ β̂F̄ − y)| < λ for j /∈ F̄ .

Under Condition A:β̂ is the unique Lasso solution (satisfies KKT)

T. Zhang (Rutgers) Sparsity Models 9 / 28

Page 20: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Irrepresentable Condition

The condition

µ = supj 6∈F̄|X>j XF̄ (X>F̄ XF̄ )−1sign(β̄)F̄ | < 1,

is called irrepresentable condition.It implies condition A when y = X β̄ and λ is sufficiently small.

Under irrepresentable condition, ifnoise is sufficiently smallminj∈F̄ |β̄j | is larger than noise level

then there exists appropriate λ such that condition A holds. ThusLasso solution is unique and feature selection consistent.

Condition similar to irrepresentable condition can be derived for OMP.

T. Zhang (Rutgers) Sparsity Models 10 / 28

Page 21: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Irrepresentable Condition

The condition

µ = supj 6∈F̄|X>j XF̄ (X>F̄ XF̄ )−1sign(β̄)F̄ | < 1,

is called irrepresentable condition.It implies condition A when y = X β̄ and λ is sufficiently small.

Under irrepresentable condition, ifnoise is sufficiently smallminj∈F̄ |β̄j | is larger than noise level

then there exists appropriate λ such that condition A holds. ThusLasso solution is unique and feature selection consistent.

Condition similar to irrepresentable condition can be derived for OMP.

T. Zhang (Rutgers) Sparsity Models 10 / 28

Page 22: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

RIP Conditions

Feature selection consistency implies good parameter estimation.However irrepresentable condition is too strong.

RIP (restricted isometry property): weaker condition which can beused to obtain parameter estimation result.Definition of RIP: for some c > 1, the following condition holds

ρ+(ck̄)/ρ−(ck̄) <∞

ρ+(s) = sup{β>X>Xββ>β

: ‖β‖0 ≤ s}

ρ−(s) = inf{β>X>Xββ>β

: ‖β‖0 ≤ s}

with k̄ = |F̄ | = ‖β̄‖0.

T. Zhang (Rutgers) Sparsity Models 11 / 28

Page 23: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

RIP Conditions

Feature selection consistency implies good parameter estimation.However irrepresentable condition is too strong.RIP (restricted isometry property): weaker condition which can beused to obtain parameter estimation result.Definition of RIP: for some c > 1, the following condition holds

ρ+(ck̄)/ρ−(ck̄) <∞

ρ+(s) = sup{β>X>Xββ>β

: ‖β‖0 ≤ s}

ρ−(s) = inf{β>X>Xββ>β

: ‖β‖0 ≤ s}

with k̄ = |F̄ | = ‖β̄‖0.

T. Zhang (Rutgers) Sparsity Models 11 / 28

Page 24: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Results under Restricted Isometry Property

Parameter estimation under RIP:

‖β̄ − β̂‖2 = O(σ2‖β̄‖0ln p/n),

where σ2 is noise variance.this result can be obtained both for Lasso and for OMP.it is best possible

Feature selection under RIP:neither procedure achieves feature selection consistency.

Improvement:non-convex formulation is needed for optimal feature selectionunder RIPtrickier to analyze: general theory only appeared very recently

T. Zhang (Rutgers) Sparsity Models 12 / 28

Page 25: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Results under Restricted Isometry Property

Parameter estimation under RIP:

‖β̄ − β̂‖2 = O(σ2‖β̄‖0ln p/n),

where σ2 is noise variance.this result can be obtained both for Lasso and for OMP.it is best possible

Feature selection under RIP:neither procedure achieves feature selection consistency.

Improvement:non-convex formulation is needed for optimal feature selectionunder RIPtrickier to analyze: general theory only appeared very recently

T. Zhang (Rutgers) Sparsity Models 12 / 28

Page 26: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Results under Restricted Isometry Property

Parameter estimation under RIP:

‖β̄ − β̂‖2 = O(σ2‖β̄‖0ln p/n),

where σ2 is noise variance.this result can be obtained both for Lasso and for OMP.it is best possible

Feature selection under RIP:neither procedure achieves feature selection consistency.

Improvement:non-convex formulation is needed for optimal feature selectionunder RIPtrickier to analyze: general theory only appeared very recently

T. Zhang (Rutgers) Sparsity Models 12 / 28

Page 27: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Complex Regularization: structured sparsity

Wavelet domain: sparsity pattern not random (structured)

Image domain Wavelet domain

can we take advantage of structure?

T. Zhang (Rutgers) Sparsity Models 13 / 28

Page 28: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Complex Regularization: structured sparsity

Wavelet domain: sparsity pattern not random (structured)

Image domain Wavelet domain

can we take advantage of structure?

T. Zhang (Rutgers) Sparsity Models 13 / 28

Page 29: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Structured Sparsity Characterization

Observation:sparsity pattern is the set of nonzero coefficientsnot all sparse patterns are equally likely

Our proposal: information theoretical characterization of “structure”:

a sparsity pattern F is associated with cost c(F )c(F ) is negative log-likelihood of F (or its multiple).

Optimization problem:

minβ‖Xβ − Y‖22 subject to ‖β‖0 + c(supp(β)) ≤ s.

c(supp(β)): cost for selecting support supp(β)‖β‖0: cost for estimation after feature selection

T. Zhang (Rutgers) Sparsity Models 14 / 28

Page 30: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Example: Group Structure

Variables are divided into pre-defined groups G1, . . . ,Gp/mm variables per groupExample (m = 4)

G1 G2 · · · G4 · · · Gp/m

nodes: variablesgray nodes: selected variables (groups 1,2,4)

Assumption:coefficients are not completely randomcoefficients in each group are simultaneously (or nearlysimultaneously) zeros or nonzeros

How to take advantage of group structure?

T. Zhang (Rutgers) Sparsity Models 15 / 28

Page 31: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Example: Group Structure

Variables are divided into pre-defined groups G1, . . . ,Gp/mm variables per group

Assumption:coefficients in each group are simultaneously zeros or nonzeros

Group sparsity pattern cost: ‖β‖0 + m−1‖β‖0 ln p.Standard sparsity pattern cost (for Lasso): ‖β‖0 ln pTheoretical question:

can we take advantage of group sparsity structure to improve Lasso?

T. Zhang (Rutgers) Sparsity Models 16 / 28

Page 32: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Convex Relaxation for group sparsity

L1 − L2 convex relaxation (group Lasso)

β̂ = arg minβ‖Xβ − Y‖22 + λ

∑j

‖βGj‖2.

This is supposed to take advantage of group sparsity structurewithin group: uses L2 regularization (doesn’t encourage sparsity)across group: uses L1 regularization (encourage sparsity)

Question: what is the benefit of group Lasso formulation?

T. Zhang (Rutgers) Sparsity Models 17 / 28

Page 33: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Convex Relaxation for group sparsity

L1 − L2 convex relaxation (group Lasso)

β̂ = arg minβ‖Xβ − Y‖22 + λ

∑j

‖βGj‖2.

This is supposed to take advantage of group sparsity structurewithin group: uses L2 regularization (doesn’t encourage sparsity)across group: uses L1 regularization (encourage sparsity)

Question: what is the benefit of group Lasso formulation?

T. Zhang (Rutgers) Sparsity Models 17 / 28

Page 34: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Recovery Analysis for Lasso and Group Lasso

Simple sparsity: s = ‖β̄‖0 variables out of p variablesinformation theoretical complexity (log of mumbler of choices): O(s ln p)statistical recovery performance:

‖β̂ − β̄‖22 = O(σ2‖β̄‖0 ln p/n)

Group sparsity: g groups out of p/m groups (ideally g = ‖β̄‖0/m)information theoretical complexity (log of number of choices):O(g ln(p/m))

Statistical recovery performance for group Lasso: if supp(β̄) iscovered in g groups, under group RIP (weaker than RIP)

‖β̂ − β̄‖22 = O

σ2

n( g ln(p/m)︸ ︷︷ ︸group selection

+ mg︸︷︷︸estimation after group selection

)

T. Zhang (Rutgers) Sparsity Models 18 / 28

Page 35: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Recovery Analysis for Lasso and Group Lasso

Simple sparsity: s = ‖β̄‖0 variables out of p variablesinformation theoretical complexity (log of mumbler of choices): O(s ln p)statistical recovery performance:

‖β̂ − β̄‖22 = O(σ2‖β̄‖0 ln p/n)

Group sparsity: g groups out of p/m groups (ideally g = ‖β̄‖0/m)information theoretical complexity (log of number of choices):O(g ln(p/m))

Statistical recovery performance for group Lasso: if supp(β̄) iscovered in g groups, under group RIP (weaker than RIP)

‖β̂ − β̄‖22 = O

σ2

n( g ln(p/m)︸ ︷︷ ︸group selection

+ mg︸︷︷︸estimation after group selection

)

T. Zhang (Rutgers) Sparsity Models 18 / 28

Page 36: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Group sparsity: correct group structure

50 100 150 200 250 300 350 400 450 500−2

0

2(a) Original

50 100 150 200 250 300 350 400 450 500−2

0

2(b) Lasso

50 100 150 200 250 300 350 400 450 500−2

0

2(b) Group Lasso

T. Zhang (Rutgers) Sparsity Models 19 / 28

Page 37: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Group sparsity: incorrect group structure

0 100 200 300 400 500−2

0

2(a) Original

0 100 200 300 400 500−2

0

2(b) Lasso

0 100 200 300 400 500−2

0

2(c) Group Lasso

T. Zhang (Rutgers) Sparsity Models 20 / 28

Page 38: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Matrix Formulation: Graphical Model Example

Learning gene interaction network structure

T. Zhang (Rutgers) Sparsity Models 21 / 28

Page 39: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Formulation: Gaussian Graphical Model

Multi-dimensional Gaussian vectors: X1, . . .Xn ∼ N(µ,Σ).Precision matrix Θ = Σ−1

Non-zeros of precision matrix gives graphical model structure:

P(Xi) ∝ |Θ|exp[−1

2(Xi − µ)T Θ(Xi − µ)

].

where | · | is determinant.Estimation: L1 regularized maximum likelihood estimator

Θ̂ = arg minΘ

[− ln |Θ|+ tr(Σ̂Θ) + λ‖Θ‖1

],

‖ · ‖1: element L1 regularization to encourage sparsityΣ̂: empirical covariance matrix.

Analysis exists (feature selection and parameter estimation):techniques similar to L1 analysis but not satisfactory

T. Zhang (Rutgers) Sparsity Models 22 / 28

Page 40: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Matrix Completion

user

movieM1 M2 M3 M4 M5 M6 M7 M8

U1 1 ? 0 4 2 1 ? ?U2 2 4 ? ? 4 ? 1 2U3 2 3 2 ? 4 ? ? 2U4 ? 1 0 3 2 ? 1 1U5 1 ? ? 1 ? ? 3 2

m × n matrix: m users and n movies with incomplete ratingscan we fill-in the missing values?

require assumptions:intuition: U2 and U3 has similar ratings on observed ratings — assumethey have similar preferencelow-rank (rank-r ) structure: user i to ui ∈ Rr and movie j to vj ∈ Rr ,with rating ≈ uT

i vj . Let X the true rating matrix

X ≈ UV T (U : m × r V : n × r)

T. Zhang (Rutgers) Sparsity Models 23 / 28

Page 41: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Matrix Completion

user

movieM1 M2 M3 M4 M5 M6 M7 M8

U1 1 ? 0 4 2 1 ? ?U2 2 4 ? ? 4 ? 1 2U3 2 3 2 ? 4 ? ? 2U4 ? 1 0 3 2 ? 1 1U5 1 ? ? 1 ? ? 3 2

m × n matrix: m users and n movies with incomplete ratingscan we fill-in the missing values?require assumptions:

intuition: U2 and U3 has similar ratings on observed ratings — assumethey have similar preferencelow-rank (rank-r ) structure: user i to ui ∈ Rr and movie j to vj ∈ Rr ,with rating ≈ uT

i vj . Let X the true rating matrix

X ≈ UV T (U : m × r V : n × r)

T. Zhang (Rutgers) Sparsity Models 23 / 28

Page 42: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Formulation

Let S = {observed (i , j) entries}Let yij be the observed values for (i , j) ∈ SLet X be the true underlying rating matrix

We want to find X to fit observed yij , assuming X is low-rank:

minX∈Rm×n

∑(i,j)∈S

(Xij − yij)2 + λ · rank(X )

.rank(X ): nonconvex function of Xconvex relaxation: trace-norm ‖X‖∗, defined as the sum of singularvalues of X .

The convex reformulation is

minX∈Rm×n

∑(i,j)∈S

(Xij − yij)2 + λ · ‖X‖∗

.Solution of trace norm regularization is low-rank.

T. Zhang (Rutgers) Sparsity Models 24 / 28

Page 43: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Sparsity versus Low-rank

A vector β ∈ Rp: p parametersreduce dimension — sparsity ‖β‖0 is smallconstraint ‖β‖0 ≤ s is nonconvexconvex relaxation: convex hull of unit 1-sparse vectors, which gives L1regularization ‖β‖1 ≤ 1vector solution with L1 regularization is sparse

A matrix X ∈ Rm×n: m × n parametersreduce dimension — lowrank X =

∑rj=1 ujvT

j where uj ∈ Rm andvj ∈ Rn are vectorsnumber of parameters — no more than rm + rn.rank constraint is nonconvexconvex relaxation: convex hull of unit rank-one matrices, which givestrace-norm regularization ‖X‖∗ ≤ 1matrix solution with trace-norm regularization is low-rank

T. Zhang (Rutgers) Sparsity Models 25 / 28

Page 44: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Matrix Regularization Example: mixed sparsity and low rank

Y (observed) = XL(low-rank) + XS(sparse)

[X̂S, X̂L] = arg min

12µ‖(XS + XL)− Y‖22 + λ‖XS‖1 + ‖XL‖∗︸ ︷︷ ︸

trace norm

.trace norm: sum of singular values of a matrix – encourage low-rank matrix

T. Zhang (Rutgers) Sparsity Models 26 / 28

Page 45: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Theoretical Analysis

Want to know under what conditions, we can recovery XS and XLMatrix is m × n.XS: sparse (spike like outliers)

no more than n0 outliers per rowno more than m0 outliers per column

XL: rank is rincoherence: XL is “flat” – no component is large

Question: how many outliers per row n0 and per column m0 areallowed to recover XS and XL?Partial answer (not completely satisfactory):

Sparsity pattern supp(XS) is random: exact recovery under the followingconditions

m0 = O(m) and n0 = O(n).Sparsity pattern supp(XS) does not have to be random:

m0 ≤ c(m/r) and n0 ≤ c(n/r) for some constant c (r is the rank of XL)

T. Zhang (Rutgers) Sparsity Models 27 / 28

Page 46: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Theoretical Analysis

Want to know under what conditions, we can recovery XS and XLMatrix is m × n.XS: sparse (spike like outliers)

no more than n0 outliers per rowno more than m0 outliers per column

XL: rank is rincoherence: XL is “flat” – no component is large

Question: how many outliers per row n0 and per column m0 areallowed to recover XS and XL?Partial answer (not completely satisfactory):

Sparsity pattern supp(XS) is random: exact recovery under the followingconditions

m0 = O(m) and n0 = O(n).Sparsity pattern supp(XS) does not have to be random:

m0 ≤ c(m/r) and n0 ≤ c(n/r) for some constant c (r is the rank of XL)

T. Zhang (Rutgers) Sparsity Models 27 / 28

Page 47: Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

References

Statistical Science Special Issue on Sparsity and Regularization, 2012http://www.imstat.org/sts/future_papers.html

Structured Sparsity: F. Bach, et alGeneral Theoretical Analysis: S. Negahban et al.Graphical Models: J. Lafferty et al.Nonconvex Methods: CH Zhang and T Zhang...

T. Zhang (Rutgers) Sparsity Models 28 / 28