Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Statistics for high-dimensional data:Group Lasso and additive models

Peter Buhlmann and Sara van de Geer

Seminar fur Statistik, ETH Zurich

May 2012

The Group Lasso (Yuan & Lin, 2006)

high-dimensional parameter vector is structured into q groupsor partitions (known a-priori):

G1, . . . ,Gq ⊆ {1, . . . ,p}, disjoint and ∪g Gg = {1, . . . ,p}

corresponding coefficients: βG = {βj ; j ∈ G}

Example: categorical covariatesX (1), . . . ,X (p) are factors (categorical variables)each with 4 levels (e.g. “letters” from DNA)

for encoding a main effect: 3 parametersfor encoding a first-order interaction: 9 parametersand so on ...

parameterization (e.g. sum contrasts) is structured as follows:I intercept: no penaltyI main effect of X (1): group G1 with df = 3I main effect of X (2): group G2 with df = 3I ...I first-order interaction of X (1) and X (2): Gp+1 with df = 9I ...

often, we want sparsity on the group-leveleither all parameters of an effect are zero or not

often, we want sparsity on the group-leveleither all parameters of an effect are zero or not

this can be achieved with the Group-Lasso penalty

λ

q∑g=1

mg ‖βGg‖2︸︷︷︸√‖·‖2

2

typically mg =√|Gg |

properties of Group-Lasso penaltyI for group-sizes |Gg | ≡ 1 ; standard Lasso-penaltyI convex penalty ; convex optimization for standard

likelihoods (exponential family models)I either (βG(λ))j = 0 or 6= 0 for all j ∈ GI penalty is invariant under orthonormal transformation

e.g. invariant when requiring orthonormal parameterizationfor factors

DNA splice site detection: (mainly) prediction problemDNA sequence

. . .ACGGC . . . E E E GC︸︷︷︸potential donor site

I I I I

︸︷︷︸3 positions exon GC 4 positions intron

. . .AAC . . .

response Y ∈ {0,1}: splice or non-splice sitepredictor variables: 7 factors each having 4 levels

(full dimension: 47 = 16′384)data:

training: 5′610 true splice sites5′610 non-splice sitesplus an unbalanced validation set

test data: 4′208 true splice sites89′717 non-splice sites

logistic regression:

log(

p(x)

1− p(x)

)= β0 + main effects + first order interactions + . . .

up to second oreder interactions: 1156 parameters

use the Group-Lasso which selects whole terms

Term

1 3 5 7 1:3 1:5 1:7 2:4 2:6 3:4 3:6 4:5 4:7 5:72 4 6 1:2 1:4 1:6 2:3 2:5 2:7 3:5 3:7 4:6 5:6 6:7

l 2−

norm

01

2 GLGL/RGL/MLE

Term

1:2:3 1:2:5 1:2:7 1:3:5 1:3:7 1:4:6 1:5:6 1:6:7 2:3:5 2:3:7 2:4:6 2:5:6 2:6:7 3:4:6 3:5:6 3:6:7 4:5:7 5:6:71:2:4 1:2:6 1:3:4 1:3:6 1:4:5 1:4:7 1:5:7 2:3:4 2:3:6 2:4:5 2:4:7 2:5:7 3:4:5 3:4:7 3:5:7 4:5:6 4:6:7

l 2−

norm

01

2

I mainly neighboring DNA positions show interactions(has been “known” and “debated”)

I no interaction among exons and introns (with Group Lassomethod)

I no second-order interactions (with Group Lasso method)

predictive power:competitive with “state to the art” maximum entropy modelingfrom Yeo and Burge (2004)

correlation between true and predicted class

Logistic Group Lasso 0.6593max. entropy (Yeo and Burge) 0.6589

our model (not necessarily the method/algorithm) is simple andhas clear interpretation

Generalized group Lasso penalty

λ

q∑j=1

mj

√βTGj

AjβGj ,

where Aj are Tj × Tj positive definite matrices

; generalized group Lasso:

β = argminβ(‖Y− Xβ‖22/n + λ

q∑j=1

mj

√βTGj

AjβGj )

reparameterize

βGj = A1/2j βGj ,

XGj = XGj A−1/2j

β = argminβ(‖Y− Xβ‖22/n + λ

q∑j=1

mj

√βTGj

AjβGj )

can be derived from

βGj = A−1/2j

ˆβGj ,

ˆβ = argminβ(‖Y− Xβ‖22/n + λ

q∑j=1

mj‖βGj‖2)

Groupwise prediction penalty and parameterization invariance

λ

q∑j=1

mj‖XGjβGj‖2 = λ

q∑j=1

mj

√βTGj

XTGj

XGjβGj

is a generalized group Lasso penalty if XTGj

XGj are positivedefinite (i.e. necessarily |Gj | ≤ n)

this penalty is invariant under any (invertible) transformationswithin groupsi.e. can use Gj = BjβGj where Bj is any Tj × Tj invertible matrix;

XGj βGj = XGjˆβGj ,

{j ; βGj 6= 0} = {j ; ˆβGj 6= 0}

Some aspects from theory“again”:

I optimal prediction and estimation (oracle inequality)I group screening: S ⊇ S0︸︷︷︸

set of active groups

with high prob.

but listen to Sarain ≈ “a few” minutes

interesting case:I Gj ’s are “large”I βGj ’s are “smooth”

example: high-dimensional additive model

Y =

p∑j=1

fj(X (j)) + ε

and expand fj(x (j)) =∑n

k=1 β(j)k︸︷︷︸

(βGj )k

B(j)k︸︷︷︸

basis fct.s

(x (j))

fj(·) smooth⇒ “smoothness” of βGj

Computation and KKT

criterion function

Qλ(β) = n−1n∑

i=1

ρβ(xi ,Yi)︸︷︷︸loss fct.

+λG∑

g=1

mg‖βg‖2,

loss function ρβ(., .) convex in β

KKT conditions:

∇ρ(β)Gg + λmgβGg

‖βGg‖2= 0 if βGg 6= 0 (not the 0-vector),

‖∇ρ(β)Gg‖2 ≤ λmg if βGg ≡ 0.

Block coordinate descent algorithm

generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .

or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )

keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed

Lasso: (β1, β2 = β(0)2 , . . . , βj = β

(0)j , . . . , βp = β

(0)p )

↑





Lasso: (β1 = β(1)1 , β2, . . . , βj = β

(0)j , . . . , βp = β

(0)p )

↑





Lasso: (β1 = β(1)1 , β2 = β

(1)2 , . . . , βj , . . . , βp = β

(0)p )

↑





Lasso: (β1 = β(1)1 , β2 = β

(1)2 , . . . , βj = β

(1)j , . . . , βp)

↑





Lasso: (β1, β2 = β(1)2 , . . . , βj = β

(1)j , . . . , βp = β

(1)p )

↑





Group Lasso: (βG1 , βG2 = β(0)G2, . . . , βGj = β

(0)Gj, . . . , βGq = β

(0)Gq

)

↑





Group Lasso: (βG1 = β(1)G1 , βG2 , . . . , βGj = β

(0)Gj, . . . , βGq = β

(0)Gq

)

↑





Group Lasso: (βG1 = β(1)G1 , βG2 = β

(1)G2, . . . , βGj , . . . , βGq = β

(0)Gq

)

↑





Group Lasso: (βG1 = β(1)G1 , βG2 = β

(1)G2, . . . , βGj = β

(1)Gj, . . . , βGq )

↑





Group Lasso: (βG1 , βG2 = β(1)G2, . . . , βGj = β

(1)Gj, . . . , βGq = β

(1)Gq

)

↑

for Gaussian log-likelihood (squared error loss):blockwise up-dates are easy and closed-form solutions exist(use KKT)

for other loss functions (e.g. logistic loss):blockwise up-dates: no closed-form solution;

strategy which is fast: improve every coordinate/groupnumerically, but not until numerical convergence(by using quadratic approximation of log-likelihood function forimproving/optimization of a single block)

and further tricks... (still allowing provable numericalconvergence)

How fast?

logistic case: p = 106, n = 100group-size = 20, sparsity: 2 active groups = 40 parameters

for 10 different λ-valuesCPU using grplasso: 203.16 seconds ≈ 3.5 minutes(dual core processor with 2.6 GHz and 32 GB RAM)

we can easily deal today with predictors in the Mega’si.e. p ≈ 106 − 107

How fast?

logistic case: p = 106, n = 100group-size = 20, sparsity: 2 active groups = 40 parameters

for 10 different λ-valuesCPU using grplasso: 203.16 seconds ≈ 3.5 minutes(dual core processor with 2.6 GHz and 32 GB RAM)

we can easily deal today with predictors in the Mega’si.e. p ≈ 106 − 107

The sparsity-smoothness penalty (SSP)

(whose corresponding optimization becomes again aGroup-Lasso problem...)

for additive modeling in high dimensions

Yi =

p∑j=1

fj(x(j)i ) + εi (i = 1, . . . ,n)

fj : R→ R smooth univariate functionsp � n

in principle: basis expansion for every fj(·) with basis functions

hj,1, . . . ,hj,K where K = O(n) (or e.g. K = O(n1/2))

j = 1, . . . ,p

; represent

p∑j=1

fj(x (j)) =

p∑j=1

K∑k=1

βj,khj,k (x (j))

; high-dimensional parametric problem

and use the Group-Lasso penalty to ensure sparsity of wholefunctions

λ

p∑j=1

‖ βGj︸︷︷︸βj :=(βj,1,...,βj,K )T

‖2

drawback:does not exploit smoothness(except when choosing appropriate K which is “bad” if differentfj ’s have different complexity)

when using a large number of basis functions (large K ) forachieving a high degree of flexibility; need additional control for smoothness

Sparsity-Smoothness Penalty (SSP)

λ1

p∑j=1

‖fj‖n︸︷︷︸‖Hjβj‖2/

√n

+λ2

p∑j=1

I(fj)

I2(fj) =

∫(f′′

j (x))2dx = βTj Wjβj

where fj = (fj(X(j)1 ), . . . , fj(X

(j)n ))T ,

and Wj =∫

h′′

j,k (x)h′′

j,`(x)dx

; SSP-penalty does variable selection (fj ≡ 0 for some j)

Orthogonal basis and diagonal smoothing matrices

n−1HTj Hj = I and

Wj ≡ diag(d21 , . . . ,d

2K ) := D2, dk = km (m > 1/2)

then, the penalty becomes

λ1

p∑j=1

‖βj‖2 + λ2

p∑j=1

‖Dβj‖2

;

β(λ1, λ2) = argminβ‖Y−p∑

j=1

Hjβj‖22/n + λ1

p∑j=1

‖βj‖2 + λ2

p∑j=1

‖Dβj‖2

the difficulty is the computation, although still a convexoptimization problem

see Section 5.3.3 in the book

A modified SSP-penalty

λ1

p∑j=1

√‖fj‖22 + λ2I2(fj)

for additive modeling:

f1, . . . , fp = argminf1,...,fp‖Y−p∑

j=1

fj‖22 + λ1

p∑j=1

√‖fj‖22 + λ2I2(fj)

assuming fj is twice differentiable; solution is a natural cubic spline with knots at X (j)

i; finite-dimensional parameterization with e.g. B-splines:

f =∑p

j=1 fj , fj = Hjβj

penalty becomes:

λ1

p∑j=1

√‖fj‖22 + λ2I2(fj)

= λ1

p∑j=1

√√√√βTj BT

j Bj︸︷︷︸Σj

βj + λ2βTj Wj︸︷︷︸

integ. 2nd derivatives

βj

= λ1

p∑j=1

√√√√βTj (Σj + λ2Wj)︸︷︷︸

Aj =Aj (λ2)

β

; re-parameterize βj = βj(λ2) = Rjβj , RTj Rj = Aj = Aj(λ2)

(Choleski)penalty becomes

λ1

p∑j=1

‖βj‖2︸︷︷︸depending on λ2

i.e., a Group-Lasso penalty

HIF1α motif additive regressionfor finding HIF1α transcription factor binding sites on DNAsequences

n = 287, p = 196

additive model with SSP has ≈ 20% better predictionperformance than linear model with Lasso

bootstrap stability analysis: select the variables (functions)which have occurred at least in 50% among all bootstrap runs; only 2 stable variables /candidate motifs remain

7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Motif.P1.6.23

Par

tial E

ffect

8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Motif.P1.6.26

Par

tial E

ffect

7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Motif.P1.6.23

Par

tial E

ffect

8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Motif.P1.6.26

Par

tial E

ffect

right panel: variable corresponds to a true, known motif

variable/motif corresponding to left panel:good additional support for relevance (nearness totranscriptional start-site of important genes, ...)

Statistics for high-dimensional data: Group Lasso and ...buhlmann/teaching/presentation4.pdf · Statistics for high-dimensional data: Group Lasso and additive models Peter Buhlmann

Documents