Statistics for high-dimensional data: Group Lasso and additive models Peter B¨ uhlmann and Sara van de Geer Seminar f ¨ ur Statistik, ETH Z ¨ urich May 2012
Statistics for high-dimensional data:Group Lasso and additive models
Peter Buhlmann and Sara van de Geer
Seminar fur Statistik, ETH Zurich
May 2012
The Group Lasso (Yuan & Lin, 2006)
high-dimensional parameter vector is structured into q groupsor partitions (known a-priori):
G1, . . . ,Gq ⊆ {1, . . . ,p}, disjoint and ∪g Gg = {1, . . . ,p}
corresponding coefficients: βG = {βj ; j ∈ G}
Example: categorical covariatesX (1), . . . ,X (p) are factors (categorical variables)each with 4 levels (e.g. “letters” from DNA)
for encoding a main effect: 3 parametersfor encoding a first-order interaction: 9 parametersand so on ...
parameterization (e.g. sum contrasts) is structured as follows:I intercept: no penaltyI main effect of X (1): group G1 with df = 3I main effect of X (2): group G2 with df = 3I ...I first-order interaction of X (1) and X (2): Gp+1 with df = 9I ...
often, we want sparsity on the group-leveleither all parameters of an effect are zero or not
often, we want sparsity on the group-leveleither all parameters of an effect are zero or not
this can be achieved with the Group-Lasso penalty
λ
q∑g=1
mg ‖βGg‖2︸ ︷︷ ︸√‖·‖2
2
typically mg =√|Gg |
properties of Group-Lasso penaltyI for group-sizes |Gg | ≡ 1 ; standard Lasso-penaltyI convex penalty ; convex optimization for standard
likelihoods (exponential family models)I either (βG(λ))j = 0 or 6= 0 for all j ∈ GI penalty is invariant under orthonormal transformation
e.g. invariant when requiring orthonormal parameterizationfor factors
DNA splice site detection: (mainly) prediction problemDNA sequence
. . .ACGGC . . . E E E GC︸︷︷︸potential donor site
I I I I
︸ ︷︷ ︸3 positions exon GC 4 positions intron
. . .AAC . . .
response Y ∈ {0,1}: splice or non-splice sitepredictor variables: 7 factors each having 4 levels
(full dimension: 47 = 16′384)data:
training: 5′610 true splice sites5′610 non-splice sitesplus an unbalanced validation set
test data: 4′208 true splice sites89′717 non-splice sites
logistic regression:
log(
p(x)
1− p(x)
)= β0 + main effects + first order interactions + . . .
up to second oreder interactions: 1156 parameters
use the Group-Lasso which selects whole terms
Term
1 3 5 7 1:3 1:5 1:7 2:4 2:6 3:4 3:6 4:5 4:7 5:72 4 6 1:2 1:4 1:6 2:3 2:5 2:7 3:5 3:7 4:6 5:6 6:7
l 2−
norm
01
2 GLGL/RGL/MLE
Term
1:2:3 1:2:5 1:2:7 1:3:5 1:3:7 1:4:6 1:5:6 1:6:7 2:3:5 2:3:7 2:4:6 2:5:6 2:6:7 3:4:6 3:5:6 3:6:7 4:5:7 5:6:71:2:4 1:2:6 1:3:4 1:3:6 1:4:5 1:4:7 1:5:7 2:3:4 2:3:6 2:4:5 2:4:7 2:5:7 3:4:5 3:4:7 3:5:7 4:5:6 4:6:7
l 2−
norm
01
2
I mainly neighboring DNA positions show interactions(has been “known” and “debated”)
I no interaction among exons and introns (with Group Lassomethod)
I no second-order interactions (with Group Lasso method)
predictive power:competitive with “state to the art” maximum entropy modelingfrom Yeo and Burge (2004)
correlation between true and predicted class
Logistic Group Lasso 0.6593max. entropy (Yeo and Burge) 0.6589
our model (not necessarily the method/algorithm) is simple andhas clear interpretation
Generalized group Lasso penalty
λ
q∑j=1
mj
√βTGj
AjβGj ,
where Aj are Tj × Tj positive definite matrices
; generalized group Lasso:
β = argminβ(‖Y− Xβ‖22/n + λ
q∑j=1
mj
√βTGj
AjβGj )
reparameterize
βGj = A1/2j βGj ,
XGj = XGj A−1/2j
β = argminβ(‖Y− Xβ‖22/n + λ
q∑j=1
mj
√βTGj
AjβGj )
can be derived from
βGj = A−1/2j
ˆβGj ,
ˆβ = argminβ(‖Y− Xβ‖22/n + λ
q∑j=1
mj‖βGj‖2)
Groupwise prediction penalty and parameterization invariance
λ
q∑j=1
mj‖XGjβGj‖2 = λ
q∑j=1
mj
√βTGj
XTGj
XGjβGj
is a generalized group Lasso penalty if XTGj
XGj are positivedefinite (i.e. necessarily |Gj | ≤ n)
this penalty is invariant under any (invertible) transformationswithin groupsi.e. can use Gj = BjβGj where Bj is any Tj × Tj invertible matrix;
XGj βGj = XGjˆβGj ,
{j ; βGj 6= 0} = {j ; ˆβGj 6= 0}
Some aspects from theory“again”:
I optimal prediction and estimation (oracle inequality)I group screening: S ⊇ S0︸︷︷︸
set of active groups
with high prob.
but listen to Sarain ≈ “a few” minutes
interesting case:I Gj ’s are “large”I βGj ’s are “smooth”
example: high-dimensional additive model
Y =
p∑j=1
fj(X (j)) + ε
and expand fj(x (j)) =∑n
k=1 β(j)k︸︷︷︸
(βGj )k
B(j)k︸︷︷︸
basis fct.s
(x (j))
fj(·) smooth⇒ “smoothness” of βGj
Computation and KKT
criterion function
Qλ(β) = n−1n∑
i=1
ρβ(xi ,Yi)︸ ︷︷ ︸loss fct.
+λG∑
g=1
mg‖βg‖2,
loss function ρβ(., .) convex in β
KKT conditions:
∇ρ(β)Gg + λmgβGg
‖βGg‖2= 0 if βGg 6= 0 (not the 0-vector),
‖∇ρ(β)Gg‖2 ≤ λmg if βGg ≡ 0.
Block coordinate descent algorithm
generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .
or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )
keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed
Lasso: (β1, β2 = β(0)2 , . . . , βj = β
(0)j , . . . , βp = β
(0)p )
↑
Block coordinate descent algorithm
generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .
or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )
keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed
Lasso: (β1 = β(1)1 , β2, . . . , βj = β
(0)j , . . . , βp = β
(0)p )
↑
Block coordinate descent algorithm
generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .
or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )
keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed
Lasso: (β1 = β(1)1 , β2 = β
(1)2 , . . . , βj , . . . , βp = β
(0)p )
↑
Block coordinate descent algorithm
generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .
or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )
keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed
Lasso: (β1 = β(1)1 , β2 = β
(1)2 , . . . , βj = β
(1)j , . . . , βp)
↑
Block coordinate descent algorithm
generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .
or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )
keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed
Lasso: (β1, β2 = β(1)2 , . . . , βj = β
(1)j , . . . , βp = β
(1)p )
↑
Block coordinate descent algorithm
generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .
or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )
keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed
Group Lasso: (βG1 , βG2 = β(0)G2, . . . , βGj = β
(0)Gj, . . . , βGq = β
(0)Gq
)
↑
Block coordinate descent algorithm
generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .
or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )
keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed
Group Lasso: (βG1 = β(1)G1 , βG2 , . . . , βGj = β
(0)Gj, . . . , βGq = β
(0)Gq
)
↑
Block coordinate descent algorithm
generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .
or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )
keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed
Group Lasso: (βG1 = β(1)G1 , βG2 = β
(1)G2, . . . , βGj , . . . , βGq = β
(0)Gq
)
↑
Block coordinate descent algorithm
generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .
or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )
keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed
Group Lasso: (βG1 = β(1)G1 , βG2 = β
(1)G2, . . . , βGj = β
(1)Gj, . . . , βGq )
↑
Block coordinate descent algorithm
generic description for both, Lasso or Group-Lasso problems:I cycle through all coordinates j = 1, . . . ,p,1,2, . . .
or j = 1, . . . ,q,1,2, . . .I optimize the penalized log-likelihood w.r.t. βj (or βGj )
keeping all other coefficients βk , k 6= j (or k 6= Gj ) fixed
Group Lasso: (βG1 , βG2 = β(1)G2, . . . , βGj = β
(1)Gj, . . . , βGq = β
(1)Gq
)
↑
for Gaussian log-likelihood (squared error loss):blockwise up-dates are easy and closed-form solutions exist(use KKT)
for other loss functions (e.g. logistic loss):blockwise up-dates: no closed-form solution;
strategy which is fast: improve every coordinate/groupnumerically, but not until numerical convergence(by using quadratic approximation of log-likelihood function forimproving/optimization of a single block)
and further tricks... (still allowing provable numericalconvergence)
How fast?
logistic case: p = 106, n = 100group-size = 20, sparsity: 2 active groups = 40 parameters
for 10 different λ-valuesCPU using grplasso: 203.16 seconds ≈ 3.5 minutes(dual core processor with 2.6 GHz and 32 GB RAM)
we can easily deal today with predictors in the Mega’si.e. p ≈ 106 − 107
How fast?
logistic case: p = 106, n = 100group-size = 20, sparsity: 2 active groups = 40 parameters
for 10 different λ-valuesCPU using grplasso: 203.16 seconds ≈ 3.5 minutes(dual core processor with 2.6 GHz and 32 GB RAM)
we can easily deal today with predictors in the Mega’si.e. p ≈ 106 − 107
The sparsity-smoothness penalty (SSP)
(whose corresponding optimization becomes again aGroup-Lasso problem...)
for additive modeling in high dimensions
Yi =
p∑j=1
fj(x(j)i ) + εi (i = 1, . . . ,n)
fj : R→ R smooth univariate functionsp � n
in principle: basis expansion for every fj(·) with basis functions
hj,1, . . . ,hj,K where K = O(n) (or e.g. K = O(n1/2))
j = 1, . . . ,p
; represent
p∑j=1
fj(x (j)) =
p∑j=1
K∑k=1
βj,khj,k (x (j))
; high-dimensional parametric problem
and use the Group-Lasso penalty to ensure sparsity of wholefunctions
λ
p∑j=1
‖ βGj︸︷︷︸βj :=(βj,1,...,βj,K )T
‖2
drawback:does not exploit smoothness(except when choosing appropriate K which is “bad” if differentfj ’s have different complexity)
when using a large number of basis functions (large K ) forachieving a high degree of flexibility; need additional control for smoothness
Sparsity-Smoothness Penalty (SSP)
λ1
p∑j=1
‖fj‖n︸︷︷︸‖Hjβj‖2/
√n
+λ2
p∑j=1
I(fj)
I2(fj) =
∫(f′′
j (x))2dx = βTj Wjβj
where fj = (fj(X(j)1 ), . . . , fj(X
(j)n ))T ,
and Wj =∫
h′′
j,k (x)h′′
j,`(x)dx
; SSP-penalty does variable selection (fj ≡ 0 for some j)
Orthogonal basis and diagonal smoothing matrices
n−1HTj Hj = I and
Wj ≡ diag(d21 , . . . ,d
2K ) := D2, dk = km (m > 1/2)
then, the penalty becomes
λ1
p∑j=1
‖βj‖2 + λ2
p∑j=1
‖Dβj‖2
;
β(λ1, λ2) = argminβ‖Y−p∑
j=1
Hjβj‖22/n + λ1
p∑j=1
‖βj‖2 + λ2
p∑j=1
‖Dβj‖2
the difficulty is the computation, although still a convexoptimization problem
see Section 5.3.3 in the book
A modified SSP-penalty
λ1
p∑j=1
√‖fj‖22 + λ2I2(fj)
for additive modeling:
f1, . . . , fp = argminf1,...,fp‖Y−p∑
j=1
fj‖22 + λ1
p∑j=1
√‖fj‖22 + λ2I2(fj)
assuming fj is twice differentiable; solution is a natural cubic spline with knots at X (j)
i; finite-dimensional parameterization with e.g. B-splines:
f =∑p
j=1 fj , fj = Hjβj
penalty becomes:
λ1
p∑j=1
√‖fj‖22 + λ2I2(fj)
= λ1
p∑j=1
√√√√βTj BT
j Bj︸ ︷︷ ︸Σj
βj + λ2βTj Wj︸︷︷︸
integ. 2nd derivatives
βj
= λ1
p∑j=1
√√√√βTj (Σj + λ2Wj)︸ ︷︷ ︸
Aj =Aj (λ2)
β
; re-parameterize βj = βj(λ2) = Rjβj , RTj Rj = Aj = Aj(λ2)
(Choleski)penalty becomes
λ1
p∑j=1
‖βj‖2︸ ︷︷ ︸depending on λ2
i.e., a Group-Lasso penalty
HIF1α motif additive regressionfor finding HIF1α transcription factor binding sites on DNAsequences
n = 287, p = 196
additive model with SSP has ≈ 20% better predictionperformance than linear model with Lasso
bootstrap stability analysis: select the variables (functions)which have occurred at least in 50% among all bootstrap runs; only 2 stable variables /candidate motifs remain
7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
Motif.P1.6.23
Par
tial E
ffect
8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
Motif.P1.6.26
Par
tial E
ffect
7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
Motif.P1.6.23
Par
tial E
ffect
8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
Motif.P1.6.26
Par
tial E
ffect
right panel: variable corresponds to a true, known motif
variable/motif corresponding to left panel:good additional support for relevance (nearness totranscriptional start-site of important genes, ...)