Top Banner
Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)
25

Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

Dec 17, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

Chapter 2: Lasso for linear models

Statistics for High-Dimensional Data (Buhlmann & van de Geer)

Page 2: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

Lasso Proposed by Tibshirani (1996) Least Absolute Shrinkage and Selection

Operator Why we still use it

Accurate in prediction and variable selection (under certain assumptions) and computationally feasible

Page 3: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.2. Introduction and preliminaries Univariate response Yi

Covariate vector Xi

Can be fixed or random variables Typically independence assumed, but Lasso

can be applied to correlated data For simplicity, we can assume the intercept is

zero and all covariates are centered and measured on the same scale Standardization

Page 4: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.2.1. The Lasso estimator

“Shrunken” least squares estimator βj=0 for some j Convex optimization – computationally efficient

Equivalent to solving:

For some R (data-dependent 1:1 correspondence between R and λ) Estimating the variance

Can use residual sum of squares and df of Lasso Or we can estimate β and σ2 simultaneously (Ch. 9)

Page 5: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.3. Orthonormal design p=n and n-1XTX=Ipxp

Then:

Page 6: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.4. Prediction Often cross-validation used to choose the λ

minimizing the squared-error risk Cross-validation can also be used to assess the

predictive accuracy Some asymptotics (more in Ch. 6)

For high-dimensional scenarios (allowing p to depend on n)

For consistency, we assume sparsity

Then:

Page 7: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.5. Variable screening Under certain assumptions including (true) model sparsity

and conditions on the design matrix (Thm. 7.1), for a suitable range of λ,

For q=1,2; derivation in Ch. 6

We can use the Lasso estimate as variable screening:

Variables with non-zero coefficients remain the same across different solutions (>1 solution for non-convex optimization such as when p>n)

The number of variables estimated as non-zero doesn’t exceed min(n,p)

And

Page 8: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

Lemma 2.1. and tuning parameter selection

Tuning parameter selection For prediction, smaller λ is preferred, while larger

penalty is needed for good variable selection (discussed more in Ch. 10-11)

The smaller λ tuned from CV can be utilized for the screening process

Page 9: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.6. Variable selection AIC and BIC use the l0-norm as a penalty

Computation is infeasible for any relatively large p, since the objective function using this norm is non-convex

For The set of all Lasso sub-models denoted by:

Then and We want to know whether is

contained in and if so, which value of λ will identify S0

With a “neighborhood stability” assumption on X (more later) and assuming Then for

Page 10: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.6.1. Neighborhood stability and irrepresentable condition In order for consistency of Lasso’s variable selection, we

make assumptions about X WLOG, let the first s0 variables form the active set S0

Let

Then the irrepresentable condition is:

Above is sufficient condition for consistency of Lasso model selection; RHS is “<=1” for necessary condition

This is easier to represent than the neighborhood stability condition but the two are equivalent for

Essentially consistency fails under too much linear dependence within sub-matrices of X

Page 11: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

Summary of some Lasso properties (Table 2.2)

Page 12: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.8. Adaptive Lasso Two-stage procedure instead of just using l1-

penalty:

Lasso can be used for the initial estimation, with CV for λ tuning, followed by the same procedure for the second stage

The adaptive Lasso gives a small penalty to βj with initial estimates of large magnitude

Page 13: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.8.1. Illustration: simulated data 1000 variables, 3 of which have true signal; “medium-

sized” signal-to-noise ratio

Both selected the active set, but adaptive Lasso selects only 10 noise variables as opposed to 41 for the Lasso

Page 14: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.8.2. Orthonormal design For

Then:

Page 15: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.8.3. The adaptive Lasso: variable selection under weak conditions For consistency of variable selection of the

adaptive Lasso, we need large enough non-zero coefficients:

But we can assume weaker conditions than the neighborhood stability (irrepresentable) condition on the design matrix required for consistency of the Lasso (more in Ch. 7)

Then:

Page 16: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.8.4. Computation We can reparameterize into a Lasso problem:

Then the objective function is:

If a solution to this is , then the solution for the adaptive Lasso is given by:

Any algorithm for computing Lasso estimates can be used to compute the adaptive Lasso (more on algorithms later)

Page 17: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.8.5. Multi-step adaptive Lasso Procedure as follows:

Page 18: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.8.6. Non-convex penalty functions General penalized linear regression:

Ex: SCAD (Smoothly Clipped Absolute Deviation)

Non-differentiable at zero and non-convex SCAD is related to the multi-step weighted Lasso:

Another non-convex penalty function commonly used is the lr-norm for r close to zero (Ch. 6-7)

Page 19: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.9. Thresholding the Lasso If we want a sparser model, instead of the adaptive

Lasso we can threshold the Lasso estimates:

Then we can refit the model (via OLS) for the non-zero estimates:

Theoretical properties are as good as or better than the adaptive Lasso (more in Ch. 6-7)

The tuning parameters can be chosen sequentially as in the adaptive Lasso

Page 20: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.10. The relaxed Lasso First stage: all possible sub-models

computed (across λ) Second stage:

With smaller penalty to lessen the bias of Lasso estimation

Performs similarly to adaptive Lasso in practice

ϕ=0 yields the Lasso-OLS hybrid model Lasso first stage, OLS second stage

Page 21: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.11. Degrees of freedom of the Lasso We denote the hat operator by which

maps the observed values to the fitted values Then from Stein’s theory on unbiased risk

estimation,

For MLEs, df=# of parameters For linear hat operators (such as OLS)

For low-dimensional case with rank(X)=p, This yields the estimator Then the BIC can be used to to select λ:

Page 22: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.12. Path-following algorithms We typically want to compute the estimator for many

values of λ We can compute the entire regularized solution path

over all λ Because the solution path is piecewise linear in λ:

Typically the number of cutpoints λk are O(n). The modified LARS algorithm (Efron 2004) can be used to

construct the whole regularized solution path Exact method

Page 23: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.12.1. Coordinatewise optimization and shooting algorithms For very high-dimensional problems

coordinate descent algorithms are much faster than exact methods such as the LARS algorithm With loss functions other than squared error loss,

exact methods are often not possible It is often sufficient to compute estimates over

a grid of λ values :

With such that

Page 24: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

Optimization (continued) Let and

The update in step 4 is explicit due to squared-error loss function:

Page 25: Chapter 2: Lasso for linear models Statistics for High-Dimensional Data (Buhlmann & van de Geer)

2.13. Elastic net: an extension Double-penalization version of Lasso:

After correction: This is equivalent to the Lasso under orthonormal

design