Regularization Paths for Generalized Linear Models via … · 2017. 5. 5. · 2 Regularization Paths for GLMs via Coordinate Descent 4. ‘ 1 regularization paths for generalized

JSS Journal of Statistical SoftwareJanuary 2010 Volume 33 Issue 1 httpwwwjstatsoftorg

Regularization Paths for Generalized Linear Models

via Coordinate Descent

Jerome FriedmanStanford University

Trevor HastieStanford University

Rob TibshiraniStanford University

Abstract

We develop fast algorithms for estimation of generalized linear models with convexpenalties The models include linear regression two-class logistic regression and multi-nomial regression problems while the penalties include `1 (the lasso) `2 (ridge regression)and mixtures of the two (the elastic net) The algorithms use cyclical coordinate descentcomputed along a regularization path The methods can handle large problems and canalso deal efficiently with sparse features In comparative timings we find that the newalgorithms are considerably faster than competing methods

Keywords lasso elastic net logistic regression `1 penalty regularization path coordinate-descent

1 Introduction

The lasso (Tibshirani 1996) is a popular method for regression that uses an `1 penalty toachieve a sparse solution In the signal processing literature the lasso is also known as basispursuit (Chen et al 1998) This idea has been broadly applied for example to general-ized linear models (Tibshirani 1996) and Coxrsquos proportional hazard models for survival data(Tibshirani 1997) In recent years there has been an enormous amount of research activitydevoted to related regularization methods

1 The grouped lasso (Yuan and Lin 2007 Meier et al 2008) where variables are includedor excluded in groups

2 The Dantzig selector (Candes and Tao 2007 and discussion) a slightly modified versionof the lasso

3 The elastic net (Zou and Hastie 2005) for correlated variables which uses a penalty thatis part `1 part `2

2 Regularization Paths for GLMs via Coordinate Descent

4 `1 regularization paths for generalized linear models (Park and Hastie 2007a)

5 Methods using non-concave penalties such as SCAD (Fan and Li 2005) and Friedmanrsquosgeneralized elastic net (Friedman 2008) enforce more severe variable selection than thelasso

6 Regularization paths for the support-vector machine (Hastie et al 2004)

7 The graphical lasso (Friedman et al 2008) for sparse covariance estimation and undi-rected graphs

Efron et al (2004) developed an efficient algorithm for computing the entire regularizationpath for the lasso for linear regression models Their algorithm exploits the fact that the coef-ficient profiles are piecewise linear which leads to an algorithm with the same computationalcost as the full least-squares fit on the data (see also Osborne et al 2000)

In some of the extensions above (items 23 and 6) piecewise-linearity can be exploited as inEfron et al (2004) to yield efficient algorithms Rosset and Zhu (2007) characterize the classof problems where piecewise-linearity existsmdashboth the loss function and the penalty have tobe quadratic or piecewise linear

Here we instead focus on cyclical coordinate descent methods These methods have beenproposed for the lasso a number of times but only recently was their power fully appreciatedEarly references include Fu (1998) Shevade and Keerthi (2003) and Daubechies et al (2004)Van der Kooij (2007) independently used coordinate descent for solving elastic-net penalizedregression models Recent rediscoveries include Friedman et al (2007) and Wu and Lange(2008) The first paper recognized the value of solving the problem along an entire path ofvalues for the regularization parameters using the current estimates as warm starts Thisstrategy turns out to be remarkably efficient for this problem Several other researchers havealso re-discovered coordinate descent many for solving the same problems we address in thispapermdashnotably Shevade and Keerthi (2003) Krishnapuram and Hartemink (2005) Genkinet al (2007) and Wu et al (2009)

In this paper we extend the work of Friedman et al (2007) and develop fast algorithmsfor fitting generalized linear models with elastic-net penalties In particular our modelsinclude regression two-class logistic regression and multinomial regression problems Ouralgorithms can work on very large datasets and can take advantage of sparsity in the featureset We provide a publicly available package glmnet (Friedman et al 2009) implemented inthe R programming system (R Development Core Team 2009) We do not revisit the well-established convergence properties of coordinate descent in convex problems (Tseng 2001) inthis article

Lasso procedures are frequently used in domains with very large datasets such as genomicsand web analysis Consequently a focus of our research has been algorithmic efficiency andspeed We demonstrate through simulations that our procedures outperform all competitorsmdash even those based on coordinate descent

In Section 2 we present the algorithm for the elastic net which includes the lasso and ridgeregression as special cases Section 3 and 4 discuss (two-class) logistic regression and multi-nomial logistic regression Comparative timings are presented in Section 5

Although the title of this paper advertises regularization paths for GLMs we only cover threeimportant members of this family However exactly the same technology extends trivially to

Journal of Statistical Software 3

other members of the exponential family such as the Poisson model We plan to extend oursoftware to cover these important other cases as well as the Cox model for survival data

Note that this article is about algorithms for fitting particular families of models and notabout the statistical properties of these models themselves Such discussions have taken placeelsewhere

2 Algorithms for the lasso ridge regression and elastic net

We consider the usual setup for linear regression We have a response variable Y isin R anda predictor vector X isin Rp and we approximate the regression function by a linear modelE(Y |X = x) = β0 + xgtβ We have N observation pairs (xi yi) For simplicity we assumethe xij are standardized

sumNi=1 xij = 0 1

sumNi=1 x

2ij = 1 for j = 1 p Our algorithms

generalize naturally to the unstandardized case The elastic net solves the following problem

min(β0β)isinRp+1

Rλ(β0 β) = min(β0β)isinRp+1

Nsumi=1

(yi minus β0 minus xgti β)2 + λPα(β)

Pα(β) = (1minus α)12||β||2`2 + α||β||`1 (2)

=psumj=1

[12(1minus α)β2

j + α|βj |] (3)

Pα is the elastic-net penalty (Zou and Hastie 2005) and is a compromise between the ridge-regression penalty (α = 0) and the lasso penalty (α = 1) This penalty is particularly usefulin the p N situation or any situation where there are many correlated predictor variables1

Ridge regression is known to shrink the coefficients of correlated predictors towards eachother allowing them to borrow strength from each other In the extreme case of k identicalpredictors they each get identical coefficients with 1kth the size that any single one wouldget if fit alone From a Bayesian point of view the ridge penalty is ideal if there are manypredictors and all have non-zero coefficients (drawn from a Gaussian distribution)

Lasso on the other hand is somewhat indifferent to very correlated predictors and will tendto pick one and ignore the rest In the extreme case above the lasso problem breaks downThe lasso penalty corresponds to a Laplace prior which expects many coefficients to be closeto zero and a small subset to be larger and nonzero

The elastic net with α = 1minusε for some small ε gt 0 performs much like the lasso but removesany degeneracies and wild behavior caused by extreme correlations More generally the entirefamily Pα creates a useful compromise between ridge and lasso As α increases from 0 to 1for a given λ the sparsity of the solution to (1) (ie the number of coefficients equal to zero)increases monotonically from 0 to the sparsity of the lasso solution

Figure 1 shows an example that demonstrates the effect of varying α The dataset is from(Golub et al 1999) consisting of 72 observations on 3571 genes measured with DNA microar-rays The observations fall in two classes so we use the penalties in conjunction with the

1Zou and Hastie (2005) called this penalty the naive elastic net and preferred a rescaled version which theycalled elastic net We drop this distinction here

Figure 1 Leukemia data profiles of estimated coefficients for three methods showing onlyfirst 10 steps (values for λ) in each case For the elastic net α = 02

logistic regression models of Section 3 The coefficient profiles from the first 10 steps (gridvalues for λ) for each of the three regularization methods are shown The lasso penalty admitsat most N = 72 genes into the model while ridge regression gives all 3571 genes non-zerocoefficients The elastic-net penalty provides a compromise between these two and has theeffect of averaging genes that are highly correlated and then entering the averaged gene intothe model Using the algorithm described below computation of the entire path of solutionsfor each method at 100 values of the regularization parameter evenly spaced on the log-scaletook under a second in total Because of the large number of non-zero coefficients for theridge penalty they are individually much smaller than the coefficients for the other methods

Consider a coordinate descent step for solving (1) That is suppose we have estimates β0 andβ` for ` 6= j and we wish to partially optimize with respect to βj We would like to computethe gradient at βj = βj which only exists if βj 6= 0 If βj gt 0 then

partRλpartβj|β=β = minus 1

Nsumi=1

xij(yi minus βo minus xgti β) + λ(1minus α)βj + λα (4)

A similar expression exists if βj lt 0 and βj = 0 is treated separately Simple calculus shows(Donoho and Johnstone 1994) that the coordinate-wise update has the form

βj larrS(

sumNi=1 xij(yi minus y

(j)i ) λα

)1 + λ(1minus α)

y(j)i = β0 +

sum6=j xi`β` is the fitted value excluding the contribution from xij and

hence yi minus y(j)i the partial residual for fitting βj Because of the standardization

(j)i ) is the simple least-squares coefficient when fitting this partial

residual to xij

S(z γ) is the soft-thresholding operator with value

sign(z)(|z| minus γ)+ =

z minus γ if z gt 0 and γ lt |z|z + γ if z lt 0 and γ lt |z|0 if γ ge |z|

The details of this derivation are spelled out in Friedman et al (2007)

Thus we compute the simple least-squares coefficient on the partial residual apply soft-thresholding to take care of the lasso contribution to the penalty and then apply a pro-portional shrinkage for the ridge penalty This algorithm was suggested by Van der Kooij(2007)

21 Naive updates

Looking more closely at (5) we see that

yi minus y(j)i = yi minus yi + xij βj

= ri + xij βj (7)

where yi is the current fit of the model for observation i and hence ri the current residualThus

Nsumi=1

xij(yi minus y(j)i ) =

Nsumi=1

xijri + βj (8)

because the xj are standardized The first term on the right-hand side is the gradient ofthe loss with respect to βj It is clear from (8) why coordinate descent is computationallyefficient Many coefficients are zero remain zero after the thresholding and so nothing needsto be changed Such a step costs O(N) operationsmdash the sum to compute the gradient Onthe other hand if a coefficient does change after the thresholding ri is changed in O(N) andthe step costs O(2N) Thus a complete cycle through all p variables costs O(pN) operationsWe refer to this as the naive algorithm since it is generally less efficient than the covarianceupdating algorithm to follow Later we use these algorithms in the context of iterativelyreweighted least squares (IRLS) where the observation weights change frequently there thenaive algorithm dominates

22 Covariance updates

Further efficiencies can be achieved in computing the updates in (8) We can write the firstterm on the right (up to a factor 1N) as

Nsumi=1

xijri = 〈xj y〉 minussum

k|βk|gt0

〈xj xk〉βk (9)

where 〈xj y〉 =sumN

i=1 xijyi Hence we need to compute inner products of each feature with yinitially and then each time a new feature xk enters the model (for the first time) we needto compute and store its inner product with all the rest of the features (O(Np) operations)We also store the p gradient components (9) If one of the coefficients currently in the modelchanges we can update each gradient in O(p) operations Hence with m non-zero terms inthe model a complete cycle costs O(pm) operations if no new variables become non-zero andcosts O(Np) for each new variable entered Importantly O(N) calculations do not have tobe made at every step This is the case for all penalized procedures with squared error loss

23 Sparse updates

We are sometimes faced with problems where the Ntimesp feature matrix X is extremely sparseA leading example is from document classification where the feature vector uses the so-called ldquobag-of-wordsrdquo model Each document is scored for the presenceabsence of each ofthe words in the entire dictionary under consideration (sometimes counts are used or sometransformation of counts) Since most words are absent the feature vector for each documentis mostly zero and so the entire matrix is mostly zero We store such matrices efficiently insparse column format where we store only the non-zero entries and the coordinates wherethey occur

Coordinate descent is ideally set up to exploit such sparsity in an obvious way The O(N)inner-product operations in either the naive or covariance updates can exploit the sparsityby summing over only the non-zero entries Note that in this case scaling of the variables will

not alter the sparsity but centering will So scaling is performed up front but the centeringis incorporated in the algorithm in an efficient and obvious manner

24 Weighted updates

Often a weight wi (other than 1N) is associated with each observation This will arisenaturally in later sections where observations receive weights in the IRLS algorithm In thiscase the update step (5) becomes only slightly more complicated

βj larrS(sumN

i=1wixij(yi minus y(j)i ) λα

i=1wix2ij + λ(1minus α)

If the xj are not standardized there is a similar sum-of-squares term in the denominator(even without weights) The presence of weights does not change the computational costs ofeither algorithm much as long as the weights remain fixed

25 Pathwise coordinate descent

We compute the solutions for a decreasing sequence of values for λ starting at the smallestvalue λmax for which the entire vector β = 0 Apart from giving us a path of solutions thisscheme exploits warm starts and leads to a more stable algorithm We have examples whereit is faster to compute the path down to λ (for small λ) than the solution only at that valuefor λ

When β = 0 we see from (5) that βj will stay zero if 1N |〈xj y〉| lt λα Hence Nαλmax =

max` |〈x` y〉| Our strategy is to select a minimum value λmin = ελmax and construct asequence of K values of λ decreasing from λmax to λmin on the log scale Typical values areε = 0001 and K = 100

26 Other details

Irrespective of whether the variables are standardized to have variance 1 we always centereach predictor variable Since the intercept is not regularized this means that β0 = y themean of the yi for all values of α and λ

It is easy to allow different penalties λj for each of the variables We implement this via apenalty scaling parameter γj ge 0 If γj gt 0 then the penalty applied to βj is λj = λγj If γj = 0 that variable does not get penalized and always enters the model unrestricted atthe first step and remains in the model Penalty rescaling would also allow for example oursoftware to be used to implement the adaptive lasso (Zou 2006)

Considerable speedup is obtained by organizing the iterations around the active set of featuresmdashthose with nonzero coefficients After a complete cycle through all the variables we iterateon only the active set till convergence If another complete cycle does not change the activeset we are done otherwise the process is repeated Active-set convergence is also mentionedin Meier et al (2008) and Krishnapuram and Hartemink (2005)

3 Regularized logistic regression

When the response variable is binary the linear logistic regression model is often used Denoteby G the response variable taking values in G = 1 2 (the labeling of the elements isarbitrary) The logistic regression model represents the class-conditional probabilities througha linear function of the predictors

Pr(G = 1|x) =1

1 + eminus(β0+xgtβ) (11)

Pr(G = 2|x) =1

1 + e+(β0+xgtβ)

= 1minus Pr(G = 1|x)

Alternatively this implies that

logPr(G = 1|x)Pr(G = 2|x)

= β0 + xgtβ (12)

Here we fit this model by regularized maximum (binomial) likelihood Let p(xi) = Pr(G =1|xi) be the probability (11) for observation i at a particular value for the parameters (β0 β)then we maximize the penalized log likelihood

max(β0β)isinRp+1

Nsumi=1

I(gi = 1) log p(xi) + I(gi = 2) log(1minus p(xi))

minus λPα(β)

] (13)

Denoting yi = I(gi = 1) the log-likelihood part of (13) can be written in the more explicitform

`(β0 β) =1N

Nsumi=1

yi middot (β0 + xgti β)minus log(1 + e(β0+xgti β)) (14)

a concave function of the parameters The Newton algorithm for maximizing the (unpe-nalized) log-likelihood (14) amounts to iteratively reweighted least squares Hence if thecurrent estimates of the parameters are (β0 β) we form a quadratic approximation to thelog-likelihood (Taylor expansion about current estimates) which is

`Q(β0 β) = minus 12N

Nsumi=1

wi(zi minus β0 minus xgti β)2 + C(β0 β)2 (15)

zi = β0 + xgti β +yi minus p(xi)

p(xi)(1minus p(xi)) (working response) (16)

wi = p(xi)(1minus p(xi)) (weights) (17)

and p(xi) is evaluated at the current parameters The last term is constant The Newtonupdate is obtained by minimizing `QOur approach is similar For each value of λ we create an outer loop which computes thequadratic approximation `Q about the current parameters (β0 β) Then we use coordinatedescent to solve the penalized weighted least-squares problem

min(β0β)isinRp+1

minus`Q(β0 β) + λPα(β) (18)

This amounts to a sequence of nested loops

outer loop Decrement λ

middle loop Update the quadratic approximation `Q using the current parameters (β0 β)

inner loop Run the coordinate descent algorithm on the penalized weighted-least-squaresproblem (18)

There are several important details in the implementation of this algorithm

When p N one cannot run λ all the way to zero because the saturated logisticregression fit is undefined (parameters wander off to plusmninfin in order to achieve probabilitiesof 0 or 1) Hence the default λ sequence runs down to λmin = ελmax gt 0

Care is taken to avoid coefficients diverging in order to achieve fitted probabilities of 0or 1 When a probability is within ε = 10minus5 of 1 we set it to 1 and set the weights toε 0 is treated similarly

Our code has an option to approximate the Hessian terms by an exact upper-boundThis is obtained by setting the wi in (17) all equal to 025 (Krishnapuram and Hartemink2005)

We allow the response data to be supplied in the form of a two-column matrix of countssometimes referred to as grouped data We discuss this in more detail in Section 42

The Newton algorithm is not guaranteed to converge without step-size optimization(Lee et al 2006) Our code does not implement any checks for divergence this wouldslow it down and when used as recommended we do not feel it is necessary We havea closed form expression for the starting solutions and each subsequent solution iswarm-started from the previous close-by solution which generally makes the quadraticapproximations very accurate We have not encountered any divergence problems sofar

4 Regularized multinomial regression

When the categorical response variable G has K gt 2 levels the linear logistic regressionmodel can be generalized to a multi-logit model The traditional approach is to extend (12)to K minus 1 logits

logPr(G = `|x)Pr(G = K|x)

= β0` + xgtβ` ` = 1 K minus 1 (19)

Here β` is a p-vector of coefficients As in Zhu and Hastie (2004) here we choose a moresymmetric approach We model

Pr(G = `|x) =eβ0`+x

gtβ`sumKk=1 e

β0k+xgtβk

This parametrization is not estimable without constraints because for any values for theparameters β0` β`K1 β0` minus c0 β` minus cK1 give identical probabilities (20) Regularizationdeals with this ambiguity in a natural way see Section 41 below

We fit the model (20) by regularized maximum (multinomial) likelihood Using a similarnotation as before let p`(xi) = Pr(G = `|xi) and let gi isin 1 2 K be the ith responseWe maximize the penalized log-likelihood

maxβ0`β`K1 isinRK(p+1)

Nsumi=1

log pgi(xi)minus λKsum`=1

Pα(β`)

] (21)

Denote by Y the N timesK indicator response matrix with elements yi` = I(gi = `) Then wecan write the log-likelihood part of (21) in the more explicit form

`(β0` β`K1 ) =1N

Nsumi=1

[Ksum`=1

yi`(β0` + xgti β`)minus log

(Ksum`=1

eβ0`+xgti β`

)] (22)

The Newton algorithm for multinomial regression can be tedious because of the vector natureof the response observations Instead of weights wi as in (17) we get weight matrices forexample However in the spirit of coordinate descent we can avoid these complexitiesWe perform partial Newton steps by forming a partial quadratic approximation to the log-likelihood (22) allowing only (β0` β`) to vary for a single class at a time It is not hard toshow that this is

`Q`(β0` β`) = minus 12N

Nsumi=1

wi`(zi` minus β0` minus xgti β`)2 + C(β0k βkK1 ) (23)

where as before

zi` = β0` + xgti β` +yi` minus p`(xi)

p`(xi)(1minus p`(xi)) (24)

wi` = p`(xi)(1minus p`(xi)) (25)

Our approach is similar to the two-class case except now we have to cycle over the classesas well in the outer loop For each value of λ we create an outer loop which cycles over `and computes the partial quadratic approximation `Q` about the current parameters (β0 β)Then we use coordinate descent to solve the penalized weighted least-squares problem

min(β0`β`)isinRp+1

minus`Q`(β0` β`) + λPα(β`) (26)

This amounts to the sequence of nested loops

middle loop (outer) Cycle over ` isin 1 2 K 1 2

middle loop (inner) Update the quadratic approximation `Q` using the current parame-ters β0k βkK1

inner loop Run the co-ordinate descent algorithm on the penalized weighted-least-squaresproblem (26)

41 Regularization and parameter ambiguity

As was pointed out earlier if β0` β`K1 characterizes a fitted model for (20) then β0` minusc0 β`minuscK1 gives an identical fit (c is a p-vector) Although this means that the log-likelihoodpart of (21) is insensitive to (c0 c) the penalty is not In particular we can always improvean estimate β0` β`K1 (wrt (21)) by solving

mincisinRp

Ksum`=1

Pα(β` minus c) (27)

This can be done separately for each coordinate hence

cj = arg mint

Ksum`=1

[12(1minus α)(βj` minus t)2 + α|βj` minus t|

] (28)

Theorem 1 Consider problem (28) for values α isin [0 1] Let βj be the mean of the βj` andβMj a median of the βj` (and for simplicity assume βj le βMj Then we have

cj isin [βj βMj ] (29)

with the left endpoint achieved if α = 0 and the right if α = 1

The two endpoints are obvious The proof of Theorem 1 is given in Appendix A A con-sequence of the theorem is that a very simple search algorithm can be used to solve (28)The objective is piecewise quadratic with knots defined by the βj` We need only evaluatesolutions in the intervals including the mean and median and those in between We recenterthe parameters in each index set j after each inner middle loop step using the the solutioncj for each j

Not all the parameters in our model are regularized The intercepts β0` are not and with ourpenalty modifiers γj (Section 26) others need not be as well For these parameters we usemean centering

42 Grouped and matrix responses

As in the two class case the data can be presented in the form of a N times K matrix mi` ofnon-negative numbers For example if the data are grouped at each xi we have a numberof multinomial samples with mi` falling into category ` In this case we divide each row bythe row-sum mi =

sum`mi` and produce our response matrix yi` = mi`mi mi becomes an

observation weight Our penalized maximum likelihood algorithm changes in a trivial wayThe working response (24) is defined exactly the same way (using yi` just defined) Theweights in (25) get augmented with the observation weight mi

wi` = mip`(xi)(1minus p`(xi)) (30)

Equivalently the data can be presented directly as a matrix of class proportions along witha weight vector From the point of view of the algorithm any matrix of positive numbers andany non-negative weight vector will be treated in the same way

5 Timings

In this section we compare the run times of the coordinate-wise algorithm to some competingalgorithms These use the lasso penalty (α = 1) in both the regression and logistic regressionsettings All timings were carried out on an Intel Xeon 280GH processor

We do not perform comparisons on the elastic net versions of the penalties since there is notmuch software available for elastic net Comparisons of our glmnet code with the R packageelasticnet will mimic the comparisons with lars (Hastie and Efron 2007) for the lasso sinceelasticnet (Zou and Hastie 2004) is built on the lars package

51 Regression with the lasso

We generated Gaussian data withN observations and p predictors with each pair of predictorsXj Xjprime having the same population correlation ρ We tried a number of combinations of Nand p with ρ varying from zero to 095 The outcome values were generated by

Y =psumj=1

Xjβj + k middot Z (31)

where βj = (minus1)j exp(minus2(j minus 1)20) Z sim N(0 1) and k is chosen so that the signal-to-noiseratio is 30 The coefficients are constructed to have alternating signs and to be exponentiallydecreasing

Table 1 shows the average CPU timings for the coordinate-wise algorithm and the lars proce-dure (Efron et al 2004) All algorithms are implemented as R functions The coordinate-wisealgorithm does all of its numerical work in Fortran while lars (Hastie and Efron 2007) doesmuch of its work in R calling Fortran routines for some matrix operations However com-parisons in Friedman et al (2007) showed that lars was actually faster than a version codedentirely in Fortran Comparisons between different programs are always tricky in particularthe lars procedure computes the entire path of solutions while the coordinate-wise proceduresolves the problem for a set of pre-defined points along the solution path In the orthogonalcase lars takes min(N p) steps hence to make things roughly comparable we called thelatter two algorithms to solve a total of min(N p) problems along the path Table 1 showstimings in seconds averaged over three runs We see that glmnet is considerably faster thanlars the covariance-updating version of the algorithm is a little faster than the naive versionwhen N gt p and a little slower when p gt N We had expected that high correlation betweenthe features would increase the run time of glmnet but this does not seem to be the case

52 Lasso-logistic regression

We used the same simulation setup as above except that we took the continuous outcome ydefined p = 1(1 + exp(minusy)) and used this to generate a two-class outcome z with Pr(z =1) = p Pr(z = 0) = 1 minus p We compared the speed of glmnet to the interior point methodl1logreg (Koh et al 2007ba) Bayesian binary regression (BBR Madigan and Lewis 2007Genkin et al 2007) and the lasso penalized logistic program LPL supplied by Ken Lange (seeWu and Lange 2008) The latter two methods also use a coordinate descent approach

The BBR software automatically performs ten-fold cross-validation when given a set of λvalues Hence we report the total time for ten-fold cross-validation for all methods using the

Linear regression ndash Dense features

Correlation0 01 02 05 09 095

N = 1000 p = 100glmnet (type = naive) 005 006 006 009 008 007glmnet (type = cov) 002 002 002 002 002 002lars 011 011 011 011 011 011

Table 1 Timings (in seconds) for glmnet and lars algorithms for linear regression with lassopenalty The first line is glmnet using naive updating while the second uses covarianceupdating Total time for 100 λ values averaged over 3 runs

same 100 λ values for all Table 2 shows the results in some cases we omitted a methodwhen it was seen to be very slow at smaller values for N or p Again we see that glmnet isthe clear winner it slows down a little under high correlation The computation seems tobe roughly linear in N but grows faster than linear in p Table 3 shows some results whenthe feature matrix is sparse we randomly set 95 of the feature values to zero Again theglmnet procedure is significantly faster than l1logreg

Logistic regression ndash Dense features

Correlation0 01 02 05 09 095

N = 1000 p = 100glmnet 165 181 231 387 599 848l1logreg 31475 3186 3435 3221 3185 3181BBR 4070 4757 5418 7006 10672 12141LPL 2468 3164 4799 17077 74100 144825

N = 5000 p = 100glmnet 789 848 901 1339 2668 2636l1logreg 23988 23200 22962 22949 2219 22309

N = 100 000 p = 100glmnet 7856 17845 20594 27433 55248 63850

N = 100 p = 100 000glmnet 13727 13940 14655 19798 21965 20193

Table 2 Timings (seconds) for logistic models with lasso penalty Total time for tenfoldcross-validation over a grid of 100 λ values

53 Real data

Table 4 shows some timing results for four different datasets

Cancer (Ramaswamy et al 2002) gene-expression data with 14 cancer classes Here wecompare glmnet with BMR (Genkin et al 2007) a multinomial version of BBR

Leukemia (Golub et al 1999) gene-expression data with a binary response indicatingtype of leukemiamdashAML vs ALL We used the preprocessed data of Dettling (2004)

InternetAd (Kushmerick 1999) document classification problem with mostly binaryfeatures The response is binary and indicates whether the document is an advertise-ment Only 12 nonzero values in the predictor matrix

Logistic regression ndash Sparse features

Correlation0 01 02 05 09 095

N = 1000 p = 100glmnet 077 074 072 073 084 088l1logreg 519 521 514 540 614 626BBR 201 195 198 206 273 288

N = 10 000 p = 100glmnet 321 302 295 325 458 508l1logreg 4587 4663 4433 4399 4560 4316BBR 1180 1164 1158 1330 1246 1183

N = 100 p = 10 000glmnet 1018 1035 993 1004 902 891l1logreg 13027 12488 12418 12984 13721 15954BBR 4572 4750 4746 4849 5629 6021

Table 3 Timings (seconds) for logistic model with lasso penalty and sparse features (95zero) Total time for ten-fold cross-validation over a grid of 100 λ values

NewsGroup (Lang 1995) document classification problem We used the training setcultured from these data by Koh et al (2007a) The response is binary and indicatesa subclass of topics the predictors are binary and indicate the presence of particulartri-gram sequences The predictor matrix has 005 nonzero values

All four datasets are available online with this publication as saved R data objects (the lattertwo in sparse format using the Matrix package Bates and Maechler 2009)

For the Leukemia and InternetAd datasets the BBR program used fewer than 100 λ valuesso we estimated the total time by scaling up the time for smaller number of values TheInternetAd and NewsGroup datasets are both sparse 1 nonzero values for the former005 for the latter Again glmnet is considerably faster than the competing methods

54 Other comparisons

When making comparisons one invariably leaves out someones favorite method We leftout our own glmpath (Park and Hastie 2007b) extension of lars for GLMs (Park and Hastie2007a) since it does not scale well to the size problems we consider here Two referees of

Name Type N p glmnet l1logreg BBRBMR

DenseCancer 14 class 144 16063 25 mins 21 hrsLeukemia 2 class 72 3571 250 550 450

SparseInternetAd 2 class 2359 1430 50 209 347NewsGroup 2 class 11314 777811 2 mins 35 hrs

Table 4 Timings (seconds unless stated otherwise) for some real datasets For the CancerLeukemia and InternetAd datasets times are for ten-fold cross-validation using 100 valuesof λ for NewsGroup we performed a single run with 100 values of λ with λmin = 005λmax

MacBook Pro HP Linux server

glmnet 034 013penalized 1031OWL-QN 31435

Table 5 Timings (seconds) for the Leukemia dataset using 100 λ values These timingswere performed on two different platforms which were different again from those used in theearlier timings in this paper

an earlier draft of this paper suggested two methods of which we were not aware We ran asingle benchmark against each of these using the Leukemia data fitting models at 100 valuesof λ in each case

OWL-QN Orthant-Wise Limited-memory Quasi-Newton Optimizer for `1-regularizedObjectives (Andrew and Gao 2007ab) The software is written in C++ and availablefrom the authors upon request

The R package penalized (Goeman 2009ba) which fits GLMs using a fast implementa-tion of gradient ascent

Table 5 shows these comparisons (on two different machines) glmnet is considerably fasterin both cases

6 Selecting the tuning parameters

The algorithms discussed in this paper compute an entire path of solutions (in λ) for anyparticular model leaving the user to select a particular solution from the ensemble Onegeneral approach is to use prediction error to guide this choice If a user is data rich they canset aside some fraction (say a third) of their data for this purpose They would then evaluatethe prediction performance at each value of λ and pick the model with the best performance

minus6 minus5 minus4 minus3 minus2 minus1 0

log(Lambda)

99 99 97 95 93 75 54 21 12 5 2 1

Gaussian Family

minus9 minus8 minus7 minus6 minus5 minus4 minus3 minus2

log(Lambda)

100 98 97 88 74 55 30 9 7 3 2

Binomial Family

log(Lambda)

100 98 97 88 74 55 30 9 7 3 2

Binomial Family

Figure 2 Ten-fold cross-validation on simulated data The first row is for regression with aGaussian response the second row logistic regression with a binomial response In both caseswe have 1000 observations and 100 predictors but the response depends on only 10 predictorsFor regression we use mean-squared prediction error as the measure of risk For logisticregression the left panel shows the mean deviance (minus twice the log-likelihood on theleft-out data) while the right panel shows misclassification error which is a rougher measureIn all cases we show the mean cross-validated error curve as well as a one-standard-deviationband In each figure the left vertical line corresponds to the minimum error while the rightvertical line the largest value of lambda such that the error is within one standard-error ofthe minimummdashthe so called ldquoone-standard-errorrdquo rule The top of each plot is annotated withthe size of the models

Alternatively they can use K-fold cross-validation (Hastie et al 2009 for example) wherethe training data is used both for training and testing in an unbiased wayFigure 2 illustrates cross-validation on a simulated dataset For logistic regression we some-times use the binomial deviance rather than misclassification error since the latter is smootherWe often use the ldquoone-standard-errorrdquo rule when selecting the best model this acknowledgesthe fact that the risk curves are estimated with error so errs on the side of parsimony (Hastieet al 2009) Cross-validation can be used to select α as well although it is often viewed as ahigher-level parameter and chosen on more subjective grounds

7 Discussion

Cyclical coordinate descent methods are a natural approach for solving convex problemswith `1 or `2 constraints or mixtures of the two (elastic net) Each coordinate-descent stepis fast with an explicit formula for each coordinate-wise minimization The method alsoexploits the sparsity of the model spending much of its time evaluating only inner productsfor variables with non-zero coefficients Its computational speed both for large N and p arequite remarkableAn R-language package glmnet is available under general public licence (GPL-2) from theComprehensive R Archive Network at httpCRANR-projectorgpackage=glmnet Sparsedata inputs are handled by the Matrix package MATLAB functions (Jiang 2009) are availablefrom httpwww-statstanfordedu~tibsglmnet-matlab

Acknowledgments

We would like to thank Holger Hoefling for helpful discussions and Hui Jiang for writing theMATLAB interface to our Fortran routines We thank the associate editor production editorand two referees who gave useful comments on an earlier draft of this articleFriedman was partially supported by grant DMS-97-64431 from the National Science Foun-dation Hastie was partially supported by grant DMS-0505676 from the National ScienceFoundation and grant 2R01 CA 72028-07 from the National Institutes of Health Tibshiraniwas partially supported by National Science Foundation Grant DMS-9971405 and NationalInstitutes of Health Contract N01-HV-28183

References

Andrew G Gao J (2007a) OWL-QN Orthant-Wise Limited-Memory Quasi-Newton Op-timizer for L1-Regularized Objectives URL httpresearchmicrosoftcomen-usdownloadsb1eb1016-1738-4bd5-83a9-370c9d498a03

Andrew G Gao J (2007b) ldquoScalable Training of L1-Regularized Log-Linear Modelsrdquo InICML rsquo07 Proceedings of the 24th International Conference on Machine Learning pp33ndash40 ACM New York NY USA doi10114512734961273501

Bates D Maechler M (2009) Matrix Sparse and Dense Matrix Classes and MethodsR package version 0999375-30 URL httpCRANR-projectorgpackage=Matrix

Candes E Tao T (2007) ldquoThe Dantzig Selector Statistical Estimation When p is muchLarger than nrdquo The Annals of Statistics 35(6) 2313ndash2351

Chen SS Donoho D Saunders M (1998) ldquoAtomic Decomposition by Basis Pursuitrdquo SIAMJournal on Scientific Computing 20(1) 33ndash61

Daubechies I Defrise M De Mol C (2004) ldquoAn Iterative Thresholding Algorithm for Lin-ear Inverse Problems with a Sparsity Constraintrdquo Communications on Pure and AppliedMathematics 57 1413ndash1457

Dettling M (2004) ldquoBagBoosting for Tumor Classification with Gene Expression DatardquoBioinformatics 20 3583ndash3593

Donoho DL Johnstone IM (1994) ldquoIdeal Spatial Adaptation by Wavelet ShrinkagerdquoBiometrika 81 425ndash455

Efron B Hastie T Johnstone I Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annalsof Statistics 32(2) 407ndash499

Fan J Li R (2005) ldquoVariable Selection via Nonconcave Penalized Likelihood and its OraclePropertiesrdquo Journal of the American Statistical Association 96 1348ndash1360

Friedman J (2008) ldquoFast Sparse Regression and Classificationrdquo Technical report Depart-ment of Statistics Stanford University URL httpwww-statstanfordedu~jhfftpGPSpubpdf

Friedman J Hastie T Hoefling H Tibshirani R (2007) ldquoPathwise Coordinate OptimizationrdquoThe Annals of Applied Statistics 2(1) 302ndash332

Friedman J Hastie T Tibshirani R (2008) ldquoSparse Inverse Covariance Estimation with theGraphical Lassordquo Biostatistics 9 432ndash441

Friedman J Hastie T Tibshirani R (2009) glmnet Lasso and Elastic-Net RegularizedGeneralized Linear Models R package version 11-4 URL httpCRANR-projectorgpackage=glmnet

Fu W (1998) ldquoPenalized Regressions The Bridge vs the Lassordquo Journal of Computationaland Graphical Statistics 7(3) 397ndash416

Genkin A Lewis D Madigan D (2007) ldquoLarge-scale Bayesian Logistic Regression for TextCategorizationrdquo Technometrics 49(3) 291ndash304

Goeman J (2009a) ldquoL1 Penalized Estimation in the Cox Proportional Hazards ModelrdquoBiometrical Journal doi101002bimj200900028 Forthcoming

Goeman J (2009b) penalized L1 (Lasso) and L2 (Ridge) Penalized Estimation in GLMsand in the Cox Model R package version 09-27 URL httpCRANR-projectorgpackage=penalized

Golub T Slonim DK Tamayo P Huard C Gaasenbeek M Mesirov JP Coller H Loh MLDowning JR Caligiuri MA Bloomfield CD Lander ES (1999) ldquoMolecular Classification ofCancer Class Discovery and Class Prediction by Gene Expression Monitoringrdquo Science286 531ndash536

Hastie T Efron B (2007) lars Least Angle Regression Lasso and Forward StagewiseR package version 09-7 URL httpCRANR-projectorgpackage=Matrix

Hastie T Rosset S Tibshirani R Zhu J (2004) ldquoThe Entire Regularization Path for theSupport Vector Machinerdquo Journal of Machine Learning Research 5 1391ndash1415

Hastie T Tibshirani R Friedman J (2009) The Elements of Statistical Learning PredictionInference and Data Mining 2nd edition Springer-Verlag New York

Jiang H (2009) ldquoA MATLAB Implementation of glmnetrdquo Stanford University URL httpwww-statstanfordedu~tibsglmnet-matlab

Koh K Kim SJ Boyd S (2007a) ldquoAn Interior-Point Method for Large-Scale L1-RegularizedLogistic Regressionrdquo Journal of Machine Learning Research 8 1519ndash1555

Koh K Kim SJ Boyd S (2007b) l1logreg A Solver for L1-Regularized Logistic RegressionR package version 01-1 Avaliable from Kwangmoo Koh (deneb1stanfordedu)

Krishnapuram B Hartemink AJ (2005) ldquoSparse Multinomial Logistic Regression Fast Al-gorithms and Generalization Boundsrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 27(6) 957ndash968 Fellow-Lawrence Carin and Senior Member-Mario AT Figueiredo

Kushmerick N (1999) ldquoLearning to Remove Internet Advertisementsrdquo In AGENTS rsquo99Proceedings of the Third Annual Conference on Autonomous Agents pp 175ndash181 ACMNew York NY USA ISBN 1-58113-066-X doi101145301136301186

Lang K (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In A Prieditis S Russell (eds)Proceedings of the 12th International Conference on Machine Learning pp 331ndash339 SanFrancisco

Lee S Lee H Abbeel P Ng A (2006) ldquoEfficient L1 Logistic Regressionrdquo In Proceedings ofthe Twenty-First National Conference on Artificial Intelligence (AAAI-06) URL httpswwwaaaiorgPapersAAAI2006AAAI06-064pdf

Madigan D Lewis D (2007) BBR BMR Bayesian Logistic Regression Open-source stand-alone software URL httpwwwbayesianregressionorg

Meier L van de Geer S Buhlmann P (2008) ldquoThe Group Lasso for Logistic RegressionrdquoJournal of the Royal Statistical Society B 70(1) 53ndash71

Osborne M Presnell B Turlach B (2000) ldquoA New Approach to Variable Selection in LeastSquares Problemsrdquo IMA Journal of Numerical Analysis 20 389ndash404

Park MY Hastie T (2007a) ldquoL1-Regularization Path Algorithm for Generalized Linear Mod-elsrdquo Journal of the Royal Statistical Society B 69 659ndash677

Park MY Hastie T (2007b) glmpath L1 Regularization Path for Generalized Lin-ear Models and Cox Proportional Hazards Model R package version 094 URL httpCRANR-projectorgpackage=glmpath

Ramaswamy S Tamayo P Rifkin R Mukherjee S Yeang C Angelo M Ladd C Reich M Lat-ulippe E Mesirov J Poggio T Gerald W Loda M Lander E Golub T (2002) ldquoMulticlassCancer Diagnosis Using Tumor Gene Expression Signaturerdquo Proceedings of the NationalAcademy of Sciences 98 15149ndash15154

R Development Core Team (2009) R A Language and Environment for Statistical ComputingR Foundation for Statistical Computing Vienna Austria ISBN 3-900051-07-0 URL httpwwwR-projectorg

Rosset S Zhu J (2007) ldquoPiecewise Linear Regularized Solution Pathsrdquo The Annals ofStatistics 35(3) 1012ndash1030

Shevade K Keerthi S (2003) ldquoA Simple and Efficient Algorithm for Gene Selection UsingSparse Logistic Regressionrdquo Bioinformatics 19 2246ndash2253

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the RoyalStatistical Society B 58 267ndash288

Tibshirani R (1997) ldquoThe Lasso Method for Variable Selection in the Cox Modelrdquo Statisticsin Medicine 16 385ndash395

Tseng P (2001) ldquoConvergence of a Block Coordinate Descent Method for NondifferentiableMinimizationrdquo Journal of Optimization Theory and Applications 109 475ndash494

Van der Kooij A (2007) Prediction Accuracy and Stability of Regrsssion with Optimal ScalingTransformations PhD thesis Department of Data Theory University of Leiden URLhttpsopenaccessleidenunivnldspacehandle188712096

Wu T Chen Y Hastie T Sobel E Lange K (2009) ldquoGenome-Wide Association Analysis byPenalized Logistic Regressionrdquo Bioinformatics 25(6) 714ndash721

Wu T Lange K (2008) ldquoCoordinate Descent Procedures for Lasso Penalized RegressionrdquoThe Annals of Applied Statistics 2(1) 224ndash244

Yuan M Lin Y (2007) ldquoModel Selection and Estimation in Regression with Grouped Vari-ablesrdquo Journal of the Royal Statistical Society B 68(1) 49ndash67

Zhu J Hastie T (2004) ldquoClassification of Expression Arrays by Penalized Logistic RegressionrdquoBiostatistics 5(3) 427ndash443

Zou H (2006) ldquoThe Adaptive Lasso and its Oracle Propertiesrdquo Journal of the AmericanStatistical Association 101 1418ndash1429

Zou H Hastie T (2004) elasticnet Elastic Net Regularization and Variable SelectionR package version 102 URL httpCRANR-projectorgpackage=elasticnet

Zou H Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journalof the Royal Statistical Society B 67(2) 301ndash320

A Proof of Theorem 1

We have

cj = arg mint

Ksum`=1

] (32)

Suppose α isin (0 1) Differentiating wrt t (using a sub-gradient representation) we have

Ksum`=1

[minus(1minus α)(βj` minus t)minus αsj`] = 0 (33)

where sj` = sign(βj` minus t) if βj` 6= t and sj` isin [minus1 1] otherwise This gives

t = βj +1K

1minus α

Ksum`=1

sj` (34)

It follows that t cannot be larger than βMj since then the second term above would be negativeand this would imply that t is less than βj Similarly t cannot be less than βj since then thesecond term above would have to be negative implying that t is larger than βMj

Affiliation

Trevor HastieDepartment of StatisticsStanford UniversityCalifornia 94305 United States of AmericaE-mail hastiestanfordeduURL httpwww-statstanfordedu~hastie

Journal of Statistical Software httpwwwjstatsoftorgpublished by the American Statistical Association httpwwwamstatorg

Volume 33 Issue 1 Submitted 2009-04-22January 2010 Accepted 2009-12-15

Introduction

Algorithms for the lasso ridge regression and elastic net

Naive updates
Covariance updates
Sparse updates
Weighted updates
Pathwise coordinate descent
Other details
Regularized logistic regression
Regularized multinomial regression
Regularization and parameter ambiguity
Grouped and matrix responses
Timings
Regression with the lasso
Lasso-logistic regression
Real data
Other comparisons
Selecting the tuning parameters
Discussion
Proof of Theorem 1